##### Equation of Tranfer:

![Robot Maze](https://datawhalechina.github.io/joyrl-book/figs/ch3/robot_maze_2.png)

$$
f(i,j) = \begin{cases}
0, & (i,j) = (0,0) \\
1, & i = 0 \text{ or } j = 0, (i,j) \neq (0,0) \\
f(i-1,j) + f(i,j-1), & i > 0, j > 0
\end{cases}
$$

In [2]:
'''Implementation'''
def solve(m,n):
    # initialize the border condition
    f = [[1] * n] + [[1] + [0] * (n - 1) for _ in range(m - 1)] 
    
    # transfer
    for i in range(1, m):
        for j in range(1, n):
            f[i][j] = f[i - 1][j] + f[i][j - 1]
            
    return "{:.2e}".format(f[m - 1][n - 1])

'''Example'''
solve(32, 76) # How many paths are there through a maze of 32*76?

'5.62e+26'

##### State-Value Function
The expected sum of discounted rewards from a state $s$ (like discounted cash flow)
$$
\begin{aligned}
V(s) &= \mathbb{E}_{\pi}[G_t|S_t=s] \\
     &= R(s) + \gamma \sum_{s' \in \mathcal{S}} p(s' \mid s) V_\pi(s')
\end{aligned}
$$

##### Action-Value Function
Introduce the action $a$ into the state-value function
$$
\begin{aligned}
Q(s,a) &= \mathbb{E}_{\pi}[G_t|S_t=s,A_t=a] \\
       &= R(s,a) + \gamma \sum_{s' \in \mathcal{S}} p(s' \mid s, a) \sum_{a' \in \mathcal{A}} \pi(a' \mid s') Q(s',a')
\end{aligned}
$$

Evidently,
$$
V(s) = \sum_{a \in A} \pi(a|s)Q(s,a)
$$
in which $\pi(a|s)$ means the probability distribution of action $a$ in state $s$

##### Greedy Policy

Define the policy $\pi$ as a function of state $s$
$$
\pi(a|s) = \max_{a} Q(s,a)
$$

### Value Iteration vs Policy Iteration
Value Iteration uses the Bellman Optimality Equation
$$
V_{k+1}(s) = \max_a \sum_{s', r} p(s', r \mid s, a) \left[ r(s,a,s') + \gamma V_k(s') \right]
$$
While Policy Iteration uses the Bellman Expectation Equation plus Policy Improvement
$$
V^{\pi_k}(s) = \sum_{s',r} p(s',r|s,\pi_k(s)) \left[ r(s,a,s') + \gamma V^{\pi_k}(s') \right]\\
\text{then\,\,}\pi_{k+1}(s) = \arg\max_a \sum_{s',r} p(s',r|s,a) \left[ r(s,a,s') + \gamma V^{\pi_k}(s') \right]
$$
##### Which is better (faster)?
Policy Iteration converges by alternating between policy evaluation and policy improvement. The altering process is almost instantaneous, as it only involves 
calculating the value function and the policy function in turn using a single equation.

Value Iteration converges by iteratively applying the Bellman Optimality Equation until it converges to the optimal value function. Each iteration involves calculating the value function for multiple states, which mostly takes longer than the policy iteration.

##### The Three Properties of Normal Dynamic Planning
1. Principle of Optimality:     
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
2. Memeryless Property (or Markov Property):       
The future state depends only on the current state, not on the sequence of events that preceded it.    
3. Overlapping Subproblems:
The same subproblems are solved multiple times in a naive recursive approach.