# Lecture 3: Planning by Dynamic Programming

* Dynamic programming assumes full knowledge of the MDP
* It is used for *planning* in an MDP
* For prediction:
    * Input: MDP $<S, A, P, R, \gamma>$
    * or: MRP $<S, P^{\pi}, R^{\pi}, \gamma>$
    * Output: value function $v_{\pi}$
* Or for control:
    * Input: MDP $<S, A, P, R, \gamma>$
    * Output: optimal value function $v_{*}$
    * and: optimal policy $\pi_{*}$

### Iterative Policy Evaluation

* Problem: evaluate a given policy $\pi$
* Solution: iterative application of Bellman expectation backup
* $v_{1} \rightarrow v_{2} \rightarrow ... \rightarrow v_{\pi}$
* Using synchronous backups (we look at all the states for a value function and apply iterative update to obtain completely new value function for all the states i.e. we consider all states in each step)
    * At each iteration $k+1$
    * For all states $s \in S$
    * Update $v_{k+1}(s)$ from $v_{k}(s')$
    * where $s'$ is a successor state of $s$

<img src="Figures/03-iterative-policy-evaluation.png" style="width: 550px;"/>

<img src="Figures/03-gridworld.png" style="width: 550px;"/>

<img src="Figures/03-policy-evaluation-1.png" style="width: 550px;"/>

<img src="Figures/03-policy-evaluation-2.png" style="width: 550px;"/>

### Policy Iteration

* Given a policy $\pi$
    * *Evaluate* the policy $\pi$
    $$ v_{\pi}(s) = \mathbb E[R_{t+1} + \gamma R_{t+2} + ... | S_{t} = s]$$
    * *Improve* the policy by acting greedily with respect to $v_{\pi}$
    $$ \pi' = greedy(v_{\pi})$$
    
<img src="Figures/03-policy-iteration.png" style="width: 550px;"/>

<img src="Figures/03-policy-improvement.png" style="width: 550px;"/>

<img src="Figures/03-policy-improvement-2.png" style="width: 550px;"/>

### Modified Policy Iteration

* We can use a stopping condition or simply stop after $k$ iterations of iterative policy evaluation
* $\epsilon$-convergence of value function

### Value Iteration

Any optimal policy can be subdivided into two components:
* An optimal first action $A_{*}$
* Followed by an optimal policy from successor state $S'$

#### Principle of Optimality 
A policy $\pi(a|s)$ achieves the optimal value from state $s$, $v_{\pi}(s) = v_{*}(s)$, iff
* For any state $s'$ reachable from $s$
* $\pi$ achieves the optimal value from state $s'$, $v_{\pi}(s') = v_{*}(s')$

e.g. for each state the wind might blow us to ($s'$) that policy would behave optimally from that state onwards.

#### Value iteration in MDPs

* Problem: find optimal policy $\pi$
* Solution: iterative application of Bellman optimality backup
* $v_{1} \rightarrow v_{2} \rightarrow ... \rightarrow v_{*}$
* Using synchronous backups
    * At each iteration $k+1$
    * For all states $s \in S$
    * Update $v_{k+1}(s)$ from $v_{k}(s')$
* Unlike policy iteration, there is no explicit policy
* Intermediate value functions may not correspond to any policy

<img src="Figures/03-value-iteration.png" style="width: 550px;"/>

<img src="Figures/03-async-dp.png" style="width: 550px;"/>