# Dynamic Programming

## Policy Evaluation vs. Control

**Policy evaluation** is the task of determining the value function for a specific policy. **Control** is the task of finding a policy to obtain as much reward as possible. In other words, finding a policy which maximizes the value function. Control is the ultimate goal of reinforcement learning. But the task of policy evaluation is usually a necessary first step. It's hard to improve our policy if we don't have a way to assess how good it is.

**Dynamic programming algorithms** use the Bellman equations to define iterative algorithms for both policy evaluation and control. Before diving into the details of this approach, imagine someone hands you a policy and your job is to determine how good that policy is. **Policy evaluation** is the task of determining the state value function $v_{\pi}$ for a particular policy $\pi$ ($\pi \rightarrow v_{\pi}$. Recall that the value of a state under a policy $\pi$ is the expected return from that state if we act according to $\pi$. The return ($G_t$) is itself a discounted sum of future rewards. 

$$
V_{\pi}(s) \doteq \mathbb{E}_{\pi} [\color{blue}{G_t} | S_t = s] \hspace{100px} \color{blue}{G_t = \sum\limits_{k=0}^{\infty} \gamma^k R_{t+k+1}}
$$

We have seen how the Bellman equation reduces the problem of finding $v_{\pi}$ to a system of linear equations, one equation for each state. 

$$
V_{\color{green}{\pi}}(s) = \sum\limits_a \color{green}{\pi}(a|s) \sum\limits_{s'}\sum\limits_{r} p(s',r | s, a) \left [ r + \gamma v_{\color{green}{\pi}}(s') \right ]
$$

So the problem of policy evaluation reduces to solving this system of linear equations. In principle, we could approach this task with a variety of methods from linear algebra.

<img src="images/linear_system_solver.svg" width="35%" align="center"/>

In practice, the iterative solution methods of dynamic programming are more suitable for general MDPs. 

<img src="images/dynamic_programming.svg" width="35%" align="center"/>

Control is the task of improving a policy. Recall that a policy $\pi_2$ is considered as good as or better than $\pi_1$ in the image (a) below if the value under $\pi_{2}$ is greater than or equal to the value under $\pi_1$ in every state. We say $\pi_2$ is strictly better than $pi_1$ if $\pi_2$ is as good as or better than $\pi_1$ and there's at least one state where the value under $\pi_2$ is strictly greater than the value under $\pi_1$. The goal of the control task is to modify a policy to produce a new one which is strictly better. Moreover, we can try to improve the policy repeatedly to obtain a sequence of better and better policies. When this is no longer possible, it means there is no policy which is strictly better than the current policy. And so the current policy must be equal to an optimal policy ($\pi_*$), and we can consider the control task complete. 

<img src="images/control.svg" width="70%" align="center"/>

Imagine we had access to the dynamics of the environment ($p$). We will learn how we can use this knowledge to solve the tasks of policy evaluation and control. Even with access to these dynamics, we will need careful thought and clever algorithms to compute value functions and optimal policies. We will investigate a class of solution methods called dynamic programming for this purpose. Dynamic programming uses the various Bellman equations we have seen, along with knowledge of $p$, to work out value functions and optimal policies. Classical dynamic programming does not involve interaction with the environment at all. Instead, we use dynamic programming methods to compute value functions and optimal policies given a model of the MDP. Nonetheless, dynamic programming is very useful for understanding other reinforced learning algorithms. Most reinforced learning algorithms can be seen as an approximation to dynamic programming without the model. This connection is perhaps most striking in the temporal different space dynamic planning algorithm.