# Dynamic Programming

## Introduction

The term dynamic programming (DP) refers to a collection of algorithms that can be
used to compute optimal policies given a perfect model of the environment as a Markov
decision process (MDP). Classical DP algorithms are of limited utility in reinforcement
learning both because of their assumption of a perfect model and because of their great
computational expense, but they are still important theoretically. In fact, it is the building block for other methods with less computation and without assuming a perfect model of the environment.

Assume that the problem is finite MDP. That is, we assume that our action, state and reward space are finite, and its dynamics are given by a
set of probabilities $p(r, s^{\prime} | s, a)$. Although
DP ideas can be applied to problems with continuous state and action spaces, exact
solutions are possible only in special cases. A common way of obtaining approximate
solutions for tasks with continuous states and actions is to quantize the state and action
spaces and then apply finite-state DP methods.

The key idea of DP, and of reinforcement learning generally, is the use of value functions
to organize and structure the search for good policies. From Bellman optimality equations, we know that

$v_*(s) = \text{max}_{a \in A(s)} q_{\pi_*}(s, a) = \text{max}_{a \in A(s)} \sum_{s^{\prime}, r}p(s^{\prime}, r | s, a)[r + \gamma v_{*}(s^\prime)]$

$q_* (s, a) = E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a] = \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a)[r + \gamma \text{max}_{a^{\prime} \in A_{t+1}(s^\prime)} q_{\pi_*}(s^{\prime}, a^{\prime})]$

We can easily obtain optimal policies once we have found the optimal value function $v_{*}$ or $q_{*}$
DP algorithms are obtained by
turning Bellman equations such as these into assignments, that is, into update rules for
improving approximations of the desired value functions.

## Policy Evaluation (Prediction)

We first consider how to compute the state-value function $v_{\pi}$ for any policy $\pi$. This is called policy evaluation in the DP literature. We also refer to it as the prediction
problem. We know that for all $s \in S$

$v_{\pi} (s) = E_{\pi}[G_t | S_t = s] = \sum_{a} \pi(a | s) \sum_{s^{\prime}}\sum_{r}p(r, s^{\prime} | a, s)[r + \gamma v_{\pi}(s^\prime)]$

$v_{\pi} (s)$ convergence is guaranteed if $\gamma < 1$ or this is an episodic task. If the environment’s dynamics are completely known, then $v_{\pi} (s)$ is a system of $|S|$ simultaneous linear equations in $|S|$ unknowns.In principle, its solution
is a straightforward, if tedious, computation. For our purposes, iterative solution methods
are most suitable.

### Iterative policy Evaluation

Consider a sequence of approximate value functions $v^{0}_{\pi}, v^{1}_{\pi}, .....$. The initial approximation, $v_0$ is chosen arbitrarily except that the terminal state, if any, must be given value 0, and each successive approximation is obtained by using the
Bellman equation for $v_{\pi}$ as an update rule:

$v^{k+1}_{\pi} (s) = E_{\pi}[R_{t+1} + \gamma v_{\pi}^{k}(S_{t+1}) | S_t = s] = \sum_{a} \pi(a | s) \sum_{s^{\prime}}\sum_{r}p(r, s^{\prime} | a, s)[r + \gamma v_{\pi}^{k}(s^\prime)]$

Indeed, the sequence ${v^{k}_{\pi}}$ can be shown in general to converge to $v_{\pi}$ as $k \rightarrow \infty$ under the same conditions that guarantee the existence of $v_{\pi}$ which is
$\gamma < 1$ or the task is an episodic task. This algorithm is call **iterative policy evaluation** .

In order to update $v^{k+1}_{\pi}$ from $v^{k}_{\pi}$, it applies the same operation to each state s: it replaces the old value of s $v^{k} (s)$ with a new value obtained from the old values of the successor states of s $v^{k}_{\pi} (s^\prime)$ and the expected immediate rewards, along all the one-step transitios possible under the policy being
evaluated. We call this kind of operation an **expected update**. Each iteration of iterative policy evaluation
updates the value of every state once to produce the new approximate value function.

### Implementation of Iterative policy evaluation

To write a sequential computer program to implement iterative policy evaluation as above, we need to use **two arrays**, one for the old values $v^{k}_{\pi}$, and one for the new values $v^{k+1}_{\pi}$.
With two arrays, the new values can be computed one by one from the old values without the old values being changed. Alternatively, we could use **one array** and update the values in place. That is, with each new value immediately overwriting the old one. Then, depending on
the order in which the states are updated sometimes new values are used instead of old ones. This in-place algorithm also converges to $v_{\pi}$. In fact, it usually converges faster than the two-array version, because it uses new data as soon as they are available. We think of the updates as
being done in a sweep through the state space. For
the in-place algorithm, the order in which states have their values updated during the
sweep has a significant influence on the rate of convergence. We usually have the in-place
version in mind when we think of DP algorithms.

<img src='inplace-algorithm.png'>

Above algorithm is an inplace algorithm, the algorithm stops when the difference between $v^{k+1}_{\pi}$ and $v^{k}_{\pi}$ is less than some threshold $\theta$.

## Policy improvement

## Policy iteration

## Value iteration