# Dynamic Programming

## Introduction

The term dynamic programming (DP) refers to a collection of algorithms that can be
used to compute optimal policies given a **perfect model of the environment** as a Markov
decision process (MDP). Classical DP algorithms are of limited utility in reinforcement
learning both because of their assumption of a perfect model and because of their great
computational expense, but they are still important theoretically. In fact, it is the building block for other methods with less computation and without assuming a perfect model of the environment.

Assume that the problem is finite MDP. That is, we assume that our action, state and reward space are finite, and its dynamics are given by a
set of probabilities $p(r, s^{\prime} | s, a)$. Although
DP ideas can be applied to problems with continuous state and action spaces, exact
solutions are possible only in special cases. A common way of obtaining approximate
solutions for tasks with continuous states and actions is to quantize the state and action
spaces and then apply finite-state DP methods.

The key idea of DP, and of reinforcement learning generally, is the use of value functions
to organize and structure the search for good policies. From Bellman optimality equations, we know that

$v_*(s) = \text{max}_{a \in A(s)} q_{\pi_*}(s, a) = \text{max}_{a \in A(s)} \sum_{s^{\prime}, r}p(s^{\prime}, r | s, a)[r + \gamma v_{*}(s^\prime)]$

$q_* (s, a) = E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a] = \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a)[r + \gamma \text{max}_{a^{\prime} \in A_{t+1}(s^\prime)} q_{\pi_*}(s^{\prime}, a^{\prime})]$

We can easily obtain optimal policies once we have found the optimal value function $v_{*}$ or $q_{*}$
DP algorithms are obtained by
turning Bellman equations such as these into assignments, that is, into update rules for
improving approximations of the desired value functions.

## Policy Evaluation (Prediction)

We first consider how to compute the state-value function $v_{\pi}$ for any policy $\pi$. This is called policy evaluation in the DP literature. We also refer to it as the prediction
problem. We know that for all $s \in S$

$v_{\pi} (s) = E_{\pi}[G_t | S_t = s] = \sum_{a} \pi(a | s) \sum_{s^{\prime}}\sum_{r}p(r, s^{\prime} | a, s)[r + \gamma v_{\pi}(s^\prime)]$

$v_{\pi} (s)$ convergence is guaranteed if $\gamma < 1$ or this is an episodic task. If the environment’s dynamics are completely known, then $v_{\pi} (s)$ is a system of $|S|$ simultaneous linear equations in $|S|$ unknowns.In principle, its solution
is a straightforward, if tedious, computation. For our purposes, iterative solution methods
are most suitable.

### Iterative policy Evaluation

Consider a sequence of approximate value functions $v^{0}_{\pi}, v^{1}_{\pi}, .....$. The initial approximation, $v_0$ is chosen arbitrarily except that the terminal state, if any, must be given value 0, and each successive approximation is obtained by using the
Bellman equation for $v_{\pi}$ as an update rule:

$v^{k+1}_{\pi} (s) = E_{\pi}[R_{t+1} + \gamma v_{\pi}^{k}(S_{t+1}) | S_t = s] = \sum_{a} \pi(a | s) \sum_{s^{\prime}}\sum_{r}p(r, s^{\prime} | a, s)[r + \gamma v_{\pi}^{k}(s^\prime)]$

Indeed, the sequence ${v^{k}_{\pi}}$ can be shown in general to converge to $v_{\pi}$ as $k \rightarrow \infty$ under the same conditions that guarantee the existence of $v_{\pi}$ which is
$\gamma < 1$ or the task is an episodic task. This algorithm is call **iterative policy evaluation** .

In order to update $v^{k+1}_{\pi}$ from $v^{k}_{\pi}$, it applies the same operation to each state s: it replaces the old value of s $v^{k} (s)$ with a new value obtained from the old values of the successor states of s $v^{k}_{\pi} (s^\prime)$ and the expected immediate rewards, along all the one-step transitios possible under the policy being
evaluated. We call this kind of operation an **expected update**. Each iteration of iterative policy evaluation
updates the value of every state once to produce the new approximate value function.

### Implementation of Iterative policy evaluation

To write a sequential computer program to implement iterative policy evaluation as above, we need to use **two arrays**, one for the old values $v^{k}_{\pi}$, and one for the new values $v^{k+1}_{\pi}$.
With two arrays, the new values can be computed one by one from the old values without the old values being changed. Alternatively, we could use **one array** and update the values in place. That is, with each new value immediately overwriting the old one. Then, depending on
the order in which the states are updated sometimes new values are used instead of old ones. This in-place algorithm also converges to $v_{\pi}$. In fact, it usually converges faster than the two-array version, because it uses new data as soon as they are available. We think of the updates as
being done in a sweep through the state space. For
the in-place algorithm, the order in which states have their values updated during the
sweep has a significant influence on the rate of convergence. We usually have the in-place
version in mind when we think of DP algorithms.

<img src='pngs/inplace-algorithm.png'>

Above algorithm is an inplace algorithm, the algorithm stops when the difference between $v^{k+1}_{\pi}$ and $v^{k}_{\pi}$ is less than some threshold $\theta$.

## Policy improvement

### Deterministic policy $\pi(s)$

Our reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function $v_{\pi}$ for an arbitrary deterministic policy $\pi$. For some state s, we would like
to know whether or note we should change the policy to **deterministically** choose an action $a \neq \pi(s)$. We know how good it is to follow the current policy from s. But would it be better or worse to change to the
new policy? One way to answer this question is to consider selecting a in s and thereafter following the existing policy $\pi$. The value of this way of behaving is:

$q_{\pi}(s, a) = E[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t=s, A_t=a] = \sum_{s^{\prime}, r} p(s^{\prime}, r | s, a)[r + \gamma v_{\pi}(s^{\prime})]$

The key criterion is whether this is greater than or less than $v_{\pi} (s)$. If it is greater, then one would expect it to be better still to select a every time s is encountered, and that the new policy would in fact be a better one overall. This is a general result called
the *policy improvement theorem*.

Let $\pi, \pi^\prime$ be any pair of deterministic policies such that , for all $s \in S$:

$q_{\pi} (s, \pi^\prime (s)) \geq v_{\pi} (s)$ *(1)*

Then the policy $\pi^\prime$ must be as good as, or better than, $\pi$. That is, it must obtain greater or equal expected return from all states $s \in S$:

$v_{\pi \prime} \geq v_{\pi} (s)$ *(2)*

Easy proof:

\begin{aligned}
v_{\pi} (s) & \leq q_{\pi} (s, \pi^{\prime}(s))\\
& = E[R_{t+1} + \gamma v_{\pi} (S_{t+1}) | S_t = s, A_t = \pi^\prime (s)]\\
& \leq E[R_{t+1} + \gamma q_{\pi} (s, \pi^{\prime}(S_{t+1})) | S_t = s, A_t = \pi^\prime (s)]\\
& = E_{\pi^\prime}[R_{t+1} + \gamma q_{\pi} (s, \pi^{\prime}(S_{t+1})) | S_t = s]\\
& .\\
& .\\
& .\\
& = v_{\pi^\prime} (s)
\end{aligned}

Moreover, if there is strict inequality of *(1)* at any state, then there must be strict inequality of *(2)* in that state

So far we have seen how, given a policy and its value function, we can easily evaluate
a change in the policy at a single state. It is a natural extension to consider changes at all states, selecting at each state the action that appears best according to
$q_{\pi} (s, a)$. In other words, to consider the new greedy policy, $\pi^\prime$, given by

$\pi^\prime (s) = \text{argmax}_{a} q_{\pi} (s, a) = \text{argmax}_{a} E[R_{t+1} + \gamma v_{\pi} (S_{t+1}) | S_t = s, A_t = a] = \text{argmax}_{a} \sum_{a} \pi(a | s) \sum_{s^{\prime}}\sum_{r}p(r, s^{\prime} | a, s)[r + \gamma v_{\pi} (s^\prime)]$

This equation denotes that at each state, we select the action a that will maximize the expected return starting from this state, performing this action and following policy $\pi$ thereafter (with ties broken arbitrarily, ie. two best actions). The greedy policy
takes the action that looks best in the short term--after one step of lookahead--according to $v_{\pi}$. By construction, the greedy policy meets the conditions of the policy improvement theorem, so we know that it is as good as or better than, the original policy.

### stochastic policy $\pi(a|s)$

In the general case, a stochastic policy $\pi $ specifies probabilities, $\pi(a|s)$, for taking each
action, a, in each state, s. In fact, all ideas of policy improvement from deterministic policies can be extended to stochastic policies.
In particular, the policy improvement
theorem carries through as stated for the stochastic case. if there are several actions at which the
maximum is achieved—then in the stochastic case we need not select a single action from
among them. Instead, each maximizing action can be given a portion of the probability
of being selected in the new greedy policy. Any apportioning scheme is allowed as long
as all submaximal actions are given zero probability.

## Policy iteration

Once a policy $\pi$ has been improved using $v_{\pi}$ to yield a better policy $\pi^\prime$, we can then compute $v_{\pi^\prime}$ and improve it again to
yield an even better $\pi^{\prime\prime}$. We can thus obtain a sequence of monotonically improving policies and value functions:

<img src="pngs/policy-iteration.png">

where "E" denotes a policy evaluation and "I" denotes a policy improvement. Each policy is guaranteed to be a strict improvement over the previous one (unless the previous one is already an optimal one).
Because a finit MDP has only a finite number of policies, this process must converge to an optimal policy and the optimal value function in a finite number of iterations.

This way of finding an optimal policy is called **policy iteration**.

<img src="pngs/policy-iteration-algo.png">

Policy iteration often converges in surprisingly few iterations, The policy improvement
theorem assures us that these policies are better than the original random policy.

## Value iteration

One drawback to policy iteration is that each of its iterations involves policy evaluation,
which may itself be a protracted iterative computation requiring multiple sweeps through
the state set. If policy evaluation is done iteratively, can we stop short of that?

fact, the policy evaluation step of policy iteration can be truncated in several ways
without losing the convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called **value iteration**.
It can be written as a particularly simple update operation that combines the policy improvement and truncated policy evaluation steps:

$v_{*}^{k+1} (s) = \text{max}_{a \in A(s)} q_{*}^{k}(s, a) = \text{max}_{a \in A(s)} E_{\pi_*}[R_{t+1} + \gamma v_{*}^{k}(S_{t+1}) | S_t=s, A_t=a]$

$= \text{max}_{a \in A(s)} \sum_{s^{\prime}, r}p(s^{\prime}, r | s, a)[r + \gamma v_{*}^{k}(s^\prime)]$

Which is the bellman optimality equation of $v_{\pi}^{k}$ being updated to $v^{k+1}_{\pi}$. This update rule is identical to policy evaluation update, except that, instead of using the average over actions, we take the maximum action.

<img src="pngs/value-iteration.png">

Value iteration e↵ectively combines, in each of its sweeps, one sweep of policy evaluation
and one sweep of policy improvement. Faster convergence is often achieved by interposing
multiple policy evaluation sweeps between each policy improvement sweep. In general,
the entire class of truncated policy iteration algorithms can be thought of as sequences
of sweeps, some of which use policy evaluation updates and some of which use value
iteration updates. Because the max operation is the only di↵erence between these updates, this just means that the max operation is added to some sweeps of policy
evaluation. All of these algorithms converge to an optimal policy for discounted finite
MDPs.

## Generalized Policy Iteration

We use the term generalized policy iteration (GPI) to refer
to the general idea of letting policy-evaluation and policyimprovement
processes interact, independent of the granularity
and other details of the two processes. Almost all reinforcement
learning methods are well described as GPI. That is, all have
identifiable policies and value functions, with the policy always
being improved with respect to the value function and the value
function always being driven toward the value function for the
policy, as suggested by the diagram to the right. If both the
evaluation process and the improvement process stabilize, that
is, no longer produce changes, then the value function and policy
must be optimal. The value function stabilizes only when it
is consistent with the current policy, and the policy stabilizes
only when it is greedy with respect to the current value function.
Thus, both processes stabilize only when a policy has been found that is greedy with
respect to its own evaluation function. This implies that the Bellman optimality equation
holds, and thus that the policy and the value function are optimal.

One might also think of the interaction between the evaluation and improvement
processes in GPI in terms of two constraints or goals—for example, as two lines in a two-dimensional space as suggested by the diagram
to the right. Although the real geometry is
much more complicated than this, the diagram suggests
what happens in the real case. Each process
drives the value function or policy toward one of
the lines representing a solution to one of the two
goals. The goals interact because the two lines are
not orthogonal. Driving directly toward one goal
causes some movement away from the other goal.
Inevitably, however, the joint process is brought closer to the overall goal of optimality.
The arrows in this diagram correspond to the behavior of policy iteration in that each
takes the system all the way to achieving one of the two goals completely. In GPI
one could also take smaller, incomplete steps toward each goal. In either case, the two
processes together achieve the overall goal of optimality even though neither is attempting
to achieve it directly.