# On-policy Control with Approximation

## Episodic Semi-gradient Control

Now, we consider examples of the form $S_t, A_t \rightarrow U_t$. The update target $U_t$ can be any approximation of $q_{\pi} (S_t, A_t)$, including the usual backed-up values such as the full
MC return $G_t$, or any of the n-step SARSA returns. (ie. we take an unbiased sample of $T^{\pi} V_{k}$). The general gradient-decent update for action-value prediction is:

$argmin_{\hat q} \frac{1}{2} \| U_t - \hat{q} (S_t, A_t, w_t)\|^2_{2, D_n}$ (Approximate Error minimization)

By taking the derivative w.r.t $w_t$ we have:

$-\frac{1}{N} \sum_{i=1}^{N} (U_{t, i} - \hat{q}_i (S_t, A_t, w_t))\nabla \hat{q}_i (S_t, A_t, w_t)$

In general, we have semi stochastic gradient descent update (only one sample):

$w_{t+1} = w_{t} + \alpha [U_t - \hat{q} (S_t, A_t, w_t)]\nabla \hat{q} (S_t, A_t, w_t) $

we can substitute any unbiased estimate of the target $U_t$, for example, one-step SARSA:

$w_{t+1} = w_{t} + \alpha [R_{t+1} + \gamma \hat{q} (S_{t+1}, A_{t+1}, w_t)  - \hat{q} (S_t, A_t, w_t)]\nabla \hat{q} (S_t, A_t, w_t) $

This is called semi-gradient because $U_t$ should be dependent on $w_t$ as well, but we just treat it as constant to avoid bias in the GD update. (i.e if we treat it as a function of $w_t$, we get BRM, the loss function itself is biased)

We call this method *episodic semi-gradient one-step Sarsa*. For a constant policy, this method converges in the same way that TD(0) does.

To form control methods, we need to couple such action-value prediction methods with techniques for policy improvement and action selection. Suitable techniques applicable to **continuous actions, or to actions from large discrete sets**, are a topic of ongoing research with as yet no clear resolution.
On the other hand, if the action set is discrete and not too large, then we can use the techniques already developed in previous chapters. That is, for each possible action a available in the next state $S_{t+1}$, we can compute $\hat{q} (S_{t+1}, a, w_t)$ and then find the greedy action. Policy improvement is then done by changing the estimation policy to a soft approximation of the greedy policy such as $\epsilon-$greedy policy. Actions are
selected according to this same policy.

<img src='pngs/episodic_semi_gradient_SARSA.png'>

## Semi-gradient n-step SARSA

We can obtain an n-step version of episodic semi-gradient SARSA by using an n-step return as the update target in the semi-gradient Sarsa update equation. The n-step return immediately generalizes from its tabular form to a function approximation form:

<img src='pngs/semi_gradient_n_step_SARSA.png'>
<img src='pngs/episodic_emi_gradient_n_step_SARSA.png'>

As we have seen before, performance is best if an intermediate level of bootstrapping
is used, corresponding to an n larger than 1.

## Average Reward: A New Problem Setting for Continuing Tasks

We now introduce a third classical setting-alongside the episodic and discounted settings --- for formulating the goal in MDP. Like the discounted setting, the average reward setting applies to continuing problems, problems for which the interaction between agent and environment goes on and on forever without termination or start states.
Unlike that setting, however, there is no discounting --- the agent cares just as much about delayed rewards as it does about immediate reawrd. The average-reward setting is one of the major settings commonly considered in the classical theory of dynamic programming and less-commonly in reinforcement learning. As we discuss in the next section, the discounted setting is problematic with function approximation, and thus the average-reward setting is need to repalce it.

In the average-reward setting, the quality of a policy $\pi$ is defined as the average rate of reward, or simply average reward, while following that policy, which we denote as $r(\pi)$:

$r(\pi) = lim_{h \rightarrow \infty} \frac{1}{n} \sum_{t=1}^{h} E[R_{t} | S_0, A_{0:t-1} \sim \pi]$ (expected return over episode length, which is just the expected reward)

$= lim_{t \rightarrow \infty} E[R_{t} | S_{0}, A_{0:t-1} \sim \pi]$

$= \sum_{s} \mu_{\pi}(s) \sum_{a} \pi(a | s) \sum_{r, s^{\prime}} p(s^{\prime}, r^{\prime} | s, a) * r$ (ie. expected reward depends on choice of a, next state, r and s which s comes from the stationary distribution, this stationary distribution also depends on sequence of actions $A_0, ....$)

Where the expectations are conditioned on the initial state, $S_0$, and on the subsequent actions, $A_0, ...., A_{t-1}$, being taken according to $\pi$. The second and third equations hold if the stationary-distribution, $mu_{\pi} (s) = \lim_{t\rightarrow \infty} P(S_t=s | A_{0:t-1} \sim \pi)$, exists and is independent of $S_{0}$, in other words, if the MDP is ergodic. In an ergodic MDP, the starting state and any early decision made by the agent can have only a temporary effect;
In the long run the expectation of being in a state depends only on the policy, and the MDP transition probabilities. Ergodicity is sufficient but not necessary to guarantee the existence of the limit.

There are subtle distinctions that can be drawn between different kinds of optimality in the discounted continuing case. Nevertheless, for most practical purposes it may be adequate simply to order policies according to their average reward per time step, in the order words, according to their $r(\pi)$. This quantity is essentially the average reward under $\pi$, or the reward rate. In particular, we consider all policies that attain the maximal value of $r(\pi)$ to be optimal.

Note that the stationary distribution $\mu_\pi$ is the special distribution under which, if you select actions according to $\pi$, you remain in the same distribution. That is, for which

$\sum_{s} \mu_{\pi} (s) \sum_a \pi(a | s) p(s^{\prime | s, a}) = \mu_{\pi} (s^{\pi})$ (definition of stationary distribution)

In the average-reward setting, returns are defined in terms of differences between rewards and the average reward:

$G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + ...$

This is known as the differential return, and the corresponding value functions are known as differential value functions. Differential value functions are defined in terms of the new return just as conventional value functions were defined in terms of the discounted return; thus we will use the same notation for differential value functions. Differential value functions also have bellman equations, just slightly different from those we have seen earlier. We simply remove all gammas and replace all rewards by the difference between the reward and the true average reward:

$v_{\pi} (s) = \sum_a \pi(a | s) \sum_{r, s^{\prime}} p(s^{\prime}, r | s, a) [r - r(\pi) + v_{\pi} (s^{\prime})]$

$q_{\pi} (s, a) =  \sum_{r, s^{\prime}} p(s^{\prime}, r | s, a) [r - r(\pi) + \sum_{a^\prime} \pi (a^\prime|s^\prime) q_{\pi} (s^\prime, a^\prime)]$

$v_{*} (s) = max_{a} \sum_{r, s^\prime} p(s^\prime, r | s, a)[r - max_{\pi} r(\pi) + v_{*} (s^\prime)]$

$v_{*} (s) = \sum_{r, s^\prime} p(s^\prime, r | s, a)[r - max_{\pi} r(\pi) + max_{a^\prime} q_{*} (s^\prime, a^{\prime})]$


There is also a differential form of the two TD errors:

$\delta_{t} = R_{t+1} - \bar{R_{t}} + \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)$

$\delta_{t} = R_{t+1} - \bar{R_{t}} + \hat{q}(S_{t+1}, A_{t+1}, w_t) - \hat{q}(S_t, A_t, w_t)$

Where $\bar{R_{t}}$ is an estimate at time t of the average reward $r(\pi)$. With these alterante definitions, most of our algorithms and many theoretical results carry through to the average-reward setting without change.

<img src='pngs/differential_semi_gradient_SARSA.png'>

## Deprecating the Discounted Setting

The continuing, discounted problem formulation has been very useful in the tabular case, in which the returns from each state can be separately identified and averaged. But in the approximate case it is questionable whether one should ever use this problem formulation.

To see why, consider an infinite sequence of returns with no beginning or end, and no
clearly identified states. The states might be represented only by feature vectors, which
may do little to distinguish the states from each other. As a special case, all the feature
vectors may be the same. Thus one really has only the reward sequence (and the actions),
and performance has to be assessed purely from these. How could it be done? One way
is by averaging the rewards over a long interval—this is the idea of the average-reward
setting. How could discounting be used? Well, for each time step we could measure
the discounted return. Some returns would be small and some big, so again we would
have to average them over a sufficiently large time interval. In the continuing setting
there are no starts and ends, and no special time steps, so there is nothing else that
could be done. However, if you do this, it turns out that the average of the discounted
returns is proportional to the average reward. In fact, for policy ⇡, the average of the
discounted returns is always $\frac{r(\pi)}{ 1 - \gamma}$, that is, it is essentially the average reward, $r(\pi)$. In particular, the ordering of all policies in the average discounted return setting
would be exactly the same as in the average-reward setting. The discount rate $\gamma$ thus has no effect on the problem formulation. It could in fact be zero and the ranking would be unchanged.

**Proof**:

Perhaps discounting can be saved by choosing an objective that sums discounted values over the distribution with which states occur under the policy:

(we want to maximize this objective)

$J(\pi) = \sum_s \mu_{\pi} (s) v_{\pi}^{\gamma} (s)$ (Values of all states are ranked by the stationary distribution of states which is reasonable)

$= \sum_s \mu_{\pi} (s) \sum_a \pi(a | s) \sum_{s^\prime, r} p(s^\prime, r | s, a) [r + \gamma * v_{\pi}^{\gamma} (s^\prime)]$

$= r(\pi) + \sum_{s} \mu_{\pi} (s) \sum_{a} \pi(a | s) \sum_{s^\prime, r} p(s^\prime, r | s, a) \gamma * v_{\pi}^{\gamma} (s^\prime)$

$= r(\pi) + \sum_{s} \mu_{\pi} (s) \sum_{a} \pi(a | s) \sum_{s^\prime} p(s^\prime| s, a) \gamma * v_{\pi}^{\gamma} (s^\prime)$ (ie. $p(s^\prime | s, a) = \sum_{r}p(s^\prime, r | s, a$))

$= r(\pi) + \gamma \sum_{s^\prime} \gamma * v_{\pi}^{\gamma} \mu_{\pi} (s^\prime) $ (by property of stationary distribution)

$= r(\pi) + \gamma J(\pi)$

$=r(\pi) + \gamma r(\pi) + \gamma^2 J(\pi) + ...$

$= \frac{1}{1 - \gamma} r(\pi)$

The proposed discounted objective orders policies identically to the discounted
(average reward) objective. The discount rate # does not influence the ordering!

This results show that, if we optimized discounted value over the on-policy distribution (stationary distribution), then the effect would be identical to optimizing undiscounted average reward; the actual value of $\gamma$ would have no effect. This strongly suggests that discounting has no role to play in the definition of the control problem with function approximation.

The root cause of the difficulties with the discounted control setting is that with
function approximation we have lost the policy improvement theorem (Section 4.2). It is
no longer true that if we change the policy to improve the discounted value of one state
then we are guaranteed to have improved the overall policy in any useful sense. That
guarantee was key to the theory of our reinforcement learning control methods. With function approximation we have lost it!

## Differential Semi-gradient n-step SARSA

<img src='pngs/differential_semi_gradient_n_step_SARSA.png'>
