# Experiment 1: zero value function

For our first experiment we initialise all value estimates to zero. This means that no bootstrapping occurs. When performing the greedy action-choice, ties are resolved by randomly choosing an action (this makes no difference when we select and bootstrap actions using the same value estimates, since the same value is propagated whatever we choose).


Let us visualise what is happening on the very first epoch:

<div>
<img src="figures/Q_learning_1D_linewalk/All_zeros/value_estimates_epoch0.png" width="600"/>
</div>


Here
- the blue line shows the values to be bootstrapped from
- the orange line shows the target $q$-values obtained by combining the n-step empirical returns with the greedy bootstraps
- the red line shows the current value estimates

Note that the blue and red lines are currently identical, since we bootstrap from the current value function at every epoch.

Since every action comes with an empirical reward of $-1$, the target function pulls all estimates down by a flat amount. Two exceptions occur when actions act left- or right- into a boundary, since this incurs an additional penalty, and so the immediate reward is $-2$. This artefact is not visible for the 'True' curve because encountering the boundary means that we do not end up one state further away from the finish. This means that the value of the next state-action is $+1$ higher than if we had actually changed position, cancelling out the $-1$ difference in the immediate reward.

Fast-forwarding a few epochs, we find

<div>
<img src="figures/Q_learning_1D_linewalk/All_zeros/value_estimates_epoch10.png" width="600"/>
</div>


All we have done to most states is shift them down by a constant amount, reflecting the fact that they all receive the same immediate rewards and bootstrap the same value estimates. Now the near-terminal states are approaching their correct values and updating by smaller amounts. This information is starting to propagate to nearby state-action pairs. Since the 'better' action is now being updated by smaller values, the model/bootstrap policies have already converged on the true policy. This will make learning updates more stable from hereon, as we will never bootstrap from the wrong action.

If we were bootstrapping from the wrong action, we would expect this information to propagate more slowly. This is because the policy would have to correct itself before we 

Passing through all future epochs, we see that the convergence to the correct function propagates outwards because all state/action pairs bootstrap from the correct greedy action, using values which get closer progressively to the true values.

<div>
<img src="figures/Q_learning_1D_linewalk/All_zeros/value_estimates_animated.gif" width="600"/>
</div>


Let us now consider how the different metrics behave:
- The MSE w.r.t. the target function starts at a maximum, since every error is equal to $-1$ or $-2$. As training progresses, the different between the target and estimates becomes monotonically smaller as both converge on the true values. This curve therefore monotonically decreases.
- The MSE considering only state/action pairs which correspond to the true greedy policy behaves the same.
- The MSE w.r.t. the true optimal policy monotonically decreases as we converge
- The accuracy of the greedy policy very quickly reaches 100% and remains there, aiding the stability of convergence.
- The maximum $q$-value monotonically iterates towards its true value and then levels off as the estimates converge on the true function from a single direction

<div>
<img src="figures/Q_learning_1D_linewalk/All_zeros/training_curves.pdf" width="600"/>
</div>

## Experiment 2: Wrong actions

In this experiment we set up the initial value function such that it has the correct dependence but the greedy policy selects the incorrect action at every state. We set the scale of the dependence to 70% of the true function, so that the multi-step error is dominated by immediate rewards and not bootstrapping.

Initially we have

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_actions/value_estimates_epoch0.png" width="600"/>
</div>



We can understand this plot as follows. The update errors are $\delta(s,a) = r + q(s',a') - q(s,a)$, with $r=-1$ in nearly all cases. When $a$ is the correct action, we bootstrap with a value of $q(s',a') = q(s,a) + 1.4$ (this is because we incorrectly select the action which returns us to our initial position, and inherit the poor-modelling). This is an example of over-estimation of the values, and incorrectly results in $\delta(t)>0$.

By contrast, when $a$ is the incorrect action, we move further away from our goal. Since we have modelled the correct dependence on the action values, we correctly say that this makes the expected value more negative. Along with the immediate reward of $r=-1$, this applies a strong negative pull on the estimates, and we correctly find $\delta(t)<0$.

After a relatively short amount of time, the values are pulled into the correct order so that the greedy policy becomes the correct one. From that point onwards, the near-terminal states are well-modelled and so this information propagates outwards to all other states.


<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_actions/value_estimates_animated.gif" width="600"/>
</div>


Since the initial updates pull some value estimates in the wrong direction, we observe the MSE metrics to sometimes get worse. Once the greedy policy accuracy becomes 100%, they undergo a phase transition and all metrics monotonically improve until convergence.

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_actions/training_curves.pdf" width="600"/>
</div>


## Experiment 3: Wrong actions and large scale

The initial value estimates are

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_actions_scaled_up/value_estimates_epoch0.png" width="600"/>
</div>

and evolve according to

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_actions_scaled_up/value_estimates_animated.gif" width="600"/>
</div>

and evolve according to the metrics 

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_actions_scaled_up/training_curves.pdf" width="600"/>
</div>

which behave qualitatively the same as before.

## Experiment 4: Wrong dependence

The initial value estimates are

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence/value_estimates_epoch0.png" width="600"/>
</div>

Here we see that the target once again pulls any 'wrong action' values in the wrong direction. This time it is driven by the fact that we are bootstrapping from the wrong values, even when we choose the correct action. When $a$ is the correct action, the impact of the immediate rewards pulls the error update into the correct direction.

This evolves into

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence/value_estimates_epoch20.png" width="600"/>
</div>

where we see that by updating the value estimates in the wrong direction, we have caused the greedy policy to _lose_ accuracy. However, our update error is now dominated by the empirical piece, which will pull the estimates in the correct direction despite this.


Fast-forwarding, we find that the whole function is pulled down until near-terminal states begin to take their correct values, and the estimates evolve towards the correct dependence. This initially occurs very slowly, since there is a cancellation between the pressure applied by the (negative) empirical rewards and the (over-estimated due to both action choice and bad-policy) bootstrap.

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence/value_estimates_epoch60.png" width="600"/>
</div>

Once the value estimates become essentially flat and equal, all of the learning pressure comes from the flat $-1$ immediate rewards, and convergence occurs at rate similar to the "zeros" case, which was also flat.

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence/value_estimates_epoch100.png" width="600"/>
</div>

The full evolution is

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence/value_estimates_animated.gif" width="600"/>
</div>

with the metrics

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence/training_curves.pdf" width="600"/>
</div>


Interestingly, since the distance between the estimates and the target is initially small (causing the slow learning) and increases as the estimates become flat, we actually observe the MSE(estimates-target) to increase over time. It then decreases again once we begin to converge on the true estimates. This metric is therefore not a good way to gauge the convergence of an experiment. It may appear converged when we simply have a cancellation between the experical return and $\gamma q(s',a') - q(s,a)$, leading to $\delta(s,a) \approx 0$. In fact, a small positive gradient means that learning is _accelerating_ in that direction.

Furthermore, the accuracy of the greedy policy appears to be very low for much of training. However, since both actions are flat and approximately equal, this has no impact on learning and we still update our estimates in the correct direction by approximately the same amount as if we have chosen the correct action.

## Experiment 5: Wrong dependence and large scale


Once again, increasing the scale of the problem makes little qualitative difference to the behaviour. However, we now converge towards an initial almost-flat distribution which is very far from the final stationary point (since it is on the scale of the initial highest-value-estimate). This causes convergence to take a very long time (it takes a long time for rewards of $-1$ to pull estimates of $\mathcal{O}(100)$ down to the correct position. It is observed as a long plateau in the metric MSE(model-target), despite the value estimates being nowhere near converged.

The initial state is

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence_scaled_up/value_estimates_epoch0.png" width="600"/>
</div>

with the full evolution as

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence_scaled_up/value_estimates_animated.gif" width="600"/>
</div>

and the metrics

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_dependence_scaled_up/training_curves.pdf" width="600"/>
</div>


## Experiment 6: Wrong dependence and wrong actions


The initial state is

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_action_and_dependence/value_estimates_epoch0.png" width="600"/>
</div>

We now choose the wrong action, but this action also has the wrong dependence. This means that we accidentally bootstrap using the value of a correct action-choice and dependence, and we update in the correct direction!

Initially learning is slow, but as the functions level out, the choice of action becomes irrelevant because all $\gamma q(s',a') - q(s,a)$ are approximately $0$. We therefore iterate towards the correct solution as usual.

The full evolution is

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_action_and_dependence/value_estimates_animated.gif" width="600"/>
</div>

with the metrics

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_action_and_dependence/training_curves.pdf" width="600"/>
</div>


## Experiment 7: Wrong dependence and wrong actions (scaled up)


The initial state is

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_action_and_dependence_scaled_up/value_estimates_epoch0.png" width="600"/>
</div>

We now select the wrong action, but this action _also_ pulls in the wrong direction. Once again the target keeps pulling the model towards a flat state, at which point the empirical returns take over and provide a learning pressure towards the correct function.

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_action_and_dependence_scaled_up/value_estimates_animated.gif" width="600"/>
</div>

with the metrics

<div>
<img src="figures/Q_learning_1D_linewalk/Wrong_action_and_dependence_scaled_up/training_curves.pdf" width="600"/>
</div>



## General behaviour

In general, the value estimates may pull individual state-action towards / away-from the optimal solutions depending on whether the current state of the value estimates causing the bootstraps to have the correct dependence and the greedy actions to be selected accordingly. When we iterate in the wrong direction, the pressure from the empirical returns tends to dampen the updates and flatten the value estimates. When this happens, the updates becomes driven by the immediate rewards, and the MSE appears to flatten or even increase as the scale of updates increases.

The terminal state-action pairs tend to be learned first, since they are always iterated towards the correct final target. This information then propagates to other state-action pairs. The speed of this propagation may depend on whether the action-selection picks up the correct bootstraps.

We expect multi-step returns to help remove the sensitivity to a badly-modelled dependence and provide greater learning pressure in the correct direction based on optimising short-term rewards. However, if the multi-step returns are biased by poor action-selection, this source of bias may increase.

In fact, in our case multi-step returns may help a lot, because the empirical returns increase from $-1$ to $-3$ and so impose much greate pressure towards the correct solution.