# Function Approximation
In the previous chapters, each update can only modify value estimate for one state, or one state-action pair. 

Unfortunately, state spaces and action spaces can be very large in some tasks, even infinite. It would be impossible to update all state-action pairs individually.

Function approximation tries to address this problem by approximating these values using parametric function, so updating the parameters of this parametric function can update the value estimaes of lots of states or state-action pairs.

When updating parameters according to experience, value estimates of states of state-action pairs that we have not visited yet can also be updated.

## Basics
FA uses mathematical model to estimate real value. For MC and TD algorithms, target to approximate are values (including optimal ones), such as $v_\pi, q_\pi, v_*, q_*$.

Will consider optimising policy and dynamics in remaining chapters of the book.

### Machine Learning: Parametric Model and Nonparametric Model
Parametric models predefine the functional form of the model.

E.g: Feature Vector $\vec{x}$, target vector $\vec{y}$, with relationship $\vec{y} = \vec{F}(\vec{x})$ where $\vec{F}$ is the mapping of some kind.

We want to model the mapping somehow. A parametric model assumes that the model has some form $\vec{y} = \vec{f}(\vec{x}; \vec{w}) + \vec{n}$.

Linear Model Example:
$$f(\vec{x}; \vec{w}) = \vec{x}^\top\vec{w} = \sum_{j \in J}x_jw_j$$
Simple and easy to understand, but it is limited to being a linear relationship when it is much more likely in practice that it is nonlinear.

Things about parametric models:
- Easy to train - fast and need little data, especially when number of parameters is small.
- Can be interpreted easily, especially when function form is simple.
- The limit the function form, but optimal solution may not have the same form. Therefore, parametric models will not be optimal. Easier to underfit.
- Require a priori knowledge to determine function form. E.g: neural network models need some a priori knowledge for designing suitable network structure and parameters.

Successful parametric model requires both suitable function forms and suitable parameters.

Nonparametric models directly learn model from data without assumption of specific function form. E.g: nearest neighbours, decision trees, kernel methods.

Nonparametric models differ from parametric models in that they do not have presupposed function forms.
- More adaptable, since unconstrained by some function forms. They can fit data better.
- Require little a priori knowledge.
- More difficult to train. More data needed, involve more parameters when models are complex, training is slower.
- More likely to overfit.

When feature set is finite, can generate estimate for every feature. This method is called the tabular method. This does not assume a function form, so it is nonparameteric. Can also be viewed as a special case of the linear model.

To see this, consider a feature space $\mathcal{X}$. Can map every $x \in \mathcal{X}$ to a vector $(\mathbb{1}_{[x=x']}: x' \in \mathcal{X})^\top$. This vector is equivalent to the feature vector in the linear model. 

In RL, function approximation uses maths models to estimate functions such as value functions.

Parametric models include linear model and neural networks. Nonparametric models include kernel method. 

We focus on parametric models.

State values: $v(s; \vec{w})$, action values $q(s, a; \vec{w})$. If action space is finite, then can also use vector function $\vec{q}(s, a; \vec{w}) = (q(s, a; \vec{w}: a \in A))$ to estimate action values.

Each element in $\vec{q}(s; \vec{w})$ corresponds to an action, and the input of the vector function is the state. Can be used to approximate either action values of a policy or the optimal action values.

Linear approximation uses linear combination of multiple vectors to approximate values. E.g: can define vector as feature for each state-action pair, i.e: $\vec{x}(s, a) = (x_j(s,a): j \in J)$. Linear combination of these features can be written as follows:
$$q(s, a; \vec{w}) = [\vec{x}(s,a)]^\top\vec{w} = \sum_{j \in J}x_j(x,a)w_j$$
Linear approximation for state values:
$$v(s; \vec{w}) = [\vec{x}(s)]^\top\vec{w} = \sum_{j \in J}x_j(s)w_j$$

DRL = Deep Reinforcement Learning.

## Stochastic Gradient Descent (SGD)
First order iterative optimisation algorithm to find optimal point for a subdifferentiable objective.

Consider differentiable and convex $f(\vec{x}) = \mathbb{E}[F(\vec{x})]$. Want to find solution of $\nabla f(\vec{x}) = \vec{0}$, i.e: solution of $\mathbb{E}[\nabla F(\vec{x})] = \vec{0}$. According to Robbins-Monro algorithm, we may consider the following update:
$$\vec{x}_{k+1} = \vec{x}_k - \alpha_k \nabla F(\vec{x}_k)$$

In RL, we can define the sample loss as $[G_t - q(S_t, A_t; \vec{w})]^2$, or $[G_t - v(S_t; \vec{w})]^2$. The total loss for an entire episode is
$$F(\vec{w}) = \frac{1}{2}\sum_{t = 0}^{T-1}[G_t - q(S_t, A_t; \vec{w})]^2 \quad \text{or} \quad F(\vec{w}) = \frac{1}{2}\sum_{t=0}^{T-1}[G_t - v(S_t; \vec{w})]^2$$
We multiply by $1/2$ because it cancels out when the gradient is taken.
This means
\begin{align*}
\nabla F(\vec{w}) &= \frac{1}{2}\nabla [G_t - q(S_t, A_t; \vec{w})]^2\\
&= - [G_t - q(S_t, A_t; \vec{w})]\nabla q(S_t, A_t; \vec{w})
\end{align*}
Or, if we are using state value estimates, 
\begin{align*}
\nabla F(\vec{w}) &= \frac{1}{2}\nabla [G_t - v(S_t; \vec{w})]^2\\
&= - [G_t - v(S_t; \vec{w})]\nabla v(S_t; \vec{w})
\end{align*}
Note that these are single-step gradients.

The following are single step updates as a consequence:
\begin{align*}
\vec{w} &\leftarrow \vec{w} + \alpha_t[G_t - q(S_t, A_t; \vec{w})]\nabla q(S_t, A_t; \vec{w})\\
\vec{w} &\leftarrow \vec{w} + \alpha_t[G_t - v(S_t; \vec{w})]\nabla v(S_t; \vec{w})
\end{align*}

### Algorithm 6.2: Policy Optimisation with function approximation and SGD
1. $\vec{w} \leftarrow$ arbitrary values.
2. For each episode:
    1. Use policy derived from current optimal action values estimates $q(\cdot, \cdot; \vec{w})$ (e.g: $\epsilon$-greedy policy) to generate trajectory $S_0, A_0, R_1, S_1, A_1, R_2, \dots, S_{T-1}, A_{T-1}, A_{T-1}, R_T, S_T$.
    2. $G\leftarrow 0$.
    3. For each $t \leftarrow T-1, T-2, \dots, 0$.
        1. $G \leftarrow \gamma G + R_{t+1}$.
        2. $\vec{w} \leftarrow \vec{w} + \alpha [G - q(S_t, A_t; \vec{w})]\nabla q(S_t, A_t; \vec{w})$.

## Semi-Gradient Descent
With parameterisation, 1-step TD return looks like
$$U_t = R_{t+1} + \gamma q(S_{t+1}, A_{t+1}; \vec{w})$$
When TD learning algorithms try to update value estimates, they will not change target $U_t$. Furthermore, when an algorithm with function approximation tries to update $\vec{w}$, it should not try to update the target $U_t$. 

Although we can see that $U_t$ depends on $\vec{w}$, we will still treat $U_t$ as a constant that has already been evaluated. This is how semi-gradient descent works.
1. (Initialise parameters) $\vec{w} \leftarrow$ arbitrary values.
2. For each episode:
    1. Choose initial state $S$.
    2. Loop until episode ends:
        1. (Decide) For policy evaluation, use $\pi(\cdot \mid S)$ to determine action $A$. For policy optimisation, use the policy derived from $q(S, \cdot; \vec{w})$ (e.g: $\epsilon$-greedy policy) to determine action $A$.
        2. (Sample) Execute $A$, observe $R$, $S'$, $D'$.
        3. (Calculate TD return) Perform **one** of the following: 
        \begin{align*}
        U &\leftarrow R + \gamma (1 - D')v(S')\\
        U &\leftarrow R + \gamma (1 - D') \sum_a\pi(a \mid S'; \vec{w})q(S', a; \vec{w})\\
        U &\leftarrow R + \gamma (1 - D')\max_a q(S', a; \vec{w})
        \end{align*}
        4. (Update value parameter) Perform one of the following:
        \begin{align*}
        \vec{w} &\leftarrow \vec{w} + \alpha [U - v(S; \vec{w})]\nabla v(S;\vec{w})\\
        \vec{w} &\leftarrow \vec{w} + \alpha [U - q(S, A; \vec{w})]\nabla q(S, A; \vec{w})
        \end{align*}
        5. $S \leftarrow S'$.

## Semi-Gradient Descent with Eligibility Trace
ET algorithm maintains ET for each value estimate to indicate weight for updating.

Recently visited states/state-action pairs have larger weights, while the ones that were visited a long time ago are weighted less. Each update will modify all ETs of entire trajectory.

Function approximation can also be applied to ETs. Here, ETs will correspond to value params $\vec{w}$. That is, ET parameter is called $\vec{z}$ which will have the same shape as $\vec{w}$ and have injective mapping.
\begin{align*}
    w &\leftarrow w + \alpha z [U - q(S_t, A_t; \vec{w})]\\
    w &\leftarrow w + \alpha z [U - v(S_t; \vec{w})]
\end{align*}
Overall:
\begin{align*}
    \vec{w} &\leftarrow \vec{w} + \alpha [U - q(S_t, A_t; \vec{w})]\vec{z}\\
    \vec{w} &\leftarrow \vec{w} + \alpha [U - v(S_t;\vec{w})]\vec{z}
\end{align*}
When updating the ET parameter $\vec{z}$, we perform the following:
\begin{align*}
    \vec{z}_t &\leftarrow \gamma \lambda \vec{z}_{t-1} + \nabla q(S_t, A_t; \vec{w})\\
    \vec{z}_t &\leftarrow \gamma \lambda \vec{z}_{t-1} + \nabla v(S_t; \vec{w})
\end{align*}

### TD($\lambda$) policy evaluation.
1. $\vec{w} \leftarrow$ arbitrary values.
2. For each episode:
    1. $\vec{z} \leftarrow \vec{0}$.
    2. Choose initial state $S$. For policy evaluation use $\pi(\cdot \mid S)$ to determine action $A$. For policy optimisation, use policy derived from $q(S, \cdot; \vec{w})$ (such as $\epsilon$-greedy) to determine action $A$.
    3. Loop until episode end:
        1. Execute $A$, observe $R$, $S'$, $D'$.
        2. For policy evaluation, use policy $\pi(\cdot\mid S')$ to determine action $A'$. For policy optimisation, use policy derived from $q(S', \cdot; \vec{w})$ (e.g: $\epsilon$-greedy) to determine action $A'$.
        3. Perform **one** of the following:
        \begin{align*}
            U &\leftarrow R + \gamma (1 - D')q(S', A'; \vec{w})\\
            U &\leftarrow R + \gamma (1 - D')v(S'; \vec{w})\\
            U &\leftarrow R + \gamma (1 - D')\sum_a\pi(a\mid S'; \vec{w})q(S', a; \vec{w})\\
            U &\leftarrow R + \gamma (1 - D')\max_aq(S', a; \vec{w})
        \end{align*}
        4. $\vec{z} \leftarrow \gamma\lambda \vec{z} + \nabla q(S, A; \vec{w})$.
        5. Perform **one** of the following:
        \begin{align*}
            \vec{w} &\leftarrow \vec{w} + \alpha[U - q(S, A; \vec{w})]\vec{z}\\
            \vec{w} &\leftarrow \vec{w} + \alpha[U - v(S; \vec{w})]\vec{z}
        \end{align*}
        6. $S \leftarrow S'$, $A \leftarrow A'$.

## Convergence of Function Approximation
Convergence of Policy Evaluation Algorithms
| Algorithm     | Tabular  | Linear Approximation | Non-Linear Approximation |
| ------------- | -------- | -------------------- | ------------------------ |
| on-policy MC  | converge | converge             | converge                 |
| on-policy TD  | converge | converge             | not always converge      |
| off-policy MC | converge | converge             | converge                 |
| off-policy TD | converge | not always converge  | not always converge      |

Convergence of Policy Opeimisation Algorithms.
| Algorithm  | Tabular  | Linear Approximation              | Non-Linear Approximation |
| ---------- | -------- | --------------------------------- | ------------------------ |
| MC         | converge | converge/swing around optimal sol | not always converge      |
| SARSA      | converge | converge/swing around optimal sol | not always converge      |
| Q-Learning | converge | not always converge               | not always converge      |

All convergence guaranteed only when learning rate satisfies conditions of Robbins-Monro Algorithm. This can be proved.

## DQN: Deep Q Network
Use ANN to approximate action values. Since ANNs are expressive and can find features automatically, they have much larger potential than classic feature engineering methods.

Baird's Counterexample shows that we cannot guarantee the convergence of an algorithm that uses off-policy, bootstrapping and function approximation at the same time.

Some tricks have been devised:
- Experience replay: store transition experience and sample afterward from storage according to some rules.
- Target network: Change method to update the networks so that we do not immediately use the parameter we just learn for bootstrapping.

### Experience Replay
Store experiment so that it can be repeatedly used. 2 steps:
1. Store experience - a part of the trajectory such as $(S_t, A_t, R_{t+1}, S_{t+1}, D_{t+1})$.
2. Replay: randomly select some experiences from teh storage, according to some selection rule.

Advantages:
- Same experience can be used multiple times, so the sample efficiency is improved. Useful when it is difficult or expensive to obtain data.
- Rearranges experiences, so the relationship among adjacent experiences is minimised. Distribution of data becomes more stable, making the training of neutral networks easier.

Disadvantages:
- Takes some space to store experiences.
- Length of experience is limited, so it cannot be used for MC update when episodes have infinite steps.

### Algorithm 6.7: DQN Policy Optimisation with Experience Replay
Parameters: params for experience replay e.g: storage capacity, number of replayed samples in each batch, optimiser, discount factor.

1. Initialise
    1. Initialise $\vec{w}$.
    2. (Initialise experience storage) $\mathcal{D} \leftarrow \emptyset$.
2. For each episode:
    1. Choose initial state $S$.
    2. Loop until episode ends:
        1. (Collect experiences) Perform the following at least once:
            1. (Decide) Use the policy derived from $q(S, \cdot; \vec{w})$ (e.g: $\epsilon$-greedy policy) to determine action $A$.
            2. (Sample) Execute $A$, observe $R$, $S'$, and $D'$.
            3. Save experience $(S, A, R, S', D')$ in $\mathcal{D}$.
            4. $S\leftarrow S'$.
        2. (Use experiences) Perform the following at least once:
            1. (Replay) Sample batch of experience $\mathcal{B}$ from $\mathcal{D}$. Each entry is in the form $(S, A, R, S', D)$.
            2. $U \leftarrow R + \gamma (1 - D')\max_aq(S', a; \vec{w})$.
            3. Update $\vec{w}$ to reduce 
            $$\frac{1}{|\mathcal{B}|}\sum_{(S, A, R, S', D')\in \mathcal{B}}[U - q(S, A; \vec{w})]^2$$
            Namely, 
            $$\vec{w} \leftarrow \vec{w} + \frac{\alpha}{|\mathcal{B}|}\sum_{(S, A, R, S', D')\in \mathcal{B}}[U - q(S, A;\vec{w})]\nabla q(S, A; \vec{w})$$

### Algorithm 6.8: DQN policy optimisation with experience replay (without looping over episodes explicitly).
1. Initialise
    1. Initialise \vec{w}.
    2. $\mathcal{D} \leftarrow \emptyset$.
    3. Choose initial state $S$.
2. Loop:
    1. (Collect experiences) Do the following once or multiple times:
        1. (Decide) Use the policy derived from $q(S, \cdot ; \vec{w})$.
        2. (Sample) Execute the action $A$, observe $R$, $S'$, $D'$.
        3. Save the experience $(S, A, R, S', D')$ in $\mathcal{D}$.
        4. If the episode does not end, $S \leftarrow S'$. Otherwise choose initial state $S$ for the next episode.
    2. (Use experiences) Do the following at least once:
        1. (Replay) Sample batch of experiences $\mathcal{B}$ from storage $\mathcal{D}$.
        2. $U \leftarrow R + \gamma (1 - D') \max_aq(S', a; \vec{w})$.
        3. $$\vec{w} \leftarrow \vec{w} + \frac{\alpha}{|\mathcal{B}|}\sum_{(S, A, R, S', D') \in \mathcal{B}}[U - q(S, A; \vec{w})]\nabla q(S, A; \vec{w})$$

Experience replay can be further subclassed:
- Centralised replay: agent interacts with one environment only, and all experiences are stored in a storage.
- Distributed replay: Multiple agent workers interact with their own environiments, and then store all experiences in a storage. Uses more resources to generate experiences more quickly, so the convergence can be faster in terms of time-consuming.

From replay aspect, experience replay can be subclassed:
- Uniform experience replay: when replayer selects experiences from the storage, all experiences can be selected with equal probability.
- Prioritised Experience Replay (PER): designate a priority value for each experience. When replayer selects experiences from storage, some experiences with higher priority can be selected with larger probability.

For PER, an example method to select an experience $i$ with priority $p_i$ with the probability
$$\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$
- Propirtional priority: The priority for experience $i$ is $$p_i = |\delta_i| + \epsilon$$ where $\delta_i$ is the TD error, defined as $\delta_i = U_i - q(S_i, A_i;\vec{w})$ or $\delta_i = U_i - v(S_i;\vec{w})$ and $\epsilon$ is a small number.
- Rank-based priority: priority for experience $i$ is $$p_i = \frac{1}{\text{rank}_i}$$ where $\text{rank}_i$ is the rank of experience $i$ sorted in descending order according to $|\delta_i|$, starting from 1.

Selecting experience via PER requires a lot of computation, and this cannot be accelerated via GPU. 

To effectively select experiences, developers usually use trees (sum tree or binary indexed tree) to maintain priorities.

## Deep Q Learning with Target Network
Since TD learning uses boostrapping, both TD return and value estimate depend on parameter $\vec{w}$. Its changes will modify TD return and value estimates. Will be unstable if action-value estimates are chasing something moving.

Use semi-gradient descent - avoids calculating gradients $\frac{\partial U_t}{\partial w_i}$. One way to avoid gradient calculation is to copy parameter as $\vec{w}_\text{target}$, and it to calculate $U_t$.

Target network has same architecture as original network (evaluation network) that we use to calculate the TD return to be used as learning target. Updating will only update the parameters of the eval network, and will not update params of target network. This makes target of learning relatively constant.

After some iterations, can assign params of eval network to target network, so the target network can be updated too.

This dynamic provides a more stable learning process for the algorithm.
### Algorithm 6.9: DQN with experience replay and target network.
1. Initialise:
    1. Initialise $\vec{w}$, and $\vec{w}_\text{target} \leftarrow \vec{w}$.
    2. $\mathcal{D} \leftarrow \emptyset$.
2. For each episode:
    1. Choose initial state $S$.
    2. Loop until episode ends:
        1. Collect experiences by performing the following at least once:
            1. (Decide) use policy derived from $q(S, \cdot; \vec{w})$ (e.g: $\epsilon$-greedy) to determine action $A$.
            2. (Sample) Execute $A$, observe $R$, $S'$, $D'$.
            3. Save experience $(S, A, R, S', D')$ in $\mathcal{D}$.
            4. $S\leftarrow S'$.
        2. Use the experiences by performing the following at least once:
            1. (Replay) Sample batch of experience $\mathcal{B} \subseteq \mathcal{D}$.
            2. $U \leftarrow R + \gamma (1 - D')\max_a q(S', a; \vec{w}_\text{target})$.
            3. $\vec{w} \leftarrow \vec{w} + \frac{\alpha}{|\mathcal{B}|}\sum_{(S, A, R, S', D') \in \mathcal{B}}[U - q(S, A; \vec{w})]\nabla q(S, A; \vec{w})$.
            4. (Update target network) Under some condition (e.g: every few updates), $\vec{w}_\text{target} \leftarrow (1 - \alpha_\text{target})\vec{w}_\text{target} + \alpha_\text{target}\vec{w}$.

## Double DQN
We know that Q learning may introduce maximisation bias, and double elarning can help reduce this. Tabular version of double Q learning maintains 2 copies of estimates $q^{(0)}$ and $q^{(1)}$, and updates one of them every update.

Applying this methodology into DQN leads to Double DQN. Since DQN algorithm already has 2 networks, double DQN can still use them.

Each update chooses one network as eval network to select action, and uses another as target network to calculate TD return.

Just change
$$U \leftarrow R + \gamma (1 - D') \max_a q(S', a; \vec{w}_\text{target})$$
into
$$U \leftarrow R + \gamma (1 - D') q(S', \arg\max_a q(S', a; \vec{w}); \vec{w}_\text{target})$$
will form the Double DQN algorithm.

## Dueling DQN
Advantage is the difference between action values and state values:
$$a(s, a) = q(s, a) - v(s)$$
In some RL tasks, difference among multiple advantages of different actions with the same state is much smaller than the difference among different state values.

This gave rise to the dueling network. It is still used to approximate the action values $q(\cdot, \cdot; \vec{w})$, but the action value network $q(\cdot, \cdot; \vec{w})$ is implemented as teh summation of a state value network and an advantage network, i.e:
$$q(s,a; \vec{w}) = v(s; \vec{w}) + a(s, a; \vec{w})$$
During training, $v$ and $a$ are jointly trained, and the training process has no difference compared to that of a normal network.

Infinite number of ways to partition a set of action value estimates $q(s, a; \vec{w})$ into a set of state values $v$ and a set of advantages $a$. Specifically, we can write
$$q(s, a; \vec{w}) = v(s; \vec{w}) + a(s, a; \vec{w}) = (v(s; \vec{w}) + c(s)) + (a(s, a; \vec{w}) - c(s))$$
To help with training, design special network architecture on advantage network so that the advantage is unique. Common methods include:
- Limit advantage ($a_\text{duel}$) after partition so its simple average over different actions is 0. $$\sum_aa_\text{duel}(s,a;\vec{w}) = 0$$ This can be achieved by the following network structure: $$a_\text{duel}(s, a; \vec{w}) = a(s, a; \vec{w}) - \frac{1}{|\mathcal{A}|}\sum_{a'}a(s, a'; \vec{w})$$
- Limit advantage after partition $a_\text{duel}$ so that its maximum value over different actions is 0. $$\max_a a_\text{duel}(s, a; \vec{w}) = 0$$ This can be achieved by the following network structure $$a_\text{duel}(s, a; \vec{w}) = a(s, a; \vec{w}) - \frac{1}{|\mathcal{A}|}\max_{a'}a(s, a'; \vec{w})$$

### Tricks used by different algorithms
| Algorithm   | Experience Replay | Target Network | Double Learning | Dueling Network |
| ----------- | ----------------- | -------------- | --------------- | --------------- |
| DQN         | V                 | V              |                 |                 |
| Double DQN  | V                 | V              | V               |                 |
| Dueling DQN | V                 | V              |                 | V               |
| D3QN        | V                 | V              | V               | V               |

## Feature Encoding: One-Hot Coding and Tile Coding
Discretise continuous inputs to finite features.

In `MountainCar-v0`, observation has position and velocity.

Easiest way to construct finite features from this continuous observation space is one-hot encoding.

Partition 2d pos-vel space into grid. The total length of position axis is $l_x$ and position length of each cell is $\delta_x$, so there are $b_x = \lceil l_x / \delta_x\rceil$ cells in the position axis. Similarly, the total length of velocity axis is $l_v$, and the velocity length of each cell is $\delta_v$, so there are $b_v = \lceil l_v/\delta_v\rceil$ cells in the velocity axis. Therefore there are $b_xb_v$ features in total.

One-hot encoding approximates the values such that all state-action pairs in the same cell will have the same feature values. Can reduce $\delta_x$ and $\delta_v$ to make the approximation more accurate, but it will also increase the number of features.

Tile coding tries to reduce number of features without scarifying the precision.

Introduces multiple levels of large grids. For tile coding with $m > 1$ layers, large grid in teach layer is $m$ times wide and $m$ times high as the small grid in one-hot encoding. For every two adjacent layers of larger grids, the difference of positions between tehse two layers in either dimension equals the length of a small cell.

Given arbitrary pos-vel pair, it will fall into one large cell in every layer. Therefore, we can conduct one hot coding on large grid, and each layer has roughly $b_x/m \times b_v / m$ features. All $m$ layers together have approximately $b_xb_v/m$ features, which is much less the size of native one-hot encoding.