# Function Approximators
The key notion in this lesson is that if we want to scale up to a bigger number of states we can't be using Q tables anymore: for instance if the state is an image or a Go board or a chess board. The number of possible state is way to big for a Q table. The only solution is try to understand the logic of the game, the main features that we need to take into account in order to reduce the dimension of the problem. Some approaches used handcrafted features such as Deep Blue along with an opening and closing library but handcrafted features contains too much priors on the game.

If we think about Chess one feature that comes to mind is the material count: you think that if you possess more pieces than your oponent you are in a winning position. There might be some truth in that but giving this feature to a neural network will endulge a very protective strategy, thus not enabling it to win. The only thing that matters in chess is the checkmate, no matter the number of pawns, rooks, knights, queens and bishops that you have. 

In a reinforcement learning setting you give your agent an encoded version of the board (see [here](https://arxiv.org/pdf/1712.01815.pdf) for details on how it can be achieved) and you give it a reward of 1 if it wins, a reward of -1 if it loses. You let it build its own interpretation of the environement, of the game. Free of priors, free of handcrafted features it can find the optimal policy (the convergence is not easy, but is doable as DeepMind showed us). 

Function approximators are the key to generalization: choosing the proper one (convolutionnal networks for board games for instance) can help the convergence a lot but a lot of others tricks are necessary as well to make it possible.

This lessons introduces to the use of function approximators in RL settings but the devil is in the details the solutions presented here don't always work.

## Value approximation

The goal here is to learn the function $v$ by approximating it with a parametric function $v_{\theta} : \mathcal S \rightarrow \mathbb R$, ideally we would like to minimize the square error of the difference between the two and find the best set of parameters $\theta*$:


$$
\theta* = \text{argmin}_\theta (\mathbb E _{s \in \mathcal S} [(v(s)-v_{\theta}(s))^2])
$$

However in pratice we don't have access to $v$

Thus we replace $v$ with the value of the return $G_t$ and approximate the expectancy on sampled states $S_i$:

$$
\theta* \sim \text{argmin}_\theta (\frac{1}{N}\sum_{i=1}^{N} [(G_t(S_i)-v_{\theta}(S_i))^2])\text{  } \text{  }\text{  }\text{     (1)}
$$

This return $G_t$ can be computed using Monte-Carlo or TD(0) or TD($\lambda$), in the case of TD(0) and TD($\lambda$) we need to use the function approximator to compute the target for the **bellman expectancy equation:**
$$
G_t^{(1)} = R_{t+1} + \gamma v(S_{t+1}) \sim R_{t+1} + \gamma v_{\theta}(S_{t+1})
$$


This value approximation has been used by [Tesauro's TD gammon](https://www.bkgm.com/articles/tesauro/tdl.html) to solve the Backgammon game: Once of the very first accomplishments using non-linear function approximators to master a game by self play by making gradient descent on the objective function (1)


## Action Value Approximation
In a similar fashion we can build an approximator of the action value function with sampled states and actions $(S_i,A_i)$
$$
\theta* \sim \text{argmin}_\theta (\frac{1}{N}\sum_{i=1}^{N} [(G_t(S_i,A_i)-q_{\theta}(S_i,A_i))^2]) \text{  } \text{  }\text{  } \text{     (2)}
$$
Using MC, SARSA or SARSA($\lambda$), for instance for SARSA using the **bellman optimality equation:**

$$
G_t^{(1)} = R_{t+1} + \gamma.\text{max}_{a \in \mathcal A}( q(S_{t+1},a)) \sim R_{t+1} + \gamma.\text{max}_{a \in \mathcal A}( q_{\theta}(S_{t+1},a))
$$
The best set of parameters can be found using gradient descent on (2)

## What is wrong in practice
Following the amazing results in the previous section you might feel ready to solve all the problems on earth, yet it doesn't work: if events for learning are sampled on the flight, if the TD target is computed using the current value approximator the function has major difficulties for converging. 

Some tips/tricks have to be implemented in order to properly control the convergence and I'll 