# Learn RL Notes Part 3

# 6 Value Function Approximation

## Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems, e.g.
* Backgammon: $10^{20}$ states 
* Computer Go: $10^{170}$ states 
* Helicopter: continuous state space
How can we **scale up** the model-free methods for prediction and control from the last two lectures?

So far we have represented value function by a **lookup table** 
* Every state s has an entry V(s)
* Or every state-action pair s,a has an entry Q(s,a)

Problem with large MDPs:
* There are too many states and/or actions to store in memory 
* It is too slow to learn the value of each state individually

## Value Function Approximation
Solution for large MDPs:
* Estimate value function with function approximation
$$ \hat{v} ( s , w ) \approx v_π ( s ) $$
or $$ \hat{q} ( s , a , w ) \approx q_π ( s , a )
$$
* Generalise from seen states to unseen states 
* Update parameter w using MC or TD learning

<img width=600 src="images/rl-fa-types.png" />
<img width=600 src="images/rl-fa-choose.png" />

## 6.1 Incremental Methods

### 6.1.1 Gradient Decent 
<img width=600 src="images/rl-fa-gd.png" />
<img width=600 src="images/rl-fa-sgd.png" />

### 6.1.2 Linear Function Approximation
<img width=600 src="images/rl-fa-feature.png" />

<img width=600 src="images/rl-fa-linear.png" />

<img width=600 src="images/rl-fa-table-lookup-features.png" />


### 6.1.3 Incremental Prediction Algorithms
<img width=600 src="images/rl-fa-incremental-prediction-algo.png" />

<img width=600 src="images/rl-fa-mc.png" />

<img width=600 src="images/rl-fa-td0.png" />

<img width=600 src="images/rl-fa-td-lambda2.png" />


### 6.1.4 Incremental Control Algorithms
<img width=600 src="images/rl-fa-control.png" />

<img width=600 src="images/rl-fa-action-value-fa.png" />

<img width=600 src="images/rl-fa-linear-action-value-fa.png" />

<img width=600 src="images/rl-fa-incremental-control-algo.png" />


### 6.1.5 Convergence
<img width=600 src="images/rl-fa-convergence-prediction.png" />
<img width=600 src="images/rl-fa-convergence-gradient-td.png" />
<img width=600 src="images/rl-fa-convergence-control.png" />

## 6.2 Batch Methods
### Batch Reinforcement Learning
* Gradient descent is simple and appealing
* But it is not sample efficient
* Batch methods seek to find the best fitting value function 
* Given the agent’s experience (“training data”)

### 6.2.1 Least Squares Prediction
<img width=600 src="images/rl-fa-lsp.png" />

** Stochastic Gradient Descent with Experience Replay **
<img width=600 src="images/rl-fa-sgd-with-experience-replay.png" />
<img width=600 src="images/rl-fa-dqn.png" />

** Linear Least Squares Prediction **
* Experience replay finds least squares solution
* But it may take many iterations
* Using linear value function approximation $\hat{v} (s, w) = x(s)^T w$
* **We can solve the least squares solution directly**
<img width=600 src="images/rl-fa-lsp-linear.png" />
<img width=600 src="images/rl-fa-lsp-linear-algo.png" />
<img width=600 src="images/rl-fa-lsp-linear-algo2.png" />
<img width=600 src="images/rl-fa-lsp-linear-algo-convergence.png" />

### 6.2.2 Least Squares Control
<img width=600 src="images/rl-fa-lsc.png" />
<img width=600 src="images/rl-fa-lsc-action-value-fa.png" />
<img width=600 src="images/rl-fa-lsc-overview.png" />
<img width=600 src="images/rl-fa-lsc-lsq.png" />
<img width=600 src="images/rl-fa-lsc-lspi.png" />
<img width=600 src="images/rl-fa-lsc-control-algo-convergence.png" />


# 7 Policy Gradient Methods
** Policy-Based Reinforcement Learning **
* In the last lecture we approximated the value or action-value function using parameters θ,
$$ V_θ (s) ≈ V^π (s) \\
Q_θ (s, a) ≈ Q^π (s, a)$$

* A policy was generated directly from the value function
    * e.g. using $\epsilon-greedy$
  
* In this lecture we will directly parametrise the policy
$$π_θ (s, a) = \mathbb{P} [a | s, θ]$$
* We will focus again on **model-free** reinforcement learning

** Value-Based and Policy-Based RL **
<img width=600 src="images/rl-pa-types.png" />
* Value Based
    * Learnt Value Function
    * Implicit policy (e.g. $\epsilon-greedy$)
* Policy Based
    * No Value Function
    * Learnt Policy
* Actor-Critic
    * Learnt Value Function
    * Learnt Policy

** Advantages of Policy-Based RL **
* Advantages:
    * Better convergence properties
    * Effective in high-dimensional or continuous action spaces
    * Can learn stochastic policies
* Disadvantages:
    * Typically converge to a local rather than global optimum
    * Evaluating a policy is typically inefficient and high variance
    
## 7.1 Policy Search
** Policy Objective Functions **
* Goal: given policy $π_θ (s, a)$ with parameters θ, find best θ
* But how do we measure the quality of a policy $π_θ$ ?
* In episodic environments we can use the start value
$$J_1 (θ) = V^{π_θ} (s_1 ) = \mathbb{E}_{π_θ} [v_1 ]$$

* In continuing environments we can use the average value
$$J_{avV} (θ) = \sum_s d^{π_θ} (s) V^{π_θ} (s)$$

* Or the average reward per time-step
$$ J_{avR} (θ) = \sum_s d^{π_θ} (s) \sum_a \pi_{\theta}(s, a) R^a_s $$
* where $d^{π_θ} (s)$ is stationary distribution of Markov chain for $π_θ$

** Policy Optimisation **
* Policy based reinforcement learning is an optimisation problem
* Find θ that maximises J(θ)
* Some approaches do not use gradient
    * Hill climbing
    * Simplex / amoeba / Nelder Mead
    * Genetic algorithms
* Greater efficiency often possible using gradient
    * Gradient descent
    * Conjugate gradient
    * Quasi-newton
* We focus on gradient descent, many extensions possible
* And on methods that exploit sequential structure

## 7.2 Finite Difference Policy Gradient
** Policy Gradient **
<img width=600 src="images/rl-pa-policy-gradient.png" />

** Computing Gradients By Finite Differences **

* To evaluate policy gradient of $π_θ (s, a)$
* For each dimension k ∈ [1, n]
    * Estimate kth partial derivative of objective function w.r.t. θ
    * By perturbing θ by small amount in kth dimension
$$ \frac {\partial J(θ)} {\partial \theta_k} \approx \frac {J(\theta + \epsilon u_k) - J(\theta)} {\epsilon} $$
    * where u k is unit vector with 1 in kth component, 0 elsewhere
    
* ses n evaluations to compute policy gradient in n dimensions
* **Simple, noisy, inefficient - but sometimes effective**
* **Works for arbitrary policies, even if policy is not differentiable**

## 7.3 Monte-Carlo Policy Gradient
### 7.3.1 Score Function
<img width=600 src="images/rl-pa-score-function.png" />
<img width=600 src="images/rl-pa-score-function-softmax.png" />
<img width=600 src="images/rl-pa-score-function-gaussian.png" />
### 7.3.2 Policy Gradient Theorem
<img width=600 src="images/rl-pa-1step-mdp.png" />
<img width=600 src="images/rl-pa-policy-gradient-theorem.png" />
<img width=600 src="images/rl-pa-mc-policy-gradient.png" />


## 7.4 Actor-Critic Policy Gradient



# 8 Integrating Learning and Planning
# 9 Exploration and Exploitation
# 10 Case study - RL in games
