# Learn RL Notes Part 3

# 6 Value Function Approximation

## Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems, e.g.
* Backgammon: $10^{20}$ states 
* Computer Go: $10^{170}$ states 
* Helicopter: continuous state space
How can we **scale up** the model-free methods for prediction and control from the last two lectures?

So far we have represented value function by a **lookup table** 
* Every state s has an entry V(s)
* Or every state-action pair s,a has an entry Q(s,a)

Problem with large MDPs:
* There are too many states and/or actions to store in memory 
* It is too slow to learn the value of each state individually

## Value Function Approximation
Solution for large MDPs:
* Estimate value function with function approximation
$$ \hat{v} ( s , w ) \approx v_π ( s ) $$
or $$ \hat{q} ( s , a , w ) \approx q_π ( s , a )
$$
* Generalise from seen states to unseen states 
* Update parameter w using MC or TD learning

<img width=600 src="images/rl-fa-types.png" />
<img width=600 src="images/rl-fa-choose.png" />

## 6.1 Incremental Methods

### 6.1.1 Gradient Decent 
<img width=600 src="images/rl-fa-gd.png" />
<img width=600 src="images/rl-fa-sgd.png" />

### 6.1.2 Linear Function Approximation
<img width=600 src="images/rl-fa-feature.png" />

<img width=600 src="images/rl-fa-linear.png" />

<img width=600 src="images/rl-fa-table-lookup-features.png" />


### 6.1.3 Incremental Prediction Algorithms
<img width=600 src="images/rl-fa-incremental-prediction-algo.png" />

<img width=600 src="images/rl-fa-mc.png" />

<img width=600 src="images/rl-fa-td0.png" />

<img width=600 src="images/rl-fa-td-lambda2.png" />


### 6.1.4 Incremental Control Algorithms
<img width=600 src="images/rl-fa-control.png" />

<img width=600 src="images/rl-fa-action-value-fa.png" />

<img width=600 src="images/rl-fa-linear-action-value-fa.png" />

<img width=600 src="images/rl-fa-incremental-control-algo.png" />


### 6.1.5 Convergence
<img width=600 src="images/rl-fa-convergence-prediction.png" />
<img width=600 src="images/rl-fa-convergence-gradient-td.png" />
<img width=600 src="images/rl-fa-convergence-control.png" />

## 6.2 Batch Methods
### Batch Reinforcement Learning
* Gradient descent is simple and appealing
* But it is not sample efficient
* Batch methods seek to find the best fitting value function 
* Given the agent’s experience (“training data”)

### 6.2.1 Least Squares Prediction
<img width=600 src="images/rl-fa-lsp.png" />

** Stochastic Gradient Descent with Experience Replay **
<img width=600 src="images/rl-fa-sgd-with-experience-replay.png" />
<img width=600 src="images/rl-fa-dqn.png" />

** Linear Least Squares Prediction **
* Experience replay finds least squares solution
* But it may take many iterations
* Using linear value function approximation $\hat{v} (s, w) = x(s)^T w$
* **We can solve the least squares solution directly**
<img width=600 src="images/rl-fa-lsp-linear.png" />
<img width=600 src="images/rl-fa-lsp-linear-algo.png" />
<img width=600 src="images/rl-fa-lsp-linear-algo2.png" />
<img width=600 src="images/rl-fa-lsp-linear-algo-convergence.png" />

### 6.2.2 Least Squares Control
<img width=600 src="images/rl-fa-lsc.png" />
<img width=600 src="images/rl-fa-lsc-action-value-fa.png" />
<img width=600 src="images/rl-fa-lsc-overview.png" />
<img width=600 src="images/rl-fa-lsc-lsq.png" />
<img width=600 src="images/rl-fa-lsc-lspi.png" />
<img width=600 src="images/rl-fa-lsc-control-algo-convergence.png" />


# 7 Policy Gradient Methods
** Policy-Based Reinforcement Learning **
* In the last lecture we approximated the value or action-value function using parameters θ,
$$ V_θ (s) ≈ V^π (s) \\
Q_θ (s, a) ≈ Q^π (s, a)$$

* A policy was generated directly from the value function
    * e.g. using $\epsilon-greedy$
  
* In this lecture we will directly parametrise the policy
$$π_θ (s, a) = \mathbb{P} [a | s, θ]$$
* We will focus again on **model-free** reinforcement learning

** Value-Based and Policy-Based RL **
<img width=600 src="images/rl-pa-types.png" />
* Value Based
    * Learnt Value Function
    * Implicit policy (e.g. $\epsilon-greedy$)
* Policy Based
    * No Value Function
    * Learnt Policy
* Actor-Critic
    * Learnt Value Function
    * Learnt Policy

** Advantages of Policy-Based RL **
* Advantages:
    * Better convergence properties
    * Effective in high-dimensional or continuous action spaces
    * Can learn stochastic policies
* Disadvantages:
    * Typically converge to a local rather than global optimum
    * Evaluating a policy is typically inefficient and high variance
    
## 7.1 Policy Search
** Policy Objective Functions **
* Goal: given policy $π_θ (s, a)$ with parameters θ, find best θ
* But how do we measure the quality of a policy $π_θ$ ?
* In episodic environments we can use the start value
$$J_1 (θ) = V^{π_θ} (s_1 ) = \mathbb{E}_{π_θ} [v_1 ]$$

* In continuing environments we can use the average value
$$J_{avV} (θ) = \sum_s d^{π_θ} (s) V^{π_θ} (s)$$

* Or the average reward per time-step
$$ J_{avR} (θ) = \sum_s d^{π_θ} (s) \sum_a \pi_{\theta}(s, a) R^a_s $$
* where $d^{π_θ} (s)$ is stationary distribution of Markov chain for $π_θ$

** Policy Optimisation **
* Policy based reinforcement learning is an optimisation problem
* Find θ that maximises J(θ)
* Some approaches do not use gradient
    * Hill climbing
    * Simplex / amoeba / Nelder Mead
    * Genetic algorithms
* Greater efficiency often possible using gradient
    * Gradient descent
    * Conjugate gradient
    * Quasi-newton
* We focus on gradient descent, many extensions possible
* And on methods that exploit sequential structure

## 7.2 Finite Difference Policy Gradient
** Policy Gradient **
<img width=600 src="images/rl-pa-policy-gradient.png" />

** Computing Gradients By Finite Differences **

* To evaluate policy gradient of $π_θ (s, a)$
* For each dimension k ∈ [1, n]
    * Estimate kth partial derivative of objective function w.r.t. θ
    * By perturbing θ by small amount in kth dimension
$$ \frac {\partial J(θ)} {\partial \theta_k} \approx \frac {J(\theta + \epsilon u_k) - J(\theta)} {\epsilon} $$
    * where u k is unit vector with 1 in kth component, 0 elsewhere
    
* ses n evaluations to compute policy gradient in n dimensions
* **Simple, noisy, inefficient - but sometimes effective**
* **Works for arbitrary policies, even if policy is not differentiable**

## 7.3 Monte-Carlo Policy Gradient (REINFORCE)
### 7.3.1 Score Function
<img width=600 src="images/rl-pa-score-function.png" />
<img width=600 src="images/rl-pa-score-function-softmax.png" />
<img width=600 src="images/rl-pa-score-function-gaussian.png" />
### 7.3.2 Policy Gradient Theorem
<img width=600 src="images/rl-pa-1step-mdp.png" />
<img width=600 src="images/rl-pa-policy-gradient-theorem.png" />
### Monte-Carlo Policy Gradient (REINFORCE)
<img width=600 src="images/rl-pa-mc-policy-gradient.png" />


## 7.4 Actor-Critic Policy Gradient

### 7.4.1 Q Actor-Critic

** Reducing Variance Using a Critic **
<img width=600 src="images/rl-pa-actor-critic.png" />

** Estimating the Action-Value Function **
* The critic is solving a familiar problem: policy evaluation
* How good is policy $π_θ$ for current parameters θ?
* This problem was explored in previous two lectures, e.g.
    * Monte-Carlo policy evaluation
    * Temporal-Difference learning
    * TD(λ)
* Could also use e.g. least-squares policy evaluation

** Q Actor-Critic **
<img width=600 src="images/rl-pa-qac.png" />

### 7.4.2 Advantage Actor-Critic

** Bias in Actor-Critic Algorithms **

* Approximating the policy gradient introduces bias
* A biased policy gradient may not find the right solution
    * e.g. if Q w (s, a) uses aliased features, can we solve gridworld example?
* Luckily, if we choose value function approximation carefully
* Then we can avoid introducing any bias
* i.e. We can still follow the exact policy gradient

** Compatible Function Approximation Theorem **

If the following two conditions are satisfied:
1. Value function approximator is **compatible** to the policy
$$ \triangledown_w Q_w (s, a) = \triangledown_\theta log_{π_{\theta}} (s, a)$$
2. Value function parameters w minimise the mean-squared error
$$ ε = \mathbb{E}_{π_θ} (Q_{π_θ} (s, a) − Q_w (s, a))^2 $$
Then the policy gradient is exact,
$$∇_θ J(θ) = E_{π_θ} [∇_θ log_{π_θ} (s, a) Q_w (s, a)] $$

<img width=600 src="images/rl-pa-aac-reduce-variance.png" />
<img width=600 src="images/rl-pa-aac.png" />
<img width=600 src="images/rl-pa-aac2.png" />

### 7.4.3 TD Actor-Critic
<img width=600 src="images/rl-pa-td-ac.png" />
<img width=600 src="images/rl-pa-td-ac2.png" />
<img width=600 src="images/rl-pa-td-ac3.png" />

### 7.4.4 Natural Actor-Critic
<img width=600 src="images/rl-pa-alternative-direction.png" />
<img width=600 src="images/rl-pa-natural-ac.png" />
<img width=600 src="images/rl-pa-natural-ac2.png" />

** Summary of Policy Gradient Algorithms **
<img width=600 src="images/rl-pa-algorithms.png" />


# 8 Integrating Learning and Planning

## 8.1 Introduction
** Model-Based Reinforcement Learning **
* Last lecture: learn policy directly from experience
* Previous lectures: learn value function directly from experience
* This lecture: learn model directly from experience
* and use planning to construct a value function or policy
* Integrate learning and planning into a single architecture

* Model-Free RL
    * No model
    * Learn value function (and/or policy) from experience
* Model-Based RL
    * Learn a model from experience
    * Plan value function (and/or policy) from model

* Advantages:
    * Can efficiently learn model by supervised learning methods
    * Can reason about model uncertainty
* Disadvantages:
    * First learn a model, then construct a value function
    * two sources of approximation error

<img width=600 src="images/rl-model-based.png" />

## 8.2 Model-Based Reinforcement Learning

** What is a Model? **
* A model $\mathcal{M}$ is a representation of an MDP $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}>$ , parametrized by η
* We will assume state space $\mathcal{S}$ and action space $\mathcal{A}$ are known
* So a model $M = <\mathcal{P}_η , \mathcal{R}_η>$ represents state transitions $\mathcal{P}_η ≈ \mathcal{P}$ and rewards $\mathcal{R}_η ≈ \mathcal{R}$
$$ S_{t+1} ∼ P_η (S_{t+1} | S_t , A_t ) \\
R_{t+1} = R_η (R_{t+1} | S_t , A_t ) $$
* Typically assume conditional independence between state transitions and rewards
$$ P [S_{t+1} , R_{t+1} | S_t , A_t ] = P [S_{t+1} | S_t , A_t ] P [R_{t+1} | S_t , A_t ]$$

** Model Learning **
* Goal: estimate model $M_η$ from experience {S_1 , A_1 , R_2 , ..., S_T }
* This is a supervised learning problem
$$
S_1 , A_1 → R_2 , S_2
S_2 , A_2 → R_3 , S_3
.
.
S_{T −1} , A_{T −1} → R_T , S_T
$$
* Learning s, a → r is a **regression problem**
* Learning s, a → s is a **density estimation problem**
* Pick loss function, e.g. **mean-squared error, KL divergence**, ...
* Find parameters η that minimise **empirical loss**

** Examples of Models **
* Table Lookup Model
* Linear Expectation Model
* Linear Gaussian Model
* Gaussian Process Model
* Deep Belief Network Model
* ...

** Table Lookup Model **
* Model is an explicit MDP, $\hat{P},\hat{R}$
* Count visits N(s, a) to each state action pair
$$ \hat{P}^a_{s,s'} = \frac{1}{N(s,a)}\sum_{t=1}^{T}\mathbf{1}(S_t, A_t, S_{t+1} = s,a,s') \\
\hat{R}^a_s = \frac{1}{N(s,a)}\sum_{t=1}^{T}\mathbf{1}(S_t, A_t = s, a)R_t \\
$$
* Alternatively
    * At each time-step t, record experience tuple $<S_t , A_t , R_{t+1} , S_{t+1}>$
    * To sample model, randomly pick tuple matching $<s, a, ·, ·>$

### Planning with a Model 
* Given a model $M_η = <P_η , R_η>$
* Solve the MDP $<S, A, P_η , R_η>$
* Using favorite planning algorithm
    * Value iteration
    * Policy iteration
    * Tree search
    * ...
    
** Sample-Based Planning **
* A simple but powerful approach to planning
* Use the model only to generate samples
* Sample experience from model
$$ S_{t+1} ∼ P_η (S_{t+1} | S_t , A_t ) \\
R_{t+1} = R_η (R_{t+1} | S_t , A_t ) $$
* Apply model-free RL to samples, e.g.:
    * Monte-Carlo control
    * Sarsa
    * Q-learning
* Sample-based planning methods are often more efficient

** Planning with an Inaccurate Model **
* Given an imperfect model $<P_η , R_η = P, R>$
* Performance of model-based RL is limited to optimal policy for approximate MDP $<S, A, P_η , R_η>$
* i.e. Model-based RL is only as good as the estimated model
* When the model is inaccurate, planning process will compute a suboptimal policy
* Solution 1: when model is wrong, use model-free RL
* Solution 2: reason explicitly about model uncertainty

## 8.3 Integrated Architectures

### 8.3.1 Dyna
** Real and Simulated Experience **
* We consider two sources of experience
* **Real experience** Sampled from environment (true MDP)
$$ S ∼ P^a_{s,s'} \\
R = R^a_s $$
* **Simulated experience** Sampled from model (approximate MDP)
$$ S' ∼ P_η (S' | S, A) \\
R = R_η (R | S, A) $$

** Integrating Learning and Planning **
* Model-Free RL
    * No model
    * Learn value function (and/or policy) from real experience
    
* Model-Based RL (using Sample-Based Planning)
    * Learn a model from **real experience**
    * Plan value function (and/or policy) from **simulated experience**
* Dyna
    * Learn a model from real experience
    * **Learn and plan value function (and/or policy) from real and simulated experience**

<img width=600 src="images/rl-model-dyna.png" />
<img width=600 src="images/rl-model-dyna-algo.png" />

## 8.4 Simulation-Based Search
### 8.4.1 Simulation-Based Search
<img width=600 src="images/rl-model-forward-search.png" />
<img width=600 src="images/rl-model-simulation-based-search.png" />

* Simulate episodes of experience from now with the model
$$ \left\{ s_t^k , A_k^t , R_{t+1}^k, ..., S_T^k \right\}_{k=1}^K \thicksim M_ν $$
* Apply model-free RL to simulated episodes
    * Monte-Carlo control → Monte-Carlo search
    * Sarsa → TD search
    
### 8.4.2 Monte-Carlo Tree Search
** Simple Monte-Carlo Search **
* Given a model $M_ν$ and a **simulation policy** π
* For each action a ∈ A
    * Simulate K episodes from current (real) state $s_t$
$$\left\{s_t , a, R_{t+1}^k, S_{t+1}^k, A^k_{t+1} , ..., S_T^k \right\}^K_{k=1} \sim M_ν , π $$
    * Evaluate actions by mean return (Monte-Carlo evaluation)
$$ Q(s_t , a) = \frac{1}{K} \sum^K_{k=1}G_t \overset{P}{\to} q_π (s_t , a) $$
* Select current (real) action with maximum value
$$a_t = \underset{a \in A}{argmax} Q(s_t , a)$$

** Monte-Carlo Tree Search (Evaluation) **
* Given a model $M_ν$
* Simulate K episodes from current state $s_t$ using current simulation policy π
$$\left\{s_t , A^k_t , R_{t+1}^k, S_{t+1}^k, ..., S _T^k \right\}^K_{k=1} \sim M_ν , π$$
* Build a search tree containing visited states and actions
* Evaluate states Q(s, a) by mean return of episodes from s, a
$$ Q(s, a) =\frac{1}{N(s, a)}\sum^K_{k=1}\sum^T_{u=t}\mathbf{1}(S_u, A_u = s, a)G_u \overset{P}{\to} q_{\pi}(s,a)$$
* After search is finished, select current (real) action with maximum value in search tree
$$a_t = \underset{a \in A}{argmax} Q(s_t , a)$$

** Monte-Carlo Tree Search (Simulation) **
* In MCTS, the simulation policy π improves
* Each simulation consists of two phases (in-tree, out-of-tree)
    * Tree policy (improves): pick actions to maximise Q(S, A)
    * Default policy (fixed): pick actions randomly
* Repeat (each simulation)
    * Evaluate states Q(S, A) by Monte-Carlo evaluation
    * Improve tree policy, e.g. by $\epsilon − greedy(Q)$
* Monte-Carlo control applied to simulated experience
* Converges on the optimal search tree, $Q(S, A) → q_∗ (S, A)$

** Advantages of MC Tree Search **
* Highly selective best-first search
* Evaluates states dynamically (unlike e.g. DP)
* Uses sampling to break curse of dimensionality
* Works for “black-box” models (only requires samples)
* Computationally efficient, anytime, parallelisable

### 8.4.3 MCTS in Go
### 8.4.4 Temporal-Difference Search
** Temporal-Difference Search **
* Simulation-based search
* Using TD instead of MC (bootstrapping)
* MC tree search applies MC control to sub-MDP from now
* TD search applies Sarsa to sub-MDP from now

** MC vs. TD search **
* For model-free reinforcement learning, bootstrapping is helpful
    * TD learning reduces variance but increases bias
    * TD learning is usually more efficient than MC
    * TD(λ) can be much more efficient than MC

* For simulation-based search, bootstrapping is also helpful
    * TD search reduces variance but increases bias
    * TD search is usually more efficient than MC search
    * TD(λ) search can be much more efficient than MC search

** TD Search **
* Simulate episodes from the current (real) state $s_t$
* Estimate action-value function Q(s, a)
* For each step of simulation, update action-values by Sarsa
$$ ∆Q(S, A) = α(R + γQ(S' , A' ) − Q(S, A))$$
* Select actions based on action-values Q(s, a) e.g. $\epsilon-greedy$
* May also use function approximation for Q

** Dyna-2 **
* In Dyna-2, the agent stores two sets of feature weights
    * Long-term memory
    * Short-term (working) memory
* Long-term memory is updated from real experience using TD learning
    * General **domain knowledge** that applies to any episode
* Short-term memory is updated from simulated experience using TD search
    * Specific **local knowledge** about the current situation
* Over value function is sum of long and short-term memories