# Deep Reinforcement Learning

[David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Home.html) - Google DeepMind

06/19/2016

## Weight sharing

__Over time__
- Recurrent Neural Networks
  - Apply weights from *t<sub>0</sub>* to *t<sub>1</sub>*, ad nauseum

__Over space__
- Convolutional Neural Network
  - Shares weights between local regions

## Reinforcement learning

*The science of optimal decision making*

- Covers many fields:
  - Optimization
  - Game theory
  - Dynamic programming
  
__Agent (brain) and Environment (world)__
- At each step the agent:
  - executes some action
  - receives an observation
  - receives a scalar reward
- The experience is a sequence of observations, actions, rewards.
- The state is a summary of experiences

__Major components__
1. Policy
  - The agent's behaviour
  - A map from state to action
2. Value function
  - A prediction of a future reward
    - *"How much reward will I get from action A in state S"*
    - Q-value function is total expected reward
      - From state *a* and action *s*
      - Under policy π
      - With discount factor GAMMA
    - Value functions decompose into Bellman equations
    - An optimal function is the maximum achievable value
      - Q\*(s,a) = max(Q<sup>π</sup>(s,a) = Q<sup>π\*</sup>(s,a)
    - Once we have Q* we can act optimally:
      - π<sup>\*</sup>(s) = argmax Q<sup>\*</sup>(s,a)
3. Model
    - The agent's proxy for the environment
    - Learnt from experience
    - Planner interacts with the model using lookahead search
    
__Three approaches__
1. Value-based
  - Estimate the optimal value function
2. Policy-based
  - Don't search for optimal value, but the optimal policy
3. Model-based
  - Build a model and plan to use lookahead search to optimize reward

### Deep reinforcement learning

Use deep neural networks to represent
1. Value function
2. Policy
3. Model

Optimize loss function by SGD

## Value-based Deep RL

__Deep reinforcement learning in Atari__

- End-to-end leaning of values *Q(s, a)* from pixels *s*
- Input state *s* is a stack of raw pixels from last four frames

[Deep reinforcement (DQN) examples](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)

__Improvements since Nature DQN__

1. Double DQN&mdash;remove upward bias caused by *max Q(s,a,w)*
    - Current Q-network __w__ is used to select actions
    - Older Q-network __w__ is used to evaluate actions
  
2. Prioritized replay&mdash;weight experience according to surprise
    - Store experience in priority queue according to DQN error
  
3. Dueling network&mdash;split Q-network into two channels
    - Action-independent value function *V(s,v)*
    - Action-dependent advantage function *A(s,a,w)*
    
__Gorilla (General Reinforcement Learning Architecture)__
- Distributed DL reinforcement learning
  - Exploits multithreading of standard CPU
  - *Parallelism actually decorrelates the data(!!!)*

## Policy-based Deep RL

- Represent policy by deep network with weights __u__
  - *a=π(a|s, __u__)* or *a=π(s,__u__)*
  
- Define objective function as total discounted reward


__Actor-Critic algorithm__

Estimate value functions $$Q(s,a,w) \approx Q^{\pi}(s,a)$$
  - Update policy parameters u by stochastic gradient ascent
  
  
__Asynchronous Advantage Actor-Critic__

1. Estimate state-value function V(s,v)
  - Q-value estimated by an *n*-step sample
  
2. Actor is updated towards target
3. Critic is updated to minimize MSE w.r.t. target


__A3C in [Labyrinth](https://deepmind.com/blog)__
- Conv. net
- Needs to recollect state (i.e., "what's behind me?")


### Continuous action spaces

How can we deal with high-dimensional continuous action spaces?
  - Can't easily compute $$\underset{a}{max}Q(s,a)$$
    - Actor-critic algorithms learn without max
  - Q-values are differentiable w.r.t *a*
    - Deterministic policy gradients exploit knowledge of $$\frac{\partial Q}{\partial a}$$
    
__DPG__ is the continuous analogue of DQN
  - Experience replay: build data-set from agent's experience
  - Critic estimates value of current policy by DQN
    - (to deal with non-stationarity, u, w, are held fixed)
  - Actor updates policy in direction that improves Q
  
__A3C in simulated physics demo__
- Asynchronous RL is viable alternative to experience replay
- Train a hierarchical, recurrent locomotion controller
- Retrain controller on more challenging tasks


### Fictitious self-play (FSP)

- Can deep RL find Nash equilibria in multi-agent games?
  - Q-network learns "best response" to opponent policies
    - By applying DQN with experience replay
  - Policy network π(a|s,__u__) learns an average of best responses:
  
  $$\frac{\partial l}{\partial u} = \frac{\partial \log\pi(\alpha |s, u)}{\partial u}$$
  
  - Actions *a* sample mix of policy network and best response
  

## Model-based Deep RL

__Caveat:__ errors compound!! Model-based deep RL is weak in the sense that it cannot handle situations it has not seen before.

But, what happens if we have a perfect model?

__Go__:
Game tree complexity = *b<sup>d</sup>* (roughly 200<sup>200</sup>)

Brute force search intractable:
  1. Search space is huge
  2. "Impossible" for computes to evaluate who is winning
  
Solved using CNNs
  - Input space is like an image
  - Output space (where to play next stone) is also like an image, or a probability distribution
  - Two different neural networks:
    1. Represent value function
    2. Represent policy
    
Combined deep RL and supervised learning
  - To get off the ground, studied human plays
  - To train reinforcement, had it play against itself 1000s of times, which generates a labeled dataset
  - Finally, train a value network
  
__Results__:
1 week on 50 GPUs with semi-supervised learning: 80% vs. pure supervised learning (4 wks) 57%

## Challenges

Must break correlations in the data. Playing simply one square off is very highly correlated, so had to brute force teach the machine *not* to play this way.

__Search space challenge__
Search tree truncates after 3-4 nodes and queries value tree to determine who would have won in that situation (rather than theoretically playing the game out at each stage)