# Guest Lecture: Deep Reinforcement Learning in TensorFlow
*Danijar Hafner*

## 0. Overview
- Intro to RL
- Value Based Methods
- Policy Based Methods
- Further Resources

## 1. Reinforcement Learning
![rl](figures/14_01.png)
Difference from supervised learning: no perfect example output
    

### 1.1 Formalization as Markov Decision Process
#### Environment
- Markovian states $s \in S$
- actions $a \in A$
- scalar reward function $R(r_t | s_t, a_t)$
- transition function $P(s_{t+1}| s_t, a_t)$

#### Agent
- act according to stochastic policy $\pi(a_t | s_0,...,s_t)$
- collects experience tuples $(s_t,a_t,r_t,s_{t+1})$

#### Objective
- Maximize expectation of return $R_t=\sum_{i=0,...,\infty} \gamma^i r_{t+1}$ discounted by $0< \gamma <1$

### 1.2 Overview of Methods
![methods](figures/14_02.png)

## 2. Value Based Methods
### 2.1 Value Learning
- value function:  
$$V(s_t)=E[R_t]=E[\sum_{i=0,...,\infty} \gamma^i r_{t+i}]$$
- Bellman equation:  
$$V(s)=r + \gamma \sum_{s' \in S}\left\{ P(s' | s, \pi(a|s))V(s') \right\}$$
- act according to best $V(s')$, sometimes randomly
- estimate $V(s)$ using learning rate:   
$$V'(s)= (1- \alpha)V(s) + \alpha (r+ V(s'))$$  
(Converges to true value function and optimal behavior)
- problem: need $P(s'|s)$ to act (as in board games, for example Go)

### 2.2 Q Learning (Watkins89)
- Q function: $$Q(s_t, a_t)=E(R_t)$$
- Bellman equation: $$Q^* (s,a)=r + \gamma\ max_{a' \in A}Q^*(s', a')$$
- act according to best Q(s, a), sometimes randomly
- estimate $Q^*(s, a)$ using learning rate:  $$Q'(s, a) = (1 - \alpha) Q(s, a) + \alpha (r + max_{a' \in A} Q(s', a'))$$
- converges to optimal function $Q^*(s, a)$ and optimal behavior
- doesn't depend on policy, can learn from demonstrations or old experience

#### compare Value Learning & Q Learning
![vlql](figures/14_03.png)

### 2.3 Epsilon Greedy Exploration
- convergence and optimality only when visiting each state infinitely often
- main challenge in RL: exploration
- simple approach: acting randomly with probability $\epsilon$
- visit each $(s, a)$ infinitely often in the limit
- decay $\epsilon$ exponentially to ensure converge
- right amount of exploration is often critical in practice

In [None]:
epsilon = exponential_decay(step, 50000, 1.0, 0.05, rate=0.5)

best_action = tf.arg_max(_qvalues([observ])[0], 0)
random_action = tf.random_uniform((), 0, num_actions, tf.int64)

should_explore = tf.random_uniform((), 0, 1) < epsilon
return tf.cond(should_explore, lambda: random_action, lambda: best_action)

def exponential_decay(step, total, initial, final, rate=1e-4, stairs=None):
    if stairs is not None:
        step = stairs * tf.floor(step / stairs)
    scale, offset = 1. / (1. - rate), 1. - (1. / (1. - rate))
    progress = tf.cast(step, tf.float32) / tf.cast(total, tf.float32)
    value = (initial - final) * scale * rate ** progress + offset + final
    lower, upper = tf.minimum(initial, final), tf.maximum(initial, final)
    return tf.maximum(lower, tf.minimum(value, upper))

### 2.4 Deep Neural Networks
#### Nonlinear Function Approximation
- rationale: too many states for a lookup table, want to approximate $Q(s, a)$ using a deep neural network
- can capture complex dependencies between $s, a$ and $Q(s, a)$ -> agent can learn sophisticated behavior
- Convolutional networks for reinforcement learning from pixels
    - Share some tricks from papers of the last two years
    - Sketch out implementations in TensorFlow

#### Predicting All Q-Values at Once (Mnih13)
Only one forward pass to find the best action
![predallq](figures/14_04.png)


In [None]:
def _qvalues(observ):
    with tf.variable_scope('qvalues', reuse=True):
        # Network from DQN (Mnih 2015)
        h1 = tf.layers.conv2d(observ, 32, 8, 4, tf.nn.relu)
        h2 = tf.layers.conv2d(h1, 64, 4, 2, tf.nn.relu)
        h3 = tf.layers.conv2d(h2, 64, 3, 1, tf.nn.relu)
        h4 = tf.layers.dense(h3, 512, tf.nn.relu)
        return tf.layers.dense(h4, num_actions, None)
    
current = tf.gather(_qvalues(observ), action)[:, 0]
target = reward + gamma * tf.reduce_max(_qvalues(nextob), 1)
target = tf.where(done, tf.zeros_like(target), target)
loss = (current - target) ** 2

#### Trick 1: Experience Replay (Mnih13)
- stochastic gradient descent expects <u>independent</u> samples
- agent collects <u>highly correlated</u> experience at a time
- solution: store experience tuples in a large buffer and select random batch for training  
-> decorrelates training examples
- even better: select training examples prioritized by last training cost (Schaul15)  
-> focuses on rare training examples

In [None]:
class ReplayBuffer:
    def __init__(self, template, capacity):
        self._capacity = capacity
        self._buffers = self._create_buffers(template)
        self._index = tf.Variable(0, dtype=tf.int32, trainable=False)
        
    def size(self):
        return tf.minimum(self._index, self._capacity)
    
    def append(self, tensors):
        position = tf.mod(self._index, self._capacity)
        with tf.control_dependencies([b[position].assign(t) for b, t in zip(self._buffers, tensors)]):
            return self._index.assign_add(1)
        
    def sample(self, amount):
        positions = tf.random_uniform((amount,), 0, self.size - 1, tf.int32)
        return [tf.gather(b, positions) for b in self._buffers]
    
    def _create_buffers(self, template):
        buffers = []
        for tensor in template:
            shape = tf.TensorShape([self._capacity]).concatenate(tensor.get_shape())
            initial = tf.zeros(shape, tensor.dtype)
            buffers.append(tf.Variable(initial, trainable=False))
        return buffers

    
class PrioritizedReplayBuffer:
    def __init__(self, template, capacity):
        template = (tf.constant(0.0),) + tuple(template)
        self._buffer = ReplayBuffer(template, capacity)
    
    def size(self):
        return self._buffer.size
    
    def append(self, priority, tensors):
        return self._buffer.append((priority,) + tuple(tensors))
    
    def sample(self, amount, temperature=1):
        priorities = self._buffer._buffers[0].value()[:self._buffer.size()]
        logprobs = tf.log(priorities / tf.reduce_sum(priorities)) / temperature
        positions = tf.multinomial(logprobs[None, ...], amount)[0]
        return [tf.gather(b, positions) for b in self._buffer._buffers[1:]]

#### Trick 2: Target Network (Mnih15, Lillicrap16, ...)
- targets $r + \gamma\ max_{a' \in A}Q(s', a')$ depend on own current network $Q(s, a)$
- training towards moving target makes training unstable
- use a moving average $Q^T(s, a)$ of the network to compute the targets
- update network parameters $\theta^T_{t+1}= (1 - \beta) \theta^T_t + \beta \theta_t$ with $\beta << 1$
- get weights using graph editor and apply `tf.train.ExponentialMovingAverage`
- use graph editor to copy network graph and bind to averaged variables


In [None]:
def bind(output, inputs):
    for key in inputs:
        if isinstance(inputs[key], tf.Variable):
            inputs[key] = inputs[key].value()
    return tf.contrib.graph_editor.graph_replace(output, inputs)

def moving_average(output, decay=0.999, collection=tf.GraphKeys.TRAINABLE_VARIABLES):
    average = tf.train.ExponentialMovingAverage(decay=decay)
    variables = set(v.value() for v in output.graph.get_collection(collection))
    deps = tf.contrib.graph_editor.get_backward_walk_ops(output)
    deps = [t for o in deps for t in o.values()]
    deps = set([t for t in deps if t in variables])
    update_op = average.apply(deps)
    new_output = bind(output, {t: average.average(t) for t in deps})
    return new_output, update_op

current = tf.gather(_qvalues(observ), action)[:, 0]
target_qvalues = moving_average(_qvalues(nextob), 0.999)
target = reward + gamma * tf.reduce_max(target_qvalues, 1)
target = tf.where(done, tf.zeros_like(target), target)
loss = (current - target) ** 2

#### Trick 3: Double Q Learning (Hasselt10, Hasselt15)
- Q Learning tends to overestimate Q values
- same network chooses best action and evaluates it
- $r + \gamma\ max_{a' \in A}Q(s', a') = r + \gamma\ Q(s', argmax_{a' \in A}Q(s', a'))$
- learning two Q functions from different experience would be ideal
- for efficiency, use target network $Q^T(s, a)$ to evaluate action
- targets become $r + \gamma\ Q^T(s', argmax_{a' \in A}Q(s', a'))$

In [None]:
# Q Learning
current = tf.gather(_qvalues(observ), action)[:, 0]
target_qvalues = moving_average(_qvalues(nextob), 0.999)
target = reward + gamma * tf.reduce_max(target_qvalues, 1)
target = tf.where(done, tf.zeros_like(target), target)
loss = (current - target) ** 2


# Double Q Learning
current = tf.gather(_qvalues(observ), action)[:, 0]
target_qvalues = moving_average(_qvalues(nextob), 0.999)
future_action = tf.argmax(_qvalues(nextob), 1)
target = reward + gamma * tf.gather(target_qvalues, future_action)
target = tf.where(done, tf.zeros_like(target), target)
loss = (current - target) ** 2

## 3. Policy Based Methods
### 3.1 Policy Gradient (Williams92)
- learn policy $\pi(a_t | s_0 , …, s_t)$ directly (instead of value function)
- train network to maximize expected return $E[ R_t ]$
- $R(r | s, a)$ is unknown but gradient of expectation still possible: $E[ R_t ∇_{\theta}ln\ \pi(a|s) ]$
- can only train on-policy because returns won't match otherwise
![policybased](figures/14_05.png)

In [None]:
def _policy(observ):
    with tf.variable_scope('policy', reuse=True):
        # Network from A3C (Mnih 2016)
        h1 = tf.layers.conv2d(observ, 16, 8, 4, tf.nn.relu)
        h2 = tf.layers.conv2d(h1, 32, 4, 2, tf.nn.relu)
        h3 = tf.layers.dense(h2, 256, tf.nn.relu)
        cell = tf.contrib.rnn.GRUCell(256)
        h4, _ = tf.nn.dynamic_rnn(cell, h3[None, ...], dtype=tf.float32)
        return tf.layers.dense(h4[0], num_actions, None)
    
action_mask = tf.one_hot(action, num_actions)
prob_under_policy = tf.reduce_sum(_policy(observ) * action_mask, 1)
loss = -return_ * tf.log(prob_under_policy + 1e-13)

### 3.2 Variance Reduction Via Baseline (Williams92, Sutton98)
- idea: learn the best actions and don't care about other parts of reward
- subtract baseline $b(s)$ from return $R_t$ to reduce variance
- advantage actor critic maximizes advantage function $A(s, a) = R_t - V(s)$
- in practice, actor and critic often share lower layers

![vrvb](figures/14_06.png)

In [None]:
def _shared_network(observ):
    with tf.variable_scope('shared_network', reuse=True):
        # Network from A3C (Mnih 2016)
        h1 = tf.layers.conv2d(observ, 16, 8, 4, tf.nn.relu)
        h2 = tf.layers.conv2d(h1, 32, 4, 2, tf.nn.relu)
        h3 = tf.layers.dense(h2, 256, tf.nn.relu)
        cell = tf.contrib.rnn.GRUCell(256)
        h4, _ = tf.nn.dynamic_rnn(cell, h3[None, ...], dtype=tf.float32)
        return h4[0]
    
features = _shared_network(observ)
policy = tf.layers.dense(features, num_actions, None)
value = tf.layers.dense(features, 1, None)
advantage = tf.stop_gradient(return_ - value)
action_mask = tf.one_hot(action, num_actions)
prob_under_policy = tf.reduce_sum(_policy(observ) * action_mask, 1)
policy_loss = -advantage * tf.log(prob_under_policy + 1e-13)
value_loss = (return_ - value) ** 2

### 3.3 Continuous Control using Policy Gradients
- many control problems are better formulated using continuous actions  
e.g. control steering angle rather than just left/center/right  
- policy gradients don't max over actions as Q Learning does  
-> well suited for continuous action spaces  
- decompose policy into mean and noise $\pi(a | s) = \mu (s) + z(s)$
- learn mean and add fixed noise source, or learn both

#### Deterministic Policy Gradient (Silver14, Lillicrap16)
- continuous policy gradient algorithm that can learn off-policy
- evaluate actions using a critic network $Q(s, a)$ rather than the environment  
On-policy SARSA doesn't need max over actions!
- Backpropagate gradient to the action: $E[ ∇_aQ(s, a) ∇_{\theta}ln\ \pi(s) ]$

![dpg](figures/14_07.png)

In [None]:
features = _shared_network(observ)
action = _policy(features, action_size)
qvalue = _qvalue(features, action)

direction = tf.gradients([qvalue], [action])[0]
if self._clip_q_grad:
    direction = tf.clip_by_value(direction, -1, 1)
target = tf.stop_gradient(action + direction)
policy_loss = tf.reduce_sum((target - action) ** 2, 2)

target_qvalue = _qvalue(_shared_network(nextob))
target_qvalue = moving_average(target_qvalue, 0.999)
target = reward + gamma * target_qvalue
target = tf.where(done, tf.zeros_like(target), target)
loss = (qvalue - target) ** 2

## 4. Further Resources
- Reading
    - [Richard Sutton](http://incompleteideas.net/sutton/book/the-book-2nd.html)
    - [Andrej Karpathy](http://karpathy.github.io/2016/05/31/rl/)
- Lectures
    - [David Silver](https://www.youtube.com/watch?v=2pWv7GOvuf0&feature=youtu.be)
    - [John Schulman](https://www.youtube.com/watch?v=oPGVsoBonLM&feature=youtu.be)
- Software
    - [Gym](https://gym.openai.com/)
    - [RL Lab](https://github.com/openai/rllab)
    - [Modular RL](https://github.com/joschu/modular_rl)
    - [Mindpark](https://github.com/danijar/mindpark)