<a href="https://colab.research.google.com/github/raphaelletseng/AI4Good2021/blob/main/RL_AI4Good_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI4Good: Introduction to RL Workshop

Welcome to our introductory workshop in Reinforcement Learning!

In this workbook, we will implement different types of *Q-learning* agents, and train them on the Catch environment.

Please <font color='red'>**make a copy**</font> of this notebook to your Drive: **File** > **Save copy in Drive**.

In [None]:
#@title Imports
%%capture
!pip install dm_env

import collections
import dm_env
import numpy as np
import tensorflow as tf

from matplotlib import pyplot as plt
import matplotlib.animation as animation
from matplotlib import rc
rc('animation', html='jshtml')
%matplotlib inline

## The Catch environment

*Catch* is a classic, simple RL environment, where the agent needs to learn to catch a falling ball by moving a paddle around. Below we provide a simple implementation of the environment, in which the three scalar actions $(0, 1, 2)$ correspond to moving the paddle to the (left, middle, right) respectively. The agent gets a reward of $1.0$ if the paddle was right below the ball when it reached the bottom of the board, otherwise the agent receives $0.0$ reward.

<img src="https://drive.google.com/uc?id=1xkpEZAkl08E_XJQsCe8b3Y0JYRhsScS2" width="400">

In [None]:
#@title Catch implementation
_ACTIONS = (0, 1, 2)  # Left, no-op, right.


class Catch(dm_env.Environment):
  """A Catch environment built on the `dm_env.Environment` class."""

  def __init__(self, rows=10, columns=5, seed=1):
    self._rows = rows
    self._columns = columns
    self._rng = np.random.RandomState(seed)
    self._board = np.zeros((rows, columns), dtype=np.float32)
    self._ball_x = None
    self._ball_y = None
    self._paddle_x = None
    self._paddle_y = self._rows - 1
    self._reset_next_step = True

  def reset(self):
    """Returns the first `TimeStep` of a new episode."""
    self._reset_next_step = False
    self._ball_x = self._rng.randint(self._columns)
    self._ball_y = 0
    self._paddle_x = self._columns // 2
    return dm_env.restart(self._observation())

  def step(self, action):
    """Updates the environment according to the action."""
    if self._reset_next_step:
      return self.reset()

    # Move the paddle.
    dx = _ACTIONS[action] - 1
    self._paddle_x = np.clip(self._paddle_x + dx, 0, self._columns - 1)

    # Drop the ball.
    self._ball_y += 1

    # Check for termination.
    if self._ball_y == self._paddle_y:
      reward = 1. if self._paddle_x == self._ball_x else -1.
      self._reset_next_step = True
      return dm_env.termination(reward=reward, observation=self._observation())
    else:
      return dm_env.transition(reward=0., observation=self._observation())

  def _observation(self):
    self._board.fill(0.)
    self._board[self._ball_y, self._ball_x] = 1.
    self._board[self._paddle_y, self._paddle_x] = 1.
    return self._board.copy()

  def observation_spec(self):
    return dm_env.specs.BoundedArray(
        shape=self._board.shape,
        dtype=self._board.dtype,
        name="board",
        minimum=0,
        maximum=1)

  def action_spec(self):
    return dm_env.specs.DiscreteArray(
        dtype=int, num_values=len(_ACTIONS), name="action")

### Let's observe a random agent acting on Catch!

First we are going to take a look at what the agent-environment interaction looks like when an agent acts randomly. We see that the board is represented by a $10\times 5$ array of zeroes, where both the ball and the paddle position are denoted by a value of $1.0$.

In [None]:
#@title Take random actions
env = Catch()

res = []
timestep = env.reset()
print('Observation format (what the agents sees):')
print(timestep.observation)
res.append(timestep.observation)
for step in range(50):
  action = np.random.randint(3)
  timestep = env.step(action)
  res.append(timestep.observation)

Observation format (what the agents sees):
[[0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]]


In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Agent training and evaluation

In [None]:
def train(agent, env, num_episodes = 100):
  print('Training agent...')
  training_returns = []
  sum_returns = 0.0
  for episode in range(num_episodes):
    timestep = env.reset()
    agent.observe_first(timestep)
    sum_rewards = 0.0
    while timestep.step_type != dm_env.StepType.LAST:
      action = agent.select_action(timestep.observation)
      timestep = env.step(action)
      if timestep.reward is not None:
        sum_rewards += timestep.reward
      agent.observe(action, timestep)
      agent.update()
    training_returns.append(sum_rewards)
    if episode % 10 == 0:
      print(f'Episode: {episode}, Return: {sum_rewards}, '
            f'Mean return: {np.mean(training_returns[-50:])}')

def evaluate(agent, env, num_episodes = 10):
  print('\nEvaluating agent...')
  agent.set_epsilon(0)
  eval_returns = []
  observations = []
  for episode in range(num_episodes):
    sum_rewards = 0.0
    timestep = env.reset()
    observations.append(timestep.observation)
    agent.observe_first(timestep)
    while timestep.step_type != dm_env.StepType.LAST:
      action = agent.select_action(timestep.observation)
      timestep = env.step(action)
      observations.append(timestep.observation)
      if timestep.reward is not None:
        sum_rewards += timestep.reward
      agent.observe(action, timestep)
      # agent.update()  # Don't update.
    eval_returns.append(sum_rewards)
    print(f'Episode: {episode}, Return: {sum_rewards}')
  print(f'mean: {np.mean(eval_returns)}, std: {np.std(eval_returns)}')
  return observations

## 1. Tabular Q-Learning Agent

In Q-learning, the agent estimates the *value* of (state, action) pairs. This estimate reflects how much total return the agent anticipates up until the end of the episode, assuming that it takes action $A$ in state $S$. In *tabular* Q-learning in particular, these value estimates are stored explicitly in a table, for example:

| (State, Action) | Q-value  |
| ----------------| ---------|
| ($S_i$, left)   | 0.7      |
| ($S_i$, stay)   | 0.0      |
| ($S_i$, right)  | -0.5     |
| ($S_j$, left)   | 0.32     |
| ($S_j$, stay)   | -1.0     |
| ($S_j$, right)  | 0.1     |
| $\dots$         | $\dots$  |

These estimates of (state, action) pairs will drive the behaviour (*policy*) of the agent.

In [None]:
class QLearning():
  """Simple Q-learning agent."""

  def __init__(self,
               learning_rate: float = 0.2,
               epsilon: float = 0.1,
               discount: float = 0.99):
    self._learning_rate = learning_rate
    self._epsilon = epsilon
    self._discount = discount
    self._q = collections.defaultdict(np.random.random)
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep

  def update(self):
    reward = self._timestep_after_action.reward
    obs = self._timestep_after_action.observation
    obs_before = self._timestep_before_action.observation

    best_action = self._best_action(obs)
    td = reward + self._discount * self._q_func(obs, best_action) - self._q_func(
        obs_before, self._latest_action)
    self._q[(str(obs_before), self._latest_action)] += self._learning_rate * td

  def select_action(self, latest_obs):
    action = np.argmax([self._q_func(latest_obs, a) for a in range(3)])
    #action = np.softmax([self._q_func(latest_obs, a) for a in range(3)])
    if np.random.random() < self._epsilon:
      action = np.random.randint(0, 3)
    return action

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])
    #return np.softmax([self._q_func(obs, a) for a in range(3)])

  def _q_func(self, obs, action):
    return self._q[(str(obs), action)]

  def set_epsilon(self, eps: float):
    self._epsilon = eps

### Train Tabular Q-learning agent

During training, the agent acts in the environment (i.e. plays the game) and makes periodic updates of its Q-value estimates based on what it observes. In particular, the estimates are updated based on the [Bellman equation](https://en.wikipedia.org/wiki/Bellman_equation):

$$Q_{new}(s_t, a_t) = Q_{old}(s_t, a_t) + \alpha *(R_t + \gamma \max_a Q(s_{t+1}, a)  - Q_{old}(s_t, a_t))$$
During this process, the Q-value estimates will become more and more accurate, leading to gradually increasing performance (i.e. the agent is *learning* to play the game well).

After the training process, the agent is *evaluated*. This means we assess its performance on the environment without any randomness in its behaviour. At this time, the agent does not make any updates to its Q-value estimates, therefore its behaviour is not changing anymore.

In [None]:
env = Catch()
agent = QLearning()

train(agent, env)
res = evaluate(agent, env)

Training agent...
Episode: 0, Return: 1.0, Mean return: 1.0
Episode: 10, Return: -1.0, Mean return: -0.09090909090909091
Episode: 20, Return: 1.0, Mean return: 0.047619047619047616
Episode: 30, Return: -1.0, Mean return: -0.0967741935483871
Episode: 40, Return: 1.0, Mean return: -0.07317073170731707
Episode: 50, Return: -1.0, Mean return: -0.08
Episode: 60, Return: 1.0, Mean return: 0.12
Episode: 70, Return: 1.0, Mean return: 0.12
Episode: 80, Return: 1.0, Mean return: 0.36
Episode: 90, Return: 1.0, Mean return: 0.52

Evaluating agent...
Episode: 0, Return: 1.0
Episode: 1, Return: 1.0
Episode: 2, Return: 1.0
Episode: 3, Return: 1.0
Episode: 4, Return: 1.0
Episode: 5, Return: 1.0
Episode: 6, Return: 1.0
Episode: 7, Return: 1.0
Episode: 8, Return: 1.0
Episode: 9, Return: 1.0
mean: 1.0, std: 0.0


In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Exercises (5 mins)

* Try different epsilons.
* Change policy from **epsilon greedy** to something else (e.g. fixed left-right-left-right cycles).
* Swap `argmax` with another function (e.g. `softmax`).
* Experiment with different learning rates.

## 2. Q-Learning agent with Neural Networks

A major limitation of the tabular approach is that if the state space is large, it will quickly become infeasible to obtain a realistic estimate of each of their Q-values. Apart from explicit Q-value tables, another way for an agent to represent its Q-value estimates is using *Neural Networks*. Neural networks are [universal function approximators](https://en.wikipedia.org/wiki/Universal_approximation_theorem), therefore in theory they can be arbitrarily accurate estimators of the true $Q(s,a)$ function. They also help overcome the problem of large state spaces, because they can exploit underlying structure in the observation space.

In our Catch example, we can take our existing tabular Q-learning agent and replace its `_q_func()` and `update()` methods to use neural networks. The `_q_func()` method will now compute the Q-value as the output of the NN model, rather than reading it directly from a table. In the meantime, the `update()` method, instead of overwriting the Q-table, will perform model fitting.

In [None]:
class QLearningNN():
  """Simple Q-learning agent using a Neural Network."""

  def __init__(self,
               model,
               epsilon: float = 0.1,
               discount: float = 0.99):
    self._model = model
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    reward = self._timestep_after_action.reward
    obs = self._timestep_after_action.observation
    obs_before = self._timestep_before_action.observation

    best_action = self._best_action(obs)
    target_output = reward + self._discount * self._q_func(obs, best_action)
    model_input = self._make_input(obs_before, self._latest_action)
    model_input = tf.expand_dims(model_input, axis=0)
    self._model.fit(model_input, tf.convert_to_tensor([[target_output]]),
                    verbose=0, batch_size=1)

  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    # Create one-hot encoded representation of the action.
    a = np.zeros([3])
    a[action] = 1
    # Concatenate the one-hot encoded action to the flattened observation.
    model_input = tf.concat([flatten_obs, a], axis=0)
    return model_input

  def _q_func(self, latest_obs, action):
    model_input = self._make_input(latest_obs, action)
    model_input = tf.expand_dims(model_input, axis=0)  # Add batch dimension.
    output = self._model(model_input)
    output = tf.squeeze(output)  # Remove batch dimension.
    return output

  def select_action(self, latest_obs):
    q_values = [self._q_func(latest_obs, a) for a in range(3)]
    action = tf.math.argmax(q_values)
    if np.random.random() < self._epsilon:
      action = np.random.randint(0, 3)
    return action

  def set_epsilon(self, eps: float):
    self._epsilon = eps

### Train Q-learning agent with NNs

In [None]:
# Create environment.
env = Catch()

# Build model for agent.
model = tf.keras.Sequential([
    tf.keras.layers.Dense(50, input_shape=(53,), activation='relu', name='layer1'),
    tf.keras.layers.Dense(10, activation='relu', name='layer2'),
    tf.keras.layers.Dense(1, name='layer3'),
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='mse')

# Create agent.
agent = QLearningNN(model)

with tf.device('/device:GPU:0'):
  train(agent, env)
  res = evaluate(agent, env)

Training agent...
Episode: 0, Return: -1.0, Mean return: -1.0
Episode: 10, Return: -1.0, Mean return: -1.0
Episode: 20, Return: -1.0, Mean return: -0.7142857142857143
Episode: 30, Return: -1.0, Mean return: -0.6774193548387096
Episode: 40, Return: -1.0, Mean return: -0.7073170731707317
Episode: 50, Return: -1.0, Mean return: -0.68
Episode: 60, Return: -1.0, Mean return: -0.44
Episode: 70, Return: -1.0, Mean return: -0.4
Episode: 80, Return: -1.0, Mean return: -0.4
Episode: 90, Return: -1.0, Mean return: -0.44

Evaluating agent...
Episode: 0, Return: 1.0
Episode: 1, Return: -1.0
Episode: 2, Return: 1.0
Episode: 3, Return: -1.0
Episode: 4, Return: -1.0
Episode: 5, Return: -1.0
Episode: 6, Return: -1.0
Episode: 7, Return: 1.0
Episode: 8, Return: -1.0
Episode: 9, Return: 1.0
mean: -0.2, std: 0.9797958971132712


In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Exercises (3 mins)


Experiment with model architectures:
* Try different activation functions.
* Change the layer sizes.
* Change the number of layers.

## 3. Q-Learning agent with NNs and a Replay Buffer

Another way we can make our algorithm more efficient is by introducing a *Replay Buffer*. In the previous example, each `model.fit` method was called on a single transition (the very last one). Instead of fitting on a single datapoint, we can fit on a *set* of datapoints. To do this, we store a number of previously seen transitions $(S_i, a_i, S_{i+1})$ and at each update we fit the model on sample of these.

In [None]:
class QLearningNNReplay():
  """Simple Q-learning agent using a Neural Network and a replay buffer."""

  def __init__(self,
               model,
               max_replay_entries: int = 10000,
               num_samples_per_update: int = 10,
               epsilon: float = 0.1,
               discount: float = 0.99):
    self._model = model
    self._replay = []
    self._max_replay_entries = max_replay_entries
    self._num_samples_per_update = num_samples_per_update
    self._epsilon = epsilon
    self._discount = discount
    self._latest_action = None
    self._timestep_before_action = None
    self._timestep_after_action = None

  def observe_first(self, first_timestep: dm_env.TimeStep):
    self._timestep_after_action = first_timestep

  def observe(self, action, next_timestep):
    self._latest_action = action
    self._timestep_before_action = self._timestep_after_action
    self._timestep_after_action = next_timestep
    # Add (S_i, a_i, S_i+1) to replay buffer.
    self._replay.append((self._timestep_before_action, self._latest_action,
                         self._timestep_after_action))
    if len(self._replay) >= self._max_replay_entries:
      # Remove a random entry from the buffer if capacity is reached.
      random_index = np.random.randint(len(self._replay))
      del self._replay[random_index]

  def _best_action(self, obs):
    return np.argmax([self._q_func(obs, a) for a in range(3)])

  def update(self):
    # Sample `self._num_samples_per_update` from replay buffer.
    samples = [
        self._replay[np.random.randint(len(self._replay))]
        for _ in range(self._num_samples_per_update)
    ]

    # Samples take the form (S_i, a_i, S_i+1).
    best_actions = [self._best_action(x[2].observation) for x in samples]
    ys = np.array([
        x[2].reward + self._discount * self._q_func(x[2].observation, a)
        for x, a in zip(samples, best_actions)
    ])
    xs = np.array([self._make_input(x[0].observation, x[1]) for x in samples])
    self._model.fit(xs, tf.convert_to_tensor(ys), verbose=0, batch_size=16)

  def _make_input(self, obs, action):
    flatten_obs = tf.reshape(obs, shape=(tf.math.reduce_prod(obs.shape)))
    a = np.zeros([3])
    a[action] = 1  # One-hot action
    model_input = tf.concat([flatten_obs, a], axis=0) # Concatenate them
    return model_input

  def _q_func(self, latest_obs, action):
    model_input = self._make_input(latest_obs, action)
    model_input = tf.expand_dims(model_input, axis=0)
    output = self._model(model_input)
    output = tf.squeeze(output)  # Remove batch dimension.
    return output

  def select_action(self, latest_obs):
    q_values = [self._q_func(latest_obs, a) for a in range(3)]
    action = tf.math.argmax(q_values)
    if np.random.random() < self._epsilon:
      action = np.random.randint(0, 3)
    return action

  def set_epsilon(self, eps: float):
    self._epsilon = eps

### Train Q-learning agent with NNs and replay buffer

In [None]:
# Create environment.
env = Catch()

# Build model for the agent.
model = tf.keras.Sequential([
    tf.keras.layers.Dense(50, input_shape=(53,), activation='relu', name='layer1'),
    tf.keras.layers.Dense(10, activation='relu', name='layer2'),
    tf.keras.layers.Dense(1, name='layer3'),
])
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='mse')

# Create agent.
agent = QLearningNNReplay(model)

with tf.device('/device:GPU:0'):
  train(agent, env)
  res = evaluate(agent, env)

In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Exercises (3 mins)

Experiment with Replay Buffer settings:
* Modify the sampling method (e.g. give higher priority to recent items instead of sampling uniformly)
* Change the eviction strategy
* Change the size of the replay buffer (i.e. the maximum number of entries)
* Change the size of the samples.

### Train Q-learning agent with NNs and replay buffer (CNN version)

In this experiment we replace our simple feedforward neural network with a convolutional neural network. These are usually more efficient if the observations are *images* (even though in the case of Catch, observations are small enough that we don't necessarily expect CNNs to have an advantage over simple networks).

In [None]:
# Create environment.
env = Catch()

# Build model for the agent.
inputs = tf.keras.Input(shape=(53,))
action = tf.keras.layers.Lambda(lambda x: x[:, 50:])(inputs)
obs = tf.keras.layers.Lambda(lambda x: tf.reshape(x[:, :50], (-1, 10, 5, 1)))(inputs)
cnn = tf.keras.layers.Conv2D(filters=2, kernel_size=3, input_shape=(10, 5, 1))(obs)
flattened_cnn = tf.keras.layers.Flatten()(cnn)
merged = tf.keras.layers.Concatenate()([flattened_cnn, action])
layer1 = tf.keras.layers.Dense(20, activation='relu', name='layer1')(merged)
layer2 = tf.keras.layers.Dense(10, activation='relu', name='layer2')(layer1)
outputs = tf.keras.layers.Dense(1, name='layer3')(layer2)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='mse')

# Create agent.
agent = QLearningNNReplay(model)

with tf.device('/device:GPU:0'):
  train(agent, env)
  res = evaluate(agent, env)

In [None]:
#@title Render animation
%%capture
im = plt.imshow(res[0])
def animate(frame):
  im.set_data(frame)
  return im,
anim = animation.FuncAnimation(plt.gcf(), animate, frames=res, blit=False, repeat=True)

In [None]:
anim

### Exercises (3 mins)

Experiment with CNN settings:
* Change the number of filters.
* Change the kernel size.
* Change other [`Conv2D()`](https://keras.io/api/layers/convolution_layers/convolution2d/) settings
* Use memory (e.g. [`LSTM`](https://keras.io/api/layers/recurrent_layers/lstm/)) (optional)


## Bonus: AndroidEnv demo

Reinforcement Learning algorithms such as Q-learning can be applied to other, more interesting environments where the agent can learn to solve tasks of different nature.

We're going to demonstrate that it is possible for example to run RL agents on an Android device, allowing it to learn to use a phone like a real user. Of course, mastering such complex tasks requires much more sophisticated agents than our toy example.

\\
<img src="https://drive.google.com/uc?id=1y2Tu12KbyWpMd5Iq1CUPM0hyd49toYyn" width="600">

\\
Feel free to give AndroidEnv a try yourself: https://github.com/deepmind/android_env.