# Set up




## Launch local runtime

To run this colab, you'll need to run your own Jupyter runtime in a Python environment with tf_agents installed. See instructions [here](https://research.google.com/colaboratory/local-runtimes.html).

## Imports

In [0]:
import matplotlib
import matplotlib.pyplot as plt

import functools
import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.metrics import metric_utils
from tf_agents.metrics import py_metrics
from tf_agents.metrics import tf_metrics
from tf_agents.metrics import tf_py_metric
from tf_agents.networks import q_network
from tf_agents.policies import py_tf_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import batched_replay_buffer

nest = tf.contrib.framework.nest


## Params

In [0]:
env_name = 'CartPole-v0'
num_iterations = 20000  # @param

# Params for collect
initial_collect_steps = 1000  # @param
collect_steps_per_iteration = 1 # @param
epsilon_greedy = 0.1 # @param
replay_buffer_capacity = 100000 # @param

# Params for target update
target_update_tau = 0.05 # @param
target_update_period = 5 # @param

# Params for train
batch_size = 64 # @param
learning_rate = 1e-3 # @param
gamma = 0.99 # @param

# Params for eval
num_eval_episodes = 100 # @param
eval_interval = 1000 # @param


# Introduction

This example shows how to train a DQN agent in Graph mode, where we first create a TensorFlow graph to hold all our operations and later execute them in a TF session. This is the most computationally efficient way of using TensorFlow, but can be a little hard to debug. Other ways of using TF-Agents, and TensorFlow in general, are eager mode (see dqn_eager_tutorial) and out of graph mode, where only the necessary components such as networks are in TensorFlow (see dqn_oog_tutorial). To get a general understanding of DQN, check out the [DQN paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) or our Introduction to DQN colab.

Now we will walk you through all the components in an RL pipeline for training and evaluating a DQN agent on the Cartpole Environment:

- Environment
- Agent
- Network
- Replay Buffers
- Policies
- Data Collection
- Training
- Evaluation
- Visualization


# Creating the Graph

In graph mode, we first construct a graph to hold all the operations and later execute them in a session. This is in contrast to eager mode, where the operations are executed immediately when they are defined.

## Environment

Environments in RL represent the task or problem that we are trying to solve. In TF-Agents, environments can be written in TensorFlow or pure python. The most important method in environments is the `time_step = environment.step(action)` method which performs one simulation step of the environment given an action and returns a `TimeStep` named tuple containing the next observation, reward, etc. Other important methods are `time_step_spec()` and `action_spec()` which return the specifications (types, shapes, bounds) of the `time_step` and `action` respectively. 

Standard environments can be easily created in TF-Agents using `suites`. We have different suites for loading environments from OpenAI Gym, Atari, DM Control etc given a string environment name. See the [environments tutorial](environments_tutorial.ipynb) for a detailed tutorial on environments in TF-Agents.

In this example, we will create two environments: one for training and one for evaluation. The environment for evaluation will be in python. For efficiency, we will keep the training environment in the TensorFlow graph because the networks are in TensorFlow. This is done automatically using a wrapper called `TFPyEnvironment` (under the hood, this converts the python functions into TensorFlow ops).

TODO: Consider evaling in TF as well, now that we have better metrics.

In [0]:
eval_py_env = suite_gym.load(env_name)
train_tf_env = tf_py_environment.TFPyEnvironment(suite_gym.load(env_name))

## Network

The DQN agent needs a network that can compute the Q values for each action. In this example we will use a set of fully connected layers.


 #TODO(oars): Update once we pass in a network instance to the agent.

## Agent

The main parameters required to create the DQN agent are:

*   `time_step_spec`: Specs describing time steps produced by the environment.
*   `action_spec`: Specs describing the actions expected by the environment.
*  `q_network_ctor`: A function to create a Q network that predicts Q values for each action given a time_step.
*   `optimizer`: The optimizer for training the Q network.
*   `epsilon_greedy`: Epsilon (probability of choosing a random action) of the Epsilon-Greedy policy used for data collection.
*   `target_update_tau`: A factor for soft variable update of the target network. If set to 1, the weights are exactly copied from the Q network.
*  `target_update_period`: The frequency at which the target networks are updated.
*   `gamma`: A factor for discounting future rewards relative to immediate rewards.



In [0]:
tf_agent = dqn_agent.DqnAgent(
    train_tf_env.time_step_spec(),
    train_tf_env.action_spec(),
    q_network.QNetwork(
        train_tf_env.time_step_spec().observation,
        train_tf_env.action_spec()),
    optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate),
    epsilon_greedy=epsilon_greedy,
    target_update_tau=target_update_tau,
    target_update_period=target_update_period,
    gamma=gamma)

## Replay Buffer

TODO: Rewrite this section after we move the replay buffer out of the agent.

TODO: link to replay buffer tutorial once that is done.

In order to keep track of the collected data we will use the BatchedUniformReplayBuffer provided in TF-Agents. 

This replay buffer is constructed by giving it a nest of specs describing the tensors that are to be stored. e.g. in DQN, the agent stores trajectories. We can extract the spec through `tf_agent.collect_data_spec()`.

In order to sample data from the replay buffer we will create a `tf.data` pipeline which we can then feed to the agent's train method.

In [0]:
replay_buffer = batched_replay_buffer.BatchedReplayBuffer(
    tf_agent.collect_data_spec(),
    batch_size=1,
    max_length=replay_buffer_capacity)

dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=batch_size,
    num_steps=2,
    time_stacked=True).prefetch(3)

iterator = dataset.make_initializable_iterator()
trajectories, unused_ids, unused_probs = iterator.get_next()

## Policies

In TF-Agents, policies represent the standard notion of policies in RL: given a `time_step` produce an action or a distribution over actions. The main method is `policy_step = policy.step(time_step)` where `policy_step` is a named tuple `PolicyStep(action, state, info)`.  `policy_step.action` is the `action` to be applied to the environment, `state` represents the state for stateful (RNN) policies and `info` may contain auxiliary information such as log probabilities of the actions. For a more detailed description of policies, please see XXX.

Agents expose two policies: the main policy that is used for evaluation/deployment (agent.policy()) and another (exploratory) policy that is used for data collection (agent.collect_policy()). Since the evaluation environment is in python, we wrap the eval policy in python using the PyTFPolicy wrapper. TODO: Consider switching eval to TF. 

Also initially it is advantageous to collect some data using a purely random policy to get a good coverage of the state/action space, so we will create that as well.

In [0]:
eval_policy = py_tf_policy.PyTFPolicy(tf_agent.policy())
collect_policy = tf_agent.collect_policy()
initial_collect_policy = random_tf_policy.RandomTFPolicy(
    train_tf_env.time_step_spec(), train_tf_env.action_spec())

## Data Collection

Data Collection is done using drivers, which is just a name for a a loop that runs a policy in an environment and broadcasts each time_step and action to a list of observers. We have drivers that run for a specific number of steps or episodes. (see drivers_tutorial XXX).

Observers are defined as a callable that takes `Trajectory` data and uses it. In this case we will define an observer to add data to the replay buffer. In general, observers can also include other things such as metrics.

In [0]:
observers = [replay_buffer.add_batch]

initial_collect_op = dynamic_step_driver.DynamicStepDriver(
    train_tf_env,
    initial_collect_policy,
    observers=observers,
    num_steps=initial_collect_steps).run()

collect_op = dynamic_step_driver.DynamicStepDriver(
    train_tf_env,
    collect_policy,
    observers=observers,
    num_steps=collect_steps_per_iteration).run()  

## Training

Now we create a global step variable to keep track of how many times the network is updated and create a train op which does one step of training/network update. 


In [0]:
global_step = tf.train.get_or_create_global_step()
experience, unused_ids, unused_probs = iterator.get_next()
train_op = tf_agent.train(experience, train_step_counter=global_step)

The train() function looks roughly as
follows:

```python
batch, _, _ = iterator.get_next()
time_steps, actions, next_time_steps = batch
loss = self.loss(time_steps, actions, next_time_steps)
train_step = create_train_step(loss, optimizer, global_step=train_step_counter)

# Do we need these two dependencies? Couldn't we switch them around?
with tf.control_dependencies([train_step]):
   target_update_op = self.update_targets(
      self._target_update_tau, self._target_update_period)

with tf.control_dependencies([target_update_op]):
  train_step = tf.identity(train_step)
```




## Initialization

We create the initialization ops at the end, because `tf.local/global_variables_initializer()` creates initializers only for the variables that exist in the graph. Additionally `agent.initialize()` creates ops for initializing the agent. In DQN, this is just an op that copies the weights from the Q network to the target Q network.

In [0]:
init_local_variables_op = tf.local_variables_initializer()
init_global_variables_op = tf.global_variables_initializer()
init_agent_op = tf_agent.initialize()

# Executing the graph

To execute the graph, we have to create a session:

In [0]:
sess = tf.InteractiveSession()

## Initialization

First we execute the initialzation ops. In addition to the variable and agent initializatio ops, we also run the initial_collect_op to fill the agent's replay buffer with some initial data.

In [0]:
#@test {"skip": true}
sess.run(init_global_variables_op)
sess.run(init_local_variables_op)
sess.run(init_agent_op)
sess.run(initial_collect_op)
sess.run(iterator.initializer)

Before we start training, let us evaluate the current policy held by the agent. In RL, the most common metric is the `AverageReturnMetric`. The Return is the sum of rewards in an episode, and average return refers to averaging this across multiple episodes.

The `metric_utils.compute()` method can be used to compute a list of metrics given a python environment and a python policy. The `num_episodes` argument can be used to set how many episodes we compute the average over.



In [0]:
#@test {"skip": true}
average_return_metric = py_metrics.AverageReturnMetric()
returns = []

# Compute evaluation metrics.
metric_utils.compute(
    [average_return_metric],
    eval_py_env,
    eval_policy,
    num_episodes=num_eval_episodes,
)
returns.append(average_return_metric.result())
print('Step = {0}: Return = {1}'.format(0, returns[-1]))

## Main Train/Collect/Eval Loop

The main loop main involves executing the collect and train ops, and also evaluating the policy at regular intervals, logging etc. Note in particular that the train and collect ops can be run in parallel easily using a single `session.run()`

In [0]:
#@test {"skip": true}
# sess.make_callable is just a way to make sess.run faster when calling it repeatedly. 
train_step_call = sess.make_callable([train_op, global_step, collect_op])

for _ in range(num_iterations):
  # Train/collect/eval.
  total_loss, global_step_val, _ = train_step_call()

  if global_step_val % eval_interval == 0:
    metric_utils.compute(
        [average_return_metric],
        eval_py_env,
        eval_policy,
        num_episodes=num_eval_episodes,
    )
    returns.append(average_return_metric.result())
    print('Step = {0}: Return = {1}'.format(global_step_val, returns[-1]))

Finally, close the session.

In [0]:
sess.close()

## Visualization

### Plots

We can plot rewards vs global steps to see the performance of our agent. In `Cartpole-v0`, the environment gives a reward of +1 for every time step the pole stays up, and since the maximum number of steps is 200, the maximum possible reward is also 200.

In [0]:
#@test {"skip": true}
plt.plot(range(0, num_iterations + 1, eval_interval), returns)
plt.ylabel('Average Return')
plt.xlabel('Global Step')

### Videos

TODO: Use moviepy once pillow has been imported: b/63250444