##### Copyright 2018 The TF-Agents Authors.

### Get Started
<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/agents/blob/master/tf_agents/colabs/metrics_tutorial.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/agents/blob/master/tf_agents/colabs/metrics_tutorial.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>


In [0]:
# Note: If you haven't installed tf-agents yet, run:
!pip install tf-agents

### Imports

In [0]:
import numpy as np
import tensorflow as tf
nest = tf.contrib.framework.nest

from tf_agents.environments import time_step as ts
from tf_agents.environments import trajectory

from tf_agents.metrics import py_metric
from tf_agents.metrics import py_metrics
from tf_agents.metrics import tf_metric
from tf_agents.metrics import tf_metrics
from tf_agents.metrics import batched_py_metric
from tf_agents.metrics import tf_py_metric

# Clear any leftover state from previous colabs run.
# (This is not necessary for normal programs.)
tf.reset_default_graph()

# Introduction

A metric is a scalar value that gets updated continuously during the training or evaluation of a reinforcement learning agent in an environemnt. Often times, a metric measures the performance of the agent (e.g., the average return over the last 10 episodes). Other times, metrics track auxiliary tracking information from the environment or training loop (e.g. the number of episodes the agent has seen so far).

`StepMetric`s are calculated based on the [trajectory](https://github.com/tensorflow/agents/tree/master/tf_agents/environments/trajectory.py) which includes information such as the observation, action, reward, discount, etc. For example, one of the most common metrics used in RL is the return or sum of the rewards in an episode.

Metrics can be implemented either in Python (see [PyMetric](https://github.com/tensorflow/agents/tree/master/tf_agents/metrics/py_metric.py)) or Tensorflow (see [TFStepMetric](github.com/tensorflow/agents/tree/master/metrics/tf_metric.py)). Usually, the easiest way to implement a metric is to write it in Python for a single environment.  This metric can then be wrapped in a BatchedPyMetric to automatically make it work in batched/parallel environments. Both regular/batched metrics can be used in Tensorflow by wrapping them in a TFPyMetric.

# Python Metrics

Python metrics are commonly used during evaluation and out of graph data collection.  The interface for Python metrics look like this:

---
```python
class PyMetric(object):

  @property
  def name(self):
    """Name of the metric."""
    return self._name

  def __call__(self, *args):
    """Processes the input to update the metric."""
    self.call(*args)

  @abc.abstractmethod
  def reset(self):
    """Resets internal variables used to compute the metric."""

  @abc.abstractmethod
  def result(self):
    """Returns the current value of the metric."""

  @staticmethod
  def aggregate(metrics):
    """Aggregates a list of metrics of this class."""
    return np.mean([metric.result() for metric in metrics])


class PyStepMetric(PyMetric):

  @abc.abstractmethod
  def call(self, trajectory):
    """Processes the trajectory to update the metric."""
```
---
To create a new metric, child classes override the `call()` method to specify how the new metric is updated at every call, and the `result()` method to specify how results are finalized and returned. We will look at a couple of examples below. (For a discussion of `aggregate()`, please see [BatchedPyMetric](#scrollTo=P-6Y2QRFqxyP).)

We will look at a couple of examples below.  For more, see `tf_agents/py_metrics.py`.


## Example 1: AverageReturnMetric

Average Return is the most common metric used in reinforcement learning. A return is defined as the sum of rewards received by an agent in an episode, and average return refers to averaging this return across multiple episodes.

### Streaming Metric

`AverageReturnMetric` is implemented by sub classing `StreamingMetric` which has a Deque buffer to keep track of the last (up to) K values of the metric. Calling `result()` on the a streaming metric returns the average of the values in the buffer. 


---
```python
class StreamingMetric(py_metric.PyStepMetric):

  def reset(self):
    self._buffer.clear()
    self._reset()

  def add_to_buffer(self, value):
    """Appends a new value to the buffer."""
    self._buffer.append(value)

  def result(self):
    """Returns the value of this metric."""
    return self._buffer and np.mean(self._buffer) or 0.0
```
---

Child classes of `StreamingMetric` must override the `call(trajectory)` method. 

### Average Return Metric

The `AverageReturnMetric` keeps track of the sum of rewards in the current episode in a variable called self.episode_return. This is updated in every `call(trajectory)` and added to the buffer at the end of the episode. Trajectories at the boundary between two episodes have an invalid reward, so they are ignored.

---
```python
class AverageReturnMetric(StreamingMetric):
  """Computes the average undiscounted reward."""

  def _reset(self):
    """Resets stat gathering variables."""
    self._episode_return = 0

  def call(self, trajectory):
    """Processes the trajectory to update the metric."""
    if not trajectory.is_boundary():
      self._episode_return += trajectory.reward
    if trajectory.is_last():
      self.add_to_buffer(self._episode_return)
      self._episode_return = 0      
```
---

The `AverageReturnMetric.result()` method returns the average of the returns saved in the buffer (implemented in the base class `StreamingMetric`).



The AverageReturnMetric can be used as:

In [0]:
metric = py_metrics.AverageReturnMetric()

# TODO(kbanoop): Make this more readable using kwargs
metric(trajectory.boundary((), (), (), 0., 1.))
metric(trajectory.first((), (), (), 1., 1.))
metric(trajectory.mid((), (), (), 2., 1.))
metric(trajectory.last((), (), (), 3., 0.))
metric(trajectory.boundary((), (), (), 0., 1.))
metric(trajectory.first((), (), (), 4., 1.))
metric(trajectory.mid((), (), (), 5., 1.))
metric(trajectory.last((), (), (), 6., 0.))

print metric.result()

## Example 2: BatchedPyMetric

Certain environments like BatchedPyEnvironment and ParallelPyEnvironment manage multiple independent copies of an environment. Therefore observations, actions, rewards etc and therefore trajectories are batched, one for each sub environment.

An easy way to make regular Python metrics work with such batches of `Trajectories` is to wrap them in a `BatchedPyMetric`.  Internally `BatchedPyMetric` creates a metric for each environment. Every time `BatchedPyMetric` is called with a batch of `Trajectories`,  it unbatches them into a list of individual `Trajectoy`'s and calls a different metric with each `Trajectory`.

The BatchedPyMetric is implemented roughly as follows (For more details see `tf_agents/metrics/batched_py_metric.py`):

---
```python
class BatchedPyMetric(py_metric.PyStepMetric):

  def call(self, batched_trajectory):
    trajectories = unstack(batched_trajectory)
    for metric, trajectory in zip(self._metrics, trajectories):
      metric(trajectory)
 
  def result(self):
    return self._metric_class.aggregate(self._metrics)
```
---

Note that the different metrics for the items in the batch are combined using the `aggregate()` method. `aggregate()`  is defined in the base class `PyMetric` and specifies how metrics of a certain class are aggregated. The default behaviour is to average them.

Also note that once a specific batch size has been used, all further calls to `BatchedPyMetric` must be done with `Trajectories` batched with the same size.

The AverageReturnMetric can be called with batched time steps in the following way:

In [0]:
# TODO(b/112359343): Update from time_steps to trajectories 


# STEP_TYPE = np.ones([2], dtype=np.int32)
# DISCOUNT = np.ones([2], dtype=np.float32)
# OBSERVATION = np.ones([2, 2])

# ts0 = ts.TimeStep(step_type=0 * STEP_TYPE, discount=DISCOUNT, observation=OBSERVATION, reward=np.array([0., 0.]))
# ts1 = ts.TimeStep(step_type=1 * STEP_TYPE, discount=DISCOUNT, observation=OBSERVATION, reward=np.array([1., 4.]))
# ts2 = ts.TimeStep(step_type=1 * STEP_TYPE, discount=DISCOUNT, observation=OBSERVATION, reward=np.array([2., 5.]))
# ts3 = ts.TimeStep(step_type=2 * STEP_TYPE, discount=DISCOUNT, observation=OBSERVATION, reward=np.array([3., 6.]))

# batched_metric = batched_py_metric.BatchedPyMetric(
#         py_metrics.AverageReturnMetric)

# batched_metric(ts0)
# batched_metric(ts1)
# batched_metric(ts2)
# batched_metric(ts3)

# print(batched_metric.result())


# TensorFlow Metric

TensorFlow metrics are usually used during data collection, e.g. to measure the average length of episodes, average return of the current policy etc.

TF metrics are derived from TF Eager metrics and roughly follow the same interface as Python metrics. The main methods are:

TODO(kbanoop): Should we make this interface and py_metric.Base closer? Main differences are no `reset()` and returning the arguments from `__call__()`.

---
```python
class TFStepMetric(eager_metrics.Metric)

  def __call__(self, *args, **kwargs):
    """Update the metric"""   
    if not self._built:
        self.build(*args, **kwargs)
      self._built = True
    return self.call(*args, **kwargs)  
  
  @abc.abstractmethod
  def call(self, *args, **kwargs):
    
  @abc.abstractmethod    
  def  build(self, *args, **kwargs):
    
  @abc.abstractmethod
  def result(self):    

```
---


There are a few key differences to keep in mind. Since all TensorFlow environments are batched by default, the `call()` method has to handle batches of time steps and actions. Also `call` has to return its input trajectory, so that metrics can be chained together, i.e.:

```python
for metric in metrics: 
  trajectory = metric(trajectory)
  ```
  
Most metrics have internal tensorflow variables to keep track of the state of the metric. These are built lazily the first time the `call` method is called. The `build` method has to be overriden to construct these variables. Now let us look at a couple of examples (for more, see `tf_agents/metrics/tf_metrics.py`).

## Example 1: NumberOfEpisodes Metric

The NumberOfEpisodes metric is used to count the number of episodes, e.g. while collecting data in TensorFlow. The implementation roughly looks like this: 

---
```python
class NumberOfEpisodes(tf_metric.TFStepMetric):
  """Counts the number of episodes in the environment."""

  def build(self, *args, **kwargs):
    self.number_episodes = self.add_variable(shape=(), initializer=tf.zeros_initializer())

  def call(self, trajectory):
    num_episodes = tf.reduce_sum(trajectory.is_last())
    self.number_episodes.assign_add(num_episodes)
    # Dont we need a control dependency on the assign add?
    return trajectory

  def result(self):
    return tf.identity(self.number_episodes)
  ```
---  
The `call` method receives a batch of `Trajectories`.  `trajectory.is_last()` indicates if it was the last in an episode, so the sum of `trajectory.is_last()`  across all `trajectories` is a count of the number of episodes seen so far. 

In [0]:
# TODO(b/112359343): Update from time_steps to trajectories 


# with tf.Graph().as_default():
#   ts0 = ts.restart(observation=tf.zeros(2, 2), batch_size=2)
#   ts1 = ts.transition(observation=tf.zeros(2, 2), reward=tf.constant([1., 2.]))
#   ts2 = ts.termination(observation=tf.zeros(2, 2), reward=tf.constant([3., 4.]))
#   ts3 = ts.restart(observation=tf.zeros(2, 2), batch_size=2)
#   ts4 = ts.transition(observation=tf.zeros(2, 2), reward=tf.constant([5., 6.]))
#   ts5 = ts.termination(observation=tf.zeros(2, 2), reward=tf.constant([7., 8.]))
  
#   time_steps = [ts0, ts1, ts2, ts3, ts4, ts5]
#   num_episodes = tf_metrics.NumberOfEpisodes()

#   deps = []
#   for i in range(len(time_steps)):
#     with tf.control_dependencies(deps):
#       time_step_action = num_episodes(time_steps[i])
#       deps = nest.flatten(time_step_action)
#   with tf.control_dependencies(deps):
#     result = num_episodes.result()

#   with tf.Session() as sess:
#     sess.run(num_episodes.init_variables())
#     print(sess.run(result))

## Example 2: TFPyMetric

A `TFPyMetric` can be used to wrap any Python metric in TensorFlow. Metrics are usually easier to implement in Python compared to TensorFlow, however we might still want to use these Python metrics in a TensorFlow graph setting. The `TFPyMetric` allows us to do this very easily. Internally `TFPyMetric` wraps the methods of `PyMetric` using `py_func`'s. `TFPyMetric` can wrap both regular `PyMetrics` and `BatchedPyMetrics`. `TFPyMetric` is thread-safe.

Let us look at an example. We will first wrap the `AverageReturnMetric` in a `BatchedPyMetric` so that it works with batches of Python `Trajectories`. Then we will wrap the `BatchedPyMetric` in a `TFPyMetric` so that it can work with batches of TensorFlow `Trajectories`.


In [0]:
# TODO(b/112359343): Update from time_steps to trajectories 

# with tf.Graph().as_default():
#   ts0 = ts.restart(observation=tf.zeros(2, 2), batch_size=2)
#   ts1 = ts.transition(observation=tf.zeros(2, 2), reward=tf.constant([1., 2.]))
#   ts2 = ts.termination(observation=tf.zeros(2, 2), reward=tf.constant([3., 4.]))
#   ts3 = ts.restart(observation=tf.zeros(2, 2), batch_size=2)
#   ts4 = ts.transition(observation=tf.zeros(2, 2), reward=tf.constant([5., 6.]))
#   ts5 = ts.termination(observation=tf.zeros(2, 2), reward=tf.constant([7., 8.]))
  
#   time_steps = [ts0, ts1, ts2, ts3, ts4, ts5]

#   batched_avg_return_metric = batched_py_metric.BatchedPyMetric(
#           py_metrics.AverageReturnMetric)    
#   tf_avg_return_metric = tf_py_metric.TFPyMetric(batched_avg_return_metric)
  
#   deps = []
#   for i in range(len(time_steps)):
#     with tf.control_dependencies(deps):
#       time_step_action = tf_avg_return_metric(time_steps[i])
#       deps = nest.flatten(time_step_action)
      
#   with tf.control_dependencies(deps):
#     result = tf_avg_return_metric.result()

#   with tf.Session() as sess:  
#     result_ = sess.run(result)
#   print(result_)