<a href="https://colab.research.google.com/github/rahiakela/deep-reinforcement-learning-hands-on/blob/chapter-4-the-cross-entropy-method/1_the_cross_entropy_method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#The Cross-Entropy Method

Despite the fact that it is much less famous than other tools in the RL practitioner's toolbox, such as deep Q-network (DQN) or advantage actor-critic, the cross-entropy method has its own strengths. Firstly, the cross-entropy method is really simple, which makes it an easy method to follow. For example, its implementation on PyTorch is less than 100 lines of code.

Secondly, the method has good convergence. In simple environments that don't
require complex, multistep policies to be learned and discovered, and that have short episodes with frequent rewards, the cross-entropy method usually works very well.Of course, lots of practical problems don't fall into this category, but sometimes they do.

The cross-entropy method falls into the model-free and policy-based category of
methods. All the methods in RL can be classified into various aspects:
* Model-free or model-based
* Value-based or policy-based
* On-policy or off-policy

The term "model-free" means that the method doesn't build a model of the
environment or reward; it just directly connects observations to actions (or values that are related to actions). In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take.

In contrast, model-based methods try to predict what the next observation and/or
reward will be. Based on this prediction, the agent tries to choose the best possible action to take, very often making such predictions multiple times to look more and more steps into the future.

By looking from another angle, policy-based methods directly approximate the
policy of the agent, that is, what actions the agent should carry out at every step. The policy is usually represented by a probability distribution over the available actions.

In contrast, the method could be value-based. In this case, instead of the probability of actions, the agent calculates the value of every possible action and chooses the action with the best value. Both of those families of methods are equally popular.

The third important classification of methods is on-policy versus off-policy.it will be enough to explain off-policy as the ability of the method to learn on historical data (obtained by a previous version of the agent, recorded by human demonstration, or just seen by the same agent several episodes ago).

So, our cross-entropy method is model-free, policy-based, and on-policy, which
means the following:

* It doesn't build any model of the environment; it just says to the agent
what to do at every step.
* It approximates the policy of the agent.
* It requires fresh data obtained from the environment.

## Setup: Installing all required library

In [0]:
! pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
! pip install atari-py
! pip install gym
! pip install opencv-python
! pip install pytorch-ignite
! pip install ptan
! pip install tensorboardX
! pip install tensorboard

In [0]:
import gym
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim

## The cross-entropy method in practice

The cross-entropy method's description is split into two unequal parts: **practical and theoretical. The practical part is intuitive in its nature, while the theoretical explanation of why the cross-entropy method works, and what's happening, is more sophisticated.**

You may remember that **the central and trickiest thing in RL is the agent, which is trying to accumulate as much total reward as possible by communicating with the environment. In practice, we follow a common machine learning (ML) approach and replace all of the complications of the agent with some kind of nonlinear trainable function, which maps the agent's input (observations from the environment) to some output.** The details of the output that this function produces may depend on a particular method or a family of methods, as described in the previous section (such as value-based versus policy-based methods). **As our cross-entropy method is policy-based, our nonlinear function (neural network (NN)) produces the policy, which basically says for every observation which action the agent should take.**

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-hands-on/high-level-rl.png?raw=1' width='800'/>

In practice, **the policy is usually represented as a probability distribution over actions, which makes it very similar to a classification problem, with the amount of classes being equal to the amount of actions we can carry out.**

This abstraction makes our agent very simple: it needs to pass an observation
from the environment to the NN, get a probability distribution over actions, and
perform random sampling using the probability distribution to get an action to
carry out. This random sampling adds randomness to our agent, which is a good
thing, as at the beginning of the training, when our weights are random, the agent behaves randomly. After the agent gets an action to issue, it fires the action to the environment and obtains the next observation and reward for the last action. Then the loop continues.

**During the agent's lifetime, its experience is presented as episodes. Every episode is a sequence of observations that the agent has got from the environment, actions it has issued, and rewards for these actions.**

Imagine that our agent has played several such episodes. For every episode, we can calculate the total reward that the agent has claimed. It can be discounted or not discounted; for simplicity, let's assume a discount factor of 𝛾𝛾 = 1 , which means just a sum of all local rewards for every episode. This total reward shows how good this episode was for the agent.

Let's illustrate this with a diagram, which contains four episodes (note that different episodes have different values for $o_i, a_i, r_i$):

<img src='https://github.com/rahiakela/img-repo/blob/master/deep-reinforcement-learning-hands-on/sample-episodes.png?raw=1' width='800'/>

**Every cell represents the agent's step in the episode. Due to randomness in the environment and the way that the agent selects actions to take, some episodes will be better than others. The core of the cross-entropy method is to throw away bad episodes and train on better ones.**

So, the steps of the method are as follows:
1. Play N number of episodes using our current model and environment.
2. Calculate the total reward for every episode and decide on a reward
boundary. Usually, we use some percentile of all rewards, such as 50th
or 70th.
3. Throw away all episodes with a reward below the boundary.
4. Train on the remaining "elite" episodes using observations as the input and
issued actions as the desired output.
5. Repeat from step 1 until we become satisfied with the result.

With the preceding procedure, our NN learns how to repeat actions, which leads to a larger reward, constantly moving the boundary higher and higher. Despite the simplicity of this method, it works well in basic environments, it's easy to implement, and it's quite robust to hyperparameters changing, which makes it an ideal baseline method to try. 

Let's now apply it to our CartPole environment.


## The cross-entropy method on CartPole

Our model's core is a one-hidden-layer NN, with rectified linear unit (ReLU) and 128 hidden neurons (which is absolutely arbitrary). Other hyperparameters are also set almost randomly and aren't tuned, as the method is robust and converges very quickly.

In [0]:
HIDDEN_SIZE = 128  # the count of neurons in the hidden layer
BATCH_SIZE = 16    # the count of episodes we play on every iteration
PERCENTILE = 70    # the percentile of episodes' total rewards that we use for "elite" episode filtering.

We will take the 70th percentile, which means that we will leave the top 30% of episodes sorted by reward.

There is nothing special about our NN; it takes a single observation from the
environment as an input vector and outputs a number for every action we can
perform. The output from the NN is a probability distribution over actions, so
a straightforward way to proceed would be to include softmax nonlinearity after
the last layer.

In [0]:
class Network(nn.Module):

  def __init__(self, obs_size, hidden_size, n_actions):
    super(Network, self).__init__()
    self.network = nn.Sequential(
        nn.Linear(obs_size, hidden_size),
        nn.ReLU(),
        nn.Linear(hidden_size, n_actions)
    )

  def forward(self, x):
    return self.network(x)

**However, in the preceding NN, we don't apply softmax to increase
the numerical stability of the training process. Rather than calculating softmax
(which uses exponentiation) and then calculating cross-entropy loss (which uses
a logarithm of probabilities), we can use the PyTorch class nn.CrossEntropyLoss,
which combines both softmax and cross-entropy in a single, more numerically
stable expression.** 

**CrossEntropyLoss requires raw, unnormalized values from the
NN (also called logits). The downside of this is that we need to remember to apply softmax every time we need to get probabilities from our NN's output.**

In [0]:
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

Here we will define two helper classes that are named tuples from the collections package in the standard library:

* **EpisodeStep**: This will be used to represent one single step that our agent
made in the episode, and it stores the observation from the environment
and what action the agent completed. We will use episode steps from "elite"
episodes as training data.

* **Episode**: This is a single episode stored as total undiscounted reward and
a collection of EpisodeStep.

Let's look at a function that generates batches with episodes:

We also declare a reward counter for the current episode and its
list of steps (the EpisodeStep objects). Then we reset our environment to obtain the first observation and create a softmax layer, which will be used to convert the NN's output to a probability distribution of actions. That's our preparations complete, so we are ready to start the environment loop.


In [0]:
def iterate_batches(env, network, batch_size):
  batch = []
  episode_reward = 0.0
  episode_steps = []
  obs = env.reset()
  softmax = nn.Softmax(dim=1)
  '''At every iteration, we convert our current observation to a PyTorch tensor and pass it to the NN to obtain action probabilities.'''
  while True:
    obs_v = torch.FloatTensor([obs])
    action_probs_v = softmax(network(obs_v))   # raw action scores feed through the softmax function to achieve nonlinearity
    action_probs = action_probs_v.data.numpy()[0]  # get the first batch element to obtain a one-dimensional vector of action probabilities.
    '''
    Now that we have the probability distribution of actions, we can use it to obtain
    the actual action for the current step by sampling this distribution using NumPy's function random.choice().
    '''
    action = np.random.choice(len(action_probs), p=action_probs)
    next_obs, reward, is_done, _ = env.step(action)
    '''After this, we will pass this action to the environment to get our next observation, our reward, and the indication of the episode ending.'''
    episode_reward += reward
    step = EpisodeStep(observation=obs, action=action)
    episode_steps.append(step)
    '''
    The reward is added to the current episode's total reward, and our list of episode
    steps is also extended with an (observation, action) pair. Note that we save the
    observation that was used to choose the action, but not the observation returned by
    the environment as a result of the action. These are the tiny, but important, details that you need to keep in mind.
    '''
    if is_done:  # handle the situation when the current episode is over
      e = Episode(reward=episode_reward, steps=episode_steps)
      batch.append(e)
      episode_reward = 0.0
      episode_steps = []      # reset total reward accumulator and clean the list of steps
      next_obs = env.reset()  # reset environment to start over.
      '''
      In case batch has reached the desired count of episodes,return it to the caller for processing using yield. Our function is a generator, so every time 
      the yield operator is executed, the control is transferred to the outer iteration loop and then continues after the yield line.
      '''
      if len(batch) == batch_size:
        yield batch
        batch = []
    obs = next_obs

The last, but very important, step in our loop is to assign an observation obtained from the environment to our current observation variable. After that, everything repeats infinitely—we pass the observation to the NN, sample the action to perform, ask the environment to process the action, and remember the result of this processing.

**One very important fact to understand in this function logic is that the training of our NN and the generation of our episodes are performed at the same time.** They are not completely in parallel, but every time our loop accumulates enough episodes (16), it passes control to this function caller, which is supposed to train the NN using gradient descent. So, when yield is returned, the NN will have different, slightly better (we hope) behavior.

**We don't need to explore proper synchronization, as our training and data gathering activities are performed at the same thread of execution, but you need to understand those constant jumps from NN training to its utilization.**

Okay, now we need to define yet another function and then we will be ready to
switch to the training loop.

This function is at the core of the cross-entropy method—from the given batch
of episodes and percentile value, it calculates a boundary reward, which is used
to filter "elite" episodes to train on. 

To obtain the boundary reward, we will use
NumPy's percentile function, which, from the list of values and the desired
percentile, calculates the percentile's value. Then, we will calculate the mean
reward, which is used only for monitoring.

In [0]:
def filter_batch(batch, percentile):
  rewards = list(map(lambda s: s.reward, batch))
  reward_bound = np.percentile(rewards, percentile)
  reward_mean = float(np.mean(rewards))
  
  train_obs = []
  train_action = []
  for reward, steps in batch:
    if reward < reward_bound:
      continue
    train_obs.extend(map(lambda step: step.observation, steps))
    train_action.extend(map(lambda step: step.action, steps))
  '''
  Next, we will filter off our episodes. For every episode in the batch, we will check
  that the episode has a higher total reward than our boundary and if it has, we will
  populate lists of observations and actions that we will train on.
  '''
  train_obs_v = torch.FloatTensor(train_obs)
  train_action_v = torch.FloatTensor(train_action)

  return train_obs_v, train_action_v, reward_bound, reward_mean

As the final step of the function, we will convert our observations and actions from "elite" episodes into tensors, and return a tuple of four: observations, actions, the boundary of reward, and the mean reward. The last two values will be used only to write them into TensorBoard to check the performance of our agent.

Now, the final chunk of code that glues everything together, and mostly consists
of the training loop, is as follows:

In the beginning, we create all the required objects: the environment, our NN, the objective function, the optimizer, and the summary writer for TensorBoard.

In [0]:
env = gym.make('CartPole-v0')
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

network = Network(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=network.parameters(), lr=0.01)
writer = SummaryWriter(comment='-cartpole')

# The commented line creates a monitor to write videos of your agent's performance.
for iter_no, batch in enumerate(iterate_batches(env, network, BATCH_SIZE)):
  obs_v, action_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)  # perform filtering of the "elite" episodes using the filter_batch function.
  optimizer.zero_grad()   # zero gradients of NN and pass observations to the NN, obtaining its action scores.
  action_scores_v = network(obs_v)   
  # scores are passed to the objective function, which will calculate cross-entropy between the NN output and the actions that the agent took.
  loss_v = objective(action_scores_v, action_v)
  loss_v.backward()
  optimizer.step()

  '''
  The idea of this is to reinforce our NN to carry out those "elite" actions that have led to good rewards. Then, we calculate gradients
  on the loss and ask the optimizer to adjust our NN.
  '''

  # We also write the same values to TensorBoard, to get a nice chart of the agent's learning performance.
  