<a href="https://colab.research.google.com/github/rahiakela/deep-reinforcement-learning-hands-on/blob/chapter-4-the-cross-entropy-method/1_the_cross_entropy_method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#The Cross-Entropy Method

Despite the fact that it is much less famous than other tools in the RL practitioner's toolbox, such as deep Q-network (DQN) or advantage actor-critic, the cross-entropy method has its own strengths. Firstly, the cross-entropy method is really simple, which makes it an easy method to follow. For example, its implementation on PyTorch is less than 100 lines of code.

Secondly, the method has good convergence. In simple environments that don't
require complex, multistep policies to be learned and discovered, and that have short episodes with frequent rewards, the cross-entropy method usually works very well.Of course, lots of practical problems don't fall into this category, but sometimes they do.

The cross-entropy method falls into the model-free and policy-based category of
methods. All the methods in RL can be classified into various aspects:
* Model-free or model-based
* Value-based or policy-based
* On-policy or off-policy

The term "model-free" means that the method doesn't build a model of the
environment or reward; it just directly connects observations to actions (or values that are related to actions). In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take.

In contrast, model-based methods try to predict what the next observation and/or
reward will be. Based on this prediction, the agent tries to choose the best possible action to take, very often making such predictions multiple times to look more and more steps into the future.

By looking from another angle, policy-based methods directly approximate the
policy of the agent, that is, what actions the agent should carry out at every step. The policy is usually represented by a probability distribution over the available actions.

In contrast, the method could be value-based. In this case, instead of the probability of actions, the agent calculates the value of every possible action and chooses the action with the best value. Both of those families of methods are equally popular.

The third important classification of methods is on-policy versus off-policy.it will be enough to explain off-policy as the ability of the method to learn on historical data (obtained by a previous version of the agent, recorded by human demonstration, or just seen by the same agent several episodes ago).

So, our cross-entropy method is model-free, policy-based, and on-policy, which
means the following:

* It doesn't build any model of the environment; it just says to the agent
what to do at every step.
* It approximates the policy of the agent.
* It requires fresh data obtained from the environment.

## Setup: Installing all required library

In [0]:
! pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

In [0]:
! pip install atari-py
! pip install gym
! pip install opencv-python
! pip install pytorch-ignite
! pip install ptan
! pip install tensorboardX
! pip install tensorboard

In [0]:
import gym
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim

## The cross-entropy method in practice

The cross-entropy method's description is split into two unequal parts: **practical and theoretical. The practical part is intuitive in its nature, while the theoretical explanation of why the cross-entropy method works, and what's happening, is more sophisticated.**

You may remember that **the central and trickiest thing in RL is the agent, which is trying to accumulate as much total reward as possible by communicating with the environment. In practice, we follow a common machine learning (ML) approach and replace all of the complications of the agent with some kind of nonlinear trainable function, which maps the agent's input (observations from the environment) to some output.** The details of the output that this function produces may depend on a particular method or a family of methods, as described in the previous section (such as value-based versus policy-based methods). **As our cross-entropy method is policy-based, our nonlinear function (neural network (NN)) produces the policy, which basically says for every observation which action the agent should take.**