# Cross Entropy Method


## Taxonomy of RL Methods
All methods of RL can be classified in three types
- Model free or Model based
- value based or policy based
- policy on or policy off

**Model-Free** means that the method doesn't build a model of the environment or reward; it directly connects observations to actions. agent does some computations and selects the action based on that computation

**Model-based** method try to predict what the next observation and/or reward will be.

**policy based** methods directly approximates the policy of the agent ie what actions agent should carry out at every step. In this method we calculate probabilities of the action.

**value-based** methods directly approximates the value of the action the agent might take and chooses the action with the best value.

**policy on or off** means ability of agent to learn based on historical data.

### Cross Entropy Method
Our cross-entropy method is model-free, policy-based, and on-policy which means
- It doesn't build any model of the environment; it just says to the agent what to do at every step
- It approximates the policy of the agent
- It requires fresh data obtained from the environment

#### Practical cross-entropy
![](policy_approximation_cross_entropy.png)

In practice, policy is usually represented as probability distribution over actions, which makes it very similar to a classification problem, with the amount of classes being equal to amount of actions we can carry out. This abstraction makes our agent very simple: it needs to pass an observation from the environment to the network, get probability distribution over actions, and perform random sampling using probability distribution to get an action to carry out. This random sampling adds randomness to our agent, which is a good thing, as at the beginning of the training when our weights are random, the agent behaves randomly. After the agent gets an action to issue, it fires the action to the environment and obtains the next observation and reward for the last action. Then the loop continues.

During the agent's lifetime, its experience is present as episodes. Every episode is a sequence of observations that the agent has got from the environment, actions it has issued, and rewards for these actions. Imagine that our agent has played several such episodes. For every episode, we can calculate the total reward that the agent has claimed. It can be discounted or not discounted, but for simplicity, let's assume a discount factor of gamma = 1, which means just a sum of all local rewards for every episode.

![](Sample_episode.png)

Every cell represents the agent's step in the episode. Due to randomness in the environment and the way that the agent selects actions to take, some episodes will be better than others. The core of the cross-entropy method is to throw away bad episodes and train on better ones. So, the steps of the method are as follows:

1. Play N number of episodes using our current model and environment.
2. Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
3. Throw away all episodes with a reward below the boundary.
4. Train on the remaining "elite" episodes using observations as the input and issued actions as the desired output.
5. Repeat from step 1 until we become satisfied with the result.

### Cross Entropy Cartpole Example


In [85]:
import gym
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim

HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

In [86]:

class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )
    def forward(self, x):
        return self.net(x)

In [87]:
from collections import namedtuple
Episode = namedtuple("Episode", field_names = ["reward","steps"])
EpisodeStep = namedtuple("EpisodeStep", field_names = ["observation", "action"])

In [88]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        action_probs_v = sm(net(obs_v))
        action_probs = action_probs_v.data.numpy()[0]
        action = np.random.choice(len(action_probs), p = action_probs)
        next_obs, reward, done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation = obs,action = action))
        if done:
            batch.append(Episode(reward = episode_reward,steps = episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs

In [89]:
def filter_batches(batch, percentile):
    rewards = list(map(lambda s:s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))
    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda steps:steps.observation, example.steps))
        train_act.extend(map(lambda steps:steps.action, example.steps))
    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean

In [90]:
if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    # env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01)
    writer = SummaryWriter(comment="-cartpole")
    print(BATCH_SIZE, PERCENTILE)
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        obs_v, actions_v, reward_b, reward_m = filter_batches(batch, PERCENTILE)
        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, actions_v)
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_bound", reward_b, iter_no)
        writer.add_scalar("reward_mean", reward_m, iter_no)
        if reward_m > 199:
            print("Solved!")
            break
    writer.close()

16 70
0: loss=0.686, reward_mean=28.9, reward_bound=31.0
1: loss=0.660, reward_mean=42.8, reward_bound=45.5
2: loss=0.651, reward_mean=43.1, reward_bound=50.5
3: loss=0.639, reward_mean=48.9, reward_bound=53.0
4: loss=0.619, reward_mean=54.1, reward_bound=64.0
5: loss=0.616, reward_mean=52.6, reward_bound=62.0
6: loss=0.601, reward_mean=55.7, reward_bound=65.0
7: loss=0.584, reward_mean=62.1, reward_bound=72.5
8: loss=0.586, reward_mean=67.9, reward_bound=83.5
9: loss=0.572, reward_mean=72.9, reward_bound=82.5
10: loss=0.558, reward_mean=94.5, reward_bound=107.0
11: loss=0.553, reward_mean=74.4, reward_bound=84.5
12: loss=0.556, reward_mean=74.1, reward_bound=85.5
13: loss=0.541, reward_mean=102.0, reward_bound=121.5
14: loss=0.546, reward_mean=117.2, reward_bound=143.0
15: loss=0.542, reward_mean=108.8, reward_bound=127.5
16: loss=0.522, reward_mean=143.8, reward_bound=190.0
17: loss=0.517, reward_mean=137.1, reward_bound=150.5
18: loss=0.531, reward_mean=144.7, reward_bound=160.0
19: