<h1 style="text-align: center;">The Cross-Entropy Method</h1>

<br>

In this chapter, we will learn one of the RL methods called __cross-entropy__. Despite being less famous than other RL algorithms, this method has it own strength:

- __Simplicity:__ This method is really simple, which makes it an intuitive method to follow. Its implementation on PyTorch is less than 100 lines of code.

- __Good convergence:__ In simple environments, this method usually works very well. Lots of practical problems don't fall into this category, but sometimes they do. In such cases, cross-entropy (on its own or as a part of a larger system) can be the perfect fit.

In the following sections, we will start from the practical side of cross-entropy, and then look at how it works in two environments in Gym. Then, at the end of the chapter, we will take a look at the theoretical background of the method. 

<br>

# 1. Taxonomy of RL methods

---

All methods in RL can be classified into various aspects:

1. Model-free or model-based
2. Value-based or policy-based
3. On-policy or off-policy

For example, the cross-entropy method falls into the model-free and policy-based category of methods. These notions are new, so let's spend some time exploring them. There are other ways that you can taxonomize RL methods, but for now we're interested in the preceding three. 

<img width="800px" src="./assets/rl_taxomy.png">

<br>

### 1.1. Model-Free or Model-Based


Let's define the model-free and model-based methods:

- __Model-free__ means that the agent takes current observations and does some computation on them, and the result is the action that it should take. In other words, the method directly connects observations to actions (or values that are related to actions).


- __Model-based__ means that the method predict the next observation and/or reward. Based on that, the agent choose the best action to take. 

<img width = "600px" src = "assets/model_free_model_based.gif">

Both classes of methods have strong and weak sides, but usually pure __model-based__ methods are used in deterministic environments (such as board games with strict rules). On the other hand, __model-free__ methods are usually easier to train as it's hard to build good models of complex environments with rich observations. 

<img width = "400px" src = "assets/model_free_model_based_2.png">

All of the methods in this book are model-free. Recently the researchers started to mix the benefits from both sides (for example, refer to DeepMind's papers on imagination in agents. This approach will be described in Chapter 17, Beyond Model-Free – Imagination).

<br>

### 1.2. Value-Based or Policy-Based

__Policy-based__ methods directly approximate the policy of the agent (e.g. the actions that the agent should take at every step). Policy is usually represented by probability distribution over the available actions. 

In __value-based__ methods, the agent calculates the value of every possible action, and chooses the action with the best value. 

Both of these families of methods are equally popular and we'll discuss both of them later on.

<br>

### 1.3. On-policy or off-policy

We'll discuss the distinction of on-policy and off-policy more in parts 2 and 3 of the book. The off-policy method is the ability of the method to learn on old historical data (obtained by a previous version of the agent or recorded by human demonstration or just seen by the same agent several episodes ago). While on-policy method requires fresh data obtained from the environment.

<br>

<br>

# 2. Practical cross-entropy

---

The cross-entropy method is: 
- __Model-free:__ It doesn't build any model of the environment, it just says to the agent what to do at every step
- __Policy-based:__ It approximates the policy of the agent
- __On-policy:__ It requires fresh data obtained from the environment

The central thing in RL is the agent which it tries to accumulate as much total reward as possible. In practice, all of the complications of the agent is replaced with some kind of nonlinear trainable function. This function maps the agent's input (observations from the environment) to some output. Since the cross-entropy method is policy-based, then our nonlinear function (neural network) produces policy, that says for every observation which action the agent should take.

<img width="500px" src="./assets/high_level_RL.png">

The policy is usually represented as a probability distribution over actions. This is similar to a classification problem in which the number of classes is equal to the number of actions. This makes our agent very simple; It passes an observation from the environment to the network. Then it takes the probability distribution over actions, and perform random sampling using probability distribution to take an action. This random sampling adds randomness to our agent, which is a good thing. At the beginning of the training when our weights are random, the agent behaves randomly. After the agent takes an action in the environment, it obtains the next observation and reward for the last action. Then the loop continues.


The agent's experience is presented as __episodes__. An episode is represented as:
1. A sequence of observations that the agent received from the environment.
2. The actions it took.
3. The rewards it received (from the action it took).

In each episode, the total reward can be __discounted__ or not. For simplicity reasons, let's assume a discount factor of gamma = 1 (e.g. the sum of all local rewards for every episode). This total reward shows how good this episode was for the agent. Let's illustrate this with a diagram, which contains four episodes (note that different episodes have different values for O<sub>i</sub> , a<sub>i</sub> , and r<sub>i</sub> ):


<img width="700px" src="./assets/sample_episode.png">

Every cell represents the agent's step in the episode. Due to randomness in the environment and the way that the agent selects actions, some episodes will be better than others. The core of the cross-entropy method is to throw away bad episodes and train on better ones. 

__The steps for the cross-entropy method are as follows:__

1. Play N number of episodes using our current model and environment.
2. Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
3. Throw away all episodes with a reward below the boundary.
4. Train on the remaining "elite" episodes using observations as the input and issued actions as the desired output.
5. Repeat from step 1 until we become satisfied with the result.

With the preceding procedure, our neural network learns how to repeat actions, which leads to a larger reward, constantly moving the boundary higher and higher. Despite the simplicity of this method, it works well in simple environments, it's easy to implement, and it's quite robust to hyperparameters changing, which makes it an ideal baseline method to try. 

<br>

# 3. Cross-entropy on CartPole

---

In this section, we will apply the cross-entropy method to our CartPole environment. Our model's core is a one-hidden-layer neural network, with ReLU and 128 hidden neurons (which is absolutely arbitrary). Other hyperparameters are also set almost randomly and aren't tuned, as the method is robust and converges very quickly.

In [1]:
# Import the libraries
import gym
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter
import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
# Hyperparameters
HIDDEN_SIZE = 128     # Count of neurons in the hidden layer
BATCH_SIZE = 16       # Count of episodes we play on every iteration
PERCENTILE = 70       # The percentile of episodes' total rewards that we use for elite episode filtering. 70th percentile means that we'll leave the top 30% of episodes sorted by reward

The network in cross-entropy method, takes a single observation as an input, and outputs a number for every action. Since the output is a probability distribution over actions, it's a good idea to include softmax nonlinearity at the very end. However, in this network we don't use a softmax layer in order to increase the numerical stability of the training process. So, instead of calculating softmax and then calculating cross-entropy loss, we'll use the PyTorch class, __nn.CrossEntropyLoss__, which combines both softmax and cross-entropy in a single, more numerically stable expression. CrossEntropyLoss requires raw, unnormalized values from the network (e.g. logits).

In [3]:
# Network class that inherits nn.Module
class Net(nn.Module):
    
    # The constructor
    def __init__(self, obs_size, hidden_size, n_actions):
        
        # Call the parent's constructor to initialize itself
        super(Net, self).__init__()
        
        # Create a sequential with layers
        self.net = nn.Sequential(nn.Linear(in_features = obs_size, out_features = hidden_size),
                                 nn.ReLU(),
                                 nn.Linear(in_features = hidden_size, out_features = n_actions))
    
    # Forward function
    def forward(self, x):
        return self.net(x)

Here we will define two helper classes:
- __EpisodeStep:__ This will be used to represent one single step that the agent make in the episode. It stores the observation, the completed action. We'll use episode steps from elite episodes as training data.
- __Episode:__ This is a single episode which stores the total (un-discounted) reward, and a collection of EpisodeStep.

In [4]:
# Nametuple for episode (it stores reward and collection of EpisodeStep)
Episode = namedtuple('Episode', field_names=['reward', 'steps'])

# Nametuple for episode step (it stores observation and action)
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [None]:
# Generates batches with episodes
def iterate_batches(env, net, batch_size):
    """
    Generate batches with episodes.
    
    PARAMETERS
    =================
        - env: The environment
        - net: The network
        - batch_size: Size of the batches
        
    RETURNS
    =================
        - batch: Yielding batches
    """
    
    # Initialize a list for batch
    batch = []
    
    # Initialize the episode reward
    episode_reward = 0.0
    
    # Initialize a list for episode steps
    episode_steps = []
    
    # Reset the environment and get the first observation
    obs = env.reset()
    
    # Create a softmax layer (for converting the network's output to a probability distribution of actions)
    sm = nn.Softmax(dim = 1)
    
    # Infinite loop
    while True:
        
        # Convert the current observation to a PyTorch tensor
        obs_v = torch.FloatTensor([obs])
        
        # Pass the observation to the network and get the action probabilities
        act_probs_v = sm(net(obs_v))
        
        # Get the data (since tensors track gradients) + convert to NumPy array + Get the first batch element to obtain a one-dimensional vector of action probabilities
        act_probs = act_probs_v.data.numpy()[0]
        
        # Random sample of action from probability distribution of actions
        action = np.random.choice(len(act_probs), p = act_probs)
        
        # Render environment
        #env.render()
        
        # Take action and get the next observation, reward, the indication of the episode ending
        next_obs, reward, is_done, _ = env.step(action)
        
        # Add the reward to episode's total reward
        episode_reward += reward

        # Add (observation, action) pair to episode steps - We save the observation that was used to choose the action, and not the observation returned by the environment. Keep these tiny detailes in mind.
        episode_steps.append(EpisodeStep(observation = obs, action = action))
        
        # If episode is ended - When the stick fallen down in CartPole Environment
        if is_done:

            # Append the finalized episode to the batch
            batch.append(Episode(reward = episode_reward, steps = episode_steps))

            # Reset the total reward accumulator
            episode_reward = 0.0

            # Clean the list of episode steps
            episode_steps = []

            # Reset the environment (to start over) and get the observation
            next_obs = env.reset()

            # If the batch has reached the desired count of episodes
            if len(batch) == batch_size:

                # Yield the batch
                yield batch

                # Empty the batch
                batch = []
        
        # Update the current observation
        obs = next_obs

In [None]:
# Function for calculating the boundary reward (which is used to filter elite episodes )
def filter_batch(batch, percentile):
    """
    Given batch of episodes and percentile value, this function calculates a boundary reward, which is used 
    to filter elite episodes to train on. 
    
    PARAMETERS
    ==================
        - batch
        - percentile
        
    RETURNS
    ==================
        - train_obs_v: The training observations
        - train_act_v: The training actions
        - reward_bound: The boundary of reward (used only for TensorBoard)
        - reward_mean: The mean reward (used only for TensorBoard)
    """
    # Get the rewards of each batch
    rewards = list(map(lambda s: s.reward, batch))
    
    # Obtain the boundary reward
    reward_bound = np.percentile(rewards, percentile)
    
    # Calculate the mean reward (used for monitoring)
    reward_mean = float(np.mean(rewards))
    
    # Initialize an empty list for training observations
    train_obs = []
    
    # Initialize an empty list for training actions
    train_act = []
    
    # Iterate through each episode in the batch
    for example in batch:
        
        # If reward is below boundary reward then go back at the start of loop
        if example.reward < reward_bound:
            continue
            
        # Populate the list of observations we will train on    
        train_obs.extend(map(lambda step: step.observation, example.steps))
        
        # Populate the list of actions we will train on    
        train_act.extend(map(lambda step: step.action, example.steps))

    # Convert the observations from elite episodes into tensors
    train_obs_v = torch.FloatTensor(train_obs)
    
    # Convert the actions from elite episodes into tensors
    train_act_v = torch.LongTensor(train_act)
    
    return train_obs_v, train_act_v, reward_bound, reward_mean

In [None]:
# Execute the main program
if __name__ == "__main__":
    
    # Make the CartPole-v0 environment
    env = gym.make("CartPole-v0")
    
    # Create a monitor to write videos of the agent's performance
    # env = gym.wrappers.Monitor(env, directory="mon", force=True)
    
    # Get the observation size
    obs_size = env.observation_space.shape[0]
    
    # Get the number of actions
    n_actions = env.action_space.n

    # Initialize the network
    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    
    # Initialize the cross entropy loss
    objective = nn.CrossEntropyLoss()
    
    # Initialize the Adam optimizer
    optimizer = optim.Adam(params = net.parameters(), lr = 0.01)
    
    # Writer of data (for TensorBoard)
    writer = SummaryWriter(comment = "-cartpole")

    # Iterate through the batches (which are a list of episode objects)
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        
        # Perform filtering of the elite episodes and get observations, actions, the reward boundary (used for filtering), and the mean reward 
        obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
        
        # Zero out the gradients of our network
        optimizer.zero_grad()
        
        # Pass the observations to the network and get the action scores
        action_scores_v = net(obs_v)
        
        # Calculates the cross-entropy between the network output and the actions that the agent took - The idea of this is to reinforce our network to carry out those "elite" actions which have led to good rewards.
        loss_v = objective(action_scores_v, acts_v)
        
        # Calculate the gradients on the loss
        loss_v.backward()
        
        # Make optimizer to adjust the network
        optimizer.step()
        
        # Print out the loss, mean loss, and boundary reward
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
        
        # Write the values for loss, mean loss, and boundary reward to TensorBoard
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_bound", reward_b, iter_no)
        writer.add_scalar("reward_mean", reward_m, iter_no)
        
        # If mean reward becomes greater than 199 then stop the training
        if reward_m > 199:
            print("Solved!")
            break
            
    # Close the writer
    writer.close()

0: loss=0.695, reward_mean=18.3, reward_bound=22.0
1: loss=0.673, reward_mean=27.6, reward_bound=33.0
2: loss=0.653, reward_mean=43.6, reward_bound=45.0
3: loss=0.638, reward_mean=40.1, reward_bound=39.5
4: loss=0.617, reward_mean=41.6, reward_bound=47.5
5: loss=0.613, reward_mean=49.6, reward_bound=59.0
6: loss=0.597, reward_mean=55.0, reward_bound=60.5
7: loss=0.595, reward_mean=57.2, reward_bound=56.5
8: loss=0.567, reward_mean=75.7, reward_bound=101.0
9: loss=0.598, reward_mean=52.7, reward_bound=55.0
10: loss=0.546, reward_mean=63.8, reward_bound=74.0
11: loss=0.579, reward_mean=56.7, reward_bound=55.0
12: loss=0.552, reward_mean=64.6, reward_bound=72.0
13: loss=0.539, reward_mean=63.5, reward_bound=66.0
14: loss=0.532, reward_mean=73.9, reward_bound=87.0
15: loss=0.542, reward_mean=71.8, reward_bound=82.0
16: loss=0.526, reward_mean=82.6, reward_bound=88.5
17: loss=0.527, reward_mean=90.1, reward_bound=100.5
18: loss=0.533, reward_mean=77.7, reward_bound=79.0
19: loss=0.545, rewa

In [None]:
!tensorboard --logdir runs --host localhost

TensorBoard 1.12.2 at http://localhost:6006 (Press CTRL+C to quit)


In above, when the mean reward becomes greater than 199, we stop our training. Why 199? In Gym, the CartPole environment is considered to be solved when the mean reward for last 100 episodes is greater than 195, but our method converges so quickly that 100 episodes are usually what we need. The properly trained agent can balance the stick infinitely long (obtaining any amount of score), but the length of the episode in CartPole is limited to 200 steps (if you look at the environment variable of CartPole, you may notice the TimeLimit wrapper, which stops the episode after 200 steps). With all this in mind, we will stop training after the mean reward in the batch is greater than 199, which is a good indication that our agent knows how to balance the stick as a pro.


The RL training in here usually doesn't take the agent more than 50 batches to solve the environment. My experiments show something from 25 to 45 episodes, which is a really good learning performance (remember, we need to play only 16 episodes for every batch). TensorBoard shows our agent consistently making progress, pushing the upper boundary at almost every batch (there are some periods of rolling down, but most of the time it improves).


<img width="800px" src="assets/report_training.png">

To check our agent in action, you can enable Monitor by uncommenting the next line after the environment creation. After restarting (possibly with xvfb-run to provide
a virtual X11 display), our program will create a mon directory with videos recorded at different training steps:

                        rl_book_samples/Chapter04$ xvfb-run -s "-screen 0 640x480x24" ./01_
                        cartpole.py
                        [2017-10-04 13:52:23,806] Making new env: CartPole-v0
                        [2017-10-04 13:52:23,814] Creating monitor directory mon
                        [2017-10-04 13:52:23,920] Starting new video recorder writing to mon/
                        openaigym.video.0.4430.video000000.mp4
                        [2017-10-04 13:52:25,229] Starting new video recorder writing to mon/
                        openaigym.video.0.4430.video000001.mp4
                        [2017-10-04 13:52:25,771] Starting new video recorder writing to mon/
                        openaigym.video.0.4430.video000008.mp4
                        0: loss=0.682, reward_mean=18.9, reward_bound=20.5
                        [2017-10-04 13:52:26,297] Starting new video recorder writing to mon/
                        openaigym.video.0.4430.video000027.mp4
                        1: loss=0.687, reward_mean=16.6, reward_bound=19.0
                        2: loss=0.677, reward_mean=21.1, reward_bound=21.0
                        [2017-10-04 13:52:26,964] Starting new video recorder writing to mon/
                        openaigym.video.0.4430.video000064.mp4
                        3: loss=0.653, reward_mean=33.2, reward_bound=48.5
                        4: loss=0.642, reward_mean=37.4, reward_bound=42.5
                        .........
                        29: loss=0.561, reward_mean=111.6, reward_bound=122.0
                        30: loss=0.540, reward_mean=135.1, reward_bound=166.0
                        [2017-10-04 13:52:40,176] Starting new video recorder writing to mon/
                        openaigym.video.0.4430.video000512.mp4
                        31: loss=0.546, reward_mean=147.5, reward_bound=179.5
                        32: loss=0.559, reward_mean=140.0, reward_bound=171.5
                        33: loss=0.558, reward_mean=160.4, reward_bound=200.0
                        34: loss=0.547, reward_mean=167.6, reward_bound=195.5
                        35: loss=0.550, reward_mean=179.5, reward_bound=200.0
                        36: loss=0.563, reward_mean=173.9, reward_bound=200.0
                        37: loss=0.542, reward_mean=162.9, reward_bound=200.0
                        38: loss=0.552, reward_mean=159.1, reward_bound=200.0
                        39: loss=0.548, reward_mean=189.6, reward_bound=200.0
                        40: loss=0.546, reward_mean=191.1, reward_bound=200.0
                        41: loss=0.548, reward_mean=199.1, reward_bound=200.0
                        Solved!
                        
As you can see from the output, it turns a periodical recording of the agent's activity into separate video files, which can give you an idea of what your agent's sessions look like.

<img width="400px" src="./assets/cartpole_visualize.png">

Let's now pause a bit and think about what's just happened. Our neural network has learned how to play the environment purely from observations and rewards, without any one word interpretation of observed values. The environment could easily be
not a cart with a stick but, say, a warehouse model with product quantities as an observation and money earned as a reward. Our implementation doesn't depend on environment details. This is the beauty of the RL model, and in the next section, we'll look at how exactly the same method can be applied to a different environment from the Gym collection.

<br>

# 4. Cross-entropy on FrozenLake

---

The next environment we'll try to solve using the cross-entropy method is __FrozenLake__. This environment has the following 
- This environment is a grid world of 4x4.
- The agent can move in four directions: up, down, left, and right. 
- The agent always starts at a top-left position, and its goal is to reach the bottom-right cell of the grid. 
- There are holes in the fixed cells of the grid and if you get into those holes, your reward is 0.0 and the episode ends.
- If the agent reaches the destination cell, then it obtains the reward 1.0 and the episode ends.
- To make life more complicated, the world is slippery (it's a frozen lake after all), so the agent's actions do not always turn out as expected: there is a 33% chance that it will slip to the right or to the left. You want the agent to move left, for example, but there is a 33% probability that it will indeed move left, a 33% chance that it will end up in the cell above, and a 33% chance that it will end up in the cell below. As we'll see at the end of the section, this makes progress difficult.

<img width="300png" src="assets/FrozenLake environment.png">

Let's look how this environment is represented in Gym:

In [1]:
# Import the gym library
import gym

In [2]:
# Initialize the FrozenLake environment
e = gym.make("FrozenLake-v0")

Our __observation space__ is discrete, which means that it's just a number from zero to 15 (inclusive). Obviously, this number is our current position in the grid. 

In [3]:
# Take a look at the observation space
e.observation_space

Discrete(16)

The __action space__ is also discrete, but can be from zero to three.

In [4]:
# Take a look at the action space
e.action_space

Discrete(4)

In [5]:
# Reset the environment and return the initial observation
e.reset()

0

In [6]:
# Render the environment
e.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


Our network from the CartPole example expects a vector of numbers. To get this, we can apply the traditional "one-hot encoding" of discrete inputs, which means that input to our network will have 16 float numbers and zero everywhere, except the index that we'll encode. To minimize changes in our code, we can use the ObservationWrapper class from Gym and implement our DiscreteOneHotWrapper class:

In [7]:
# Classs for one-hot encoding the discrete inputs
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    
    # The constructor
    def __init__(self, env):
        
        # Call the parent's constructor to initialize itself
        super(DiscreteOneHotWrapper, self).__init__(env)
        
        # Check to make sure that the observation space is discrete type
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        
        # The observation space
        self.observation_space = gym.spaces.Box(low = 0.0, 
                                                high = 1.0, 
                                                shape = (env.observation_space.n, ), 
                                                dtype = np.float32)
        
    # TODO
    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

With that wrapper applied to the environment, both the observation space and action space are 100% compatible with our CartPole solution (source code __Chapter04/02_ frozenlake_naive.py__). However, by launching it, we can see that this doesn't improve the score over time.

<img width="700px" src="./assets/report.png">

To understand what's going on, we need to look deeper at the reward structure of both environments. In CartPole, every step of the environment gives us the reward 1.0, until the moment that the pole falls. So, the longer our agent balanced the pole, the more reward it obtained. Due to randomness in our agent's behavior, different episodes were of different lengths, which gave us a pretty normal distribution of the episodes' rewards. After choosing a reward boundary, we rejected less successful episodes and learned how to repeat better ones (by training on successful episodes' data).

This is shown in the following diagram:

<img width="600px" src="./assets/reward_distribution.png">

In the FrozenLake environment, episodes and their reward look different. We get
the reward of 1.0 only when we reach the goal, and this reward says nothing about how good each episode was. Was it quick and efficient or did we make four rounds on the lake before we randomly stepped into the final cell? We don't know, it's just 1.0 reward and that's it. The distribution of rewards for our episodes are also problematic. There are only two kinds of episodes possible, with zero reward (failed) and one reward (successful), and failed episodes will obviously dominate in the beginning of the training. So, our percentile selection of "elite" episodes is totally wrong and gives us bad examples to train on. This is the reason for our training failure.

<img width="600px" src="./assets/reward_distribution_2.png">

This example shows us the limitations of the cross-entropy method:
- For training, our episodes have to be finite and, preferably, short
- The total reward for the episodes should have enough variability to separate good episodes from bad ones
- There is no intermediate indication about whether the agent has succeeded or failed

Later in the book, we'll become familiar with other methods, which address these limitations. For now, if you're curious about how FrozenLake can be solved using cross-entropy, here is a list of tweaks of the code that you need to make (the full example is in __Chapter04/03_frozenlake_tweaked.py__):
- __Larger batches of played episodes:__ <br>In CartPole, it was enough to have 16 episodes on every iteration, but FrozenLake requires at least 100 just to get some successful episodes.<br><br>
- __Discount factor applied to reward:__ <br>To make the total reward for the episode depend on episode length, and add variety in episodes, we can use a discounted total reward with the discount factor 0.9 or 0.95. In this case, the reward for shorter episodes will be higher than the reward for longer ones.<br><br>
- __Keeping "elite" episodes for a longer time:__ <br>In the CartPole training, we sampled episodes from the environment, trained on the best ones, and threw them away. In FrozenLake, a successful episode is a much rarer animal, so we need to keep them for several iterations to train on them.<br><br>
- __Decrease learning rate:__ <br>This will give our network time to average more training samples.<br><br>
- __Much longer training time:__ <br>Due to the sparsity of successful episodes, and the random outcome of our actions, it's much harder for our network to get an idea of the best behavior to perform in any particular situation. To reach 50% successful episodes, about 5k training iterations are required.<br><br>

To incorporate all these into our code, we need to change the filter_batch function to calculate discounted reward and return "elite" episodes for us to keep:

In [8]:
# Function for calculating the boundary reward (which is used to filter elite episodes )
def filter_batch(batch, percentile):
    
    # Get the discounted rewards of each batch
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
    
    # Obtain the boundary reward
    reward_bound = np.percentile(disc_rewards, percentile)
    
    # Initialize an empty list for training observations
    train_obs = []
    
    # Initialize an empty list for training actions
    train_act = []
    
    # Initialize an empty list for elite batch
    elite_batch = []
    
    # Iterate through each episode and discounted reward in the batch
    for example, discounted_reward in zip(batch, disc_rewards):
        
        # If discounted reward is higher than boundary reward
        if discounted_reward > reward_bound:
            
            # Populate the list of observations we will train on    
            train_obs.extend(map(lambda step: step.observation, example.steps))
            
            # Populate the list of actions we will train on    
            train_act.extend(map(lambda step: step.action, example.steps))
            
            # Append the episode to elite_batch list
            elite_batch.append(example)
            
    return elite_batch, train_obs, train_act, reward_bound

Then, in the training loop, we will store previous "elite" episodes to pass them to the preceding function on the next training iteration.

In [None]:
# Initialize an empty list for the full batch
full_batch = []

# Iterate through batches
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    
    # Get the mean reward
    reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
    
    # Get the full batch, observations, actions, and boundary reward
    full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
    
    # If full_batch is not true then start the loop again
    if not full_batch:
        continue
        
    # Convert the observations into a tensor    
    obs_v = torch.FloatTensor(obs)
    
    # Convert the actions into a tensor
    acts_v = torch.LongTensor(acts)
    
    # Get the last 500 batches and delete the rest
    full_batch = full_batch[-500:]

The rest of the code is the same, except that the learning rate decreased 10 times and the BATCH_SIZE was set to 100. After a period of patient waiting (the new version takes about one and a half hours to finish 10k iterations), we can see that the training of the model stopped improving around 55% of solved episodes. There are ways
to address this (by applying entropy loss regularization, for example), but those techniques will be discussed in the upcoming chapters.

<img width="700px" src="./assets/convergence.png">

The final point to note here is the effect of "slipperiness" in the FrozenLake environment. Each of our actions with 33% probability is replaced with the 90° rotated one (the "up" action, for instance, will succeed with 0.33 probability and with 0.33 chance that it will be replaced with the "left" action and 0.33 with the "right" action).

The nonslippery version is in __Chapter04/04_frozenlake_nonslippery.py__, and the only difference is in the environment creation (we need to peek into the core of Gym to create the instance of the environment with tweaked arguments):

In [None]:
    # Create the FrozenLake environment that is not slippery
    env = gym.envs.toy_text.frozen_lake.FrozenLakeEnv(is_slippery=False)
    
    # Give a time limit of 100 steps
    env = gym.wrappers.TimeLimit(env, max_episode_steps=100)
    
    # Create a One-Hot Encoding for each of the states in the environment
    env = DiscreteOneHotWrapper(env)

The effect is dramatic! The nonslippery version of the environment can be solved in 120-140 batch iterations, which is 100 times faster than the noisy environment:

                    rl_book_samples/Chapter04$ ./04_frozenlake_nonslippery.py
                    0: loss=1.379, reward_mean=0.010, reward_bound=0.000, batch=1
                    1: loss=1.375, reward_mean=0.010, reward_bound=0.000, batch=2
                    2: loss=1.359, reward_mean=0.010, reward_bound=0.000, batch=3
                    3: loss=1.361, reward_mean=0.010, reward_bound=0.000, batch=4
                    4: loss=1.355, reward_mean=0.000, reward_bound=0.000, batch=4
                    5: loss=1.342, reward_mean=0.010, reward_bound=0.000, batch=5
                    6: loss=1.353, reward_mean=0.020, reward_bound=0.000, batch=7
                    7: loss=1.351, reward_mean=0.040, reward_bound=0.000, batch=11
                    ......
                    124: loss=0.484, reward_mean=0.680, reward_bound=0.000, batch=68
                    125: loss=0.373, reward_mean=0.710, reward_bound=0.430, batch=114
                    126: loss=0.305, reward_mean=0.690, reward_bound=0.478, batch=133
                    128: loss=0.413, reward_mean=0.790, reward_bound=0.478, batch=73
                    129: loss=0.297, reward_mean=0.810, reward_bound=0.478, batch=108
                    Solved!
                 
<br>                 
<img width="700px" src="./assets/convergence_2.png">

<br>

# 5. Theoretical background of the cross-entropy method

---

This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.
The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

<img width="600px" src="assets/formula_1.png">

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) divergence which is as follows:

<img width="600px" src="assets/formula_2.png">

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.


Combining both formulas, we can get an iterative algorithm, which starts with q0(x) = p(x) and on every step improves. This is an approximation of p(x)H(x) with an update:

<img width="500px" src="assets/formula_3.png">

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

<img width="500px" src="assets/formula_4.png">

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.
There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese ( https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf ).

<br>

# 6. FrozenLake Naive

---

In [1]:
import gym, gym.spaces
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter
import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

In [3]:
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

In [5]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

In [6]:
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [7]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs

In [8]:
def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean

In [None]:
if __name__ == "__main__":
    env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
    # env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.01)
    writer = SummaryWriter(comment="-frozenlake-naive")

    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_bound", reward_b, iter_no)
        writer.add_scalar("reward_mean", reward_m, iter_no)
        if reward_m > 0.8:
            print("Solved!")
            break
    writer.close()

0: loss=1.387, reward_mean=0.0, reward_bound=0.0
1: loss=1.367, reward_mean=0.0, reward_bound=0.0
2: loss=1.340, reward_mean=0.0, reward_bound=0.0
3: loss=1.342, reward_mean=0.0, reward_bound=0.0
4: loss=1.267, reward_mean=0.1, reward_bound=0.0
5: loss=1.237, reward_mean=0.0, reward_bound=0.0
6: loss=1.219, reward_mean=0.1, reward_bound=0.0
7: loss=1.134, reward_mean=0.0, reward_bound=0.0
8: loss=1.134, reward_mean=0.0, reward_bound=0.0
9: loss=1.086, reward_mean=0.0, reward_bound=0.0
10: loss=1.014, reward_mean=0.0, reward_bound=0.0
11: loss=1.081, reward_mean=0.0, reward_bound=0.0
12: loss=0.939, reward_mean=0.0, reward_bound=0.0
13: loss=1.055, reward_mean=0.0, reward_bound=0.0
14: loss=0.941, reward_mean=0.0, reward_bound=0.0
15: loss=0.981, reward_mean=0.0, reward_bound=0.0
16: loss=0.939, reward_mean=0.0, reward_bound=0.0
17: loss=0.930, reward_mean=0.0, reward_bound=0.0
18: loss=0.995, reward_mean=0.0, reward_bound=0.0
19: loss=0.910, reward_mean=0.1, reward_bound=0.0
20: loss=0

164: loss=0.363, reward_mean=0.0, reward_bound=0.0
165: loss=0.463, reward_mean=0.1, reward_bound=0.0
166: loss=0.412, reward_mean=0.0, reward_bound=0.0
167: loss=0.241, reward_mean=0.0, reward_bound=0.0
168: loss=0.352, reward_mean=0.0, reward_bound=0.0
169: loss=0.366, reward_mean=0.0, reward_bound=0.0
170: loss=0.199, reward_mean=0.0, reward_bound=0.0
171: loss=0.207, reward_mean=0.0, reward_bound=0.0
172: loss=0.175, reward_mean=0.0, reward_bound=0.0
173: loss=0.346, reward_mean=0.0, reward_bound=0.0
174: loss=0.262, reward_mean=0.0, reward_bound=0.0
175: loss=0.420, reward_mean=0.0, reward_bound=0.0
176: loss=0.115, reward_mean=0.0, reward_bound=0.0
177: loss=0.212, reward_mean=0.0, reward_bound=0.0
178: loss=0.117, reward_mean=0.0, reward_bound=0.0
179: loss=0.205, reward_mean=0.0, reward_bound=0.0
180: loss=0.249, reward_mean=0.0, reward_bound=0.0
181: loss=0.106, reward_mean=0.0, reward_bound=0.0
182: loss=0.112, reward_mean=0.0, reward_bound=0.0
183: loss=0.147, reward_mean=0.

<br>

# 7. FrozenLake Tweaked

---

In [1]:
import random
import gym
import gym.spaces
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter
import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
HIDDEN_SIZE = 128
BATCH_SIZE = 100
PERCENTILE = 30
GAMMA = 0.9

In [3]:
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

In [4]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

In [5]:
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [6]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs

In [7]:
def filter_batch(batch, percentile):
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
    reward_bound = np.percentile(disc_rewards, percentile)

    train_obs = []
    train_act = []
    elite_batch = []
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound:
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))
            elite_batch.append(example)

    return elite_batch, train_obs, train_act, reward_bound

In [None]:
if __name__ == "__main__":
    random.seed(12345)
    env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
    # env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.001)
    writer = SummaryWriter(comment="-frozenlake-tweaked")

    full_batch = []
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
        full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
        if not full_batch:
            continue
        obs_v = torch.FloatTensor(obs)
        acts_v = torch.LongTensor(acts)
        full_batch = full_batch[-500:]

        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
            iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch)))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_mean", reward_mean, iter_no)
        writer.add_scalar("reward_bound", reward_bound, iter_no)
        if reward_mean > 0.8:
            print("Solved!")
            break
    writer.close()

0: loss=1.394, reward_mean=0.020, reward_bound=0.000, batch=2
1: loss=1.386, reward_mean=0.010, reward_bound=0.000, batch=3
2: loss=1.386, reward_mean=0.010, reward_bound=0.000, batch=4
3: loss=1.382, reward_mean=0.000, reward_bound=0.000, batch=4
4: loss=1.384, reward_mean=0.010, reward_bound=0.000, batch=5
5: loss=1.378, reward_mean=0.040, reward_bound=0.000, batch=9
6: loss=1.373, reward_mean=0.010, reward_bound=0.000, batch=10
7: loss=1.373, reward_mean=0.010, reward_bound=0.000, batch=11
8: loss=1.370, reward_mean=0.000, reward_bound=0.000, batch=11
9: loss=1.368, reward_mean=0.010, reward_bound=0.000, batch=12
10: loss=1.371, reward_mean=0.010, reward_bound=0.000, batch=13
11: loss=1.369, reward_mean=0.000, reward_bound=0.000, batch=13
12: loss=1.368, reward_mean=0.020, reward_bound=0.000, batch=15
13: loss=1.367, reward_mean=0.000, reward_bound=0.000, batch=15
14: loss=1.365, reward_mean=0.000, reward_bound=0.000, batch=15
15: loss=1.364, reward_mean=0.010, reward_bound=0.000, b

<br>

# 8. FrozenLake Non-Slippery

---

In [1]:
import random
import gym
import gym.spaces
import gym.wrappers
import gym.envs.toy_text.frozen_lake
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter
import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
HIDDEN_SIZE = 128
BATCH_SIZE = 100
PERCENTILE = 30
GAMMA = 0.9

In [3]:
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

In [4]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

In [5]:
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [6]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        if is_done:
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs

In [7]:
def filter_batch(batch, percentile):
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
    reward_bound = np.percentile(disc_rewards, percentile)

    train_obs = []
    train_act = []
    elite_batch = []
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound:
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))
            elite_batch.append(example)

    return elite_batch, train_obs, train_act, reward_bound

In [8]:
if __name__ == "__main__":
    random.seed(12345)
    env = gym.envs.toy_text.frozen_lake.FrozenLakeEnv(is_slippery=False)
    #env = gym.wrappers.TimeLimit(env, max_episode_steps=100)
    env = DiscreteOneHotWrapper(env)
    # env = gym.wrappers.Monitor(env, directory="mon", force=True)
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n

    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    objective = nn.CrossEntropyLoss()
    optimizer = optim.Adam(params=net.parameters(), lr=0.001)
    writer = SummaryWriter(comment="-frozenlake-nonslippery")

    full_batch = []
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
        full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
        if not full_batch:
            continue
        obs_v = torch.FloatTensor(obs)
        acts_v = torch.LongTensor(acts)
        full_batch = full_batch[-500:]

        optimizer.zero_grad()
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        optimizer.step()
        print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
            iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch)))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_mean", reward_mean, iter_no)
        writer.add_scalar("reward_bound", reward_bound, iter_no)
        if reward_mean > 0.8:
            print("Solved!")
            break
    writer.close()

0: loss=1.373, reward_mean=0.010, reward_bound=0.000, batch=1
1: loss=1.381, reward_mean=0.020, reward_bound=0.000, batch=3
2: loss=1.386, reward_mean=0.010, reward_bound=0.000, batch=4
3: loss=1.384, reward_mean=0.010, reward_bound=0.000, batch=5
4: loss=1.379, reward_mean=0.000, reward_bound=0.000, batch=5
5: loss=1.375, reward_mean=0.000, reward_bound=0.000, batch=5
6: loss=1.375, reward_mean=0.020, reward_bound=0.000, batch=7
7: loss=1.372, reward_mean=0.020, reward_bound=0.000, batch=9
8: loss=1.370, reward_mean=0.020, reward_bound=0.000, batch=11
9: loss=1.366, reward_mean=0.010, reward_bound=0.000, batch=12
10: loss=1.362, reward_mean=0.020, reward_bound=0.000, batch=14
11: loss=1.358, reward_mean=0.000, reward_bound=0.000, batch=14
12: loss=1.353, reward_mean=0.040, reward_bound=0.000, batch=18
13: loss=1.354, reward_mean=0.020, reward_bound=0.000, batch=20
14: loss=1.351, reward_mean=0.050, reward_bound=0.000, batch=25
15: loss=1.349, reward_mean=0.060, reward_bound=0.000, bat

<br>

# 9. Summary

---

In this chapter, we became familiar with the first RL method cross-entropy, which is simple but quite powerful, despite its limitations. We applied it to a CartPole environment (with huge success) and to FrozenLake (with much more modest success). This chapter ends the introductory part of the book.
In the upcoming chapters, we will explore more complex, but more powerful tools of deep RL.

___THE END___