# **Deep Reinforcement Learning**

# M3-1 Introduction to Approximate Solutions

## Simple NN-based Agent 

Below we will see a simple example that will allow us to understand the concepts introduced in this module. 

## 1. CartPole Environment

In this exercise we are going to load the [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/) environment and perform some tests.

The following code loads the necessary packages for the example, creates the environment using the `make` method and prints on the screen the dimension of the:
- **action space** (two actions: 0 = left and 1 = right), 
- **observations space** (four observations : cart position, cart speed, pole angle, and pole speed at the tip) 
- range of the **reward** variable (from minus infinity to plus infinity).

In [1]:
import gymnasium as gym
import numpy as np

env = gym.make('CartPole-v1')

print("Gymnasium version is {} ".format(gym.__version__))
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))

Gymnasium version is 1.2.1 
Action space is Discrete(2) 
Observation space is Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32) 


## 2. Defining a simple NN-based Agent


<u>Notes</u>:
- This code is based on [Deep-Reinforcement-Learning-Hands-On-Second-Edition, published by Packt](https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition)

In [2]:
from collections import namedtuple
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

In [3]:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU detected")

#Use cuda if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

True
NVIDIA GeForce RTX 3050 Laptop GPU
Using device: cuda


In [4]:
import wandb

# start a new wandb run to track this script
wandb.init(project="M3-1_Example_1")

[34m[1mwandb[0m: Currently logged in as: [33mvictorbrr17[0m ([33mvictorbrr17-universitat-aut-noma-de-barcelona[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


We define constants at the top of the file and they include the count of neurons in the hidden layer, the count of episodes we play on every iteration (16), and the percentile of episodes' total rewards that we use for elite episode filtering. We'll take the 70th percentile, which means that we'll leave the top 30% of episodes sorted by reward:

In [5]:
HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

Our model's core is a one-hidden-layer neural network, with ReLU and 128 hidden neurons (which is absolutely arbitrary). Other hyperparameters are also set almost randomly and aren't tuned, as the method is robust and converges very quickly.

There is nothing special about our network:
- It takes a single observation from the environment as an input vector and outputs a number for every action we can perform. 
- The output from the network is a probability distribution over actions, so a straightforward way to proceed would be to include softmax nonlinearity after the last layer. However, in the following network we don't apply softmax to increase the numerical stability of the training process. 

Rather than calculating softmax (which uses exponentiation) and then calculating cross-entropy loss (which uses logarithm of probabilities), we'll use the PyTorch class, `nn.CrossEntropyLoss`, which combines both softmax and cross-entropy in a single, more numerically stable expression. `CrossEntropyLoss` requires raw, unnormalized values from the network (also called _logits_), and the downside of this is that we need to remember to apply softmax every time we need to get probabilities from our network's output.

In [6]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

Here we will define two helper classes that are named tuples from the collections package in the standard library:
- `EpisodeStep`: This will be used to represent one single step that our agent made in the episode, and it stores the observation from the environment and what action the agent completed. We'll use episode steps from elite episodes as training data.
- `Episode`: This is a single episode stored as total undiscounted reward and a collection of EpisodeStep.

In [7]:
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

Let's look at a function that generates batches with episodes:

- The function accepts the environment (the `Env` class instance from the Gymnasium library), our neural network, and the count of episodes it should generate on every iteration. 
- The `batch` variable will be used to accumulate our batch (which is a list of the `Episode` instances). 
- We also declare a reward counter for the current episode and its list of steps (the `EpisodeStep` objects). 

Then we reset our environment to obtain the first observation and create a softmax layer, which will be used to convert the network's output to a probability distribution of actions.

In [8]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()[0]
    sm = nn.Softmax(dim=1)
    
    while True:
        obs_v = torch.FloatTensor(np.array([obs]))
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        next_obs, reward, terminated, truncated, _ = env.step(action)
        is_done = terminated or truncated
        episode_reward += reward
        step = EpisodeStep(observation=obs, action=action)
        episode_steps.append(step)
        
        if is_done:
            e = Episode(reward=episode_reward, steps=episode_steps)
            batch.append(e)
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()[0]
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs

At every iteration (`while True`, line 7), we convert our current observation to a PyTorch tensor and pass it to the network to obtain action probabilities. There are several things to note here:
- All `nn.Module` instances in PyTorch expect a batch of data items and the same is true for our network, so we convert our observation (which is a vector of four numbers in CartPole) into a tensor of size $1 \times 4$ (to achieve this we pass an observation in a single-element list).

- As we haven't used nonlinearity at the output of our network, it outputs raw action scores, which we need to feed through the softmax function.

- Both our network and the softmax layer return tensors which track gradients, so we need to unpack this by accessing the `tensor.data` field and then converting the tensor into a NumPy array. This array will have the same two-dimensional structure as the input, with the batch dimension on axis 0, so we need to get the first batch element to obtain a one-dimensional vector of action probabilities:

> action = np.random.choice(len(act_probs), p=act_probs)

> next_obs, reward, is_done, _ = env.step(action)

- Now that we have the probability distribution of actions, we can use this distribution to obtain the actual action for the current step by sampling this distribution using NumPy's function, `random.choice()`. After this, we will pass this action to the environment to get our next observation, our reward, and the indication of the episode ending:

> episode_reward += reward

> episode_steps.append(EpisodeStep(observation=obs, action=action))

- Reward is added to the current episode's total reward, and our list of episode steps is also extended with an (observation, action) pair. Note that we save the observation that was used to choose the action, but not the observation returned by the environment as a result of the action. These are the tiny but important details that you need to keep in mind.

At the end of the episode (`if is_done`, line 17):

- This is how we handle the situation when the current episode is over. 
- We append the finalized episode to the batch, saving the total reward (as the episode has been completed and we've accumulated all reward) and steps we've taken. 
- Then we reset our total reward accumulator and clean the list of steps. After that, we reset our environment to start over.
- In case our batch has reached the desired count of episodes, we return it to the caller for processing, using `yield`. Our function is a generator, so every time the `yield` operator is executed, the control is transferred to the outer iteration loop and then continues after the `yield` line. 
- After processing, we will clean up the batch:

The last, but very important, step in our loop is to assign an observation obtained from the environment to our current observation variable. After that, everything repeats infinitely: we pass the observation to the net, sample the action to perform, ask the environment to process the action, and remember the result of this processing.

One very important fact to understand in this function logic is that the training of our network and the generation of our episodes are performed at the same time. They are not completely in parallel, but every time our loop accumulates enough episodes (16), it passes control to this function caller, which is supposed to train the network using the gradient descent. So, when `yield` is returned, the network will have different, slightly better (we hope) behavior.

We don't need to explore proper synchronization, as our training and data gathering activities are performed at the same thread of execution, but you need to understand those constant jumps from network training to its utilization.

We need to define yet another function and we'll be ready to switch to the training loop:

In [9]:
def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []
    for reward, steps in batch:
        if reward < reward_bound:
            continue
        train_obs.extend(map(lambda step: step.observation, steps))
        train_act.extend(map(lambda step: step.action, steps))

    train_obs_v = torch.FloatTensor(np.array(train_obs))
    train_act_v = torch.LongTensor(np.array(train_act))
    return train_obs_v, train_act_v, reward_bound, reward_mean

This function is at the core of the cross-entropy method: 
- from the given batch of episodes and percentile value, it calculates a boundary reward, which is used to filter elite episodes to train on. 

To obtain the boundary reward, we're using NumPy's `percentile` function, which from the list of values and the desired percentile, calculates the percentile's value. Then we will calculate mean reward, which is used only for monitoring.

Next, we will filter off our episodes. For every episode in the batch, we will check that the episode has a higher total reward than our boundary and if it has, we will populate lists of observations and actions that we will train on.

As the final step of the function, we will convert our observations and actions from elite episodes into tensors, and return a tuple of four: observations, actions, the boundary of reward, and the mean reward. The last two values will be used only to write them into TensorBoard to check the performance of our agent.

The final chunk of code that glues everything together and mostly consists of the training loop is as follows:

In [10]:
env = gym.make("CartPole-v1")
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)

for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()

    # log metrics to wandb
    print("Iteration %d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (iter_no, loss_v.item(), reward_m, reward_b))
    wandb.log({"loss": loss_v.item(), "reward_mean": reward_m, "reward_bound": reward_b}, step=iter_no)
    
    if reward_m > 199:
        print("Solved!")
        break
        
# Finish the wandb run, necessary in notebooks
wandb.finish()

Iteration 0: loss=0.693, reward_mean=19.4, reward_bound=22.5
Iteration 1: loss=0.691, reward_mean=20.6, reward_bound=22.5
Iteration 2: loss=0.688, reward_mean=20.5, reward_bound=26.0
Iteration 3: loss=0.676, reward_mean=22.9, reward_bound=28.5
Iteration 4: loss=0.680, reward_mean=24.3, reward_bound=27.5
Iteration 5: loss=0.677, reward_mean=27.4, reward_bound=30.0
Iteration 6: loss=0.664, reward_mean=31.4, reward_bound=34.5
Iteration 7: loss=0.665, reward_mean=30.2, reward_bound=33.5
Iteration 8: loss=0.637, reward_mean=36.4, reward_bound=41.0
Iteration 9: loss=0.646, reward_mean=38.4, reward_bound=46.5
Iteration 10: loss=0.624, reward_mean=40.5, reward_bound=46.0
Iteration 11: loss=0.620, reward_mean=36.1, reward_bound=42.0
Iteration 12: loss=0.618, reward_mean=59.1, reward_bound=56.0
Iteration 13: loss=0.613, reward_mean=40.8, reward_bound=45.0
Iteration 14: loss=0.613, reward_mean=51.4, reward_bound=58.5
Iteration 15: loss=0.592, reward_mean=45.1, reward_bound=54.0
Iteration 16: loss

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Iteration 57: loss=0.484, reward_mean=206.9, reward_bound=219.0
Solved!


0,1
loss,██▇██▇▆▇▆▆▅▅▅▅▅▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁
reward_bound,▁▁▁▁▁▁▁▂▂▂▂▂▂▃▂▂▃▃▃▃▃▄▃▃▃▄▄▄▆▆▆▆▆▇▆▆▆▇▆█
reward_mean,▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▃▄▃▃▃▄▄▄▅▅▆▆▆▆▆▆▇▆▆█

0,1
loss,0.48419
reward_bound,219.0
reward_mean,206.875


In the beginning, we will create all the required objects: the environment, our neural network, the objective function, the optimizer, and the summary writer for TensorBoard. The commented line creates a monitor to write videos of your agent's performance.

In the training loop, we will iterate our batches (which are a list of `Episode` objects), then we perform filtering of the elite episodes using the `filter_batch` function. The result is variables of observations and taken actions, the reward boundary used for filtering and the mean reward. After that, we zero gradients of our network and pass observations to the network, obtaining its action scores. These scores are passed to the objective function, which calculates cross-entropy between the network output and the actions that the agent took. The idea of this is to reinforce our network to carry out those "elite" actions which have led to good rewards. Then, we will calculate gradients on the loss and ask the optimizer to adjust our network.

The rest of the loop is mostly the monitoring of progress. On the console, we show iteration number, loss, the mean reward of the batch, and the reward boundary. We also write the same values to TensorBoard, to get a nice chart of the agent's learning performance.

The last check in the loop is the comparison of the mean rewards of our batch episodes. When this becomes greater than `199`, we stop our training. Why `199`? In Gymnasium, the CartPole environment is considered to be solved when the mean reward for last 100 episodes is greater than 195, but our method converges so quickly that 100 episodes are usually what we need. The properly trained agent can balance the stick infinitely long (obtaining any amount of score), but the length of the episode in CartPole is limited to 200 steps (if you look at the environment variable of CartPole, you may notice the `TimeLimit` wrapper, which stops the episode after 200 steps). With all this in mind, we will stop training after the mean reward in the batch is greater than `199`, which is a good indication that our agent knows how to balance the stick as a pro.