In [1]:
### Note: this notebook is taken from the tutorial provided on the SB3 website. Go there for more information and documentation

# Stable Baselines3 Tutorial - Getting Started

Github repo: https://github.com/araffin/rl-tutorial-jnrr19

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines.readthedocs.io/en/master/

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.


## Introduction

In this notebook, you will learn the basics for using stable baselines3 library: how to create a RL model, train it and evaluate it. Because all algorithms share the same interface, we will see how simple it is to switch from one algorithm to another.


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [2]:
### Feel free to disable GPU if you don't have one
import os
os.environ["CUDA_VISIBLE_DEVICES"]=""

In [4]:
import stable_baselines3
stable_baselines3.__version__

'1.6.1a0'

## Imports

Stable-Baselines works on environments that follow the [gym interface](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://gym.openai.com/envs/#classic_control).

It is also recommended to check the [source code](https://github.com/openai/gym) to learn more about the observation and action space of each env, as gym does not have a proper documentation.
Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines.readthedocs.io/en/master/guide/algos.html)

In [5]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [6]:
from stable_baselines3 import PPO

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor: 

```PPO('MlpPolicy', env)``` instead of ```PPO(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy`, that's why using string for the policy is the recommened option.

In [8]:
from stable_baselines3.ppo import MlpPolicy

## Create the Gym env and instantiate the agent

For this example, we will use CartPole environment, a classic control problem.

"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. "

Cartpole environment: [https://gym.openai.com/envs/CartPole-v1/](https://gym.openai.com/envs/CartPole-v1/)

![Cartpole](https://cdn-images-1.medium.com/max/1143/1*h4WTQNVIsvMXJTCpXm_TAw.gif)

Note: vectorized environments allow to easily multiprocess training. In this example, we are using only one process, hence the DummyVecEnv.

We chose the MlpPolicy because input of CartPole is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space


Here we are using the [Proximal Policy Optimization](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.


In [6]:
env = gym.make('CartPole-v1')

model = PPO(MlpPolicy, env, verbose=0)

We create a helper function to evaluate the agent:

In [7]:
def evaluate(model, num_episodes=100, deterministic=True):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            action, _states = model.predict(obs, deterministic=deterministic)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

In fact, Stable-Baselines3 already provides you with that helper:

In [8]:
from stable_baselines3.common.evaluation import evaluate_policy

Let's evaluate the un-trained agent, this should be a random agent.

In [9]:
# Use a separate environement for evaluation
eval_env = gym.make('CartPole-v1')

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:9.31 +/- 0.70




## Train the agent and evaluate it

In [10]:
# Train the agent for 10000 steps
model.learn(total_timesteps=10000)

<stable_baselines3.ppo.ppo.PPO at 0x7fed19081520>

In [11]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:385.32 +/- 114.70


Apparently the training went well, the mean reward increased a lot ! 

### Prepare video recording

In [12]:
# # Set up fake display; otherwise rendering will fail
# import os
# os.system("Xvfb :1 -screen 0 1024x768x24 &")
# os.environ['DISPLAY'] = ':1'

In [13]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you will learn about those wrapper in the next notebook.

In [14]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make('CartPole-v1')])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

### Visualize trained agent



In [16]:
record_video('CartPole-v1', model, video_length=500, prefix='ppo-cartpole')

Saving video to /Users/scottyang/Desktop/FA22/DSC 180/code-base/capstone_venv/videos/ppo-cartpole-step-0-to-step-500.mp4


In [17]:
show_videos('videos', prefix='ppo')

## Bonus: Train a RL Model in One Line

The policy class to use will be inferred and the environment will be automatically created. This works because both are [registered](https://stable-baselines.readthedocs.io/en/master/guide/quickstart.html).

In [19]:
model = PPO('MlpPolicy', "CartPole-v1", verbose=1).learn(100000)

Using cpu device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 22.3     |
|    ep_rew_mean     | 22.3     |
| time/              |          |
|    fps             | 3988     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 25.6        |
|    ep_rew_mean          | 25.6        |
| time/                   |             |
|    fps                  | 2666        |
|    iterations           | 2           |
|    time_elapsed         | 1           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008158682 |
|    clip_fraction        | 0.0877      |
|    cl

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 166         |
|    ep_rew_mean          | 166         |
| time/                   |             |
|    fps                  | 2143        |
|    iterations           | 11          |
|    time_elapsed         | 10          |
|    total_timesteps      | 22528       |
| train/                  |             |
|    approx_kl            | 0.004537469 |
|    clip_fraction        | 0.0252      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.558      |
|    explained_variance   | 0.875       |
|    learning_rate        | 0.0003      |
|    loss                 | 18          |
|    n_updates            | 100         |
|    policy_gradient_loss | -0.00585    |
|    value_loss           | 19.1        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 185   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 343         |
|    ep_rew_mean          | 343         |
| time/                   |             |
|    fps                  | 2115        |
|    iterations           | 21          |
|    time_elapsed         | 20          |
|    total_timesteps      | 43008       |
| train/                  |             |
|    approx_kl            | 0.009811033 |
|    clip_fraction        | 0.0971      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.502      |
|    explained_variance   | 0.654       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0117     |
|    n_updates            | 200         |
|    policy_gradient_loss | -0.0108     |
|    value_loss           | 0.1         |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 356 

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 487         |
|    ep_rew_mean          | 487         |
| time/                   |             |
|    fps                  | 2103        |
|    iterations           | 31          |
|    time_elapsed         | 30          |
|    total_timesteps      | 63488       |
| train/                  |             |
|    approx_kl            | 0.002336335 |
|    clip_fraction        | 0.0451      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.463      |
|    explained_variance   | 0.414       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0154     |
|    n_updates            | 300         |
|    policy_gradient_loss | -0.00156    |
|    value_loss           | 0.00129     |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 493   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 500          |
|    ep_rew_mean          | 500          |
| time/                   |              |
|    fps                  | 2098         |
|    iterations           | 41           |
|    time_elapsed         | 40           |
|    total_timesteps      | 83968        |
| train/                  |              |
|    approx_kl            | 0.0025598132 |
|    clip_fraction        | 0.0416       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.393       |
|    explained_variance   | 0.3          |
|    learning_rate        | 0.0003       |
|    loss                 | -0.0061      |
|    n_updates            | 400          |
|    policy_gradient_loss | -0.000246    |
|    value_loss           | 4.42e-05     |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

### Additional tasks:

1. Change the environment
2. Benchmark two separate algorithms in this new environment.
3. (Optional) Play with hyperparameters

# Part 2: Behavioral Cloning

Here, you'll implement a deep learning algorithm that can store multiple episodes. We'll do it from scratch so that it'll be easier to perturb the individual components.

In [220]:
### Step 1: Make a data structure to store transitions.
### At the very least, you'll need observations and actions.

# example implementation
class TransitionStorage:
    def __init__(self):
        self.obs = []
        self.action = []

    def store_transition(self, obs, action):
        """
        update self.obs and self.action based on obs and action
        """
        if action == 0:
            self.action.append([1, 0])
        else:
            self.action.append([0, 1])
        self.obs.append(obs)
    def get_batch(self, batch_size):
        curr = 0
        while len(self.obs) >= curr:
            yield self.obs[curr:curr+batch_size], self.action[curr:curr+batch_size]
            curr += batch_size

In [222]:
### Step 2: Reuse the evaluate function, but now use it to collect data

def evaluate_and_collect(model, storage, num_episodes=100, deterministic=True):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            action, _states = model.predict(obs, deterministic=deterministic)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            next_obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)

            # store the transition
            storage.store_transition(obs, action)

            obs = next_obs

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

In [223]:
model

<stable_baselines3.ppo.ppo.PPO at 0x7fed190d82b0>

In [224]:
### Step 3: Collect some data on an expert.
### You can use the evaluate_and_collect function you wrote above.

storage = TransitionStorage()
evaluate_and_collect(model, storage, num_episodes=1000)

Mean reward: 500.0 Num episodes: 1000


500.0

In [287]:
### Step 4: Define a network that you'll train through behavioral cloning (supervised learning)

import torch
import torch.nn as nn
import torch.nn.functional as F

# BCNetwork but with a discrete action space
class BCNetworkDiscrete(nn.Module):
    def __init__(self, obs_dim, action_dim):
        # assumes that observation and action are one-dimensional
        super(BCNetworkDiscrete, self).__init__()
        self.obs_dim = obs_dim
        self.action_dim = action_dim

        self.fc1 = nn.Linear(self.obs_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, self.action_dim)

    def forward(self, obs):
        x = F.relu(self.fc1(obs))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = F.softmax(x, dim=2) # what should you do to x to make it a probability distribution?
        return x

In [290]:
### Step 5: Train the network

# initialize the network
network = BCNetworkDiscrete(env.observation_space.shape[0], env.action_space.n)

# define the optimizer
optimizer = torch.optim.Adam(network.parameters(), lr=1e-3)

# define the loss function
loss_fn = nn.CrossEntropyLoss()

# define the number of epochs
num_epochs = 10

# define the batch size
batch_size = 32

# define the number of batches
num_batches = len(storage.obs) // batch_size

# note: you can keep the obs and action completely in memory 
# (the training loop is set up for that). make sure that:
# 1. the types are correct
# 2. the shapes match

# train the network
for epoch in range(num_epochs):
    gen = storage.get_batch(batch_size)
    # accumulate loss
    epoch_loss = 0

    for batch in range(num_batches):
        # get the batch somehow. you can either write a method 
        # into the storage class or just directly access the 
        # values in it
        
        batch_obs, batch_action = next(gen)
        batch_obs, batch_action = torch.FloatTensor(batch_obs), torch.FloatTensor(batch_action)
        # forward pass
        logits = network(batch_obs)
        # need to squeeze out the extra dimension
        logits = torch.squeeze(logits)
    
        # compute the loss
        loss = loss_fn(logits, batch_action)

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # accumulate loss
        epoch_loss += loss.item()
    # print the loss
    print("Epoch: {}, Loss: {}".format(epoch, epoch_loss / num_batches))

Epoch: 0, Loss: 0.3724624923381805
Epoch: 1, Loss: 0.3316619104347229
Epoch: 2, Loss: 0.32863264413642884
Epoch: 3, Loss: 0.3276375488548279
Epoch: 4, Loss: 0.327039477104187
Epoch: 5, Loss: 0.32647510086631776
Epoch: 6, Loss: 0.32616590786743166
Epoch: 7, Loss: 0.32604323777389527
Epoch: 8, Loss: 0.3250926556186676
Epoch: 9, Loss: 0.32491208099746705


In [291]:
### Step 6: run the trained network on the environment, based on the evaluate function but using network instead of model
def evaluate_network(network, num_episodes=100):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        while not done:
            # _states are only useful when using LSTM policies
            
            # need to add the additional dimenstion becuase of the 
            # single batch training
            action = network(torch.tensor([obs], dtype=torch.float32)).argmax().item()
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, info = env.step([action])
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

evaluate_network(network)

Mean reward: 500.0 Num episodes: 100


500.0

In [295]:
### Step 7: visualize the new policy, and save to an mp4 file

env = model.get_env()
obs = env.reset()
done = False
frames = []
while not done:
    action = network(torch.tensor([obs], dtype=torch.float32)).argmax().item()
    obs, reward, done, info = env.step([action])
    frames.append(env.render(mode='rgb_array'))
env.close()

# save the video
import imageio  # note: pip install imageio[ffmpeg]
imageio.mimsave('bc.mp4', frames, fps=30)

### Messing around with BC

In [None]:
# Write (or reuse) a loop and add noise to the policy actions. What happens?


In [None]:
# Add noise to the observations


In [None]:
# (Optional): Try collecting data from a trained policy while applying noise, 
# then try training a new policy on that data. How does the new policy fare?
