# Purpose
The purpose of this notebook is to get an introduction to reinforcement learning (RL).

## Step 1: Dependencies
 - Stable Baselines is a library to make RL easier. Think of it as the tensorflow of RL. It was based off of baselines, a package that was developed by OpenAI
 - Gym is another package created by OpenAI. It provides a variety of environments

In [1]:
!pip install stable-baselines3[extra] gymnasium tensorflow tensorboard

Collecting stable-baselines3[extra]
  Obtaining dependency information for stable-baselines3[extra] from https://files.pythonhosted.org/packages/d9/57/13d4e4b7bbbc940815964ac31e205263b8133f1f2a0147bd4ca884a6e174/stable_baselines3-2.0.0-py3-none-any.whl.metadata
  Using cached stable_baselines3-2.0.0-py3-none-any.whl.metadata (5.4 kB)
Collecting gymnasium
  Obtaining dependency information for gymnasium from https://files.pythonhosted.org/packages/3f/00/a728a4a8608213482fc38d76d842657d29b546f214e83801a044de074612/gymnasium-0.29.0-py3-none-any.whl.metadata
  Downloading gymnasium-0.29.0-py3-none-any.whl.metadata (10 kB)
Collecting tensorflow
  Obtaining dependency information for tensorflow from https://files.pythonhosted.org/packages/ba/7c/b971f2485155917ecdcebb210e021e36a6b65457394590be01cc61515310/tensorflow-2.13.0-cp310-cp310-win_amd64.whl.metadata
  Using cached tensorflow-2.13.0-cp310-cp310-win_amd64.whl.metadata (2.6 kB)
Collecting tensorboard
  Using cached tensorboard-2.13.0-py3

In [2]:
import os
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
import numpy as np

 - PPO is one of many RL algorithms. It stands for Proximal Policy Optimization. After an update, the new policy should be slightly different from the old policy
 - DummyVecEnv is a wrapper around the environment to make it easier to work with Stable Baselines. We are able to use this because we are not vectorizing our environment
 - evaluate_policy helps us evaluate our model, including the average reward over a certain number of episodes, and the standard deviation

## Step 2: Load Environment

OpenAI gym is a nice way to use a simulated environment. Simulated environments are helpful as they are cheap, fast, easy, and cannot cause real world damage. For example, it is a lot better to train a self driving car in a simulated environment first, before deploying it on the road

For this first example, we will be using the CartPole environment. This is a simple env of balancing a pole on a moving cart

In [3]:
env_name = 'CartPole-v1'
render_mode = 'human'

Now that we have loaded the env, let's take a look at it before we do anything more

In [33]:
env = gym.make(env_name, render_mode = render_mode)

episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0

    while not done:
        #Visualize the env
        env.render()
        #Create a random action
        action = env.action_space.sample()
        #Pass in our random action to the environment, and collect observations
        #Dummy is pointless
        n_state, reward, done, dummy, info = env.step(action)
        #Add to the score
        score += reward
    print(f'Episode:{episode} Score:{score}')
env.close()

Episode:1 Score:21.0
Episode:2 Score:26.0
Episode:3 Score:13.0
Episode:4 Score:11.0
Episode:5 Score:20.0


- Episode: An episode is essentially one full game within an environment. For some games, this is defined when the game ends, such as if a player dies or run out of lives. For cartpole, the episode is defined as surviving 500 frames.
- State: This is the initial set of observations. The observations are from the environment. These observations will be passed into the agent, to determine the best possible reward.
- env.render(): This allows us to view what is happening in the environment
- action: We are generating a random action, not one that is actually useful yet
- env.step(): Here we are passing in our action into the environment. We get the observations returned.

Note that there are two different spaces, the action space and the environment space
 - environment space: the observations about the environment
 - action space: the actions that can be taken


In [20]:
#This is the different actions that can be taken. Discrete 2 means that there are only two uniquely discrete options that can be taken
print(env.action_space)
#This number will either be a 0 or a 1. This is the random action that is being chosen. This correlates to the cart being moved either left or right
print(env.action_space.sample())
#This is the space that our observations will be saved in. These are the ranges that the 4 numbers can be between
print(env.observation_space)
#This is an example of the results from our observation space
print(env.observation_space.sample())

Discrete(2)
1
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
[ 2.4412444e+00  1.7722344e+38  1.1473202e-01 -2.6802146e+38]


Here's what the numbers in the observation space represent
 - Number 1: Cart Position
 - Number 2: Cart Velocity
 - Number 3: Pole Angle
 - Number 4: Pole Angular Velocity

See more: https://gymnasium.farama.org/environments/classic_control/cart_pole/

## Step 3: Training

It is important to understand our training metrics. There are 3 main types of metrics, as well as other metrics

Evaluation Metrics: These have to do with each episode
- ep_len_mean: average length of each episode, essentially the length of one game
- ep_rew_mean: average reward for each episode

Time Metrics: These all have to do with the time
- fps
- iterations
- time_elapsed
- total_timesteps

Loss Metrics: 
- Entropy_loss 
- policy_loss
- value_loss

Other Metrics:
- Explained_variance
- Learning_rate
- n_updates

In [8]:
log_path = os.path.join('Training', 'Logs')

In [22]:
train_env = gym.make(env_name)
train_env = DummyVecEnv([lambda: train_env])
model = PPO('MlpPolicy', train_env, verbose = 1, tensorboard_log = log_path)

Using cpu device


- policy: the rules that which an agent can follow within a given environment
- timestep: essentially a frame. A decision point where the agent reads the observations and makes its next action
    * Note that 20,000 timesteps is considered low

Now, it's time to train our agent. This requires little code. The number of timesteps is specified to train on, not the number of episodes

In [24]:
#This cell can be run many times, to train the model further
model.learn(total_timesteps=20000)

Logging to Training\Logs\PPO_10
-----------------------------
| time/              |      |
|    fps             | 1109 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 900         |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.004228264 |
|    clip_fraction        | 0.0498      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.564      |
|    explained_variance   | 0.971       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.524       |
|    n_updates            | 110         |
|    policy_gradient_loss | -0.00781    |
|    value_loss           | 8.29        |
-----------------------------------------
--

<stable_baselines3.ppo.ppo.PPO at 0x1116a2aef80>

## Step 4: Save and Reload Model

In [5]:
#Defining the path to save our model to
PPO_Path = os.path.join('Training', 'Saved Models', 'PPO_Model_CartPole')

In [26]:
#Saving the model
model.save(PPO_Path)

In [27]:
#Deleting the model from memory
del model

In [4]:
#Reloading in the model
#Must pass in save path and environment
model = PPO.load(PPO_Path, env=env)

NameError: name 'PPO_Path' is not defined

## Step 5: Evaluate

We did not want to render the model during training, as it makes it significantly slower to train. However, we do want to render it during evaluation, so we can actually see the performance of the model. To do this, we will evaluate in a new env from the one we trained in

In [6]:
#Create evaluation env
eval_env = gym.make(env_name, render_mode='human')
#Load in our model with our evaluation environment
model = PPO.load(PPO_Path, env = eval_env)

#Running the evaluation of the model
evaluate_policy(model, eval_env, n_eval_episodes=10, render=True)

eval_env.close()

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




## Step 6: Test Model

Now, we will deploy our model

In [84]:
test_env = gym.make(env_name, render_mode = render_mode)

episodes = 5
for episode in range(1, episodes+1):
    done = False
    score = 0
    obs = test_env.reset()[0]

    while not done:
        #Visualize the test_env
        test_env.render()

        # Using our model to predict the next action
        action, _ = model.predict(obs)

        #Pass in our action to the environment, and collect observations
        obs, reward, done, done, info = test_env.step(action)
        #Add to the score
        score += reward
    print(f'Episode:{episode} Score:{score}') 

test_env.close()

Episode:1 Score:500.0
Episode:2 Score:500.0
Episode:3 Score:500.0
Episode:4 Score:500.0
Episode:5 Score:500.0


In [83]:
test_env.close()

## Step 7: Viewing logs in TensorBoard

This will open up analytics about our model. View these by running the tensorboard cell (cell 3), and navigating to localhost:6006

In [9]:
training_log_path = os.path.join(log_path, 'PPO_9')

In [10]:
training_log_path

'Training\\Logs\\PPO_9'

In [11]:
!tensorboard --logdir={training_log_path}

^C


Note that these are the most important statistics:
- average_reward
- average_episode_length

If the model is not meeting expecations, follow these strategies:
- train longer
- Hyperparameter tuning
- try different algorithms

## Step 8: Adding a callback to the training Stage

Callbacks allow for intervention during the training stage. For example, if the reward is deemed to be high enough (the agent is good), the model can stop training. Custom actions can be linked to these interventions

In [12]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [13]:
save_path = os.path.join('Training', 'Saved Models')
log_path = os.path.join('Training', 'Logs')

In [15]:
cb_env = gym.make(env_name)
cb_env = DummyVecEnv([lambda : cb_env])

- stop_callback: This will stop our training once a certain reward is reached
- eval_callback: This will run an evaluation on the model. It only runs a certain number of timesteps, in this case, 10000.
    It will save the model at this callback. We must also pass in our env, and any other callbacks

In [16]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=500, verbose=1)
eval_callback = EvalCallback(
    cb_env,
    callback_on_new_best = stop_callback,
    eval_freq=10000,
    best_model_save_path=save_path,
    verbose=1
)

Now that we have these callbacks, let's retrain the agent using them

In [17]:
model = PPO('MlpPolicy', cb_env, verbose=1, tensorboard_log=log_path)

Using cpu device


In [18]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training\Logs\PPO_11
-----------------------------
| time/              |      |
|    fps             | 1227 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 954         |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009362006 |
|    clip_fraction        | 0.129       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.685      |
|    explained_variance   | 0.00634     |
|    learning_rate        | 0.0003      |
|    loss                 | 7.23        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0202     |
|    value_loss           | 50          |
-----------------------------------------
--



Eval num_timesteps=10000, episode_reward=236.20 +/- 71.40
Episode length: 236.20 +/- 71.40
------------------------------------------
| eval/                   |              |
|    mean_ep_length       | 236          |
|    mean_reward          | 236          |
| time/                   |              |
|    total_timesteps      | 10000        |
| train/                  |              |
|    approx_kl            | 0.0076094978 |
|    clip_fraction        | 0.0475       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.616       |
|    explained_variance   | 0.114        |
|    learning_rate        | 0.0003       |
|    loss                 | 22.7         |
|    n_updates            | 40           |
|    policy_gradient_loss | -0.0131      |
|    value_loss           | 73.4         |
------------------------------------------
New best mean reward!
------------------------------
| time/              |       |
|    fps             | 791   |
|    iterations     

<stable_baselines3.ppo.ppo.PPO at 0x1c1e0079f90>

## Step 9: Changing Policies
Let's try running cartpole with a different architecture.
<br>
We'll start by defining a new architecture for a neural net  
The pi param specifies the architecture for the actor. It is 4 layers, each consisting of 128 nodes  
The vf param stands for value function. It uses the same arch as the actor

In [19]:
net_arch = [dict(pi=[128,128,128,128], vf=[128,128,128,128])]

In [21]:
model = PPO('MlpPolicy', cb_env, verbose=1, policy_kwargs={'net_arch': net_arch})

Using cpu device




In [22]:
model.learn(total_timesteps=20000, callback=eval_callback)

-----------------------------
| time/              |      |
|    fps             | 1241 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 827         |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014296461 |
|    clip_fraction        | 0.184       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.682      |
|    explained_variance   | 0.00942     |
|    learning_rate        | 0.0003      |
|    loss                 | 2.31        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0216     |
|    value_loss           | 20.2        |
-----------------------------------------
----------------------------------



Eval num_timesteps=9520, episode_reward=437.00 +/- 87.57
Episode length: 437.00 +/- 87.57
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 437         |
|    mean_reward          | 437         |
| time/                   |             |
|    total_timesteps      | 9520        |
| train/                  |             |
|    approx_kl            | 0.007967362 |
|    clip_fraction        | 0.0844      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.559      |
|    explained_variance   | 0.516       |
|    learning_rate        | 0.0003      |
|    loss                 | 12          |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0134     |
|    value_loss           | 38.9        |
-----------------------------------------
------------------------------
| time/              |       |
|    fps             | 588   |
|    iterations      | 5     |
|    time_elapsed    | 17    |

<stable_baselines3.ppo.ppo.PPO at 0x1c1dfff0520>

## Step 10: Changing Algorithms

Let's try using a separate algorithm than PPO

In [23]:
from stable_baselines3 import DQN

In [24]:
model = DQN('MlpPolicy', cb_env, verbose=1)

Using cpu device


In [25]:
model.learn(total_timesteps=20000, callback=eval_callback)

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.967    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 6358     |
|    time_elapsed     | 0        |
|    total_timesteps  | 70       |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.932    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 5298     |
|    time_elapsed     | 0        |
|    total_timesteps  | 143      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.898    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 6131     |
|    time_elapsed     | 0        |
|    total_timesteps  | 214      |
----------------------------------
----------------------------------
| rollout/          



----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 476      |
|    fps              | 5398     |
|    time_elapsed     | 2        |
|    total_timesteps  | 10880    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 480      |
|    fps              | 5387     |
|    time_elapsed     | 2        |
|    total_timesteps  | 10987    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 484      |
|    fps              | 5372     |
|    time_elapsed     | 2        |
|    total_timesteps  | 11046    |
----------------------------------
----------------------------------
| rollout/          

<stable_baselines3.dqn.dqn.DQN at 0x1c1dfff0310>

In [26]:
dqn_path = os.path.join('Training', 'Saved Models', 'DQN_model')

In [27]:
model.save(dqn_path)