# Installing Dependencies

Stable baselines is an RL library that allows you to work with model free algorithms. Runs on Tensorflow and PyTorch

In [21]:
!pip install --user tensorflow

Collecting tensorflow
  Using cached https://files.pythonhosted.org/packages/46/18/cdee520cbd3761e202163f035b647ab1e12b57fd9b7b8c594673a01e4448/tensorflow-2.6.0-cp37-cp37m-win_amd64.whl
Collecting gast==0.4.0 (from tensorflow)
  Using cached https://files.pythonhosted.org/packages/b6/48/583c032b79ae5b3daa02225a675aeb673e58d2cb698e78510feceb11958c/gast-0.4.0-py3-none-any.whl
Collecting google-pasta~=0.2 (from tensorflow)
  Using cached https://files.pythonhosted.org/packages/a3/de/c648ef6835192e6e2cc03f40b19eeda4382c49b5bafb43d88b931c4c74ac/google_pasta-0.2.0-py3-none-any.whl
Collecting termcolor~=1.1.0 (from tensorflow)
Collecting flatbuffers~=1.12.0 (from tensorflow)
  Using cached https://files.pythonhosted.org/packages/eb/26/712e578c5f14e26ae3314c39a1bdc4eb2ec2f4ddc89b708cf8e0a0d20423/flatbuffers-1.12-py2.py3-none-any.whl
Collecting tensorflow-estimator~=2.6 (from tensorflow)
  Using cached https://files.pythonhosted.org/packages/c8/54/1b2f1e22a2670546cc02e4df1b80425edaee02133173bb9

ERROR: astroid 2.3.1 requires typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8", which is not installed.
ERROR: astroid 2.3.1 has requirement six==1.12, but you'll have six 1.15.0 which is incompatible.
ERROR: astroid 2.3.1 has requirement wrapt==1.11.*, but you'll have wrapt 1.12.1 which is incompatible.


In [22]:
import tensorflow as tf

In [1]:
!pip install stable-baselines3[extra]



# Imports

In [2]:
import os # operating system library that makes it easier to define our paths to save model as well as where to log out
import gym #for OpenAI gym, allows us to build environments and work with pre-existing environments
from stable_baselines3 import PPO # an algorithm
 #stable baselines allows you to vectorize your environments, making it more practical to train your agent on multiple environments at the same time. Boosts training speed
from stable_baselines3.common.vec_env import DummyVecEnv #not really vectorization, more like a wrapper around env that makes it easier to work with stable baselines
from stable_baselines3.common.evaluation import evaluate_policy #  test out how a model is performing. Gets the average reward over a certain number of episodes

In [3]:
from platform import python_version

print(python_version())

3.7.4


# Load Environment

In [3]:
pip install pyglet

Collecting pyglet
  Using cached https://files.pythonhosted.org/packages/48/c2/5898d5cce5d5ce7e74b5a515f2d107a82f2c4d0d4505c0ca119cb34c6b01/pyglet-1.5.19-py3-none-any.whl
Installing collected packages: pyglet
Successfully installed pyglet-1.5.19
Note: you may need to restart the kernel to use updated packages.


In [5]:
# upload environment
environment_name = 'CartPole-v0'
env = gym.make(environment_name)

In [14]:
# test out environments, seeing how the agent can interact with it. 
episodes = 5 # test 5 times
for episode in range(1, episodes + 1): # loop through each episode
    state = env.reset()  # reset environment every time theres a new episode, get an initial set of observations
    # these observations are passed to the reinforcement agent to determine best action to maximize reward, but we aren't doing that right now
    done = False # episode is not done
    score = 0 
    
    #actions will move bar to the left and to the right
    while not done: 
        env.render() # visual representation of env. 
        action = env.action_space.sample() # generate a random action, NOT an action informed by observations
        n_state, reward, done, info = env.step(action) # pass through random action -a forward pass, or in this case, 
        #supply an action to the environment, gets an observation back.
        # get back the next set of observations, the reward for taking the inputted actio)n (positive for increase, negative for decrease (includes 0). 
        #whether episode is done. If done, stop. )
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
#env.close()
    

TypeError: 'int' object is not subscriptable

In [8]:
env.reset() # observations for observations space. can take these 

array([-0.02500592,  0.0207521 , -0.04510446, -0.01304501])

In [5]:
env.action_space # get two different types of actions, either 0 or 1

Discrete(2)

In [6]:
env.action_space.sample() 

1

There are two different spaces within any environment: the action space and the observation space

In [8]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [9]:
env.observation_space.sample() # also randomly outputed

array([ 1.9326792e+00,  1.3847042e+38,  3.3825469e-01, -2.5008390e+38],
      dtype=float32)

# Understanding the Environment

Remember, there's two parts of our environment, actions space and observation space

# What do the action space values mean?

Type = Discrete(2)
0: Push cart to the left, 1: push cart to the right

Reward is 1 for every step taken

In [13]:
env.action_space

Discrete(2)

In [14]:
env.action_space.sample()

0

# What do these observation spaces values mean?

You can take a look at the openAI documentation to find out, but here they are: 

Type: Box(4)
Num 0: Cart Position , from [-4.8 , 4/8]
Num 1: Cart Velicity, (-Inf, Inf)
Num 2: Pole Angle, [-24 degrees, 24 degrees]
Num 3: Pole Angular Velocity, all reals

In [11]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [15]:
env.observation_space.sample()

array([-4.5754972e+00,  2.6312240e+38,  2.7803665e-01,  1.0333603e+38],
      dtype=float32)

# Training our agent for real 

In [4]:
# location to save our tensor board logs: good for monitoring and referencing about how model is performing
log_path = os.path.join('Training', 'Logs') # so, inside the folder that you're working in, creating a folder called Training. The Training folder has another folder inside of it called Logs

In [3]:
log_path

'Training\\Logs'

In [6]:
# instantiate algorithm. Using PPO
env = gym.make(environment_name) 
env = DummyVecEnv([lambda: env]) #wrap environment
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log = log_path) # define model, 
#MlpPolicy - using a neural network
# verbose = 1: meaning please log out results

Using cpu device


# You need to look at the documentations

In [5]:
model.learn(total_timesteps=20000) 

Logging to Training\Logs\PPO_4
-----------------------------
| time/              |      |
|    fps             | 2682 |
|    iterations      | 1    |
|    time_elapsed    | 0    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1843        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008439124 |
|    clip_fraction        | 0.103       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | 0.00611     |
|    learning_rate        | 0.0003      |
|    loss                 | 6.29        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0155     |
|    value_loss           | 49.5        |
-----------------------------------------
---

<stable_baselines3.ppo.ppo.PPO at 0x1c1cf26e488>

# Save and Reload Model

In [6]:
PPO_Path = os.path.join('Training', 'Saved Models', 'PPO_Model_Cartpole')

In [7]:
model.save(PPO_Path)

In [9]:
del model

In [8]:
model.learn(total_timesteps=1000)

Logging to Training\Logs\PPO_5
-----------------------------
| time/              |      |
|    fps             | 2754 |
|    iterations      | 1    |
|    time_elapsed    | 0    |
|    total_timesteps | 2048 |
-----------------------------


<stable_baselines3.ppo.ppo.PPO at 0x1c1cf26e488>

In [12]:
# after deleting, you can actually "recover it" by reloading. Reload it back into memory
model = PPO.load(PPO_Path, env= env)

# Evaluation

For PPO, an evinroment is considered solved if you get on avergae a reward of 200 or higher
Test our model to see how it's performing

In [9]:
# look at the documentation for this, too
evaluate_policy(model, env, n_eval_episodes = 10, render = True)



(200.0, 0.0)

Check out the output: The first coordinate is the average rewards and the second is the STDev


Reward for CartPole is calculated as 1 point for every step that the pole reamins upright (with a max of 200 steps). If the pole is more than 15 degrees from vertical or the cart moves more than 2.4 units from the center the episode ends)

In [10]:
env.close()

# Testing Agent

In [11]:
# test out environments -- uhh what is the difference from what we did before
episodes = 5 # test 5 times
for episode in range(1, episodes + 1): # loop through each episode
    obs = env.reset()  # reset environment every time theres a new episode, get an initial set of observations
    # these observations are passed to the reinforcement agent to determine best action to maximize reward
    done = False # episode is not done
    score = 0 
    
    #actions will move bar to the left and to the right
    while not done: 
        env.render() # visual representation of env. 
        action, _ = model.predict(obs) # NOW USING MODEL HERE
        obs, reward, done, info = env.step(action) # pass through random action -a forward pass, or in this case, 
        #supply an action to the environment, gets an observation back.
        # get back the next set of observations, the reward for taking the inputted actio)n (positive for increase, negative for decrease (includes 0). 
        #whether episode is done. If done, stop. )
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
#env.close()

Episode:1 Score:[200.]
Episode:2 Score:[199.]
Episode:3 Score:[200.]
Episode:4 Score:[200.]
Episode:5 Score:[200.]


In [12]:
env.close()

# Viewing Logs in Tensorboard

It's a good idea to use this when you have a much larger or more complex environment. Ideally, it's best to use this in a command prompt

In [7]:
# get log directory that you want to view
# specify the training log path
training_log_path = os.path.join(log_path, 'PPO_2')

training_log_path

'Training\\Logs\\PPO_2'

In [12]:
log_path

'Training\\Logs'

In [14]:
!tensorboard --logdir = {training_log_path}

# break down this line
# using an exclamation mark inside a jupyter notebook is known as a "magic command",allows you to run command line
# prompts in your notebook


# see if this works on your mac

2021-09-01 12:24:07.998972: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-09-01 12:24:07.999165: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
usage: tensorboard [-h] [--helpfull] [--logdir PATH] [--logdir_spec PATH_SPEC]
                   [--host ADDR] [--bind_all] [--port PORT]
                   [--reuse_port BOOL] [--load_fast {false,auto,true}]
                   [--extra_data_server_flags EXTRA_DATA_SERVER_FLAGS]
                   [--grpc_creds_type {local,ssl,ssl_dev}]
                   [--grpc_data_provider PORT] [--purge_orphaned_data BOOL]
                   [--db URI] [--db_import] [--inspect] [--version_tb]
                   [--tag TAG] [--event_file PATH] [--path_prefix PATH]
                   [--window_title TEXT] [--max_reload_threads COUNT]
                   [--reload_inter

In [None]:
env.close()

Note: Core metrics to look at when you are testing your model performance are your average reward, average episode length. 

Average Episode Length: 

-see how long the model lasts in the environment
-particularly important in environments that don't have a fixed environment length - will see more of this in other examples.

If model is not performing well: 
-Training for longer
-Hyperparameter tuning -

# Adding a callback to the training stage

Install dependencies

EvalCallback: the callback that runs during our training
StopTrainingOnRewardThreshold: think of this as a checker. Once model passes a reward threshold, this stops the training

In [15]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [16]:
save_path = os.path.join('Training', 'Saved Models')

In [20]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold = 200, verbose = 1) #stops training once you reach a reward theshold. 
eval_callback = EvalCallback(env, callback_on_new_best = stop_callback, eval_freq = 10000, 
                            best_model_save_path= save_path, verbose = 1)
# pass through environment, the call back to run on the best model (everytime theres a new best model, stop_callback runs ), how many times to evaluate, 
# best_model_save_path: the eval callback can save the model everytime there is a new best model.
#right now, save_path is our best model 

#save_path is our best model:
#so after 10000 evaluations, it will double check if it has passes the reward threshold. And if it has, training will stop and the model will be saved.

Now that these are established, we need to associate these with our model

In [21]:
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log = log_path)

Using cpu device


In [22]:
model.learn(total_timesteps = 20000, callback = eval_callback)

Logging to Training\Logs\PPO_6
-----------------------------
| time/              |      |
|    fps             | 814  |
|    iterations      | 1    |
|    time_elapsed    | 2    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1012        |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008749456 |
|    clip_fraction        | 0.0987      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | 0.000267    |
|    learning_rate        | 0.0003      |
|    loss                 | 7.4         |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0142     |
|    value_loss           | 52.8        |
-----------------------------------------
---



Eval num_timesteps=10000, episode_reward=200.00 +/- 0.00
Episode length: 200.00 +/- 0.00
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 200         |
|    mean_reward          | 200         |
| time/                   |             |
|    total timesteps      | 10000       |
| train/                  |             |
|    approx_kl            | 0.009751417 |
|    clip_fraction        | 0.0962      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.611      |
|    explained_variance   | 0.341       |
|    learning_rate        | 0.0003      |
|    loss                 | 22.3        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0195     |
|    value_loss           | 57.5        |
-----------------------------------------
New best mean reward!
Stopping training because the mean reward 200.00  is above the threshold 200


<stable_baselines3.ppo.ppo.PPO at 0x27760720d88>

Gives you a lot more control when training