##### Github : @saty035 Twitter : @heysaty

# Install Dependencies

### Stable Baselines
It is a python library that makes it easier to get up and running with reinforcement learning.

It is orignally based off the OpenAI Baselines packages but has additional features that makes it easy to get started with RL.

https://stable-baselines.readthedocs.io/en/master/

In [3]:
# stable baseline install

# 3 is for latest package
!pip install stable_baselines3[extra] 



In [6]:
import os
import gym
# proximal policy optimization
from stable_baselines3 import PPO
# stable baseline allows you to vectorize env , means it allows you to treain your ml model 
# multiple agent or multiple env at the same time (help in getting huge boost)
from stable_baselines3.common.vec_env import DummyVecEnv
# check how model is actually performing
from stable_baselines3.common.evaluation import evaluate_policy

# Load Environment

### Open Gym Spaces
This gives you a kicking off point rather than having to write all of the code yourself
1. Box
2. Discrete
3. Tuple
4. Dict
5. Multibinary
6. MultiDiscrete

In [7]:
envornment_name='CartPole-v0'
env=gym.make(envornment_name)

In [8]:
envornment_name

'CartPole-v0'

In [17]:
episodes=5
for episode in range(1,episodes+1):
    state=env.reset()
    done=False
    score=0
    
    while not done:
        env.render()
        action=env.action_space.sample()
        n_state,reward,done,info=env.step(action)
        score+=reward
    print('Episode :{}, Score : {}'.format(episode,score))
# env.close()

Episode :1, Score : 54.0
Episode :2, Score : 37.0
Episode :3, Score : 21.0
Episode :4, Score : 22.0
Episode :5, Score : 18.0


In [18]:
env.close()

In [19]:
env.reset()

array([-0.01662121,  0.03712428,  0.01222216,  0.00685307])

# Understanding our Environmment
##### https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

In [23]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [29]:
env.observation_space.sample()  # random choice

array([ 1.9969771e+00,  2.8919903e+38, -3.3435616e-01,  3.1651899e+38],
      dtype=float32)

In [24]:
env.action_space

Discrete(2)

In [26]:
env.action_space.sample() # random choice

1

##### look at choosing_algo


# Train an RL model

In [30]:
# make yur directories first
log_path=os.path.join('Training','Logs')

In [31]:
log_path

'Training\\Logs'

In [33]:
# recreated env
env=gym.make(envornment_name)
# wrapping in DummyVecEncv
env=DummyVecEnv([lambda:env])
# multilayer perceptron policy(mlp)
# Policy : think of an agent policy as the rule which tells it how to operate in the env
# verbose: Integer. 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch.
model=PPO('MlpPolicy',env,verbose=1,tensorboard_log=log_path)

Using cpu device


In [34]:
# PPO?

In [36]:
# everytime we run we do another iteration
model.learn(total_timesteps=20000)

Logging to Training\Logs\PPO_2
-----------------------------
| time/              |      |
|    fps             | 525  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 363        |
|    iterations           | 2          |
|    time_elapsed         | 11         |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00466931 |
|    clip_fraction        | 0.0248     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.551     |
|    explained_variance   | 0.371      |
|    learning_rate        | 0.0003     |
|    loss                 | 48.2       |
|    n_updates            | 110        |
|    policy_gradient_loss | -0.00689   |
|    value_loss           | 127        |
----------------------------------------
---------------------

<stable_baselines3.ppo.ppo.PPO at 0x21357207c10>

# Save And Reload Our Model

In [37]:
# joing the path
PPO_Path=os.path.join('Training','Saved Models','PPO_Model_CartPole')

In [38]:
# saving our model
model.save(PPO_Path)

In [39]:
PPO_Path

'Training\\Saved Models\\PPO_Model_CartPole'

In [40]:
# deleting the model
del model

In [41]:
model

NameError: name 'model' is not defined

In [42]:
# reloadthe model
model=PPO.load(PPO_Path,env=env)

In [44]:
model.learn(total_timesteps=10000)

Logging to Training\Logs\PPO_3
-----------------------------
| time/              |      |
|    fps             | 539  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 360          |
|    iterations           | 2            |
|    time_elapsed         | 11           |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0038065212 |
|    clip_fraction        | 0.0248       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.539       |
|    explained_variance   | 0.419        |
|    learning_rate        | 0.0003       |
|    loss                 | 75.7         |
|    n_updates            | 210          |
|    policy_gradient_loss | -0.00221     |
|    value_loss           | 151          |
----------------------------

<stable_baselines3.ppo.ppo.PPO at 0x2136b6b6670>

# Evaluation
Reward for CartPole is calulated as 1 point for every step that the pole 
remains upright (with a max of 200 steps).
If the pole is more than 15 degrees from vertival or the cart moves more thean 2.4 unnits 
from center the episode ends.

In [45]:
evaluate_policy(model,env,n_eval_episodes=10,render=True)
# returns avg reward and std deviation



(200.0, 0.0)

In [46]:
env.close()

In [49]:
obs=env.reset()
model.predict(obs) # return: the model's action and the next state

(array([1], dtype=int64), None)

# Testing Our Model


In [65]:
#  Agent is our model 
episodes=5
for episode in range(1,episodes+1):
    obs=env.reset()
    done=False
    score=0
    
    while not done:
        env.render()
        # Now we are using our model to predict the actions
        action, _ = model.predict(obs) # return: the model's action and the next state
        obs,reward,done,info=env.step(action)
        score+=reward
    print('Episode :{}, Score : {}'.format(episode,score))
# env.close()

Episode :1, Score : [200.]
Episode :2, Score : [200.]
Episode :3, Score : [200.]
Episode :4, Score : [200.]
Episode :5, Score : [200.]


In [66]:
env.close()

In [56]:
obs=env.reset()

In [60]:
model.predict?
# return: the model's action and the next state

In [62]:
env.action_space.sample()

0

In [70]:
action,_=model.predict(obs)
# checking reward
env.step(action)

(array([[ 0.04241536, -0.03299304,  0.04505168,  0.05955849]],
       dtype=float32),
 array([1.], dtype=float32),
 array([False]),
 [{}])

# Viewing Logs in Tensorboard

In [71]:
training_log_path=os.path.join(log_path,'PPO_2')

In [72]:
training_log_path

'Training\\Logs\\PPO_2'

In [76]:
!tensorboard --logdr={training_log_path}
# run from terminal ''tensorboard --logdir=.'' in PPO diresctory

2021-11-06 22:41:07.707408: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-11-06 22:41:07.707453: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
usage: tensorboard [-h] [--helpfull] [--logdir PATH] [--logdir_spec PATH_SPEC]
                   [--host ADDR] [--bind_all] [--port PORT]
                   [--reuse_port BOOL] [--load_fast {false,auto,true}]
                   [--extra_data_server_flags EXTRA_DATA_SERVER_FLAGS]
                   [--grpc_creds_type {local,ssl,ssl_dev}]
                   [--grpc_data_provider PORT] [--purge_orphaned_data BOOL]
                   [--db URI] [--db_import] [--inspect] [--version_tb]
                   [--tag TAG] [--event_file PATH] [--path_prefix PATH]
                   [--window_title TEXT] [--max_reload_threads COUNT]
                   [--reload_inter

#### Core metric to look at :
1.  Avg Reward
2.  Avg Episode length

#### Training Strategies :
1. Train for longer
2. Hyperparamer tuning
3. Try Different Algorithms

#### Applying Callbacks :
You can leverage callback function as part of stable baselines to log out data or save the model under the certain condition.
#### Modifying Neural Network Architecture :
You are also able to change the underlying neural network which SB uses as part of the policy.

#### Using Different Algorithm :
Stable baselines comes pre-packed with a number of different algorithms that can be used to train your agent.

# Adding a callback to the training Stage

In [78]:
# EvalCallback is the callback that runs during our training stage and stop training at some reward threshold
# similar as Checker
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [79]:
save_path=os.path.join('Training','Saved Models')

In [90]:
stop_callback=StopTrainingOnRewardThreshold(reward_threshold=200,verbose=1)
eval_callback=EvalCallback(env,
                          callback_on_new_best=stop_callback,
                          eval_freq=5000,
                          best_model_save_path=save_path,
                          verbose=1)

In [91]:
model=PPO('MlpPolicy',env,verbose=1,tensorboard_log=log_path)

Using cpu device


In [92]:
model.learn(total_timesteps=20000,callback=eval_callback)

Logging to Training\Logs\PPO_7
-----------------------------
| time/              |      |
|    fps             | 630  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 422          |
|    iterations           | 2            |
|    time_elapsed         | 9            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0065668626 |
|    clip_fraction        | 0.0803       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.687       |
|    explained_variance   | -0.0051      |
|    learning_rate        | 0.0003       |
|    loss                 | 9.28         |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.012       |
|    value_loss           | 63           |
----------------------------

<stable_baselines3.ppo.ppo.PPO at 0x2136bfc9760>

# Changing Policies

In [93]:
# 4-layer 
net_arch=[dict(pi=[128,128,128,128],vf=[128,128,128,128])]

In [101]:
model=PPO('MlpPolicy',env,verbose=1,tensorboard_log=log_path,policy_kwargs={'net_arch':net_arch})

Using cpu device


In [95]:
model.learn(total_timesteps=20000,callback=eval_callback)

Logging to Training\Logs\PPO_8
-----------------------------
| time/              |      |
|    fps             | 455  |
|    iterations      | 1    |
|    time_elapsed    | 4    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 282         |
|    iterations           | 2           |
|    time_elapsed         | 14          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.014702489 |
|    clip_fraction        | 0.228       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.682      |
|    explained_variance   | -0.00201    |
|    learning_rate        | 0.0003      |
|    loss                 | 2.3         |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0272     |
|    value_loss           | 19.5        |
-----------------------------------------
Eva

<stable_baselines3.ppo.ppo.PPO at 0x2136c190040>

# Using an Alternative Algorithm

In [96]:
from stable_baselines3 import DQN

In [102]:
model=DQN('MlpPolicy',env,verbose=1,tensorboard_log=log_path)

Using cpu device


In [103]:
model.learn(total_timesteps=20000)

Logging to Training\Logs\DQN_2
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.962    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 3234     |
|    time_elapsed     | 0        |
|    total_timesteps  | 81       |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.912    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 4072     |
|    time_elapsed     | 0        |
|    total_timesteps  | 186      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.862    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 4219     |
|    time_elapsed     | 0        |
|    total_timesteps  | 290      |
----------------------------------
------------------------

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 108      |
|    fps              | 3433     |
|    time_elapsed     | 0        |
|    total_timesteps  | 2642     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 112      |
|    fps              | 3396     |
|    time_elapsed     | 0        |
|    total_timesteps  | 2717     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 116      |
|    fps              | 3370     |
|    time_elapsed     | 0        |
|    total_timesteps  | 2784     |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 216      |
|    fps              | 3095     |
|    time_elapsed     | 1        |
|    total_timesteps  | 4921     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 220      |
|    fps              | 3098     |
|    time_elapsed     | 1        |
|    total_timesteps  | 4980     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 224      |
|    fps              | 3109     |
|    time_elapsed     | 1        |
|    total_timesteps  | 5063     |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 324      |
|    fps              | 2632     |
|    time_elapsed     | 2        |
|    total_timesteps  | 7176     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 328      |
|    fps              | 2606     |
|    time_elapsed     | 2        |
|    total_timesteps  | 7296     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 332      |
|    fps              | 2605     |
|    time_elapsed     | 2        |
|    total_timesteps  | 7386     |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 432      |
|    fps              | 2548     |
|    time_elapsed     | 3        |
|    total_timesteps  | 9662     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 436      |
|    fps              | 2545     |
|    time_elapsed     | 3        |
|    total_timesteps  | 9729     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 440      |
|    fps              | 2538     |
|    time_elapsed     | 3        |
|    total_timesteps  | 9800     |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 540      |
|    fps              | 2571     |
|    time_elapsed     | 4        |
|    total_timesteps  | 11902    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 544      |
|    fps              | 2576     |
|    time_elapsed     | 4        |
|    total_timesteps  | 11978    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 548      |
|    fps              | 2586     |
|    time_elapsed     | 4        |
|    total_timesteps  | 12085    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 648      |
|    fps              | 2700     |
|    time_elapsed     | 5        |
|    total_timesteps  | 14268    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 652      |
|    fps              | 2703     |
|    time_elapsed     | 5        |
|    total_timesteps  | 14328    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 656      |
|    fps              | 2714     |
|    time_elapsed     | 5        |
|    total_timesteps  | 14462    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 756      |
|    fps              | 2798     |
|    time_elapsed     | 6        |
|    total_timesteps  | 16926    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 760      |
|    fps              | 2798     |
|    time_elapsed     | 6        |
|    total_timesteps  | 17011    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 764      |
|    fps              | 2802     |
|    time_elapsed     | 6        |
|    total_timesteps  | 17075    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 864      |
|    fps              | 2850     |
|    time_elapsed     | 6        |
|    total_timesteps  | 19211    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 868      |
|    fps              | 2852     |
|    time_elapsed     | 6        |
|    total_timesteps  | 19316    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 872      |
|    fps              | 2855     |
|    time_elapsed     | 6        |
|    total_timesteps  | 19397    |
----------------------------------
----------------------------------
| rollout/          

<stable_baselines3.dqn.dqn.DQN at 0x2136c1e0250>