Baseline Documentation [https://stable-baselines3.readthedocs.io/en/master/](https://stable-baselines3.readthedocs.io/en/master/)

### 1. Import Dependencies

In [1]:
!pip install stable-baselines3[extra]



In [2]:
!pip install pyglet



### 2. Load Environment

In [3]:
import os
import gym   # help us to build enviroment and work with preexisting enviroments
from stable_baselines3 import PPO   
from stable_baselines3.common.vec_env import DummyVecEnv #vectorize enviroments it allow to train a model multiple agent at a time
from stable_baselines3.common.evaluation import evaluate_policy # to test how our model is performing  

**Box** - n dimensional tensor, range of values
E.g. Box(0,1,shape=(3,3))

**Discrete** - set of items
E.g. Discrete(3)

**Tuple** - tuple of other spaces e.g.Box or Discrete
E.g. Tuple((Discrete(2), Box(0,100, shape = (1,))))

**Dict** - dictionary of spaces e.g. Box or Discrete
E.g. Dict({'height':Discrete(2), 'speed':Box (0,100, shape=(1,))})

**MultiBinary** - one hot encoded binary values
E.g. MultiBinary(4)

**MultiDiscrete** - multiple discrete values
E.g. MultiDiscrete([5,2,2])

In [4]:
environment_name = 'CartPole-v0'
env = gym.make(environment_name)

In [3]:
environment_name

'CartPole-v0'

In [4]:
episodes = 20
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score += reward
    print(f'Episode: {episode} score: {score}')
env.close()

Episode: 1 score: 33.0
Episode: 2 score: 15.0
Episode: 3 score: 12.0
Episode: 4 score: 27.0
Episode: 5 score: 18.0
Episode: 6 score: 11.0
Episode: 7 score: 18.0
Episode: 8 score: 26.0
Episode: 9 score: 12.0
Episode: 10 score: 21.0
Episode: 11 score: 22.0
Episode: 12 score: 15.0
Episode: 13 score: 22.0
Episode: 14 score: 19.0
Episode: 15 score: 48.0
Episode: 16 score: 34.0
Episode: 17 score: 22.0
Episode: 18 score: 13.0
Episode: 19 score: 10.0
Episode: 20 score: 40.0


## Understanding the environment
[https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py]

  ### Action Space
    The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction
     of the fixed force the cart is pushed with.
    | Num | Action                 |
    |-----|------------------------|
    | 0   | Push cart to the left  |
    | 1   | Push cart to the right |
    **Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle
     the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it
  
 ### Observation Space
    The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:
    | Num | Observation           | Min                 | Max               |
    |-----|-----------------------|---------------------|-------------------|
    | 0   | Cart Position         | -4.8                | 4.8               |
    | 1   | Cart Velocity         | -Inf                | Inf               |
    | 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
    | 3   | Pole Angular Velocity | -Inf                | Inf               |

In [5]:
env.action_space

Discrete(2)

In [48]:
env.action_space.sample()

1

In [49]:
env.observation_space

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

In [50]:
env.observation_space.sample()

array([-4.6927400e+00, -2.8965313e+38, -6.6490263e-02, -2.9868164e+38],
      dtype=float32)

### 3. Train an RL Model

In [5]:
# create directories first
log_path = os.path.join('Training','Logs')

In [7]:
log_path

'Training\\Logs'

In [8]:
env = gym.make(environment_name)
env = DummyVecEnv([lambda:env])
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

Using cuda device


https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

In [54]:
PPO??

In [9]:
model.learn(total_timesteps=20000)

Logging to Training\Logs\PPO_1
-----------------------------
| time/              |      |
|    fps             | 266  |
|    iterations      | 1    |
|    time_elapsed    | 7    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 311          |
|    iterations           | 2            |
|    time_elapsed         | 13           |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0082615465 |
|    clip_fraction        | 0.0864       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.687       |
|    explained_variance   | -0.00191     |
|    learning_rate        | 0.0003       |
|    loss                 | 6.9          |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.0124      |
|    value_loss           | 46.1         |
----------------------------

<stable_baselines3.ppo.ppo.PPO at 0x2b63424ad60>

### 4. Save and Reload Model

In [10]:
PPO_Path = os.path.join('Training','Saved Models','PPO_Model_Cartpole')

In [11]:
model.save(PPO_Path)



In [58]:
del model

In [59]:
PPO_Path

'Training\\Saved Models\\PPO_Model_Cartpole'

In [63]:
model = PPO.load(PPO_Path, env=env)

In [64]:
model.learn(total_timesteps=20000)

Logging to Training\Logs\PPO_7
-----------------------------
| time/              |      |
|    fps             | 538  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------


AssertionError: If capturable=False, state_steps should not be CUDA tensors.

### 4.Evaluation

In [65]:
from stable_baselines3.common.evaluation import evaluate_policy

In [12]:
evaluate_policy(model,env,n_eval_episodes=10, render =True)



(200.0, 0.0)

In [13]:
env.close()

### 5.Test Model

In [14]:
episodes = 5
for episode in range(1, episodes+1):
    obs = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action, _ = model.predict(obs)
        n_state, reward, done, info = env.step(action)
        score += reward
    print(f'Episode: {episode} score: {score}')
env.close()

Episode: 1 score: [29.]
Episode: 2 score: [31.]
Episode: 3 score: [16.]
Episode: 4 score: [13.]
Episode: 5 score: [18.]


In [15]:
env.reset()

array([[-0.04757111,  0.02003796, -0.04715839,  0.01410156]],
      dtype=float32)

### 6. Viewing Logs in Tensorboard

In [32]:
training_log_path = os.path.join(log_path,'PPO_1')

In [43]:
training_log_path

'Training\\Logs\\PPO_1'

In [44]:
!tensorboard --logdir = {training_log_path}

usage: tensorboard [-h] [--helpfull] [--logdir PATH] [--logdir_spec PATH_SPEC]
                   [--host ADDR] [--bind_all] [--port PORT]
                   [--reuse_port BOOL] [--load_fast {false,auto,true}]
                   [--extra_data_server_flags EXTRA_DATA_SERVER_FLAGS]
                   [--grpc_creds_type {local,ssl,ssl_dev}]
                   [--grpc_data_provider PORT] [--purge_orphaned_data BOOL]
                   [--db URI] [--db_import] [--inspect] [--version_tb]
                   [--tag TAG] [--event_file PATH] [--path_prefix PATH]
                   [--window_title TEXT] [--max_reload_threads COUNT]
                   [--reload_interval SECONDS] [--reload_task TYPE]
                   [--reload_multifile BOOL]
                   [--reload_multifile_inactive_secs SECONDS]
                   [--generic_data TYPE]
                   [--samples_per_plugin SAMPLES_PER_PLUGIN]
                   [--detect_file_replacement BOOL]
                   [--whatif-use-unsafe-cu

In [40]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is 62FE-D118

 Directory of C:\Users\Awesome\Downloads\ATT blog\Open AI Gym\baseline

17-Jul-22  12:10 AM    <DIR>          .
17-Jul-22  12:10 AM    <DIR>          ..
14-Jul-22  08:20 AM    <DIR>          .ipynb_checkpoints
17-Jul-22  12:10 AM            50,820 baseline_rl.ipynb
16-Jul-22  10:43 PM    <DIR>          Training
               1 File(s)         50,820 bytes
               4 Dir(s)  315,797,667,840 bytes free


### 8. Adding a callback to the training stage

In [1]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [6]:
save_path = os.path.join('Training','Saved Models')

In [8]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=200, verbose=1)
eval_callback = EvalCallback(env,
                            callback_on_new_best = stop_callback,
                            eval_freq=10000,
                            best_model_save_path=save_path,
                            verbose=1)

In [9]:
model = PPO('MlpPolicy',env, verbose=1, tensorboard_log=log_path)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [11]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training\Logs\PPO_2
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 21.1     |
|    ep_rew_mean     | 21.1     |
| time/              |          |
|    fps             | 47       |
|    iterations      | 1        |
|    time_elapsed    | 42       |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 27.1        |
|    ep_rew_mean          | 27.1        |
| time/                   |             |
|    fps                  | 83          |
|    iterations           | 2           |
|    time_elapsed         | 49          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008520352 |
|    clip_fraction        | 0.0875      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.687      |
|    explained_variance   | -0.00147    |



Eval num_timesteps=10000, episode_reward=200.00 +/- 0.00
Episode length: 200.00 +/- 0.00
-----------------------------------------
| eval/                   |             |
|    mean_ep_length       | 200         |
|    mean_reward          | 200         |
| time/                   |             |
|    total_timesteps      | 10000       |
| train/                  |             |
|    approx_kl            | 0.007378431 |
|    clip_fraction        | 0.0634      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.607      |
|    explained_variance   | 0.308       |
|    learning_rate        | 0.0003      |
|    loss                 | 23.1        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0151     |
|    value_loss           | 61.4        |
-----------------------------------------
New best mean reward!
Stopping training because the mean reward 200.00  is above the threshold 200


<stable_baselines3.ppo.ppo.PPO at 0x1f2bdc4a7f0>

### 9. Changing Policies

In [14]:
net_arch = [dict(pi=[128,128,128,128], vf=[128,128,128,128])]

#### Define new neural network

In [15]:
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log = log_path,policy_kwargs={'net_arch':net_arch})

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [16]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training\Logs\PPO_3
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 22.1     |
|    ep_rew_mean     | 22.1     |
| time/              |          |
|    fps             | 428      |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 29.3        |
|    ep_rew_mean          | 29.3        |
| time/                   |             |
|    fps                  | 329         |
|    iterations           | 2           |
|    time_elapsed         | 12          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.015458845 |
|    clip_fraction        | 0.22        |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.681      |
|    explained_variance   | -0.000876   |

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 156      |
|    ep_rew_mean     | 156      |
| time/              |          |
|    fps             | 282      |
|    iterations      | 10       |
|    time_elapsed    | 72       |
|    total_timesteps | 20480    |
---------------------------------


<stable_baselines3.ppo.ppo.PPO at 0x1f2d03dc940>

### 10. Using an different Algorithm

In [17]:
from stable_baselines3 import DQN

In [20]:
model = DQN('MlpPolicy', env, verbose=1, tensorboard_log = log_path)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [21]:
model.learn(total_timesteps=20000)

Logging to Training\Logs\DQN_1
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.5     |
|    ep_rew_mean      | 21.5     |
|    exploration_rate | 0.959    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 1329     |
|    time_elapsed     | 0        |
|    total_timesteps  | 86       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 19.9     |
|    ep_rew_mean      | 19.9     |
|    exploration_rate | 0.924    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 2013     |
|    time_elapsed     | 0        |
|    total_timesteps  | 159      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 20.3     |
|    ep_rew_mean      | 20.3     |
|    exploration_rate | 0.884    |
| time/               | 

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.9     |
|    ep_rew_mean      | 22.9     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 92       |
|    fps              | 7154     |
|    time_elapsed     | 0        |
|    total_timesteps  | 2105     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 23.1     |
|    ep_rew_mean      | 23.1     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 96       |
|    fps              | 7257     |
|    time_elapsed     | 0        |
|    total_timesteps  | 2222     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 23       |
|    ep_rew_mean      | 23       |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 24.5     |
|    ep_rew_mean      | 24.5     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 180      |
|    fps              | 8118     |
|    time_elapsed     | 0        |
|    total_timesteps  | 4224     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 24.3     |
|    ep_rew_mean      | 24.3     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 184      |
|    fps              | 8144     |
|    time_elapsed     | 0        |
|    total_timesteps  | 4327     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 24.7     |
|    ep_rew_mean      | 24.7     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 23.1     |
|    ep_rew_mean      | 23.1     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 268      |
|    fps              | 8375     |
|    time_elapsed     | 0        |
|    total_timesteps  | 6189     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.9     |
|    ep_rew_mean      | 22.9     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 272      |
|    fps              | 8376     |
|    time_elapsed     | 0        |
|    total_timesteps  | 6256     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.8     |
|    ep_rew_mean      | 22.8     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.1     |
|    ep_rew_mean      | 21.1     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 356      |
|    fps              | 8545     |
|    time_elapsed     | 0        |
|    total_timesteps  | 8078     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.3     |
|    ep_rew_mean      | 21.3     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 360      |
|    fps              | 8556     |
|    time_elapsed     | 0        |
|    total_timesteps  | 8174     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.9     |
|    ep_rew_mean      | 21.9     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 23       |
|    ep_rew_mean      | 23       |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 444      |
|    fps              | 7928     |
|    time_elapsed     | 1        |
|    total_timesteps  | 10059    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.8     |
|    ep_rew_mean      | 22.8     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 448      |
|    fps              | 7884     |
|    time_elapsed     | 1        |
|    total_timesteps  | 10145    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.6     |
|    ep_rew_mean      | 22.6     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 20.8     |
|    ep_rew_mean      | 20.8     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 532      |
|    fps              | 8035     |
|    time_elapsed     | 1        |
|    total_timesteps  | 11905    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 20.7     |
|    ep_rew_mean      | 20.7     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 536      |
|    fps              | 8037     |
|    time_elapsed     | 1        |
|    total_timesteps  | 11964    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 20.8     |
|    ep_rew_mean      | 20.8     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 20.3     |
|    ep_rew_mean      | 20.3     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 620      |
|    fps              | 8089     |
|    time_elapsed     | 1        |
|    total_timesteps  | 13676    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 20.6     |
|    ep_rew_mean      | 20.6     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 624      |
|    fps              | 8099     |
|    time_elapsed     | 1        |
|    total_timesteps  | 13789    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21       |
|    ep_rew_mean      | 21       |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.8     |
|    ep_rew_mean      | 21.8     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 708      |
|    fps              | 8200     |
|    time_elapsed     | 1        |
|    total_timesteps  | 15648    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22       |
|    ep_rew_mean      | 22       |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 712      |
|    fps              | 8199     |
|    time_elapsed     | 1        |
|    total_timesteps  | 15720    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.4     |
|    ep_rew_mean      | 22.4     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 23.3     |
|    ep_rew_mean      | 23.3     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 796      |
|    fps              | 8309     |
|    time_elapsed     | 2        |
|    total_timesteps  | 17684    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 23.3     |
|    ep_rew_mean      | 23.3     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 800      |
|    fps              | 8289     |
|    time_elapsed     | 2        |
|    total_timesteps  | 17766    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 23.2     |
|    ep_rew_mean      | 23.2     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.2     |
|    ep_rew_mean      | 22.2     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 884      |
|    fps              | 8346     |
|    time_elapsed     | 2        |
|    total_timesteps  | 19618    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.6     |
|    ep_rew_mean      | 22.6     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 888      |
|    fps              | 8351     |
|    time_elapsed     | 2        |
|    total_timesteps  | 19728    |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.6     |
|    ep_rew_mean      | 22.6     |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes       

<stable_baselines3.dqn.dqn.DQN at 0x1f284352100>