# Reinforcement Learning Model #

Installation

In [1]:
!pip install gym #install gym
!pip install stable-baselines3[extra] #install Stable Baselines3
!apt-get install swig cmake libopenmpi-dev zlib1g-dev #necessary for Lunar Lander
!pip install box2d-py==2.3.8 #necessary for Lunar Lander

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libopenmpi-dev is already the newest version (4.1.2-2ubuntu1).
swig is already the newest version (4.0.2-1ubuntu1).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


Libraries

In [2]:
import gymnasium as gym
import numpy as np
import stable_baselines3
from stable_baselines3 import A2C #RL model A2C
from stable_baselines3 import PPO #RL model PPO
from stable_baselines3 import DQN #RL model DQN
from stable_baselines3.common.evaluation import evaluate_policy #evaluate model/agent
from gymnasium.wrappers import RecordVideo #visualization
from IPython.display import Video #visualization

Environment

In [3]:
env = gym.make("LunarLander-v2") #create environment
observation = env.reset()
print("Observation:", observation) #initial observation of the environment
print("\nObservation Space:", env.observation_space) #range of values that an observation can take
print("\nObservation Space Shape:", env.observation_space.shape) #number of items in observation
print("\nObservation Sample:", env.observation_space) #example of an observation
print("\nAction Space:", env.action_space) #action space determines which RL model to use
print("\nSample Action:", env.action_space.sample()) #example of an action space
env.close() #close environment

Observation: (array([ 0.00558386,  1.4094867 ,  0.56556   , -0.06372489, -0.00646342,
       -0.12810776,  0.        ,  0.        ], dtype=float32), {})

Observation Space: Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)

Observation Space Shape: (8,)

Observation Sample: Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)

Action Space: Discrete(4)

Sample Action: 2


Random Agent

In [4]:
env = gym.make("LunarLander-v2", render_mode = 'rgb_array') #create environment
env = RecordVideo(env, 'video') #use gymnasium wrapper RecordVideo to record environment
observation = env.reset() #reset environment
env.start_video_recorder() #start recording

while True: #create random agent
  env.render() #assists in visualization
  action = env.action_space.sample() #create random action
  observation, reward, terminated, truncated, info = env.step(action) #update environment with action
  if terminated or truncated: #end loop when terminal or truncation state is reached
    break

env.close() #close environment

  logger.warn(


Moviepy - Building video /content/video/rl-video-episode-0.mp4.
Moviepy - Writing video /content/video/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-0.mp4
Moviepy - Building video /content/video/rl-video-episode-0.mp4.
Moviepy - Writing video /content/video/rl-video-episode-0.mp4



                                                             

Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-0.mp4




In [5]:
!ln -sf '/content/video/rl-video-episode-0.mp4' rl-video-episode-0.mp4 #get video file
Video("rl-video-episode-0.mp4", embed=True) #display video

RL Model

In [6]:
model = A2C("MlpPolicy", env, verbose = 1) #create model using RL algorithm

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Evaluate Random Model/Agent

In [7]:
eval_env = gym.make("LunarLander-v2", render_mode = 'rgb_array') #create evaluation environment

mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes = 100) #determine random agent's mean reward
print("Mean Reward = %d +/- %d" % (mean_reward, std_reward))

eval_env.close() #close evaluation environment



Mean Reward = -400 +/- 185


Train Model/Agent

In [8]:
model.learn(total_timesteps = 100_000) #train agent for 100_000 steps

Moviepy - Building video /content/video/rl-video-episode-1.mp4.
Moviepy - Writing video /content/video/rl-video-episode-1.mp4





Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-1.mp4
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 91.2     |
|    ep_rew_mean        | -308     |
| time/                 |          |
|    fps                | 287      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -1.21    |
|    explained_variance | -0.0209  |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -14.2    |
|    value_loss         | 144      |
------------------------------------
Moviepy - Building video /content/video/rl-video-episode-8.mp4.
Moviepy - Writing video /content/video/rl-video-episode-8.mp4





Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-8.mp4
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 87.6     |
|    ep_rew_mean        | -282     |
| time/                 |          |
|    fps                | 278      |
|    iterations         | 200      |
|    time_elapsed       | 3        |
|    total_timesteps    | 1000     |
| train/                |          |
|    entropy_loss       | -1.3     |
|    explained_variance | -0.0277  |
|    learning_rate      | 0.0007   |
|    n_updates          | 199      |
|    policy_loss        | -9.29    |
|    value_loss         | 41.9     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 93.5     |
|    ep_rew_mean        | -244     |
| time/                 |          |
|    fps                | 331      |
|    iterations         | 300      |
|    time_elapsed       | 4        



Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-27.mp4
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 160      |
|    ep_rew_mean        | -191     |
| time/                 |          |
|    fps                | 306      |
|    iterations         | 900      |
|    time_elapsed       | 14       |
|    total_timesteps    | 4500     |
| train/                |          |
|    entropy_loss       | -0.657   |
|    explained_variance | -0.0321  |
|    learning_rate      | 0.0007   |
|    n_updates          | 899      |
|    policy_loss        | 1.29     |
|    value_loss         | 29.2     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 162      |
|    ep_rew_mean        | -188     |
| time/                 |          |
|    fps                | 311      |
|    iterations         | 1000     |
|    time_elapsed       | 16      



Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-64.mp4
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 197      |
|    ep_rew_mean        | -139     |
| time/                 |          |
|    fps                | 344      |
|    iterations         | 2600     |
|    time_elapsed       | 37       |
|    total_timesteps    | 13000    |
| train/                |          |
|    entropy_loss       | -0.585   |
|    explained_variance | 0.617    |
|    learning_rate      | 0.0007   |
|    n_updates          | 2599     |
|    policy_loss        | -0.262   |
|    value_loss         | 1.78     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 197      |
|    ep_rew_mean        | -135     |
| time/                 |          |
|    fps                | 345      |
|    iterations         | 2700     |
|    time_elapsed       | 39      



Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-125.mp4
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 248      |
|    ep_rew_mean        | -81.8    |
| time/                 |          |
|    fps                | 369      |
|    iterations         | 5800     |
|    time_elapsed       | 78       |
|    total_timesteps    | 29000    |
| train/                |          |
|    entropy_loss       | -0.561   |
|    explained_variance | -0.846   |
|    learning_rate      | 0.0007   |
|    n_updates          | 5799     |
|    policy_loss        | -3.27    |
|    value_loss         | 41.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 249      |
|    ep_rew_mean        | -81.5    |
| time/                 |          |
|    fps                | 369      |
|    iterations         | 5900     |
|    time_elapsed       | 79     



Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-216.mp4
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 264      |
|    ep_rew_mean        | -62      |
| time/                 |          |
|    fps                | 380      |
|    iterations         | 10700    |
|    time_elapsed       | 140      |
|    total_timesteps    | 53500    |
| train/                |          |
|    entropy_loss       | -0.518   |
|    explained_variance | 0.156    |
|    learning_rate      | 0.0007   |
|    n_updates          | 10699    |
|    policy_loss        | 0.545    |
|    value_loss         | 1.6      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 264      |
|    ep_rew_mean        | -60.4    |
| time/                 |          |
|    fps                | 381      |
|    iterations         | 10800    |
|    time_elapsed       | 141    



Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-343.mp4
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 334      |
|    ep_rew_mean        | 13.5     |
| time/                 |          |
|    fps                | 384      |
|    iterations         | 19000    |
|    time_elapsed       | 247      |
|    total_timesteps    | 95000    |
| train/                |          |
|    entropy_loss       | -0.379   |
|    explained_variance | 0.632    |
|    learning_rate      | 0.0007   |
|    n_updates          | 18999    |
|    policy_loss        | -1.68    |
|    value_loss         | 33.8     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 339      |
|    ep_rew_mean        | 16.9     |
| time/                 |          |
|    fps                | 383      |
|    iterations         | 19100    |
|    time_elapsed       | 248    

<stable_baselines3.a2c.a2c.A2C at 0x7e22a16692a0>

Evaluate Trained Model/Agent

In [9]:
eval_env = gym.make("LunarLander-v2", render_mode = 'rgb_array') #create evaluation environment

mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes = 100); #determine trained agent's mean reward
print("Mean Reward = %d +/- %d" % (mean_reward, std_reward))

eval_env.close() #close evaluation environment

Mean Reward = -65 +/- 180


Visualize Model/Agent

In [10]:
eval_env = gym.make("LunarLander-v2", render_mode = 'rgb_array') #create evaluation environment
eval_env = RecordVideo(env, 'video') #use gymnasium wrapper RecordVideo to record environment
observation = eval_env.reset() #reset environment
eval_env.start_video_recorder() #start recording

episodes = 10 #number of times to repeat game
for episode in range(episodes): #loop through episodes
  observation, info = eval_env.reset() #reset environment
  while True: #create trained agent
    env.render() #assists in visualization
    action, _ = model.predict(observation) #determine action based on model
    observation, reward, terminated, truncated, info = eval_env.step(action) #update environment with action
    if terminated or truncated: #end loop when terminal or truncation state is reached
      break

eval_env.close() #close evaluation environment

  logger.warn(


Moviepy - Building video /content/video/rl-video-episode-0.mp4.
Moviepy - Writing video /content/video/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-0.mp4
Moviepy - Building video /content/video/rl-video-episode-0.mp4.
Moviepy - Writing video /content/video/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-0.mp4
Moviepy - Building video /content/video/rl-video-episode-1.mp4.
Moviepy - Writing video /content/video/rl-video-episode-1.mp4





Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-1.mp4
Moviepy - Building video /content/video/rl-video-episode-8.mp4.
Moviepy - Writing video /content/video/rl-video-episode-8.mp4





Moviepy - Done !
Moviepy - video ready /content/video/rl-video-episode-8.mp4


In [11]:
!ln -sf '/content/video/rl-video-episode-0.mp4' rl-video-episode-0.mp4 #get video file
Video("rl-video-episode-0.mp4", embed=True) #display video

References:
- Files:
  - cartpole-updated.ipynb
  - RLenvsetup.md
- Piazza:
  - Project Progress Discussion
  - Project Demo Clips
- Gymnasium:
  - General Info: https://gymnasium.farama.org/content/basic_usage/
  - Environment: https://gymnasium.farama.org/api/env/
  - Lunar Lander: https://gymnasium.farama.org/environments/box2d/lunar_lander/
- Stable Baselines3:
  - Tutorial: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb#scrollTo=63M8mSKR-6Zt
  - Tutorial: https://www.youtube.com/watch?v=XbWhJdQgi7E
  - A2C: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html
  - PPO: https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
  - DQN: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html
  - RL Algorithms: https://stable-baselines3.readthedocs.io/en/master/guide/algos.html
- Visualizations: https://stackoverflow.com/questions/18019477/how-can-i-play-a-local-video-in-my-ipython-notebook