# Stable Baselines3 - Train on Atari Games

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [None]:
!pip install "stable-baselines3[extra]>=2.0.0a4"

Collecting stable-baselines3>=2.0.0a4 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading stable_baselines3-2.4.0a7-py3-none-any.whl.metadata (5.1 kB)
Collecting gymnasium<0.30,>=0.28.1 (from stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading gymnasium-0.29.1-py3-none-any.whl.metadata (10 kB)
Collecting shimmy~=1.3.0 (from shimmy[atari]~=1.3.0; extra == "extra"->stable-baselines3[extra]>=2.0.0a4)
  Downloading Shimmy-1.3.0-py3-none-any.whl.metadata (3.7 kB)
Collecting autorom~=0.6.1 (from autorom[accept-rom-license]~=0.6.1; extra == "extra"->stable-baselines3[extra]>=2.0.0a4)
  Downloading AutoROM-0.6.1-py3-none-any.whl.metadata (2.4 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.6.1; extra == "extra"->stable-baselines3[extra]>=2.0.0a4)
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Import policy, RL agent, ...

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

In [None]:
from google.colab import files

  and should_run_async(code)


## Training on Atari

We will use atari wrapper (it will downsample the image and convert it to gray scale).

About Atari preprocessing: [Frame Skipping and Pre-Processing for Deep Q-Networks on Atari 2600 Games](https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/)

![Pong](https://cdn-images-1.medium.com/max/800/1*UHYJE7lF8IDZS_U5SsAFUQ.gif)

In [None]:
# There already exists an environment generator that will make and wrap atari environments correctly.
env = make_atari_env("SpaceInvadersNoFrameskip-v4", n_envs=8, seed=0)
# Stack 4 frames
vec_env = VecFrameStack(env, n_stack=4)

In [None]:
import os
models_dir = "models/PPO"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
from stable_baselines3.common.callbacks import CheckpointCallback
# Save a checkpoint every 2_000_000 steps
checkpoint_callback = CheckpointCallback(save_freq=250000, save_path="./gdrive/MyDrive/models/PPO",
                                         name_prefix="ppo_spaceInvadersLvl4Train")

from collections import OrderedDict \n  
hyperParamsDictForPpo = OrderedDict([('batch_size', 256), ('clip_range', 0.001), ('ent_coef', 0.01), ('env_wrapper', ['stable_baselines3.common.atari_wrappers.AtariWrapper']), ('frame_stack', 4), ('learning_rate', 0.0001), ('n_envs', 8), ('n_epochs', 4), ('n_steps', 128), ('n_timesteps', 100000), ('policy', 'CnnPolicy'), ('vf_coef', 0.5), ('normalize', False)])

In [None]:
# model = PPO("CnnPolicy", vec_env, verbose=1, tensorboard_log="./gdrive/MyDrive/models/ppo_spaceinvaders_tensorboard/",
#             batch_size=256,
#             clip_range=0.001,
#             ent_coef=0.01,
#             learning_rate=0.0001,
#             n_epochs=4,
#             n_steps=128,
#             vf_coef=0.5
#             )

Using cuda device
Wrapping the env in a VecTransposeImage.


In [None]:
model = PPO.load("/content/gdrive/MyDrive/models/PPO/ppo_spaceInvadersLvl4Train_33955088_steps", verbose=1, tensorboard_log="./gdrive/MyDrive/models/ppo_spaceinvaders_tensorboard/",
                 force_reset=False)
model.set_env(vec_env)

Wrapping the env in a VecTransposeImage.


In [None]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

# Stop training when the model reaches the reward threshold
ptsPerInvaderCol = 60
colPerLevel = 12
ptsPerLevel = ptsPerInvaderCol*colPerLevel
levelsToBeat = 3
# ptsPerLevel*levelsToBeat ~= 3000
reward_threshold = 2200
callback_on_best = StopTrainingOnRewardThreshold(reward_threshold=reward_threshold, verbose=1)
eval_callback = EvalCallback(vec_env, callback_on_new_best=callback_on_best, verbose=1)

In [None]:
timestepsPerLvl = 2000
gamesToPlay = 1e6
ttlTimeSteps = int(levelsToBeat*timestepsPerLvl*gamesToPlay)
print("ttlTimeSteps: ", ttlTimeSteps)

ttlTimeSteps:  6000000000


In [None]:
#!rm ./models/PPO/ppo*

In [None]:
model.learn(total_timesteps=30000000, callback=[checkpoint_callback], reset_num_timesteps=False)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    value_loss           | 0.87         |
------------------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 9.79e+03      |
|    ep_rew_mean          | 2.39e+03      |
| time/                   |               |
|    fps                  | 555           |
|    iterations           | 29060         |
|    time_elapsed         | 53593         |
|    total_timesteps      | 63712528      |
| train/                  |               |
|    approx_kl            | 0.00065765623 |
|    clip_fraction        | 0.861         |
|    clip_range           | 0.001         |
|    entropy_loss         | -1.34         |
|    explained_variance   | 0.943         |
|    learning_rate        | 0.0001        |
|    loss                 | 0.0635        |
|    n_updates            | 266444        |
|    policy_gradient_loss | 0.0013        |
|    value_lo

<stable_baselines3.ppo.ppo.PPO at 0x7894d3d9bdf0>

## Download / Upload Trained Agent and Continue Training

Save and download trained model

In [None]:
from google.colab import files

In [None]:
model.save("ppo_spaceInvadersLvl4Train")
files.download("ppo_spaceInvadersLvl4Train.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import time
# sleep for 60 seconds to finish downloading
time.sleep(60)

In [None]:
drive.flush_and_unmount()

In [None]:
# terminate sessions
# free resources
from google.colab import runtime
runtime.unassign()

Upload train agent from your local machine

In [None]:
files.upload()

In [None]:
!du -h ppo*

Load the agent, and then you can continue training

In [None]:
trained_model = PPO.load("ppo_spaceInvaders", verbose=1)
env = make_atari_env('SpaceInvadersNoFrameskip-v4', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)
trained_model.set_env(env)

In [None]:
trained_model.learn(int(0.5e6))

In [None]:
trained_model.save("ppo_spaceInvaders_2")
files.download("ppo_spaceInvaders_2.zip")

In [None]:
!pip install renderlab

In [None]:
import gymnasium as gym
import renderlab as rl
from stable_baselines3.common.atari_wrappers import WarpFrame
import numpy as np

env = gym.make("SpaceInvadersNoFrameskip-v4", render_mode = "rgb_array")
env = rl.RenderFrame(env, "./output")

resizeFrame = WarpFrame(env, width=84, height=84)

observation, info = env.reset()

reward_list = []

while True:
  # rescale and duplicate obs so it work for vectorized model,
  # that expects 4 greyscale images
  obsForAgent = resizeFrame.observation(observation)
  obsForAgent = obsForAgent.squeeze()
  obsForAgent = np.repeat(obsForAgent[np.newaxis, :, :], 4, axis=0)
  action = model.predict(obsForAgent)
  observation, reward, terminated, truncated, info = env.step(action[0])

  reward_list.append(reward)

  #if terminated or truncated:
  if terminated:
    break

env.play()



In [None]:
tot_reward = 0
for rew in reward_list:
  tot_reward += rew

print("Total reward: ", tot_reward)

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# There already exists an environment generator that will make and wrap atari environments correctly.
env = make_atari_env("SpaceInvadersNoFrameskip-v4", n_envs=8, seed=0)
# Stack 4 frames
vec_env = VecFrameStack(env, n_stack=4)


model = PPO.load("/content/gdrive/MyDrive/models/PPO/ppo_spaceInvadersLvl4Train_63955088_steps",
                 verbose=1,
                 force_reset=False)


In [None]:
observation = vec_env.reset()
model.set_env(vec_env)

reward_list = []
img_list = []
render_list = []
terminated_list = []
info_list = []
stepCounter = 0

while stepCounter<1000:
  action = model.predict(observation)
  observation, reward, terminated, info = vec_env.step(action[0])

  reward_list.append(reward)

  render_list.append(vec_env(render()))
  img_list.append(vec_env.get_images())
  terminated_list.append(terminated)
  info_list.append(info)

  stepCounter += 1

  and should_run_async(code)


Wrapping the env in a VecTransposeImage.


  logger.warn(


In [None]:
print(type(img_list[0]))
print(type(img_list[0][0]))
print(img_list[0][0].shape)

<class 'list'>
<class 'numpy.ndarray'>
(210, 160, 3)


  and should_run_async(code)


In [None]:
envIdx = 0
epLen = 0
numTrues = 0
for term in terminated_list:
  epLen += 1
  if term[envIdx]:
    numTrues += 1
    print(epLen)
    epLen = 0

print(numTrues)
print(len(terminated_list) - numTrues)

728
1934
285
157
3906
508
598
846
8
9992


  and should_run_async(code)


In [None]:
info = info_list[5]
type(info)
len(info)
info

  and should_run_async(code)


[{'lives': 3,
  'episode_frame_number': 54,
  'frame_number': 54,
  'TimeLimit.truncated': False},
 {'lives': 3,
  'episode_frame_number': 37,
  'frame_number': 37,
  'TimeLimit.truncated': False},
 {'lives': 3,
  'episode_frame_number': 51,
  'frame_number': 51,
  'TimeLimit.truncated': False},
 {'lives': 3,
  'episode_frame_number': 57,
  'frame_number': 57,
  'TimeLimit.truncated': False},
 {'lives': 3,
  'episode_frame_number': 33,
  'frame_number': 33,
  'TimeLimit.truncated': False},
 {'lives': 3,
  'episode_frame_number': 61,
  'frame_number': 61,
  'TimeLimit.truncated': False},
 {'lives': 3,
  'episode_frame_number': 53,
  'frame_number': 53,
  'TimeLimit.truncated': False},
 {'lives': 3,
  'episode_frame_number': 46,
  'frame_number': 46,
  'TimeLimit.truncated': False}]

In [None]:
siEnv = make_atari_env("SpaceInvadersNoFrameskip-v4", n_envs=8, seed=0)
dir(siEnv)
siEnv.metadata

  and should_run_async(code)


{'render_modes': ['human', 'rgb_array'],
 'obs_types': {'grayscale', 'ram', 'rgb'}}

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# write whole video
# all games
import numpy as np
import cv2
from cv2 import cvtColor

envIdx = 1
size = 210, 160
fps = 61 # 32
out = cv2.VideoWriter('output61.mp4', cv2.VideoWriter_fourcc(*'mp4v'), fps, (size[1], size[0]), 1)
for idx in range( len(img_list) ):
    data = img_list[idx][envIdx]
    # out.write(data)
    out.write(cv2.cvtColor(data, cv2.COLOR_RGB2BGR))
out.release()

In [None]:
#########################################
# write each game in env0 to a separate file

import numpy as np
import cv2

size = 210, 160
fps = 32
envIdx = 1
mp4Suffix = '.mp4'
filePrefix = 'spaceInvadersPpoAgent35MframesGame'
game = 1
gameScore = 0
'''
for termIdx in range( len(img_list) ):
    gameScore += reward_list[termIdx][envIdx]
    if terminated_list[termIdx][envIdx]:
      break
vidFileName = filePrefix + str(game) + 'Score' + str(gameScore) + mp4Suffix
'''
vidFileName = filePrefix + str(game) + mp4Suffix
out = cv2.VideoWriter(vidFileName, cv2.VideoWriter_fourcc(*'mp4v'), fps, (size[1], size[0]), 1)
for idx in range( len(img_list) ):
    data = img_list[idx][envIdx]
    gameScore += reward_list[idx]
    # out.write(data)
    out.write(cv2.cvtColor(data, cv2.COLOR_RGB2BGR))
    if terminated_list[idx][envIdx]:
      out.release()
      game += 1
      gameScore = 0
      '''
      for termIdx in range(1, len(img_list) ):
        forwardIdx = idx + termIdx
        if forwardIdx>=len(img_list):
          break
        gameScore += reward_list[forwardIdx][envIdx]
        if terminated_list[forwardIdx][envIdx]:
          break
      # vidFileName = filePrefix + str(game) + 'Score' + str(gameScore) + mp4Suffix
      '''
      vidFileName = filePrefix + str(game) + mp4Suffix
      out = cv2.VideoWriter(vidFileName, cv2.VideoWriter_fourcc(*'mp4v'), fps, (size[1], size[0]), 1)


out.release()

In [None]:
!rm /content/spaceInvadersPpoAgent35MframesGame*

In [None]:
size = 720*16//9, 720
duration = 2
fps = 25
out = cv2.VideoWriter('output.mp4', cv2.VideoWriter_fourcc(*'mp4v'), fps, (size[1], size[0]), False)
for _ in range(fps * duration):
    data = np.random.randint(0, 256, size, dtype='uint8')
    out.write(data)
out.release()

In [None]:
!pip moviepy

ERROR: unknown command "moviepy"


In [None]:
from moviepy.editor import VideoFileClip, concatenate_videoclips



vidPrefix = "/content/spaceInvadersPpoAgent35MframesGame"
vidPrefix = "/content/"+filePrefix

clip_1 = VideoFileClip(vidPrefix + "7.mp4")
clip_2 = VideoFileClip(vidPrefix + "8.mp4")
clip_3 = VideoFileClip(vidPrefix + "9.mp4")
final_clip = concatenate_videoclips([clip_3,clip_1,clip_2])
# final_clip = concatenate_videoclips([clip_2, clip_1])
final_clip.write_videofile("game3.mp4")

  and should_run_async(code)



Moviepy - Building video game3.mp4.
Moviepy - Writing video game3.mp4




                                                                  

Moviepy - Done !
Moviepy - video ready game3.mp4




In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# There already exists an environment generator that will make and wrap atari environments correctly.
env = make_atari_env("SpaceInvadersNoFrameskip-v4", n_envs=8, seed=0)
# Stack 4 frames
vec_env = VecFrameStack(env, n_stack=4)

  and should_run_async(code)


In [None]:
object_methods = [method_name for method_name in dir(vec_env)
                  if callable(getattr(vec_env, method_name))]

In [None]:
object_methods

['__class__',
 '__delattr__',
 '__dir__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_get_indices',
 '_get_target_envs',
 '_obs_from_buf',
 '_reset_options',
 '_reset_seeds',
 '_save_obs',
 'close',
 'env_is_wrapped',
 'env_method',
 'get_attr',
 'get_images',
 'getattr_depth_check',
 'render',
 'reset',
 'seed',
 'set_attr',
 'set_options',
 'step',
 'step_async',
 'step_wait']

In [None]:
object_methods = [method_name for method_name in dir(env)
                  if callable(getattr(env, method_name))]
object_methods

['__class__',
 '__delattr__',
 '__dir__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_get_indices',
 '_get_target_envs',
 '_obs_from_buf',
 '_reset_options',
 '_reset_seeds',
 '_save_obs',
 'close',
 'env_is_wrapped',
 'env_method',
 'get_attr',
 'get_images',
 'getattr_depth_check',
 'render',
 'reset',
 'seed',
 'set_attr',
 'set_options',
 'step',
 'step_async',
 'step_wait']

In [None]:
env.get_attr('score')

  logger.warn(


AttributeError: 'AtariEnv' object has no attribute 'score'

In [None]:
# possible fix for the color issue
video.write(cv2.cvtColor(data, cv2.COLOR_RGB2BGR))

OrderedDict([('batch_size', 256),
             ('clip_range', 0.001),
             ('ent_coef', 0.01),
             ('env_wrapper',
              ['stable_baselines3.common.atari_wrappers.AtariWrapper']),
             ('frame_stack', 4),
             ('learning_rate', 0.0001),
             ('n_envs', 8),
             ('n_epochs', 4),
             ('n_steps', 128),
             ('n_timesteps', 100000),
             ('policy', 'CnnPolicy'),
             ('vf_coef', 0.5),
             ('normalize', False)])