# Stable Baselines3 - Train on Atari Games

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [1]:
!pip install "stable-baselines3[extra]>=2.0.0a4"

Collecting stable-baselines3>=2.0.0a4 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading stable_baselines3-2.4.0a7-py3-none-any.whl.metadata (5.1 kB)
Collecting gymnasium<0.30,>=0.28.1 (from stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading gymnasium-0.29.1-py3-none-any.whl.metadata (10 kB)
Collecting shimmy~=1.3.0 (from shimmy[atari]~=1.3.0; extra == "extra"->stable-baselines3[extra]>=2.0.0a4)
  Downloading Shimmy-1.3.0-py3-none-any.whl.metadata (3.7 kB)
Collecting autorom~=0.6.1 (from autorom[accept-rom-license]~=0.6.1; extra == "extra"->stable-baselines3[extra]>=2.0.0a4)
  Downloading AutoROM-0.6.1-py3-none-any.whl.metadata (2.4 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.6.1; extra == "extra"->stable-baselines3[extra]>=2.0.0a4)
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h 

## Import policy, RL agent, ...

In [2]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

In [3]:
from google.colab import files

  and should_run_async(code)


## Training on Atari

We will use atari wrapper (it will downsample the image and convert it to gray scale).

About Atari preprocessing: [Frame Skipping and Pre-Processing for Deep Q-Networks on Atari 2600 Games](https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/)

![Pong](https://cdn-images-1.medium.com/max/800/1*UHYJE7lF8IDZS_U5SsAFUQ.gif)

In [4]:
# There already exists an environment generator that will make and wrap atari environments correctly.
env = make_atari_env("SpaceInvadersNoFrameskip-v4", n_envs=8, seed=0)
# Stack 4 frames
vec_env = VecFrameStack(env, n_stack=4)

In [5]:
import os
models_dir = "models/PPO"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)

In [6]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [7]:
from stable_baselines3.common.callbacks import CheckpointCallback
# Save a checkpoint every 2_000_000 steps
checkpoint_callback = CheckpointCallback(save_freq=250000, save_path="./gdrive/MyDrive/models/PPO",
                                         name_prefix="ppo_spaceInvadersPpoDefaults")

from collections import OrderedDict \n  
hyperParamsDictForPpo = OrderedDict([('batch_size', 256), ('clip_range', 0.001), ('ent_coef', 0.01), ('env_wrapper', ['stable_baselines3.common.atari_wrappers.AtariWrapper']), ('frame_stack', 4), ('learning_rate', 0.0001), ('n_envs', 8), ('n_epochs', 4), ('n_steps', 128), ('n_timesteps', 100000), ('policy', 'CnnPolicy'), ('vf_coef', 0.5), ('normalize', False)])

In [None]:
# model = PPO("CnnPolicy", vec_env, verbose=1, tensorboard_log="./gdrive/MyDrive/models/ppo_spaceinvaders_tensorboard/",
#             batch_size=256,
#             clip_range=0.001,
#             ent_coef=0.01,
#             learning_rate=0.0001,
#             n_epochs=4,
#             n_steps=128,
#             vf_coef=0.5
#             )

Using cuda device
Wrapping the env in a VecTransposeImage.


In [8]:
# create model with default settings

# stable_baselines3.ppo.PPO(policy, env, learning_rate=0.0003, n_steps=2048, batch_size=64, n_epochs=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, clip_range_vf=None, normalize_advantage=True, ent_coef=0.0, vf_coef=0.5, max_grad_norm=0.5, use_sde=False, sde_sample_freq=-1, rollout_buffer_class=None, rollout_buffer_kwargs=None, target_kl=None, stats_window_size=100, tensorboard_log=None, policy_kwargs=None, verbose=0, seed=None, device='auto', _init_setup_model=True)

model = PPO("CnnPolicy", vec_env, verbose=1, tensorboard_log="./gdrive/MyDrive/models/ppo_spaceinvaders_tensorboard/"
            )

# model.set_env(vec_env)

Using cuda device
Wrapping the env in a VecTransposeImage.


In [None]:
model = PPO.load("/content/gdrive/MyDrive/models/PPO/ppo_spaceInvadersLvl4Train_33955088_steps", verbose=1, tensorboard_log="./gdrive/MyDrive/models/ppo_spaceinvaders_tensorboard/",
                 force_reset=False)
model.set_env(vec_env)

Wrapping the env in a VecTransposeImage.


In [9]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

# Stop training when the model reaches the reward threshold
ptsPerInvaderCol = 60
colPerLevel = 12
ptsPerLevel = ptsPerInvaderCol*colPerLevel
levelsToBeat = 3
# ptsPerLevel*levelsToBeat ~= 3000
reward_threshold = 2200
callback_on_best = StopTrainingOnRewardThreshold(reward_threshold=reward_threshold, verbose=1)
eval_callback = EvalCallback(vec_env, callback_on_new_best=callback_on_best, verbose=1)

In [None]:
model.learn(total_timesteps=30000000, callback=[checkpoint_callback], reset_num_timesteps=False)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    value_loss           | 0.131     |
---------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 3.05e+03   |
|    ep_rew_mean          | 462        |
| time/                   |            |
|    fps                  | 373        |
|    iterations           | 1595       |
|    time_elapsed         | 69877      |
|    total_timesteps      | 26132480   |
| train/                  |            |
|    approx_kl            | 0.44387078 |
|    clip_fraction        | 0.247      |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.341     |
|    explained_variance   | 0.951      |
|    learning_rate        | 0.0003     |
|    loss                 | 0.0387     |
|    n_updates            | 15940      |
|    policy_gradient_loss | -0.015     |
|    value_loss           | 0.121      |
-----------------------------------

<stable_baselines3.ppo.ppo.PPO at 0x7904f7726ce0>

## Download / Upload Trained Agent and Continue Training

Save and download trained model

In [None]:
from google.colab import files

In [None]:
model.save("ppoDefault_spaceInvaders")
files.download("ppoDefault_spaceInvaders.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import time
# sleep for 60 seconds to finish downloading
time.sleep(60)

In [None]:
drive.flush_and_unmount()

In [None]:
# terminate sessions
# free resources
from google.colab import runtime
runtime.unassign()