# Stable Baselines, a Fork of OpenAI Baselines - Training, Saving and Loading

Github Repo: [https://github.com/hill-a/stable-baselines](https://github.com/hill-a/stable-baselines)

Medium article: [https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82](https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82)

[RL Baselines Zoo](https://github.com/araffin/rl-baselines-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.

Documentation is available online: [https://stable-baselines.readthedocs.io/](https://stable-baselines.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip

List of full dependencies can be found in the [README](https://github.com/hill-a/stable-baselines).

```
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev zlib1g-dev
```


```
pip install stable-baselines[mpi]
```

In [None]:
# Stable Baselines only supports tensorflow 1.x for now
%tensorflow_version 1.x
!pip install requests==2.23 # Avoid error
!apt install swig cmake libopenmpi-dev zlib1g-dev
!pip install stable-baselines3
!pip install compiler_gym


Collecting requests==2.23
  Downloading requests-2.23.0-py2.py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 3.0 MB/s eta 0:00:011
Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.26.0
    Uninstalling requests-2.26.0:
      Successfully uninstalled requests-2.26.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kapre 0.3.6 requires tensorflow>=2.0.0, but you have tensorflow 1.15.2 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
compiler-gym 0.2.1 requires requests>=2.24.0, but you have requests 2.23.0 which is incompatible.[0m
Successfully installed requests-2.23.0


Reading package lists... Done
Building dependency tree       
Reading state information... Done
zlib1g-dev is already the newest version (1:1.2.11.dfsg-0ubuntu2).
libopenmpi-dev is already the newest version (2.1.1-8).
swig is already the newest version (3.0.12-1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.
Collecting gym<0.20,>=0.17
  Downloading gym-0.19.0.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 5.4 MB/s 
Building wheels for collected packages: gym
  Building wheel for gym (setup.py) ... [?25l[?25hdone
  Created wheel for gym: filename=gym-0.19.0-py3-none-any.whl size=1663113 sha256=381a1c589b42b74c4e3fbcd50a3462da963fd2b5057f3c353b58f63d7b0d22b4
  Stored in directory: /root/.cache/pip/wheels/ef/9d/70/8bea53f7edec2fdb4f98d9d64ac9f11aea95dfcb98099d7712
Successfully built gym
Installing collected packages: gym
  Attempting uninstall: gym
    Found existing installation: gym 0

Collecting requests>=2.24.0
  Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kapre 0.3.6 requires tensorflow>=2.0.0, but you have tensorflow 1.15.2 which is incompatible.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
Successfully installed requests-2.26.0


## Import policy, RL agent, ...

In [None]:
import os

import gym

import compiler_gym
from compiler_gym.leaderboard.llvm_instcount import eval_llvm_instcount_policy # Evaluation method used by FB for leaderboard
from compiler_gym.envs import LlvmEnv
from compiler_gym.wrappers import TimeLimit

import numpy as np

from stable_baselines3 import DQN

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
os.chdir('/') # Navigate to root
os.chdir('/content/drive/Shareddrives/csc461_A-Team/experiments/cg_exp_1') # Navigate to this directory

In [None]:
pwd

'/content/drive/Shareddrives/csc461_A-Team/experiments/cg_exp_1'

## Create the Gym env and instantiate the agent

In [None]:
def make_env() -> compiler_gym.envs.CompilerEnv:
    """Make the reinforcement learning environment for this experiment.
    
      From FB example.
    """
    env = compiler_gym.make(
        "llvm-ic-v0",
        observation_space="Autophase",
        reward_space="IrInstructionCountOz",
    )

    # Finally, we impose a time limit on the environment so that every episode
    # for 5 steps or fewer. This is because the environment's task is continuous
    # and no action is guaranteed to result in a terminal state. Adding a time
    # limit means we don't have to worry about learning when an agent should 
    # stop, though again this limits the potential improvements that the agent
    # can achieve compared to using an unbounded maximum episode length.
    env = TimeLimit(env, max_episode_steps=5)
    return env

In [None]:
from itertools import islice

with make_env() as env:
  # The two datasets we will be using:
  chstone = env.datasets["chstone-v0"] # Small dataset

  # Each dataset has a `benchmarks()` method that returns an iterator over the
  # benchmarks within the dataset. Here we will use iterator sliceing to grab a 
  # handful of benchmarks for training and validation.

  train_benchmarks = list(islice(chstone.benchmarks(), 55)) # 55 total benchmarks the dataset
  train_benchmarks, val_benchmarks = train_benchmarks[:50], train_benchmarks[50:]

  # # We will use the entire chstone-v0 dataset for testing.
  # test_benchmarks = list(chstone.benchmarks())

In [None]:
from compiler_gym.wrappers import CycleOverBenchmarks

training_env = CycleOverBenchmarks(make_env(), train_benchmarks)

In [None]:
model = DQN('MlpPolicy', training_env, learning_rate=.001, verbose=1, exploration_final_eps=0.15)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [None]:
# Train the agent
model.learn(total_timesteps=100000, log_interval=10)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
| train/              |          |
|    learning_rate    | 0.001    |
|    loss             | 2.43     |
|    n_updates        | 8337     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 5        |
|    ep_rew_mean      | -0.644   |
|    exploration_rate | 0.15     |
| time/               |          |
|    episodes         | 16680    |
|    fps              | 268      |
|    time_elapsed     | 310      |
|    total_timesteps  | 83400    |
| train/              |          |
|    learning_rate    | 0.001    |
|    loss             | 2.16     |
|    n_updates        | 8349     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 5        |
|    ep_rew_mean      | -0.643   |
|    exploration_rate | 0.15     |
| time/               |          |
|    episodes         | 1

<stable_baselines3.dqn.dqn.DQN at 0x7f7c36cf4890>

Saving the model

In [None]:
# Create save dir
# save_dir = "/results/llvm-codesize/"
# os.makedirs(save_dir, exist_ok=True)

model.save("dqn_llvm_v0")

In [None]:
# NOTE: Don't run until you know the model file has been saved.
# del model  # delete trained model to demonstrate loading

In [None]:
!python3 11-29_llvm_codesize.py

I1202 00:43:27.687420  3505 CreateAndRunCompilerGymServiceImpl.h:123] Service "/dev/shm/compiler_gym_root/s/1202T004327-667792-6ace" listening on 39847, PID = 3505
Writing results to llvm_instcount-results.csv
Writing logs to llvm_instcount-results.log
=== Evaluating policy on 230 cbench-v1 benchmarks ===


[2A[KRuntime: 0:00:00. Estimated completion: 0:00:00. Completed: 0 / 230 (0.0%).
[KCurrent mean walltime: 0.001s / benchmark.
[2A[KRuntime: 0:00:02. Estimated completion: 0:09:33. Completed: 0 / 230 (0.0%).
[KCurrent mean walltime: 2.482s / benchmark.
[KCurrent geomean reward: 0.0000.Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/dist-packages/compiler_gym/leaderboard/llvm_instcount.py", line 117, in run
    self.policy(self.env)
  File "11-29_llvm_codesize.py", li

Loading the model

In [None]:
# load the model, and when loading set verbose to 1
#loaded_model = DQN.load(save_dir + "/A2C_tutorial", verbose=1)

In [None]:
# NOTE: Not sure if this is needed, but haven't gotten to test it yet.
from compiler_gym.envs import LlvmEnv
def dqn_policy(env: LlvmEnv) -> None:

  done = False
  obs = env.reset()
  while not done:
    action, _states = model.predict(obs)

    obs, reward, done, info = env.step(action)


In [None]:
# NOTE: This seems like it has to be run from a terminal.

%tb # See full stack trace
# Evaluate the agent using Facebook's criteria for leaderboards.
# Code on Github: https://github.com/facebookresearch/CompilerGym/blob/f4ef8b5e809b65a6da38e74cfa8dbba7fb1fa6fe/compiler_gym/leaderboard/llvm_instcount.py#L141
# Making leaderboard submissions: https://github.com/facebookresearch/CompilerGym/blob/development/CONTRIBUTING.md#leaderboard-submissions
#if __name__ == "__main__":
!python3 eval_llvm_instcount_policy(dqn_policy)


SystemExit: ignored

FATAL Flags parsing error: Unknown command line flag 'f'
Pass --helpshort or --helpfull to see help on flags.


SystemExit: ignored

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


#### Other code

We create a helper function to evaluate the agent:

In [None]:
def evaluate(model, num_steps=1000):
  """
  Evaluate a RL agent
  :param model: (BaseRLModel object) the RL Agent
  :param num_steps: (int) number of timesteps to evaluate it
  :return: (float) Mean reward for the last 100 episodes
  """
  episode_rewards = [0.0]
  obs = env.reset()
  for i in range(num_steps):
      # _states are only useful when using LSTM policies
      action, _states = model.predict(obs)

      obs, reward, done, info = env.step(action)
      
      # Stats
      episode_rewards[-1] += reward
      if done:
          obs = env.reset()
          episode_rewards.append(0.0)
  # Compute mean reward for the last 100 episodes
  mean_100ep_reward = round(np.mean(episode_rewards[-100:]), 1)
  print("Mean reward:", mean_100ep_reward, "Num episodes:", len(episode_rewards))
  
  return mean_100ep_reward

Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

Mean reward: -895.1 Num episodes: 88


## Train the agent and save it

Warning: this may take a while

## Load the trained agent

In [None]:
model = DQN.load("dqn_lunar")

Loading a model without an environment, this model cannot be trained until it has a valid environment.


In [None]:
# Evaluate the trained agent
mean_reward = evaluate(model, num_steps=10000)