<a href="https://colab.research.google.com/github/rho-selynn/592-HW4/blob/Roselynn/592_HW4_LL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PFRL Quickstart Guide

This is a quickstart guide for users who just want to try PFRL for the first time.

If you have not yet installed PFRL, run the command below to install it:
```
pip install pfrl
```

If you have already installed PFRL, let's begin!

First, you need to import necessary modules. The module name of PFRL is `pfrl`. Let's import `torch`, `gym`, and `numpy` as well since they are used later.

In [1]:
#installing prerequisite display packages
!apt update && apt install xvfb python-opengl ffmpeg
#install torch and plotting packages
!pip install torchvision matplotlib seaborn pandas numpy pathlib 
#install gym and physics engine for box2d environments
!pip install gym box2d-py

#install wrapper to visualize environment
!pip install gym-notebook-wrapper
!pip install pyvirtualdisplay
import pyvirtualdisplay
disp = pyvirtualdisplay.Display()
disp.start() # Start Xvfb and set "DISPLAY" environment properly.
!pip install pfrl
import pfrl
import torch
import torch.nn
import gym
import numpy

import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import random, os.path, math, glob, csv, base64, itertools, sys
import gym
from gym.wrappers import Monitor
import gnwrapper

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [696 B]
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:12 https://clo

PFRL can be used for any problems if they are modeled as "environments". [OpenAI Gym](https://github.com/openai/gym) provides various kinds of benchmark environments and defines the common interface among them. PFRL uses a subset of the interface. Specifically, an environment must define its observation space and action space and have at least two methods: `reset` and `step`.

- `env.reset` will reset the environment to the initial state and return the initial observation.
- `env.step` will execute a given action, move to the next state and return four values:
  - a next observation
  - a scalar reward
  - a boolean value indicating whether the current state is terminal or not
  - additional information
- `env.render` will render the current state. (optional)

Let's try `CartPole-v0`, which is a classic control problem. You can see below that its observation space consists of four real numbers while its action space consists of two discrete actions.

In [12]:
#env = gym.make('CartPole-v1')
env = gym.make('LunarLander-v2')
env.seed(0)
env = gnwrapper.Monitor(env,directory="./train", force=True, video_callable=lambda num: num % 50 == 0) # Start Xvfb, if force=True, overwrites exisiting saved videos
print('observation space:', env.observation_space)
print('action space:', env.action_space)

obs = env.reset()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)

# Uncomment to open a GUI window rendering the current state of the environment
env.render()

observation space: Box(-inf, inf, (8,), float32)
action space: Discrete(4)
initial observation: [-5.9156417e-04  1.4134574e+00 -5.9935719e-02  1.1277095e-01
  6.9228926e-04  1.3576316e-02  0.0000000e+00  0.0000000e+00]
next observation: [-1.0364533e-03  1.4164169e+00 -4.5885697e-02  1.3153474e-01
  2.0719476e-03  2.7596612e-02  0.0000000e+00  0.0000000e+00]
reward: -1.8939137747137977
done: False
info: {}


True

Now you have defined your environment. Next, you need to define an agent, which will learn through interactions with the environment.

PFRL provides various agents, each of which implements a deep reinforcement learning algorithm.

Let's try using the DoubleDQN algorithm (https://arxiv.org/abs/1509.06461), which is implemented by `pfrl.agents.DoubleDQN`. This algorithm trains a Q-function that receives an observation and returns an expected future return for each action the agent can take. In PFRL, you can define your Q-function as `torch.nn.Module` as below. Note that the outputs are wrapped by `pfrl.action_value.DiscreteActionValue`. By wrapping the outputs of Q-functions, PFRL can support not only discrete-action Q-functions like this but also continuous-action Q-functions (via [Normalized Advantage Functions](https://arxiv.org/abs/1603.00748)) in the same way.

In [13]:
class QFunction(torch.nn.Module):

    def __init__(self, obs_size, n_actions):
        super().__init__()
        self.l1 = torch.nn.Linear(obs_size, 50)
        self.l2 = torch.nn.Linear(50, 50)
        self.l3 = torch.nn.Linear(50, n_actions)

    def forward(self, x):
        h = x
        h = torch.nn.functional.relu(self.l1(h))
        h = torch.nn.functional.relu(self.l2(h))
        h = self.l3(h)
        return pfrl.action_value.DiscreteActionValue(h)

obs_size = env.observation_space.low.size
n_actions = env.action_space.n
q_func = QFunction(obs_size, n_actions)
print(q_func)

QFunction(
  (l1): Linear(in_features=8, out_features=50, bias=True)
  (l2): Linear(in_features=50, out_features=50, bias=True)
  (l3): Linear(in_features=50, out_features=4, bias=True)
)


It is also possible to define the same model using `torch.nn.Sequential`. `pfrl.q_functions.DiscreteActionValueHead` is just a `torch.nn.Module` that packs its input to `pfrl.action_value.DiscreteActionValue`.

In [14]:
q_func2 = torch.nn.Sequential(
    torch.nn.Linear(obs_size, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, n_actions),
    pfrl.q_functions.DiscreteActionValueHead(),
)
print(q_func2)

Sequential(
  (0): Linear(in_features=8, out_features=50, bias=True)
  (1): ReLU()
  (2): Linear(in_features=50, out_features=50, bias=True)
  (3): ReLU()
  (4): Linear(in_features=50, out_features=4, bias=True)
  (5): DiscreteActionValueHead()
)


As usual in PyTorch, `torch.optim.Optimizer` is used to optimize a model.

In [15]:
# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = torch.optim.Adam(q_func.parameters(), eps=1e-2)

To create a DoubleDQN agent with the Q-function and optimizer, you need to specify a bit more parameters and configurations.

In [16]:
# Set the discount factor that discounts future rewards.
gamma = 0.9

# Use epsilon-greedy for exploration
explorer = pfrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = pfrl.replay_buffers.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# As PyTorch only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(numpy.float32, copy=False)

# Set the device id to use GPU. To use CPU only, set it to -1.
gpu = -1

# Now create an agent that will interact with the environment.
agent = pfrl.agents.DoubleDQN(
    q_func,
    optimizer,
    replay_buffer,
    gamma,
    explorer,
    replay_start_size=500,
    update_interval=1,
    target_update_interval=100,
    phi=phi,
    gpu=gpu,
)

Now you have an agent and an environment. It's time to start reinforcement learning!

During training, two methods of `agent` must be called: `agent.act` and `agent.observe`. `agent.act(obs)` takes the current observation as input and returns an exploratory action. Once the returned action is processed in the env, `agent.observe(obs, reward, done, reset)` then observes the consequences:
- `obs`: next observation.
- `reward`: an immediate reward.
- `done`: a boolean value set to True if it reached a terminal state.
- `reset`: a boolean value set to True if an episode is interrupted at a non-terminal state, typically by a time limit.

Optionally, you can get training statistics of the agent via `agent.get_statistics`.

In [17]:
n_episodes = 300
max_episode_len = 200
for i in range(1, n_episodes + 1):
    obs = env.reset()
    R = 0  # return (sum of rewards)
    t = 0  # time step
    while True:
        # Uncomment to watch the behavior in a GUI window
        env.render()
        action = agent.act(obs) # get action from agent
        obs, reward, done, _ = env.step(action)
        R += reward
        t += 1
        reset = t == max_episode_len
        agent.observe(obs, reward, done, reset)
        if done or reset:
            break
    if i % 10 == 0:
        print('episode:', i, 'R:', R)
    if i % 50 == 0:
        print('statistics:', agent.get_statistics())
print('Finished.')

episode: 10 R: -272.2957162484753
episode: 20 R: -2.3689266328752407
episode: 30 R: -14.680004501413823
episode: 40 R: 13.970698906312707
episode: 50 R: 6.125502747199036
statistics: [('average_q', 0.7527589), ('average_loss', 0.6254820474982261), ('cumulative_steps', 8441), ('n_updates', 7942), ('rlen', 8441)]
episode: 60 R: 35.677403141885776
episode: 70 R: 4.459698328875573
episode: 80 R: 16.351358501711054
episode: 90 R: 0.9821798187193795
episode: 100 R: -1.2640507902555733
statistics: [('average_q', 3.2971933), ('average_loss', 0.4599005316197872), ('cumulative_steps', 18441), ('n_updates', 17942), ('rlen', 18441)]
episode: 110 R: 12.394432902693248
episode: 120 R: 19.76671146595775
episode: 130 R: 53.95136596309791
episode: 140 R: -25.3991385859201
episode: 150 R: 5.729127694323825
statistics: [('average_q', 2.2713606), ('average_loss', 0.4418302783370018), ('cumulative_steps', 28362), ('n_updates', 27863), ('rlen', 28362)]
episode: 160 R: -15.558915929418689
episode: 170 R: 47.

Now you finished training the DoubleDQN agent for 300 episodes. How good is the agent now? You can evaluate it by using `with agent.eval_mode()`. Exploration such as epsilon-greedy is not used anymore.

In [18]:
with agent.eval_mode():
    for i in range(10):
        obs = env.reset()
        R = 0
        t = 0
        while True:
            # Uncomment to watch the behavior in a GUI window
            # env.render()
            action = agent.act(obs)
            obs, r, done, _ = env.step(action)
            R += r
            t += 1
            reset = t == 200
            agent.observe(obs, r, done, reset)
            if done or reset:
                break
        print('evaluation episode:', i, 'R:', R)

evaluation episode: 0 R: 17.05141820084007
evaluation episode: 1 R: -14.293021443311684
evaluation episode: 2 R: 6.463629655025237
evaluation episode: 3 R: 22.31893413263944
evaluation episode: 4 R: 46.33641495449315
evaluation episode: 5 R: 1.9695294002284887
evaluation episode: 6 R: 31.943813311150503
evaluation episode: 7 R: 33.09348942035398
evaluation episode: 8 R: 25.12066859209115
evaluation episode: 9 R: 24.35689373879785


For your information, `CartPole-v0`'s maximum achievable return is 200. If the agent could not achieve 200, it was unlucky! You can train the agent longer by running the training loop again.

If the results are good enough, the only remaining task is to save the agent so that you can reuse it. What you need to do is to simply call `agent.save` to save the agent, then `agent.load` to load the saved agent.

In [19]:
env.display()
# Save an agent to the 'agent' directory
agent.save('agent')

# Uncomment to load an agent from the 'agent' directory
# agent.load('agent')

'openaigym.video.2.60.video000000.mp4'

'openaigym.video.2.60.video000050.mp4'

'openaigym.video.2.60.video000100.mp4'

'openaigym.video.2.60.video000150.mp4'

'openaigym.video.2.60.video000200.mp4'

'openaigym.video.2.60.video000250.mp4'

'openaigym.video.2.60.video000300.mp4'

RL completed!

But writing code like this every time you use RL might be tedious. So, PFRL has utility functions that do these things.

In [20]:
# Set up the logger to print info messages for understandability.
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=2000,           # Train the agent for 2000 steps
    eval_n_steps=None,       # We evaluate for episodes, not time
    eval_n_episodes=10,       # 10 episodes are sampled for each evaluation
    train_max_episode_len=200,  # Maximum length of each episode
    eval_interval=1000,   # Evaluate the agent after every 1000 steps
    outdir='result',      # Save everything to 'result' directory
)

outdir:result step:200 episode:0 R:-4.889253439623856
statistics:[('average_q', 3.4998457), ('average_loss', 0.4453967161476612), ('cumulative_steps', 58324), ('n_updates', 57825), ('rlen', 58324)]
outdir:result step:400 episode:1 R:29.714623769771695
statistics:[('average_q', 3.485362), ('average_loss', 0.4436208690702915), ('cumulative_steps', 58524), ('n_updates', 58025), ('rlen', 58524)]
outdir:result step:600 episode:2 R:6.789792322324227
statistics:[('average_q', 2.9759436), ('average_loss', 0.48929208680987357), ('cumulative_steps', 58724), ('n_updates', 58225), ('rlen', 58724)]
outdir:result step:800 episode:3 R:-6.103379765722471
statistics:[('average_q', 3.6794488), ('average_loss', 0.45635294005274774), ('cumulative_steps', 58924), ('n_updates', 58425), ('rlen', 58924)]
outdir:result step:1000 episode:4 R:82.44397224225678
statistics:[('average_q', 3.5074635), ('average_loss', 0.4247084227204323), ('cumulative_steps', 59124), ('n_updates', 58625), ('rlen', 59124)]
evaluation

(<pfrl.agents.double_dqn.DoubleDQN at 0x7f7645070050>,
 [{'average_loss': 0.4247084227204323,
   'average_q': 3.5074635,
   'cumulative_steps': 59124,
   'eval_score': 8.401110523343535,
   'n_updates': 58625,
   'rlen': 59124},
  {'average_loss': 0.41273755371570586,
   'average_q': 4.237748,
   'cumulative_steps': 60124,
   'eval_score': 15.218966050613723,
   'n_updates': 59625,
   'rlen': 60124}])

That's all of the PFRL quickstart guide. To know more about PFRL, please look into the `examples` directory and read and run the examples. Thank you!