<a href="https://colab.research.google.com/github/rho-selynn/592-HW4/blob/Roselynn/592_HW4_LL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PFRL Quickstart Guide

This is a quickstart guide for users who just want to try PFRL for the first time.

If you have not yet installed PFRL, run the command below to install it:
```
pip install pfrl
```

If you have already installed PFRL, let's begin!

First, you need to import necessary modules. The module name of PFRL is `pfrl`. Let's import `torch`, `gym`, and `numpy` as well since they are used later.

In [1]:
#installing prerequisite display packages
!apt update && apt install xvfb python-opengl ffmpeg
#install torch and plotting packages
!pip install torchvision matplotlib seaborn pandas numpy pathlib 
#install gym and physics engine for box2d environments
!pip install gym box2d-py

#install wrapper to visualize environment
!pip install gym-notebook-wrapper
!pip install pyvirtualdisplay
import pyvirtualdisplay
disp = pyvirtualdisplay.Display()
disp.start() # Start Xvfb and set "DISPLAY" environment properly.
!pip install pfrl
import pfrl
import torch
import torch.nn
import gym
import numpy

import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import random, os.path, math, glob, csv, base64, itertools, sys
import gym
from gym.wrappers import Monitor
import gnwrapper


Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,622 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [696 B]
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Get:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:9 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:12 http://ppa.launchpad.net/cran/libgit2/u

PFRL can be used for any problems if they are modeled as "environments". [OpenAI Gym](https://github.com/openai/gym) provides various kinds of benchmark environments and defines the common interface among them. PFRL uses a subset of the interface. Specifically, an environment must define its observation space and action space and have at least two methods: `reset` and `step`.

- `env.reset` will reset the environment to the initial state and return the initial observation.
- `env.step` will execute a given action, move to the next state and return four values:
  - a next observation
  - a scalar reward
  - a boolean value indicating whether the current state is terminal or not
  - additional information
- `env.render` will render the current state. (optional)

Let's try `CartPole-v0`, which is a classic control problem. You can see below that its observation space consists of four real numbers while its action space consists of two discrete actions.

In [2]:
#env = gym.make('CartPole-v1')
env = gym.make('LunarLander-v2')
env.seed(0)
env = gnwrapper.Monitor(env,directory="./train", force=True, video_callable=lambda num: num % 50 == 0) # Start Xvfb, if force=True, overwrites exisiting saved videos
print('observation space:', env.observation_space)
print('action space:', env.action_space)

obs = env.reset()
print('initial observation:', obs)

action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)

# Uncomment to open a GUI window rendering the current state of the environment
env.render()

observation space: Box(-inf, inf, (8,), float32)
action space: Discrete(4)
initial observation: [-5.9156417e-04  1.4134574e+00 -5.9935719e-02  1.1277095e-01
  6.9228926e-04  1.3576316e-02  0.0000000e+00  0.0000000e+00]
next observation: [-1.0890007e-03  1.4154159e+00 -4.8010956e-02  8.7042563e-02
 -1.0120891e-03 -3.4089852e-02  0.0000000e+00  0.0000000e+00]
reward: 2.5724821158297018
done: False
info: {}


True

In [3]:

from collections import deque

Now you have defined your environment. Next, you need to define an agent, which will learn through interactions with the environment.

PFRL provides various agents, each of which implements a deep reinforcement learning algorithm.

Let's try using the DoubleDQN algorithm (https://arxiv.org/abs/1509.06461), which is implemented by `pfrl.agents.DoubleDQN`. This algorithm trains a Q-function that receives an observation and returns an expected future return for each action the agent can take. In PFRL, you can define your Q-function as `torch.nn.Module` as below. Note that the outputs are wrapped by `pfrl.action_value.DiscreteActionValue`. By wrapping the outputs of Q-functions, PFRL can support not only discrete-action Q-functions like this but also continuous-action Q-functions (via [Normalized Advantage Functions](https://arxiv.org/abs/1603.00748)) in the same way.

In [4]:
class QFunction(torch.nn.Module):

    def __init__(self, obs_size, n_actions):
        super().__init__()
        self.l1 = torch.nn.Linear(obs_size, 50)
        self.l2 = torch.nn.Linear(50, 50)
        self.l3 = torch.nn.Linear(50, n_actions)

    def forward(self, x):
        h = x
        h = torch.nn.functional.relu(self.l1(h))
        h = torch.nn.functional.relu(self.l2(h))
        h = self.l3(h)
        return pfrl.action_value.DiscreteActionValue(h)

obs_size = env.observation_space.low.size
n_actions = env.action_space.n
q_func = QFunction(obs_size, n_actions)
print(q_func)

QFunction(
  (l1): Linear(in_features=8, out_features=50, bias=True)
  (l2): Linear(in_features=50, out_features=50, bias=True)
  (l3): Linear(in_features=50, out_features=4, bias=True)
)


It is also possible to define the same model using `torch.nn.Sequential`. `pfrl.q_functions.DiscreteActionValueHead` is just a `torch.nn.Module` that packs its input to `pfrl.action_value.DiscreteActionValue`.

In [5]:
q_func2 = torch.nn.Sequential(
    torch.nn.Linear(obs_size, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, 50),
    torch.nn.ReLU(),
    torch.nn.Linear(50, n_actions),
    pfrl.q_functions.DiscreteActionValueHead(),
)
print(q_func2)

Sequential(
  (0): Linear(in_features=8, out_features=50, bias=True)
  (1): ReLU()
  (2): Linear(in_features=50, out_features=50, bias=True)
  (3): ReLU()
  (4): Linear(in_features=50, out_features=4, bias=True)
  (5): DiscreteActionValueHead()
)


As usual in PyTorch, `torch.optim.Optimizer` is used to optimize a model.

In [6]:
# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = torch.optim.Adam(q_func.parameters(), eps=1e-2)

To create a DoubleDQN agent with the Q-function and optimizer, you need to specify a bit more parameters and configurations.

In [14]:
# Set the discount factor that discounts future rewards.
gamma = 0.9

# Use epsilon-greedy for exploration
explorer = pfrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.9, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = pfrl.replay_buffers.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# As PyTorch only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(numpy.float32, copy=False)

# Set the device id to use GPU. To use CPU only, set it to -1.
gpu = -1

# Now create an agent that will interact with the environment.
agent = pfrl.agents.DoubleDQN(
    q_func,
    optimizer,
    replay_buffer,
    gamma,
    explorer,
    replay_start_size=500,
    update_interval=1,
    target_update_interval=100,
    phi=phi,
    gpu=gpu,
)

Now you have an agent and an environment. It's time to start reinforcement learning!

During training, two methods of `agent` must be called: `agent.act` and `agent.observe`. `agent.act(obs)` takes the current observation as input and returns an exploratory action. Once the returned action is processed in the env, `agent.observe(obs, reward, done, reset)` then observes the consequences:
- `obs`: next observation.
- `reward`: an immediate reward.
- `done`: a boolean value set to True if it reached a terminal state.
- `reset`: a boolean value set to True if an episode is interrupted at a non-terminal state, typically by a time limit.

Optionally, you can get training statistics of the agent via `agent.get_statistics`.

In [15]:
n_episodes = 500
max_episode_len = 200
for i in range(1, n_episodes + 1):
    obs = env.reset()
    R = 0  # return (sum of rewards)
    t = 0  # time step
    while True:
        # Uncomment to watch the behavior in a GUI window
        env.render()
        action = agent.act(obs) # get action from agent
        obs, reward, done, _ = env.step(action)
        R += reward
        t += 1
        reset = t == max_episode_len
        agent.observe(obs, reward, done, reset)
        if done or reset:
            break
    if i % 10 == 0:
        print('episode:', i, 'R:', R)
    if i % 50 == 0:
        print('statistics:', agent.get_statistics())
print('Finished.')

episode: 10 R: -208.6467574866477
episode: 20 R: -60.280405721230295
episode: 30 R: -54.04944536978143
episode: 40 R: -105.37362578527232
episode: 50 R: -130.17861750394343
statistics: [('average_q', 7.6913347), ('average_loss', 1.2356057489663363), ('cumulative_steps', 5001), ('n_updates', 4502), ('rlen', 5001)]
episode: 60 R: -128.66155985780205
episode: 70 R: -82.82940042637729
episode: 80 R: -124.00411504387068
episode: 90 R: -226.67729293809012
episode: 100 R: -100.57957949732989
statistics: [('average_q', 6.9151406), ('average_loss', 1.211688529253006), ('cumulative_steps', 9703), ('n_updates', 9204), ('rlen', 9703)]
episode: 110 R: -63.689021538202795
episode: 120 R: -99.66151497290355
episode: 130 R: -351.31177513104194
episode: 140 R: -144.82511078153817
episode: 150 R: -133.8786409235346
statistics: [('average_q', 6.9616637), ('average_loss', 1.0841694302856921), ('cumulative_steps', 14621), ('n_updates', 14122), ('rlen', 14621)]
episode: 160 R: -79.4738429380318
episode: 170

Now you finished training the DoubleDQN agent for 300 episodes. How good is the agent now? You can evaluate it by using `with agent.eval_mode()`. Exploration such as epsilon-greedy is not used anymore.

In [16]:
with agent.eval_mode():
    for i in range(30):
        obs = env.reset()
        R = 0
        t = 0
        while True:
            # Uncomment to watch the behavior in a GUI window
            # env.render()
            action = agent.act(obs)
            obs, r, done, _ = env.step(action)
            R += r
            t += 1
            reset = t == 200
            agent.observe(obs, r, done, reset)
            if done or reset:
                break
        print('evaluation episode:', i, 'R:', R)

evaluation episode: 0 R: 21.72548282013763
evaluation episode: 1 R: -13.602197955014
evaluation episode: 2 R: 30.536389251521165
evaluation episode: 3 R: -21.042082589149402
evaluation episode: 4 R: -14.657867622873685
evaluation episode: 5 R: -33.765550912436005
evaluation episode: 6 R: -12.57935073561571
evaluation episode: 7 R: 4.607229984455572
evaluation episode: 8 R: 14.008810468457241
evaluation episode: 9 R: 29.0809485286108
evaluation episode: 10 R: 14.088986331151261
evaluation episode: 11 R: 30.539620655656712
evaluation episode: 12 R: -19.84500613344716
evaluation episode: 13 R: -7.872491361247681
evaluation episode: 14 R: -13.385482119079017
evaluation episode: 15 R: -2.226830308764944
evaluation episode: 16 R: -24.679977916786825
evaluation episode: 17 R: 31.014416404290525
evaluation episode: 18 R: -33.368999291343144
evaluation episode: 19 R: -3.3129068507523405
evaluation episode: 20 R: 8.183207309174357
evaluation episode: 21 R: 4.600054263659681
evaluation episode: 2

For your information, `CartPole-v0`'s maximum achievable return is 200. If the agent could not achieve 200, it was unlucky! You can train the agent longer by running the training loop again.

If the results are good enough, the only remaining task is to save the agent so that you can reuse it. What you need to do is to simply call `agent.save` to save the agent, then `agent.load` to load the saved agent.

In [17]:
env.display()
# Save an agent to the 'agent' directory
agent.save('agent')

# Uncomment to load an agent from the 'agent' directory
# agent.load('agent')

'openaigym.video.0.59.video000000.mp4'

'openaigym.video.0.59.video000050.mp4'

'openaigym.video.0.59.video000100.mp4'

'openaigym.video.0.59.video000150.mp4'

'openaigym.video.0.59.video000200.mp4'

'openaigym.video.0.59.video000250.mp4'

'openaigym.video.0.59.video000300.mp4'

'openaigym.video.0.59.video000350.mp4'

'openaigym.video.0.59.video000400.mp4'

'openaigym.video.0.59.video000450.mp4'

'openaigym.video.0.59.video000500.mp4'

'openaigym.video.0.59.video000550.mp4'

'openaigym.video.0.59.video000600.mp4'

'openaigym.video.0.59.video000650.mp4'

'openaigym.video.0.59.video000700.mp4'

'openaigym.video.0.59.video000750.mp4'

'openaigym.video.0.59.video000800.mp4'

'openaigym.video.0.59.video000850.mp4'

'openaigym.video.0.59.video000900.mp4'

'openaigym.video.0.59.video000950.mp4'

'openaigym.video.0.59.video001000.mp4'

'openaigym.video.0.59.video001050.mp4'

'openaigym.video.0.59.video001100.mp4'

'openaigym.video.0.59.video001150.mp4'

'openaigym.video.0.59.video001200.mp4'

'openaigym.video.0.59.video001250.mp4'

'openaigym.video.0.59.video001300.mp4'

'openaigym.video.0.59.video001350.mp4'

RL completed!

But writing code like this every time you use RL might be tedious. So, PFRL has utility functions that do these things.

In [None]:
# Set up the logger to print info messages for understandability.
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=2000,           # Train the agent for 2000 steps
    eval_n_steps=None,       # We evaluate for episodes, not time
    eval_n_episodes=10,       # 10 episodes are sampled for each evaluation
    train_max_episode_len=200,  # Maximum length of each episode
    eval_interval=1000,   # Evaluate the agent after every 1000 steps
    outdir='result',      # Save everything to 'result' directory
)

outdir:result step:200 episode:0 R:41.37984804611819
statistics:[('average_q', 3.1913528), ('average_loss', 0.42760026663541795), ('cumulative_steps', 57400), ('n_updates', 56901), ('rlen', 57400)]
outdir:result step:400 episode:1 R:25.966680368432897
statistics:[('average_q', 3.4991524), ('average_loss', 0.49347428292036055), ('cumulative_steps', 57600), ('n_updates', 57101), ('rlen', 57600)]
outdir:result step:590 episode:2 R:40.047485579277776
statistics:[('average_q', 3.2654471), ('average_loss', 0.5446228332817554), ('cumulative_steps', 57790), ('n_updates', 57291), ('rlen', 57790)]
outdir:result step:790 episode:3 R:7.695951305545852
statistics:[('average_q', 4.0521207), ('average_loss', 0.4585756452381611), ('cumulative_steps', 57990), ('n_updates', 57491), ('rlen', 57990)]
outdir:result step:990 episode:4 R:53.80433045120746
statistics:[('average_q', 3.8925943), ('average_loss', 0.46209473460912703), ('cumulative_steps', 58190), ('n_updates', 57691), ('rlen', 58190)]
outdir:res

(<pfrl.agents.double_dqn.DoubleDQN at 0x7f33a080c850>,
 [{'average_loss': 0.5509733194112778,
   'average_q': 3.59798,
   'cumulative_steps': 58390,
   'eval_score': 23.968355982988914,
   'n_updates': 57891,
   'rlen': 58390},
  {'average_loss': 0.4764625625312328,
   'average_q': 3.3027246,
   'cumulative_steps': 59200,
   'eval_score': 9.045334810050294,
   'n_updates': 58701,
   'rlen': 59200}])

That's all of the PFRL quickstart guide. To know more about PFRL, please look into the `examples` directory and read and run the examples. Thank you!