# Basic Policy Gradients (REINFORCE) example in PyTorch

This notebook is an adaptation from the [PyTorch REINFORCE tutorial](https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py).

**Lab exercise prepared by [Víctor Campos](https://imatge.upc.edu/web/people/victor-campos), and adapted by [Juan José Nieto](https://www.linkedin.com/in/juan-jose-nieto-salas/) and [Xavier Giro-i-Nieto](https://imatge.upc.edu/web/people/xavier-giro) for the [Postgraduate course in Artificial Intelligence with Deep Learning](https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgrau-artificial-intelligence-deep-learning/) in [UPC School](https://www.talent.upc.edu/ing/) (2020).**

![Víctor Campos](https://imatge.upc.edu/web/sites/default/files/styles/medium/public/users/vcampos//photo.jpg?itok=eCtqXNX9)
![Juan José Nieto](https://media-exp1.licdn.com/dms/image/C5603AQHLMgUe1Jvx-g/profile-displayphoto-shrink_200_200/0?e=1593648000&v=beta&t=AhGwoXIDMNNQj_2P6pFbp7RwD39PhpmpOU8OvOdRqC4)
![Xavier Giro-i-Nieto](https://telecombcn-dl.github.io/2019-dlcv/img/instructors/XavierGiro-160x160.jpg)

## Installing dependencies

We will use OpenAI Gym to simulate the environment, which might not be installed by default. We also need to install some dependencies for visualization purposes (this may take a while).

In [0]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

## Setting up the environment

We will need some tricks to visualize the simulations in the browser, as simply calling env.render() will not work in this notebook ([source](https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t#scrollTo=Jyb2Ujuozfi2&forceEdit=true&offline=true&sandboxMode=true)).

In [0]:
import gym
import time
import numpy as np
import matplotlib.pyplot as plt

In [0]:
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only

import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [0]:
print('PyTorch version: ', torch.__version__)

In [0]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

In [0]:
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    
def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

## Visualize a random policy in the environment

Our goal is to train an agent that is capable of solving the CartPole problem, where a pole is attached to a cart moving along a horizontal track. The agent can interact with the environment by applying a force (+1/-1) to the cart. The episode is terminated whenever the pole is more than 15 degrees from vertical or the cart goes out of bounds in the horizontal axis. The agent receives +1 reward for each timestep under the desired conditions.

We can visualize what a random policy would do in this environment:

In [0]:
# Let's generate a random trajectory...
env = wrap_env(gym.make("CartPole-v1"))
ob, done, total_rew = env.reset(), False, 0
while not done:
  env.render()
  ac = env.action_space.sample()
  ob, rew, done, info = env.step(ac)
  total_rew += rew
print('Cumulative reward:', total_rew)
  
# ... and visualize it!
env.close()
show_video()

## Create the model

Now we will define our policy, parameterized by a feedforward neural network.

**Exercise #1.** Implement the policy as an MLP with a hidden layer of 128 neurons with a ReLU activation and a Softmax output layer.

In [0]:
class Policy(nn.Module):
    def __init__(self, inputs, outputs):
        super(Policy, self).__init__()
        # TODO
        self.affine1 = nn.TODO
        self.affine2 = nn.TODO

        self.saved_log_probs = []
        self.rewards = []

    def forward(self, x):
        # TODO
        x = self.TODO
        x = F.TODO
        action_scores = self.TODO
        return F.softmax(action_scores, dim=1)

## Functions for collecting experience and updating the policy

**Exercise #2.** Use the policy to predict a distribution of probabilities across actions.

**Exercise #3.** Compute the return from the rewards collected by the policy.

**Exercise #4.** Complete the loss computation using the returns and the log probs.

In [0]:
def select_action(policy, state):
    # Convert state into PyTorch tensor
    state = torch.from_numpy(state).float().unsqueeze(0)
    # TODO: Compute action probabilities
    probs = TODO
    # Sample action
    m = torch.distributions.Categorical(probs)
    action = m.sample()
    # Bookkeeping
    policy.saved_log_probs.append(m.log_prob(action))
    return action.item()


def train(policy, optimizer):
    R = 0
    policy_loss = []
    returns = []
    # Compute the returns by reading the rewards vector backwards
    for r in policy.rewards[::-1]:
        # TODO: Complete the computation of the return using gamma
        R = r + TODO
        returns.insert(0, R)
    returns = torch.tensor(returns)
    # Normalize returns (this usually accelerates convergence)
    returns = (returns - returns.mean()) / (returns.std() + eps)
    for log_prob, R in zip(policy.saved_log_probs, returns):
        # TODO: Complete the 'loss' computation using the returns and the log probs.
        policy_loss.append(TODO)
    # Update policy: 
    #  (1) reset optimizer grads
    optimizer.zero_grad()
    #  (2) compute surrogate policy gradients loss
    policy_loss = torch.cat(policy_loss).sum()
    #  (3) SGD step
    policy_loss.backward()
    optimizer.step()
    del policy.rewards[:]
    del policy.saved_log_probs[:]

## Training the agent

In [0]:
# Hyperparameters
env_name = 'CartPole-v1'
gamma = 0.99  # discount factor
seed = 543  # random seed
log_interval = 10  # controls how often we log progress
max_ep_len = 1000  # maximum episode length
num_episodes = 1500  # number of episodes to train on

In [0]:
# Create environment
env = gym.make(env_name)

# Fix random seed (for reproducibility)
env.seed(seed)
torch.manual_seed(seed)

In [0]:
# Create policy and optimizer
n_inputs = env.observation_space.shape[0]
n_actions = env.action_space.n
policy = Policy(n_inputs, n_actions)
optimizer = torch.optim.Adam(policy.parameters(), lr=1e-2)
eps = np.finfo(np.float32).eps.item()


# Training loop
print("Target reward: {}".format(env.spec.reward_threshold))
running_reward = 10
ep_rew_history = []
for i_episode in range(num_episodes):
    # Collect experience
    state, ep_reward = env.reset(), 0
    for t in range(max_ep_len):  # Don't infinite loop while learning
        
        action = select_action(policy, state)
        state, reward, done, _ = env.step(action)
        policy.rewards.append(reward)
        ep_reward += reward
        if done:
            break

    # Update running reward
    running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
    
    # Perform training step
    train(policy, optimizer)
    ep_rew_history.append((i_episode, ep_reward))
    if i_episode % log_interval == 0:
        print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
              i_episode, ep_reward, running_reward))
    if running_reward > env.spec.reward_threshold:
        print("Solved!")
        break

print("Finished training! Running reward is now {:.2f} and "
      "the last episode runs to {} time steps!".format(running_reward, t))

In [0]:
plt.plot([x[0] for x in ep_rew_history], [x[1] for x in ep_rew_history])
plt.xlabel('Episode')
plt.ylabel('Reward')

## Visualize trained policy

In [0]:
test_env = wrap_env(gym.make(env_name))
state, ep_reward, done = test_env.reset(), 0, False
while not done:
    test_env.render()
    action = select_action(policy, state)
    state, reward, done, _ = test_env.step(action)
    ep_reward += reward
print("Cumulative reward: {}".format(ep_reward))

test_env.close()
show_video()