In [1]:
!pip install box2d-py



# REINFORCE with a Learned Baseline

In this tutorial we will train an agent to play `LunarLander-v2` using REINFORCE with a learned baseline.

## Introduction

Previously we implemented the REINFORCE algorithm and discussed how introducing a baseline reduces variance and improves performance. We noted that the value function $v$ would be an ideal baseline (if we knew it) and used the average return as a rough approximation. In this tutorial we will try to learn the value function so that we can use it as a baseline in REINFORCE. Like the policy, we will parametrise the value function using a simple neural network with parameters $\omega$. Recall that the policy gradient theorem gives us
\begin{align}
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \left(G_t - v_\omega(s_t) \right) \right], \, \text{ where } \,
G_t = \sum_{t'=t}^T \gamma^{t' - t} r(s_{t'}, a_{t'}).
\end{align}
Here we have used our parametrised value function $v_\omega$ as the baseline. **But how do we learn $v_\omega$?**. Well, we can use monte-carlo learning! Since we can calculate the discounted returns $G_t$ we can just minimize $\frac{1}{2}\sum_{t=0}^T|G_t - v_\omega(s_t)|^2$.

**REINFORCE with a Learned Baseline**:
1. sample a trajectory $\tau = (s_0, a_0, r_1, s_1, \ldots, s_{T}, r_{T})$ using the policy $\pi_\theta$.
2. compute the vector of returns $[G_0, G_1, \ldots, G_T]$.
3. compute the policy gradient $\nabla_\theta J(\theta) \approx \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \left(G_t - v_\omega(s_t) \right)$.
4. compute the gradient of the value function loss $\nabla_\omega \mathcal{L}(\omega) = \nabla_\omega \frac{1}{2}\sum_{t=0}^T|G_t - v_\omega(s_t)|^2$.
5. update policy parameters $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$.
6. update the value function parameters $\omega \leftarrow \omega - \alpha \nabla_\omega \mathcal{L}(\omega)$.

In [2]:
import gym
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical 

In [3]:
# use gpu if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('device:', device)

# configure matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 10.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

device: cpu


# The Environment

In `LunarLander-v2` our goal is to land a space shuttle on the moon. We have 4 actions: fire the left thruster, fire the right thruster, fire the main thruster, or do nothing. The state space consists of: the shuttle's position (x,y)-cooridinate, its velocity, the angle of tilt, the angular velocity, and 2 boolean flags indicating whether the left and right legs of the shuttle are in contact with the ground. For more details refer to the OpenAi/gym github wiki page ([link](https://github.com/openai/gym/wiki/Leaderboard#lunarlander-v2)) 

In [4]:
env = gym.make('LunarLander-v2')
print('Environment:', 'LunarLander-v2')
print('\t','action space:', env.action_space)
print('\t','observation space:', env.observation_space)

Environment: LunarLander-v2
	 action space: Discrete(4)
	 observation space: Box(8,)


### Watching a Random Policy In Action

Let's see how a random policy performs in this enviroment:

In [None]:
env_1 = gym.make('LunarLander-v2')
state = env_1.reset()
for t in range(500):
    # sample a random action
    action =env_1.action_space.sample()
    env_1.render()
    state, reward, done, _ = env_1.step(action)
    if done:
        state = env_1.reset()
env_1.close()
del env_1

That's some awful flying! We'll try to do better with REINFORCE now.

## The Policy and Value Function Networks

First, let's define our policy and value function. 

For efficiency we can use a single network with two heads. For a given state, the first head will output a Categorical distribution over the actions while the second head will return the value of the state.

![two headed network](https://i.imgur.com/Z4Fq3cO.png)

In [6]:
class Net(nn.Module):
    def __init__(self, s_size=8, h_size=128, a_size=4):
        super(Net, self).__init__()
        # The first layer should be a shared linear layer with
        # an input size of env.observation_space.n and an output size of 128
        self.fc_shared = nn.Linear(s_size, h_size)
        # The policy head should be a linear layer with input size of 128 
        # and an output size of env.action_space.n
        self.fc_policy = nn.Linear(h_size, a_size)
        # The value function head should be a linear layer with input size of 128
        # and an output size of 1
        self.fc_value_function = nn.Linear(h_size, 1)
        
    def forward(self, x):
        # Define the forward pass
        # apply a ReLU activation after the shared layer
        x = F.relu(self.fc_shared(x))
        # apply the policy head layer (without an activation).
        logits = self.fc_policy(x)
        # apply the value function head layer (without an activation)
        value = self.fc_value_function(x)
        # define a Categorical distribution over the actions
        dist = Categorical(logits=logits)
        return dist, value

### How do we use the net?

For a given state our policy returns a tuple consisting of a pytorch `Categorial` object and a pytorch `Tensor`. To recap, we can use `sample` to sample an action from the distribution and `log_prob` to find the log probability of a particular action.

In [7]:
net = Net().to(device)
state = env.reset()
state = torch.from_numpy(state).float().to(device)
dist, value = net(state)
action = dist.sample()
print('Sampled action: ', action.item())
print('Log probability of action: ', dist.log_prob(action).item())
print('Estimated value of the state: ', value.item())

Sampled action:  2
Log probability of action:  -1.1634083986282349
Estimated value of the state:  0.12388109415769577


## Computing the Return

Given a sequence of returns compute the vector of discounted returns $[G_0, G_1, \ldots, G_T]$. Note that we alse use the trick of 'normalizing' the returns i.e. we subtract the mean and divide by the standard deviation.

In [8]:
def compute_returns(rewards, gamma):
    R = 0
    returns = []
    for step in reversed(range(len(rewards))):
        R = rewards[step] + gamma * R
        returns.insert(0, R)
    returns = np.array(returns)
    returns -= returns.mean()
    returns /= returns.std()
    return returns

## REINFORCE with a Learned Baseline

1. sample a trajectory $\tau = (s_0, a_0, r_1, s_1, \ldots, s_{T}, r_{T})$ using the policy $\pi_\theta$.
2. compute the vector of returns $[G_0, G_1, \ldots, G_T]$.
3. compute the policy gradient $\nabla_\theta J(\theta) \approx \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \left(G_t - v_\omega(s_t) \right)$.
4. compute the gradient of the value function loss $\nabla_\omega \mathcal{L}(\omega) = \nabla_\omega \frac{1}{2}\sum_{t=0}^T|G_t - v_\omega(s_t)|^2$.
5. update policy parameters $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$.
6. update the value function parameters $\omega \leftarrow \omega - \alpha \nabla_\omega \mathcal{L}(\omega)$.

In practice we combine steps 3. and 4. by defining the composite loss:
\begin{align}
\texttt{loss} = \sum_{t=0}^T \log \pi_\theta(a_t | s_t) \left(G_t - \text{detach}(v_\omega(s_t)) \right) + \frac{1}{2}\sum_{t=0}^T|G_t - v_\omega(s_t)|^2.
\end{align}
This also allows us to perform the policy and value function updates in a single step.

![algor](https://i.imgur.com/c2sE5Z9.png)

In [9]:
# define some hyperparameters
gamma = 0.99
lr = 0.02
seed = 401
number_episodes = 1250

In [10]:
def reinforce_learned_baseline(seed):
    env = gym.make('LunarLander-v2')
    
    # set random seeds (for reproducibility)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    env.seed(seed)
    random.seed(seed)

    # instantiate the policy and optimizer
    net = Net().to(device)
    optimizer = optim.Adam(net.parameters(), lr=lr)

    scores = []
    scores_deque = deque(maxlen=50)
    for episode in range(1, number_episodes+1):
        ##################################################################
        # 1. Collect trajectories using our policy and save the rewards, #
        # log probability, and the estimated value of each state.        #                                                #
        ##################################################################
        log_probs = []
        values = []
        rewards = []

        state = env.reset()
        for t in range(1000):
            # convert state to a torch Tensor
            state = torch.from_numpy(state).float().to(device)
            # get the distribution over actions and the estimated value of state
            dist, value = net(state)

            # sample an action from the distribution
            action = dist.sample()
            
            # compute the log probability
            log_prob = dist.log_prob(action)
            
            # take a step in the environment
            state, reward, done, _ = env.step(action.item())

            # save the reward, log probabily, and value 
            rewards.append(reward)
            log_probs.append(log_prob.unsqueeze(0))
            values.append(value)

            if done:
                break
    
        # for reporting save the score
        scores.append(sum(rewards))
        scores_deque.append(sum(rewards))

        ##################################################################
        # 2. Compute the vector of discounted returns                    #
        ##################################################################
        returns = compute_returns(rewards, gamma)
        returns = torch.from_numpy(returns).float().to(device)

        ##################################################################
        # 3. and 4. Compute the loss for gradient descent                #
        ##################################################################
        values = torch.cat(values)
        log_probs = torch.cat(log_probs)

        # compute the difference between the returns and the values
        delta = returns - values

        # compute the policy loss term. multiply the log probabilities by delta and sum
        # (remeber to call .detach() on delta since we do not want the gradient to propogate
        # to the value function network here)
        policy_loss = -torch.sum(log_probs*delta.detach())

        # compute the value function loss term
        value_function_loss = 0.5*torch.sum(delta**2)

        # compute the composite loss
        loss = policy_loss + value_function_loss

        #################################################################
        # 4. and 5. update the policy and value function parameters     #
        #################################################################
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if episode % 50 == 0:
            print('Episode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_deque)))

    return net, scores

In [11]:
net, scores = reinforce_learned_baseline(seed)

Episode 50	Average Score: -162.77
Episode 100	Average Score: -118.39
Episode 150	Average Score: -150.96
Episode 200	Average Score: -81.04
Episode 250	Average Score: -82.67
Episode 300	Average Score: 1.10
Episode 350	Average Score: 64.52
Episode 400	Average Score: 17.90
Episode 450	Average Score: 58.05
Episode 500	Average Score: 30.90
Episode 550	Average Score: 96.95
Episode 600	Average Score: 103.18
Episode 650	Average Score: 113.35
Episode 700	Average Score: 96.09
Episode 750	Average Score: 118.01
Episode 800	Average Score: 130.27
Episode 850	Average Score: 137.48
Episode 900	Average Score: 142.25
Episode 950	Average Score: 136.40
Episode 1000	Average Score: 133.01
Episode 1050	Average Score: 130.14
Episode 1100	Average Score: 128.77
Episode 1150	Average Score: 140.36
Episode 1200	Average Score: 146.88
Episode 1250	Average Score: 153.03


## Watching Our Agent in Action

Finally, let's see how our agent performs in the `LunarLander` environment.

In [None]:
env_1 = gym.make('LunarLander-v2')
state = env_1.reset()
for t in range(2000):
    state = torch.from_numpy(state).float().to(device)
    dist, value = net(state)
    action = dist.sample().item()
    env_1.render()
    state, reward, done, _ = env_1.step(action)
    if done:
        state = env_1.reset()
env_1.close()
del env_1