# REINFORCE with PyTorch 

This is an implementation exercise of applying the policy gradient method: REINFORCE with PyTorch. The challenge here is a "PixelcopterEnv" that is being solved with policy based methods. 

# Coding Reinforce algorithm from scratch

The goal of your agent is to achieve a return of >= 5 for the PixelCopter

## Installing dependencies for virtual display 

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip install pyvirtualdisplay
!pip install pyglet==1.5.1

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

## Install the dependencies

In [None]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt

## Import the packages 
In addition to import the installed libraries, we also import:

- `imageio`: A library that will help us to generate a replay video



In [None]:
import numpy as np

from collections import deque

import matplotlib.pyplot as plt
%matplotlib inline

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

# Gym
import gym
import gym_pygame

# Hugging Face Hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
import imageio

## Check if we have a GPU

- Let's check if we have a GPU
- If it's the case you should see `device:cuda0`

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
print(device)

We're now ready to implement our Reinforce algorithm 🔥

## PixelCopter 🚁

💡 A good habit when you start to use an environment is to check its documentation 

- [The Environment documentation](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)


# Q1: Let's see what the Environment looks like (10 pts)

In [None]:
env_id = "Pixelcopter-PLE-v0"
env = gym.make(env_id)
eval_env = gym.make(env_id)
s_size = env.observation_space.shape[0]
a_size = env.action_space.n

In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action

## Q1.1 What are the possible actions and observations ? (7 pts)

## Write your answer here

## Q1.2 What is the terminal state for this environment ? (3 pts)

## Write your answer here

# Q2 Defining the Policy with Neural Networks (10 pts)

## Q2.1 Fill the missing portions of this code marked by "Code Here" ? (10 pts)

In [None]:
class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size):
        super(Policy, self).__init__()
        # Define the three layers here
        # Code Here

    def forward(self, x):
        # Define the forward process here (with ReLU activation for the first 2 layers)
        # x -> fc1 -> ReLU -> fc2 -> ReLU -> fc3 
        # Code Here

        # We output the softmax
        return F.softmax(x, dim=1)
    
    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

### Reinforce algorithm Pseudocode

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_pseudocode.png" alt="Policy gradient pseudocode"/>

# Q3 Defining the Reinforce algorithm and Training (55 pts)

## Q3.1 Fill the missing code (marked with "Code Here") for the Reinforce algorithm (40 pts)

Note: There are 4 spots where you need to make the code edits

In [None]:
def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):
    # Help us to calculate the score during the training
    scores_deque = deque(maxlen=100)
    scores = []
    # Line 3 of pseudocode
    for i_episode in range(1, n_training_episodes+1):
        saved_log_probs = []
        rewards = []
        state = # Code Here: reset the environment
        # Line 4 of pseudocode
        for t in range(max_t):
            action, log_prob = # Code Here: get the action
            saved_log_probs.append(log_prob)
            state, reward, done, _ = # Code Here: take an env step
            rewards.append(reward)
            if done:
                break 
        scores_deque.append(sum(rewards))
        scores.append(sum(rewards))
        
        # Line 6 of pseudocode: calculate the return
        returns = deque(maxlen=max_t) 
        n_steps = len(rewards) 
        
        # Compute the discounted returns at each timestep,
        # as the sum of the gamma-discounted return at time t (G_t) + the reward at time t
        
        ## We compute this starting from the last timestep to the first, to avoid redundant computations
        
        ## appendleft() function of queues appends to the position 0
        ## We use deque instead of lists to reduce the time complexity
        
        for t in range(n_steps)[::-1]:
            disc_return_t = (returns[0] if len(returns)>0 else 0)
            returns.appendleft(    ) # Code Here: complete here        
       
        ## standardization for training stability
        eps = np.finfo(np.float32).eps.item()
        
        ## eps is added to the standard deviation of the returns to avoid numerical instabilities
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + eps)
        
        # Line 7:
        policy_loss = []
        for log_prob, disc_return in zip(saved_log_probs, returns):
            policy_loss.append(-log_prob * disc_return)
        policy_loss = torch.cat(policy_loss).sum()
        
        # Line 8: PyTorch prefers gradient descent 
        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()
        
        if i_episode % print_every == 0:
            print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
        
    return scores

### Defining the hyperparameters 

In [None]:
pixelcopter_hyperparameters = {
    "h_size": 64,
    "n_training_episodes": 10000,
    "n_evaluation_episodes": 10,
    "max_t": 10000,
    "gamma": 0.99,
    "lr": 1e-4,
    "env_id": env_id,
    "state_space": s_size,
    "action_space": a_size,
}

###  Train it
- We're now ready to train our agent 🔥.

In [None]:
# Create policy and place it to the device
torch.manual_seed(50) # Don't change this
pixelcopter_policy = Policy(pixelcopter_hyperparameters["state_space"], pixelcopter_hyperparameters["action_space"], pixelcopter_hyperparameters["h_size"]).to(device)
pixelcopter_optimizer = optim.Adam(pixelcopter_policy.parameters(), lr=pixelcopter_hyperparameters["lr"])

In [None]:
scores = reinforce(pixelcopter_policy,
                   pixelcopter_optimizer,
                   pixelcopter_hyperparameters["n_training_episodes"], 
                   pixelcopter_hyperparameters["max_t"],
                   pixelcopter_hyperparameters["gamma"], 
                   1000)

## Q3.2 What can you learn from the training progress above ? (10 pts)

## Write your answer here

## Q3.3 Modify the hyperparamter h_size and comment on how the training is impacted by this change in h_size ? (5 pts)

We have added the two sizes which we want you to experiment. You are free to try changing to other h_sizes and even change other hyperparameters

In [None]:
pixelcopter_hyperparameters_32 = {
    "h_size": 32,
    "n_training_episodes": 10000,
    "n_evaluation_episodes": 10,
    "max_t": 10000,
    "gamma": 0.99,
    "lr": 1e-4,
    "env_id": env_id,
    "state_space": s_size,
    "action_space": a_size,
}

In [None]:
# Create policy and place it to the device
torch.manual_seed(50) # Don't change this
pixelcopter_policy_32 = Policy(pixelcopter_hyperparameters_32["state_space"], pixelcopter_hyperparameters_32["action_space"], pixelcopter_hyperparameters_32["h_size"]).to(device)
pixelcopter_optimizer_32 = optim.Adam(pixelcopter_policy_32.parameters(), lr=pixelcopter_hyperparameters_32["lr"])

In [None]:
scores = reinforce(pixelcopter_policy_32,
                   pixelcopter_optimizer_32,
                   pixelcopter_hyperparameters_32["n_training_episodes"], 
                   pixelcopter_hyperparameters_32["max_t"],
                   pixelcopter_hyperparameters_32["gamma"], 
                   1000)

In [None]:
pixelcopter_hyperparameters_128 = {
    "h_size": 128,
    "n_training_episodes": 10000,
    "n_evaluation_episodes": 10,
    "max_t": 10000,
    "gamma": 0.99,
    "lr": 1e-4,
    "env_id": env_id,
    "state_space": s_size,
    "action_space": a_size,
}

In [None]:
# Create policy and place it to the device
torch.manual_seed(50) # Don't change this
pixelcopter_policy_128 = Policy(pixelcopter_hyperparameters_128["state_space"], pixelcopter_hyperparameters_128["action_space"], pixelcopter_hyperparameters_128["h_size"]).to(device)
pixelcopter_optimizer_128 = optim.Adam(pixelcopter_policy_128.parameters(), lr=pixelcopter_hyperparameters_128["lr"])

In [None]:
scores = reinforce(pixelcopter_policy_128,
                   pixelcopter_optimizer_128,
                   pixelcopter_hyperparameters_128["n_training_episodes"], 
                   pixelcopter_hyperparameters_128["max_t"],
                   pixelcopter_hyperparameters_128["gamma"], 
                   1000)

## Write your answer here

# Q4 Evaluation (25 pts)

## Evaluation method
- Here we have defined the evaluation method that we're going to use to test the Reinforce agent.

In [None]:
def evaluate_agent(env, max_steps, n_eval_episodes, policy):
  """
  Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
  :param env: The evaluation environment
  :param n_eval_episodes: Number of episode to evaluate the agent
  :param policy: The Reinforce agent
  """
  episode_rewards = []
  for episode in range(n_eval_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards_ep = 0
    
    for step in range(max_steps):
      action, _ = policy.act(state)
      new_state, reward, done, info = env.step(action)
      total_rewards_ep += reward
        
      if done:
        break
      state = new_state
    episode_rewards.append(total_rewards_ep)
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)

  return mean_reward, std_reward

In [None]:
evaluate_agent(eval_env, 
               pixelcopter_hyperparameters["max_t"], 
               pixelcopter_hyperparameters["n_evaluation_episodes"],
               pixelcopter_policy)

## Q4.1: What can you learn from the evaluation results above? (10 pts)

## Write your answer here

## Q4.2 Run the evaluation with the policies with different h_sizes (32 and 128) and comment on the results (10 pts)

You are free to experiment with other h_sizes as well

In [None]:
evaluate_agent(eval_env, 
               pixelcopter_hyperparameters_32["max_t"], 
               pixelcopter_hyperparameters_32["n_evaluation_episodes"],
               pixelcopter_policy_32)

In [None]:
evaluate_agent(eval_env, 
               pixelcopter_hyperparameters_128["max_t"], 
               pixelcopter_hyperparameters_128["n_evaluation_episodes"],
               pixelcopter_policy_128)

## Write your answer here

## Q4.3: Record Agent (5 pts)


In [None]:
def record_video(env, policy, out_directory, fps=30):
  """
  Generate a replay video of the agent
  :param env
  :param Qtable: Qtable of our agent
  :param out_directory
  :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
  """
  images = []  
  done = False
  state = env.reset()
  img = env.render(mode='rgb_array')
  images.append(img)
  while not done:
    # Take the action (index) that have the maximum expected future reward given that state
    action, _ = policy.act(state)
    state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic
    img = env.render(mode='rgb_array')
    images.append(img)
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)

In [None]:
record_video(eval_env, pixelcopter_policy, './replay.mp4')