# Vanilla Policy Gradient

## Formulation
In this exercise, we will train a RL agent with Vanilla Policy Gradient (VPG), the simplest on-policy RL algorithm. The objective of VP is to maximize the expected return of the trajectories sampled from a policy $\pi_\theta$, which is expressed by
$$ \eta(\pi_\theta) = \max_\theta \mathbb{E}_{\tau \sim p_{\pi_\theta}(\tau)} \left[ R(\tau) \right], $$
where $R(\tau)$ is the discounted return of a trajectory $\tau$ of lenght $T$. Using the log-derivative trick, we can compute the gradient of the objective with respect to $\theta$, which is given by
$$ \nabla_\theta \eta(\pi_\theta) \approx \frac{1}{m} \sum_{i=1}^m \left( \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta (a_t^i \mid s_t^i) \right) \left( \sum_{t=0}^{T-1} \gamma^t r(s_t^i, a_t^i) \right), $$
where $m$ is the number of trajectories sampled for training and $r(s, a)$ is an immediate reward given a state $s$ and an action $a$. Using the fact that the policy cannot affect rewards in the past, we can modify the gradient above as 
$$ \nabla_\theta \eta(\pi_\theta) \approx \frac{1}{m} \sum_{i=1}^m \left( \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta (a_t^i \mid s_t^i) \sum_{t'=t}^{T-1} \gamma^{t'-t} r(s_{t'}^i, a_{t'}^i) \right), $$
 where the sum of rewards here does not include the rewards achieved prior to the time step at which the policy is being queried.

## Baseline
In fact, the policy gradient suffers from high variance. To address this, we introduce a baseline function and subtract the baseline from the sum of rewards. Note that this does not affact the value of the objective by EGLP lemma. The most common choice of baseline is the on-policy value function $V_\phi^\pi$, which acts as a state-dependent baseline. The value function will be trained to approximate the discounted sum of future rewards starting from a particular state:
$$ V_\phi^\pi(s_t) \approx \sum_{t'=t}^{T-1} \gamma^{t'-t} \mathbb{E}_{\pi_\theta} \left[ r(s_{t'}, a_{t'}) \mid s_t \right]. $$
Finally, the policy gradient now looks like as follows:
$$ \nabla_\theta \eta(\pi_\theta) \approx \frac{1}{m} \sum_{i=1}^m \left( \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta (a_t^i \mid s_t^i) \underbrace{\left( \sum_{t'=t}^{T-1} \gamma^{t'-t} r(s_{t'}^i, a_{t'}^i) - V_\phi^\pi(s_t^i) \right)}_{A(s_{t'}^i, a_{t'}^i)} \right), $$
where $A$ is called advantage function. In practice, we use the standardized version of advantages.

## Implementation
To implement Vanila Policy Gradient, you need to fill in some blanks that are marked with `TODO` in the following files:
- `MLPPolicyPG` (rlkit/policies/mlp_policy.py): a class for MLP policy for VPG, which takes an observation as an input and outputs an action.
- `PGAgent` (rlkit/agents/pg_agent.py): a class for VPG agent, which updates the policy via VPG using given trajectories.

### Implementing `MLPPolicyPG`

MLPPolicyPG consists of policy and baseline network (optional), each of which is a feed-forward deep neural network. You should implement a method named `update`, whose functionality is to compute the losses for the policy and baseline and update the networks. 

### Implementing `PGAgent`

PGAgent computes baselines, discounted returns, and advantages from given trajectories and passes them to `self.actor`, an instance of the `MLPPolicyPG` class. You should implement methods such as `calculate_baselines` and `calculate_advantages` to build the whole training procedure of a PG agent given trajectory data.

# Setup

In [None]:
#@title 1. Mount your Google Drive

from google.colab import drive

drive.mount('/content/drive', force_remount=True)

# enter the foldername in your Drive where you have saved the unzipped 'rlkit' folder
FOLDERNAME = 'hw9'

assert FOLDERNAME is not None, "[!] Enter the foldername."

%cd /content/drive/MyDrive/$FOLDERNAME

In [None]:
#@title 2. Install packages

#@markdown Please run the follown script to install external Linux and Python packages.

#@markdown This may take a few minutes.

!apt update 
!apt install xvfb ffmpeg

!pip install tensorboard tensorboardX pyvirtualdisplay selenium
!pip install gym==0.22.0

# Run Vanilla Policy Gradient

In [None]:
#@title 1. Import packages

import os
from pyvirtualdisplay import Display

from rlkit.infrastructure.rl_trainer import OnPolicyRLTrainer
from rlkit.agents.pg_agent import PGAgent

%load_ext autoreload
%autoreload 2

In [None]:
#@title 2. Runtime arguments

class Args:
  def __getitem__(self, key):
    return getattr(self, key)

  def __setitem__(self, key, val):
    setattr(self, key, val)

  def __contains__(self, key):
    return hasattr(self, key)

  env_name = "CartPole-v1" #@param
  exp_name = "vpg" #@param

  #@markdown main parameters of interest
  n_iter = 200 #@param {type: "integer"}

  ## PDF will tell you how to set ep_len
  ## and discount for each environment
  ep_len = 500 #@param {type: "integer"}
  discount = 0.99 #@param {type: "number"}
  nn_baseline = True #@param {type: "boolean"}
  standardize_advantages = True #@param {type: "boolean"}

  #@markdown batches and steps
  n_trajs =  5#@param {type: "integer"}
  eval_n_trajs = 5 #@param {type: "integer"}
  num_agent_train_steps_per_iter = 1 #@param {type: "integer"}
  learning_rate = 1e-3 #@param {type: "number"}

  #@markdown MLP parameters
  n_layers = 2 #@param {type: "integer"}
  size = 64 #@param {type: "integer"}

  #@markdown system
  save_params = False #@param {type: "boolean"}
  no_gpu = True #@param {type: "boolean"}
  which_gpu = 0 #@param {type: "integer"}
  seed = 1337 #@param {type: "integer"}

  #@markdown logging
  ## default is to not log video so
  ## that logs are small enough to be
  ## uploaded to gradscope
  video_log_freq = -1 #@param {type: "integer"}
  scalar_log_freq = 1 #@param {type: "integer"}


args = Args()

In [None]:
#@title 3. Create directory for logging

base_logdir = "logs"
exp_name = args["exp_name"] + '_' + args["env_name"]
logdir = os.path.join(base_logdir, exp_name)
os.makedirs(logdir, exist_ok=True)
args["logdir"] = logdir

In [None]:
#@title 4. Define policy gradient trainer

class PG_Trainer(object):

    def __init__(self, params):

        #####################
        ## SET AGENT PARAMS
        #####################

        computation_graph_args = {
            'n_layers': params['n_layers'],
            'size': params['size'],
            'learning_rate': params['learning_rate'],
            }

        estimate_advantage_args = {
            'gamma': params['discount'],
            'standardize_advantages': params['standardize_advantages'],
            'nn_baseline': params['nn_baseline'],
        }

        train_args = {
            'num_agent_train_steps_per_iter': params['num_agent_train_steps_per_iter'],
        }

        agent_params = {**computation_graph_args, **estimate_advantage_args, **train_args}

        self.params = params
        self.params['agent_class'] = PGAgent
        self.params['agent_params'] = agent_params

        ################
        ## RL TRAINER
        ################

        self.rl_trainer = OnPolicyRLTrainer(self.params)

    def run_training_loop(self):

        self.rl_trainer.run_training_loop(
            self.params['n_iter'],
            collect_policy = self.rl_trainer.agent.actor,
            eval_policy = self.rl_trainer.agent.actor,
        )

In [None]:
#@title 5. Run training

#@markdown If your implementation is correct, the average return will be above 400.

#@markdown This may take a few minutes.

trainer = PG_Trainer(args)
trainer.run_training_loop()

In [None]:
#@title 6. Run Tensorboard

%load_ext tensorboard
%tensorboard --logdir /content/drive/MyDrive/$FOLDERNAME/logs