## Policy Gradient

The general organization of the homework is given below.

- pg
    - reinforce
        - model
        - cartpole
    - a2c
        - model
        - vecenv
        - box2d
        - pong
    - ppo
        - model
        - box2d
        - walker
    - common

In this homework, you will implement REINFORCE, A2C, and PPO agents and run these agents in CartPole, LunarLander, Pong, and BipedalWalker environments.

### Running

Each experiment will be trained from scratch with 3 different seeds (except for Pong) to have a good understanding of the stochasticity involved in the training. You can run your experiments with command-line arguments from the jupyter notebook as shown below or using a bash script.


In [None]:
!python pg/a2c/box2d.py --nenv 16 --log-dir log/a2c_lunar


You should obtain scores higher than below:
- CartPole: 400
- LunarLander: 200
- BipedalWalker: 100
- Pong: 10

The default hyperparameters are not tuned but tested and they work well. However, hyperparemeters are sensitive to implementation details and hence you may need to tune them if you feel the need.

By default, the log file (named: progress.csv) will be saved in a temporary file if ```--log_dir``` CL argument is not given. You can use the csv file to obtain a Pandas dataframe object. Using the dataframe object, you can visualize the training with the given visualization script.

### Submission

Submission should include logs and models folders.

Example log folder:
- log
    - a2c_lunar
        - 05-23-2021-00-01-02
        - 05-23-2021-01-02-02
        - 05-23-2021-02-03-02
    - a2c_pong
        - 05-23-2021-03-01-02
    - ppo_lunar
        - 05-23-2021-04-01-02
        - 05-23-2021-05-02-02
        - 05-23-2021-06-03-02
    - ppo_walker
        - 05-23-2021-07-01-02
        - 05-23-2021-08-02-02
        - 05-23-2021-09-03-02
    - reinforce_cartpole
        - 05-23-2021-10-01-02
        - 05-23-2021-11-02-02
        - 05-23-2021-12-03-02
        
As long as you use CL argument ```--log-dir``` with the values such as ```log/a2c_lunar```, the folder will be filled automatically.

### Plotting

When you are done with experiments, you can plot the statistics. As long as the number of rows in every log within the given directory matches ```Plotter``` draws the statistics of the runs. For example, if you have 3 runs for a certain experiment, ```Plotter``` draws a mean curve and a shaded region of the area between the $\alpha$ and $1-\alpha$ quantiles of the runs.

Below is an example plotting code.

In [None]:
import os
import pandas as pd
from pg.visualize import Plotter

log_dir = os.path.join("log", "a2c_lunar")
df_dict = {
    "lunar a2c": [pd.read_csv(os.path.join(log_dir, folder, "progress.csv"))
                for folder in os.listdir(log_dir)]
}

plotter = Plotter(df_dict)
plotter()

## Instructions

### Reinforce

The implementation of REINFORCE is pretty simple. There are two python scripts and we begin with filling the "model.py" script. Since you are familiar with REINFORCE, we can start implementing the ```learn``` method in the ```Reinforce``` class. It excepts ```args```, ```opt```, and ```env``` as parameters. We loop as many episodes as stated in the argument ```args.n_episodes``` and collect rollouts to update the policy parameters. Note that, if an episode does not terminate in ```args.max_episode_len``` many steps, we terminate it manually.

We need to obtain a rollout that consists of a list of transitions. Each transition in the rollout needs to include the log probability of the selected (and performed) action and the intermediate reward. We expect from the ```policynet``` to return a Categorical distribution of the actions for a given state whenever the forward method is called. At each iteration in an episode, use the action distribution to sample action and step the environment with it. Store the log probability of the sampled action and the intermediate reward in a ```Transition``` object. The rollout is a list of ```Transition```s created during the episode. 

When we have the rollout, we can calculate gradients with respect to the policy parameters. Fill the missing part in ```accumulate_gradient```. Don't forget to call ```.backward()``` for each log probability to minimize the negative log-likelihood. Note that, we do not update the parameters in this method but only calculate the gradients.

Now, fill the ```cartpole.py``` script. Fill the missing parts in the ```PolicyNet``` class so that it can return a categorical distribution for a given state.

After the implementation is completed, you can run the experiments. Don't forget to tune the hyperparameters as they affect the performance of the training.

### Experiments

- Run REINFORCE in CartPole-v1 (Reach at least +400 mean reward for the last 20 episodes)


By default, the writer logs the mean reward of the last 20 episodes. This can be changed by overwriting --log-window-length command-line argument.

Plot the results. Also, please **keep** the ```progress.csv``` files as shown under the submission section above.

In [None]:
!python pg/reinforce/cartpole.py --log-dir log/reinforce_cartpole

In [None]:
# Plot Experiment 1

### A2C Implementation

A2C is a synchronized version of the popular A3C algorithm. The synchronization is done via a wrapper that runs multiple environments in parallel. We use the wrapper provided in ```pg/ac2/vecenv.py``` script. Unlike in REINFORCE algorithm, we do not collect rollouts until the termination. Instead, the rollouts in A2C consist of fixed length transitions in parallel.

Although you do not need to fill ```pg/ac2/vecenv.py``` please take a look to understand the vectorization of the environments. 

Before starting, you may want to check ```learn``` method in the ```pg/ac2/model.py``` script to observe the overall structure of the algorithm.

#### Model

Fill the missing part in ```forward``` method in ```pg/ac2/model.py``` script. You may want to check the return value of the network at ```pg/ac2/box2d.py``` to know what to expect when you call ```self.network(.)``` in the ```forward``` method. We want to sample actions and calculate the log probabilities of these actions as well as the entropy of the action distribution to use them later in the parameter update.

Next, you need to fill ```collect_rollout``` method in the ```pg/ac2/model.py``` script. This is the method that collects ```args.n-step``` many transitions in parallel to make a rollout. Note that, you need to calculate the value of the last state (the one that is returned by the ```step``` function at the end of the rollout) so that we can calculate the target value later. Combining the list of transitions and the value of the last state you can form a ```Rollout``` object. We also return the last state and the last gru hidden state for future calls.

We continue filling with ```calculate_gae``` method in the ```pg/ac2/model.py``` script. You need to read GAE paper before implementing this part. This method returns a list of advantages and a list of returns (the capital letter G in the book notation). We will use the advantages to calculate the policy loss and use returns to calculate value loss.

Now we have a rollout and a list of advantages and returns, we can calculate a loss and update the parameters. Fill ```parameter_update``` method in the ```pg/ac2/model.py``` script. You can use ```rollout_data_loader``` method to obtain flatten tensors.

Finally, fill the ```evaluate``` method that runs the policy for multiple episodes to obtain an average score. We use this method to measure the performance of a trained agent. Now that we filled all of the missing parts, we can observe how they come together in the ```learn``` method at ```pg/ac2/model.py```.

#### Box2d

Now we need to create a neural network that represents the policy and the value functions. Unlike before, we will use a recurrent layer (GRU layer). That is why we have additional tensors like ```gru_hx``` in the ```forward``` and ```collect_rollout``` methods. We assume a familiarity with the recurrent networks for this part.

Start filling the ```__init__``` method. You can use separate networks to represent the policy and the value functions or a shared feature network and two separate head layers. Next, fill the ```forward``` method. Remember to return policy logits (no nonlinearity so that you can use them in the ```forward``` method in the ```pg/ac2/model.py``` script to create a Categorical distribution), value (no nonlinearity), and the hidden vector for the GRU layer. The Categorical distribution will be created in the ```A2C``` class and not within the ```network```.

#### Pong

Pong is a visual domain, so you may want to use convolutional layers (not mandatory). Other than that, there is only a handful of differences between the Pong network and the box2d network. We use simple environment wrappers that are designed specifically for the Pong environment. You don't have to use a GPU in this experiment as it trains quite fast given the implementation is correct.


#### Experiments
When you complete all the implementations mentioned above, you can start experimenting (and debugging) with  LunarLander and Pong environments.

We will run two experiments.

- Run A2C in LunarLander-v2 (Reach at least +200 mean reward for the last 20 episodes)
- Run A2C in Pong (Reach at least 10 mean reward for the last 20 episodes) (a single seed is enough)

Plot these results (2 Plots). Also **keep** the ```.csv``` files and the models (you can leave a Google Drive link for the model files if you can not submit them through Ninova).

In [None]:
!python pg/a2c/box2d.py --log-dir ./log/a2c_lunar

In [None]:
# Plot Experiment 1

In [None]:
!python pg/a2c/pong.py --log-dir ./log/a2c_pong

In [None]:
# Plot Experiment 2

In [None]:
# Evaluate Pong 5 times
import torch

from pg.a2c.pong import make_env, GruNet
from pg.a2c.model import A2C


model_data = torch.load("models/pong.b")
env = make_env()
in_size = env.observation_space.shape[0]
out_size = env.action_space.n

network = GruNet(in_size, out_size, model_data["args"]["hidden_size"])
model = A2C(network, None, None, None)
model.load_state_dict(model_data["state_dict"])
model.evaluate(make_env, n_episodes=5)


### PPO Implementation

PPO is very similar to A2C in terms of implementation steps. Most of the structure is the same and you can follow the same order as you did in the A2C part. In PPO experiments, we will not use recurrent architecture as that makes the implementation a bit challenging. There are two experiments with PPO, LunarLander, and BipedalWalker. Note that, BipedalWalker is a continuous action space environment.

#### Differences

- ```parameter_update``` method updates the parameters multiple times, one per mini-batch.
- Unlike in A2C, we have ```forward_given_actions``` method that is used within ```parameter_update``` method to calculate the log_probabilities, values, and entropies over multiple passes. Since, after every mini-batch, the parameters are updated we can not use the same log_probabilities and need to recalculate them.
- ```rollout_data_loader``` method needs to yield mini-batches of rollout data as opposed to full rollout data.
- You need to fill ```linear_annealing``` method that is used for scheduling ```clip_range``` parameter.

#### Box2d

Similar to A2C.

#### BipedalWalker

This is a continuous action space environment. Use normal distribution to represent action distributions. 

#### Experiments

We will run two experiments each containing 3 runs as mentioned previously.

- Run PPO in LunarLander-v2 (Reach at least +200 mean reward for the last 20 episodes)
- Run PPO in BipedalWalker-v2 (Reach at least +100 mean reward for the last 20 episodes)

> Notice: Recent gym versions may require BipedalWalker-v3 instaed of BipedalWalker-v2. Please change it accordingly if necessary. 

Plot these results (2 Plots). Also **keep** the ```.csv``` files and the models (you can leave a Google Drive link for the model files).

In [None]:
!python pg/ppo/box2d.py --log-dir ./log/ppo_lunar

In [None]:
# Plot Experiment 1

In [None]:
!python pg/ppo/walker.py --log-dir ./log/ppo_walker

In [None]:
# Plot Experiment 2

In [None]:
# Evaluate Walker agent

### Comparison

Now that you completed the implementations you can compare the training performance of A2C and PPO.


In [None]:
import os
import pandas as pd
from pg.visualize import Plotter

a2c_log_dir = os.path.join("log", "a2c_lunar")
ppo_log_dir = os.path.join("log", "ppo_lunar")
df_dict = {
    "a2c Lunarlander": [pd.read_csv(os.path.join(a2c_log_dir, folder, "progress.csv"))
                for folder in os.listdir(a2c_log_dir)],
    "ppo Lunarlander": [pd.read_csv(os.path.join(ppo_log_dir, folder, "progress.csv"))
                for folder in os.listdir(ppo_log_dir)],
}

plotter = Plotter(df_dict)
plotter()

#### Your comments (Bonus + 5)

> Explain the score comparison you observe on Lunar Lander environment.

> Explain the advantages and disadvantages of the methods you implemented within this homework