# CPSC 533V: Assignment 4 - Policy Gradients and Proximal Policy Optimization (PPO)

## 45 points total (9% of final grade)

## Due Date: Fri Oct 29, any-time-on-earth

---

## Submission Information

- Complete the assignment by editing and executing the associated Python files.
- Task 1 should be completed in the notebook, i.e. include your answers under each question.
- Task 2-4 are coding and experiment questions. Copy and paste your results (screenshots and logs) in the notebook.  You should also copy completed code into this notebook and paste under the corresponding questions, they should be only a few lines maximum.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.
- **We recommend working in groups of two**. List your names and student numbers below (if you use a different name on Canvas).

<ul style="list-style-type: none; font-size: 1.2em;">
<li>Name (and student ID): Curtis Fox, 23673149</li>
<li>Name (and student ID): Tianyue Zhang, 24991151</li>
</ul>

*As always, you are encouraged to discuss your ideas and approaches with other students, even if you are not working as a group.*

## Assignment Background

This assignment is on vanilla policy gradients (VPG) methods and Proximal Policy Optimization (PPO).
You will be implementing the loss functions for vanilla policy gradients (VPG), running some experiments on it, and then implementing clipped-PPO policy gradients and its loss function.  The change for PPO is simple and it yields efficient-to-compute stable policy updates, making PPO one of the most widely used DeepRL algorithms today.


Goals:
- To understand policy gradient RL and to implement the relevant training losses, for both discrete and continuous action spaces
- To observe the sensitivity issues that plague vanilla policy gradients
- To understand and implement the PPO clipping objective and to observe how it addresses these issues

External resources:
- [Sutton's Book Chapter 13: Policy Gradients](http://incompleteideas.net/book/the-book-2nd.html)
- [Andrej Karpathy's post on RL in general and policy gradients specifically](http://karpathy.github.io/2016/05/31/rl/)
- [OpenAI's Spinning Up for coverage of policy gradients and PPO](https://spinningup.openai.com/en/latest/)
- [PPO paper](https://arxiv.org/pdf/1707.06347.pdf)
- [Matthew's StackOverflow Post on PPO](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl/50663200#50663200)

## Task 0: Preliminaries



### Dependencies

In addition to dependencies from past assignments, we will learn to use TensorBoard to view our experiment results. 
```
pip install tensorboard
```

If you want to experiment with LunarLander instead of Cartpole, you'll also need to install the box2d environment.
```
pip install 'gym[box2d]'
```

### Debugging


You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.

---

### Quick recap of policy gradients

The idea is that we create a **differentiable policy** $\pi$ to be optimized so as to yield actions that yield high return.  To optimize the policy, we generate samples in the environment and we use those to compute a "modulated gradient" usable for gradient ascent on the policy parameters.  The modulated gradient consists of two terms: (1) the bare policy gradient term $\text{log}(\pi_\theta(a_t | s_t))$,  and the (2) reward/advantage modulator term $A_t$.  Note that $a_t$ is the action that was actually chosen and sent to the environment.  In PyTorch, we implement this modulated gradient by multiplying the two terms together in the following loss function and then calling backward on it:
$$L^{PG}(\theta) = \text{log}(\pi_\theta(a_t | s_t)) * A_t$$

The policy gradient term by itself indicates the direction required to move the policy parameters to *make the action that we chose more probable*.  By itself, this does nothing useful, if applied equally to all samples.  However, by multiplying this gradient by the advantage $A_t$, the full modulated gradient tells us how to move in the direction that makes good actions more probably and bad actions less probable.  When $A_t$ is large in absolute value, we should change the probability a lot. When $A_t$ is negative, we should make that action less likely.  This lets us use a non-differentiable reward signal to modulate the policy's gradient.

Here is a reference of a full vanilla policy gradient algorithm from OpenAI's Spinning Up resources.  This uses a critic value function $V$ trained to predict return.

![alt text](https://spinningup.openai.com/en/latest/_images/math/262538f3077a7be8ce89066abbab523575132996.svg)

---

## Task 1: Getting up to speed [14pts]
We have provided template code to get started.
For an overview, the files are: `models.py`, `pg_buffer.py`, `main.py`, `utils.py`.  You need only modify `main.py`, but you may modify the others if you so choose.
- `model.py` has the implementation of the networks and the action distributions we will use 
- `pg_buffer.py` has the implementation of the policy gradient buffer (similar to a replay buffer, but only for the most recent on-policy data)
- `main.py` has the (incomplete) implementation of the policy gradient losses and training loop
- `utils.py` utility (helper) functions


### `models.py`

#### 1.a.  Read `models.py` [1pts]
Read through `models.py` and describe what the contained classes do and how they are used.  Include notes that also help support your own understanding and any questions you have.  If you find an anwer to these questions later, you can write them here. Pay attention to the distributions used and how they are parameterized, and also what is different between data collection and optimization time.

models.py contains 4 classes: 


*   Network, the network used for both actor and critic networks, has a Relu hidden layer, and output layer has weights and bias initilized to 0
*   DiscreteActor, the actor network when actions are discrete. It choose 1 discrete action from a Catagorical distribution parameterized by log probabilities output from the network
*   GaussianActor, the actor network for continious actions. It samples from a normal distribution parameterized with mean mu output from the network, and a standard diviation 
*   ActorCritic, where “init” builds a Discrete or Gaussian actor network depending on input, and a value function network; "act" calls on "step" which runs a single forward step without backprop.



#### 1.b.  Categorical distribution [1pts]
Imagine we have 4 possible actions {0, 1, 2, 3}, and our network outputs logits of `[-2.5, 0.9, 2.4, 3.7]`.  How does `Categorical` convert this into a valid probability distribution, i.e., having probabilities that sum to 1?  What mathematical function is used and what would be the probabilities returned in this case?

In [None]:
import torch
import torch.nn.functional as F
x = torch.tensor([-2.5, 0.9, 2.4, 3.7])
x = F.softmax(x, dim = -1)
print(x)
print("sum(x):", sum(x))

tensor([0.0015, 0.0455, 0.2041, 0.7489])
sum(x): tensor(1.)


Softmax is used to find the probability of each action. In this case, probabilities for actions {0,1,2,3} is {0.0015, 0.0455, 0.2041, 0.7489} respectivly.



#### 1.c. Gradient of Categorical distribution [3pts]
Continuing from the previous question, assume that we sample from that distribution such that we choose the action corresponding to index 2 (i.e., $a_t = 2$).  Now we want to compute the log prob gradient of this action.  What would be the value of this gradient with respect to all of the logit inputs? In other words, what is $\nabla_{\text{logits}} \text{log}(\pi(a_t))$ if $\pi$ is our Categorical?

You can solve this either by deriving the gradient on paper using your answer from 1.b. or by empirically computing it with code.  In the latter case, you may use the pseudocode below, but you must write a mathematical expression for how the logit gradients are related to the probabilities of the Categorical (`c.probs`).

```
logits = torch.nn.Parameter(torch.tensor([-2.5, 0.9, 2.4, 3.7]))   # imagine these came from the output of the network
c = Categorical(logits=logits)
a_t = torch.tensor(2)  # imagine this came from c.sample()
logp = c.log_prob(a_t)
logp.backward()
print(logits.grad)
```

In [None]:
from torch.distributions.categorical import Categorical
logits = torch.nn.Parameter(torch.tensor([-2.5, 0.9, 2.4, 3.7]))   # imagine these came from the output of the network
c = Categorical(logits=logits)
a_t = torch.tensor(2)  # imagine this came from c.sample()
logp = c.log_prob(a_t)
print(torch.tensor([0, 0, 1, 0]) - c.probs) 
# This is the logit gradient computed from catagorical probabilities
# We could also use an indicator function where the corresponding index is 1, and 0 otehrwise
logp.backward()
print(logits.grad)

tensor([-0.0015, -0.0455,  0.7959, -0.7489], grad_fn=<SubBackward0>)
tensor([-0.0015, -0.0455,  0.7959, -0.7489])


$\nabla_{\text{logits}} \text{log}(\pi(a_t)) = 1/\pi(a_t) * \pi(a_t)*(1 - \pi(a_t)) = 1 - \pi(a_t) $ when taking derivative with respect to the corresponding index, or $-\pi(a_t)$ for other indices. 

#### 1.d. Gaussian actor [2pts]
Now imagine we have a continuous action space with 2 actions, and our network outputs a mu of `[0.0, 1.2]`.  Then assume we sampled from that distribution to get $a_t = [0.1, 1.0]$.  What is $\nabla_\mu \text{log}(\pi_\mu(a_t))$ if $\pi$ is our Normal?  Give the value for this case, and write a mathematical expression for the gradient value in general, as a function of $\mu$ and $a_t$.

In [None]:
from torch.distributions.normal import Normal
mu = torch.nn.Parameter(torch.tensor([0.0, 1.2]))
std = torch.exp(torch.tensor(-0.5 * np.ones(2, dtype=np.float32)))
d = Normal(mu, std)
a = torch.tensor([0.1, 1.0])
logp = d.log_prob(a).sum(axis=-1)
print((a-mu) / (std * std)) # This is the gradient of the normal distribution wrp. mean
logp.backward()
print(mu.grad)

tensor([ 0.2718, -0.5437], grad_fn=<DivBackward0>)
tensor([ 0.2718, -0.5437])


$\nabla_{\text{logits}} \text{log}(\pi(a_t)) = 1/\pi(a_t) * \pi(a_t)*((a - mu)/std^2) = (a - mu)/std^2 $ 

#### 1.e. Meaning of these gradients [1pts]
For both continuous and discrete actions, what are these gradients telling us to do, in terms of the logits and the mus and the actions chosen?  

Gradeints tell us how to make the chosen action more likely, by increasing the probability or shifting mean towards action.

###  `pg_buffer.py`

This code implements a buffer used to store the data we acculumate so we can process it in a batch.
Notably, it also computes GAE-Lambda Advantages. To answer the questions below, you should first skim the GAE paper, including at least the abstract and Equation 1 with the different options for $\Psi$ (`psi`): https://arxiv.org/pdf/1506.02438.pdf.  


#### 1.f  Why use GAE-lambda? [1pts]
What is the main argument from the GAE paper about why one should use the Advantage function, (rather than sum of discounted future reward for example) for our choice of $A_t$?

From the paper: Advantage function yields almost the lowest possible variance. It measures whether or not the action is better or worse than the policy’s default behaviot, so that a step in the policy gradient direction increases the probability of better-than-average actions and decreases the probability of worse-thanaverage actions. 

#### 1.g  Paper to code correspondence [1pts]
See the `finish_path` function.  In which line of the GAE algorithm (pg 8) would you call it? And which equation in the GAE paper does the `adv_buf` line (`pg_buffer.py:61`) correspond to?

finish_path implements the line in the algorithm that compute advantage function at all timesteps(third line in the for loop in paper algorithm). adv_buf line correspond to equation (16) in paper.

### 1.3 `main.py`

#### 1.h. Read `main.py` [2pts]

Read through the code and write down any notes that help your understanding of what is going on, as well as any questions you have.

For each epoch, we go through a number of steps and store observation into buffer. We handle end of episodes or end of trajectories. Then we call update once a while, which runs policy optimizer for a number of times and value optimizer for a number of times, using the loss function compute_loss_pi and compute_loss_v respectively. These are the two places we need to implement.
Question: what is pi_info output from compute_loss_pi? How is it used?


#### 1.i. Order of data collection and updating [1pts]
Note the order that we collect data and run optimization in.  How many steps do we collect data before running an update (w/ default args)?  Then how many optimization steps do we run in `update` by default?

We run by default 1000 steps and collect data before running an update, this number is stored in steps_per_epoch argument.  
In update, we run train_pi_iters(default 4) number of policy updates and train_v_iters(default 40) number of value iterations.

#### 1.i. End of episode handling [1pts]
Describe how the episode terminals / timeouts are handled

If timeout or number of ephco reached but trajectory didn't reach terminal state, we bootstrap value target through the network, else we use v = 0. Then we pass this argument to finish_path function to compute advantage estimate and rewards-to-go for each state. If trajectory reached terminal state, we save episode reward and length. Then we reset the environment.

---

## Task 2: Implementing Policy Gradient Losses [10pts]

Now you will implement the vanilla policy gradient losses.  This includes the policy gradient loss $L^{PG}$ as well as a critic loss $L^{V}$, where the critic will be used to compute better advantages. You can reference any external sources you would like, but we suggest first trying to implement the losses without these.

$$L^{PG}(\theta) = \text{log}(\pi_\theta(a_t | s_t)) A_t$$

$$L^{V}(\phi) = (V_{\phi}(s_t) - R_t)^2$$

In this homework, choose between CartPole and LunarLander, although experiment with other environments if you are feeling adventurous.  We recommend LunarLander because it is fun and more challenging than CartPole, and good policies are generally quick to learn.  It takes around 10 minutes to reach interesting behavior on a decent computer, and should be fine for this homework.  However, if you find that it is taking too long to train, you can switch to CartPole.  LunarLander also has both discrete and continuous versions so you can try both modes.

- Fill in the TODOs in the `compute_loss_pi` and `compute_loss_v` functions.
- Run your code and make sure it is correct.

The figure below gives examples of the learning curves that you can expect to see with a correct implementation.  This is LunarLander-v2 performance run with the default arg settings.  Note that watching the losses is not as helpful as it is supervised learning. Losses in RL can be deceiving.  They can be increasing, while your policy is still improving a lot.  The reverse can also happen.  They are mostly good to watch as a sanity check and diagnostic. Also note that entropy is a less obvious, but very helpful metric to watch in RL, especially for discrete action spaces.  It should not stay at its maximum, nor should it drop very quickly; it should somewhat gradually decrease as shown in the figure. 

![example curves](./Ut7R1C9.png)
You might see something slightly different due to small differences in your implementation.  Command to run: `tensorboard --log_dir=logs/`

In [None]:
# ANSWERS for Task 2

# Copy your completed functions (or relevant sections) from main.py and paste them here
# Set up function for computing policy loss
def compute_loss_pi(batch):
    obs, act, psi, logp_old = batch['obs'], batch['act'], batch['psi'], batch['logp']
    pi, logp = ac.pi(obs, act)

    # Policy loss
    if args.loss_mode == 'vpg':
        # TODO (Task 2): implement vanilla policy gradient loss
        loss_pi = -(psi*logp).mean()
    # elif args.loss_mode == 'ppo':
    #     # TODO (Task 4): implement clipped PPO loss
    else:
        raise Exception('Invalid loss_mode option', args.loss_mode)

    # Useful extra info
    approx_kl = (logp_old - logp).mean().item()
    ent = pi.entropy().mean().item()
    pi_info = dict(kl=approx_kl, ent=ent)

    return loss_pi, pi_info

# Set up function for computing value loss
def compute_loss_v(batch):
    obs, ret = batch['obs'], batch['ret']
    v = ac.v(obs)
    loss_v = torch.square(torch.subtract(v, ret)).mean()
    return loss_v
# Figure and logs: see question below

---

## Task 3: Experimenting with the code [11pts]
 
Once you verify your losses are correct by seeing that your policy starts learning, you will run some experiments.  For this, we have created several command line options that can be used to vary parameters, as well as a logging system that prints to stdout and logs scalars (and optionally gifs) to TensorBoard.  

#### 3.a.  REINFORCE vs. GAE-Lambda [3pts]

As the GAE paper discusses, there are many possible choices for advantage term to use in the policy gradient.  One of the first ones imagined is the discounted future return (`future_return` in the code).  This choice leads to the REINFORCE algorithm (excerpted from the [Sutton book Chapter 13](http://incompleteideas.net/book/the-book-2nd.html) for your reference (where $G$ is discounted future return):

![REINFORCE](./WzyIzgg.png)

You will compare REINFORCE advantage (discounted return) to GAE lambda advantage.  Before you run the experiment, write down what you think will happen.  Why might REINFORCE do better or why might GAE-Lambda do better for this environment? Then run the two following experiments and measure the difference.  You should run them for at least 100 epochs, and maybe more if you want.  Then write down what happened and include a TensorBoard screenshot with both the results.

```
python3 main.py --psi_mode=future_return --prefix=logs/3a/ --epochs=100  # you can make these longer if you want
```

```
python3 main.py --psi_mode=gae --prefix=logs/3a/ --epochs=100
```

In [None]:
# ANSWERS for Task 3.a

# Describe your predictions. Why might REINFORCE do better or why might GAE-Lambda do better for this environment? 
# Write down what actually happened
# Include any screenshots, logs, etc

Prediction: I think GAE will do better, because the reduce variance without adding bias to the gradient.
Actual result: (pink is GAE and green is REINFORCE) REINFORCE episode length fluctuates more than GAE, but they seem to achieve similar rewards around the similar episode.
![pic](https://drive.google.com/uc?id=1nEGahO3pcNsGIFrzVide1BhSdJAtixfH)

In [None]:
(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --psi_mode=gae --prefix=logs/3a/ --epochs=500
Number of parameters 4605
Epoch 0 {'ep_ret': -214.21225455581796, 'ep_len': 100.22222222222223, 'kl': 0.006314518861472607, 'ent': 1.3862947225570679, 'loss_v': 15757.0087890625, 'loss_pi': -0.010490997694432735}
Epoch 10 {'ep_ret': -166.2573821086001, 'ep_len': 103.97752808988764, 'kl': 0.000648377019729196, 'ent': 1.341350543498993, 'loss_v': 8726.1359375, 'loss_pi': 0.0075670818099752065}
Epoch 20 {'ep_ret': -179.0545925724785, 'ep_len': 145.890625, 'kl': 0.001931461409549229, 'ent': 1.191381287574768, 'loss_v': 6640.911840820312, 'loss_pi': 0.06138229067437351}
Epoch 30 {'ep_ret': -117.53276816451823, 'ep_len': 123.81538461538462, 'kl': 0.0014535660426190588, 'ent': 1.226637876033783, 'loss_v': 3815.2077026367188, 'loss_pi': 0.03848858191631734}
Epoch 40 {'ep_ret': -171.20906206035517, 'ep_len': 249.53846153846155, 'kl': 0.0009813318349188194, 'ent': 1.1815240263938904, 'loss_v': 3843.297497558594, 'loss_pi': 0.00295040481723845}
Epoch 50 {'ep_ret': -89.5848657684806, 'ep_len': 174.29268292682926, 'kl': 0.001928536222862931, 'ent': 1.1605192184448243, 'loss_v': 2668.432824707031, 'loss_pi': 0.0011303705163300038}
Epoch 60 {'ep_ret': -101.17662673128672, 'ep_len': 280.2916666666667, 'kl': 0.0007110616264981217, 'ent': 1.1283333420753479, 'loss_v': 2141.364404296875, 'loss_pi': -0.03619013940333389}
Epoch 70 {'ep_ret': -92.1969182029093, 'ep_len': 342.53846153846155, 'kl': 0.0017034097261785063, 'ent': 1.1207642793655395, 'loss_v': 2165.875341796875, 'loss_pi': -0.03355207906570286}
Epoch 80 {'ep_ret': -57.415489648974834, 'ep_len': 255.03333333333333, 'kl': 0.0016286730868159793, 'ent': 1.070542925596237, 'loss_v': 1752.8075134277344, 'loss_pi': -0.036445910762995484}
Epoch 90 {'ep_ret': -25.988981779157314, 'ep_len': 267.45, 'kl': 0.00255399796878919, 'ent': 1.0938203454017639, 'loss_v': 1093.906671142578, 'loss_pi': -0.04558830441674218}
Epoch 100 {'ep_ret': -29.846830096200204, 'ep_len': 387.46666666666664, 'kl': 0.0012903226524940692, 'ent': 0.9984289050102234, 'loss_v': 1396.8307647705078, 'loss_pi': -0.029355354653671385}
Epoch 110 {'ep_ret': -36.678575082660664, 'ep_len': 433.1875, 'kl': 0.00038354772452748876, 'ent': 1.0075349152088164, 'loss_v': 1461.2985473632812, 'loss_pi': -0.0685744141228497}
Epoch 120 {'ep_ret': -140.6204786976047, 'ep_len': 717.5, 'kl': 0.002839423010664177, 'ent': 1.0348931968212127, 'loss_v': 816.6007232666016, 'loss_pi': -0.03385047069750726}
Epoch 130 {'ep_ret': -19.97083941553056, 'ep_len': 578.2307692307693, 'kl': 0.0016026289231376722, 'ent': 0.9662361443042755, 'loss_v': 1361.4172790527343, 'loss_pi': -0.06982025774195791}
Epoch 140 {'ep_ret': 34.75954177417856, 'ep_len': 659.8, 'kl': 0.00101313726954686, 'ent': 0.901612389087677, 'loss_v': 1564.9589294433595, 'loss_pi': -0.03177134785801172}
Epoch 150 {'ep_ret': -19.041407780053323, 'ep_len': 565.7692307692307, 'kl': 0.001896938587378827, 'ent': 0.9290532410144806, 'loss_v': 1242.8437118530273, 'loss_pi': -0.045345538854599}
Epoch 160 {'ep_ret': 12.906431099590685, 'ep_len': 889.6, 'kl': 0.001034053658713674, 'ent': 0.916285115480423, 'loss_v': 778.7779067993164, 'loss_pi': -0.019634436443448068}
Epoch 170 {'ep_ret': 35.196550847728744, 'ep_len': 815.2, 'kl': 0.0007011102024989668, 'ent': 0.9107542335987091, 'loss_v': 809.6455123901367, 'loss_pi': -0.029301108606159688}
Epoch 180 {'ep_ret': 69.18715260610692, 'ep_len': 701.5833333333334, 'kl': 0.0013816051789035555, 'ent': 0.8420754492282867, 'loss_v': 965.2607315063476, 'loss_pi': -0.015486479242099449}
Epoch 190 {'ep_ret': 29.927752127368247, 'ep_len': 854.6363636363636, 'kl': 0.0005208881346334237, 'ent': 0.8842112183570862, 'loss_v': 588.5229019165039, 'loss_pi': -0.05752984154969454}
Epoch 200 {'ep_ret': 24.258699447751237, 'ep_len': 845.5454545454545, 'kl': 0.002662870446266652, 'ent': 0.873357230424881, 'loss_v': 616.0157699584961, 'loss_pi': -0.025773294316604734}
Epoch 210 {'ep_ret': -21.85577452948535, 'ep_len': 965.7, 'kl': 0.0018200343285570853, 'ent': 0.9049549996852875, 'loss_v': 318.1192813873291, 'loss_pi': -0.050817842478863895}
Epoch 220 {'ep_ret': 80.29993102284493, 'ep_len': 782.2, 'kl': 0.000753272731162724, 'ent': 0.8429440915584564, 'loss_v': 534.8887977600098, 'loss_pi': -0.0647812323179096}
Epoch 230 {'ep_ret': 10.577714065662514, 'ep_len': 908.5, 'kl': 0.003422435460379347, 'ent': 0.8780409872531891, 'loss_v': 335.5973297119141, 'loss_pi': -0.09545747777447104}
Epoch 240 {'ep_ret': 49.7608403177274, 'ep_len': 944.3, 'kl': 0.0014206269922397042, 'ent': 0.8306865274906159, 'loss_v': 237.4087341308594, 'loss_pi': -0.006168042216449976}
Epoch 250 {'ep_ret': 53.671450130152856, 'ep_len': 927.6, 'kl': 0.004968803009251133, 'ent': 0.8752592980861664, 'loss_v': 252.6212646484375, 'loss_pi': -0.08525692094117403}
Epoch 260 {'ep_ret': 126.25973980416039, 'ep_len': 675.1818181818181, 'kl': 0.009218841631081887, 'ent': 0.7324049055576325, 'loss_v': 620.1683334350586, 'loss_pi': -0.0096840504091233}
Epoch 270 {'ep_ret': 118.57827972140299, 'ep_len': 568.9230769230769, 'kl': 0.009216191718587652, 'ent': 0.7325763285160065, 'loss_v': 832.7316436767578, 'loss_pi': 0.0013507307739928365}
Epoch 280 {'ep_ret': -69.1055420511967, 'ep_len': 751.5833333333334, 'kl': 0.007795039092889056, 'ent': 0.8500260233879089, 'loss_v': 826.6844192504883, 'loss_pi': -0.0418662644457072}
Epoch 290 {'ep_ret': 38.652255872235024, 'ep_len': 984.0, 'kl': 0.0025976556731620803, 'ent': 0.8449670553207398, 'loss_v': 194.42973518371582, 'loss_pi': -0.06043443167582154}
Epoch 300 {'ep_ret': -43.01129125652659, 'ep_len': 743.0833333333334, 'kl': 0.0019573747158574406, 'ent': 0.8493375778198242, 'loss_v': 732.732071685791, 'loss_pi': -0.021871429681777955}
Epoch 310 {'ep_ret': -7.789043571402469, 'ep_len': 954.8, 'kl': 0.003240781651402358, 'ent': 0.89419926404953, 'loss_v': 199.03268737792968, 'loss_pi': -0.05941832759417594}
Epoch 320 {'ep_ret': -23.522944612684242, 'ep_len': 950.2, 'kl': 0.0027198207440960686, 'ent': 0.8851964890956878, 'loss_v': 336.1387908935547, 'loss_pi': -0.1068357102572918}
Epoch 330 {'ep_ret': -13.413878923417922, 'ep_len': 803.8, 'kl': 0.0013693542230612365, 'ent': 0.8754893839359283, 'loss_v': 705.9811080932617, 'loss_pi': -0.07177656893618405}
Epoch 340 {'ep_ret': 67.27209598613601, 'ep_len': 952.2, 'kl': 0.004187869062297978, 'ent': 0.8079078733921051, 'loss_v': 224.30722160339354, 'loss_pi': -0.0024598005693405867}
Epoch 350 {'ep_ret': 67.50918269609765, 'ep_len': 925.3, 'kl': 0.0012797439962923818, 'ent': 0.8047679424285888, 'loss_v': 287.80372581481936, 'loss_pi': -0.046231273212470114}
Epoch 360 {'ep_ret': 42.642268685157305, 'ep_len': 937.9, 'kl': 0.002992481629780741, 'ent': 0.8373865127563477, 'loss_v': 261.58691558837893, 'loss_pi': -0.05857139974832535}
Epoch 370 {'ep_ret': 120.18496580890996, 'ep_len': 776.7272727272727, 'kl': 0.0018626411765580997, 'ent': 0.7706810712814331, 'loss_v': 512.0759216308594, 'loss_pi': -0.08438789541833103}
Epoch 380 {'ep_ret': 45.013920334320346, 'ep_len': 983.5, 'kl': 0.007988379464950412, 'ent': 0.8260229706764222, 'loss_v': 161.84496574401857, 'loss_pi': -0.049442334845662114}
Epoch 390 {'ep_ret': -71.36650829682102, 'ep_len': 1000.0, 'kl': 0.00695834013458807, 'ent': 0.8806710183620453, 'loss_v': 120.03015937805176, 'loss_pi': -0.08296642457135021}
Epoch 400 {'ep_ret': -13.389217835419839, 'ep_len': 990.0, 'kl': 0.002107598018483259, 'ent': 0.8742916047573089, 'loss_v': 158.48674468994142, 'loss_pi': -0.09788168556988239}
Epoch 410 {'ep_ret': 7.390961623991513, 'ep_len': 976.4, 'kl': 0.0018088463868480176, 'ent': 0.8526305675506591, 'loss_v': 256.62505531311035, 'loss_pi': -0.09108900967985392}
Epoch 420 {'ep_ret': 62.477725316688755, 'ep_len': 799.0, 'kl': 0.002091269294032827, 'ent': 0.7828056991100312, 'loss_v': 522.996293258667, 'loss_pi': -0.05697792451828718}
Epoch 430 {'ep_ret': 81.32633840518085, 'ep_len': 844.1, 'kl': 0.016218351727729896, 'ent': 0.7374095678329468, 'loss_v': 416.4759796142578, 'loss_pi': 0.02085097467061132}
Epoch 440 {'ep_ret': 82.47965746920956, 'ep_len': 681.3333333333334, 'kl': 0.0020326384459622205, 'ent': 0.8222067534923554, 'loss_v': 751.5440383911133, 'loss_pi': -0.13409714102745057}
Epoch 450 {'ep_ret': 57.07093190746548, 'ep_len': 942.3, 'kl': 0.001255245394713711, 'ent': 0.8094427049160003, 'loss_v': 310.5242210388184, 'loss_pi': -0.0908955754712224}
Epoch 460 {'ep_ret': 50.08721667118617, 'ep_len': 915.0, 'kl': 0.0043128517578679745, 'ent': 0.8189766824245452, 'loss_v': 411.30494956970216, 'loss_pi': -0.10073938928544521}
Epoch 470 {'ep_ret': -30.701948977112217, 'ep_len': 1000.0, 'kl': 0.0020760115958182723, 'ent': 0.8794787526130676, 'loss_v': 123.20777206420898, 'loss_pi': -0.09566179290413857}
Epoch 480 {'ep_ret': -9.038131622870035, 'ep_len': 976.7, 'kl': 0.0031955501181073487, 'ent': 0.8261014759540558, 'loss_v': 225.63126220703126, 'loss_pi': -0.06613923572003841}
Epoch 490 {'ep_ret': -66.8633519392495, 'ep_len': 1000.0, 'kl': 0.01095159500837326, 'ent': 0.8654287934303284, 'loss_v': 73.85282878875732, 'loss_pi': -0.09046921581029892}
(base) kuanconkandeMBP:a4 helennnnnnnn$ tensorboard --logdir logs/3a

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C(base) kuanconkandeMBP:a4 helennnnnnnn
(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --psi_mode=future_return --prefix=logs/3a/ --epochs=500
Number of parameters 4605
Epoch 0 {'ep_ret': -214.21225455581796, 'ep_len': 100.22222222222223, 'kl': 0.004362524952739477, 'ent': 1.3862947225570679, 'loss_v': 15757.0087890625, 'loss_pi': -0.00837092474102974}
Epoch 10 {'ep_ret': -175.65771949016408, 'ep_len': 96.64, 'kl': 0.0015320111764594913, 'ent': 1.3497097134590148, 'loss_v': 10259.399755859375, 'loss_pi': 0.0016291525767883285}
Epoch 20 {'ep_ret': -111.25963441870006, 'ep_len': 98.11224489795919, 'kl': 0.0009092620755836833, 'ent': 1.3549686193466186, 'loss_v': 1750.1989868164062, 'loss_pi': -0.009326366311870515}
Epoch 30 {'ep_ret': -75.18164732257921, 'ep_len': 122.62820512820512, 'kl': 0.0004910191200906411, 'ent': 1.27357097864151, 'loss_v': 1320.704522705078, 'loss_pi': 0.016874381963862106}
Epoch 40 {'ep_ret': -64.32146177999188, 'ep_len': 189.4047619047619, 'kl': 0.0005376249500841368, 'ent': 1.2019302010536195, 'loss_v': 2212.932275390625, 'loss_pi': -0.003817028598859906}
Epoch 50 {'ep_ret': -98.14209030574372, 'ep_len': 264.51851851851853, 'kl': 0.000467168923933059, 'ent': 1.1208897233009338, 'loss_v': 2512.967156982422, 'loss_pi': 0.0048919258872047065}
Epoch 60 {'ep_ret': -99.95631172221243, 'ep_len': 385.6666666666667, 'kl': 0.0018746814617770723, 'ent': 1.1077221989631654, 'loss_v': 1810.9427673339844, 'loss_pi': 0.0026488347444683313}
Epoch 70 {'ep_ret': -56.35701200890649, 'ep_len': 444.6111111111111, 'kl': 0.0009042753532412462, 'ent': 1.1040005087852478, 'loss_v': 1783.3175476074218, 'loss_pi': -0.0050733347423374655}
Epoch 80 {'ep_ret': 10.348820295858559, 'ep_len': 441.8666666666667, 'kl': 0.0008466617105625573, 'ent': 1.120531713962555, 'loss_v': 1258.145263671875, 'loss_pi': -0.003258096519857645}
Epoch 90 {'ep_ret': -34.18279395015618, 'ep_len': 660.4, 'kl': 0.0008491218524341093, 'ent': 1.064324402809143, 'loss_v': 974.472543334961, 'loss_pi': 0.01863491553813219}
Epoch 100 {'ep_ret': 7.214503641554719, 'ep_len': 192.57894736842104, 'kl': 0.0016742002626415342, 'ent': 1.0915809273719788, 'loss_v': 1253.2992370605468, 'loss_pi': -0.007496039138641208}
Epoch 110 {'ep_ret': -43.98123747140541, 'ep_len': 480.7857142857143, 'kl': 0.0017073394454200752, 'ent': 1.0606858611106873, 'loss_v': 1408.8691497802733, 'loss_pi': 0.0024059736635535954}
Epoch 120 {'ep_ret': -78.05684680782818, 'ep_len': 565.2857142857143, 'kl': 0.0007273736562638077, 'ent': 1.0709572792053224, 'loss_v': 1555.684375, 'loss_pi': 0.0011752949678339065}
Epoch 130 {'ep_ret': 18.822912853017105, 'ep_len': 390.85714285714283, 'kl': 0.0020747775852214545, 'ent': 1.128619408607483, 'loss_v': 1023.4502593994141, 'loss_pi': -0.015169486636295915}
Epoch 140 {'ep_ret': 15.850676912718399, 'ep_len': 775.3636363636364, 'kl': 0.0007529568232712336, 'ent': 1.1441360592842102, 'loss_v': 821.6227508544922, 'loss_pi': -0.015390200921683573}
Epoch 150 {'ep_ret': -23.38798267334431, 'ep_len': 295.2916666666667, 'kl': 0.0030442818475421517, 'ent': 1.004940241575241, 'loss_v': 1553.9721801757812, 'loss_pi': -0.02101633595302701}
Epoch 160 {'ep_ret': 28.66856567385297, 'ep_len': 158.47368421052633, 'kl': 0.00585164733347483, 'ent': 1.0346574306488037, 'loss_v': 1329.787579345703, 'loss_pi': -0.01646288074553013}
Epoch 170 {'ep_ret': 27.097928727620683, 'ep_len': 133.46511627906978, 'kl': 0.0015714748413302004, 'ent': 0.98082737326622, 'loss_v': 1718.147637939453, 'loss_pi': -0.013091415003873407}
Epoch 180 {'ep_ret': 16.112069076869606, 'ep_len': 499.7142857142857, 'kl': 0.001414133316211519, 'ent': 0.996200966835022, 'loss_v': 1753.9300415039063, 'loss_pi': -0.02394020827487111}
Epoch 190 {'ep_ret': 28.623170948717945, 'ep_len': 207.46511627906978, 'kl': 0.0007418748296913691, 'ent': 0.8685294449329376, 'loss_v': 1652.2901733398437, 'loss_pi': 0.018031103303655982}
Epoch 200 {'ep_ret': 48.763004651607616, 'ep_len': 404.63157894736844, 'kl': 0.00037728732713731005, 'ent': 0.8106696486473084, 'loss_v': 2203.025958251953, 'loss_pi': -0.0019447423983365297}
Epoch 210 {'ep_ret': 86.98540366786122, 'ep_len': 440.29411764705884, 'kl': 0.0014470923342742026, 'ent': 0.7591051518917084, 'loss_v': 2612.017578125, 'loss_pi': 0.02739402123261243}
Epoch 220 {'ep_ret': 39.24338478664355, 'ep_len': 255.09375, 'kl': 0.0009750309217452013, 'ent': 0.738649332523346, 'loss_v': 3033.1204345703127, 'loss_pi': -0.010489753528963775}
Epoch 230 {'ep_ret': 77.08964867642143, 'ep_len': 307.3076923076923, 'kl': 0.0012280937909963541, 'ent': 0.6735723048448563, 'loss_v': 2133.8473999023436, 'loss_pi': 0.020718086860142648}
Epoch 240 {'ep_ret': 59.36440378177991, 'ep_len': 426.47058823529414, 'kl': 0.00027522643940756095, 'ent': 0.7695506572723388, 'loss_v': 2592.6838928222655, 'loss_pi': 0.026612690836191177}
Epoch 250 {'ep_ret': 29.6744636873788, 'ep_len': 500.0, 'kl': 0.0004724642749351915, 'ent': 0.7596228182315826, 'loss_v': 2004.7586608886718, 'loss_pi': -0.024955570080783217}
Epoch 260 {'ep_ret': 39.920108784008846, 'ep_len': 384.0, 'kl': -0.00038101776117400733, 'ent': 0.7442680537700653, 'loss_v': 1970.192626953125, 'loss_pi': 0.019750402518548073}
Epoch 270 {'ep_ret': 108.55350304087267, 'ep_len': 447.25, 'kl': 0.0006685426546027884, 'ent': 0.6580725252628327, 'loss_v': 1815.5755920410156, 'loss_pi': 0.048639046552125365}
Epoch 280 {'ep_ret': 131.64763699494085, 'ep_len': 363.94736842105266, 'kl': 8.691849561728304e-05, 'ent': 0.6724378824234009, 'loss_v': 1659.33447265625, 'loss_pi': 0.04276154562830925}
Epoch 290 {'ep_ret': 95.50848755409615, 'ep_len': 277.7096774193548, 'kl': 1.7202345770783722e-05, 'ent': 0.6787461400032043, 'loss_v': 1957.4685607910155, 'loss_pi': 0.021198019199073315}
Epoch 300 {'ep_ret': 99.16714954680671, 'ep_len': 392.5238095238095, 'kl': 0.004112819919828326, 'ent': 0.6940427541732788, 'loss_v': 1773.9920318603515, 'loss_pi': 0.03246861603111029}
Epoch 310 {'ep_ret': 114.29375429761886, 'ep_len': 695.8333333333334, 'kl': 0.005641216260846705, 'ent': 0.8238339245319366, 'loss_v': 647.290919494629, 'loss_pi': -0.03669326715171337}
Epoch 320 {'ep_ret': 46.68814643436355, 'ep_len': 781.3333333333334, 'kl': 0.004579547824687324, 'ent': 0.865211009979248, 'loss_v': 647.4459350585937, 'loss_pi': -0.015990993287414313}
Epoch 330 {'ep_ret': -76.14656067987885, 'ep_len': 912.4, 'kl': 0.0005285488110530423, 'ent': 0.9263999938964844, 'loss_v': 919.6610412597656, 'loss_pi': -0.026798319071531296}
Epoch 340 {'ep_ret': -55.389237023518, 'ep_len': 873.2, 'kl': 0.0017460110830143094, 'ent': 0.948418939113617, 'loss_v': 812.2001007080078, 'loss_pi': -0.02685932591557503}
Epoch 350 {'ep_ret': 55.818641722894014, 'ep_len': 801.4, 'kl': 0.0012072678769982303, 'ent': 0.8027755558490753, 'loss_v': 964.4644454956054, 'loss_pi': 0.05821645453106612}
Epoch 360 {'ep_ret': 188.5627082484432, 'ep_len': 469.29411764705884, 'kl': 0.0003635070774180349, 'ent': 0.6674540787935257, 'loss_v': 1458.2451110839843, 'loss_pi': 0.03690716773271561}
Epoch 370 {'ep_ret': 157.27367939677745, 'ep_len': 416.6666666666667, 'kl': 0.04238936559559079, 'ent': 0.7511776089668274, 'loss_v': 1292.0616577148437, 'loss_pi': -0.024812142737209796}
Epoch 380 {'ep_ret': 11.114534690846332, 'ep_len': 243.71428571428572, 'kl': 0.010713674174621702, 'ent': 0.7247811019420624, 'loss_v': 2728.200506591797, 'loss_pi': -0.027063210494816303}
Epoch 390 {'ep_ret': 13.206942083915285, 'ep_len': 118.4225352112676, 'kl': 0.006852838676422834, 'ent': 0.7154208958148957, 'loss_v': 2043.5300537109374, 'loss_pi': 0.01563389743678272}
Epoch 400 {'ep_ret': 23.047181479932092, 'ep_len': 117.56962025316456, 'kl': 0.0002469066035700962, 'ent': 0.6691586196422576, 'loss_v': 2117.0609741210938, 'loss_pi': 0.060180356912314895}
Epoch 410 {'ep_ret': 23.481167755485632, 'ep_len': 123.08108108108108, 'kl': 0.00013954939204268156, 'ent': 0.7106789767742157, 'loss_v': 2353.3981689453126, 'loss_pi': 0.0502410676330328}
Epoch 420 {'ep_ret': 37.057357761830644, 'ep_len': 144.05084745762713, 'kl': 0.0020232188107911497, 'ent': 0.6804318994283676, 'loss_v': 2401.059716796875, 'loss_pi': 0.028392123663797973}
Epoch 430 {'ep_ret': 88.68474307801569, 'ep_len': 235.21875, 'kl': 0.0005140680499607697, 'ent': 0.5895979762077331, 'loss_v': 2214.8840576171874, 'loss_pi': 0.002825002744793892}
Epoch 440 {'ep_ret': 64.420930853162, 'ep_len': 279.06451612903226, 'kl': 0.0011627026775386184, 'ent': 0.6114434421062469, 'loss_v': 2768.2092407226564, 'loss_pi': -0.03538628695532679}
Epoch 450 {'ep_ret': 14.771152394516, 'ep_len': 245.26470588235293, 'kl': 0.0011812014156021178, 'ent': 0.6577556133270264, 'loss_v': 3580.3987060546874, 'loss_pi': -0.033512470638379456}
Epoch 460 {'ep_ret': -20.457958405449855, 'ep_len': 348.16, 'kl': 0.0009727408802973514, 'ent': 0.7519823253154755, 'loss_v': 3224.285534667969, 'loss_pi': -0.043255950696766375}
Epoch 470 {'ep_ret': -6.486748150772264, 'ep_len': 370.3478260869565, 'kl': 0.0004704127233708277, 'ent': 0.7448117136955261, 'loss_v': 2914.9983032226564, 'loss_pi': -0.08195782182738184}
Epoch 480 {'ep_ret': -11.004519454319393, 'ep_len': 395.95238095238096, 'kl': -2.4445558665320277e-05, 'ent': 0.7414065003395081, 'loss_v': 2829.82529296875, 'loss_pi': -0.05361060313880443}
Epoch 490 {'ep_ret': 38.11055760516861, 'ep_len': 429.52941176470586, 'kl': 0.0012660471441904519, 'ent': 0.6563057035207749, 'loss_v': 2302.3315551757814, 'loss_pi': -0.04170649778097868}
(base) kuanconkandeMBP:a4 helennnnnnnn$ tensorboard --logdir logs/3a

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C(base) kuanconkandeMBP:a4 helennnnnnnn$ 


#### 3.b.  Running with different numbers of policy training steps / vanilla policy gradient failure [3pts]

One issue of vanilla policy gradient methods is that they are fairly unstable.
In general one cannot run too many update steps for the most recent data because the policy will then overfit to that local data.
It will update too much based on that local gradient estimate and it will eventually cause more harm than good.
Once this happens, it is very difficult to recover.

This is a well known issue that motivated the development of TRPO and PPO, and you are you going to test this issue for yourself. By default, our code only runs 4 policy iterations during each update phase. What happens if you try to run more?  Try the following experiments, include a screenshot and write some thoughts you had about this.  Anything expected or unexpected?  (Note you will rerun these experiments with PPO in a minute)

```
python3 main.py --prefix=logs/3b/ --train_pi_iters=4  --epochs=150 # you can just also keep your results from part a 
```

```
python3 main.py --prefix=logs/3b/ --train_pi_iters=10 --epochs=150  
```

```
python3 main.py --prefix=logs/3b/ --train_pi_iters=20 --epochs=150
```

In [None]:
# ANSWERS for Task 3.b

# Describe anything expected or unexpected in the experiment
# Include any screenshots, logs, etc

(orange: pi_iter = 4; blue: pi_iter = 10; red: pi_iter = 20)

From the plot, when running 20 iterations each update, episode length and return both get stuck and never goes up, policy became very unstable, as expected.
![policy](https://drive.google.com/uc?id=1qOn7SOwxP3dkjdbMvByznhFFytd2A5q8)

In [None]:
^C(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --prefix=logs/3b/ --train_pi_iters=4  --epochs=150
Number of parameters 4605
Epoch 0 {'ep_ret': -214.21225455581796, 'ep_len': 100.22222222222223, 'kl': 0.006314518861472607, 'ent': 1.3862947225570679, 'loss_v': 15757.0087890625, 'loss_pi': -0.010490997694432735}
Epoch 10 {'ep_ret': -166.2573821086001, 'ep_len': 103.97752808988764, 'kl': 0.000648377019729196, 'ent': 1.341350543498993, 'loss_v': 8726.1359375, 'loss_pi': 0.0075670818099752065}
Epoch 20 {'ep_ret': -179.0545925724785, 'ep_len': 145.890625, 'kl': 0.001931461409549229, 'ent': 1.191381287574768, 'loss_v': 6640.911840820312, 'loss_pi': 0.06138229067437351}
Epoch 30 {'ep_ret': -117.53276816451823, 'ep_len': 123.81538461538462, 'kl': 0.0014535660426190588, 'ent': 1.226637876033783, 'loss_v': 3815.2077026367188, 'loss_pi': 0.03848858191631734}
Epoch 40 {'ep_ret': -171.20906206035517, 'ep_len': 249.53846153846155, 'kl': 0.0009813318349188194, 'ent': 1.1815240263938904, 'loss_v': 3843.297497558594, 'loss_pi': 0.00295040481723845}
Epoch 50 {'ep_ret': -89.5848657684806, 'ep_len': 174.29268292682926, 'kl': 0.001928536222862931, 'ent': 1.1605192184448243, 'loss_v': 2668.432824707031, 'loss_pi': 0.0011303705163300038}
Epoch 60 {'ep_ret': -101.17662673128672, 'ep_len': 280.2916666666667, 'kl': 0.0007110616264981217, 'ent': 1.1283333420753479, 'loss_v': 2141.364404296875, 'loss_pi': -0.03619013940333389}
Epoch 70 {'ep_ret': -92.1969182029093, 'ep_len': 342.53846153846155, 'kl': 0.0017034097261785063, 'ent': 1.1207642793655395, 'loss_v': 2165.875341796875, 'loss_pi': -0.03355207906570286}
Epoch 80 {'ep_ret': -57.415489648974834, 'ep_len': 255.03333333333333, 'kl': 0.0016286730868159793, 'ent': 1.070542925596237, 'loss_v': 1752.8075134277344, 'loss_pi': -0.036445910762995484}
Epoch 90 {'ep_ret': -25.988981779157314, 'ep_len': 267.45, 'kl': 0.00255399796878919, 'ent': 1.0938203454017639, 'loss_v': 1093.906671142578, 'loss_pi': -0.04558830441674218}
Epoch 100 {'ep_ret': -29.846830096200204, 'ep_len': 387.46666666666664, 'kl': 0.0012903226524940692, 'ent': 0.9984289050102234, 'loss_v': 1396.8307647705078, 'loss_pi': -0.029355354653671385}
Epoch 110 {'ep_ret': -36.678575082660664, 'ep_len': 433.1875, 'kl': 0.00038354772452748876, 'ent': 1.0075349152088164, 'loss_v': 1461.2985473632812, 'loss_pi': -0.0685744141228497}
Epoch 120 {'ep_ret': -140.6204786976047, 'ep_len': 717.5, 'kl': 0.002839423010664177, 'ent': 1.0348931968212127, 'loss_v': 816.6007232666016, 'loss_pi': -0.03385047069750726}
Epoch 130 {'ep_ret': -19.97083941553056, 'ep_len': 578.2307692307693, 'kl': 0.0016026289231376722, 'ent': 0.9662361443042755, 'loss_v': 1361.4172790527343, 'loss_pi': -0.06982025774195791}
Epoch 140 {'ep_ret': 34.75954177417856, 'ep_len': 659.8, 'kl': 0.00101313726954686, 'ent': 0.901612389087677, 'loss_v': 1564.9589294433595, 'loss_pi': -0.03177134785801172}
(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --prefix=logs/3b/ --train_pi_iters=10 --epochs=150 
Number of parameters 4605
Epoch 0 {'ep_ret': -214.21225455581796, 'ep_len': 100.22222222222223, 'kl': 0.047204405069351196, 'ent': 1.3862947225570679, 'loss_v': 15757.0087890625, 'loss_pi': -0.04691065847873688}
Epoch 10 {'ep_ret': -155.68464338754532, 'ep_len': 118.58441558441558, 'kl': 0.01681785506370943, 'ent': 1.2145734548568725, 'loss_v': 8536.910461425781, 'loss_pi': -0.010101783205755055}
Epoch 20 {'ep_ret': -131.0415370432866, 'ep_len': 174.82926829268294, 'kl': 0.03827856084681116, 'ent': 0.9881149053573608, 'loss_v': 4976.862841796875, 'loss_pi': -0.0346862341510132}
Epoch 30 {'ep_ret': -43.35768366870372, 'ep_len': 152.14285714285714, 'kl': 0.02525985613465309, 'ent': 1.008419942855835, 'loss_v': 969.1934478759765, 'loss_pi': -0.008060578879667445}
Epoch 40 {'ep_ret': -57.570326681518566, 'ep_len': 328.1904761904762, 'kl': 0.01126612345688045, 'ent': 0.9115507781505585, 'loss_v': 1392.9020080566406, 'loss_pi': -0.020805374276824294}
Epoch 50 {'ep_ret': -35.112193368656214, 'ep_len': 306.1666666666667, 'kl': 0.028931190026924014, 'ent': 0.876833838224411, 'loss_v': 1393.7796813964844, 'loss_pi': -0.01585372700355947}
Epoch 60 {'ep_ret': -306.34060477366125, 'ep_len': 121.53947368421052, 'kl': 0.1697205040603876, 'ent': 0.8673194348812103, 'loss_v': 6452.027886962891, 'loss_pi': -0.044062312599271534}
Epoch 70 {'ep_ret': -189.40420214342342, 'ep_len': 136.8955223880597, 'kl': 0.18200836339965462, 'ent': 0.8031679928302765, 'loss_v': 6432.506372070313, 'loss_pi': -0.05110915489494801}
Epoch 80 {'ep_ret': 1.636955417864454, 'ep_len': 204.39473684210526, 'kl': 0.010192935465602205, 'ent': 0.7304623574018478, 'loss_v': 4230.202294921875, 'loss_pi': 0.0679513767361641}
Epoch 90 {'ep_ret': -12.323186090371689, 'ep_len': 580.8461538461538, 'kl': 0.011491665779612959, 'ent': 0.8358985722064972, 'loss_v': 2323.0611572265625, 'loss_pi': -0.03669778229668737}
Epoch 100 {'ep_ret': 0.6435803735162903, 'ep_len': 719.0909090909091, 'kl': 0.010066089547763113, 'ent': 0.8110317051410675, 'loss_v': 2148.5935485839846, 'loss_pi': -0.03541271910071373}
Epoch 110 {'ep_ret': 53.11341591658313, 'ep_len': 394.36842105263156, 'kl': 0.011554712982615456, 'ent': 0.6501742303371429, 'loss_v': 2708.7699462890623, 'loss_pi': 0.07870282009243965}
Epoch 120 {'ep_ret': -29.064540825634605, 'ep_len': 535.6428571428571, 'kl': 0.02524184245849028, 'ent': 0.7965717732906341, 'loss_v': 1783.3997131347655, 'loss_pi': -0.043176391953602435}
Epoch 130 {'ep_ret': 86.11843384045548, 'ep_len': 493.0, 'kl': 0.032746067868356474, 'ent': 0.6479901224374771, 'loss_v': 2122.7394104003906, 'loss_pi': 0.011860024929046632}
Epoch 140 {'ep_ret': 89.27088644171086, 'ep_len': 382.5263157894737, 'kl': 0.002102634127368219, 'ent': 0.5730206042528152, 'loss_v': 2468.001989746094, 'loss_pi': 0.05800626892596483}
(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --prefix=logs/3b/ --train_pi_iters=20 --epochs=150
Number of parameters 4605
Epoch 0 {'ep_ret': -214.21225455581796, 'ep_len': 100.22222222222223, 'kl': 0.2247946709394455, 'ent': 1.3862947225570679, 'loss_v': 15757.0087890625, 'loss_pi': -0.16476057469844818}
Epoch 10 {'ep_ret': -170.4581275913731, 'ep_len': 131.13235294117646, 'kl': 0.0926096479408443, 'ent': 1.090891468524933, 'loss_v': 10734.807934570312, 'loss_pi': 0.0021152547677047552}
Epoch 20 {'ep_ret': -487.58517496548313, 'ep_len': 240.64864864864865, 'kl': 0.11441107920254581, 'ent': 0.8454310953617096, 'loss_v': 23513.958154296874, 'loss_pi': 0.18345231860876082}
Epoch 30 {'ep_ret': -301.33311143693504, 'ep_len': 118.05194805194805, 'kl': 0.5170039484102744, 'ent': 0.7564738914370537, 'loss_v': 15367.602160644532, 'loss_pi': -0.18869921527802944}
Epoch 40 {'ep_ret': -569.3521327379008, 'ep_len': 67.24647887323944, 'kl': 0.005113115650601685, 'ent': 0.12293624356389046, 'loss_v': 38739.8259765625, 'loss_pi': -0.004101280658505857}
Epoch 50 {'ep_ret': -368.49466774979936, 'ep_len': 71.36842105263158, 'kl': 0.9408271691761911, 'ent': 0.2417846668511629, 'loss_v': 18914.25654296875, 'loss_pi': -0.5559942780528218}
Epoch 60 {'ep_ret': -557.7665572223823, 'ep_len': 66.8125, 'kl': 0.04847471280268571, 'ent': 0.026529814931564033, 'loss_v': 20134.58134765625, 'loss_pi': -0.031033773958915843}
Epoch 70 {'ep_ret': -587.3742794706134, 'ep_len': 68.16783216783217, 'kl': 7.878637077851636e-05, 'ent': 0.0023542185430414976, 'loss_v': 20521.65380859375, 'loss_pi': -0.002696374258448486}
Epoch 80 {'ep_ret': -577.3849836239632, 'ep_len': 66.21379310344828, 'kl': 3.713874850745924e-05, 'ent': 0.0026839184865821153, 'loss_v': 18769.57470703125, 'loss_pi': -0.001534139213617891}
Epoch 90 {'ep_ret': -591.6478047246218, 'ep_len': 67.87234042553192, 'kl': 9.307136467668897e-05, 'ent': 0.003617431048769504, 'loss_v': 18451.737109375, 'loss_pi': -0.0036560249965987167}
Epoch 100 {'ep_ret': -572.0498666359987, 'ep_len': 66.95172413793104, 'kl': 0.00013802023102016392, 'ent': 0.002031703817192465, 'loss_v': 17745.82392578125, 'loss_pi': -0.003082756641379092}
Epoch 110 {'ep_ret': -572.3060514778997, 'ep_len': 66.57931034482759, 'kl': 2.5492906547697203e-05, 'ent': 0.0017541816225275398, 'loss_v': 16713.76962890625, 'loss_pi': -0.0009542148007312789}
Epoch 120 {'ep_ret': -567.9614671400806, 'ep_len': 66.04081632653062, 'kl': 0.00015512027812292217, 'ent': 0.0036479943024460225, 'loss_v': 15993.2794921875, 'loss_pi': -0.0029041541594779117}
Epoch 130 {'ep_ret': -557.0199635922386, 'ep_len': 65.75167785234899, 'kl': 0.00016278600496661967, 'ent': 0.0040604508365504445, 'loss_v': 15359.5263671875, 'loss_pi': -0.003336326847784221}
Epoch 140 {'ep_ret': -569.8522755286215, 'ep_len': 66.26896551724138, 'kl': 0.001590388221086414, 'ent': 0.009696421585977078, 'loss_v': 15140.78603515625, 'loss_pi': -0.011464910081122071}
(base) kuanconkandeMBP:a4 helennnnnnnn$ tensorboard --logdir logs/3b

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)



#### 3.c.  Design your  own experiment(s) [5pts]

NOTE: you can defer this until after implementing the PPO loss if you wish

Now you get to design your own experiments.  Perhaps you are curious about how the learning rate affects things or how a different network would work.  This is your chance to experiment with whatever you think would be interesting to try.
You are free to make any modifications to the code that you would like to run your experiments.

Here is a further list of ideas:
- network arch, activations
- learning rates
- implementing other $\psi$ versions (e.g., #6)
- continuous environment.  comparing how LunarLander does against LunarLanderContinuous
- effect of gamma parameter
- effect of lambda parameter
- how much better performance is if we don't sample from the mean (deterministic forward pass, for evaluation)
- how different random seeds compare
- anything else that you think of

Describe what you are testing, and your predictions about what is going to happen.
Then run the experiment and report the results, including screenshots.

In [None]:
# ANSWERS for Task 3.c

# Describe what you are testing and your predictions
# Include any screenshots, logs, etc

(This is done with ppo)
Trying out LunarLanderContinuous-v2. It's hard to say which one would ppo perform better on, continious or discrete space... Guess I'll find out :)

Result: perfrormance seem to be consistant with discrete action space, with a higher episode return. (Red: continuous; Orange: discrete)
![continuous](https://drive.google.com/uc?id=1Q2jcmjP6jtV6XTbkeob4o77FJVo__ghh)

In [None]:
(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --loss_mode=ppo --prefix=logs/4/ --train_pi_iters=20 --epochs=150 --env LunarLanderContinuous-v2
Number of parameters 4205
Epoch 0 {'ep_ret': -226.08613767180157, 'ep_len': 111.11111111111111, 'kl': 0.0021905219182372093, 'ent': 0.9189384579658508, 'loss_v': 17021.583984375, 'loss_pi': -0.01019517332315445}
Epoch 10 {'ep_ret': -132.4524613408711, 'ep_len': 114.84415584415585, 'kl': 0.003693257097620517, 'ent': 0.9558463871479035, 'loss_v': 5843.788854980468, 'loss_pi': -0.00970200877636671}
Epoch 20 {'ep_ret': -60.87723395103291, 'ep_len': 137.39705882352942, 'kl': 0.004876104796130676, 'ent': 0.9910628616809845, 'loss_v': 1485.0810241699219, 'loss_pi': -0.009947406873106957}
Epoch 30 {'ep_ret': -7.6728565195040614, 'ep_len': 165.38095238095238, 'kl': 0.005269259307533502, 'ent': 1.000085985660553, 'loss_v': 1082.0081420898437, 'loss_pi': -0.009438554663211108}
Epoch 40 {'ep_ret': -8.813910378285966, 'ep_len': 216.55172413793105, 'kl': 0.005722285737283528, 'ent': 0.9873232543468475, 'loss_v': 1669.6092224121094, 'loss_pi': -0.009350317250937223}
Epoch 50 {'ep_ret': 52.579065879911134, 'ep_len': 488.0, 'kl': 0.001812034723116085, 'ent': 0.9573383092880249, 'loss_v': 1313.1483459472656, 'loss_pi': -0.007921617990359665}
Epoch 60 {'ep_ret': 100.26701598817598, 'ep_len': 929.2, 'kl': 0.0012358671079709892, 'ent': 0.883703351020813, 'loss_v': 455.9629364013672, 'loss_pi': -0.006723375990986824}
Epoch 70 {'ep_ret': 113.50733260925551, 'ep_len': 884.4, 'kl': -0.00011308640096103772, 'ent': 0.8255425572395325, 'loss_v': 550.6210113525391, 'loss_pi': -0.006796361971646547}
Epoch 80 {'ep_ret': 50.23747835351873, 'ep_len': 872.1818181818181, 'kl': 0.0027077826322056352, 'ent': 0.7575180470943451, 'loss_v': 525.1468444824219, 'loss_pi': -0.0070605931337922815}
Epoch 90 {'ep_ret': 10.578220962738465, 'ep_len': 955.8, 'kl': 0.0004247735138051212, 'ent': 0.697680550813675, 'loss_v': 417.23081359863284, 'loss_pi': -0.006504626478999853}
Epoch 100 {'ep_ret': -49.80091546923553, 'ep_len': 831.9090909090909, 'kl': 0.0008381779683986679, 'ent': 0.6302671194076538, 'loss_v': 549.2763198852539, 'loss_pi': -0.005889856489375234}
Epoch 110 {'ep_ret': -4.748155475413064, 'ep_len': 941.6, 'kl': 0.0015564757946776807, 'ent': 0.5867776691913604, 'loss_v': 384.8773002624512, 'loss_pi': -0.005480307200923562}
Epoch 120 {'ep_ret': 31.768905748292593, 'ep_len': 984.2, 'kl': 0.001829358353279531, 'ent': 0.5110917329788208, 'loss_v': 316.6283218383789, 'loss_pi': -0.006738315708935261}
Epoch 130 {'ep_ret': -48.01248397223802, 'ep_len': 957.3, 'kl': 0.0011212736018933356, 'ent': 0.4418627142906189, 'loss_v': 249.46277770996093, 'loss_pi': -0.005114978877827525}
Epoch 140 {'ep_ret': -49.85965469221845, 'ep_len': 995.9, 'kl': 0.0018475057542673313, 'ent': 0.4573903292417526, 'loss_v': 301.9609375, 'loss_pi': -0.006202252442017197}


---

## Task 4: Trying out the PPO clipping objective [10pts]

The following are useful resources for understanding PPO:
- [OpenAI's Spinning Up for coverage of policy gradients and PPO](https://spinningup.openai.com/en/latest/)
- [PPO paper](https://arxiv.org/pdf/1707.06347.pdf)
- [Matthew's StackOverflow Post on PPO](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl/50663200#50663200)


Now implement the PPO clipped loss objective in the `compute_loss_pi` function. It is a small fix (only a few lines) to our policy gradients implementation.  After you see that it is learning, by running the command below, you will then compare it to VPG.
```
python3 main.py --loss_mode=ppo
```

This would have been problematic before, but now the algorithm should stay fairly stable:
```
python3 main.py --loss_mode=ppo --prefix=logs/4/ --train_pi_iters=20 --epochs=150
```
vs.

```
python3 main.py --loss_mode=vpg --prefix=logs/4/ --train_pi_iters=20 --epochs=150
```


Record the results of what happened and consider including some screenshots.  You are free to run and include any other tests that you found interesting.  You can also try to further tune PPO and find hyperparameters that make it work better.

In [None]:
# ANSWERS for Task 4

# Copy your completed function (or relevant sections) here
# Include any screenshots, logs, etc
# Describe anything else you have tried
# Set up function for computing policy loss
def compute_loss_pi(batch):
    obs, act, psi, logp_old = batch['obs'], batch['act'], batch['psi'], batch['logp']
    pi, logp = ac.pi(obs, act)

    # Policy loss
    if args.loss_mode == 'vpg':
        loss_pi = -(psi*logp).mean()
    elif args.loss_mode == 'ppo':
        ratio = (logp - logp_old).exp()
        surr_loss = ratio * psi
        clipped_loss = torch.clamp(ratio, 1.0 - args.clip_ratio, 1.0 + args.clip_ratio) * psi
        loss_pi = -torch.min(surr_loss, clipped_loss).mean()
    else:
        raise Exception('Invalid loss_mode option', args.loss_mode)

    # Useful extra info
    approx_kl = (logp_old - logp).mean().item()
    ent = pi.entropy().mean().item()
    pi_info = dict(kl=approx_kl, ent=ent)

    return loss_pi, pi_info

# Set up function for computing value loss
def compute_loss_v(batch):
    obs, ret = batch['obs'], batch['ret']
    v = ac.v(obs)
    loss_v = torch.square(torch.subtract(v, ret)).mean()
    return loss_v

(blue: vpg; orange: ppo) 

Here ppo performs much better than vpg when updating 20 policy iteration each time. Magic!!
![ppo](https://drive.google.com/uc?id=1lQqgN_Z_hWxPoYLiUzVxMmsj8wO-s9mb)

In [None]:
(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --loss_mode=ppo --prefix=logs/4/ --train_pi_iters=20 --epochs=150
Number of parameters 4605
Epoch 0 {'ep_ret': -214.21225455581796, 'ep_len': 100.22222222222223, 'kl': 0.00405900739133358, 'ent': 1.3862947225570679, 'loss_v': 15757.0087890625, 'loss_pi': -0.01281240489333868}
Epoch 10 {'ep_ret': -139.18536111294503, 'ep_len': 105.21111111111111, 'kl': 0.004727941410965286, 'ent': 1.3517048239707947, 'loss_v': 7285.217077636718, 'loss_pi': -0.01077119647525251}
Epoch 20 {'ep_ret': -78.66842122363883, 'ep_len': 129.04545454545453, 'kl': 0.005964844836853445, 'ent': 1.2509766340255737, 'loss_v': 1738.0119384765626, 'loss_pi': -0.01073956536129117}
Epoch 30 {'ep_ret': -63.47945944765385, 'ep_len': 215.11111111111111, 'kl': 0.004261300421785563, 'ent': 1.1393129467964171, 'loss_v': 1058.6724792480468, 'loss_pi': -0.009728099079802632}
Epoch 40 {'ep_ret': -68.50405355061837, 'ep_len': 423.4736842105263, 'kl': 0.004535131659940817, 'ent': 1.1233717322349548, 'loss_v': 1081.9293273925782, 'loss_pi': -0.009394727228209377}
Epoch 50 {'ep_ret': -35.88930477586346, 'ep_len': 385.6666666666667, 'kl': 0.004022385855205357, 'ent': 1.1003846645355224, 'loss_v': 970.7476989746094, 'loss_pi': -0.008355038054287434}
Epoch 60 {'ep_ret': -79.25308500764699, 'ep_len': 746.0, 'kl': 0.0047756223240867255, 'ent': 1.0867365002632141, 'loss_v': 766.7895050048828, 'loss_pi': -0.008290067967027425}
Epoch 70 {'ep_ret': -72.30430841638778, 'ep_len': 826.1818181818181, 'kl': 0.004455080890329555, 'ent': 1.0222611844539642, 'loss_v': 336.0399883270264, 'loss_pi': -0.008144951751455664}
Epoch 80 {'ep_ret': -28.11128844263427, 'ep_len': 960.4, 'kl': 0.0022176060185302047, 'ent': 0.980959665775299, 'loss_v': 193.86274223327638, 'loss_pi': -0.00660975503269583}
Epoch 90 {'ep_ret': 7.449606873858485, 'ep_len': 948.3, 'kl': 0.0021534039806283545, 'ent': 0.9393785059452057, 'loss_v': 173.1554153442383, 'loss_pi': -0.00541592005174607}
Epoch 100 {'ep_ret': -45.98441982774091, 'ep_len': 1000.0, 'kl': 0.0018384976203378756, 'ent': 0.9053065657615662, 'loss_v': 102.14805355072022, 'loss_pi': -0.004900675662793219}
Epoch 110 {'ep_ret': -51.21991798445468, 'ep_len': 998.7, 'kl': 0.0029897905222242118, 'ent': 0.8979464709758759, 'loss_v': 170.94191246032716, 'loss_pi': -0.005202228459529579}
Epoch 120 {'ep_ret': -85.06000419185777, 'ep_len': 983.0, 'kl': 0.0036088830674998462, 'ent': 0.8790291011333465, 'loss_v': 91.62356605529786, 'loss_pi': -0.0043179821572266516}
Epoch 130 {'ep_ret': -86.21776810246884, 'ep_len': 992.2, 'kl': 0.00443066909792833, 'ent': 0.9192185163497925, 'loss_v': 125.68757648468018, 'loss_pi': -0.0047918226104229685}
Epoch 140 {'ep_ret': -94.36478404850139, 'ep_len': 921.0, 'kl': 0.0009337107989267679, 'ent': 0.8975055158138275, 'loss_v': 160.69817810058595, 'loss_pi': -0.004136195976752788}
(base) kuanconkandeMBP:a4 helennnnnnnn$ python3 main.py --loss_mode=vpg --prefix=logs/4/ --train_pi_iters=20 --epochs=150
Number of parameters 4605
Epoch 0 {'ep_ret': -214.21225455581796, 'ep_len': 100.22222222222223, 'kl': 0.2247946709394455, 'ent': 1.3862947225570679, 'loss_v': 15757.0087890625, 'loss_pi': -0.16476057469844818}
Epoch 10 {'ep_ret': -170.4581275913731, 'ep_len': 131.13235294117646, 'kl': 0.0926096479408443, 'ent': 1.090891468524933, 'loss_v': 10734.807934570312, 'loss_pi': 0.0021152547677047552}
Epoch 20 {'ep_ret': -487.58517496548313, 'ep_len': 240.64864864864865, 'kl': 0.11441107920254581, 'ent': 0.8454310953617096, 'loss_v': 23513.958154296874, 'loss_pi': 0.18345231860876082}
Epoch 30 {'ep_ret': -301.33311143693504, 'ep_len': 118.05194805194805, 'kl': 0.5170039484102744, 'ent': 0.7564738914370537, 'loss_v': 15367.602160644532, 'loss_pi': -0.18869921527802944}
Epoch 40 {'ep_ret': -569.3521327379008, 'ep_len': 67.24647887323944, 'kl': 0.005113115650601685, 'ent': 0.12293624356389046, 'loss_v': 38739.8259765625, 'loss_pi': -0.004101280658505857}
Epoch 50 {'ep_ret': -368.49466774979936, 'ep_len': 71.36842105263158, 'kl': 0.9408271691761911, 'ent': 0.2417846668511629, 'loss_v': 18914.25654296875, 'loss_pi': -0.5559942780528218}
Epoch 60 {'ep_ret': -557.7665572223823, 'ep_len': 66.8125, 'kl': 0.04847471280268571, 'ent': 0.026529814931564033, 'loss_v': 20134.58134765625, 'loss_pi': -0.031033773958915843}
Epoch 70 {'ep_ret': -587.3742794706134, 'ep_len': 68.16783216783217, 'kl': 7.878637077851636e-05, 'ent': 0.0023542185430414976, 'loss_v': 20521.65380859375, 'loss_pi': -0.002696374258448486}
Epoch 80 {'ep_ret': -577.3849836239632, 'ep_len': 66.21379310344828, 'kl': 3.713874850745924e-05, 'ent': 0.0026839184865821153, 'loss_v': 18769.57470703125, 'loss_pi': -0.001534139213617891}
Epoch 90 {'ep_ret': -591.6478047246218, 'ep_len': 67.87234042553192, 'kl': 9.307136467668897e-05, 'ent': 0.003617431048769504, 'loss_v': 18451.737109375, 'loss_pi': -0.0036560249965987167}
Epoch 100 {'ep_ret': -572.0498666359987, 'ep_len': 66.95172413793104, 'kl': 0.00013802023102016392, 'ent': 0.002031703817192465, 'loss_v': 17745.82392578125, 'loss_pi': -0.003082756641379092}
Epoch 110 {'ep_ret': -572.3060514778997, 'ep_len': 66.57931034482759, 'kl': 2.5492906547697203e-05, 'ent': 0.0017541816225275398, 'loss_v': 16713.76962890625, 'loss_pi': -0.0009542148007312789}
Epoch 120 {'ep_ret': -567.9614671400806, 'ep_len': 66.04081632653062, 'kl': 0.00015512027812292217, 'ent': 0.0036479943024460225, 'loss_v': 15993.2794921875, 'loss_pi': -0.0029041541594779117}
Epoch 130 {'ep_ret': -557.0199635922386, 'ep_len': 65.75167785234899, 'kl': 0.00016278600496661967, 'ent': 0.0040604508365504445, 'loss_v': 15359.5263671875, 'loss_pi': -0.003336326847784221}
Epoch 140 {'ep_ret': -569.8522755286215, 'ep_len': 66.26896551724138, 'kl': 0.001590388221086414, 'ent': 0.009696421585977078, 'loss_v': 15140.78603515625, 'loss_pi': -0.011464910081122071}
(base) kuanconkandeMBP:a4 helennnnnnnn$ tensorboard --logdir logs/4

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C(base) kuanconkandeMBP:a4 helennnnnnnn$ 


---

## Task 5: Optional

#### 5.1 Fully solving LunarLander

During initial testing, you likely did not fully solve LunarLander.  An optimal reward is about 300.  Your first bonus task is to adapt your implementation, as needed, to achieve this high reward.  This likely involves parameter tuning, implementing learning rate annealing, and maybe some observation normalization.

#### 5.2 Implementing parallelized environments

A major bottleneck right now is that we are only collecting data with 1 agent at a time.  Most modern algorithm implementations use many parallel episodic evaluations to collect data.  This greatly speeds up training and is a practical necessity if you want to use these algorithms to solve new problems.  Your second bonus task is to implement parallelized environment data collection.  One fairly easy way to do this is to use OpenAI gym's `AsyncVectorEnv`.  This runs N environments each in their own process.  To use it, you will have to make a few slight modifications to your data collection code to handle stacks of observations and actions.

Documentation (see tests for usage): https://github.com/openai/gym/tree/master/gym/vector

#### 5.3 New environments

Your third bonus task is to try solving the PyBullet environments (or Mujoco if you want to get a free license).  `HalfCheetah` is a good place to start as one of the easier control tasks.  See the [Bullet code here](https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/examples/enjoy_TF_HalfCheetahBulletEnv_v0_2017may.py) for how to make the bullet envs.


```
# Example environment usage
import gym
import pybullet_envs
env = gym.make("HalfCheetahBulletEnv-v0")
env.render(mode="human")

obs = env.reset()
while True:
    act = env.action_space.sample()
    obs, rew, done, info = env.step(act)

```

#### 5.4 Setting up MuJoCo [2pts]

MuJoCo is free as of October 18, 2021 ([News](https://deepmind.com/blog/announcements/mujoco)). The Python binding for MuJoCo ([mujoco-py](https://github.com/openai/mujoco-py)) is as of yet pending update.  This bonus task is about installing and running MuJoCo on your machine.  MuJoCo is slightly faster (and more popular in RL community) than PyBullet, so you might consider using it for your projects.