### **Algorithms**
In this project, I used Proximal Policy Optimization (PPO, Shulman2017) to solve Reacher Environment. PPO aims to solve a major limitation in policy gradients methods: data inefficeincy. Each trajectory can validly be used for updating the policy network once in PG, which is wasteful, especially when the generation process is slow, resource-consuming or even dangerous. With tricks of importance sampling, surrogate objectives, and surrogate clipping, the policy network can then be updated multiple times using a generated trajectory (generated from an "old policy") without losing track from the true objective function. This technique enhances data effiency greatly by creating off-policy learning (improving a policy other than the trajectory generating one) alike capability for PG algorithm. 

### **Implementation**
My implementation is based on the idea of Algorithm.xx in John Schulman's 2017 paper, the algorithm x.x. Learning rules are summarized here:

In addition, PPO algorithm has parallel learning nature in that one can use a policy to generate simultaneously multiple trajectories and learn from all of them afterwards. Therefore, I chose multi-agent version of the environment to take advantage of it. I created a trajectory buffer using torch.utils.data to organize the data format and mini-batch generation. 

### **Results**  

#### **Statistics**
**Discussion**

**Video recording of a trained agent**

### **Future Work**
- compare the result with other PG based methods
- play with the latest Soft-Actor Critic.

### **Reference**
Research Papers:
- [Proximal Policy Optimization 2017](https://www.nature.com/articles/nature14236)
- [Dueling DQN 2016](https://arxiv.org/abs/1511.06581)
- [Double DQN 2016](https://arxiv.org/abs/1509.06461)

Related Projects:
- [tnakae: Udacity-DeepRL-p2-Continuous]
(https://github.com/tnakae/Udacity-DeepRL-p2-Continuous)



### **Appendix**
1. Key equations and the corresponding lines of codes in the project are summarized in ![./equations.png](./equations.png) 
2. Hyperparameters

| Hyperparameter                      | Value |
| ----------------------------------- | ----- |
| Agent Model Type                    | MLP   |
| Agent Model Arch                    | [in, 20, 20, out] |
| Update (Learning) Frequency         | every 4 steps |
| Replay buffer size                  | 1e5   |
| Batch size                          | 64    |
| $\gamma$ (discount factor)          | 0.99  |
| Optimizer                           | Adam  |
| Learning rate                       | 5e-4  |
| Soft-Update (*1)                      | True  |
| $\tau$ (soft-update mixing rate)    | 1e-3  |
| $\epsilon$ (exploration rate) start | 1.0   |
| $\epsilon$ minimum                  | 0.1   |
| $\epsilon$ decay                    | 0.995 |

(*1.) Soft-update is a way to update the target network by mixing the current local parameters and current target parameters one via the following formula:

$w_{target} \leftarrow \tau * w_{local} + (1-\tau) * w_{target}$

The original DQN uses hard-update (i.e. $\tau=1$), meaning that the target network is fully updated by the current local network. In practice, hard-update needs to be updated less frequently since it requires the local network to first accumulate more knowledge. The choice of hard-update is exposed in the project, and one can experiment with it by setting soft-update to be False.



## Report

[image1]: ./ppo_future_rewards-exp1.png "Crawler"

- Using PPO to solve Reacher environment:
- Exp-1: v1: Future Reward PPO/ Reinforce PPO => not working: 
    - Reinforce PPO ![][image1]
    - hyperparams = {"max_t": 1000,
                   "SGD_epoch": 2,
                   "gamma": 0.99,
                   "n_episodes": 150,
                   "grad_clip": 0.2,
                   "epsilon": 0.01,
                   "beta": 0.01}  # 
- Exp-2: using single agent environment:
    - Exp-2.1: Find bug, the entropy is to maximize but not minimize...
    - it is still not working, it is able to learn something, but it looks like agent can't try our something meaningful!
        - reward_to_go is not working!
        - reinforce is ...? not working! 
        -     hyperparams = {"max_t": 1000, # maximal possible t_step is only 1000
                   "SGD_epoch": 2,
                   "gamma": 0.99,
                   "n_episodes": 150,
                   "grad_clip": 0.0,
                   "epsilon": 0.01,
                   "beta": 0.01}
              setting = {"multi_env": False,
                   "plotting": False}
       - with gradient clip

- Exp-3: using tnakae's implementation:
    - exp3.1 it is working~~ yay (PPO with GAE)
    - exp3.2 it is not working if I am using return as the value (last R using value function), it reaches ~11 in episode 150 and stuck here.
    - exp3.3 using R_last=0