# Continuous Control

---

In this report, I describe solving the Reacher environment with two policy gradient methods: Proximal Policy Optimization ("PPO") and Deep Deterministic Policy Graident ("DDPG") methods.

## Policy Gradient Methods

According to Schulman, Wolski et al., 2017, policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form:

$$
\hat g = \hat E_{t} \bigg[ ∇_{θ} logπ_{θ}(a_{t} | s_{t}) \hat At \bigg]
$$

where $π_{θ}$ is a stochastic policy and $\hat A_{t}$ is an estimator of the advantage function at timestep t. The advantage function represents an improvement compared to the average of possible actions taken at that state. It can be expressed as the following equation:

$$
A(s_{t}, a) = Q(s_{t}, a)-V(s_{t}) 
$$
which can be re-written as:
$$
A(s_{t}, a) = r + \gamma V(s_{t+1}) - V(s_{t})
$$

## PPO

### Result
I was able to solve the 20-agent version with PPO. It surpassed the score 30 at the 40th episode and the average score reached 30 in 106 episodes.

![image.png](attachment:image.png)

### Methodology
PPO is a policy gradient method and its main features include clipping the surrogate objective, use of importance sampling, etc. The "Clipped Surrogate Objective" aims to prevent large policy updates by ignoring the change in probability ratio when it would make the objective improve, and including it when it makes the objective worse. It is expressed as the following mathematical equation:

$$
L^{CLIP}(θ) = \hat E_{t} \bigg[min(r_{t}(θ)A_{t}, clip(r_{t}(θ), 1 − ϵ,1 + ϵ)A_{t})\bigg]
$$

Importance sampling is useful in policy gradient methods because it allows using a sampled trajectory multiple times without having to calculate the probabilities for the trajectory every time.

### Implementation
I followed the actor-critic style PPO as suggested by Schulman, Wolski et al., 2017. 

![image-2.png](attachment:image-2.png)

Besides, in the continuous control setting, it requires the use of a probability distribution. I used the normal distribution and added standard deviation as a parameter to be optimized. Actions are sampled from the distribution to enable exploration.

I referred to this github page https://github.com/SimonBirrell/ppo-continuous-control for exact implementation. In this version, the tanh function is used for hidden layers. The leaky-relu function seems to also work but it frequently encountered all gradients becoming nan. Lastly, the ordinary relu function did not appear to work well.

### Future Ideas
- A more systematic approach in tuning hyperparameters such as grid search or random search would be beneficial. 
- Employing different regularization techniques such as weight decay may be useful in generalizing the model better. For this project, I used Cross-Entropy method provided by Udacity.
- Trying out different distributions for actions or even using a non-parametric one may be worth it. For this project, I used the normal distribution.
- Surprisingly, this algorithm barely worked for the single-agent environment. I tried tuning hyperparameters in various ways with no luck. It would be good to figure out why this model would not work to deepen the understanding.
- Instead of using advantage functions, it may be worth trying discounted future rewards or even the current reward.


## DDPG

### Results
I was able to solve the single-agent version with three models of DDPG: 1) the vanilla DDPG, 2) DDPG with Prioritized Experience Replay ("PER"), and 3) DDPG with PER, without a target actor model. The three models achieved the average score of 30 in 273, 164, and 145 episodes respectively. Although the DDPG with PER solved the environment in a fewer episodes, it took a longer time to run through each episode. Also, surpiringly DDPG without a target actor worked perfectly fine at least in this environment.

![image.png](attachment:image.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)

### Methodology
DDPG is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy. DDPG can be also thought of as being deep Q-learning for continuous action spaces (from OpenAI website).

In addition to the aforementioned features, other features include:
- Use of experience replay. In fact, all standard algorithms for training a deep neural network to approximate $\hat Q^*(s,a)$ make use of an experience replay buffer. 
- Target networks. They are introduced to make the Q-function more stable as minimizing mean-squared Bellman error (MSBE) can make it highly unstable.
- Noise. Since the policy is deterministic (mapping a state to an action with certainty), we want to introduece noise to make the agent explore.

### Implementation
For the vanilla DDPG algorithm. I generally followed the structure described in the pseudocode below.
![image-2.png](attachment:image-2.png)

Next, I also implemented DDPG with PER. PER prioritizes using samples with a large difference between the Q-value of the current state/action and the sum of the reward of the current state/action and the Q-value of the next state/action ("TD Error"). This necessitates using the SumTree method as it requires calculating the probabilities of samples continuously based on the newly calculated TD Errors, and it can be computationally very expensive. 

Finally, I implemented DDPG with PER, but without a target actor. As the results show, the Reacher environment could be solved in the least number of episodes using this algorithm.

### Future Ideas
- Just like PPO, a more systematic approach in tuning hyperparameters such as grid search or random search would be beneficial.
- It may expedite the agent's learning if double DQN or duelling DQN are introduced.
- It would be useful to consider different ways to initialize the weights for PER. For this project, I used experience replay until the weights for a batch of observations are gathered.
- It may be worth considering different noise functions such as uncorrelated, mean-zero Gaussian noise. For this project, I used Ornstein–Uhlenbeck noise for DDPG. 

## Reference
- https://github.com/SimonBirrell/ppo-continuous-control
- https://spinningup.openai.com/en/latest/
- https://adventuresinmachinelearning.com/sumtree-introduction-python/
- https://towardsdatascience.com/importance-sampling-introduction-e76b2c32e744