# Report for Udacity Deep Reinforement Learning
# Project 3: Collaboration and Competition

## Overview

This report is for the Collaboration and Competition project at Deep Reinforcement Learning course by Udacity. The objective of the project is to train two agents control rackets to bounce a ball over a net as many times as possible. 
If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.

## Learning Algorithm

The same learning algorithm with [p2-ball-tracking](https://github.com/yshk-mrt/p2-ball-tracking) is applied.

[A2C](https://openai.com/blog/baselines-acktr-a2c/) is used, which is a synchronous variant of [A3C](https://arxiv.org/abs/1602.01783). Pseudocode of A3C from the cited paper is as follows.

<img src="files/content/a3c_suedocode.png" width="800"/> 

### Model

One neural network for policy $\pi(a|s;\theta')$ and value function $V(s;\theta'_v)$ is shared by all agents. The first and second hidden fully connected layers are shared by policy (actor) and value function (critic). In order to generate actions for continuous control, the network generates mean and standard deviation of Gaussian distribution, and actions are sampled from the distribution. Here is the corresponding code in [TrainAgent.ipynb](./TrainAgent.ipynb)


```python
    out1 = F.relu(self.fc1(state))
    out2 = F.relu(self.fc2(out1))

    # mean of the Gaussian distribution range in [-1, 1]
    mean = torch.tanh(self.fc_actor(out2))
    # V value
    value = self.fc_critic(out2)

    # Create distribution from mean and standard deviation
    # Use softplus function to make deviation always positive
    # SoftPlus is a smooth approximation to ReLU function
    dist = torch.distributions.Normal(mean, F.softplus(self.std))

    # Sample next action from the distribution.
    action = dist.sample()
    action = torch.clamp(action, min=-1.0, max=1.0)
```

#### Model structure

Model structure by torchviz library.
<img src="files/content/model.png" width="800"/> 

#### Model parameters
The following table shows model parameters. Linear-1 and Linear-2 is hidden layers, Linear-3 is for actions, and Linear-4 is for value function.
```
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1               [-1, 1, 128]           3,200
            Linear-2                [-1, 1, 64]           8,256
            Linear-3                 [-1, 1, 2]             130
            Linear-4                 [-1, 1, 1]              65
================================================================
Total params: 11,651
```


### Hyperparameters
The following hyperparameters are used.

| Parameter                       | Value         | Description                                                    |
|:--------------------------------|--------------:|:--------------------------------------------------------------|
| N-Step                          |             5 | Update model every n-step                                     |
| Discount factor \[$ \gamma $\]  |          0.99 | Discount factor used in Q-Learning                            |
| Critic_loss_coef                |         100.0 | Weight of critic loss                                         |
| Entropy_loss_coef               |        0.0001 | Weight of entropy loss                                        |
| Learning rate \[$ \alpha $\]    |        0.0005 | Learning rate for Adam                                        |
| Standard deviation              |           0.0 | Initial standard deviation before SoftPlus is applied         |
| Max norm of the gradients       |           5.0 | Max norm for gradient clipping                                |


## Results

### Plot of Rewards
At episode 6921, A2C achieved score 0.5.  
At episode 19779, A2C achieved average score 0.5.  
Thus, the environment is considered solved.
![trained_result](files/content/score.png)

### Trained agents' behavior

The following gif shows trained agent behaviors. This time, standard deviation for action generation is set to nearly zero, which means agents always take greedy actions. As the result, agents got an average score 1.91 over 100 episodes.
![trained_agent](files/results/final/tennis.gif)
![trained_agent_score](files/content/DetermisticScore.png)

## Consideration

1. Entropy

The following graph shows the average score and entropy. When the score improves, the entropy decreases. On the other hand, when the score drops, the entropy increases, which will contribute to recover the score again by searching broader action spaces. The weight of entropy is important to balance convergent and divergent mode.

![entropy](files/content/score_entropy.png)

2. Weight of critic loss

The following graph (a) shows actor and critic loss without weight of critic loss. Graph (b) shows absolute value of actor loss divided by critic loss. As you can see, the actor loss is 10 to 1000 times bigger than the critic loss, and the ratio decreases as the learning continues. If the weight can be adaptively changed to keep the ratio, learning could be more stable. (A quick experiment did not work.)
In this report, 100.0 is used. 

| ![actor_critic]| ![actor_divided_by_critic]|
|---|---|
|(a)Actor loss and critic loss|(b)Actor loss divided by critic loss|

[actor_critic]:files/content/actor_critic.png
[actor_divided_by_critic]:files/content/actor_critic_log.png


## Ideas for Future Work

For more stable results:
 * Add a small value to std to prevent log probability becomes infinity when std is zero.
 * Use gradient clipping -> Done
 * Use Trust Region Policy Optimization (TRPO)
 * Try RMSprop instead of Adam
 * Try off-policy methods
 * Try separated network for actor and critic
 * Try adaptive critic loss weight mention in the second consideration point

## References
* [GitHub:ShangtongZhang/DeepRL](https://github.com/ShangtongZhang/DeepRL)
* [GitHub:qiaochen/A2C](https://github.com/qiaochen/A2C)
* [Let’s make an A3C: Implementation](https://jaromiru.com/2017/03/26/lets-make-an-a3c-implementation)
* [Understanding Actor Critic Methods and A2C](https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f)
* [ゼロから作る A3C](https://qiita.com/s05319ss/items/2fe9bfe562fea1707e79)

## Appendix
Code for graph of this report.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def moving_average(data, points=100):
    b=np.ones(points)/points
    a = np.zeros(points-1)
    return np.convolve(np.hstack((a, data)), b, mode='vaild')
    
def plot(data_source):
    """Plot data
        Parameters
        ----------
            data_source : list
                [Data label, data file path, line type]  
    """
    plt.rcParams["figure.dpi"] = 100.0
    max_length = 30000
    
    for source in data_source:
        data = np.loadtxt(source[1], skiprows=1, delimiter=', \t', dtype='float')
        length = len(data[:,])
        if length > max_length:
            length = max_length
            data = data[:length]
        
        
        # Comment/Uncomment to select data type
        #plt.plot(np.linspace(1, length, length, endpoint=True), moving_average(data[:,1]), label="score " + source[0], ls = source[2])
        #plt.plot(np.linspace(1, length, length, endpoint=True), moving_average(data[:,2]), label="total_loss " + source[0], ls = source[2])
        #plt.plot(np.linspace(1, length, length, endpoint=True), moving_average(data[:,3]), label="actor_loss " + source[0], ls = source[2])
        #plt.plot(np.linspace(1, length, length, endpoint=True), moving_average(data[:,4]), label="critic_loss " + source[0], ls = source[2])
        #plt.plot(np.linspace(1, length, length, endpoint=True), moving_average(data[:,5]), label="entropy " + source[0], ls = source[2])
        plt.plot(np.linspace(1, length, length, endpoint=True), moving_average(abs(data[:,3]/data[:,4])), label="actor/critic " + source[0], ls = source[2])
    
    plt.xlabel('Episode #')
    plt.legend()
    plt.xlim(0, length)
    #plt.ylim(0, 1000)
    plt.yscale("log")
    plt.grid()
    plt.show()

data_paths = []

data_paths.append(['tennis','results/final/result.txt', "-"])
plot(data_paths)