[//]: # (Image References)

[image1]: https://user-images.githubusercontent.com/10624937/42135619-d90f2f28-7d12-11e8-8823-82b970a54d7e.gif "Trained Agent"

# Project 1: Continuous Control

[//]: # (Image References)

[image1]: https://user-images.githubusercontent.com/10624937/42135623-e770e354-7d12-11e8-998d-29fc74429ca2.gif "Trained Agent"
[image2]: https://user-images.githubusercontent.com/10624937/42135622-e55fb586-7d12-11e8-8a54-3c31da15a90a.gif "Soccer"


# Project 3: Collaboration and Competition

### Introduction

For this project, you will work with the [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis) environment.

![Trained Agent][image1]

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation.  Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single **score** for each episode.

The environment is considered solved, when the average (over 100 episodes) of those **scores** is at least +0.5.

### Agent

#### MADDPG Agent
MADDPG agent, [maddpg_agent.py](maddpg_agent.py), implements the MADDPG algorithm from [MADDPG paper](https://arxiv.org/abs/1706.02275). 

* MADDPG is a multi-agent version of DDPG. It is adapted from DDPG to resolve multi-agent collaboration and competition problems. Although the Tennis environment may be able to be solved by a simpler way, for the sake of learning, MADDPG is employed and tried.
* To train multiple agents to work in a collaborative, completitive or mixed environment, the algorithm adopts the framework of centralized training with decentralized execution. During training, we feed critics in each agent with full observations, observations from all agents plus additional information if available, so that agents have opportunities to learn collaboration or completition. At the same time, actors in agents use only own observations but receiving suggestions from their critics for collaboration or completition information. That is the centralized training part. In testing or execution time, we only use actors in agents and there is no change required as they require the same own observation as training time.
* Actor and critic network parameters used in the project. The actions are included in the second layer in the critic network.

    Actor network parameters
    ```
    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
           BatchNorm1d-1                   [-1, 24]              48
                Linear-2                  [-1, 400]          10,000
                Linear-3                  [-1, 300]         120,300
                Linear-4                    [-1, 2]             602
    ================================================================
    Total params: 130,950
    Trainable params: 130,950
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.01
    Params size (MB): 0.50
    Estimated Total Size (MB): 0.51
    ----------------------------------------------------------------
    ```
    
    Critic network parameters
    ```
    ----------------------------------------------------------------
            Layer (type)               Output Shape         Param #
    ================================================================
           BatchNorm1d-1                   [-1, 48]              96
                Linear-2                  [-1, 400]          19,600
                Linear-3                  [-1, 300]         121,500
                Linear-4                    [-1, 1]             301
    ================================================================
    Total params: 141,497
    Trainable params: 141,497
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.01
    Params size (MB): 0.54
    Estimated Total Size (MB): 0.55
    ----------------------------------------------------------------
    ```
* Except for the points mentioned above, the algorithm of MADDPG is similar to DDPG. For DDPG, you can find additional information in my write-up for the [Project 2](https://github.com/weicheng113/p2_continuous-control/blob/master/Report.ipynb)
![Algorithm](./assets/algorithm.jpg)

#### Hyper-Parameters
* Replay buffer size, BUFFER_SIZE, of **100,000**.
* Minibatch size, BATCH_SIZE, of **256**.
* Discount factor, GAMMA, of **0.99**.
* Soft update of target parameters, TAU, of **0.001**.
* Actor learning rate of **0.0001** and critic learning rate of **0.001**.
* Unmodified Ornstein-Uhlenbeck noise: mu of **0.0**, theta of **0.15** and sigma of **0.2**.
* Initial noise scale: initial_noise_scale of **1.0** and noise reduction factor: noise_reduction of **0.99**. These two parameters control noise reduction in each step.
* Episodes before training: episodes_before_train of **300**. The agent collects enough samples before starting to train.

#### Result
The following is training result, which is resolved in 1260 episodes with average score of 0.51 over the last 100 episodes.

![Result](./assets/result.jpg)

### (Optional) Challenge: Play Soccer

#### Agent

* Using MA-PPO approach to train the agents. Adopt similar training framework as MADDPG, centralized training and decentralized execution.
* Take full states **4 * 336** as input to critics. 
* A modification was made to the reward structure for the training only as follows.
$ reward = ownReward * (1.0 - teamSpirit) + teamMemberReward * teamSprint $. **teamSpirit = 0.4** at the moment.
* [Source code](https://github.com/weicheng113/deep-reinforcement-learning/tree/master/soccer-twos-ppo)

#### Result

After 5000 episodes, we can see performance of the trained agents below.
* trained agents vs random agents - [youtube video](https://youtu.be/P0BsQ1wL7vA)
* trained agents vs trained agents - [youtube video](https://youtu.be/T_IKQryICCI)

The one from Unity ML-Agents, which is said to use independent PPO agents - [youtube video](https://www.youtube.com/watch?v=Hg3nmYD3DjQ)

### Future Improvements
In MA-PPO for Soccer Twos, I removed batch normalization layer because of an issue in training. I believe batch normalization will improve training efficiency according to my prior experience with other projects. I would like to resolve batch normalization issue and bring it back when I have time. I have not got enough time to train a adapted version of MADDPG for discrete actions yet. This is something I would like to try also. Apart from these, I also want to try Prioritized Experience Reply for training efficency and Generalized Advantage Estimation to reduce bias for critics.