# Project 3 : Tennis Project (Collaboration and Competition)

## Project Report 

This is the report for the third project in the Udacity Deep Reinforcement Learning Nanodegree. The purpose of this project is to let us learn to use deep reinforcement learning algorithm to train multiple agents. In the Tennis environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. **Thus, the goal of each agent is to keep the ball in play**.



I choose to solve this project by DDPG algorithm. I am reusing the code for DDPG from https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum as well as the code I developed in the second project.

I refer to https://github.com/Zartris/TD3_multi_agent_tennis/blob/master/report.pdf for introducing larger noise in the actions in the early stage of training, and gradually reducing noice. 

The codes for training and validation are in **'Tennis.ipynb'**. In that notebook, I run the training program with one set of hyperparameters. The final result show that the agent is able to obtain an average score of 0.5 over 100 episodes. In the validation part, the agents use the trained neural networks to play tennis, and obtain average score over 0.5 in 100 episodes.

## Environment details

The environment is based on [Unity ML-agents](https://github.com/Unity-Technologies/ml-agents). The project environment is similar to, but not identical to the Tennis environment on the [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md) environment on the Unity ML-Agents GitHub page.

The observation space consists of 24 variables corresponding to  the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. Every entry in the action vector should be a number between -1 and 1.

The task is episodic, and in order to solve the environment, agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single score for each episode.


### Code implementation

The multi-agent DDPG Algorithm is adopted from [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwi9173Rk6rrAhVDCKwKHfzSBZwQFjABegQIBRAB&url=https%3A%2F%2Farxiv.org%2Fabs%2F1706.02275&usg=AOvVaw19InHBSf6yWo24Zb-rtv-p). 
The code for this DDPG agent is built upon the example from https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum. I modify it to adapt to the multi-agent environment. The codes include:
 
- 'Tennis.ipynb': main code for training and validating the model;

- 'maddpg_agent.py': 
> **Agent class**  for a single agnet. The Agent class has actor network and critic network.  
> **MAagent class**  that contains several Agents, and is responsible for combined action execution, managing replay buffer and updating the neural network of each individaul agent.  

- 'model.py' : contains the neural networks for the actor network and critic network. One batch normalization layer is added after the first linear layer in both the actor and critic networks. 


### Neural network implementation description
Each agent maintains its own actor neural network. The actor only takes local observation of an individual agent, and output action for that specific agent, so the **Actor Neural Networks** use the following architecture :

```
Input nodes (state_size=24) 
  -> Fully Connected Layer 
    -> Batch Normlization, Relu activation
      -> Fully Connected Layer (Relu activation) 
         -> Ouput nodes (2 nodes, tanh activation)
```
Each agent maintains its own critic neural network, but in training, agent has access to other agents' observations and action policies, so **Critic Neural Networks** use the following architecture :

```
Input nodes (num_agents*state_size + num_agents*action_size = 2*24 + 2*2) 
  -> Fully Connected Layer
    -> Batch Normlization, Relu activation
        -> Fully Connected Layer (nodes, Relu activation) 
          -> Ouput node (1 node, no activation)
```
**Description of experience buffer**: In the project, there are two agents. "states" returned from the environment is a $2\times24$ numpy ndarray. The "actions" returned from MAagent is a $2\times2$ numpy ndarray (each row correspondes to a agnet's action). When "actions" is given to environment, "next_states" "rewards" and "dones" are given back. "next_states" is a $2\times24$ numpy ndarray; "rewards" and "dones" are length-two lists. They are stored in the experience buffer as they are. 

**Minibatch**:  When a minibatch is sampled from experiences, the returned items include:
- obs: length-two *list*, obs[i] correspondes to agent i's local observations, which is a $batch\_size\times24$ tensors. obs[i] can be directly given to the actor neural network;
- rewards: lenth-two *list*, rewards[i] correspondes to agent i's local reward, which is a $batch\_size\times1$ tensor;
- next_obs: length-two *list*, similiarly constructed to obs;
- dones: length-two *list*, similarly constructed to rewards;

- states: a $batch\_size\times 48$ *tensor*, each row correspondes to the combined observation of all agents;
- actions = a $batch\_size\times 4$ *tensor*, each row correspondes to the combined actions of all agents;
- next_states: similarly constructed to states. 

### Critic neural network updating:
When a minibatch is sampled:
- each agent i takes next_obs[i] to calculate actions: agents[i].**actor_target**(next_obs[i]), actions are concatenated to produce: **actions_next** ($batch\_size\times 4$ *tensor*)
- each agent i calculates: Q_targets_next=agent[i].**critic_target**(next_states, actions_next), then Q_targets = rewards[i] + (gamma * Q_targets_next * (1 - dones[i]))
- each agent i calculate: Q_expected = agents[i].**critic_local**(states, actions)
- critic network is updated based on loss function: F.mse_loss(Q_expected, Q_targets)

### Actor neural network updating:
after critic neural network is updated:
- each agent i takes obs[i] to calculate actions: agents[i].**actor_local**(obs[i]), actions are concatednated to produce: **actions_pred** ($batch\_size\times 4$ *tensor*)
- actor network is updated based on loss function: -self.agents[i].**critic_local**(states, **actions_pred**).mean()

### MAagent parameters and results

#### Methodology

The set of hyper parameters used in this project follows:
```
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 256        # minibatch size
max_t=1000, # maximum allowed time step in each episode
n_episodes=10000        # maximum allowed episodes
random_seed=2,
noise_scalar_init=2.0   
noise_reduction_factor=0.999
update_every=2,# time steps before neural network update from minibatch
actor_fc1_units=200, # number of neurons in first layer in actor network
actor_fc2_units=150, # number of neurons in second layer in actor network
critic_fcs1_units=400, # number of neurons in first layer in critic network
critic_fc2_units=300, # number of neurons in second layer in critic network

gamma=0.99,  #discount rate
tau=1e-2,    #soft update rate, use a larger tau helps the training

lr_actor=1e-4,  # actor network learning rate
lr_critic=1e-3, # critic network learning rate
weight_decay=0, # neural network weight decay rate

mu=0.,  # Ornstein-Uhlenbeck noise parameter
theta=0.15, # Ornstein-Uhlenbeck noise parameter
sigma=0.2 # Ornstein-Uhlenbeck noise parameter
```
I don't play with different hyperparameters in this project. I notice that it takes over thousand episodes for the agents to really beginining learnining something, and it seems trying out different hyper parameters combination for a few episodes will not show me how the model really performs. Therefore I pick the above set of hyper parameters and let it train for 10,000 episodes. In the end it works. 

The scores during training is plot below. It can be observed it takes about 3000 episodes to train the agents to obtain an average score over 0.5.
![title](imag/training_score_plot.png)


After the training, we use the trained agents to play a game for 100 episodes, and show that it can achieve average score of 0.5 over 100 episodes. 
![title](imag/validation_score_plot.png)


### Ideas for future work

In the discussion section, I found many discussions about how to improve the performance of this project, e.g.
https://knowledge.udacity.com/questions/133212 gives some suggestions, such as:
>Use parameter space noise rather than noise on action. https://vimeo.com/252185862https://github.com/jvmancuso/ParamNoise
We can use prioritised experience buffer. https://github.com/Damcy/prioritized-experience-replay
Different replay buffer for actor/critic
Try adding dropouts in critic network
Turn off OU noise and use random noise
Set batch size to 256, set your actor and critic NN hidden layers to 512x256, OU sigma to 0.1, and perhaps make your buffer size 1e6

I guess that this environment has the "sparse reward" issue. I found this tutorial on tackling sparse reward [DRL Lecture 7: Sparse Reward(in Chinese)](https://www.youtube.com/watch?v=-5cCWhu0OaM&list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_&index=7). Maybe we can do some reward-shaping to purposefully lead the agents to better actions.

In the paper [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwi9173Rk6rrAhVDCKwKHfzSBZwQFjABegQIBRAB&url=https%3A%2F%2Farxiv.org%2Fabs%2F1706.02275&usg=AOvVaw19InHBSf6yWo24Zb-rtv-p), section 4.2 discussed inferring policies of other agents. When training the critic network, we assume that all agents have access to other agents' action policies. If this assumption is not true, then the agent has to maintain approximate policy networks of other agents. I think this is an interesting point and I wish to explore this further.

This concludes the report.
