# Project 3 Collaborative Competiton

## Learning Algorithm

A DDPG based agent was used to control the two players. DDPG is a stochastic policy gradient method and is also an actor-critic method in a sense. It uses the actor network to estimate the continuous action, given the input state and the critic network to estimate the Q value given a state and action input. DDPG is also an off-policy algorithm, which implies use of the concept of trajectories for the purpose of calculating the rewards for future actions. This is implemented through the target networks in the code. This enables the behaviour generation to be somewhat stable and not constantly updated. In the code, the target networks have identical architectures to the corresponding local networks. As expected the target networks need to be updated at some point so that they don't diverge dramatically from the local networks which output the behaviour of the agent. Currently this is done every learning step in the code. DDPG has been known to be unstable in the absense of a replay buffer. Here both a regular replay buffer that sample experiences from past actions uniformly at random and a TD error prioritized replay buffer have been tried. The prioritized replay buffer appeared to overfit early in the training process, limiting the average score from going beyond 0.1, while the regular uniformly randomly sampled replay buffer didn't have the overfitting problem, achieving the target score.

The code can be found the Iron Python notebook called Tennis_Final_Solution.ipynb

The saved weights are located in the Final_Weights directory

To see the saved weights in action, you can run the Test_Agents.ipynb notebook

The framework was adapted from the previous project. The neural network for the actor consisted of a single 128 neuron layer connected to the output layer. The critic consisted of 3 hidden layers of 128, 64 and 32 units each. This architecture was based on some discussions and multiple iterations.
The other hyper parameters are mentioned below:

In [3]:
BUFFER_SIZE = int(1e6)      # replay buffer size
BATCH_SIZE = 1024        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-3         # learning rate of the actor 
LR_CRITIC = 3e-3        # learning rate of the critic
WEIGHT_DECAY = 0.0      # L2 weight decay
LATERAL_LIMIT = 1
JUMP_LIMIT_MAX = 1
JUMP_LIMIT_MIN = 0
ACTOR_FC1_UNITS=128
ACTOR_FC2_UNITS=64 # Not used
ACTOR_FC3_UNITS=32 # Not used

CRITIC_FC1_UNITS=128
CRITIC_FC2_UNITS=64
CRITIC_FC3_UNITS=32

During implementation, a prioritized replay buffer was attempted, however it seemed to get stuck at an average score of 0.1, leading to the belief that a prioritized replay buffer is probably more valuable after an agent has been trained sufficiently. The results here are presented from a regular replay buffer, where the agent took 8166 episodes to reach the target. This number might appear large, however there is some variance in this. In previous runs, the agent was able to achieve the target score in as low as 1500 cycles, so some randomization effects are playing a role. Additionally, when changing the random seed, there were instances when the agent didn't achieve the score at all in 10,000 episodes. In addition to achieving the target score, the agent's maximum score was 2.2, which is what the weights are saved for.

[regular_result_image]: ./Training_Result.png "Regular Replay Training result"

Below is an image of the plot of average and maximum scores using a regular replay buffer:
![Regular Buffer Training Result][regular_result_image]
<center> Regular Replay Buffer </center>

[pri_replay_result_image]: ./Pri_replay_training_result.png "Prioritized Replay Training result"
Below is an image of the plot of average and maximum scores using a prioritized replay buffer:
![Prioritized Replay Buffer][pri_replay_result_image]
<center> Prioritized Replay Buffer </center>

The training was slow on our machine, so we limited the agent to train once every other timestep. A framework was created to enable switching between different profiles of testing easily. These profiles included:
1. Completely Separate agents training independently with separate actor and critic models
2. Agents sharing the replay buffer, but with separate action and critic models
3. Agents sharing the replay buffer and actor models, but different critic models
4. Agents sharing the replay buffer and critic models but different actor models
5. Agents sharing all the above.

Finally the solution was achieved using the 5th configuration with agents sharing all the above, the replay buffer, the actor model and the critic model.



## Future Work
One of the primary comments online seems to be that PPO has significant advantages over DDPG, and that is something I would definitely want to try next.

## Video of Agents playing

Below is a link to the agents playing against each other using the saved weights

<a href="http://www.youtube.com/watch?feature=player_embedded&v=hrtIdekQtw8
" target="_blank"><img src="http://img.youtube.com/vi/hrtIdekQtw8/0.jpg" 
alt="Demo of the result" width="240" height="180" border="10" /></a>