##  Project 3: Collaboration and Competition

# Goal and Enviroment

In the project, I built agents that will collabrate on racket ball playing task. Details could be seen from [ReadMe](https://https://github.com/readerwei/Reinforcement_Learning_Degree/blob/master/p3_collab-compet/README.md) 

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

The environment is considered solved, when the average (over 100 episodes) of maximum of each agent's scores is at least +0.5.

# Learning Algorithm

During the project I adopted the highly suggested *multi-agent deep deterministic policy gradient* (MADDPG) algorithm. The algorithm used is based on a natural extension of the DDPG algorithms described in these papers:

* [MADDPG](https://arxiv.org/pdf/1706.02275.pdf)

* [DDPG](https://arxiv.org/pdf/1509.02971.pdf)

Details of these algorithms are below:

## Theoretical Foundation of MADDPG

#### DQN vs DDPG
In the DQN Algorithm, one calculates the optimal state-action values for continuous states and discrete action space, then choose the optimal action according to $argmax$ operation. However, in the continuous action spaces, performing such $argmax$ is difficult. 

On the other hand, one can use a policy based method to optimize the policy gradient directly. Actor-critic methods leverage the strengths of both policy-based and value-based methods. Using a policy-based approach, the agent (actor) learns how to act by directly estimating the optimal policy and maximizing reward through gradient ascent. Meanwhile, employing a value-based approach, the agent (critic) learns how to estimate the value (i.e., the future cumulative reward) of different state-action pairs. Actor-critic methods combine these two approaches in order to accelerate the learning process. Actor-critic agents are also more stable than value-based agents, while requiring fewer training samples than policy-based agents.

DDPG attack the problem from a third direction. It is ultimately still use a QNetwork $Q$ to evaluate the advantage of a state, action pair. However, it uses another deep network $g$ to approximate the action space, similar to the actor function in the actor-critic method:
$$ g_\pi(s; \Phi) = \bf{a} $$
The goal is to find a function $\Phi$ that maximizes the value function $Q^*(s,\bf{a}; \Theta)$, where $\bf{a}$ is the output of the above actor function. Here, the Q Network will implement a one step Temporal-Difference methods (TD) to evaluate the action-values, similar to the critic agent in AC method. As in the previous project, to increase the efficiency of learning, we will adopt a Replay Buffer $\mathbb{D}$. Also, to avoid the moving target problem, we will setup target functions for both $\Phi$ and $\Theta$, which will be labeled as $\Phi^-$ and $\Theta^-$. 

Thus, the algorithm can be summarized as follows:
- Take action $\bf{a} = g_\pi(s; \Phi) + \bf{z}$ from $s$, where $\bf{z}$ is a random noise for exploration, which will be elaborated below. 
- Observe $s'$ and reward $R$, add $(s, a, R, s')$ to a Replay Buffer $\mathbb{D}$
- Sample a mini-batch of $(s^{(i)}, a^{(i)}, R^{(i)}, s'^{(i+1)}$ from $\mathbb{D}$
- Update $\Theta$ using TD(1) Advantage function:
    $$\Theta \leftarrow \Theta - \eta \nabla_{\Theta} A(s,\bf{a}; \Theta)$$ 
    where 
    $$ A = \sum_{i}{\left[R^{i}+\gamma Q(s^{(i+1)}, g_\pi(s; \Phi^-); \Theta^-) - Q(s^{(i)},\bf{a}^{(i)}; \Theta) \right]^2} $$
    
- Update $\Phi$:
    $$\Phi \leftarrow \Phi - \lambda \nabla_{\Phi} Q(s,\bf{a}; \Phi)$$
- Slowly update the target network: 
    $\Theta^- \leftarrow \tau \Theta + (1-\tau) \Theta^-$ and $\Phi^- \leftarrow \tau \Phi + (1-\tau) \Phi^-$

Please find the DDPG logic implemented as part of the `Agent()` class in [ddpg_agent.py](https://github.com/readerwei/Reinforcement_Learning_Degree/blob/master/p2_continuous-control/ddpg_agent.py#L30) of the source code. The networks $A$ and $Q$ can be found via their respective `Actor()` and `Critic()` classes [model.py](https://github.com/readerwei/Reinforcement_Learning_Degree/blob/master/p2_continuous-control/model.py).



#### Experience Replay
Experience replay allows the RL agent to learn from past experience.

As with DQN in the previous project, DDPG also utilizes a replay buffer to gather experiences from each agent. Each experience is stored in a replay buffer as the agent interacts with the environment. In this project, there is one central replay buffer utilized by all 20 agents, therefore allowing agents to learn from each others' experiences. It turned out to be critically important to achieve a good result in short number of iterations. 

The replay buffer contains a collection of experience tuples with the state, action, reward, and next state $(s, a, R, s')$. Each agent samples from this buffer as part of the learning step. Experiences are sampled randomly, so that the data is uncorrelated. This prevents action values from oscillating or diverging catastrophically, since a naive algorithm could otherwise become biased by correlations between sequential experience tuples.

Also, experience replay improves learning through repetition. By doing multiple passes over the data, our agents have multiple opportunities to learn from a single experience tuple. This is particularly useful for state-action pairs that occur infrequently within the environment.

The implementation of the replay buffer can be found in the [ddpg_agent.py](https://github.com/readerwei/Reinforcement_Learning_Degree/blob/master/p2_continuous-control/ddpg_agent.py#L189) of the source code.

Also, the size of the buffer is also very important to the success of the training. Initially I was using the default buffer size 1e5, however, it never reached the target performance with this size. Finally, I changed the size to 1e6 and reached the performances shown in this report. 

#### Exploration vs Exploitation
One challenge is choosing which action to take while the agent is still learning the optimal policy. Should the agent choose an action based on the rewards observed thus far? Or, should the agent try a new action in hopes of earning a higher reward? This is known as the **exploration vs. exploitation dilemma**.

In the Navigation project, I addressed this by implementing an [𝛆-greedy algorithm]. This algorithm allows the agent to systematically manage the exploration vs. exploitation trade-off. The agent "explores" by picking a random action with some probability epsilon `𝛜`. Meanwhile, the agent continues to "exploit" its knowledge of the environment by choosing actions based on the deterministic policy with probability (1-𝛜). However, this approach won't work for controlling a robotic arm. The reason is that the actions are no longer a discrete set of simple directions (i.e., up, down, left, right). 

Instead, we'll use the **Ornstein-Uhlenbeck process**, as suggested in the previously mentioned [paper by Google DeepMind](https://arxiv.org/pdf/1509.02971.pdf) (see bottom of page 4). The Ornstein-Uhlenbeck process adds a certain amount of noise to the action values at each timestep. This noise is correlated to previous noise, and therefore tends to stay in the same direction for longer durations without canceling itself out. This allows the arm to maintain velocity and explore the action space with more continuity.

You can find the Ornstein-Uhlenbeck process implemented  in the class of [`OUNoise`](https://github.com/readerwei/Reinforcement_Learning_Degree/blob/master/p2_continuous-control/ddpg_agent.py#L167) from the source code.

# Code Implementation

The code used here is adapted from the ["Physical Deception Lab"](https://classroom.udacity.com/nanodegrees/nd893/) tutorial from the Deep Reinforcement Learning Nanodegree, and has been adjusted for being used with the Racket Enviroment.

The code consist of :

- model.py : In this python file, a PyTorch Actor and Critic classes are implemented which inherits nn.Module base class. This is a regular fully connected Deep Neural Network using the [PyTorch Framework](https://pytorch.org/docs/0.4.0/). The actor network will be trained to generate the actions to perform depending on the environment observed states while the critic network will be trained to evaluate the advantage of such actions. These Neural Networks are used by the DDPG agent and is composed of :

  - Actor network
      - input layer of size equal to the state_size (33)
      - 2 hidden fully connected layers of 128 and 64 cells each
      - output layer which returns the actions to be taken by the agent, depends on the action_size parameter passed in the constructor, which is 4 in our problem
      
  - Critic network
      - input layer of size 33
      - 1 hidden fully connected layers of 128 cells with 4 extra units souring the action values
      - 2nd hidden fully connected layers of 64 cells
      - output layer which returns the Q-value of size 1
  
- ddpg_agent.py : In this python file, a DDPG agent, an OUNoise class and a Replay Buffer memory (used by the DDPG agent) are defined.

  - The Agent class is implemented, as described in the DDPG algorithm. It provides several methods :
    - constructor : 
      - Initialize the memory buffer (*Replay Buffer*)
      - Initialize the OUNoise instance
      - Initialize 2 instance of the Actor  Neural Network : the *target* network and the *local* network
      - Initialize 2 instance of the Critic Neural Network : the *target* network and the *local* network
      
    - step() : 
      - Allows to store a step taken by the agent (state, action, reward, next_state, done) in the Replay Buffer/Memory
      - Every step (and if their are enough samples available in the Replay Buffer), sample from the replay buffer and perform the learning steps. 
      
    - act():
      - It returns actions for the given state as per current policy (actor) network 
      - Add noise to each step the actor is taking
      
    - learn():
      - which update both critic and actor Neural Network value parameters by standard training procedure using given batch of experiences from the Replay Buffer
      - update the two *target* networks' weights with continuous blending from the current weight values from the *local* network
      
    - soft_update():
      - It is called by learn() to slowly blends the weights of the *local* network weights into the *target* Neural Network. 
      
  - The ReplayBuffer class implements a fixed-size buffer to store experience tuples  (state, action, reward, next_state, done) 
    - add() allows to add an experience step to the memory
    
    - sample() allows to randomly sample a minibatch of experience steps for the learning  
    
  - The OUNoise class implements Ornstein-Uhlenbeck process to serve an exploration mechanism for the action 
    - reset() allows to reset the internal state (= noise) to mean (mu)
    
    - sample() allows to randomly return a noise sample to be added to the action
    
    
- Continous-Control_ddpg_gpu.ipynb : This Jupyter notebooks allows to train the agent. More in details it allows to :
  - Import the Necessary Packages 
  - Examine the State and Action Spaces
  - Take Random Actions in the Environment
  - Train an agent using DDPG, the main function of training is called ddpg()
  - Use Tensorboard to monitor the training procedure
  - Plot the scores
