# DRLND - P1 - Continuous Control : Report
________________________________________________________________________

In this report, I am going to present about the environment and the algorithms that I have used to solve the continuous control problem where a number of agents, double-jointed arms, are to reach specified goals.

## Environment

We work with Unity ML Agent [Reacher](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#reacher) in this project.

There are 20 double-jointed arms are present in the environment and arms can move to locations around them. Whenever an agent reaches the goal location, the agent gets a reward of +0.1. The problem is deemed solved if the agents could maintain their positions at the target locations for as many steps as possible. For this particular problem, all the agents must achieve an average score of +30 over 100 consecutive episodes.

Given below are the characteristics of the agent.

* Unity brain name: ReacherBrain
* Number of Visual Observations (per agent): 0
* Vector Observation space type: continuous
* Vector Observation space size (per agent): 33
* Number of stacked Vector Observation: 1
* Vector Action space type: continuous
* Vector Action space size (per agent): 4

## Learning Algorithm

To solve this reinforment learning problem, I am using a Deep Deterministic Policy Gradients (DDPG) as has been taught in the course.

### Deep Deterministic Policy Gradient (DDPG)

**DDPG** is an actor-critic algorithm that extends **DQN** to work in continuous spaces. Here, we use two deep neural networks, one as actor and the other as critic. Similar network architectures are used for both actor and critic. **ADAM** optimizer is used with **learning rates 0.0005** and **0.001** for actor and critic, respectively. And the **discount factor** used is **0.99**.

```python
GAMMA = 0.99            # discount factor
LR_ACTOR = 5e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
```
##### Neural Network Architecture

State --> BatchNorm --> 400 --> ReLU --> 300 --> ReLU --> tanh --> action

##### Pytorch Implementation
```python
    self.bn1 = nn.BatchNorm1d(state_size)   
    self.fc1 = nn.Linear(state_size, 400)
    self.fc2 = nn.Linear(400, 300)
    self.fc3 = nn.Linear(300, action_size)

    ...

    state = self.bn1(state)
    x = F.relu(self.fc1(state))
    x = F.relu(self.fc2(x))
    x = F.tanh(self.fc3(x))
```

#### Experience Replay

We store the last one million experience tuples (S,A,R,S') into a data container called **Replay Buffer** from which we sample **a mini batch of 1024** experiences. This batch ensures that the experiences are independent and stable enough to train the network.

```python
    BUFFER_SIZE = int(1e6)  # replay buffer size
    BATCH_SIZE = 1024       # minibatch size
```

#### Soft Target Updates

In order to calculate the target values for both actor and critic networks, we use **Soft Target Update** strategy. 

```python
    TAU = 1e-3              # for soft update of target parameters
```

## Plot of Rewards

After tuning the parameters and tweaking the network by changing the hidden layers size, I could solve the problem in **229 episodes**. The plot below shows the rewards per episode and the target.


![Plot of Rewards](plot.jpg)

## Result

![Result](result.gif)

Trained models, _checkpoint_actor.pth_ and _checkpoint_critic.pth_ can be found [here](weights/). 

## Ideas for Future Work

* To improve the performance, I am planning to implement algorithms like [PPO](https://arxiv.org/pdf/1707.06347.pdf), [A3C](https://arxiv.org/pdf/1602.01783.pdf), and [D4PG](https://openreview.net/pdf?id=SyZipzbCb) that use multiple (non-interacting, parallel) copies of the same agent to distribute the task of gathering experience.  

* Try the same algorithms on **Crawler** environment