# Report for Project 3: Collaboration and Competition


## Learning Algorithm

DDPG is used to train the agents. The DDPG architecture, including the local/target actor, local/target critic, and the replay buffer are shared among the two agents.

At every time step, the network receives two copies of experience tuple, one from each agent, these two experience tuples are both added to a single replay buffer. Then if a training should be performed at this time step (determined by hyper parameters related to training frequency), a batch of samples is drawn uniformly from the replay buffer, (since the observation is stacked, the length of the state is 3 * 8), this batch is then used to train the DDPG agent.

The DDPG trains as follows, given a batch of experiences, `(states, actions, rewards, next_states, dones)`, first the target q-value is calculated as:

`y = rewards + (1 - dones) * gamma * critic_target(next_states, actor_target(next_states))`

then the local critic is trained use the loss:

`MSE(critic_local(states, actions), y)`

the local actor is trained using the loss:

`-reduce_mean(critic_local(states, actor_local(states)))`

finally, if it satisfies the soft-update frequency, the local network is soft-copied to the target network.

Architecture of the actor:

```
  # both fc1 and fc2 uses the ReLU activation,
  # and fc3 uses tanh
  
  # states is passed to fc1 and follows down to the end
  (fc1): Linear(in_features=24, out_features=400, bias=True)
  (bn1): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=400, out_features=300, bias=True)
  (fc3): NoisyLinear(in_features=300 out_features=2, bias=True)
```

Architecture of the critic:

```
  # all fc layers use the ReLU activation,
  # except for the last one, for which there is no non-linearity
  
  # states is passed to fc1
  (fc1): Linear(in_features=24, out_features=400, bias=True)
  (bn1): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  
  # the transformed states are merged with the actions,
  # the merged tensors follows down to the end
  (fc2): Linear(in_features=402, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=1, bias=True)
```

| name                          | value    | comment|
| ---                           | ---      | --- |
| memory_max_size               | 1,000,000 | the size of the replay buffer |
| num_episodes                  | 10000    | number of episodes to train |
| batch_size                    | 128      | batch size |
| gamma                         | 0.99     | reward decay |
| critic_local_lr               | 1e-3     | critic's learning rate |
| actor_local_lr                | 1e-3     | actor's learning rate |
| update_target_every_learnings | 1        | the combination of `update_target_every_learnings=1`, `learn_every_new_experiences=20` and `times_consequtive_learn=10` means that for every 20 new samples, the local network is trained 10 times and the target network is updated 10 times|
| learn_every_new_experiences   | 20       |  |
| times_consequtive_learn       | 10       |  |
| soft_update_tau               | 1e-3     | copy only this fraction of the local network to the target |


## Plot of Rewards

As can be seen from the plot, the environment is solved at around episode 8100 (got an average score of +0.5 over 100 consecutive episodes), and peaked episode 8270, which reached an average reward of +1.6

![](run/Jul09_08-51-44_0.png)


## Ideas for Future Work

- Try prioritized experience replay
- More stable learning