# Collaboration and Competition
This notebook describes the implementation and illustrates the live performance of Alice & Bob in the actual environment.

# Background
#### Multi-Agent Deep Deterministic Policy Gradients (MADDPG)
Expanding on [DDPGs](https://arxiv.org/pdf/1509.02971.pdf), MADDPG is an algorithm that trains an ensemble of agents to solve a task. Agents can collaborate, compete, or a mixture of both. While one could try to implement DDPG separately for every agent, in [their paper](https://arxiv.org/pdf/1706.02275.pdf) the authors propose many improvements over this approach. Most importantly:
* All agents share the same replay buffer
* While actors can potentially only "see" a part of the environment state (their individual observation), the critics learn from the totality of the state and actions taken by all actors.


# Parameters
#### Actor
The actor network has three layers, with sizes `24` (input), `512`, `256` and `2` (output). ReLUs are applied after the first and second layer. A tanh function is appliead after the last layer to ensure outputs in the range `[-1, 1]`. The actor applies noise to its actions as a Ornstein-Uhlenbeck process with $\mu=0$, $\theta=1$ and $\sigma=0.15$. This noise decays after each episode by a factor of $0.95$

#### Critic
The actor network has three layers, with sizes `52` (input), `512`, `256` and `1` (output). ReLUs are applied after the first and second layer. The input size equals twice the state space size plus twice the action space size, since there are two agents.

#### Learning
This implementation uses the following learning parameters
* A replay buffer of size `10⁶` is used
* The mini-batch size is `400`
* Discount rate $\gamma=0.99$
* Soft update rate $\tau=10^{-3}$
* Actor's learning rate $\eta=10^{⁻4}$
* Critic's learning rate $\eta=5*10^{⁻4}$
* Weight decay `= 0`

# Results

![Results](media/scores.png)

The DDPG solves the environment after 4375 episodes.

# Improvements

The following are suggestions on how to improve the performance.
* **Rewrite the replay buffer.** The replay buffer is implemented as a python deque, which performs fast queue/deque operations, but is [inefficient (O(n))](https://docs.python.org/3/library/collections.html#collections.deque) when sampling from the middle of the buffer. Having a faster replay buffer means faster trainings and more time for experiments.
* **Perform a hyperparameter search.** There is guaranteed a better set of hyperparameters available, which would also speed up learning.
* **Try a common critic.** Try training a shared critic instead of a separate critic for each agent.

# Agent demonstration
Below we will see the trained agents following a greedy policy in the actual environment.

**Note:** Running the cell under "Initialization" will open a full-screen Unity window. Switch windows or run the whole script at once

In [1]:
from unityagents import UnityEnvironment
import numpy as np
import time

# Invite our agent & import utils
from ddpg_agent import Agent
#from random import random as rnd
import torch

### Initialization

In [2]:
PATH_TO_ENV = "Tennis_Linux/Tennis.x86"
BRAIN = "TennisBrain"
TRAINING = False

env = UnityEnvironment(file_name=PATH_TO_ENV, no_graphics=TRAINING)

ACTION_SIZE = env.brains[BRAIN].vector_action_space_size
env_info = env.reset(train_mode=TRAINING)[BRAIN]
NUM_AGENTS = len(env_info.agents)
STATE_SIZE = env_info.vector_observations.shape[1]

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


In [3]:
def act(env, actions, brain_name=BRAIN) -> tuple:
    """Sends actions to the environment env and observes the results.
    Returns a tuple of rewards, next_states, dones (One per agent)"""
    action_result = env.step(actions)[brain_name] # Act on the environment and observe the result
    return (action_result.rewards,
            action_result.vector_observations, # next states
            action_result.local_done) # True if the episode ended
    
def reset(env, training=TRAINING, brain_name=BRAIN) -> np.ndarray:
    """Syntactic sugar for resetting the unity environment"""
    return env.reset(train_mode=training)[brain_name].vector_observations

def visualize(agents, env): 
    states = reset(env)
    scores = np.zeros(NUM_AGENTS)
    done = False
    while not done:
        actions = np.vstack([agent.decide(np.expand_dims(state, 0), as_tensor=False) 
                             for agent, state in zip(agents, states)]) # Choose actions
        rewards, next_states, dones = act(env, actions)    # Send the action to the environment
        scores += rewards[0]                                # Update the score
        states = next_states                             # Roll over the state to next time step
        done = np.any(dones)
        time.sleep(.03)
    print("Scores: {}".format(scores))  

### Run the simulation

In [4]:
# Load the agents
agents = []
state_dict = torch.load('models/20-01-13_21.18-4376-0.5.pth')
for agent_name in ('Alice', 'Bob'):
    agent = Agent(STATE_SIZE, ACTION_SIZE, NUM_AGENTS)
    agent.actor_local.load_state_dict(state_dict[f'{agent_name}_actor_state_dict'])
    agent.critic_local.load_state_dict(state_dict[f'{agent_name}_critic_state_dict'])
    agents.append(agent)

In [5]:
visualize(agents, env)

Scores: [1.19000002 1.19000002]


In [6]:
# Shuts down the Unity environment
env.close()