[//]: # (Image References)

[image1]: https://user-images.githubusercontent.com/10624937/42135623-e770e354-7d12-11e8-998d-29fc74429ca2.gif "Trained Agent"


# Collaboration and Competition Report

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. About the problem

Reinforcement learning (RL) has recently been applied to solve challenging problems, from game
playing to robotics. In industrial applications, RL is emerging as a practical component
in large scale systems such as data center cooling. Most of the successes of RL have been in
single agent domains, where modelling or predicting the behaviour of other actors in the environment
is largely unnecessary.

However, there are a number of important applications that involve interaction between multiple
agents, where emergent behavior and complexity arise from agents co-evolving together. For example,
multi-robot control, the discovery of communication and language, multiplayer games, 
and the analysis of social dilemmas  all operate in a multi-agent domain. Related problems,
such as variants of hierarchical reinforcement learning can also be seen as a multi-agent system,
with multiple levels of hierarchy being equivalent to multiple agents. Additionally, multi-agent
self-play has recently been shown to be a useful training paradigm. Successfully scaling RL
to environments with multiple agents is crucial to building artificially intelligent systems that can
productively interact with humans and each other.

For this project, you will work with the [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis) environment.

![Trained Agent][image1]

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation.  Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single **score** for each episode.

The environment is considered solved, when the average (over 100 episodes) of those **scores** is at least +0.5.

#### Explore the environment

In [10]:
import torch

torch.cuda.is_available()

False

In [2]:
# Unity ml-agents path
import sys
sys.path.append("../python/")

In [3]:
from unityagents import UnityEnvironment
import numpy as np

In [4]:
env = UnityEnvironment(file_name="../Tennis_Windows_x86_64/Tennis.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 2. Benchmark

Benchmark for the agent is using random actions, which can get average lower than 0.05 if you get a good luck.

In [8]:
for i in range(1, 6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    t = 0
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        t += 1
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

Score (max over agents) from episode 1: 0.10000000149011612
Score (max over agents) from episode 2: 0.09000000171363354
Score (max over agents) from episode 3: 0.10000000149011612
Score (max over agents) from episode 4: 0.0
Score (max over agents) from episode 5: 0.0


### 3. Solution

#### MADDPG
Multi-agent DDPG is the framework of centralized training with
decentralized execution. Thus, we allow the policies to use extra information to ease training, so
long as this information is not used at test time. It is unnatural to do this with Q-learning, as the Q
function generally cannot contain different information at training and test time. Thus, we propose
a simple extension of actor-critic policy gradient methods where the critic is augmented with extra
information about the policies of other agents.

![maddpg](./pic/maddpg.png)

The difference between basic ddpg and maddpg is that critic network will include inputs from other agents as input to train and evaluate.


<img src="./pic/psuedo.png"  height="500" width="800" align='left'>


#### MADDPG model

Basic agent takes DDPG codes from https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum as reference with following modifications.

```python
class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=128):
        ...
        # change states full connected layer size to 2 agents concated
        self.fc1 = nn.Linear(state_size*2, fc1_units)
        # batch normalization
        self.bn1 = nn.BatchNorm1d(fc1_units)
        ...
```
```python
class Critic(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, seed, fcs1_units=256, fc2_units=128):
        ...
        # change states full connected layer size to 2 agents concated
        self.fcs1 = nn.Linear(state_size*2, fcs1_units)
        # batch normalization
        self.bn1 = nn.BatchNorm1d(fcs1_units)
        # change action layer size to 2 agents concated
        self.fc2 = nn.Linear(fcs1_units+(action_size*2), fc2_units)
        ...
```


#### Prioritized Replay

Prioritized replay can make learning from experience replay
more efficient. Here's one possible implementation of prioritized replay.

<img src="./pic/priority.png"  height="500" width="800" align='left'>


Basic PER codes are from https://github.com/rlcode/per as reference with following modifications.
```python
class PEReplayBuffer:
    """
    prioritized experience replay memory
    """
    def __init__(self, buffer_size, batch_size, seed):
        ...
        # add namedtuple to organize our experience
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        ...
        # add initial p for adding experience 
        self.p_max = 1

```
For calc td when adding samples are quite time consuming, in my codes, I update the step experience td-error with a local saved maximum temp variable.
```python
    def update(self, idx, error):
        ...
        if self.p_max == 1:
            self.p_max = np.max(p)
        else:
            if self.p_max < np.max(p):
                self.p_max = np.max(p)
        ...

```

#### Train process

```python
import matplotlib.pyplot as plt
%matplotlib inline
import time

def train(n_episodes=500, max_t=1000, random_seed=1, agent=None, debug=False):

    agents = [Agent(state_size=state_size, action_size=action_size, 
                      random_seed=random_seed, 
                   num_agents=num_agents) for i in range(num_agents)]
       
    scores_window = deque(maxlen=100)  # last 100 scores
    scores_plot = []
    scores_ave = []
    scores_agent = []
   
    for i_episode in range(1, n_episodes + 1):
        
        env_info = env.reset(train_mode=True)[brain_name]
        states = np.reshape(env_info.vector_observations, (1,num_agents*state_size))
        scores = np.zeros(num_agents)
        for agent in agents:
            agent.reset()

        time_start = time.time()
        
        for _ in range(max_t):
            actions = [agent.act(states, True) for agent in agents]
            actions = np.concatenate(actions, axis=0).flatten()
            env_info = env.step(actions)[brain_name]
            next_states = np.reshape(env_info.vector_observations, (1, num_agents*state_size))
            rewards = env_info.rewards  # get the reward
            dones = env_info.local_done  # see if episode has finished
            for i, agent in enumerate(agents):
                agent.step(states, actions, rewards[i], next_states, dones[i], i)

            states = next_states
            scores += rewards
                            
            if np.any(dones):
                break
            
        duration = time.time() - time_start
        
        scores_window.append(np.mean(scores))  # save most recent score
        scores_plot.append(np.mean(scores))
        scores_ave.append(np.mean(scores_window))
        scores_agent.append(scores)
                
        print('\rEpisode {}({}sec)\t MIN:{:.2f} MAX:{:.2f} MEAN:{:.2f} MEANo100:{:.2f} {}'.format(i_episode, 
                                    round(duration), np.min(scores), 
                                     np.max(scores), np.mean(scores), 
                                     np.mean(scores_window), ' '*10), end='')
        if i_episode % 100 == 0:
            print('\nEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))            
            # plot the scores
            fig, ax = plt.subplots()
            
            plt.plot(np.arange(len(scores_ave)), scores_ave, label='Score Mean 100')
            for i in range(num_agents):
                plt.plot(np.arange(len(np.vstack(scores_agent))), 
                         np.vstack(scores_agent)[:,i], label='Agent {}'.format(i+1))
            plt.plot(np.arange(len(scores_plot)), scores_plot, label='Score Ave')
            plt.xlabel('Episode #')
            ax.legend()
            plt.show()
                        
            for i in range(num_agents):
                torch.save(agents[i].actor_local.state_dict(), 'actor{}_{}.pth'.format(i, i_episode))
                torch.save(agents[i].critic_local.state_dict(), 'critic{}_{}.pth'.format(i, i_episode))

        if np.mean(scores_window) >= 0.5:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode,
                                                                                         np.mean(scores_window)))
            for i in range(num_agents):
                torch.save(agents[i].actor_local.state_dict(), 'actor{}.pth'.format(i))
                torch.save(agents[i].critic_local.state_dict(), 'critic{}.pth'.format(i))
            break
            
    return scores_ave, agent
```

### 4. Training and result

I made two versions of solution.

- maddpg with replay buffer. full codes(maddpg.py) [here](maddpg.py).
- maddpg with prioritized experience replay. full codes(maddpg_v3.py) [here](maddpg_v3.py).

The hyper-parameters are the same.

```python
BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 512        # minibatch size
LR_ACTOR = 1e-3         # learning rate of the actor
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
LEARN_EVERY = 1         # learning timestep interval
LEARN_NUM = 5           # number of learning passes
GAMMA = 0.99            # discount factor
TAU = 8e-3              # for soft update of target parameters
OU_SIGMA = 0.2          # Ornstein-Uhlenbeck noise parameter, volatility
OU_THETA = 0.15         # Ornstein-Uhlenbeck noise parameter, speed of mean reversion
EPS_START = 5.0         # initial value for epsilon in noise decay process in Agent.act()
EPS_DECAY = 6e-4        # episode to end the noise decay process
EPS_FINAL = 0           # final value for epsilon after decay
```

#### maddpg with replay buffer

Using following codes to load the models and operation codes.


```python
import maddpg
from imp import reload
reload(maddpg)
from maddpg import *
```

The printed result text:

Episode 100(41sec)	 MIN:0.09 MAX:0.20 MEAN:0.15 MEANo100:0.01            
Episode 100	Average Score: 0.01

Episode 200(18sec)	 MIN:-0.01 MAX:0.10 MEAN:0.05 MEANo100:0.02            
Episode 200	Average Score: 0.02

Episode 300(9sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.03            
Episode 300	Average Score: 0.03

Episode 400(39sec)	 MIN:0.09 MAX:0.20 MEAN:0.15 MEANo100:0.03             
Episode 400	Average Score: 0.03

Episode 500(18sec)	 MIN:0.00 MAX:0.09 MEAN:0.05 MEANo100:0.04             
Episode 500	Average Score: 0.04

Episode 600(41sec)	 MIN:0.09 MAX:0.20 MEAN:0.15 MEANo100:0.05             
Episode 600	Average Score: 0.05

Episode 700(9sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.03            
Episode 700	Average Score: 0.03

Episode 800(33sec)	 MIN:0.09 MAX:0.10 MEAN:0.10 MEANo100:0.04             
Episode 800	Average Score: 0.04

Episode 900(9sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.03            
Episode 900	Average Score: 0.03

Episode 1000(9sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.04           
Episode 1000	Average Score: 0.04

Episode 1100(18sec)	 MIN:-0.01 MAX:0.10 MEAN:0.05 MEANo100:0.04            
Episode 1100	Average Score: 0.04

Episode 1200(8sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.04            
Episode 1200	Average Score: 0.04

Episode 1300(32sec)	 MIN:0.09 MAX:0.10 MEAN:0.10 MEANo100:0.06             
Episode 1300	Average Score: 0.06

Episode 1400(19sec)	 MIN:-0.01 MAX:0.10 MEAN:0.05 MEANo100:0.07            
Episode 1400	Average Score: 0.07

Episode 1500(8sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.08            
Episode 1500	Average Score: 0.08

Episode 1600(53sec)	 MIN:0.19 MAX:0.20 MEAN:0.20 MEANo100:0.09             
Episode 1600	Average Score: 0.09

Episode 1700(110sec)	 MIN:0.39 MAX:0.50 MEAN:0.45 MEANo100:0.14           
Episode 1700	Average Score: 0.14

Episode 1800(88sec)	 MIN:0.29 MAX:0.40 MEAN:0.35 MEANo100:0.25            
Episode 1800	Average Score: 0.25

Episode 1900(440sec)	 MIN:1.89 MAX:1.90 MEAN:1.90 MEANo100:0.39            
Episode 1900	Average Score: 0.39

Episode 2000(89sec)	 MIN:0.29 MAX:0.40 MEAN:0.35 MEANo100:0.31            
Episode 2000	Average Score: 0.31

Episode 2100(39sec)	 MIN:0.10 MAX:0.19 MEAN:0.15 MEANo100:0.35            
Episode 2100	Average Score: 0.35

Episode 2200(96sec)	 MIN:0.29 MAX:0.40 MEAN:0.35 MEANo100:0.38            
Episode 2200	Average Score: 0.38

Episode 2300(51sec)	 MIN:0.19 MAX:0.20 MEAN:0.20 MEANo100:0.35            
Episode 2300	Average Score: 0.35

Episode 2400(142sec)	 MIN:0.49 MAX:0.50 MEAN:0.50 MEANo100:0.27           
Episode 2400	Average Score: 0.27

Episode 2500(19sec)	 MIN:-0.01 MAX:0.10 MEAN:0.05 MEANo100:0.19           
Episode 2500	Average Score: 0.19

Episode 2600(32sec)	 MIN:-0.01 MAX:0.10 MEAN:0.05 MEANo100:0.17           
Episode 2600	Average Score: 0.17

Episode 2700(151sec)	 MIN:0.59 MAX:0.60 MEAN:0.60 MEANo100:0.39           
Episode 2700	Average Score: 0.39

Episode 2755(33sec)	 MIN:0.09 MAX:0.10 MEAN:0.10 MEANo100:0.50            
Environment solved in 2755 episodes!	Average Score: 0.50

Plot score mean of 100

<img src="./pic/100mean_maddpg.png"  height="300" width="400" align='left'>


More details [here](./Tennis-MADDPG.ipynb).

#### Watch two smart agents play

In [13]:
import maddpg
from imp import reload
reload(maddpg)
from maddpg import *

def play(play_agent, t=10, add_noise=False):
    # trained model
    # play times
    for i in range(t):                                      # play game for 5 episodes
        env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
        states = np.reshape(env_info.vector_observations, (1,num_agents*state_size))
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        step = 0
        while True:
            step += 1
            actions = [a.act(states, add_noise) for a in play_agent]
            actions = np.concatenate(actions, axis=0).flatten()
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = np.reshape(env_info.vector_observations, (1, num_agents*state_size))
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            scores += env_info.rewards                         # update the score (for each agent)
            states = next_states                               # roll over states to next time step
            if np.any(dones):                                  # exit loop if episode finished
                break
        print('Score (sum over agents) from episode {}: {:.2f}'.format(i, np.sum(scores)))

In [17]:
# read model weights
best_agent_0 = Agent(state_size=state_size, action_size=action_size, random_seed=1, num_agents=num_agents)
best_agent_1 = Agent(state_size=state_size, action_size=action_size, random_seed=1, num_agents=num_agents)
actor_state_dict_0 = torch.load('./best_pth/1st/actor0.pth', map_location='cpu')
actor_state_dict_1 = torch.load('./best_pth/1st/actor1.pth', map_location='cpu')

best_agent_0.actor_local.load_state_dict(actor_state_dict_0)
best_agent_1.actor_local.load_state_dict(actor_state_dict_1)

play([best_agent_0, best_agent_1], t=20)

Score (sum over agents) from episode 0: 0.29
Score (sum over agents) from episode 1: 0.29
Score (sum over agents) from episode 2: 1.29
Score (sum over agents) from episode 3: 1.39
Score (sum over agents) from episode 4: 0.29
Score (sum over agents) from episode 5: 0.29
Score (sum over agents) from episode 6: 0.79
Score (sum over agents) from episode 7: 0.19
Score (sum over agents) from episode 8: 1.19
Score (sum over agents) from episode 9: 0.19
Score (sum over agents) from episode 10: 2.29
Score (sum over agents) from episode 11: 1.39
Score (sum over agents) from episode 12: 2.09
Score (sum over agents) from episode 13: 0.29
Score (sum over agents) from episode 14: 0.19
Score (sum over agents) from episode 15: 1.09
Score (sum over agents) from episode 16: 1.09
Score (sum over agents) from episode 17: 0.89
Score (sum over agents) from episode 18: 0.19
Score (sum over agents) from episode 19: 0.39


#### maddpg with prioritized experience replay

Using following codes to load the models and operation codes.


```python
import maddpg_v3
from imp import reload
reload(maddpg_v3)
from maddpg_v3 import *
```

The printed result text:

Episode 100(27sec)	 MIN:-0.01 MAX:0.10 MEAN:0.05 MEANo100:0.01           
Episode 100	Average Score: 0.01

Episode 200(61sec)	 MIN:0.09 MAX:0.20 MEAN:0.15 MEANo100:0.05             
Episode 200	Average Score: 0.05

Episode 300(12sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.04           
Episode 300	Average Score: 0.04

Episode 400(77sec)	 MIN:0.19 MAX:0.20 MEAN:0.20 MEANo100:0.04             
Episode 400	Average Score: 0.04

Episode 500(78sec)	 MIN:0.19 MAX:0.20 MEAN:0.20 MEANo100:0.04             
Episode 500	Average Score: 0.04

Episode 600(27sec)	 MIN:0.00 MAX:0.09 MEAN:0.05 MEANo100:0.04             
Episode 600	Average Score: 0.04

Episode 700(12sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.05           
Episode 700	Average Score: 0.05

Episode 800(12sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.04           
Episode 800	Average Score: 0.04

Episode 900(12sec)	 MIN:-0.01 MAX:0.00 MEAN:-0.00 MEANo100:0.05           
Episode 900	Average Score: 0.05

Episode 1000(45sec)	 MIN:0.09 MAX:0.10 MEAN:0.10 MEANo100:0.06            
Episode 1000	Average Score: 0.06

Episode 1100(27sec)	 MIN:-0.01 MAX:0.10 MEAN:0.05 MEANo100:0.05            
Episode 1100	Average Score: 0.05

Episode 1150(79sec)	 MIN:0.19 MAX:0.20 MEAN:0.20 MEANo100:0.07        

...

Due to the gpu time limitation, I don't have chance to get the result. But with part of epsiode result of PER, agents are more efficient of learning from experiences. Under same hyper-parameters, for 200 episodes, per agents can reach an average of 0.05 while replay buffer agents need over 600 episodes. And reach 0.07 before 1200 episodes while old agents need 1400 episodes.

More details [here](./Tennis-MADDPG-PER.ipynb).

### 5. Future work

Implement variant ways to solve multi-agents problem to see differences among those algorithms.

- Asynchronous Actor-Critic Agents (A3C)
- Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO)

### 6. Reference

- Deterministic Policy Gradient Algorithms, Silver et al. 2014
- Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016
- https://arxiv.org/abs/1511.05952
- https://papers.nips.cc/paper/7217-multi-agent-actor-critic-for-mixed-cooperative-competitive-environments.pdf
- https://spinningup.openai.com/en/latest/algorithms/ddpg.html#background
- https://github.com/rlcode/per
- https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum
- https://github.com/shariqiqbal2810/maddpg-pytorch
- https://towardsdatascience.com/training-two-agents-to-play-tennis-8285ebfaec5f
- https://github.com/xuehy/pytorch-maddpg