# Collaboration and Competition

---

Congratulations for completing the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program!  In this notebook, you will learn how to control agents in a more challenging environment, where the goal is to train a team of agents to play soccer.  **Note that this exercise is optional!**

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np
from collections import deque
from itertools import count
import time
import torch
from Agent import Agent
import matplotlib.pyplot as plt
%matplotlib inline

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Soccer.app"`
- **Windows** (x86): `"path/to/Soccer_Windows_x86/Soccer.exe"`
- **Windows** (x86_64): `"path/to/Soccer_Windows_x86_64/Soccer.exe"`
- **Linux** (x86): `"path/to/Soccer_Linux/Soccer.x86"`
- **Linux** (x86_64): `"path/to/Soccer_Linux/Soccer.x86_64"`
- **Linux** (x86, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86"`
- **Linux** (x86_64, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86_64"`

For instance, if you are using a Mac, then you downloaded `Soccer.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Soccer.app")
```

In [2]:
env = UnityEnvironment(file_name="Soccer_Windows_x86_64/Soccer.exe", worker_id=1, seed=2)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 2
        Number of External Brains : 2
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: GoalieBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 
Unity brain name: StrikerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 6
        Vector Action descriptions: , , , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we obtain separate brains for the striker and goalie agents.

In [3]:
# print the brain names
print(env.brain_names)

# set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]

# set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]

['GoalieBrain', 'StrikerBrain']


### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)

# number of agents 
num_g_agents = len(env_info[g_brain_name].agents)
print('Number of goalie agents:', num_g_agents)
num_s_agents = len(env_info[s_brain_name].agents)
print('Number of striker agents:', num_s_agents)

# number of actions
g_action_size = g_brain.vector_action_space_size
print('Number of goalie actions:', g_action_size)
s_action_size = s_brain.vector_action_space_size
print('Number of striker actions:', s_action_size)

# examine the state space 
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
print('There are {} goalie agents. Each receives a state with length: {}'.format(g_states.shape[0], g_state_size))
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]
print('There are {} striker agents. Each receives a state with length: {}'.format(s_states.shape[0], s_state_size))

Number of goalie agents: 2
Number of striker agents: 2
Number of goalie actions: 4
Number of striker actions: 6
There are 2 goalie agents. Each receives a state with length: 336
There are 2 striker agents. Each receives a state with length: 336


### 3. It's Your Turn!

#### 1. Instantiate the Environment and Agent

In [5]:
g_agent = Agent(state_size=g_state_size, action_size=g_action_size, num_agents=num_g_agents, random_seed=2)
s_agent = Agent(state_size=s_state_size, action_size=s_action_size, num_agents=num_s_agents, random_seed=2)

def save_model():
    print("Model Save...")    
    torch.save(g_agent.actor_local.state_dict(), 'checkpoint_goalie_actor.pth')
    torch.save(g_agent.critic_local.state_dict(), 'checkpoint_goalie_critic.pth')
    
    torch.save(s_agent.actor_local.state_dict(), 'checkpoint_striker_actor.pth')
    torch.save(s_agent.critic_local.state_dict(), 'checkpoint_striker_critic.pth')    

#### 2. Train the Agent with DDPG

In [6]:


def ddpg(n_episodes=2000, max_t=1000, print_every=10, save_every=10):
#def ddpg(n_episodes=200, max_t=1000, print_every=10, save_every=10):
    g_scores_deque = deque(maxlen=100)
    s_scores_deque = deque(maxlen=100)
    
    scores_global = []

    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=True)                 # reset the environment    
        
        g_states = env_info[g_brain_name].vector_observations  # get initial state (goalies)
        s_states = env_info[s_brain_name].vector_observations  # get initial state (strikers)
        
        g_scores = np.zeros(num_g_agents)                      # initialize the score (goalies)
        s_scores = np.zeros(num_s_agents)                      # initialize the score (strikers)
                
        for t in range(max_t):
            # select actions and send to environment            
            g_actions = g_agent.act(g_states)
            s_actions = s_agent.act(s_states)
           
            #print("--")
            #print(g_actions)
            #print("--")
            #print(g_action_size-1)
            #print(num_g_agents)
            #print(np.random.randint(g_action_size, size=num_g_agents))
        
            actions = dict(zip([g_brain_name, s_brain_name], 
                               [g_actions, s_actions]))
            
            env_info = env.step(actions)                       

            # get next states
            g_next_states = env_info[g_brain_name].vector_observations         
            s_next_states = env_info[s_brain_name].vector_observations

            # get reward and update scores
            g_rewards = env_info[g_brain_name].rewards  
            s_rewards = env_info[s_brain_name].rewards
            g_scores += g_rewards
            s_scores += s_rewards

            # check if episode finished
            done = np.any(env_info[g_brain_name].local_done)  
            g_dones = np.zeros(g_state_size, dtype=np.bool)
            g_dones.fill(done)
            
            s_dones = np.zeros(s_state_size, dtype=np.bool)
            s_dones.fill(done)
            
            g_agent.step(g_states, g_actions, g_rewards, g_next_states, g_dones)
            s_agent.step(s_states, s_actions, s_rewards, s_next_states, s_dones)

            # roll over states to next time step
            g_states = g_next_states
            s_states = s_next_states

            # exit loop if episode finished
            if done:                                           
                break
                         
        g_score_mean = np.mean(g_scores)
        s_score_mean = np.mean(s_scores)
        
        g_scores_deque.append(g_score_mean)
        s_scores_deque.append(s_score_mean)
        
        g_score_average = np.mean(g_scores_deque)
        s_score_average = np.mean(s_scores_deque)
        
        scores_global.append([g_score_mean, s_score_mean])
                
        if i_episode % save_every == 0:
            save_model()
            
        print('Scores from episode {}: {} (goalies), {} (strikers)'.format(i_episode+1, g_scores, s_scores))
        
    return scores_global

scores = ddpg()    

RuntimeError: size mismatch, m1: [1024 x 129], m2: [132 x 64] at c:\programdata\miniconda3\conda-bld\pytorch_1533090623466\work\aten\src\thc\generic/THCTensorMathBlas.cu:249

#### 3. Plot

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

#### 4. Watch a Smart Agent

In [None]:
g_agent.actor_local.load_state_dict(torch.load('checkpoint_goalie_actor.pth'))
g_agent.critic_local.load_state_dict(torch.load('checkpoint_goalie_critic.pth'))

s_agent.actor_local.load_state_dict(torch.load('checkpoint_striker_actor.pth'))
s_agent.critic_local.load_state_dict(torch.load('checkpoint_striker_critic.pth'))

env_info = env.reset(train_mode=False)                 # reset the environment   

g_scores = np.zeros(num_g_agents)                      # initialize the score (goalies)
s_scores = np.zeros(num_s_agents)                      # initialize the score (strikers)

g_states = env_info[g_brain_name].vector_observations  # get initial state (goalies)
s_states = env_info[s_brain_name].vector_observations  # get initial state (strikers)

while True:
    # select actions and send to environment
    g_actions = g_agent.act(g_states)
    s_actions = s_agent.act(s_states)    

    actions = dict(zip([g_brain_name, s_brain_name], 
                       [g_actions, s_actions]))    
    
    env_info = env.step(actions)  

    # get next states
    g_next_states = env_info[g_brain_name].vector_observations         
    s_next_states = env_info[s_brain_name].vector_observations

    # get reward and update scores
    g_rewards = env_info[g_brain_name].rewards  
    s_rewards = env_info[s_brain_name].rewards
    g_scores += g_rewards
    s_scores += s_rewards  
    
    # roll over states to next time step
    g_states = g_next_states
    s_states = s_next_states    
    
    # check if episode finished
    done = np.any(env_info[g_brain_name].local_done)      
    
    # exit loop if episode finished
    if done:                                           
        break    
        
print('Scores: {} (goalies), {} (strikers)'.format(g_scores, s_scores)) 