# 3-Collborative Reinofrcement Learning

---

In this notebook, we will train and evaluate our model for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [None]:
from unityagents import UnityEnvironment
import numpy as np
import gym
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
env = UnityEnvironment(file_name="./Tennis_Linux/Tennis.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


### 2. Define the Agent 
We first define and intialize a multi agent DDPG agent which creates two instances of DDPG agents (one for each player). This agent is directly imported here from ma_ddgp_agent.py script. Please refer to ma_ddgp_agent.py and ddpg_agent.py script to define a new agent or modify the existing one.<br> <img src="files/logging/DDPG_algorithm.svg"> <br> This algorithm is for a single agent. To implement this algorithm for a multi-agent system, we write a multi-agent wrapper class **maddpgagent** which instantiates two separate **ddgp_agent**. <br> 

### 2.1 Multi-agent DDPG class 

Instantiates two separate DDPG agents and few helper methods to act and step in the enviornment. The idea behind multi-agent DDPG algorithm is to act separately in the environment and concatenate the output of both agents. After an initial action is taken in the enviornment, the states, action, reward and next action tuple is saved in the replay buffer. Note that this tuple contains the observations of both agents except for the reward which is for each spefific agent. During learning pahse, each agent individually learns from both observations available for both agents. 

In [3]:
from ddpg_agent import Agent
from ma_ddgp_agent import maddpgagent

brain_name = env.brain_names[0]
brain = env.brains[brain_name]
env_info = env.reset(train_mode=True)[brain_name]
state_dim = env_info.vector_observations.shape[1]
action_dim = brain.vector_action_space_size
# number of agents
num_agents = len(env_info.agents)

class maddpgagent:
    def __init__(self, state_dim, action_dim,num_agents, seed):

        # Initlaize each agent
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.num_agents = num_agents
        self.num_unique_agents=1
        self.seed = seed
        self.agents = [Agent(self.state_dim, self.action_dim,self.num_unique_agents, self.seed) for _ in range(num_agents)]


    def reset(self):
        for agent in self.agents:
            agent.reset()

    def ma_act(self,states):
        action_0 = self.agents[0].act(states, True)
        action_1 = self.agents[1].act(states, True)
        actions = np.concatenate((action_0, action_1), axis=0) 
        actions = np.reshape(actions, (1, 4))
        return actions
        # return np.concatenate((action_0, action_1), axis=0).flatten()
        # actions = [agent.act(states) for agent in self.agents ]
        # return np.reshape(actions, (1, self.num_agents*self.action_dim))

    def ma_step(self, states, actions, rewards, next_states, dones):
        for i, agent in enumerate(self.agents):
            agent.step(states, actions, rewards[i], next_states, dones,i)


### 2.2 DDPG Agent 

DDPG agent implements a **Replay Buffer** to store and retrieve experiences. We sample observations from this buffer at random. This makes the training process robust to dependencies present in the subsequent data which might affect the performance. This was highlighted in the original DQN paper. DDPG agent also implements an **OU Noise**, **Act** and **Step** Method. The **Act** method selects and action deterministically in the enviornment given a policy. The step method stores data for each agent in a replay buffer and triggers learning iteratins every few timesteps. The **learn** method implements the actor and critic losses which aim to minimize the critic loss to predict the **optimal action value function** and actor loss to predict the **most optimal action**

In [None]:
import random
import torch
import numpy as np
from actor_critic import Actor, Critic
from collections import deque, namedtuple
import torch.optim as optim
import torch.nn.functional as F
import copy


BATCH_SIZE = 128        # minibatch size
BUFFER_SIZE = int(1e6)  # replay buffer size
GAMMA = 0.99            # discount factor
LR_ACTOR = 1e-3         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
TAU = 6e-2              # for soft update of target parameters
WEIGHT_DECAY = 0        # L2 weight decay
UPDATE_EVERY = 1        # time steps between network updates
N_UPDATES = 1           # number of times training
ADD_NOISE = True

eps_start = 6           # Noise level start
eps_end = 0             # Noise level end
eps_decay = 250         # Number of episodes to decay over from start to end

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class Agent():
    """Interacts with and learns from the environment."""
    
    def __init__(self, state_size, action_size, num_agents, random_seed):
        """Initialize an Agent object.
        
        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            num_agents (int): number of agents
            random_seed (int): random seed
        """
        self.state_size = state_size
        self.action_size = action_size
        self.num_agents = num_agents
        self.seed = random.seed(random_seed)
        self.eps = eps_start
        self.t_step = 0

        # Actor Network (w/ Target Network)
        self.actor_local = Actor(state_size, action_size, random_seed).to(device)
        self.actor_target = Actor(state_size, action_size, random_seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)

        # Critic Network (w/ Target Network)
        self.critic_local = Critic(state_size, action_size, random_seed).to(device)
        self.critic_target = Critic(state_size, action_size, random_seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)
                
        # Noise process
        self.noise = OUNoise((num_agents, action_size), random_seed)

        # Replay memory
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)
    
    def step(self, state, action, reward, next_state, done, agent_number):
        """Save experience in replay memory, and use random sample from buffer to learn."""
        self.t_step += 1
        # Save experience / reward
        self.memory.add(state, action, reward, next_state, done)
    
        # Learn, if enough samples are available in memory and at interval settings
        if len(self.memory) > BATCH_SIZE:
            if self.t_step % UPDATE_EVERY == 0:
                for _ in range(N_UPDATES):
                    experiences = self.memory.sample()
                    self.learn(experiences, GAMMA, agent_number)

    def act(self, states, add_noise):
        """Returns actions for given state as per current policy."""
        states = torch.from_numpy(states).float().to(device)
        actions = np.zeros((self.num_agents, self.action_size))
        self.actor_local.eval()
        with torch.no_grad():
            for agent_num, state in enumerate(states):
                action = self.actor_local(state).cpu().data.numpy()
                actions[agent_num, :] = action
        self.actor_local.train()
        if add_noise:
            actions += self.eps * self.noise.sample()
        return np.clip(actions, -1, 1)

    def reset(self):
        self.noise.reset()

    def learn(self, experiences, gamma, agent_number):
        """Update policy and value parameters using given batch of experience tuples.
        Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
        where:
            actor_target(state) -> action
            critic_target(state, action) -> Q-value
        Params
        ======
            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples 
            gamma (float): discount factor
        """
        states, actions, rewards, next_states, dones = experiences

        # ---------------------------- update critic ---------------------------- #
        # Get predicted next-state actions and Q values from target models
        actions_next = self.actor_target(next_states)
        
        if agent_number == 0:
            actions_next = torch.cat((actions_next, actions[:,2:]), dim=1)
        else:
            actions_next = torch.cat((actions[:,:2], actions_next), dim=1)
            
        Q_targets_next = self.critic_target(next_states, actions_next)
        # Compute Q targets for current states (y_i)
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
        # Compute critic loss
        Q_expected = self.critic_local(states, actions)
        critic_loss = F.mse_loss(Q_expected, Q_targets)
        # Minimize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # ---------------------------- update actor ---------------------------- #
        # Compute actor loss
        actions_pred = self.actor_local(states)
        
        if agent_number == 0:
            actions_pred = torch.cat((actions_pred, actions[:,2:]), dim=1)
        else:
            actions_pred = torch.cat((actions[:,:2], actions_pred), dim=1)

        actor_loss = -self.critic_local(states, actions_pred).mean()
        # Minimize the loss
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # ----------------------- update target networks ----------------------- #
        self.soft_update(self.critic_local, self.critic_target, TAU)
        self.soft_update(self.actor_local, self.actor_target, TAU)                     

        # Update epsilon noise value
        self.eps = self.eps - (1/eps_decay)
        if self.eps < eps_end:
            self.eps=eps_end
                  
    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target
        Params
        ======
            local_model: PyTorch model (weights will be copied from)
            target_model: PyTorch model (weights will be copied to)
            tau (float): interpolation parameter 
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)


class OUNoise:
    """Ornstein-Uhlenbeck process."""

    def __init__(self, size, seed, mu=0.0, theta=0.13, sigma=0.2):
        """Initialize parameters and noise process."""
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.seed = random.seed(seed)
        self.size = size
        self.reset()

    def reset(self):
        """Reset the internal state (= noise) to mean (mu)."""
        self.state = copy.copy(self.mu)

    def sample(self):
        """Update internal state and return it as a noise sample."""
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.standard_normal(self.size)
        self.state = x + dx
        return self.state

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)
    
    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)
    
    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

agent = maddpgagent(state_dim=state_dim, action_dim=action_dim,num_agents = num_agents, seed=0)

## 2.3 Policy Architecture 
The policy architecture comprised of separate actor and critic polocies. **Actor policy** is a Multi Layer Neural Network which outputs **action** given a **state** tuple. **Critic policy** predicts the **optimal action value function** given a **state, action pair** tuple

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch

def hidden_init(layer):
    d = layer.weight.data.size()[0]
    lim = 1./np.sqrt(d)
    return (-lim, lim)

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, seed, fc1_dim=256, fc2_dim=128):
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)

        self.fc1 = nn.Linear(state_dim*2, fc1_dim)
        self.fc2 = nn.Linear(fc1_dim, fc2_dim)
        self.fc3 = nn.Linear(fc2_dim, action_dim)
        self.reset_parameters()
    
    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3,3e-3)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = torch.tanh(self.fc3(x))
        return x

class Critic(nn.Module):
    def __init__(self, state_dim, action_dim, seed, c_fc1_dim=256, c_fc2_dim=128):
        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)

        self.c_fc1 = nn.Linear(state_dim*2, c_fc1_dim)
        self.c_fc2 = nn.Linear(c_fc1_dim+(action_dim*2), c_fc2_dim)
        self.c_fc3 = nn.Linear(c_fc2_dim, 1)
        self.reset_parameters()

    def reset_parameters(self):
        self.c_fc1.weight.data.uniform_(*hidden_init(self.c_fc1))
        self.c_fc2.weight.data.uniform_(*hidden_init(self.c_fc2))
        self.c_fc3.weight.data.uniform_(-3e-3,3e-3)

    def forward(self, state, action):
        xs = F.relu(self.c_fc1(state))
        x = torch.cat((xs,action), dim=1)
        x = F.relu(self.c_fc2(x))
        x = self.c_fc3(x)
        return x

### 2.4 Hyperparameters

In [1]:
BATCH_SIZE = 128        # minibatch size
BUFFER_SIZE = int(1e6)  # replay buffer size
GAMMA = 0.99            # discount factor
LR_ACTOR = 1e-3         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
TAU = 6e-2              # for soft update of target parameters
WEIGHT_DECAY = 0        # L2 weight decay
UPDATE_EVERY = 1        # time steps between network updates
N_UPDATES = 1           # number of times training
ADD_NOISE = True

eps_start = 6           # Noise level start
eps_end = 0             # Noise level end
eps_decay = 250         # Number of episodes to decay over from start to end

### 3. Train the agent 
Run the code below to train the agent from scratch


In [4]:
from collections import deque
import torch
import numpy as np
import math
import time


def interact(env,state_dim, brain_name, agent, num_agents,max_t=1500, num_episodes=1500, window=100):
    scores = []
    PRINT_EVERY=10
    rolling_avg=[]
    best_score=0
    scores_window = deque(maxlen=window)
    for i_episode in range(1, num_episodes+1):
        # Reset env and get current state
        env_info = env.reset(train_mode=True)[brain_name]
        states = np.reshape(env_info.vector_observations, (1,48)) # flatten states
        score = np.zeros(num_agents)
        agent.reset()
        while True:
            actions = agent.ma_act(states)
            env_info = env.step(actions)[brain_name]
            next_states = np.reshape(env_info.vector_observations, (1,48)) # flatten states
            rewards = env_info.rewards
            dones = env_info.local_done
            agent.ma_step(states, actions, rewards, next_states, dones)
            states = next_states
            score += rewards
            if any(dones):
                break
        scores.append(np.max(score))
        scores_window.append(np.max(score))
        rolling_avg.append(np.mean(scores_window))
        # print results
        if i_episode % PRINT_EVERY == 0:
            print('Episodes {:0>4d}\tMax Reward: {:.3f}\tMoving Average: {:.3f}'.format(i_episode, np.max(score), np.mean(scores_window)))
        # print('\rEpisode {}\tMax Reward: {:.2f}\tAverage Score: {:.2f}'.format(i_episode, np.max(scores_all[-PRINT_EVERY:]),np.mean(scores_window)))
        if np.mean(scores_window)>=0.5:
            torch.save(agent.agents[0].actor_local.state_dict(), './logging/checkpoint_actor_0.pth')
            torch.save(agent.agents[0].critic_local.state_dict(), './logging/checkpoint_critic_0.pth')
            torch.save(agent.agents[1].actor_local.state_dict(), './logging/checkpoint_actor_1.pth')
            torch.save(agent.agents[1].critic_local.state_dict(), './logging/checkpoint_critic_1.pth')
            scores_filename = "./logging/ma_ddpg_agent_score_" +str(i_episode) + ".csv"
            rolling_avg_filename = "./logging/ma_ddpg_agent_rolling_avg_" +str(i_episode) + ".csv"
            np.savetxt(scores_filename, scores, delimiter=",")
            np.savetxt(rolling_avg_filename, rolling_avg, delimiter=",")
        if np.mean(scores_window)>=best_score:
            torch.save(agent.agents[0].actor_local.state_dict(), './logging/best_checkpoint_actor_0.pth')
            torch.save(agent.agents[0].critic_local.state_dict(), './logging/best_checkpoint_critic_0.pth')
            torch.save(agent.agents[1].actor_local.state_dict(), './logging/best_checkpoint_actor_1.pth')
            torch.save(agent.agents[1].critic_local.state_dict(), './logging/best_checkpoint_critic_1.pth')
            best_score = np.mean(scores_window)
        if i_episode % window == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
    return scores

scores = interact(env, state_dim, brain_name, agent,num_agents)

Episodes 0010	Max Reward: 0.000	Moving Average: 0.000
Episodes 0020	Max Reward: 0.000	Moving Average: 0.005
Episodes 0030	Max Reward: 0.000	Moving Average: 0.003
Episodes 0040	Max Reward: 0.000	Moving Average: 0.003
Episodes 0050	Max Reward: 0.000	Moving Average: 0.002
Episodes 0060	Max Reward: 0.000	Moving Average: 0.002
Episodes 0070	Max Reward: 0.000	Moving Average: 0.001
Episodes 0080	Max Reward: 0.000	Moving Average: 0.001
Episodes 0090	Max Reward: 0.000	Moving Average: 0.001
Episodes 0100	Max Reward: 0.000	Moving Average: 0.001
Episode 100	Average Score: 0.00
Episodes 0110	Max Reward: 0.000	Moving Average: 0.003
Episodes 0120	Max Reward: 0.000	Moving Average: 0.002
Episodes 0130	Max Reward: 0.000	Moving Average: 0.002
Episodes 0140	Max Reward: 0.000	Moving Average: 0.005
Episodes 0150	Max Reward: 0.090	Moving Average: 0.008
Episodes 0160	Max Reward: 0.000	Moving Average: 0.010
Episodes 0170	Max Reward: 0.000	Moving Average: 0.013
Episodes 0180	Max Reward: 0.000	Moving Average: 0.

Episodes 1450	Max Reward: 1.500	Moving Average: 0.664
Episodes 1460	Max Reward: 0.300	Moving Average: 0.653
Episodes 1470	Max Reward: 1.200	Moving Average: 0.591
Episodes 1480	Max Reward: 0.500	Moving Average: 0.592
Episodes 1490	Max Reward: 0.100	Moving Average: 0.519
Episodes 1500	Max Reward: 0.100	Moving Average: 0.530
Episode 1500	Average Score: 0.53


In [None]:
from plot import plot_results

plot_results(benchmark_score=0.5)

<br> <img src="files/logging/plot.jpg"> <br>

### 3. Watch a smart agent
Run the code below to watch a smart agent navigating inside the enviornment


In [None]:
env = UnityEnvironment(file_name="./Tennis_Linux/Tennis.x86_64")
# reset env and extract state_dim and action_dim
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
env_info = env.reset(train_mode=False)[brain_name]    # reset the environment  
state_dim = env_info.vector_observations.shape[1]
action_dim = brain.vector_action_space_size
# number of agents
num_agents = len(env_info.agents)
agent = maddpgagent(state_dim=state_dim, action_dim=action_dim,num_agents = num_agents, seed=0)

#Watch a smart agent 
# Load the saved training parameters

agent_0 = Agent(state_dim, action_dim, 1, random_seed=0)
agent_1 = Agent(state_dim, action_dim, 1, random_seed=0)
agent_0.actor_local.load_state_dict(torch.load('logging/checkpoint_actor_0.pth', map_location='cpu'))
agent_0.critic_local.load_state_dict(torch.load('logging/checkpoint_critic_0.pth', map_location='cpu'))
agent_1.actor_local.load_state_dict(torch.load('logging/checkpoint_actor_1.pth', map_location='cpu'))
agent_1.critic_local.load_state_dict(torch.load('logging/checkpoint_critic_1.pth', map_location='cpu'))
  
states = env_info.vector_observations                  # get the current state (for each agent)
states = np.reshape(states, (1,48))
score = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    # actions = agent.ma_act(states)
    action_0 = agent_0.act(states, add_noise=False)         
    action_1 = agent_1.act(states, add_noise=False)        
    actions = np.concatenate((action_0, action_1), axis=0) 
    actions = np.reshape(actions, (1, 4))
    env_info = env.step(actions)[brain_name]
    next_states = np.reshape(env_info.vector_observations, (1,48)) # flatten states
    rewards = env_info.rewards
    dones = env_info.local_done
    states = next_states
    score += rewards
    if any(dones):
        break

In [None]:
env.close()

### 4. Future Improvements
[Solve the crawler enviornment using DDPG](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#crawler)