# Multi-Agent Deep Deterministic Policy Gradient for Tennis

---

In this notebook, we have implemented the [Multi-Agent Deep Deterministic Policy Gradient (MADDPG)](https://arxiv.org/abs/1706.02275) reinforcement learning algorithm for a simulated tennis game. This is for the second project of the [Udacity Deep Reinforcement Learning Nanodegree program](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### Setup

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md).

In [1]:
import copy
import torch
import random
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
from unityagents import UnityEnvironment

### Create and Examine The Tennis Environment

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [2]:
env = UnityEnvironment(file_name="/home/sebastian/udacity_drl/drlnd-multiagent-project/Tennis_Linux/Tennis.x86_64")

# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 
Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### Create and Train a MADDPG agent

In [3]:
from ddpg import DDPGAgent
from utils import ReplayBuffer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# DDPG parameters
GAMMA = 0.99    # Discount factor
TAU = 2.5e-3    # Soft update weight for target networks

# Training parameters
NUM_EPISODES = 2000         # Total number of episodes
BATCH_SIZE = 128            # Batch size for training
UPDATE_EVERY = 5            # Number of time steps between training steps
NUM_UPDATES = 10            # Number of iterations per training step
REPLAY_BUF_SIZE = int(5e5)  # Replay buffer size
LR_ACTOR = 2e-4             # Learning rate for actor network
LR_CRITIC = 1e-3            # Learning rate for critic network
WEIGHT_DECAY_ACTOR = 1e-6   # Regularization for actor network weights
WEIGHT_DECAY_CRITIC = 1e-5  # Regularization for critic network weights
CLIP_GRAD = 1.0             # Clip value for gradients during training

# OU Noise decay parameters
INIT_NOISE = 0.5
INIT_SIGMA = 0.25
MIN_NOISE = 0.05
MIN_SIGMA = 0.05
NOISE_DECAY = 0.999

# Reproducibility by fixing random seed
SEED = 4321
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Create agents and replay buffer
agents = [DDPGAgent(num_agents=num_agents, gamma=GAMMA, tau=TAU,
                    lr_actor=LR_ACTOR, lr_critic=LR_CRITIC, 
                    weight_decay_actor=WEIGHT_DECAY_ACTOR, 
                    weight_decay_critic=WEIGHT_DECAY_CRITIC) 
          for _ in range(num_agents)]
buffer = ReplayBuffer(size=REPLAY_BUF_SIZE)

# Training Loop
noise = INIT_NOISE
sigma = INIT_SIGMA
best_score = 0
scores_vec = []
scores_avg_vec = []
score_window_size = 100 
scores_window = deque(maxlen=score_window_size)
idx = 0
for ep in range(NUM_EPISODES):

    # Reset the environment
    env_info = env.reset(train_mode=True)[brain_name]
    states = env_info.vector_observations
    scores = np.zeros(num_agents)

    # Run an episode
    ep_done = False
    while not ep_done:

        # Choose actions from the agents
        with torch.no_grad():
            actions = np.zeros((num_agents, action_size)) # select an action (for each agent)
            for aidx in range(num_agents):
                state_tensor = torch.tensor(states[aidx], dtype=torch.float)
                act_tensor = agents[aidx].act(state_tensor.cuda(), noise=noise)
                actions[aidx, :] = act_tensor.cpu().detach().numpy()

        # Step the environment and collect data
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        
        # Push to the experience buffer
        for aidx in range(num_agents):
            experience = (states[aidx], actions[aidx], [rewards[aidx]], 
                        next_states[aidx], [dones[aidx]], 
                        states, next_states, actions)
            buffer.push(experience)

        # Update the agents once the buffer is sufficiently full
        if len(buffer) > BATCH_SIZE and idx % UPDATE_EVERY == 0:
            for _ in range(NUM_UPDATES):
                for agent in agents:
                    samples = buffer.sample(BATCH_SIZE)
                    agent.train(samples)

        # Update the time step index
        idx += 1

        # Roll over next state and terminate
        states = next_states
        if np.any(dones):
            ep_done = True

    # Reset and decay the OU noise
    noise = max(MIN_NOISE, noise*NOISE_DECAY)
    sigma = max(MIN_SIGMA, sigma*NOISE_DECAY)
    for agent in agents:
        agent.noise.reset()
        agent.noise.sigma = sigma
    
    # Update the average scores
    max_score = max(scores)
    scores_vec.append(max_score)
    scores_window.append(max_score)
    if (ep+1) % score_window_size == 0 or ep == 0 or ep == NUM_EPISODES-1:
        mean_max_score = np.mean(scores_window)
        scores_avg_vec.append(mean_max_score)
        print("\rEpisode {}/{} -- Buffer Size: {} -- Average Max Score: {:.3f}".format(
              ep+1, NUM_EPISODES, len(buffer), mean_max_score))

        # Update the best model
        if mean_max_score > best_score:
            best_idx = np.argmax(scores)
            best_model = copy.deepcopy(agents[best_idx].actor)

# Plot the final scores
plt.plot(np.arange(len(scores_vec)), scores_vec)
plt.plot(np.arange(len(scores_avg_vec))*score_window_size, scores_avg_vec)
plt.xlabel("Episode Number")
plt.ylabel("Total Return (Maximum of Both Agents)")
plt.legend(["Raw", "Average over {} episodes".format(score_window_size)])
plt.show()

# Save the trained agent to file
torch.save(best_model.state_dict(), "trained_actor_weights.pth")

Episode 1/2000 -- Buffer Size: 30 -- Average Max Score: 0.000
Episode 100/2000 -- Buffer Size: 2898 -- Average Max Score: 0.000
Episode 200/2000 -- Buffer Size: 5740 -- Average Max Score: 0.000
Episode 300/2000 -- Buffer Size: 8754 -- Average Max Score: 0.005
Episode 400/2000 -- Buffer Size: 11594 -- Average Max Score: 0.000
Episode 500/2000 -- Buffer Size: 14434 -- Average Max Score: 0.000
Episode 600/2000 -- Buffer Size: 17274 -- Average Max Score: 0.000
Episode 700/2000 -- Buffer Size: 20114 -- Average Max Score: 0.000


### Test Trained Agent

Now we will take the best actor weights from training and use it for self-play. 

That is, both agents will use the same actor with no added OU noise.

In [4]:
# Create a new actor network and assign it the trained weights
from networks import ActorNetwork
actor = ActorNetwork().to(device)
actor.load_state_dict(torch.load("trained_actor_weights.pth"))

# Training Loop
env_info = env.reset(train_mode=False)[brain_name]
states = env_info.vector_observations
scores = np.zeros(num_agents)

# Run an episode
MAX_STEPS = 10000
ep_done = False
idx = 0
while not ep_done:

    # Choose actions for both agents using the trained actor
    with torch.no_grad():
        actions = np.zeros((num_agents, action_size))
        for aidx in range(num_agents):
            state_tensor = torch.tensor(states[aidx], dtype=torch.float)
            act_tensor = actor.act(state_tensor.cuda())
            actions[aidx, :] = act_tensor.cpu().detach().numpy()

    # Step the environment and collect data
    env_info = env.step(actions)[brain_name]
    next_states = env_info.vector_observations
    rewards = env_info.rewards
    dones = env_info.local_done
    scores += env_info.rewards
    states = next_states
    idx += 1

    # Termination condition
    if np.any(dones) or idx >= MAX_STEPS:
        ep_done = True

print("Final scores after {} steps: {}".format(idx, scores))

FileNotFoundError: [Errno 2] No such file or directory: 'trained_actor_weights.pth'

In [5]:
# Finally, close the environment
env.close()