# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name="Tennis.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

To run the code in the cell, set try_random_actions to True. Once this cell is executed, you will watch the agents' performance, as they select actiona at random at each time step. A window should pop up that allows you to observe the agents, as they move through the environment.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [5]:
try_random_actions = False

if try_random_actions:
    for i in range(1, 6):                                      # play game for 5 episodes
        env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
        states = env_info.vector_observations                  # get the current state (for each agent)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        while True:
            actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
            actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            scores += env_info.rewards                         # update the score (for each agent)
            states = next_states                               # roll over states to next time step
            if np.any(dones):                                  # exit loop if episode finished
                break
        print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

When finished, you can close the environment as shown below. Note however that once the environment is closed, you need to restart the notebook to be able to continue running the remaining cells. Once you restart the notebook, change try_random_actions in the above cell to False to skip running that cell and the subsequent one that closes the environment.

In [6]:
if try_random_actions:
    env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

### 5. Import the Necessary Packages


In [7]:
import torch
import random
import time
import pprint
import numpy as np
from collections import deque, namedtuple
import matplotlib.pyplot as plt
%matplotlib inline

from ddpg_agent import Agent, override_config

### 6. Instantiate the Environment and a Multi-Agent DDPG

In [8]:
config = {"state_size": state_size,
          "action_size": action_size,
          "parallel_agents": num_agents,
                    
          # Actor and Critic model parameters
          "actor_layer_sizes": [256, 128],
          "critic_layer_sizes": [256, 128],
          "actor_sees_other_state": True,
          "critic_combines_state_action": True,
          
          # Agent's parameters
          "train_every": 20,
          "train_steps": 10,
          "buffer_size": int(1048576) , # power of two
          "batch_size": 512,
          "lr_actor": 1e-4,
          "lr_critic": 1e-3,
          "weight_decay": 0.00001,
          "gamma": 0.99,
          "tau": 0.001, 
          "tau_increase": 1.0001, # definitely use 1.0001 for prioritized replay. Otherwise, 1.001
          
          # Exploration using noise or random actions
          "add_noise": True,           # If I turn this off, it looks like the agent learns overfits and plays a fixed game!
          "use_ounoise": True,
          "random_action_period": 1500, # If I turn off both sources of noise, the agent still learns (but without prioritized replay), but more slowly. Perhpas the replay buffer is the source of noise.
          "minimum_random_action_prob": 0.01, # If I turn off both noise, the prioritized replay buffer better be off!

          # OUNoise parameters
          "noise_theta": 0.15,
          "noise_sigma": 0.2,
          "theta_decay": 0.99,
          "sigma_decay": 0.99,
          
          # Prioritized replay parameters
          "prioritized_replay": True,
          "epsilon_error": 1e-7,
          "maximum_error": 1e4,
          "alpha": 0.6,
          "beta_start": 0.4,
          "beta_decay": 0.99,
          "beta_end": 1.0
         }

# Override the global parameters used by the agent
override_config(config)

agent = Agent(state_size=state_size, action_size=action_size, random_seed=2,
#              actor_sees_other_state=config["actor_sees_other_state"], critic_combines_state_action=config["critic_combines_state_action"],
              prioritized_replay=config["prioritized_replay"], use_ounoise=config["use_ounoise"],
              parallel_agents=num_agents, train_every=config["train_every"], train_steps=config["train_steps"])

### 7. Train the DDPG Agent

In [9]:
train_every = config["train_every"]  # train the agent after every train_every time steps
train_steps = config["train_steps"]  # train for train_steps after every train_every time steps

add_noise = config["add_noise"]      
noise_theta = config["noise_theta"]  # OUNoise theta
noise_sigma = config["noise_sigma"]  # OUNoise sigma
theta_decay = config["theta_decay"]  # 0.99
sigma_decay = config["sigma_decay"]  # 0.99

gamma = config["gamma"]

beta_start = config["beta_start"]
beta_decay = config["beta_decay"]
beta_end = config["beta_end"]    

# Increase tau
# Add noisy actions that reduce in frequency during the first 1500 steps but never go to zero
# When training the actor through the critic, and passing state and action to the critic, compute the
# value of the situation assuming that only your action has changed (what the actor computes), but that
# the other agent's actions have not changed

def ddpg(n_episodes=10000, print_every=100, solution_score=0.5, train=True):
    scores_deque = deque(maxlen=print_every)
    scores_array = []
    theta = noise_theta
    sigma = noise_sigma
    store_only_steps_remaining = train_every
    train_steps_remaining = 0
    trained_steps = 0
    steps = 0
    trained_steps = 0
    beta = beta_start
    
    start_time = time.time()
    print_every_start_time = start_time

    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=train)[brain_name]     # reset the environment VIDA: What's train_mode?   
        states = env_info.vector_observations                  # get the current state (for each agent)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        agent.reset(theta, sigma)
        theta *= theta_decay
        sigma *= sigma_decay
        while True:
            actions = agent.act(states, add_noise=add_noise)
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            dones = env_info.local_done                        # see if episode finished
            
            rewards = env_info.rewards                         # get reward (for each agent)
            scores += rewards                                  # update the score (for each agent)
        
            agent.step(states, actions, rewards, next_states, dones, beta=beta)
                            
            states = next_states
            if np.any(dones):                                  # exit loop if episode finished
                break 

        scores_deque.append(np.max(scores))
        scores_array.append(np.max(scores))
        average_over_window = np.mean(scores_deque)
        
        end_time = time.time()
        runtime = end_time - start_time
        start_time = end_time
        print('\rEpisode {}\tRuntime {:.2f}\tAverage Score: {:.2f}'.format(i_episode, runtime, average_over_window), end="")
        
        if train and config["prioritized_replay"] and i_episode >= 100:
            beta = min(beta_end, (1 - beta_decay * (1 - beta)))
        
        if i_episode % print_every == 0:
            print_every_end_time = time.time()
            runtime = print_every_end_time - print_every_start_time
            print_every_start_time = print_every_end_time
            print('\rEpisode {}\tRuntime {:.2f}\tAverage Score: {:.2f}'.format(i_episode, runtime, average_over_window))
        
        if len(scores_deque) >= print_every:            
            if average_over_window >= solution_score:
                print('\rSolved after Episode {}\tAverage Score: {:.2f}'.format(i_episode, average_over_window))
                torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
                torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
                break
            
    return scores_array

pprint.pprint(config)
scores = ddpg(train=True)

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

{'action_size': 2,
 'actor_layer_sizes': [256, 128],
 'actor_sees_other_state': True,
 'add_noise': True,
 'alpha': 0.6,
 'batch_size': 512,
 'beta_decay': 0.99,
 'beta_end': 1.0,
 'beta_start': 0.4,
 'buffer_size': 1048576,
 'critic_combines_state_action': True,
 'critic_layer_sizes': [256, 128],
 'epsilon_error': 1e-07,
 'gamma': 0.99,
 'lr_actor': 0.0001,
 'lr_critic': 0.001,
 'maximum_error': 10000.0,
 'minimum_random_action_prob': 0.01,
 'noise_sigma': 0.2,
 'noise_theta': 0.15,
 'parallel_agents': 2,
 'prioritized_replay': True,
 'random_action_period': 1500,
 'sigma_decay': 0.99,
 'state_size': 24,
 'tau': 0.001,
 'tau_increase': 1.0001,
 'theta_decay': 0.99,
 'train_every': 20,
 'train_steps': 10,
 'use_ounoise': True,
 'weight_decay': 1e-05}
Episode 39	Runtime 0.05	Average Score: 0.01

KeyboardInterrupt: 

The first solution found: Looks like increasing tau over time is important.
Note that I have also removed any extra initialization of the weights or biases in the model.

- Tried turning on add_noise and it was bad! 
- Tried tau_increase of 1.0 and 1.001. 
1.0 was bad. 
1.001 was a bit less stable, but faster. 
I prefer 1.0001.

Tried also turning off the random noise during early training, in addition to add_noise=False. The agent still got trained, as if there are other sources of noise for exploration, perhaps the replay buffer. And actually the training is less noisy. So, we will keep this source of noise off!


Try "critic_combines_state_action"=False. The difference is in the critic: whether the critic looks at actor's action in the context of the state of both agents plus the actions of the other agents, or if the critic looks at the actions of the two agents in the context of their two states. Of course, the first one ("critic_combines_state_action"=True) makes much more sense.
The agents still learns, but super slowly, and at some points along training when its performance is above 0.25, it still fluctuates and goes down instead of going up in a stable manner, like the other configurations. Because of this, I aborted the training after +3000 steps.
The first fc layer of the critic is encoding the context in which an action is evaluated. So, again it makes sense that that context better include the action of the other agent.
This also means that separating the context and action inputs into different layers makes a lot of sense!

Try "actor_sees_other_state": False. Note that we have already decided that "critic_combines_state_action"=True.
It doesn't look like promising, and I aborted it.

Next, let's try prioritized_replay=True. However, note that this feature has its own hyper parameters that would need tuning and thus may not work well. Perhaps we want to keep things random for longer (similar to when the prioritized_replay mode is off), by changing the alpha and beta parameters.

Agent is getting trained, but training is slower than without prioritized replay. Also, due to repeated sorting of the replay_buffer errors, the wall-time of training is also slower. We are picking up experiences with larger errors more often, but the gradients that we compute for them are smaller, to compensate for sampling them more often. Thus, perhaps we end up not learning much from other experiences.


 'prioritized_replay': False,
 'add_noise': False,
 'tau_increase': 1.001, 
 
{'action_size': 2,
 'actor_layer_sizes': [256, 128],
 'actor_sees_other_state': True,
 'add_noise': False,
 'alpha': 0.6,
 'batch_size': 512,
 'beta_decay': 0.99,
 'beta_end': 1.0,
 'beta_start': 0.4,
 'buffer_size': 1048576,
 'critic_combines_state_action': True,
 'critic_layer_sizes': [256, 128],
 'epsilon_error': 1e-07,
 'gamma': 0.99,
 'lr_actor': 0.0001,
 'lr_critic': 0.001,
 'maximum_error': 10000.0,
 'minimum_random_action_prob': 0.01,
 'noise_sigma': 0.2,
 'noise_theta': 0.15,
 'parallel_agents': 2,
 'prioritized_replay': False,
 'random_action_period': 1500,
 'sigma_decay': 0.99,
 'state_size': 24,
 'tau': 0.001,
 'tau_increase': 1.001,
 'theta_decay': 0.99,
 'train_every': 20,
 'train_steps': 10,
 'use_ounoise': True,
 'weight_decay': 1e-05}
Episode 100	Runtime 20.50	Average Score: 0.01
Episode 200	Runtime 22.67	Average Score: 0.00
Episode 300	Runtime 22.01	Average Score: 0.00
Episode 400	Runtime 23.51	Average Score: 0.00
Episode 500	Runtime 28.01	Average Score: 0.02
Episode 600	Runtime 23.28	Average Score: 0.00
Episode 700	Runtime 27.26	Average Score: 0.02
Episode 800	Runtime 30.15	Average Score: 0.03
Episode 900	Runtime 23.54	Average Score: 0.01
Episode 1000	Runtime 25.94	Average Score: 0.01
Episode 1100	Runtime 29.42	Average Score: 0.02
Episode 1200	Runtime 29.65	Average Score: 0.02
Episode 1300	Runtime 30.93	Average Score: 0.01
Episode 1400	Runtime 39.22	Average Score: 0.05
Episode 1500	Runtime 39.53	Average Score: 0.05
Episode 1600	Runtime 31.89	Average Score: 0.03
Episode 1700	Runtime 40.27	Average Score: 0.06
Episode 1800	Runtime 39.03	Average Score: 0.06
Episode 1900	Runtime 45.01	Average Score: 0.07
Episode 2000	Runtime 51.19	Average Score: 0.07
Episode 2100	Runtime 54.06	Average Score: 0.09
Episode 2200	Runtime 60.84	Average Score: 0.10
Episode 2300	Runtime 67.73	Average Score: 0.11
Episode 2400	Runtime 84.80	Average Score: 0.14
Episode 2500	Runtime 73.57	Average Score: 0.12
Episode 2600	Runtime 74.63	Average Score: 0.12
Episode 2700	Runtime 71.67	Average Score: 0.12
Episode 2800	Runtime 74.21	Average Score: 0.12
Episode 2900	Runtime 81.17	Average Score: 0.12
Episode 3000	Runtime 93.80	Average Score: 0.13
Episode 3100	Runtime 90.11	Average Score: 0.13
Episode 3200	Runtime 194.88	Average Score: 0.26
Solved after Episode 3237	Average Score: 0.500
    
![Screen%20Shot%202021-08-04%20at%206.17.15%20PM.png](attachment:Screen%20Shot%202021-08-04%20at%206.17.15%20PM.png)    

 'prioritized_replay': True,
 'add_noise': False,
 'tau_increase': 1.0001,
 
 {'action_size': 2,
 'actor_layer_sizes': [256, 128],
 'actor_sees_other_state': True,
 'add_noise': False,
 'alpha': 0.6,
 'batch_size': 512,
 'beta_decay': 0.99,
 'beta_end': 1.0,
 'beta_start': 0.4,
 'buffer_size': 1048576,
 'critic_combines_state_action': True,
 'critic_layer_sizes': [256, 128],
 'epsilon_error': 1e-07,
 'gamma': 0.99,
 'lr_actor': 0.0001,
 'lr_critic': 0.001,
 'maximum_error': 10000.0,
 'minimum_random_action_prob': 0.01,
 'noise_sigma': 0.2,
 'noise_theta': 0.15,
 'parallel_agents': 2,
 'prioritized_replay': True,
 'random_action_period': 1500,
 'sigma_decay': 0.99,
 'state_size': 24,
 'tau': 0.001,
 'tau_increase': 1.0001,
 'theta_decay': 0.99,
 'train_every': 20,
 'train_steps': 10,
 'use_ounoise': True,
 'weight_decay': 1e-05}
Episode 100	Runtime 38.92	Average Score: 0.00
Episode 200	Runtime 42.17	Average Score: 0.00
Episode 300	Runtime 42.49	Average Score: 0.00
Episode 400	Runtime 43.19	Average Score: 0.00
Episode 500	Runtime 43.69	Average Score: 0.00
Episode 600	Runtime 44.91	Average Score: 0.00
Episode 700	Runtime 58.45	Average Score: 0.02
Episode 800	Runtime 76.84	Average Score: 0.05
Episode 900	Runtime 78.13	Average Score: 0.05
Episode 1000	Runtime 68.18	Average Score: 0.02
Episode 1100	Runtime 51.73	Average Score: 0.00
Episode 1200	Runtime 54.98	Average Score: 0.01
Episode 1300	Runtime 60.46	Average Score: 0.01
Episode 1400	Runtime 62.70	Average Score: 0.02
Episode 1500	Runtime 125.40	Average Score: 0.07
Episode 1600	Runtime 79.55	Average Score: 0.02
Episode 1700	Runtime 133.48	Average Score: 0.07
Episode 1800	Runtime 178.32	Average Score: 0.11
Episode 1900	Runtime 116.21	Average Score: 0.07
Episode 2000	Runtime 106.14	Average Score: 0.06
Episode 2100	Runtime 186.07	Average Score: 0.12
Episode 2200	Runtime 392.91	Average Score: 0.25
Solved after Episode 2286	Average Score: 0.522

![Screen%20Shot%202021-08-04%20at%207.50.03%20PM.png](attachment:Screen%20Shot%202021-08-04%20at%207.50.03%20PM.png)

 'prioritized_replay': True,
 'tau_increase': 1.0001,
 'add_noise': True,
 
{'action_size': 2,
 'actor_layer_sizes': [256, 128],
 'actor_sees_other_state': True,
 'add_noise': True,
 'alpha': 0.6,
 'batch_size': 512,
 'beta_decay': 0.99,
 'beta_end': 1.0,
 'beta_start': 0.4,
 'buffer_size': 1048576,
 'critic_combines_state_action': True,
 'critic_layer_sizes': [256, 128],
 'epsilon_error': 1e-07,
 'gamma': 0.99,
 'lr_actor': 0.0001,
 'lr_critic': 0.001,
 'maximum_error': 10000.0,
 'minimum_random_action_prob': 0.01,
 'noise_sigma': 0.2,
 'noise_theta': 0.15,
 'parallel_agents': 2,
 'prioritized_replay': True,
 'random_action_period': 1500,
 'sigma_decay': 0.99,
 'state_size': 24,
 'tau': 0.001,
 'tau_increase': 1.0001,
 'theta_decay': 0.99,
 'train_every': 20,
 'train_steps': 10,
 'use_ounoise': True,
 'weight_decay': 1e-05}
Episode 100	Runtime 38.61	Average Score: 0.01
Episode 200	Runtime 42.43	Average Score: 0.00
Episode 300	Runtime 43.35	Average Score: 0.00
Episode 400	Runtime 48.37	Average Score: 0.00
Episode 500	Runtime 49.56	Average Score: 0.00
Episode 600	Runtime 47.06	Average Score: 0.00
Episode 700	Runtime 48.20	Average Score: 0.00
Episode 800	Runtime 47.34	Average Score: 0.00
Episode 900	Runtime 49.24	Average Score: 0.00
Episode 1000	Runtime 49.33	Average Score: 0.00
Episode 1100	Runtime 49.65	Average Score: 0.01
Episode 1200	Runtime 48.85	Average Score: 0.00
Episode 1300	Runtime 61.87	Average Score: 0.02
Episode 1400	Runtime 56.44	Average Score: 0.01
Episode 1500	Runtime 48.42	Average Score: 0.00
Episode 1600	Runtime 49.30	Average Score: 0.00
Episode 1700	Runtime 50.03	Average Score: 0.00
Episode 1800	Runtime 91.29	Average Score: 0.05
Episode 1900	Runtime 111.16	Average Score: 0.08
Episode 2000	Runtime 168.79	Average Score: 0.12
Episode 2100	Runtime 170.40	Average Score: 0.12
Episode 2200	Runtime 235.77	Average Score: 0.15
Episode 2300	Runtime 248.62	Average Score: 0.17
Episode 2400	Runtime 625.44	Average Score: 0.40
Episode 2500	Runtime 568.35	Average Score: 0.35
Solved after Episode 2545	Average Score: 0.500
![Screen%20Shot%202021-08-04%20at%209.04.11%20PM.png](attachment:Screen%20Shot%202021-08-04%20at%209.04.11%20PM.png)

 'random_action_period': 1,
 'minimum_random_action_prob': 0.0,
 'prioritized_replay': True,
 'add_noise': True,
 'tau_increase': 1.0001,

No learning seemed to have happened
    
{'action_size': 2,
 'actor_layer_sizes': [256, 128],
 'actor_sees_other_state': True,
 'add_noise': True,
 'alpha': 0.6,
 'batch_size': 512,
 'beta_decay': 0.99,
 'beta_end': 1.0,
 'beta_start': 0.4,
 'buffer_size': 1048576,
 'critic_combines_state_action': True,
 'critic_layer_sizes': [256, 128],
 'epsilon_error': 1e-07,
 'gamma': 0.99,
 'lr_actor': 0.0001,
 'lr_critic': 0.001,
 'maximum_error': 10000.0,
 'minimum_random_action_prob': 0.0,
 'noise_sigma': 0.2,
 'noise_theta': 0.15,
 'parallel_agents': 2,
 'prioritized_replay': True,
 'random_action_period': 1,
 'sigma_decay': 0.99,
 'state_size': 24,
 'tau': 0.001,
 'tau_increase': 1.0001,
 'theta_decay': 0.99,
 'train_every': 20,
 'train_steps': 10,
 'use_ounoise': True,
 'weight_decay': 1e-05}

Episode 100	Runtime 38.85	Average Score: 0.00
Episode 200	Runtime 43.60	Average Score: 0.00
Episode 300	Runtime 47.35	Average Score: 0.00
Episode 400	Runtime 46.64	Average Score: 0.00
Episode 500	Runtime 45.12	Average Score: 0.00
Episode 600	Runtime 46.07	Average Score: 0.00
Episode 700	Runtime 46.43	Average Score: 0.00
Episode 800	Runtime 46.91	Average Score: 0.00
Episode 900	Runtime 48.23	Average Score: 0.00
Episode 1000	Runtime 47.71	Average Score: 0.00
Episode 1100	Runtime 49.12	Average Score: 0.00
Episode 1200	Runtime 50.17	Average Score: 0.00
Episode 1300	Runtime 50.79	Average Score: 0.00
Episode 1400	Runtime 49.73	Average Score: 0.00
Episode 1500	Runtime 50.07	Average Score: 0.00
Episode 1600	Runtime 48.93	Average Score: 0.00
Episode 1700	Runtime 49.59	Average Score: 0.00
Episode 1800	Runtime 50.39	Average Score: 0.00
Episode 1900	Runtime 49.21	Average Score: 0.00
Episode 2000	Runtime 49.43	Average Score: 0.00
Episode 2100	Runtime 50.35	Average Score: 0.00
Episode 2200	Runtime 60.93	Average Score: 0.00
Episode 2300	Runtime 59.65	Average Score: 0.00
Episode 2400	Runtime 61.19	Average Score: 0.00
Episode 2500	Runtime 68.64	Average Score: 0.00
Episode 2600	Runtime 73.63	Average Score: 0.00
Episode 2700	Runtime 107.91	Average Score: 0.00
Episode 2800	Runtime 171.68	Average Score: 0.00
Episode 2900	Runtime 217.21	Average Score: 0.00
Episode 2920	Runtime 0.15	Average Score: 0.00

 'random_action_period': 1,
 'minimum_random_action_prob': 0.01,
 'prioritized_replay': True,
 'add_noise': True,
 'tau_increase': 1.0001,

No sign of training! This suggests that it is cruicial that we have  'random_action_period': 1500, and 'minimum_random_action_prob': 0.01,

    
{'action_size': 2,
 'actor_layer_sizes': [256, 128],
 'actor_sees_other_state': True,
 'add_noise': True,
 'alpha': 0.6,
 'batch_size': 512,
 'beta_decay': 0.99,
 'beta_end': 1.0,
 'beta_start': 0.4,
 'buffer_size': 1048576,
 'critic_combines_state_action': True,
 'critic_layer_sizes': [256, 128],
 'epsilon_error': 1e-07,
 'gamma': 0.99,
 'lr_actor': 0.0001,
 'lr_critic': 0.001,
 'maximum_error': 10000.0,
 'minimum_random_action_prob': 0.01,
 'noise_sigma': 0.2,
 'noise_theta': 0.15,
 'parallel_agents': 2,
 'prioritized_replay': True,
 'random_action_period': 1,
 'sigma_decay': 0.99,
 'state_size': 24,
 'tau': 0.001,
 'tau_increase': 1.0001,
 'theta_decay': 0.99,
 'train_every': 20,
 'train_steps': 10,
 'use_ounoise': True,
 'weight_decay': 1e-05}
Episode 100	Runtime 38.16	Average Score: 0.00
Episode 200	Runtime 44.33	Average Score: 0.00
Episode 300	Runtime 45.80	Average Score: 0.00
Episode 400	Runtime 48.28	Average Score: 0.00
Episode 500	Runtime 53.22	Average Score: 0.00
Episode 600	Runtime 47.58	Average Score: 0.00
Episode 700	Runtime 47.25	Average Score: 0.00
Episode 800	Runtime 48.73	Average Score: 0.00
Episode 900	Runtime 48.40	Average Score: 0.00
Episode 1000	Runtime 50.04	Average Score: 0.00
Episode 1100	Runtime 50.44	Average Score: 0.00
Episode 1200	Runtime 50.70	Average Score: 0.00
Episode 1300	Runtime 50.11	Average Score: 0.00
Episode 1400	Runtime 57.87	Average Score: 0.00
Episode 1500	Runtime 56.11	Average Score: 0.00
Episode 1600	Runtime 54.76	Average Score: 0.00
Episode 1700	Runtime 51.92	Average Score: 0.00
Episode 1800	Runtime 54.81	Average Score: 0.00
Episode 1900	Runtime 53.75	Average Score: 0.00
Episode 2000	Runtime 55.90	Average Score: 0.00
Episode 2100	Runtime 59.62	Average Score: 0.00
Episode 2200	Runtime 55.34	Average Score: 0.00
Episode 2300	Runtime 58.38	Average Score: 0.00
Episode 2377	Runtime 0.12	Average Score: 0.00

{'action_size': 2,
 'actor_layer_sizes': [256, 128],
 'actor_sees_other_state': True,
 'add_noise': True,
 'alpha': 0.6,
 'batch_size': 512,
 'beta_decay': 0.99,
 'beta_end': 1.0,
 'beta_start': 0.4,
 'buffer_size': 1048576,
 'critic_combines_state_action': True,
 'critic_layer_sizes': [256, 128],
 'epsilon_error': 1e-07,
 'gamma': 0.99,
 'lr_actor': 0.0001,
 'lr_critic': 0.001,
 'maximum_error': 10000.0,
 'minimum_random_action_prob': 0.01,
 'noise_sigma': 0.2,
 'noise_theta': 0.15,
 'parallel_agents': 2,
 'prioritized_replay': True,
 'random_action_period': 1500,
 'sigma_decay': 0.99,
 'state_size': 24,
 'tau': 0.001,
 'tau_increase': 1.0001,
 'theta_decay': 0.99,
 'train_every': 20,
 'train_steps': 10,
 'use_ounoise': True,
 'weight_decay': 1e-05}
Episode 100	Runtime 41.61	Average Score: 0.00
Episode 200	Runtime 43.18	Average Score: 0.00
Episode 300	Runtime 43.78	Average Score: 0.00
Episode 400	Runtime 46.86	Average Score: 0.00
Episode 500	Runtime 44.56	Average Score: 0.00
Episode 600	Runtime 44.93	Average Score: 0.00
Episode 700	Runtime 46.39	Average Score: 0.00
Episode 800	Runtime 55.72	Average Score: 0.02
Episode 900	Runtime 46.55	Average Score: 0.00
Episode 1000	Runtime 46.11	Average Score: 0.00
Episode 1100	Runtime 47.12	Average Score: 0.00
Episode 1200	Runtime 62.18	Average Score: 0.02
Episode 1300	Runtime 63.59	Average Score: 0.01
Episode 1400	Runtime 58.16	Average Score: 0.01
Episode 1500	Runtime 65.00	Average Score: 0.02
Episode 1600	Runtime 83.35	Average Score: 0.05
Episode 1700	Runtime 131.06	Average Score: 0.10
Solved after Episode 1738	Average Score: 0.511

![Screen%20Shot%202021-08-05%20at%2011.54.34%20AM.png](attachment:Screen%20Shot%202021-08-05%20at%2011.54.34%20AM.png)

### 8. Load the Trained Agent

In [None]:
agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))
agent.critic_local.load_state_dict(torch.load('checkpoint_critic.pth'))

### 9. Watch the Trained Agent in Action

Run the next cell to watch the trained agent's performance. A window should pop up that allows you to observe the agent(s) acting in the environment.

In [None]:
for i in range(1, 6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = agent.act(states, add_noise=False)       # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

When finished, you can close the environment by uncommenting the following line and running the cell.

In [None]:
# env.close()