# Continuous Control

---

This coding environment will be used to train the agent for the project.

### 1. Start the Environment

The next code cell installs a few packages.  This line will take a few minutes to run!

In [1]:
%%time
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 2.0.10 which is incompatible.[0m
CPU times: user 696 ms, sys: 117 ms, total: 813 ms
Wall time: 51.2 s


In [2]:
from unityagents import UnityEnvironment
import numpy as np

In [3]:
def initialize_env(unity_file):
    # Initialize the environment
    env = UnityEnvironment(file_name=unity_file)

    # Get default brain
    brain_name = env.brain_names[0]
    brain = env.brains[brain_name]

    # Get state and action spaces
    env_info = env.reset(train_mode=True)[brain_name]
    state_size = env_info.vector_observations.shape[1]
    action_size = brain.vector_action_space_size
    n_agents = len(env_info.agents)
    
    print('State size: ', state_size)
    print('Action size: ', action_size)
    print('Number of agents: ', n_agents)
    
    return env, brain_name, brain, state_size, action_size, n_agents

The environments corresponding to both versions of the environment are already saved in the Workspace and can be accessed at the file paths provided below.
Please select one of the two options below for loading the environment.

In [4]:
# select this option to load version 1 (with a single agent) of the environment
unity_file = '/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64'
# select this option to load version 2 (with 20 agents) of the environment
# unity_file='/data/Reacher_Linux_NoVis/Reacher.x86_64'

env, brain_name, brain, state_size, action_size, n_agents = initialize_env(unity_file)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


State size:  33
Action size:  4
Number of agents:  1


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

### 2. Examine the State and Action Spaces

Let's print some information about the environment.

In [5]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'
      .format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
   1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -6.30408478e+00  -1.00000000e+00
  -4.92529202e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
  -5.33014059e-01]


### 3. Take Random Actions in the Environment

**In this coding environment, we will not be able to watch the agents while they are training**, and we should set `train_mode=True` to restart the environment.

In [6]:
env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
i_episode = 0
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    i_episode += 1
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}. Number of episodes: {}.'
      .format(np.mean(scores), i_episode-1))

# Just to be sure they're not used any more
del states, actions, next_states, rewards, dones, scores

Total score (averaged over agents) this episode: 0.0. Number of episodes: 1000.


### 4. It's Your Turn!

- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- In oder to watch the agents while they are training: **_After training the agents_**, you can download the saved model weights to watch the agents on your own machine! 

### 5. Submission

> The following solution is based on the code provided in [Udacity ddpg-bipedal](https://github.com/udacity/deep-reinforcement-learning/blob/master/ddpg-bipedal/DDPG.ipynb). In particular, it uses the files [model.py](https://github.com/udacity/deep-reinforcement-learning/blob/master/ddpg-bipedal/model.py) and [ddpg_agent/py](https://github.com/udacity/deep-reinforcement-learning/blob/master/ddpg-bipedal/ddpg_agent.py), as well as the code provided in the training loop `ddpg()`.

In [7]:
import random
import datetime, time
import torch
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline
from workspace_utils import active_session

from ddpg_agent import Agent

#### Learning Algorithm

TODO - Describe DDPG...

####  Neural Network Architecture

TODO - Finalize...
After trying the initial architecture from the Udacity bipolar walker, and reading in the student hub I adapted the model as follows:

For the **actor** I added another fully connected layer with 128 neurons.
This results in the following number of neurons on each layer then (input, FC1, FC2, output): `33-256-128-4`.
Besides, a batch normalization layer is being introduced in between the fully connected layers, to help normalize the activation across the batch of samples from the replay buffer

The **critic** is adjusted in the following way:
128-128

In [8]:
def ddpg(env, brain_name, agent, n_agents,
         n_episodes=750, max_t=700):
    
    scores_deque = deque(maxlen=100)
    scores = []
    last_average_score = 0
    
    for i_episode in range(1, n_episodes+1):
        
        t_start = datetime.datetime.now()
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]
        agent.reset()
        score = 0
        
        for t in range(max_t):
            print('\rEpisode {}/{}, t: {}/{}'
                  .format(i_episode, n_episodes, t, max_t), end="")
            action = agent.act(state)
            
            env_info = env.step(action)[brain_name]     # send all actions to the environment
            next_state = env_info.vector_observations[0] # get next state (for each agent)
            reward = env_info.rewards[0]                 # get reward (for each agent)
            done = env_info.local_done[0]                # see if episode finished

            agent.step(state, action, reward, next_state, done, t)
            
            score += reward
            state = next_state                           # roll over states to next time step
            
            if done:                                     # exit loop if episode finished
                break
                
        scores_deque.append(score)
        scores.append(score)
        
        t_episode = datetime.datetime.now() - t_start
        average_score = np.mean(scores_deque)
        improvement = average_score - last_average_score
        last_average_score = average_score
        print('\rEpisode: {}, Average Score: {:.2f} ({:.2f}), Score: {:.2f}, time: {}'
              .format(i_episode, average_score, improvement, score, t_episode), end="")

        # average_score should be above 5. after 100 episodes! Abort if not...
        if i_episode >= 100:
            if average_score < 5.:
                print('\nAverage score is only {:.3f} after {} episodes, '\
                    'hence aborting this run!\n'.format(average_score, i_episode))
                break
        
        if i_episode % 100 == 0:
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            print('\nEpisode: {}, Average Score: {:.2f}\n'
                  .format(i_episode, average_score))

        if np.mean(scores_deque) >= 30.0:
            print('\n\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'
                  .format(i_episode, average_score))
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor_solution.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic_solution.pth')
            break
            
    return scores, average_score, t_episode

We have several hyperparameters for training the DDPG algorithm.
Some of them are listed in the following cell:

In [9]:
N = 400  # Should be enough to solve this!
BUFFER_SIZE = [int(1e5), int(1e7)]
BATCH_SIZE = [64, 128, 256]
GAMMA = .99
TAU = 1e-3
LEARN_RATE = [1e-3, 1e-4]
WEIGHT_DECAY = 0.0
SEED = 2
MAX_T = [1000, 3000]
NEURONS = [128, 256]

#### Hyperparameter grid search

Manual try & error of a few hyperparameters did unfortunately not lead to a succesful model.
Hence, in order to find the best hyperparameters, we define some possible values and perform a grid search over the hyperparameter space.
It shall be pointed out, that the `ddpg()` method exits, if the score did not reach at least `5.0` after `100` episodes!

First, let's create all combinations of hyperparameters.

In [10]:
from collections import namedtuple
from itertools import product, starmap
import pickle

def named_product(**items):
    Combination = namedtuple('Combination', items.keys())
    return starmap(Combination, product(*items.values()))

hyperparameter_combinations = []
for combination in named_product(buffer_size=BUFFER_SIZE,
                                 batch_size=BATCH_SIZE,
                                 learn_rate=LEARN_RATE,
                                 max_t=MAX_T,
                                 neurons=NEURONS):
    # print(combination)
    hyperparameter_combinations.append(combination)

len_hyperparameter_combinations = len(hyperparameter_combinations)
print('We will be testing {} combinations of hyperparameters!'
      .format(len_hyperparameter_combinations))

# Let's check a few of them
num_examples = 3
print('{} random Examples:'.format(num_examples))
for _ in range(num_examples):
    print(hyperparameter_combinations[random.randint(0, len_hyperparameter_combinations-1)])

We will be testing 48 combinations of hyperparameters!
3 random Examples:
Combination(buffer_size=100000, batch_size=128, learn_rate=0.001, max_t=3000, neurons=128)
Combination(buffer_size=100000, batch_size=64, learn_rate=0.0001, max_t=3000, neurons=128)
Combination(buffer_size=10000000, batch_size=128, learn_rate=0.001, max_t=1000, neurons=256)


In [11]:
run_grid_search = False

In [12]:
%%time
agent = None

scores_grid_search = []
missing_combinations = [12, 14, 15, 22, 23, 24, 25, 26, 27, 28, 38]

if run_grid_search:
    with active_session():
        for i, hyperparameters in enumerate(hyperparameter_combinations):
            id = i+1
            if id not in missing_combinations:
                continue

            print('Testing following hyperparameters ({}/{}):\n{}\n'
                  .format(id, len(hyperparameter_combinations), hyperparameters))

            # Delete any existing agent
            if agent is not None:
                del agent

            # Initialize agent
            agent = Agent(state_size,
                          action_size,
                          n_agents,
                          buffer_size=hyperparameters.buffer_size,
                          batch_size=hyperparameters.batch_size,
                          gamma=GAMMA,
                          tau=TAU,
                          lr_actor=hyperparameters.learn_rate,
                          lr_critic=hyperparameters.learn_rate,
                          weight_decay=WEIGHT_DECAY,
                          neurons=hyperparameters.neurons,
                          random_seed=SEED)

            # Run training
            scores, average_score, t_episode = ddpg(env, brain_name, agent, n_agents,
                                                    n_episodes=N, max_t=hyperparameters.max_t)

            # Save results
            print('Saving results...')
            scores_grid_search.append((hyperparameters, t_episode, scores, average_score))
            # TODO Write directly to CSV file...

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.54 µs


The output of the cell above has been manually put into the file `screen_scrape.txt`, and then processed by `create_csv.py`.
Resulting CSV file will be imported in the following cell.

In [13]:
import pandas as pd
grid_search = pd.read_csv("grid_search.csv")
grid_search['time'] = pd.to_datetime(grid_search['time']).dt.time
grid_search.sort_values(by='avg_score', ascending=False)

Unnamed: 0,id,buffer_size,batch_size,learn_rate,max_t,neurons,episodes,avg_score,time
45,27,10000000,64,0.001,3000,128,100,1.37,00:00:14.567689
29,41,10000000,256,0.001,1000,128,100,1.29,00:00:21.143977
26,37,10000000,128,0.0001,1000,128,100,1.28,00:00:17.895239
16,20,100000,256,0.001,3000,256,100,1.27,00:00:20.971953
37,12,100000,128,0.001,3000,256,100,1.2,00:00:16.543660
22,33,10000000,128,0.001,1000,128,100,1.19,00:00:17.244626
46,28,10000000,64,0.001,3000,256,100,1.18,00:00:14.360792
44,26,10000000,64,0.001,1000,256,100,1.18,00:00:15.332227
32,44,10000000,256,0.001,3000,256,100,1.16,00:00:19.806695
30,42,10000000,256,0.001,1000,256,100,1.11,00:00:19.747336


As can be seen in the table above, none of the hyperparameter combinations results in a score of more than `1.37` after the first `100` episodes.
Several of the training runs have already been aborted after `50` episodes, since the score did not reach `1.0` by then.

It is expected, that training reaches at least a score of `5.0` after `100` episodes, as can be seen in other students' submissions and as discussed on the Udacity Nanodegree Slack channnel.
This holds even for the single agent environment.
Hence, an error in the current implementation of the training algorithm is suspected.

In [None]:
def plot_scores(scores):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    plt.plot(np.arange(len(scores)), scores)
    plt.ylabel('Score')
    plt.xlabel('Episode #')
    plt.show()

plot_scores(scores)

When finished, we can close the environment.

In [6]:
env.close()