# Continuous Control

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.30 which is incompatible.[0m
[31mjupyter-console 6.4.3 has requirement jupyter-client>=7.0.0, but you'll have jupyter-client 5.2.4 which is incompatible.[0m


The environments corresponding to both versions of the environment are already saved in the Workspace and can be accessed at the file paths provided below.  

Please select one of the two options below for loading the environment.

In [2]:
from unityagents import UnityEnvironment
import numpy as np

# select this option to load version 1 (with a single agent) of the environment
env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')

# select this option to load version 2 (with 20 agents) of the environment
# env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_size -> 5.0
		goal_speed -> 1.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import random
import copy
from collections import deque, namedtuple

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [5]:
# Models
def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)

class Critic(nn.Module):
    def __init__(self):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(33, 128)
        self.fc2 = nn.Linear(132, 128)
        self.fc3 = nn.Linear(128, 1)
        self.bn1 = nn.BatchNorm1d(128) # ADDED
        self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)
        
    def forward(self, states, actions):
        x = self.bn1(F.relu(self.fc1(states.to(device))))
        actions = torch.tensor(actions, dtype = torch.float)
        x = torch.cat((x, actions.to(device)), -1)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
class Actor(nn.Module):
    def __init__(self):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(33, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 4)
        self.bn1 = nn.BatchNorm1d(128) # ADDED
        self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)
        
    def forward(self, states):
        x = torch.tensor(states, dtype = torch.float).to(device)
        if x.dim() == 1: x = torch.unsqueeze(x,0) # ADDED
        x = self.bn1(F.relu(self.fc1(x)))
        x = F.relu(self.fc2(x))
        x = F.tanh(self.fc3(x))
        return x

In [6]:
class OUNoise:
    def __init__(self, size, mu=0., theta=0.15, sigma=0.1):
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.reset()

    def reset(self):
        self.state = copy.copy(self.mu)

    def sample(self):
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(len(x)) # ADDED
        self.state = x + dx
        return self.state

In [7]:
# Agent
class Agent(nn.Module):
    def __init__(self, buffer_size, batch_size):
        super(Agent, self).__init__()
        self.buffer_size = buffer_size
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "done", "next_state"])
        self.memory = deque(maxlen=self.buffer_size)
        self.dqn = Critic().to(device)
        self.dqn_target = Critic().to(device)
        for target_param, local_param in zip(self.dqn_target.parameters(), self.dqn.parameters()):
            target_param.data.copy_(local_param)
        self.policy = Actor().to(device)
        self.policy_target = Actor().to(device)
        for target_param, local_param in zip(self.policy_target.parameters(), self.policy.parameters()):
            target_param.data.copy_(local_param)
        self.optimizer_dqn = optim.Adam(self.dqn.parameters(), lr = 2e-4, weight_decay = 0)
        self.optimizer_policy = optim.Adam(self.policy.parameters(), lr = 2e-4)
        self.t_step = 0
        self.noise = OUNoise(4)
    
    def act(self, state, add_noise=True):
        state = torch.from_numpy(state).float().to(device)
        self.policy.eval()
        with torch.no_grad():
            action = self.policy(state).cpu().data.numpy()
        self.policy.train()
        action += self.noise.sample()
        return np.clip(action, -1, 1)
    
    def step(self, state, action, reward, done, next_state):
        e = self.experience(state, action, reward, done, next_state)
        self.memory.append(e) 
        
        self.t_step = (self.t_step + 1)
        if self.t_step % 1 == 0:
            if len(self.memory)>=self.batch_size:
                experiences = self.sample()
                states, actions, rewards, dones, next_states = experiences
                
                Q = self.dqn_learn(states, actions, rewards, dones, next_states)
                self.optimizer_dqn.zero_grad()
                Q.backward()
                torch.nn.utils.clip_grad_norm_(self.dqn.parameters(), 1)
                self.optimizer_dqn.step()
                for target_param, local_param in zip(self.dqn_target.parameters(), self.dqn.parameters()):
                    target_param.data.copy_(0.001*local_param.data + (1.0-0.001)*target_param.data)
                
                P = self.policy_learn(states)
                self.optimizer_policy.zero_grad()
                P.backward()
                self.optimizer_policy.step()
                for target_param, local_param in zip(self.policy_target.parameters(), self.policy.parameters()):
                    target_param.data.copy_(0.001*local_param.data + (1.0-0.001)*target_param.data)
                return Q.cpu().data.numpy(), P.cpu().data.numpy()

    def sample(self):
        experiences = random.sample(self.memory, k=self.batch_size)
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
        return (states, actions, rewards, dones, next_states)
    
    def dqn_learn(self, states, actions, rewards, dones, next_states, discount = 0.99):
        next_actions = self.policy_target(next_states)
        dqn_future_reward = rewards + (discount * self.dqn_target(next_states, next_actions) * (1 - dones))
        dqn_expected_reward = self.dqn(states, actions)
        return F.mse_loss(dqn_future_reward, dqn_expected_reward)
    
    def policy_learn(self, states):
        states = torch.tensor(states, dtype=torch.float)
        actions = self.policy(states)
        rewards = self.dqn(states, actions)
        return -torch.mean(rewards)
    
    def reset(self):
        self.noise.reset()
agent = Agent(int(1e5), 128)

In [8]:
from workspace_utils import active_session
 
with active_session():
    def training(n_episodes=500, max_t=1000):
        scores = []                        # list containing scores from each episode
        scores_window = deque(maxlen=100)  # last 100 scores
        q_scores_window = deque(maxlen=100)  # last 100 scores
        policy_scores_window = deque(maxlen=100)  # last 100 scores
        for i_episode in range(1, n_episodes+1):
            env_info = env.reset(train_mode = True)[brain_name]
            agent.reset()
            state = env_info.vector_observations[0]
            score = 0
            q_score = 0
            policy_score = 0
            for t in range(max_t):
                action = agent.act(state)
                env_info = env.step(action)[brain_name]           
                next_state, reward, done = env_info.vector_observations[0], env_info.rewards[0], env_info.local_done[0]                        
                score += env_info.rewards[0]                      
                q_learn = agent.step(state, action, reward, done, next_state)
                if q_learn is not None:
                    q, p = q_learn
                    q_score += q
                    policy_score += p
                state = next_state                                # roll over states to next time step
                if done:                                  # exit loop if episode finished
                    break
            scores_window.append(score)       
            q_scores_window.append(q_score)       
            policy_scores_window.append(policy_score)       
            scores.append(score)              
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window, axis = 0)))
            print('\rEpisode {}\tAverage Score Q: {:.2f}'.format(i_episode, np.mean(q_scores_window, axis = 0)))
            print('\rEpisode {}\tAverage Score Policy: {:.2f}'.format(i_episode, np.mean(policy_scores_window, axis = 0)))
            if i_episode % 100 == 0:
                print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
            if np.mean(scores_window)>=30.0:
                print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
                torch.save(agent.dqn.state_dict(), 'checkpoint.pth')
                torch.save(agent.policy, 'policy.policy')
                break
        return scores

    scores = training()

Episode 1	Average Score: 1.58
Episode 1	Average Score Q: 0.00
Episode 1	Average Score Policy: -20.84
Episode 2	Average Score: 0.90
Episode 2	Average Score Q: 0.00
Episode 2	Average Score Policy: -24.42
Episode 3	Average Score: 0.90
Episode 3	Average Score Q: 0.00
Episode 3	Average Score Policy: -26.69
Episode 4	Average Score: 0.79
Episode 4	Average Score Q: 0.01
Episode 4	Average Score Policy: -28.70
Episode 5	Average Score: 0.75
Episode 5	Average Score Q: 0.01
Episode 5	Average Score Policy: -30.52
Episode 6	Average Score: 0.70
Episode 6	Average Score Q: 0.01
Episode 6	Average Score Policy: -32.17
Episode 7	Average Score: 0.65
Episode 7	Average Score Q: 0.01
Episode 7	Average Score Policy: -33.78
Episode 8	Average Score: 0.74
Episode 8	Average Score Q: 0.01
Episode 8	Average Score Policy: -35.32
Episode 9	Average Score: 0.84
Episode 9	Average Score Q: 0.01
Episode 9	Average Score Policy: -36.84
Episode 10	Average Score: 1.01
Episode 10	Average Score Q: 0.01
Episode 10	Average Score Po

Episode 80	Average Score: 2.54
Episode 80	Average Score Q: 0.26
Episode 80	Average Score Policy: -186.83
Episode 81	Average Score: 2.58
Episode 81	Average Score Q: 0.26
Episode 81	Average Score Policy: -189.19
Episode 82	Average Score: 2.59
Episode 82	Average Score Q: 0.27
Episode 82	Average Score Policy: -191.56
Episode 83	Average Score: 2.61
Episode 83	Average Score Q: 0.27
Episode 83	Average Score Policy: -193.91
Episode 84	Average Score: 2.62
Episode 84	Average Score Q: 0.27
Episode 84	Average Score Policy: -196.27
Episode 85	Average Score: 2.65
Episode 85	Average Score Q: 0.27
Episode 85	Average Score Policy: -198.63
Episode 86	Average Score: 2.69
Episode 86	Average Score Q: 0.28
Episode 86	Average Score Policy: -200.99
Episode 87	Average Score: 2.72
Episode 87	Average Score Q: 0.28
Episode 87	Average Score Policy: -203.34
Episode 88	Average Score: 2.77
Episode 88	Average Score Q: 0.28
Episode 88	Average Score Policy: -205.69
Episode 89	Average Score: 2.79
Episode 89	Average Score

Episode 157	Average Score: 6.31
Episode 157	Average Score Q: 0.78
Episode 157	Average Score Policy: -510.51
Episode 158	Average Score: 6.37
Episode 158	Average Score Q: 0.79
Episode 158	Average Score Policy: -515.78
Episode 159	Average Score: 6.41
Episode 159	Average Score Q: 0.80
Episode 159	Average Score Policy: -521.08
Episode 160	Average Score: 6.63
Episode 160	Average Score Q: 0.81
Episode 160	Average Score Policy: -526.40
Episode 161	Average Score: 6.78
Episode 161	Average Score Q: 0.83
Episode 161	Average Score Policy: -531.77
Episode 162	Average Score: 6.88
Episode 162	Average Score Q: 0.84
Episode 162	Average Score Policy: -537.17
Episode 163	Average Score: 6.89
Episode 163	Average Score Q: 0.85
Episode 163	Average Score Policy: -542.60
Episode 164	Average Score: 6.95
Episode 164	Average Score Q: 0.86
Episode 164	Average Score Policy: -548.07
Episode 165	Average Score: 7.06
Episode 165	Average Score Q: 0.87
Episode 165	Average Score Policy: -553.59
Episode 166	Average Score: 7

Episode 233	Average Score: 17.54
Episode 233	Average Score Q: 2.33
Episode 233	Average Score Policy: -1113.88
Episode 234	Average Score: 17.85
Episode 234	Average Score Q: 2.36
Episode 234	Average Score Policy: -1125.83
Episode 235	Average Score: 18.10
Episode 235	Average Score Q: 2.38
Episode 235	Average Score Policy: -1137.93
Episode 236	Average Score: 18.36
Episode 236	Average Score Q: 2.41
Episode 236	Average Score Policy: -1150.17
Episode 237	Average Score: 18.46
Episode 237	Average Score Q: 2.43
Episode 237	Average Score Policy: -1162.52
Episode 238	Average Score: 18.65
Episode 238	Average Score Q: 2.45
Episode 238	Average Score Policy: -1175.00
Episode 239	Average Score: 18.79
Episode 239	Average Score Q: 2.48
Episode 239	Average Score Policy: -1187.60
Episode 240	Average Score: 19.05
Episode 240	Average Score Q: 2.50
Episode 240	Average Score Policy: -1200.34
Episode 241	Average Score: 19.16
Episode 241	Average Score Q: 2.52
Episode 241	Average Score Policy: -1213.20
Episode 24

Episode 308	Average Score: 28.21
Episode 308	Average Score Q: 3.26
Episode 308	Average Score Policy: -2274.43
Episode 309	Average Score: 28.21
Episode 309	Average Score Q: 3.25
Episode 309	Average Score Policy: -2291.75
Episode 310	Average Score: 28.46
Episode 310	Average Score Q: 3.25
Episode 310	Average Score Policy: -2309.04
Episode 311	Average Score: 28.45
Episode 311	Average Score Q: 3.25
Episode 311	Average Score Policy: -2326.33
Episode 312	Average Score: 28.75
Episode 312	Average Score Q: 3.24
Episode 312	Average Score Policy: -2343.61
Episode 313	Average Score: 28.99
Episode 313	Average Score Q: 3.24
Episode 313	Average Score Policy: -2360.88
Episode 314	Average Score: 29.00
Episode 314	Average Score Q: 3.24
Episode 314	Average Score Policy: -2378.16
Episode 315	Average Score: 28.94
Episode 315	Average Score Q: 3.23
Episode 315	Average Score Policy: -2395.41
Episode 316	Average Score: 28.66
Episode 316	Average Score Q: 3.23
Episode 316	Average Score Policy: -2412.58
Episode 31

  "type " + obj.__name__ + ". It won't be checked "
