# Continuous Control

---

Congratulations for completing the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program!  In this notebook, you will learn how to control an agent in a more challenging environment, where the goal is to train a creature with four arms to walk forward.  **Note that this exercise is optional!**

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [1]:
from unityagents import UnityEnvironment
import numpy as np

env = UnityEnvironment(file_name='Crawler_Linux_NoVis/Crawler.x86_64')

# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: CrawlerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 129
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 20
        Vector Action descriptions: , , , , , , , , , , , , , , , , , , , 


Number of agents: 12
Size of each action: 20
There are 12 agents. Each observes a state with length: 129
The state for the first agent looks like: [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  2.25000000e+00
  1.00000000e+00  0.00000000e+00  1.78813934e-07  0.00000000e+00
  1.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  6.06093168e-01 -1.42857209e-01 -6.06078804e-01  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  1.33339906e+00 -1.42857209e-01
 -1.33341408e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -6.0609

In [7]:
#######################################################################
# Copyright (C) 2017 Shangtong Zhang(zhangshangtong.cpp@gmail.com)    #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

from deep_rl import *

import torch
import numpy as np
from deep_rl.utils import *
import torch.multiprocessing as mp
from collections import deque
from skimage.io import imsave
from deep_rl.network import *
from deep_rl.component import *


class BaseAgent:
    def __init__(self, config):
        self.config = config
        self.logger = get_logger(tag=config.tag, log_level=config.log_level)
        self.task_ind = 0
        self.episode_rewards = []
        self.rewards = None
        self.episodic_return = None
    def close(self):
        close_obj(self.task)

    def save(self, filename):
        torch.save(self.network.state_dict(), '%s.model' % (filename))
        with open('%s.stats' % (filename), 'wb') as f:
            pickle.dump(self.config.state_normalizer.state_dict(), f)

    def load(self, filename):
        state_dict = torch.load('%s.model' % filename, map_location=lambda storage, loc: storage)
        self.network.load_state_dict(state_dict)
        with open('%s.stats' % (filename), 'rb') as f:
            self.config.state_normalizer.load_state_dict(pickle.load(f))

    def eval_step(self, state):
        raise NotImplementedError

    def eval_episode(self):
        env = self.config.eval_env
        state = env.reset()
        while True:
            action = self.eval_step(state)
            state, reward, done, info = env.step(action)
            ret = info[0]['episodic_return']
            if ret is not None:
                break
        return ret

    def eval_episodes(self):
        episodic_returns = []
        for ep in range(self.config.eval_episodes):
            total_rewards = self.eval_episode()
            episodic_returns.append(np.sum(total_rewards))
        self.episode_rewards = episodic_returns
        self.logger.info('steps %d, episodic_return_test %.2f(%.2f)' % (
            self.total_steps, np.mean(episodic_returns), np.std(episodic_returns) / np.sqrt(len(episodic_returns))
        ))
        self.logger.add_scalar('episodic_return_test', np.mean(episodic_returns), self.total_steps)
        return {
            'episodic_return_test': np.mean(episodic_returns),
        }

    def record_online_return(self, info, offset=0):
        if isinstance(info, dict):
            ret = info['episodic_return']
            self.rewards = info['all_rewards']
            if(self.rewards is not None):
                episode = len(self.rewards)
            if ret is not None:
                self.episodic_return = ret
#                 self.logger.add_scalar('episodic_return_train', ret, self.total_steps + offset)
#                 self.logger.info('Episode %d, steps %d, episodic_return_train %s' % (episode,self.total_steps + offset, ret))
        elif isinstance(info, tuple):
            for i, info_ in enumerate(info):
                self.record_online_return(info_, i)
        else:
            raise NotImplementedError

    def switch_task(self):
        config = self.config
        if not config.tasks:
            return
        segs = np.linspace(0, config.max_steps, len(config.tasks) + 1)
        if self.total_steps > segs[self.task_ind + 1]:
            self.task_ind += 1
            self.task = config.tasks[self.task_ind]
            self.states = self.task.reset()
            self.states = config.state_normalizer(self.states)

    def record_episode(self, dir, env):
        mkdir(dir)
        steps = 0
        state = env.reset()
        while True:
            self.record_obs(env, dir, steps)
            action = self.record_step(state)
            state, reward, done, info = env.step(action)
            ret = info[0]['episodic_return']
            steps += 1
            if ret is not None:
                break

    def record_step(self, state):
        raise NotImplementedError

    # For DMControl
    def record_obs(self, env, dir, steps):
        env = env.env.envs[0]
        obs = env.render(mode='rgb_array')
        imsave('%s/%04d.png' % (dir, steps), obs)

class PPOAgent(BaseAgent):
    def __init__(self, config):
        BaseAgent.__init__(self, config)
        self.config = config
        self.task = config.task_fn()
        self.network = config.network_fn()
        self.opt = config.optimizer_fn(self.network.parameters())
        self.total_steps = 0
        self.states = self.task.reset()
        self.states = config.state_normalizer(self.states)

    def step(self):
        config = self.config
        storage = Storage(config.rollout_length)
        states = self.states
        for _ in range(config.rollout_length):
            prediction = self.network(states)
            next_states, rewards, terminals, info = self.task.step(to_np(prediction['a']))
            self.record_online_return(info)
            rewards = config.reward_normalizer(rewards)
            next_states = config.state_normalizer(next_states)
            storage.add(prediction)
            storage.add({'r': tensor(rewards).unsqueeze(-1),
                         'm': tensor(1 - terminals).unsqueeze(-1),
                         's': tensor(states)})
            states = next_states
            self.total_steps += config.num_workers

        self.states = states
        prediction = self.network(states)
        storage.add(prediction)
        storage.placeholder()

        advantages = tensor(np.zeros((config.num_workers, 1)))
        returns = prediction['v'].detach()
        for i in reversed(range(config.rollout_length)):
            returns = storage.r[i] + config.discount * storage.m[i] * returns
            if not config.use_gae:
                advantages = returns - storage.v[i].detach()
            else:
                td_error = storage.r[i] + config.discount * storage.m[i] * storage.v[i + 1] - storage.v[i]
                advantages = advantages * config.gae_tau * config.discount * storage.m[i] + td_error
            storage.adv[i] = advantages.detach()
            storage.ret[i] = returns.detach()

        states, actions, log_probs_old, returns, advantages = storage.cat(['s', 'a', 'log_pi_a', 'ret', 'adv'])
        actions = actions.detach()
        log_probs_old = log_probs_old.detach()
        advantages = (advantages - advantages.mean()) / advantages.std()

        for _ in range(config.optimization_epochs):
            sampler = random_sample(np.arange(states.size(0)), config.mini_batch_size)
            for batch_indices in sampler:
                batch_indices = tensor(batch_indices).long()
                sampled_states = states[batch_indices]
                sampled_actions = actions[batch_indices]
                sampled_log_probs_old = log_probs_old[batch_indices]
                sampled_returns = returns[batch_indices]
                sampled_advantages = advantages[batch_indices]

                prediction = self.network(sampled_states, sampled_actions)
                ratio = (prediction['log_pi_a'] - sampled_log_probs_old).exp()
                obj = ratio * sampled_advantages
                obj_clipped = ratio.clamp(1.0 - self.config.ppo_ratio_clip,
                                          1.0 + self.config.ppo_ratio_clip) * sampled_advantages
                policy_loss = -torch.min(obj, obj_clipped).mean() - config.entropy_weight * prediction['ent'].mean()

                value_loss = 0.5 * (sampled_returns - prediction['v']).pow(2).mean()

                self.opt.zero_grad()
                (policy_loss + value_loss).backward()
                nn.utils.clip_grad_norm_(self.network.parameters(), config.gradient_clip)
                self.opt.step()

In [8]:
def run_steps_custom(agent):
    config = agent.config
    agent_name = agent.__class__.__name__
    t0 = time.time()
    rewards_deque = deque(maxlen=100)
    rewards_all = []
    while True:
        rewards = agent.episodic_return
        if rewards is not None:
            rewards_deque.append(np.mean(rewards))
            rewards_all.append(np.mean(rewards))
        if config.log_interval and not agent.total_steps % config.log_interval and (rewards is not None):
            agent.logger.info('Episode %d,last %d episodes, mean rewards  %.2f,  steps %d, %.2f steps/s' % (len(rewards_all),len(rewards_deque),np.mean(rewards_deque),agent.total_steps, config.log_interval / (time.time() - t0)))
            t0 = time.time()
#         if config.max_steps and agent.total_steps >= config.max_steps:
#             agent.close()
#             return True,rewards_deque,rewards_all
        if (rewards is not None) and np.mean(rewards_deque) > 2000:
            agent.save('./data/model-%s.bin' % (agent_name))
            agent.close()
            return True,rewards_deque,rewards_all
        if (len(rewards_all) % 200):
            agent.save('./data/model-%s.bin' % (agent_name))


        agent.step()
        agent.switch_task()

class CrawlerTask():
    def __init__(self):
#         BaseTask.__init__(self)
        self.name = 'Reacher'
        self.env = env
        self.action_dim = brain.vector_action_space_size
        self.state_dim = brain.vector_observation_space_size
        self.info = {"all_rewards":None}
        self.total_rewards = np.zeros(12)
        self.rewards = []
    def reset(self):
        env_info = self.env.reset(train_mode=True)[brain_name]
        return np.array(env_info.vector_observations)

    def step(self, action):
        action = np.clip(action, -1, 1)
        env_info = self.env.step(action)[brain_name]
        next_state = env_info.vector_observations   # next state
        reward = env_info.rewards                   # reward
        done = env_info.local_done

        self.total_rewards += reward

        if np.any(done):
            if any(np.isnan(self.total_rewards.reshape(-1))):
                self.total_rewards[np.isnan(self.total_rewards)] = -5            
            self.info['episodic_return'] = self.total_rewards
            self.rewards.append(self.total_rewards)
            self.info['all_rewards'] = self.rewards
            self.total_rewards = np.zeros(12)
            next_state = self.reset()            
        else:
            self.info['episodic_return'] = None

        return np.array(next_state), np.array(reward), np.array(done), self.info

    def seed(self, random_seed):
        return 10

def ppo_continuous():
    config = Config()
    config.num_workers = num_agents
    task_fn = lambda : CrawlerTask()
    config.task_fn = task_fn
    config.eval_env = task_fn()

    config.network_fn = lambda: GaussianActorCriticNet(
        config.state_dim, config.action_dim, actor_body=FCBody(config.state_dim,hidden_units=(128, 128),gate=F.leaky_relu),
        critic_body=FCBody(config.state_dim,hidden_units=(128, 128),gate=F.leaky_relu))
    config.optimizer_fn = lambda params: torch.optim.Adam(params, 3e-4, eps=1e-5)
    config.discount = 0.99
    config.use_gae = True
    config.gae_tau = 0.99
    config.gradient_clip = 5
    config.rollout_length = 64
    config.optimization_epochs = 4
    config.mini_batch_size = 64
    config.ppo_ratio_clip = 0.2
    config.log_interval = 4096
    config.max_steps = 1e4
    config.state_normalizer = MeanStdNormalizer()
    agent = PPOAgent(config)
#     agent.load('data/model-PPOAgent.bin')
    return run_steps_custom(agent)

success, rewards_deque, rewards_all = ppo_continuous()

INFO:root:Episode 16,last 16 episodes, mean rewards  0.96,  steps 12288, 802.09 steps/s
INFO:root:Episode 32,last 32 episodes, mean rewards  2.15,  steps 24576, 800.68 steps/s
INFO:root:Episode 48,last 48 episodes, mean rewards  3.22,  steps 36864, 801.93 steps/s
INFO:root:Episode 64,last 64 episodes, mean rewards  3.76,  steps 49152, 786.38 steps/s
INFO:root:Episode 80,last 80 episodes, mean rewards  4.57,  steps 61440, 801.91 steps/s
INFO:root:Episode 96,last 96 episodes, mean rewards  5.36,  steps 73728, 806.23 steps/s
INFO:root:Episode 112,last 100 episodes, mean rewards  6.43,  steps 86016, 800.45 steps/s
INFO:root:Episode 128,last 100 episodes, mean rewards  7.08,  steps 98304, 797.35 steps/s
INFO:root:Episode 144,last 100 episodes, mean rewards  7.85,  steps 110592, 810.01 steps/s
INFO:root:Episode 160,last 100 episodes, mean rewards  8.66,  steps 122880, 811.91 steps/s
INFO:root:Episode 176,last 100 episodes, mean rewards  9.19,  steps 135168, 791.20 steps/s
INFO:root:Episode 1

INFO:root:Episode 1456,last 100 episodes, mean rewards  20.53,  steps 1118208, 797.19 steps/s
INFO:root:Episode 1472,last 100 episodes, mean rewards  21.32,  steps 1130496, 801.90 steps/s
INFO:root:Episode 1488,last 100 episodes, mean rewards  21.61,  steps 1142784, 799.71 steps/s
INFO:root:Episode 1504,last 100 episodes, mean rewards  21.41,  steps 1155072, 798.24 steps/s
INFO:root:Episode 1520,last 100 episodes, mean rewards  21.25,  steps 1167360, 796.78 steps/s
INFO:root:Episode 1536,last 100 episodes, mean rewards  21.02,  steps 1179648, 793.81 steps/s
INFO:root:Episode 1552,last 100 episodes, mean rewards  21.35,  steps 1191936, 810.63 steps/s
INFO:root:Episode 1568,last 100 episodes, mean rewards  21.56,  steps 1204224, 788.46 steps/s
INFO:root:Episode 1584,last 100 episodes, mean rewards  21.72,  steps 1216512, 786.75 steps/s
INFO:root:Episode 1600,last 100 episodes, mean rewards  22.10,  steps 1228800, 783.73 steps/s
INFO:root:Episode 1616,last 100 episodes, mean rewards  21.9

INFO:root:Episode 2864,last 100 episodes, mean rewards  24.22,  steps 2199552, 792.04 steps/s
INFO:root:Episode 2880,last 100 episodes, mean rewards  24.07,  steps 2211840, 794.57 steps/s
INFO:root:Episode 2896,last 100 episodes, mean rewards  24.22,  steps 2224128, 793.32 steps/s
INFO:root:Episode 2912,last 100 episodes, mean rewards  24.39,  steps 2236416, 787.78 steps/s
INFO:root:Episode 2928,last 100 episodes, mean rewards  24.67,  steps 2248704, 799.41 steps/s
INFO:root:Episode 2944,last 100 episodes, mean rewards  24.82,  steps 2260992, 787.93 steps/s
INFO:root:Episode 2960,last 100 episodes, mean rewards  24.96,  steps 2273280, 812.42 steps/s
INFO:root:Episode 2976,last 100 episodes, mean rewards  24.35,  steps 2285568, 789.37 steps/s
INFO:root:Episode 2992,last 100 episodes, mean rewards  24.15,  steps 2297856, 767.00 steps/s
INFO:root:Episode 3008,last 100 episodes, mean rewards  24.55,  steps 2310144, 780.09 steps/s
INFO:root:Episode 3024,last 100 episodes, mean rewards  24.5

INFO:root:Episode 4272,last 100 episodes, mean rewards  25.08,  steps 3280896, 797.66 steps/s
INFO:root:Episode 4288,last 100 episodes, mean rewards  25.88,  steps 3293184, 795.15 steps/s
INFO:root:Episode 4304,last 100 episodes, mean rewards  25.76,  steps 3305472, 803.91 steps/s
INFO:root:Episode 4320,last 100 episodes, mean rewards  25.22,  steps 3317760, 793.81 steps/s
INFO:root:Episode 4336,last 100 episodes, mean rewards  24.87,  steps 3330048, 788.47 steps/s
INFO:root:Episode 4352,last 100 episodes, mean rewards  24.92,  steps 3342336, 798.39 steps/s
INFO:root:Episode 4368,last 100 episodes, mean rewards  24.53,  steps 3354624, 789.96 steps/s
INFO:root:Episode 4384,last 100 episodes, mean rewards  24.88,  steps 3366912, 787.51 steps/s
INFO:root:Episode 4400,last 100 episodes, mean rewards  25.11,  steps 3379200, 795.67 steps/s
INFO:root:Episode 4416,last 100 episodes, mean rewards  25.38,  steps 3391488, 798.37 steps/s
INFO:root:Episode 4432,last 100 episodes, mean rewards  25.4

INFO:root:Episode 5680,last 100 episodes, mean rewards  24.38,  steps 4362240, 785.81 steps/s
INFO:root:Episode 5696,last 100 episodes, mean rewards  24.01,  steps 4374528, 797.55 steps/s
INFO:root:Episode 5712,last 100 episodes, mean rewards  23.59,  steps 4386816, 789.85 steps/s
INFO:root:Episode 5728,last 100 episodes, mean rewards  23.43,  steps 4399104, 783.42 steps/s
INFO:root:Episode 5744,last 100 episodes, mean rewards  23.29,  steps 4411392, 788.74 steps/s
INFO:root:Episode 5760,last 100 episodes, mean rewards  23.22,  steps 4423680, 789.23 steps/s
INFO:root:Episode 5776,last 100 episodes, mean rewards  22.47,  steps 4435968, 801.31 steps/s
INFO:root:Episode 5792,last 100 episodes, mean rewards  21.83,  steps 4448256, 797.28 steps/s
INFO:root:Episode 5808,last 100 episodes, mean rewards  21.93,  steps 4460544, 800.03 steps/s
INFO:root:Episode 5824,last 100 episodes, mean rewards  21.16,  steps 4472832, 794.34 steps/s
INFO:root:Episode 5840,last 100 episodes, mean rewards  19.6

INFO:root:Episode 7088,last 100 episodes, mean rewards  9.69,  steps 5443584, 786.25 steps/s
INFO:root:Episode 7104,last 100 episodes, mean rewards  10.59,  steps 5455872, 794.32 steps/s
INFO:root:Episode 7120,last 100 episodes, mean rewards  11.40,  steps 5468160, 790.66 steps/s
INFO:root:Episode 7136,last 100 episodes, mean rewards  11.32,  steps 5480448, 788.43 steps/s
INFO:root:Episode 7152,last 100 episodes, mean rewards  11.80,  steps 5492736, 792.81 steps/s
INFO:root:Episode 7168,last 100 episodes, mean rewards  11.73,  steps 5505024, 787.83 steps/s
INFO:root:Episode 7184,last 100 episodes, mean rewards  11.30,  steps 5517312, 785.84 steps/s
INFO:root:Episode 7200,last 100 episodes, mean rewards  10.72,  steps 5529600, 788.43 steps/s
INFO:root:Episode 7216,last 100 episodes, mean rewards  9.96,  steps 5541888, 786.34 steps/s
INFO:root:Episode 7232,last 100 episodes, mean rewards  9.53,  steps 5554176, 781.27 steps/s
INFO:root:Episode 7248,last 100 episodes, mean rewards  8.85,  

INFO:root:Episode 8496,last 100 episodes, mean rewards  15.24,  steps 6524928, 790.30 steps/s
INFO:root:Episode 8512,last 100 episodes, mean rewards  15.39,  steps 6537216, 791.00 steps/s
INFO:root:Episode 8528,last 100 episodes, mean rewards  15.63,  steps 6549504, 786.31 steps/s
INFO:root:Episode 8544,last 100 episodes, mean rewards  16.53,  steps 6561792, 789.41 steps/s
INFO:root:Episode 8560,last 100 episodes, mean rewards  17.15,  steps 6574080, 790.57 steps/s
INFO:root:Episode 8576,last 100 episodes, mean rewards  17.62,  steps 6586368, 807.62 steps/s
INFO:root:Episode 8592,last 100 episodes, mean rewards  18.40,  steps 6598656, 788.14 steps/s
INFO:root:Episode 8608,last 100 episodes, mean rewards  19.10,  steps 6610944, 802.12 steps/s
INFO:root:Episode 8624,last 100 episodes, mean rewards  20.02,  steps 6623232, 792.97 steps/s
INFO:root:Episode 8640,last 100 episodes, mean rewards  19.75,  steps 6635520, 799.28 steps/s
INFO:root:Episode 8656,last 100 episodes, mean rewards  19.5

INFO:root:Episode 9904,last 100 episodes, mean rewards  8.65,  steps 7606272, 765.50 steps/s
INFO:root:Episode 9920,last 100 episodes, mean rewards  8.67,  steps 7618560, 765.02 steps/s
INFO:root:Episode 9936,last 100 episodes, mean rewards  8.08,  steps 7630848, 761.90 steps/s
INFO:root:Episode 9952,last 100 episodes, mean rewards  7.73,  steps 7643136, 775.93 steps/s
INFO:root:Episode 9968,last 100 episodes, mean rewards  7.25,  steps 7655424, 763.31 steps/s
INFO:root:Episode 9984,last 100 episodes, mean rewards  6.90,  steps 7667712, 772.90 steps/s
INFO:root:Episode 10000,last 100 episodes, mean rewards  6.94,  steps 7680000, 769.31 steps/s
INFO:root:Episode 10016,last 100 episodes, mean rewards  6.77,  steps 7692288, 768.05 steps/s
INFO:root:Episode 10032,last 100 episodes, mean rewards  7.22,  steps 7704576, 770.55 steps/s
INFO:root:Episode 10048,last 100 episodes, mean rewards  7.36,  steps 7716864, 778.62 steps/s
INFO:root:Episode 10064,last 100 episodes, mean rewards  7.98,  st

KeyboardInterrupt: 