# Continuous Control

---

*Jun Zhu, zhujun981661@gmail.com, 08.2020*

In this notebook, I present the solution for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

In [None]:
from unityagents import UnityEnvironment
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F

!pip install matplotlib

import matplotlib.pyplot as plt
%matplotlib inline

import sys
sys.path.append('..')
from ddpg_agent import DdpgAgent
from utilities import check_environment, play, plot_score_history

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [None]:
env = UnityEnvironment(file_name='Reacher_Linux/Reacher.x86_64')

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [None]:
brain_name, state_space, action_space = check_environment(env)
brain_name

### 3. Take Random Actions in the Environment

In [None]:
# play(env, brain_name, continuous=True)

### 4. Train  a DDPG (deep deterministic policy gradient) agent

The DDPG algorithm proposed in this [paper](https://arxiv.org/pdf/1509.02971.pdf) was implemented. DDPG is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over **continuous action space**.

In [None]:
def init_layer(layer):
    layer.weight.data.uniform_(-3e-3, 3e-3)
    layer.bias.data.uniform_(-3e-3, 3e-3)


class Actor(nn.Module):
    def __init__(self):
        super().__init__()
        
        self._fc1 = nn.Linear(state_space, 512)
        self._fc2 = nn.Linear(512, 128)
        
        self._fc_final = nn.Linear(128, action_space)
        init_layer(self._fc_final)
        
    def forward(self, state):
        x = state
        for fc in (self._fc1, self._fc2):
            x = F.relu(fc(x))
        
        return torch.tanh(self._fc_final(x))
    

class Critic(nn.Module):
    def __init__(self):
        super().__init__()
        
        self._fc1 = nn.Linear(state_space + action_space, 512)
        self._fc2 = nn.Linear(512, 64)
        
        self._fc_final = nn.Linear(64, 1)
        init_layer(self._fc_final)

    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        
        for fc in (self._fc1, self._fc2):
            x = F.relu(fc(x))
            
        return self._fc_final(x)

In [None]:
target_score = 30

# initialize the agent
agent = DdpgAgent(action_space, (Actor(), Critic()), replay_memory_size=int(1e5))

# train the agent with given hyperparameters (remove ./ddpg_checkpoint.pth to train a new model from scratch)
scores = agent.train(env,
                     theta=0.15,  # Ornstein-Uhlenbeck process constant
                     sigma=0.20,  # Ornstein-Uhlenbeck process constant
                     decay_rate=0.995,  # noise decay rate
                     tau=0.001,  # soft update rate of the target network
                     gamma=0.99,  # discount factor
                     learning_rate=(1e-3, 1e-3),  # learning rates of actor and critic models
                     weight_decay=(1e-4, 1.e-4),  # L2 penalties of actor and critic models
                     batch_size=64,  # mini batch size
                     n_episodes=1000,
                     target_score=target_score,
                     output_frequency=1)

In [None]:
_, ax = plt.subplots(1, 1, figsize=(12, 6))
plot_score_history(ax, scores, target_score)

In [None]:
for i in range(10):
    agent.play(env)

In [None]:
env.close()

### 5. Further improvement

Use the second environment with 20 agents and implement [PPO](https://arxiv.org/pdf/1707.06347.pdf), [A3C](https://arxiv.org/pdf/1602.01783.pdf), and [D4PG](https://openreview.net/pdf?id=SyZipzbCb) that use multiple (non-interacting, parallel) copies of the same agent to distribute the task of gathering experience.