# Continuous Control

---


Run the next code cell to install a few packages. 

In [1]:
# !pip -q install ./python

## 0. Learning Algorithm

---

To solve Unity reacher one agent problem, I choose [DDPG algorithm](https://arxiv.org/pdf/1509.02971.pdf)(Lillicrap et al., CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING).

**Algorithm**

To make DQN agent deal with continuos action space, the authors suggest extension of actor-critic model called DDPG. The DDPG agent is consists of 2 kinds of networks. Actor network is responsible for the policy function approximator. Critic Network is reponsible for Q-value function approximator. 

Just like DQN, DDPG also uses replay buffer which stores old experiences and samples a small batch of tuples to remove correlations in consecutive observations. They also use target network used in DQN, but modified it to use soft target updates.

And they add Ornstein-Uhlenbeck Noise to the action produced by actor network for encouraging exploration. Lastly, they used Adam optimizer for learning the nueral network paramters.

<br>
<figure>
  <img src = "./ddpg_algorithm.png" width = 80% style = "border: thin silver solid; padding: 10px">
      <figcaption style = "text-align: center; font-style: italic">Fig 1. - DDPG Algorithm.</figcaption>
</figure> 
<br>


**Hyperparamters**

All these hyperparamters except buffer size and batch size are from the paper's experiment details. Buffer size and batch size are much smaller since the task is more simple.

```
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 1e-2     # L2 weight decay
SIGMA = 0.2             # Paramter for OU Process
THETA = 0.15            # Paramter for OU Process

```

**Model Architecture**

Since this reacher problem is low dimensional problem, Both Actor and Critic are consist of few several fully connected layers. Critic gets states and actions as input and ouputs the action-value. The paper states that the actions were not included until the 2nd hidden layer of Q. So In this Implementatiom, actions are merged into the hidden layer between the 1st and 2nd one.

```
Actor(
  (fc1): Linear(in_features=33, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=4, bias=True)
)
```


```
Critic(
  (fc1): Linear(in_features=33, out_features=64, bias=True)
  (fc_merged): Linear(in_features=68, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=4, bias=True)
)
```


## 1.Implementation


- Replay Buffer
- Actor and Critic Network
- OUNoise
- Agent

---

In [1]:
import torch
import torch.nn.functional as F
import torch.optim as optim
import torch.nn as nn
import torch.nn.init as I

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

import numpy as np

### 1.Replay Buffer

Reinforcement learning is unstable when a nonlinear function approximator is used to represent action-value function, because the sequence of experiencs can be highly correlated. DDPG Agent stores the experience at each time step. Then by sampling from the buffer at random, It can prevent action values from oscillating or diverging.

### 2.Actor and Critic Networks

According to the papar, The authors initialized the final layer weights and biases of both the actor and critic from a uniform distribution $ [−3×10^−3, 3×10^−3] $ and $[3×10^−4, 3×10^−4]$. This was to ensure the initial outputs for the policy and value estimates were near zero. The other layers were initialized from uniform distributions $ [− \sqrt{f} , \sqrt{f} ] $ where f is the fan-in of the layer.

Since eveny entry in the action vector should be a number between -1 and 1, The activation function is **tanh**.
Other hidden layers is activated by **relu**. 

### 3.OUNoise

To encourage agnet do exploration at initial step, add noise from Ornstein–Uhlenbeck noise process to the specific action produced by the actor(policy) network. The variation of the noise process decreases as time goes by. Therefore it can lead to reducing the exploration as the agent train. Also 2 consecutive samples are temporally correlated. This will ensure that 2 consecutive actions are not different widly. The authors use theta=0.15 sigma=0.2 for this noise process.

### 4.Agent

## 2.Plot of Rewards
---
### 1.Training Code

In [29]:
from unityagents import UnityEnvironment
# select this option to load version 1 (with a single agent) of the environment
# for linux : /data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64
env = UnityEnvironment(file_name='data/single_agent/Reacher_Windows_x86_64/Reacher.exe')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [30]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

# shape of its elements
print('Shape of next state: {}'.format(env_info.vector_observations.shape))
print('Shape of the rewards : {}'.format(env_info.rewards))
print('Shape of dones : {}'.format(env_info.local_done))

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]
Shape of next state: (1, 33)
Shape of the rewards : [0.0]
Shape of dones : [False]


In [32]:
states.shape

(1, 33)

In [34]:
for state in states:
    print(state.shape)

(33,)


In [11]:
env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.09549999786540866


In [14]:
agent = DDPGAgent(state_size=state_size, action_size=action_size, seed=3)

In [15]:
agent.actor_local

Actor(
  (bn0): BatchNorm1d(33, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc1): Linear(in_features=33, out_features=256, bias=True)
  (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc3): Linear(in_features=128, out_features=4, bias=True)
)

In [16]:
agent.critic_local

Critic(
  (fc1): Linear(in_features=33, out_features=256, bias=True)
  (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc_merged): Linear(in_features=260, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
)

In [3]:
a=0

if a==0:
    print('a')

a


In [17]:
def ddpg(n_episodes=200, timestep_max=1000, print_every=50):
    # for plotting score grpha
    scores = []
    # for calculating mean score of consecutive episodes.
    score_deque = deque(maxlen=100)
    for i_episode in range(1,n_episodes):
        # Get Initial State
        env_info = env.reset(train_mode=True)[brain_name]
        state = env_info.vector_observations[0]
        # noise reset
        agent.reset()
        # Episode score
        score = 0
        for t in range(timestep_max):
            action = agent.act(state)
            
            env_info = env.step(action)[brain_name]
            next_state = env_info.vector_observations[0]
            reward = env_info.rewards[0]
            done = env_info.local_done[0]
        
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
        score_deque.append(score)
        scores.append(score)
        print('\rEpisode: {}\t Score: {:.2f}'.format(i_episode, score), end="")
        # save model parameters
        if i_episode%print_every == 0:
            print('Episode: {}\t Average score: {}'.format(i_episode, np.mean(score_deque)))
        if np.mean(score_deque)>=30:
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            break           
            
    return scores, agent.actor_loss, agent.critic_loss

In [12]:
multi_agent=MultiAgent(20, state_size, action_size, 0)

def ddpg_multi(n_episodes=1000, timestep_max=1000, print_every=50):
    # for plotting score grpha
    scores = []
    # for calculating mean score of consecutive episodes.
    score_deque = deque(maxlen=100)
    for i_episode in range(n_episodes):
        # Get Initial State
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        # noise reset
        multi_agent.reset()
        # Episode score
        score = np.zeros(20)
        for t in range(timestep_max):
            actions = multi_agent.act(states)
            
            env_info = env.step(actions)[brain_name]
            next_states = env_info.vector_observations
            rewards = env_info.rewards
            dones = env_info.local_done
        
            multi_agent.step(states, actions, rewards, next_states, dones)
            states = next_states
            score += rewards
            if np.any(dones):
                break
        avg_score = score.mean()
        score_deque.append(avg_score)
        scores.append(avg_score)
        print('\rEpisode: {}\t score: {}'.format(i_episode, avg_score), end="")
        # save model parameters
        if i_episode%print_every == 0:
            print('Episode: {}\t Average score: {}'.format(i_episode, np.mean(score_deque)))
#         if np.mean(score_deque)>=30:
#             torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
#             torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
#             break           
            
    return scores 

### 2. Results

In [18]:
import matplotlib.pyplot as plt

In [None]:
scores= ddpg_multi()

Episode: 0	 score: 0.012999999709427357Episode: 0	 Average score: 0.012999999709427357
Episode: 13	 score: 0.026999999396502973

In [None]:
scores, actor_loss, critic_loss= ddpg()

Episode: 50	 Score: 0.15Episode: 50	 Average score: 0.5887999868392945
Episode: 100	 Score: 0.24Episode: 100	 Average score: 0.636699985768646
Episode: 108	 Score: 0.33

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(actor_loss+1), 1), actor_loss)
plt.ylabel('average score')
plt.xlabel('Episode')
plt.show()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(critic_loss+1), 1), critic_loss)
plt.ylabel('average score')
plt.xlabel('Episode')
plt.show()

In [None]:
#env.close()

## 3. Ideas for Future Work

1. Hyper paramter를 조정해본다
    
    Authors get the optiaml hyper paratmeter by extensive trial to the task. and obviously the task in the paper and this project task are different. so there might be a better set of hyperparamters.
    
    