# Continuous Control

---


Run the next code cell to install a few packages. 

In [1]:
# !pip -q install ./python

## 0. Learning Algorithm

---

To solve Unity reacher one agent problem, I choose [DDPG algorithm](https://arxiv.org/pdf/1509.02971.pdf)(Lillicrap et al., CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING).

**Algorithm**

To make DQN agent deal with continuos action space, the authors suggest extension of actor-critic model called DDPG. The DDPG agent is consists of 2 kinds of networks. Actor network is responsible for the policy function approximator. Critic Network is reponsible for Q-value function approximator. 

Just like DQN, DDPG also uses replay buffer which stores old experiences and samples a small batch of tuples to remove correlations in consecutive observations. They also use target network used in DQN, but modified it to use soft target updates.

And they add Ornstein-Uhlenbeck Noise to the action produced by actor network for encouraging exploration. Lastly, they used Adam optimizer for learning the nueral network paramters.

<br>
<figure>
  <img src = "./ddpg_algorithm.png" width = 80% style = "border: thin silver solid; padding: 10px">
      <figcaption style = "text-align: center; font-style: italic">Fig 1. - DDPG Algorithm.</figcaption>
</figure> 
<br>


**Hyperparamters**

All these hyperparamters except buffer size and batch size are from the paper's experiment details. Buffer size and batch size are much smaller since the task is more simple.

```
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 1e-2     # L2 weight decay
SIGMA = 0.2             # Paramter for OU Process
THETA = 0.15            # Paramter for OU Process

```

**Model Architecture**

Since this reacher problem is low dimensional problem, Both Actor and Critic are consist of few several fully connected layers. Critic gets states and actions as input and ouputs the action-value. The paper states that the actions were not included until the 2nd hidden layer of Q. So In this Implementatiom, actions are merged into the hidden layer between the 1st and 2nd one.

```
Actor(
  (fc1): Linear(in_features=33, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=4, bias=True)
)
```


```
Critic(
  (fc1): Linear(in_features=33, out_features=64, bias=True)
  (fc_merged): Linear(in_features=68, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=4, bias=True)
)
```


In [1]:
1e5

100000.0

In [2]:
%load_ext autoreload
%autoreload 1
%aimport agent, model, buffer

## 1.Implementation


- Replay Buffer
- Actor and Critic Network
- OUNoise
- Agent

---

In [3]:
import torch
import torch.nn.functional as F
import torch.optim as optim
import torch.nn as nn
import torch.nn.init as I
import json

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

import numpy as np
from collections import deque
from agent import MultiAgent
import agent
import time
from tqdm import tqdm

### 1.Replay Buffer

Reinforcement learning is unstable when a nonlinear function approximator is used to represent action-value function, because the sequence of experiencs can be highly correlated. DDPG Agent stores the experience at each time step. Then by sampling from the buffer at random, It can prevent action values from oscillating or diverging.

### 2.Actor and Critic Networks

According to the papar, The authors initialized the final layer weights and biases of both the actor and critic from a uniform distribution $ [−3×10^−3, 3×10^−3] $ and $[3×10^−4, 3×10^−4]$. This was to ensure the initial outputs for the policy and value estimates were near zero. The other layers were initialized from uniform distributions $ [− \sqrt{f} , \sqrt{f} ] $ where f is the fan-in of the layer.

Since eveny entry in the action vector should be a number between -1 and 1, The activation function is **tanh**.
Other hidden layers is activated by **relu**. 

### 3.OUNoise

To encourage agnet do exploration at initial step, add noise from Ornstein–Uhlenbeck noise process to the specific action produced by the actor(policy) network. The variation of the noise process decreases as time goes by. Therefore it can lead to reducing the exploration as the agent train. Also 2 consecutive samples are temporally correlated. This will ensure that 2 consecutive actions are not different widly. The authors use theta=0.15 sigma=0.2 for this noise process.

### 4.Agent

## 2.Plot of Rewards
---
### 1.Training Code

In [4]:
from unityagents import UnityEnvironment
# select this option to load version 1 (with a single agent) of the environment
# for linux : /data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64
env = UnityEnvironment(file_name='data/multi_agent/Reacher_Windows_x86_64/Reacher.exe')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))

# shape of its elements
print('Shape of next state: {}'.format(env_info.vector_observations.shape))

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
Shape of next state: (20, 33)


In [None]:
load_previous_model = False
prev_model_path = 'data/tmp/'

In [None]:
def load_print_prev_scores(prev_model_path):
    """
    Load previous saved scores and print.
    Get the current episode index and set up the scores list.
    """
    # Load previouse score info
    with open(prev_model_path+'prev_scores.json', 'r') as f:
            prev_data = json.load(f)
            
    # Print prev scores
    for i in range(len(prev_data)):
        print('\rEpisode : {} \t Current Score: {:.2f} \t Average Score : {:.2f}'\
              .format(prev_data[i]['episode'], prev_data[i]['current_score'], prev_data[i]['avg_score']))
    
    scores = [data['current_score'] for data in prev_data]
    cur_episode = len(prev_data)+1
    
    return cur_episode, scores

In [None]:
multiAgent = MultiAgent(state_size=state_size, action_size=action_size, 
                        num_agents=num_agents, seed=0, critic_type=1)

In [None]:
multiAgent.agents[0].actor

Actor(
  (bn0): BatchNorm1d(33, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc1): Linear(in_features=33, out_features=96, bias=False)
  (bn1): BatchNorm1d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=96, out_features=96, bias=False)
  (bn2): BatchNorm1d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc3): Linear(in_features=96, out_features=4, bias=True)
)

In [None]:
def ddpg(n_episodes=100, timestep_max=1000, print_every=50, \
         load_previous_model=False, prev_model_path='data/tmp/', new_json_name='scores.json'):
    # for plotting score grpha
    scores = []
    # for calculating mean score of consecutive episodes.
    score_deque = deque(maxlen=100)
    
    # If using previouse model
    if load_previous_model:
        # Load previouse score info
        cur_episode, scores = load_print_prev_scores(prev_model_path)
        # Load previouse model paramters
        multiAgent.load_model(prev_model_path)
    else:
        cur_episode = 1
        
    with open('data/tmp/'+new_json_name, mode='w') as f:
        json.dump([], f)
        
    for i in range(cur_episode,n_episodes+1):
        # Get Initial State
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        # noise reset
        multiAgent.reset()
        # Episode score
        score = np.zeros(num_agents) # shape : [num_agents,]
        
        
        action_time = []
        step_time = []
        epi_start = time.time()
        for t in range(timestep_max):
            
            action_start = time.time()
            actions = multiAgent.act(states)
            
            env_info = env.step(actions)[brain_name]
            next_states = env_info.vector_observations
            rewards = env_info.rewards
            dones = env_info.local_done
            action_stop = time.time()
            action_time.append(action_stop-action_start)
            
            step_start = time.time()
            multiAgent.step(states, actions, rewards, next_states, dones)
            step_stop = 
            states = next_states
            score += rewards
            # any agent arrives at terminal state.
            if np.any(dones):
                break
        
        epi_stop = time.time()
        print('\r1 Episode time: {}'.format(epi_stop-epi_start))
        single_score = np.mean(score)
        score_deque.append(single_score)
        scores.append(single_score)
        
        # print('\rEpisode: {}\t Score: {:.2f}'.format(i, score), end="")
        print('\rEpisode : {} \t Current Score: {:.2f} \t Average Score : {:.2f}'\
              .format(i, single_score, np.mean(score_deque)))
        
#         if i%print_every == 0:
#             print('\rEpisode : {} \t Current Score: {:.2f} \t Average Score : {:.2f}'.format(i, single_score, np.mean(score_deque)))
        
        if np.mean(score_deque)>=30:
            print('The number of episodes needed to solve the problem : {}'.format(i))
            # save model parameters
            multiAgent.save_model()
            # save scores
            with open('data/model/avg_scores.json', 'w') as f:
                json.dump(scores, f)
            break 
        
        # Save score info and model parameters
        # Save score info
        with open('data/tmp/'+new_json_name, mode='r') as f:
            score_data = json.load(f)
        current_data = {'episode' : i, 'currnet_score' : single_score, 'avg_score':np.mean(score_deque)}
        score_data.append(current_data)
        with open('data/tmp/'+new_json_name, mode='w') as f:
            json.dump(score_data, f)
        multiAgent.save_model('data/tmp/')
                     
    return scores

### 2. Results

In [None]:
scores= ddpg()

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(actor_loss+1), 1), actor_loss)
plt.ylabel('average score')
plt.xlabel('Episode')
plt.show()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(critic_loss+1), 1), critic_loss)
plt.ylabel('average score')
plt.xlabel('Episode')
plt.show()

In [None]:
#env.close()

구버전


In [None]:
def ddpg(n_episodes=100, timestep_max=1000, print_every=50):
    # for plotting score grpha
    scores = []
    # for calculating mean score of consecutive episodes.
    score_deque = deque(maxlen=100)
    for i in range(1,n_episodes+1):
        # Get Initial State
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        # noise reset
        multiAgent.reset()
        # Episode score
        score = np.zeros(num_agents) # shape : [num_agents,]
        for t in range(timestep_max):
            actions = multiAgent.act(states)
            
            env_info = env.step(actions)[brain_name]
            next_states = env_info.vector_observations
            rewards = env_info.rewards
            dones = env_info.local_done
        
            multiAgent.step(states, actions, rewards, next_states, dones)
            
            states = next_states
            score += rewards
            # any agent arrives at terminal state.
            if np.any(dones):
                break
        
        score_deque.append(np.mean(score))
        scores.append(np.mean(score))
        
        # print('\rEpisode: {}\t Score: {:.2f}'.format(i, score), end="")
        print('\rEpisode : {} \t Current Score: {:.2f} \t Average Score : {:.2f}'\
              .format(i, np.mean(score), np.mean(score_deque)))
        if i%print_every == 0:
            print('\rEpisode : {} \t Current Score: {:.2f} \t Average Score : {:.2f}'.format(i, np.mean(score), np.mean(score_deque)))
        if np.mean(score_deque)>=30:
            print('The number of episodes needed to solve the problem : {}'.format(i))
            # save model parameters
            multiAgent.save_model()
            break           
            
    return scoresa

## 3. Ideas for Future Work

1. Hyper paramter를 조정해본다
    
    Authors get the optiaml hyper paratmeter by extensive trial to the task. and obviously the task in the paper and this project task are different. so there might be a better set of hyperparamters.
    
    