# Continuous Control

---


Run the next code cell to install a few packages. 

In [1]:
# !pip -q install ./python

## 0. Learning Algorithm

---

To solve Unity reacher one agent problem, I choose [DDPG algorithm](https://arxiv.org/pdf/1509.02971.pdf)(Lillicrap et al., CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING).

**Algorithm**

To make DQN agent deal with continuos action space, the authors suggest extension of actor-critic model called DDPG. The DDPG agent is consists of 2 kinds of networks. Actor network is responsible for the policy function approximator. Critic Network is reponsible for Q-value function approximator. 

Just like DQN, DDPG also uses replay buffer which stores old experiences and samples a small batch of tuples to remove correlations in consecutive observations. They also use target network used in DQN, but modified it to use soft target updates.

And they add Ornstein-Uhlenbeck Noise to the action produced by actor network for encouraging exploration. Lastly, they used Adam optimizer for learning the nueral network paramters.

<br>
<figure>
  <img src = "./ddpg_algorithm.png" width = 80% style = "border: thin silver solid; padding: 10px">
      <figcaption style = "text-align: center; font-style: italic">Fig 1. - DDPG Algorithm.</figcaption>
</figure> 
<br>


**Hyperparamters**

All these hyperparamters except buffer size and batch size are from the paper's experiment details. Buffer size and batch size are much smaller since the task is more simple.

```
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 1e-2     # L2 weight decay
SIGMA = 0.2             # Paramter for OU Process
THETA = 0.15            # Paramter for OU Process

```

**Model Architecture**

Since this reacher problem is low dimensional problem, Both Actor and Critic are consist of few several fully connected layers. Critic gets states and actions as input and ouputs the action-value. The paper states that the actions were not included until the 2nd hidden layer of Q. So In this Implementatiom, actions are merged into the hidden layer between the 1st and 2nd one.

```
Actor(
  (fc1): Linear(in_features=33, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=4, bias=True)
)
```


```
Critic(
  (fc1): Linear(in_features=33, out_features=64, bias=True)
  (fc_merged): Linear(in_features=68, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=4, bias=True)
)
```


## 1.Implementation


- Replay Buffer
- Actor and Critic Network
- OUNoise
- Agent

---

In [28]:
for i in range(1):
    print(i)

0


In [1]:
import torch
import torch.nn.functional as F
import torch.optim as optim
import torch.nn as nn
import torch.nn.init as I

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

import numpy as np

In [38]:
from torch.autograd import Variable

In [111]:
import random

In [None]:
random.seed()

In [None]:
np.random.seed()

In [85]:
a = [True, True, False]

In [88]:
np.stack(a).shape

(3,)

In [None]:
F.mse_loss()

In [91]:
a = torch.tensor([3.0], requires_grad=False)
print(a.view(-1).requires_grad)
with torch.no_grad():
    b = a**2
    c = b*2
    d = c*3

False


In [84]:
print(b.requires_grad)
print(c.requires_grad)
print(d.requires_grad)

False
False
False


In [104]:
a = torch.tensor([3.0, 2.0], requires_grad=True)
b = torch.tensor([4.0, 1.0], requires_grad=False)
hubber_loss = torch.nn.SmoothL1Loss()
critic_loss = hubber_loss(a, target=b)

In [105]:
critic_loss.backward()

In [106]:
print(a.grad)

tensor([-0.5000,  0.5000])


In [95]:
print(b.grad)

None


In [None]:
F.smooth_l1_loss()

In [107]:
a = torch.tensor([[3.0], [2.0]], requires_grad=True)
b = torch.tensor([[4.0], [1.0]], requires_grad=False)
loss = F.mse_loss(a,b)

In [108]:
loss.backward()

In [109]:
loss

tensor(1.)

In [110]:
print(a.grad)

tensor([[-1.],
        [ 1.]])


In [77]:
a = Variable(torch.tensor([3.0]), requires_grad=True)
b = a**2
c = b*2

In [78]:
d = c*3
d.backward(retain_graph=True)

print(a.grad)
print(d.grad)

tensor([ 36.])
None


In [79]:
d = c*3
a.grad.zero_()
d.backward(retain_graph=True)

print(a.grad)

tensor([ 36.])


In [80]:
d = c*3
a.grad.zero_()
d.backward(retain_graph=True)
print(a.grad)

tensor([ 36.])


In [49]:
e.backward()

In [50]:
print(a.grad)

tensor([[-1.1647, -7.3983, -3.2755,  7.9343]])


In [2]:
x = torch.tensor([1])

In [3]:
x.requires_grad

False

In [4]:
x = torch.tensor(data=[1], requires_grad=True)
y =x**2
z = 2*y
w = z**3

In [6]:
print(y.requires_grad)
print(z.requires_grad)
print(w.requires_grad)

True
True
True


In [9]:
p = z

In [10]:
type(p)

torch.Tensor

In [11]:
q = torch.tensor(data=[2], requires_grad=True)
pq = p*q

In [12]:
pq.backward(retain_graph=True)

In [14]:
w.backward()
print(x.grad)

tensor([ 56])


In [15]:
x = torch.tensor([1], requires_grad=True)
y = x**2
z = 2*y
w = z**3

In [16]:
p = z.detach()

In [19]:
p.requires_grad

False

In [17]:
q = torch.tensor([2], requires_grad=True)
pq = p*q
pq.backward(retain_graph=True)

In [18]:
w.backward()
print(x.grad)

tensor([ 48])


In [23]:
q.requires_grad

True

In [25]:
q.

True

In [26]:
drop = nn.Dropout(0.5)
x = torch.ones(1,10)

drop.train()
print(drop(x))


tensor([[ 2.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  2.]])


In [27]:
drop.eval()
print(drop(x))

tensor([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])


In [7]:
a=torch.randn(2,4)
print(a.requires_grad)

False


In [8]:
b=a.permute(1,0)
print(b.requires_grad)

False


In [5]:
a=torch.randn(2,4)
b = [1,2]

In [7]:
for t,t2 in zip(a,b):
    print(t)
    print(t2)

tensor([ 0.8118, -0.8935,  0.5180,  0.4833])
1
tensor([ 0.7906, -0.9023,  2.2028, -0.8669])
2


In [8]:
a.shape

torch.Size([2, 4])

In [9]:
a.unsqueeze(0).shape

torch.Size([1, 2, 4])

In [11]:
a.unsqueeze(0)

tensor([[[ 0.8118, -0.8935,  0.5180,  0.4833],
         [ 0.7906, -0.9023,  2.2028, -0.8669]]])

In [None]:
nn.Linear()

In [19]:
test = np.random.randn(2)

In [20]:
test

array([-0.71846398, -0.803826  ])

In [21]:
np.argmax(test)

0

In [28]:
m = nn.Linear(20, 30)
mb = nn.BatchNorm1d(30)
input = torch.randn(40,20)
output = m(input)
print(output.size())

torch.Size([40, 30])


In [29]:
mb.eval()
output = mb(output)
mb.train()
print(output.size())


torch.Size([40, 30])


In [52]:
for i in [1,2,3,4,5]:
    print(i)

1
2
3
4
5


In [None]:
torch.squeeze()

In [37]:
t = [np.random.randn(2), np.random.randn(2), np.random.randn(2), np.random.randn(2)]

In [39]:
t

[array([-1.13954628,  1.18039878]),
 array([-1.4491894 ,  0.04311405]),
 array([ 0.90179285, -1.69994976]),
 array([0.91887564, 0.06323698])]

In [53]:
np.stack(t).shape

(4, 2)

In [40]:
np.random.randn(4).shape

(4,)

In [44]:
np.random.randn(4)

array([-3.25208563, -0.66612742, -1.51034224, -0.27338439])

In [42]:
np.random.randn(4,1).shape

(4, 1)

In [45]:
np.random.randn(4,1)

array([[-0.52165329],
       [-0.86011339],
       [-0.44075813],
       [ 0.54924978]])

In [43]:
np.random.randn(1,4).shape

(1, 4)

In [46]:
np.random.randn(1,4)

array([[-0.15024536, -1.12597114,  1.55068629, -0.11011235]])

In [47]:
np.random.randn(4)+np.random.randn(1,4)

array([[ 1.48978565,  1.35344593, -0.53593591, -0.07508489]])

In [48]:
np.ones(4).shape

(4,)

In [50]:
np.random.randn(4).shape

(4,)

### 1.Replay Buffer

Reinforcement learning is unstable when a nonlinear function approximator is used to represent action-value function, because the sequence of experiencs can be highly correlated. DDPG Agent stores the experience at each time step. Then by sampling from the buffer at random, It can prevent action values from oscillating or diverging.

In [3]:
from collections import deque, namedtuple
import random

class ReplayBuffer():
    def __init__(self, buf_size, batch_size, seed):
        """
        Params
        ----------
        buf_size (int): size of memory
        batch_size (int): number of samples to be sampled
        seed (int): random seed
        """
        # When the replay buffer was full, the oldest sample needs to be discarded.
        # So deque is suitalbe data structure. 
        self.memory = deque(maxlen=buf_size)
        self.seed = random.seed(seed)
        self.batch_size = batch_size
        self.experience = namedtuple('Trajectory', field_names=["state", "action", "reward", "next_state", "done"])
        
    def __len__(self):
        """
        Return the size of memory
        """
        
        return len(self.memory)
    def add(self, state, action, reward, next_state, done):
        """
        Add the agent's experiences at eacy time to the memory
        """
        # Instantiate new experience with custom nemaedTuple
        e = self.experience(state, action, reward, next_state, done)
        # Add the tuple to the memory
        self.memory.append(e)
        
    def sample(self):
        """
        Draw a sample.
        Since the sample data is used by pytorch model, It needs to be converted to a torch Tensor.
        
        Returns
        -------
        A tuple of torch tensor. Each tenosr's outermost dimension is batch_size.
        """
        # list of sampled experience namedtuple of size of self.batch_size
        experiences = random.sample(self.memory, k=self.batch_size)
        
        # states : [batch_size, state_size]
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        # dones is needed to calculated the Q-value. At terminal state(dones=1), the Q-value should be just latest rewards.
        # Convert it to np.uint8
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
        
        return (states, actions, rewards, next_states, dones)
        

### 2.Actor and Critic Networks

According to the papar, The authors initialized the final layer weights and biases of both the actor and critic from a uniform distribution $ [−3×10^−3, 3×10^−3] $ and $[3×10^−4, 3×10^−4]$. This was to ensure the initial outputs for the policy and value estimates were near zero. The other layers were initialized from uniform distributions $ [− \sqrt{f} , \sqrt{f} ] $ where f is the fan-in of the layer.

Since eveny entry in the action vector should be a number between -1 and 1, The activation function is **tanh**.
Other hidden layers is activated by **relu**. 

In [4]:
class Actor(nn.Module):
    """
    Policy Network : state -> specific action (Not Probability distribution of Actions)
    """
    def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=128):
        """
        Initialization
        
        Params
        -------
        state_size : Vector Observation space size(per agent)
        action_size : Vector Action space size(per agent) 
        seed : seed
        fc1_units : first hidden layer 
        fc2_units : seccond hidden layer
        
        Returns
        -------
        actions (Torch Tensor)
        """
        super(Actor, self).__init__()
        #self.seed = torch.manual_seed(seed)
        self.bn0 = nn.BatchNorm1d(state_size)
        
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.bn1 = nn.BatchNorm1d(fc1_units)
        
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.bn2 = nn.BatchNorm1d(fc2_units)
        
        self.fc3 = nn.Linear(fc2_units, action_size)
        
        self.reset_parameters()
        
    def get_fan_in(self, layer):
        """
        Get the fan-in in each layer.
        """
        fan_in = 1/np.sqrt(layer.in_features)
        return -fan_in, fan_in
        
    def reset_parameters(self):
        """
        Initialize weights and bais in each layer
        """
        I.uniform_(self.fc1.weight, *self.get_fan_in(self.fc1))
        I.uniform_(self.fc2.weight, *self.get_fan_in(self.fc2))
        I.uniform_(self.fc3.weight, -3*1e-3, 3*1e-3)
            
    def forward(self, state):
        
        x = self.bn0(state)
        
        x = F.relu(self.fc1(x))
        x = self.bn1(x)
        
        x = F.relu(self.fc2(x))
        x = self.bn2(x)
        # Every entry in action vector : [-1, 1]
        actions = F.tanh(self.fc3(x))
        
        return actions

In [5]:
class Critic(nn.Module):
    def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=128):
        super(Critic, self).__init__()
        #self.seed = torch.manual_seed(seed)
        
        # hidden layer for state pathway
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.bn1 = nn.BatchNorm1d(fc1_units)
        
        self.fc_merged = nn.Linear(fc1_units+action_size, fc2_units)
        self.bn2 = nn.BatchNorm1d(fc2_units)
        
        self.fc2 = nn.Linear(fc2_units, 1)

        
        # Initialize Weights and Biases
        self.reset_parameters()
        
    def reset_parameters(self):
        I.uniform_(self.fc1.weight, *self.get_fan_in(self.fc1))
        I.uniform_(self.fc_merged.weight, *self.get_fan_in(self.fc_merged))
        I.uniform_(self.fc2.weight, -3*1e-4, 3*1e-4)
        #I.uniform_(self.fc3.weight, -3*1e-3, 3*1e+3)
        
    def get_fan_in(self, layer):
        fan_in = 1/np.sqrt(layer.in_features)
        return -fan_in, fan_in
    
    def forward(self, state, action):
        
        x = F.relu((self.fc1(state)))
        x = self.bn1(x)
                   
        # state : [batch_size, state_size], action : [batch_size, action_size]
        # merged : [batch_size, state_size + action_size]
        x = torch.cat((x, action), dim=1)
        #x = self.bn2(x)
        
        x = F.relu(self.fc_merged(x))
        x = self.bn2(x)
        
        #x = F.relu(self.bn3(self.fc2(x)))
        q = self.fc2(x)
        return q

### 3.OUNoise

To encourage agnet do exploration at initial step, add noise from Ornstein–Uhlenbeck noise process to the specific action produced by the actor(policy) network. The variation of the noise process decreases as time goes by. Therefore it can lead to reducing the exploration as the agent train. Also 2 consecutive samples are temporally correlated. This will ensure that 2 consecutive actions are not different widly. The authors use theta=0.15 sigma=0.2 for this noise process.

In [6]:
import copy 

class OUNoise:
    """
    Ornstein–Uhlenbeck noise
    """
    def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.1):
        """
        Intialization
        
        Params
        ------
        size : number of noise. It should be action_size = size
        seed : random seed
        mu : mean, defalut : 0 to reduce the noise over time.
        theta : hyper paramter
        sigam : hyper paramter
        """
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.seed = random.seed(seed)
        self.reset()
        
    def reset(self):
        """
        Reset the internal state to mu
        """
        self.state = copy.copy(self.mu)
    def sample(self):
        """
        Process the internal state.
        The Wiener process states are sampled by random.random()
        
        Returns
        -----
        self.state (numpy array)
        """
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma*np.random.randn(len(x))
        self.state = x+dx
        return self.state

### 4.Agent

In [7]:
# Hyper parameters
BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 256        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4        # learning rate of the actor 
LR_CRITIC = 1e-4        # learning rate of the critic
WEIGHT_DECAY = 1e-2     # L2 weight decay for the critic
UP_FREQ = 4
class DDPGAgent():
    """
    RL Agent whose actor and critic networks 
    """
    
    def __init__(self, state_size, action_size, seed):
        """
        Initialize an Agent object.
        
        Params
        -------
        state_size : 
        
        """
        self.state_size = state_size
        self.action_size = action_size
        self.seed = seed
        
        # Actor networks
        self.actor_local = Actor(state_size, action_size, seed).to(device)
        self.actor_target = Actor(state_size, action_size, seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)
        # Critic networks
        self.critic_local = Critic(state_size, action_size, seed).to(device)
        self.critic_target = Critic(state_size, action_size, seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)
        # Replay buffer
        self.memory = ReplayBuffer(BUFFER_SIZE, BATCH_SIZE, seed)
        # OUNoise
        self.noise = OUNoise(size=action_size, seed=self.seed)
        
        
        self.actor_loss = []
        self.critic_loss = []
        
    def reset(self):
        """
        Reset the OUNoise state.
        """
        self.noise.reset()
        
    def act(self, state, add_noise=True):
        """
        Returns actions for state based on current policy.
        Params
        -----
        state : 1d numpy array [state_size]
        add_noise : boolean , default: True
        Returns
        -----
        action : numpy array ,shape:[1 x action_size]
        """
        # Convert state into torch tensor 
        # Use unsqueeze(0) for batch normalization. [1 x state_size]
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        # Set evaluation mode
        self.actor_local.eval()
        # Disable gradient calculation
        with torch.no_grad():
            action = self.actor_local(state).cpu().data.numpy() # [1 x action_size]
        self.actor_local.train()
        
        if add_noise:
            action += self.noise.sample()
            
        return np.clip(action, -1, 1)
    
    def step(self, state, action, reward, next_state, done):
        """
        Store experience in replay buffer
        """
        self.memory.add(state, action, reward, next_state, done)
        
        if len(self.memory)>BATCH_SIZE:
            # Get minibatch experiences
            experiences = self.memory.sample()
            self.learn(experiences, GAMMA)
    
    def learn(self, experiences, gamma):
        """
        Update actor and critic network
        
        Params
        -----
        experiences (Tuple[torch.Tensor])
        gamma (float)
        """
        # Each of them is batch sized torch Tenosr
        states, actions, rewards, next_states, dones = experiences
        
        # Update local critic
        # 1. Get next actions
        next_actions = self.actor_target(next_states)
        # 2. Get target value from the target network 
        q_next = self.critic_target(next_states, next_actions)
        y = rewards + gamma*q_next*(1-dones)
        # 3. Critic objective function
        critic_loss = F.mse_loss(self.critic_local(states, actions), y)
        # 4. Minimize
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        nn.utils.clip_grad_value_(self.critic_local.parameters(),1.0)
        self.critic_optimizer.step()
        self.critic_loss.append(critic_loss.cpu().data.numpy())
        
        # Update local actor
        # 1. Get predicted action
        pred_actions = self.actor_local(states)
        # 2. Actor objective
        actor_loss = -self.critic_local(states, pred_actions).mean()
        # 3. Minimize loss
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        # nn.utils.clip_grad_value_(self.actor_local.parameters(),1.0)
        self.actor_optimizer.step()
        self.actor_loss.append(actor_loss.cpu().data.numpy())
        # Update target networks
        self.soft_update(self.actor_local, self.actor_target, TAU)
        self.soft_update(self.critic_local, self.critic_target, TAU)
        
        
    def soft_update(self, local, target, tau):
        """
        Soft update
        
        𝜃_target = 𝜏*𝜃_local + (1 - 𝜏)*𝜃_target
        
        """
        # .parameters() is generator
        for target_param, local_param in zip(target.parameters(), local.parameters()):
            target_param.data.copy_(tau*local_param.data + (1-tau)*local_param.data).to(device)

In [8]:
class MultiAgent():
    
    def __init__(self, n_agents, state_size, action_size, seed):
        
        self.shared_buffer = ReplayBuffer(BUFFER_SIZE,BATCH_SIZE,seed)
        self.n_agents = n_agents
        self.add_noise = False
        self.agent_list = [DDPGAgent(state_size, action_size, seed) for i in range(n_agents)]
        self.t_step = 0
        
    def reset(self):
        for i in range(self.n_agents):
            self.agent_list[i].reset()
            
    def act(self, states):
        act_list = []
        for i in range(self.n_agents):
            act_list.append(self.agent_list[i].act(states[i], self.add_noise))
        return act_list
    
    def step(self, states, actions, rewards, next_states, dones):
        """
        Store experience in replay buffer
        """
        for i in range(self.n_agents):
            self.shared_buffer.add(states[i], actions[i], rewards[i], next_states[i], dones[i])
        #self.t_step = (self.t_step+1) % UP_FREQ
        #if self.t_step == 0:
        #for i in range(UP_FREQ):
        if len(self.shared_buffer)>BATCH_SIZE:
            # Get minibatch experiences
            for j in range(self.n_agents):
                experiences = self.shared_buffer.sample()
                self.agent_list[j].learn(experiences, GAMMA)
        
        
        

## 2.Plot of Rewards
---
### 1.Training Code

In [29]:
from unityagents import UnityEnvironment
# select this option to load version 1 (with a single agent) of the environment
# for linux : /data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64
env = UnityEnvironment(file_name='data/single_agent/Reacher_Windows_x86_64/Reacher.exe')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [30]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

# shape of its elements
print('Shape of next state: {}'.format(env_info.vector_observations.shape))
print('Shape of the rewards : {}'.format(env_info.rewards))
print('Shape of dones : {}'.format(env_info.local_done))

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]
Shape of next state: (1, 33)
Shape of the rewards : [0.0]
Shape of dones : [False]


In [32]:
states.shape

(1, 33)

In [34]:
for state in states:
    print(state.shape)

(33,)


In [11]:
env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.09549999786540866


In [14]:
agent = DDPGAgent(state_size=state_size, action_size=action_size, seed=3)

In [15]:
agent.actor_local

Actor(
  (bn0): BatchNorm1d(33, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc1): Linear(in_features=33, out_features=256, bias=True)
  (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc3): Linear(in_features=128, out_features=4, bias=True)
)

In [16]:
agent.critic_local

Critic(
  (fc1): Linear(in_features=33, out_features=256, bias=True)
  (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc_merged): Linear(in_features=260, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
)

In [3]:
a=0

if a==0:
    print('a')

a


In [17]:
def ddpg(n_episodes=200, timestep_max=1000, print_every=50):
    # for plotting score grpha
    scores = []
    # for calculating mean score of consecutive episodes.
    score_deque = deque(maxlen=100)
    for i_episode in range(1,n_episodes):
        # Get Initial State
        env_info = env.reset(train_mode=True)[brain_name]
        state = env_info.vector_observations[0]
        # noise reset
        agent.reset()
        # Episode score
        score = 0
        for t in range(timestep_max):
            action = agent.act(state)
            
            env_info = env.step(action)[brain_name]
            next_state = env_info.vector_observations[0]
            reward = env_info.rewards[0]
            done = env_info.local_done[0]
        
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
        score_deque.append(score)
        scores.append(score)
        print('\rEpisode: {}\t Score: {:.2f}'.format(i_episode, score), end="")
        # save model parameters
        if i_episode%print_every == 0:
            print('Episode: {}\t Average score: {}'.format(i_episode, np.mean(score_deque)))
        if np.mean(score_deque)>=30:
            torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
            break           
            
    return scores, agent.actor_loss, agent.critic_loss

In [12]:
multi_agent=MultiAgent(20, state_size, action_size, 0)

def ddpg_multi(n_episodes=1000, timestep_max=1000, print_every=50):
    # for plotting score grpha
    scores = []
    # for calculating mean score of consecutive episodes.
    score_deque = deque(maxlen=100)
    for i_episode in range(n_episodes):
        # Get Initial State
        env_info = env.reset(train_mode=True)[brain_name]
        states = env_info.vector_observations
        # noise reset
        multi_agent.reset()
        # Episode score
        score = np.zeros(20)
        for t in range(timestep_max):
            actions = multi_agent.act(states)
            
            env_info = env.step(actions)[brain_name]
            next_states = env_info.vector_observations
            rewards = env_info.rewards
            dones = env_info.local_done
        
            multi_agent.step(states, actions, rewards, next_states, dones)
            states = next_states
            score += rewards
            if np.any(dones):
                break
        avg_score = score.mean()
        score_deque.append(avg_score)
        scores.append(avg_score)
        print('\rEpisode: {}\t score: {}'.format(i_episode, avg_score), end="")
        # save model parameters
        if i_episode%print_every == 0:
            print('Episode: {}\t Average score: {}'.format(i_episode, np.mean(score_deque)))
#         if np.mean(score_deque)>=30:
#             torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
#             torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
#             break           
            
    return scores 

### 2. Results

In [18]:
import matplotlib.pyplot as plt

In [None]:
scores= ddpg_multi()

Episode: 0	 score: 0.012999999709427357Episode: 0	 Average score: 0.012999999709427357
Episode: 13	 score: 0.026999999396502973

In [None]:
scores, actor_loss, critic_loss= ddpg()

Episode: 50	 Score: 0.15Episode: 50	 Average score: 0.5887999868392945
Episode: 100	 Score: 0.24Episode: 100	 Average score: 0.636699985768646
Episode: 108	 Score: 0.33

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(actor_loss+1), 1), actor_loss)
plt.ylabel('average score')
plt.xlabel('Episode')
plt.show()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(critic_loss+1), 1), critic_loss)
plt.ylabel('average score')
plt.xlabel('Episode')
plt.show()

In [None]:
#env.close()

## 3. Ideas for Future Work

1. Hyper paramter를 조정해본다
    
    Authors get the optiaml hyper paratmeter by extensive trial to the task. and obviously the task in the paper and this project task are different. so there might be a better set of hyperparamters.
    
    