# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import os
import sys
sys.path.append('D:/UserData/Z003XD5A/dev/deep_rl_projects/ml-agents/python')

from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
print(os.listdir('../../unity_environments/Tennis_Windows_x86_64'))

['Tennis.exe', 'Tennis_Data', 'UnityPlayer.dll']


In [3]:
env = UnityEnvironment(file_name='../../unity_environments/Tennis_Windows_x86_64/Tennis.exe')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [6]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [7]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
print('states shape ', states.shape)
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
states shape  (2, 24)
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.69487906 -1.5
 -0.          0.          6.83172083  5.98822832 -0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [15]:
for i in range(1, 6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

Score (max over agents) from episode 1: 0.0
Score (max over agents) from episode 2: 0.10000000149011612
Score (max over agents) from episode 3: 0.0
Score (max over agents) from episode 4: 0.0
Score (max over agents) from episode 5: 0.0


When finished, you can close the environment.

In [None]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

#### Explore return states and actions

In [38]:
   
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
print('states shape ', states.shape)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
print('scores ', scores.shape)
actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
print('actions ', actions.shape)
env_info = env.step(actions)[brain_name]           # send all actions to tne environment
next_states = env_info.vector_observations         # get next state (for each agent)
print('next states ', next_states.shape)
rewards = env_info.rewards                         # get reward (for each agent)
print('rewards type ', type(rewards), ' len ', len(rewards), ' rewards ', rewards)
dones = env_info.local_done                        # see if episode finished
print('dones type ', type(dones), ' len ', len(dones), ' any done ', np.any(dones), ' dones ', dones)
scores += env_info.rewards                         # update the score (for each agent)
states = next_states                               # roll over states to next time step


states shape  (2, 24)
scores  (2,)
actions  (2, 2)
next states  (2, 24)
rewards type  <class 'list'>  len  2  rewards  [0.0, 0.0]
dones type  <class 'list'>  len  2  any done  False  dones  [False, False]


In [9]:
import torch
import torch.nn.functional as F

import numpy as np

In [10]:
%load_ext autoreload
%autoreload 2

#from maddpg import *
#from ddpg import *
from buffer import *
#from network import *

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)

class Network(nn.Module):
    def __init__(self, state_size, action_size, hidden_in_dim, hidden_out_dim, activation=F.relu, is_actor=False):
        super(Network, self).__init__()

        """self.input_norm = nn.BatchNorm1d(input_dim)
        self.input_norm.weight.data.fill_(1)
        self.input_norm.bias.data.fill_(0)"""

        self.bn0 = nn.BatchNorm1d(state_size)
        self.fc1 = nn.Linear(state_size,hidden_in_dim)
        self.bn1 = nn.BatchNorm1d(hidden_in_dim)
        self.fc2_actor = nn.Linear(hidden_in_dim,hidden_out_dim)
        self.fc2_critic = nn.Linear(hidden_in_dim+action_size,hidden_out_dim)
        self.bn2 = nn.BatchNorm1d(hidden_out_dim)
        self.fc3_actor = nn.Linear(hidden_out_dim,action_size)
        self.fc3_critic = nn.Linear(hidden_out_dim,1)
        self.activation = activation 
        self.is_actor = is_actor
        #self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-1e-3, 1e-3)

    def forward(self, x, action=None):
        if self.is_actor:
            # return a vector of the force
            x = self.bn0(x)
            x = self.activation(self.bn1(self.fc1(x)))
            x = self.activation(self.bn2(self.fc2_actor(x)))
            return torch.tanh(self.fc3_actor(x))
        
        else:
            # critic network simply outputs a number
            x = self.bn0(x)
            x = self.activation(self.bn1(self.fc1(x)))
            x = torch.cat((x, action), dim=-1)
            x = self.activation(self.fc2_critic(x))
            return self.fc3_critic(x)

In [97]:
import copy
import random

import torch
import torch.optim as optim
import torch.nn.functional as F
from ou_noise import *

class DDPG():
    """Interacts with and learns from the environment."""

    def __init__(self, device, state_size, action_size, random_seed, hidden_in_dim, hidden_out_dim, activation, 
                 tau, lr_actor, lr_critic, weight_decay, epsilon, epsilon_decay):
             
        """Initialize an Agent object.
        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            random_seed (int): random seed
        """
        super(DDPG, self).__init__()
        
        self.state_size = state_size
        self.action_size = action_size
        
        self.device = device
        self.tau = tau
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay

        # Actor Network (w/ Target Network)
        self.actor_local = Network(self.state_size, self.action_size, hidden_in_dim, hidden_out_dim, activation=activation, is_actor=True).to(self.device)
        self.actor_target = Network(self.state_size, self.action_size, hidden_in_dim, hidden_out_dim, activation=activation, is_actor=True).to(self.device)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=lr_actor)

        # Critic Network (w/ Target Network)
        self.critic_local = Network(self.state_size*2, self.action_size*2, hidden_in_dim, hidden_out_dim, activation=activation, is_actor=False).to(self.device)
        self.critic_target = Network(self.state_size*2, self.action_size*2, hidden_in_dim, hidden_out_dim, activation=activation, is_actor=False).to(self.device)
        self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=lr_critic, weight_decay=weight_decay)

        # Same initialization
        self.__copy__(self.actor_local, self.actor_target)
        self.__copy__(self.critic_local, self.critic_target)

        # Noise process
        self.noise = OUNoise(action_size, seed=random_seed)
                
    def act(self, state, noise_scale=0.0):
        """Returns actions for given state as per current policy."""

        if isinstance(state, np.ndarray):
            state = torch.from_numpy(state).float()
        
        self.actor_local.eval()
        with torch.no_grad():
            action = self.actor_local(state.to(self.device))
        self.actor_local.train()
        return action + noise_scale*self.noise.noise()

    def target_act(self, state, noise_scale=0.0):
        """Returns actions for given state as per current policy."""

        if isinstance(state, np.ndarray):
            state = torch.from_numpy(state).float()
        return self.actor_target(state.to(self.device)) + noise_scale*self.noise.noise()

    def reset(self):
        self.noise.reset()

    def update_exploration_strategy(self, experiences, gamma):
        """Update policy and value parameters using given batch of experience tuples.
        Q_targets = r + ? * critic_target(next_state, actor_target(next_state))
        where:
            actor_target(state) -> action
            critic_target(state, action) -> Q-value
        Params
        ======
            experiences (Tuple[torch.Tensor]): tuple of (s, a, r, s', done) tuples
            gamma (float): discount factor
        """
        # ---------------------------- update noise ---------------------------- #
        self.epsilon -= self.epsilon_decay
        self.noise.reset()

    def soft_update(self, local_model, target_model):
        """Soft update model parameters.
        ?_target = t*?_local + (1 - t)*?_target
        Params
        ======
            local_model: PyTorch model (weights will be copied from)
            target_model: PyTorch model (weights will be copied to)
            tau (float): interpolation parameter
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(self.tau*local_param.data + (1.-self.tau)*target_param.data)

    def __copy__(self, source, target):
        for src_param, target_param in zip(source.parameters(), target.parameters()):
            target_param.data.copy_(src_param.data)



In [114]:
import torch
import numpy as np


class MADDPG:
    def __init__(self, device, gamma, ddpg_settings):
    
        '''
            ddpg_settings: dict 
        '''
        super(MADDPG, self).__init__()
        self.device = device
        self.marl = [DDPG(device=device, **ddpg_settings), DDPG(device=device, **ddpg_settings)]
        self.gamma = gamma
    
    def act(self, obs_per_agent, noise_scale=0.0):
        """get actions from all agents in the MADDPG object
            obs_per_agent: numpy of shape 2x24 where 2 is num agnets and 24 is state size
        """
        # torch require input of shape [num_smaple, state_size] so we unsqueeze the 1st dim
        actions = [agent.act(obs[np.newaxis,:], noise_scale) for agent, obs in zip(self.marl, obs_per_agent)]
        return actions
    
    def target_act(self, obs_per_agent, noise_scale=0.0):
        """get target network actions from all the agents in the MADDPG object """
        target_actions = [agent.target_act(obs[np.newaxis,:], noise) for agent, obs in zip(self.marl, obs_per_agent)]
        return target_actions
    
    def critic_loss_function(self, agent_number, rewards, dones, next_local_states):
        '''
            agent : a selected ddpg agent
        '''
        
        next_target_actions = [a.target_act(next_local_state) for a, next_local_state in zip(self.marl, next_local_states)]
        #print('next_target_actions ', len(next_target_actions), next_target_actions[0].shape)
        next_target_actions = torch.cat(next_target_actions, dim=-1)
        #print('next_target_actions ', next_target_actions.shape)
        with torch.no_grad():
            q_next = self.marl[agent_number].critic_target(next_global_states, next_target_actions.to(self.device))
        #print('q_next ', q_next.shape)
        #print('reward[agent_number] ', rewards[agent_number].shape, '  done[agent_number] ',  dones[agent_number].shape)
        y = rewards[agent_number] + self.gamma * q_next * (1. - dones[agent_number])
        #print('y ', y.shape)
        #print('global states ', global_states.shape, ' actions ', actions.shape)
        q = self.marl[agent_number].critic_local(global_states, actions)
        #print('q ', q.shape)
        return F.smooth_l1_loss(q, y.detach())
    
    def actor_loss_function(self, agent_number, local_states, global_states):
        
        predicted_actions = [self.marl[i].actor_local(state) if i == agent_number \
                             else self.marl[i].actor_local(state).detach() \
                             for i, state in enumerate(local_states)]
        #print('predicted_actions ', len(predicted_actions), predicted_actions[0].shape)
        predicted_actions = torch.cat(predicted_actions, dim=-1)
        #print('predicted_actions ', predicted_actions.shape)
        return -self.marl[agent_number].critic_local(global_states, predicted_actions).mean()
    
    def update(self, experiences, agent_number):
        '''
            Update learned critic and actor networks
            experiences: random samples from replay buffer
            agent_number: index to an agent
        '''
        local_states, global_states, actions, rewards, next_local_states, next_global_states, dones = experiences
        # get selected agent
        agent = self.marl[agent_number]
        
        #----------------update critic-----------------------
        critic_loss = self.critic_loss_function(agent_number, rewards, dones, next_local_states)
        agent.critic_optimizer.zero_grad()
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(agent.critic_local.parameters(), 1.)
        agent.critic_optimizer.step()
        
        #----------------update actor-----------------------
        actor_loss = self.actor_loss_function(agent_number, local_states, global_states)
        agent.actor_optimizer.zero_grad()
        actor_loss.backward()
        #torch.nn.utils.clip_grad_norm_(agent.actor.parameters(),0.5)
        agent.actor_optimizer.step()
        
        return critic_loss.cpu().detach().item(), actor_loss.cpu().detach().item()
    
    def update_targets(self):
        '''Update target critic and actor networks'''
        for agent in self.marl:
            agent.soft_update(agent.critic_local, agent.critic_target)
            agent.soft_update(agent.actor_local, agent.actor_target)

In [81]:
def print_transitions(transition):
    print('obs ',len(transition), transition[0].shape)
    print('obs_full ', len(transition), transition[1].shape)#, len(transition[1][0][0]))
    print('actions ', len(transition),transition[2].shape)
    print('rewards ', len(transition), transition[3].shape)#, len(transition[1][0][0]))
    print('next obs ', len(transition), transition[4].shape)
    print('next obs full ', len(transition), transition[5].shape)#, len(transition[1][0][0]))
    print('dones ', len(transition), transition[6].shape)#, len(transition[1][0][0]))
    #print(transition[6])
def view_replay_buffer_data(replay_buffer):
    if len(replay_buffer) > 0:
        transition = replay_buffer.memory[0]
    print_transitions(transition)

In [15]:
random_seed = 1234
num_batch_permute = 10
gamma = 0.99
tau = 1.e-3
buffer_size = int(1e6)
batch_size = 256
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
ddpg_settings = {'state_size':state_size, 'action_size':action_size, 'random_seed':random_seed,
            'hidden_in_dim':128, 'hidden_out_dim':128, 'activation':F.relu,
            'tau':tau, 'lr_actor':1e-3, 'lr_critic':1e-3,
           'weight_decay':0., 'epsilon':1., 'epsilon_decay':1e-6}


In [99]:
maddpg = MADDPG(gamma, tau, ddpg_settings)
replay_buffer = ReplayBuffer(device, action_size, buffer_size, batch_size, random_seed) 

In [41]:
print(states.shape)
state0 = torch.from_numpy(states[0]).float().to(device)
state1 = torch.from_numpy(states[1]).float().to(device)
print(state0.shape, state0)
print(state1.shape, state1)
act0 = maddpg.marl[0].act(state0.unsqueeze(0), 1.)
print('act 0 ', act0.shape, act0)

actions = maddpg.act(states, 1.)
print('actions ', len(actions), ' action0 ', actions[0].shape)
print(actions)
actions = [action.numpy().squeeze() for action in actions] 
# print(actions)
# actions+= [np.random.randn(1,2)]
# print(actions)
print(np.hstack(actions).shape, '\n', np.hstack(actions))

(2, 24)
torch.Size([24]) tensor([  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
          0.0000,  -7.3793,  -1.5000,  -0.0000,   0.0000,   7.7616,   5.9882,
         -0.0000,   0.0000,  -9.8775,  -0.9832, -24.9819,   6.2152,   7.7616,
          5.8901, -24.9819,   6.2152])
torch.Size([24]) tensor([  0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
          0.0000,  -7.2174,  -1.5000,   0.0000,   0.0000,  -7.7616,   5.9882,
          0.0000,   0.0000, -10.2174,  -1.5589, -30.0000,  -0.9810,  -7.7616,
          5.8901, -30.0000,  -0.9810])
act 0  torch.Size([1, 2]) tensor([[-0.1312,  0.5264]])
actions  2  action0  torch.Size([1, 2])
[tensor([[0.0553, 0.6361]]), tensor([[0.0751, 0.1770]])]
(4,) 
 [0.05529726 0.63613534 0.07509224 0.17695409]


In [69]:
max_t = 300
noise_scale = 1.
for i in range(6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    for i in range(max_t):
        actions = maddpg.act(states, noise_scale)
        #actions = np.vstack([action.numpy() for action in actions]) # num_agent x action_size
        actions = np.hstack([action.numpy().squeeze() for action in actions])
#         actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#         actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        replay_buffer.add(states, np.hstack(states), actions, np.vstack(rewards), next_states, np.hstack(next_states), np.vstack(dones).astype(np.uint8))
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
print(len(replay_buffer))
view_replay_buffer_data(replay_buffer)

362
obs  7 (2, 24)
obs_full  7 (48,)
actions  7 (4,)
rewards  7 (2, 1)
next obs  7 (2, 24)
next obs full  7 (48,)
dones  7 (2, 1)
[[0]
 [0]]


In [83]:
experiences = replay_buffer.sample()
print_transitions(experiences)

obs  7 torch.Size([2, 256, 24])
obs_full  7 torch.Size([256, 48])
actions  7 torch.Size([256, 4])
rewards  7 torch.Size([2, 256, 1])
next obs  7 torch.Size([2, 256, 24])
next obs full  7 torch.Size([256, 48])
dones  7 torch.Size([2, 256, 1])


In [103]:
agent_number = 0
agent = maddpg.marl[agent_number]
print(agent)
print(maddpg.gamma)

<__main__.DDPG object at 0x000001E68A1C42E8>
0.99


## Critic loss

In [92]:
#critic_target = Network(state_size*2, action_size*2, hidden_in_dim=128, hidden_out_dim=128, activation=F.relu, is_actor=False).to(device)

In [101]:
local_states, global_states, actions, rewards, next_local_states, next_global_states, dones = experiences

In [108]:
next_target_actions = [agent.target_act(next_local_state, noise_scale) for agent, next_local_state in zip(maddpg.marl, next_local_states)]
print('next_target_actions ', len(next_target_actions), next_target_actions[0].shape)
next_target_actions = torch.cat(next_target_actions, dim=-1)
print('next_target_actions ', next_target_actions.shape)
with torch.no_grad():
    q_next = agent.critic_target(next_global_states, next_target_actions.to(device))
print('q_next ', q_next.shape)
print('reward[agent_number] ', rewards[agent_number].shape, '  done[agent_number] ',  dones[agent_number].shape)
y = rewards[agent_number] + maddpg.gamma * q_next * (1. - dones[agent_number])
print('y ', y.shape)
print('global states ', global_states.shape, ' actions ', actions.shape)
q = agent.critic_local(global_states, actions)
print('q ', q.shape)
critic_loss = F.smooth_l1_loss(q, y.detach())
print(critic_loss)

next_target_actions  2 torch.Size([256, 2])
next_target_actions  torch.Size([256, 4])
q_next  torch.Size([256, 1])
reward[agent_number]  torch.Size([256, 1])   done[agent_number]  torch.Size([256, 1])
y  torch.Size([256, 1])
global states  torch.Size([256, 48])  actions  torch.Size([256, 4])
q  torch.Size([256, 1])
tensor(0.0015, grad_fn=<SmoothL1LossBackward>)


## Actor Loss

In [113]:
predicted_actions = [maddpg.marl[i].actor_local(state) if i == agent_number \
                     else maddpg.marl[i].actor_local(state).detach() \
                     for i, state in enumerate(local_states)]
print('predicted_actions ', len(predicted_actions), predicted_actions[0].shape)
predicted_actions = torch.cat(predicted_actions, dim=-1)
print('predicted_actions ', predicted_actions.shape)
actor_loss = -agent.critic_local(global_states, predicted_actions).mean()
print(actor_loss)

predicted_actions  2 torch.Size([256, 2])
predicted_actions  torch.Size([256, 4])
tensor(0.0553, grad_fn=<NegBackward>)


In [96]:
x = critic_target.bn0(next_global_states)
x = critic_target.activation(critic_target.bn1(critic_target.fc1(x)))
print(x.shape, next_target_actions.shape)
x = torch.cat((x, next_target_actions), dim=-1)
print(x.shape)
x = critic_target.activation(critic_target.fc2_critic(x))
print(x.shape)
x = critic_target.fc3_critic(x)
print(x.shape)
x = critic_target(next_global_states, next_target_actions)
print(x.shape)

torch.Size([256, 128]) torch.Size([256, 4])
torch.Size([256, 132])
torch.Size([256, 128])
torch.Size([256, 1])
torch.Size([256, 1])


In [74]:
exp_local_states = [e.local_state for e in experiences if e is not None]
print(len(exp_local_states), exp_local_states[0].shape)
np_exp_local_states = np.stack(exp_local_states).swapaxes(0,1)
print('np_exp_local_states ', np_exp_local_states.shape)
### global state
exp_global_states = [e.global_state for e in experiences if e is not None]
print(len(exp_global_states), exp_global_states[0].shape)
np_exp_global_states = np.stack(exp_global_states)
print('np_exp_global_states ', np_exp_global_states.shape)
### actions
exp_actions = [e.action for e in experiences if e is not None]
print(len(exp_actions), exp_actions[0].shape)
np_exp_actions = np.stack(exp_actions)
print('np_exp_actions ', np_exp_actions.shape)
### rewards
exp_rewards = [e.reward for e in experiences if e is not None]
print(len(exp_rewards), exp_rewards[0].shape)
np_exp_rewards = np.stack(exp_rewards).swapaxes(0,1)
print('np_exp_rewards ', np_exp_rewards.shape)
### next state
exp_next_local_state = [e.next_local_state for e in experiences if e is not None]
print(len(exp_next_local_state), exp_next_local_state[0].shape)
np_exp_next_local_state = np.stack(exp_next_local_state).swapaxes(0,1)
print('np_exp_next_local_state ', np_exp_next_local_state.shape)
### next  global state
exp_next_global_states = [e.next_global_state for e in experiences if e is not None]
print(len(exp_next_global_states), exp_next_global_states[0].shape)
np_exp_next_global_states = np.stack(exp_next_global_states)
print('np_exp_next_global_states ', np_exp_next_global_states.shape)
### dones
exp_dones = [e.done for e in experiences if e is not None]
print(len(exp_dones), exp_dones[0].shape)
np_exp_dones = np.stack(exp_dones).swapaxes(0,1)
print('np_exp_dones ', np_exp_dones.shape)
exp_dones

256 (2, 24)
np_exp_local_states  (2, 256, 24)
256 (48,)
np_exp_global_states  (256, 48)
256 (4,)
np_exp_actions  (256, 4)
256 (2, 1)
np_exp_rewards  (2, 256, 1)
256 (2, 24)
np_exp_next_local_state  (2, 256, 24)
256 (48,)
np_exp_next_global_states  (256, 48)
256 (2, 1)
np_exp_dones  (2, 256, 1)


[array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[1],
        [1]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8), array([[0],
        [0]], dtype=uint8),

In [75]:
# shape [num_agents, num_samples, state_size] --- [2, batch_size, 24]
local_states = torch.from_numpy(np.stack([e.local_state for e in experiences if e is not None]).swapaxes(0,1)).float().to(device)
# shape [num_samples, num_agentsxstate_size] --- [batch_size, 2x24=48]
global_states = torch.from_numpy(np.stack([e.global_state for e in experiences if e is not None])).float().to(device)
# shape [num_samples, num_agentsxaction_size] --- [batch_size, 2x2=4]
actions = torch.from_numpy(np.stack([e.action for e in experiences if e is not None])).float().to(device)
# shape [num_agents, num_samples, rewards_size] --- [2, batch_size, 1]
rewards = torch.from_numpy(np.stack([e.reward for e in experiences if e is not None]).swapaxes(0,1)).float().to(device)
# shape [num_agents, num_samples, state_size] --- [2, batch_size, 24]
next_local_states = torch.from_numpy(np.stack([e.next_local_state for e in experiences if e is not None]).swapaxes(0,1)).float().to(device)
# shape [num_samples, num_agentsxstate_size] --- [batch_size, 2x24=48]
next_global_states = torch.from_numpy(np.stack([e.next_global_state for e in experiences if e is not None])).float().to(device)
# shape [num_agents, num_samples, done_size] --- [2, batch_size, 1]
dones = torch.from_numpy(np.stack([e.done for e in experiences if e is not None]).swapaxes(0,1)).float().to(device)


torch.Size([2, 256, 1])
