# Introduction to Multi-AgentRL

## Motivation for Multi-agent Systems
* Ultimate goal of AI is to solve intelligence.
* We live in a multiagent world, we do not become intelligent in isolation. 
* We learn from other and our own experiences, and so on.
* Our intelligence is therefore a result of our iteractions with multiple agents over our lifetime.
* If we want to build intelligent agents that are used in real world, they have to interact with humans, and also with other agents.
* This lead to multi agent scenario.
* The multi-agent case is a very complex kind of environment because all the agent are learning simultaneously and also iteracting with one another.
### Summary
* We live in a multi agent world
* Intelligent agents have to interact with human 
* Agents need to work in complex environments.

## Application of Multi-Agent Systems
* Some potential real-life application of multi-agent systems.
1. A group of drones or robots whose aim is to pick up a package and drop it to the destination is a multi-agent systems.
2. In the stock market, each person who is trading can be considered as an agent and the profit maximization can be modeled as a multi-agent problem.
3. Interactive robots or humanoids that can iteracts with humans and get some task done are nothing but multi-agent system if we consider humans to be agent.
4. Windmills in a wind farm can be thought of as multiple agents.
    * It would be cool if the agents, that is, the wind turbines figured out the optimal directions to face by themselves, and obtain maximum energy from the wind farm.
    * The aim here is to collaboratively maximize the profit obtained from the wind farm.

## Benefits of Multi-Agent Systems
* The agents can share their experiences with one another making each other smarter, just as we learned from our teachers and friends.
* However, when agents want to share, they have to communicate, which leads to a cost of communication, like extra hardware and software capabilites.
* A multi-agent system is robust.
* Agent can be replaced with a copy when they fail. Other agents in the system can take over the tasks of the failed agent, but the substituting agent now has to  do some extra work.
* Scalability comes by virtue of design, as most multi-agent system allow insertion of new agent easily.
* But if more agent are added to the system, the system becomes more complex than before.
* So it depends on the assumptions made by the algorithm and the software-hardware capabilities of the agents, whether or not these advantages will be exploited.
* So from here onwards, we will learn about multi-agent RL, also known as $$MARL$$
* When multi-agent system used reinforcement learning techinques to train the agent and make them learn their behaviours, we call the process **multi-agent reinforcement learning**.
* Next we learn about the __framework__ for $MARL$ just like Markov decision processes are __MDPs__ for __single-agent RL__.

# Markov Games
* Consider an example of single agent reinforcement learning.
* We have a drone with the task of grabbing a package. The possible action actions are going right, left, up, down, and grasping.
* The reward is +50 for grasping the package, and minus one (-1) otherwise.
* Now the difference in multi-agent RL, is that we have more than one agent. So say we have a second drone. Now both the agent collaboratively trying to grasp the package.
* They're both observing the packets from their respective positions.
* They both have their own policies that returned an action for their observations.
* Both have their own set of actions. **The main thing about multi-agent RL, is that there is also a __joint set actions__.**
* Both the left drone and right drone must begin action.
* For example, the pair DL is bended left drone moves down, and right drone moves to left.
* This example illustrates the Markov game framework, which we are now ready to discuss in more detail.


* A markov game, is a tuple written as this:
$$(n,S,A_1,...,A_n,O_1,...,O_n,R_1,...,R_n,\pi_1,...,\pi_n,T)$$
    - $n$: number of agents
    - $S$: set of environment states
    - $A$: $A_1 \times A_2,... \times A_n$($A_i$ is set of actions of agent i) $A$ is joint action space.
    - $O_i$:$O_i$ is set of observation of agent i
    - $R$: $S \times A_i \to R$($R_i$ is the reward function of agent i) which returns a real value for acting in action in a particular state,
    - $\pi_i$:$O_i \to A_i$ ($\pi_i$ is the policy of each agent i) given the observation returns the probability distribution over actions $A_i$.
    - $T$: $S \times A \to S$($T$ is the state transition function, given the current state and the joint action, it provides a probability distribution over the set of possible next_states.)
    
    
* Note, that even here the state transition are Markovian, just like in an MDP. Recall **Markovian** means that the **next state depends only on the present state and the action taken in this state.**
* However, this transition function now depends on the **joint action**.


## Approaches to MARL
* So can we think about adapting the single-agent RL techniques we've learned about so far to the multi-agent case?
* Two extreme approaches comes to mind.
* The simplest approach should be to train all the agents **independently** without considering the **existence** of other agents. 
* In this approach, any agent **considers all the others to be a part of the environment** and learns its own policy.
* Since all are learning simultaneously, the environment as seen from the prospective of a single agent, **change dynamically**.
* This condition is called **NON-STATIONARITY** of the environment.
* In most single agent algorithms, it is assumed that the environment is **Stationary**, which leads to certain **convergence** guarantees.
* Hence, under **non-stationarity** conditions, these **guarantees of convergence no longer holds**.
#### Second Approach
* The second approach is the **matter** agent approach.
* The matter agent approach takes into account the **exsistence of multiple agents.**
* Here, a single policy is knowed for all the agents. It takes as input the present state of environment and returns action of each agent in the form of a single joint **Action vector**.
$$Policy: S\to A_1 \times A_2 ... \times A_n$$
* Typically, a **single reward function, given the environment state and the action vector returns a GLOBAL REWARD**
$$R: S \times A \to Real\ Number$$
* The joint action space as we have discussed before, would increase **exponentially** with the number of the agents.
* If environment is **partially observable** or the agent can only see locally, each agent will have a different **observation** of the environment state, hence it will e difficult to **disambiguate** the state of the environment from different local observations.
* So this approach work well only when each agent knows **Everything** about the environment.

## Cooperation, Competition, Mixed Environments
#### Example Condition
### CASE 1
* Let's pretend that you and your sister are playing a game of pong.
* We are give one bag of 100 coins frow which we plan buying a video game console.
* For each time either of us misses the ball, we lose one coin from the bank to our parents.
* Hence, we both will try to keep the ball in the game to have as many coins as possible at the end. -- __cooperation__
* This is an example of cooperative environment where the agents are **Concerned** about to accomplishing a group task and cooperate to do so.
### CASE 2
* Consider that now we both have separate banks.
* Whosoever misses the ball, gives a coin from their bank to the other.
* So now instead of cooperating, we're competing with each another.
* One sibling's gain is the other's loss.
* This is an example competitive environment where the agents are just concerned about maximizing their own rewards.


* Notice in cooperative setting both of us loses a coin while in the competitive setting, while in competitive setting one loses a coin when other gains a coin.
* So, the way reward is defined makes the agent behaviour competitive or apparently collaborative.
* In many environments, the agents have to show a mixture of both(cooperative and competitive behaviours which leads to mixed cooperative competitive environments).

## Research Topics
* The field of mult-agent RL is abuzz with cutting edge research.
* Recently, Open AI announced that its team of five neural networks, OpenAI 5 has learned to defeat amature DOTA 2 players.
* OpenAI 5 has been trained using a scaled-up version of **BPO**
* Coordination between agents is controlled using hyperparameter called `team_spirit`.
* It range from zero to one, where zero means agent only care about the **individual reward** function while one means that they completely care about the team's reward function. 

There are many iteresting papers out there on MARL. For the purposes of this lesson we will stick to one particular paper called [Multi Agent Actor Critic for Mixed Cooperative Competitive environment](https://papers.nips.cc/paper/7217-multi-agent-actor-critic-for-mixed-cooperative-competitive-environments.pdf) by OpenAI.

## Paper Description, Part 1
* The paper we have chosen implements a multi-agent version of DDPG.
* DDPG, as we might remember, is an off-policy actor-critic algorithm that uses the concept of traget networks.
* The input of the actor network is the current state while output is real value or a vector representing an action chosen from a continuous action space.
* OpenAI has created a mulit-agent environment called multi-agent particle.
* It consists of particle that are agents and some landmarks.
* A lot of itresting experimental scenarios have been laid out in this environment.
* We will be working one of many scenarios called physical deception.
* Here, any agents cooperate to reach the target landmark out of n landmarks.
* There is an adversary which also trying to reach the target landmark, but it doesn't know which out of n landmarks is the target landmark.
<img src = "images/a23.png">

## Paper Description, Part 2
* The normal agents are rewarded based on the least distance of any of the agents to the landmark, and penalized based on the distance between the adversary and the target network.
* Under this reward structure, the agent cooperate to spread out across all the landmarks, so as to deceive the adversary. 

* The framework of centralized training with decentralized execution has been adopted in this paper.
* This implies that some extra information is used to ease training, but that information is not used during the testing time.
* This frame can be naturally implemented using actor-critic algorithm.
#### Important
* During training, the __critic__ for each agent uses **extra information** like __state's observed and actions taken by all the other agents__.
* As for the actor we'll notice that there is one for each agent.
* Each actor has access to only its agent's observation and actions. 
* During execution time, only the actors are present and hence, own observation and actions are used.
<img src = "images/a24.png">

## Summary 
* We began by introducing ourselves to multi-agent system present in our surroundings.
* We reasoned why multi-agent system are an important puzzle to solve AI, and decided to pursue this complex topic. 
* We also studied the Markov game framework, which is generalization of MDPs to the multi-agent case.
* We talked about using single-agent RL algorithms, as they are in multi-agent case.
* This either leads to **non-stationarity**, or a large joint action space. 
* We saw the intresting kinds of environments presents in the multi-agent case namely: cooperative, competitive and mixed.
* Towards the end, we implemented multi-agent DDPG algorithm which is **centralized training**, and **decentralized execution** algorithm that can be used in any of the above environments.


## Mini Project -- Physical Deception
For this Lab, we will train an agent to solve the **Physical Deception** problem.

### Goal of the environment
Blue dots are **good agent**, and the RED DOTS are **adversary**. All the agents' goals are to go near the green target. The blue agents know which one is green, but the Red agent is color blind and does not know which is green/black! The optimal solution os for the red agent to chase one of the blue agent, and for the blue agent, and for the blue agents to split up and go towards each of the target. 

Running within the workspace ( Recommended Option)
No explicit setup commands need to run by you, we have taken care of all the installations in this lab, enjoy exploration.
./run_training.sh Let's you run the program based on the parameters provided in the main program.
./run_tensorboard.sh will give you an URL to view the dashboard where you would have visualizations to see how your agents are performing. Use this as a guide to know how the changes you made are affecting the program.
Folder named Model_dir would store the episode-XXX.gif files which show the visualization on how your agent is performing.

* `torch.norm()` computation $||x||_{p} = \sqrt[p]{x_{1}^{p} + x_{2}^{p} + \ldots + x_{N}^{p}}$

# MADDPG LAB

## networkforall.py

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as f
import numpy as np

def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1./np.sqrt(fan_in)
    return (-lim, lim)

class Network(nn.Module):
    def __init__(self, input_dim, hidden_in_dim, hidden_out_dim,
                 output_dim, actor=False):
        super(Network,self).__init__()
        self.fc1 = nn.Linear(input_dim,hidden_in_dim)
        self.fc2 = nn.Linear(hidden_in_dim,hidden_out_dim)
        self.fc3 = nn.Linear(hidden_out_dim,output_dim)
        self.nonlin = f.relu ## leaky_relu
        self.actor = actor
        
    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-1e-3, 1e-3)
        
    def forward(self, x):
        if self.actor:
            h1 = self.nonlin(self.fc1(x))
            h2 = self.nonlin(self.fc2(h1))
            h3 = self.fc3(h2)
            norm = torch.norm(h3)
            # h3 is a 2D vector (a force that is applied to the agent)
            # we bound the norm of the vector to be between 0 and 10
            return 10.0*(f.tanh(norm))*h3/norm if norm > 0 else 10*h3
        
        else:
            ## critic network simply outputs a number 
            h1 = self.nonlin(self.fc1(x))
            h2 = self.nonlin(self.fc2(x))
            h3 = (self.fc3(h2))
            return h3

## ddpg.py


In [None]:
# individual network settings for each actor+crtic pair

## see networkforall for details
from networkforall import Network
from utilities import hard_update, gumbel_softmax, onehot_from_logits
from torch.optim import Adam
import torch
import numpy as np

# ad OU noise for exploration
from OUNoise import OUNoise
device = "cpu"

class DDPGAgent:
    def __init__(self,in_actor,hidden_in_actor, hidden_out_actor, out_actor,
                in_critic, hidden_in_critic, hidden_out_critic, lr_actor=1.0-2,
                lr_critic=1.0e-2):
        super(DDPGAgent, self)__init__()
        self.actor = Network(in_actor, hidden_in_actor, hidden_out_actor,
                            out_actor,actor=True).to(device)
        self.critic = Network(in_critic, hidden_in_critic, hidden_out_critic,
                            1).to(device)
        self.target_actor = Network(in_actor, hidden_in_critic, hidden_out_actor,
                                   out_actor, actor=True).to(device)
        self.target_critic = Network(in_critic, hidden_in_critic, hidden_out_critic,
                                    1).to(device)
        self.noise = OUNoise(out_actor,scale=1.0)
        
        ##initializing target same as original network
        self.hard_update(self.target_actor,self.actor)
        self.hard_update(self.target_critic, self.critic)
        
        self.actor_optimizer = Adam(self.actor.parameters(), lr=lr_actor)
        self.critic_optimizer = Adam(self.critic.parameters(), lr= lr_critic, weight_decay=1.0e-5)
        
    def act(self, obs, noise=0.0):
        obs = obs.to(device)
        action = self.actor(obs)+ noise*self.noise.noise()
        return action
    
    def target_act(self, obs, noise=0.0):
        obs = obs.to(device)
        action = self.target_actor(obs)+noise*self.noise.noise()
        return action
    
    def hard_update(self, target, source, tau):
        for target_param, param in zip(target.parameters(),source.parameters()):
            target_param.data.copy_(param.data)
            
    def soft_update(self, target, source, tau):
        for target_param,param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(target_param.data*(1-tau) + tau*param.data)
            

## maddpg.py

In [None]:
## main code that contains the neural network setup
# policy + critic updates

from ddpg import DDPGAgent
import torch
from utilities import soft_update, transpose_to_tensor, transpose_list

device = "cpu"

class MADDPG:
    def __init__(self, discount_factor=0.95, tau=0.02):
        super(MADDPG, self).__init__()
        
        # critic input = obs_full + actions = 14(state_space)+2(action_agent1)+ 2 (action_agent2)+ 2 (action_agent3) = 20
        ## Total three agent one adversary and 2 good agent!!
        self.maddpg_agent = [DDPGAgent(14, 16, 8, 2, 20, 32, 16),
                            DDPGAgent(14, 16, 8, 2, 20, 32, 16),
                            DDPGAgent(14, 16, 8, 2, 20, 32, 16)]
        
        self.discount_factor = discount_factor
        self.tau = tau
        self.iter = 0
        
    def get_actors(self):
        """get actors of all the agents in the MADDPG object"""
        actors = [ddpg_agent.actor for ddpg_agent in self.maddpg_agent]
        return actors
    
    def get_target_actors(self):
        """get target_actors of all the agents in the MADDPG object"""
        target_actors = [ddpg_agent.target_actors for ddpg_agent in self.maddpg_agent]
        return target_actors
    
    def act(self, obs_all_agents, noise=0.0):
        """get actions from all agents in the MADDPG object"""
        actions = [agent.act(obs, noise) for agent, obs in zip(self.maddpg_agent, obs_all_agents)]
        return actions
    
    def target_act(self, obs_all_agents, noise=0.0):
        """Get target network actions from all the agent in the MADDPG object"""
        
        target_actions = [ddpg_agent.target_act(obs, noise) for ddpg_agent,obs in zip(self.maddpg_agent, obs_all_agents)]
        return target_actions
    
    def update(self, samples, agent_number, logger):
        """update the critics and actors of all the agents"""
        
        #need to transpose each element of the samples
        #to flip obs ::  dim -> [parallel_agent][agent_number] to 
        # obs :: dim -> [agent_number][parallel_agent]
        obs, obs_full, action, reward, next_obs, next_obs_full, done = map(transpose_to_tensor, samples)
        
        obs_full = torch.stack(obs_full)
        next_obs_full = torch.stack(next_obs_full)
        
        agent = self.maddpg_agent[agent_number]
        
        ## ============================== ##
        #       Critic Training            #
        ## ============================== ##
        
        agent.critic_optimizer.zero_grad()
        
        ## CRITIC LOSS = batch mean of (y - Q(s,a) from target network)^2
        ## y = reward from this timestep + discount* Q(st+1, at+1) from target network
        ## at+1 from actor_target.
        target_actions = self.target_act(next_obs) #size (3,2) #from three agent
        target_actions = torch.cat(target_actions, dim=1) #size (1,6)
        
        target_critic_input = torch.cat((next_obs_full.t(),target_actions),dim=1).to(device)
        
        with torch.no_grad():
            q_next = agent.target_critic(target_critic_input)
        y = reward[agent_number].view(-1,1) + self.discount_factor * q_next * (1 - done[agent_number].view(-1,1))
        
        action = torch.cat(action, dim=1)
        critic_input = torch.cat((obs_full.t(),action), dim=1).to(device)
        q = agent.critic(critic_input)
        
        huber_loss = torch.nn.SmoothL1Loss()
        critic_loss = huber_loss(q,y.detach())
        critic_loss.backward()
        
        agent.critic_optimizer.step()
        ## ============================== ##
        #         Agent Training           #
        ## ============================== ##
        
        agent.actor_optimizer.zero_grad()
        # make input to agent
        # detach the other agents to save computing derivation
        # >>>>>>> read about .detach() method <<<<<<<<
        # saves some time for computing derivative
        
        q_input = [self.maddpg_agent[i].actor(ob) if i == agent_number \ 
                  else  self.maddpg_agent[i].actor(ob).detach() for i, ob in enumerate(obs)]
        q_input = torch.cat(q_input, dim=1)
        # combine all the actions and observations for input to critc
        # many of the obs are redundant, and obs[1] contains all useful infromation already
        
        q_input2 = torch.cat((obs_full.t(), q_input), dim =1)
        
        # get the policy gradient
        actor_loss = -agent.critic(q_input2).mean()
        actor_loss.backward()
        agent.actor_optimizer.step()
        
        al = actor_loss.cpu().detach().item()
        cl = critic_loss.cpu().detach().item()
        logger.add_scalars('agent%i/losses' % agent_number,
                   {'critic loss': cl,
                    'actor_loss': al},
                   self.iter)
    def update_targets(self):
        """soft update targets"""
        self.iter +=1
        for ddpg_agent in self.maddpg_agent:
            ddpg_agent.soft_update(ddpg_agent.target_actor, ddpg_agent.actor, self.tau)
            ddpg_agent.soft_update(ddpg_agent.target_critic, ddpg_agent.critic, self.tau)
            
        

In [1]:
import torch

In [10]:
a = torch.randn(3,2)
a

tensor([[-0.4740,  0.2601],
        [ 0.9360, -0.2556],
        [ 0.6471, -1.2109]])

In [11]:
torch.norm(a)

tensor(1.7660)