# Simple Spread Testing

## Standard Simple Spread (Collaborative Only)

Agent observations: `[self_vel, self_pos, landmark_rel_positions, other_agent_rel_positions, communication]`
 - `self_vel = (2, )`
 - `self_pos = (2, )`
 - `landmark_rel_positions = (2 * N, )`
 - `other_agent_rel_positions = (2 * (N - 1), )`
 - `communication = (2 * (N - 1), )`

Agent action space: `[no_action, move_left, move_right, move_down, move_up] = (0-4)` 

In [None]:
from pettingzoo.mpe import simple_spread_v3

In [None]:
env = simple_spread_v3.parallel_env(N=5)
observations, infos = env.reset()
observations, infos


In [None]:
env.num_agents

In [None]:
env.state()

In [None]:
print(observations["agent_0"].shape)
observations["agent_0"]

In [None]:
env.action_space("agent_0")

In [None]:
# this is where you would insert your policy
# actions = {agent: env.action_space(agent).sample() for agent in env.agents}
actions = {agent: 0 for agent in env.agents}
observations, rewards, terminations, truncations, infos = env.step(actions)
observations, rewards, terminations, truncations, infos

## Adversarial Variant (Custom)

Agent observations: `[self_is_adversary, self_vel, self_pos, landmark_rel_positions, other_agent_is_adversary_rel_positions]`
 - `self_is_adversary = (1, )`: 0 / 1 flag
 - `self_vel = (2, )`
 - `self_pos = (2, )`
 - `landmark_rel_positions = (2 * n_landmarks, )`
 - `other_agent_is_adversary_rel_positions = ((1 + 2) * (n_agents + n_adversaries - 1), )`: 0 / 1 flag  for if that other agent is an adversary + relative position for the other agent times the number of other agents

Agent action space: `[no_action, move_left, move_right, move_down, move_up] = (0-4)` 

In [None]:
%load_ext autoreload
%autoreload 2

import simple_spread_adversarial

In [None]:
env = simple_spread_adversarial.parallel_env(n_agents=2, n_adversaries=2, n_landmarks=2)
observations, infos = env.reset()
observations, infos

In [None]:
env.num_agents, env.agents

In [None]:
env.state()

In [None]:
print(observations["agent_0"].shape)
observations["agent_0"]

In [None]:
env.action_space("agent_0")

In [None]:
# this is where you would insert your policy
# actions = {agent: env.action_space(agent).sample() for agent in env.agents}
actions = {agent: env.action_space(agent).sample() for agent in env.agents}
observations, rewards, terminations, truncations, infos = env.step(actions)
observations, rewards, terminations, truncations, infos

In [None]:
# Visualize full episode
env = simple_spread_adversarial.parallel_env(
    n_agents=2,
    n_adversaries=2,
    n_landmarks=3,
    render_mode="human"
)
observations, infos = env.reset()

while env.agents:
    # this is where you would insert your policy
    actions = {agent: env.action_space(agent).sample() for agent in env.agents}

    observations, rewards, terminations, truncations, infos = env.step(actions)
env.close()


# Testing Communication

In [None]:
temp = 3
env = simple_spread_v3.parallel_env(N=temp)
observations, infos = env.reset()
observations, infos

In [None]:
for agent in env.agents:
    observation = observations[agent]
    
    self_vel = observation[:2]
    self_pos = observation[2:4]
    idx = 4 + temp * 2
    landmark = observation[4:idx]
    idx2 = idx + (temp - 1) * 2
    other_pos = observation[idx:idx2]
    comms = observation[idx2:]
    
    print("self vel: ", self_vel)
    print("self pos: ", self_pos)
    print("landmarks: ", landmark)
    print("other players: ", other_pos)
    print("comms: ", comms)
    print("")

In [None]:
#     def observation(self, agent, world):
#         # get positions of all entities in this agent's reference frame
#         entity_pos = []
#         for entity in world.landmarks:  # world.entities:
#             entity_pos.append(entity.state.p_pos - agent.state.p_pos)
#         # communication of all other agents
#         comm = []
#         other_pos = []
#         for other in world.agents:
#             if other is agent:
#                 continue
#             comm.append(other.state.c)
#             other_pos.append(other.state.p_pos - agent.state.p_pos)
#         return np.concatenate(
#             [agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + comm
#         )

Thoughts: Vary the amount of comm being transferred. By default, other pos are included outside of the comm vector. Potential Baselines: Mask landmarks / Mask other pos. Masking both doesn't make much sense as it essentially. becomes. Run on small number of iterations to learn policy. Ideas for custom defined comm vector: provide velocity of self to other agents (2 per other agent, 2N-1 like right now). alternatively, provide euclidan distance to each of the landmarks (my thinking is that it would explicitly force the agents to learn instead of learning implicitly via the reward func. The number would be N per agent). This could either be an absolute L2 distance or some binary variable. The binary variable could either be N per other agent (1 if within some parameter bound to landmark x, 0 if not) or 2 per other agent (1 if within some parameter bound to any landmark, 0 if not)

# Communciation (Custom)

In [1]:
import simple_spread_comms

In [10]:
# temp controls number of landmarks and players and the print
# could split it up like in the adversarial file so that landmarks and players are different numbers
temp = 3
# SEE BOTTOM OF .PY FILE TO SEE WHAT EACH MODE DOES 0-4
env = simple_spread_comms.parallel_env(N = temp, comm_mode=4)
observations, infos = env.reset()
observations, infos

({'agent_0': array([ 0.        ,  0.        ,  0.15948704, -0.44753256, -0.16793479,
          0.74102026,  0.5801595 , -0.16484697, -0.4447318 ,  1.1122671 ,
         -1.0538346 ,  0.6712349 ,  0.5373433 ,  0.39537844,  0.        ,
          0.        ], dtype=float32),
  'agent_1': array([ 0.        ,  0.        , -0.89434755,  0.22370231,  0.8858998 ,
          0.06978536,  1.6339941 , -0.83608186,  0.6091027 ,  0.44103223,
          1.0538346 , -0.6712349 ,  1.5911779 , -0.27585644,  0.        ,
          0.        ], dtype=float32),
  'agent_2': array([ 0.        ,  0.        ,  0.6968304 , -0.05215412, -0.7052781 ,
          0.3456418 ,  0.04281615, -0.5602254 , -0.98207515,  0.71688867,
         -0.5373433 , -0.39537844, -1.5911779 ,  0.27585644,  0.        ,
          0.        ], dtype=float32)},
 {'agent_0': {}, 'agent_1': {}, 'agent_2': {}})

In [3]:
for agent in env.agents:
    observation = observations[agent]
    
    self_vel = observation[:2]
    self_pos = observation[2:4]
    idx = 4 + temp * 2
    landmark = observation[4:idx]
    idx2 = idx + (temp - 1) * 2
    other_pos = observation[idx:idx2]
    comms = observation[idx2:]
    
    print("self vel: ", self_vel)
    print("self pos: ", self_pos)
    print("landmarks: ", landmark)
    print("relative pos of other players: ", other_pos)
    print("comms: ", comms)
    print("")

self vel:  [0. 0.]
self pos:  [0.6473063  0.52806616]
landmarks:  [-1.2742863  -1.1879019  -1.4164813  -0.7793342  -1.140535    0.41489962]
relative pos of other players:  [-0.4060039  -0.41880074 -1.0631205  -0.43103477]
comms:  [0. 1.]

self vel:  [0. 0.]
self pos:  [0.24130242 0.10926543]
landmarks:  [-0.8682824  -0.7691011  -1.0104774  -0.36053345 -0.7345311   0.83370036]
relative pos of other players:  [ 0.4060039   0.41880074 -0.65711653 -0.01223403]
comms:  [0. 1.]

self vel:  [0. 0.]
self pos:  [-0.41581413  0.0970314 ]
landmarks:  [-0.21116579 -0.75686705 -0.35336083 -0.3482994  -0.07741455  0.8459344 ]
relative pos of other players:  [1.0631205  0.43103477 0.65711653 0.01223403]
comms:  [0. 0.]



In [11]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

In [12]:
# DQN PLACEHOLDER
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

In [13]:
num_episodes = 5
episode_max_length = 100

gamma = 0.95
learning_rate = 0.001
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01

reward_of_episodes = []
for episode in range(num_episodes):
    states = env.reset()
    total_reward = 0

    for t in range(episode_max_length):
        actions = {}
        for agent in env.agents:
            
        env.render(mode='human')

        for agent in env.agents:
            state = torch.FloatTensor(states[agent])
            next_state = torch.FloatTensor(next_states[agent])
            reward = rewards[agent]
            done = dones[agent]

            # Calculate the target Q-value
            with torch.no_grad():
                target_q_value = reward + gamma * torch.max(q_network(next_state))

            # Calculate the current Q-value
            current_q_value = q_network(state)[actions[agent]]

            # Calculate the loss and perform optimization
            loss = nn.functional.mse_loss(current_q_value, target_q_value)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_reward += reward

            if done:
                break

        states = next_states

    # Decay epsilon
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Other possible code here
    reward_of_episodes.append(total_reward)

    print(f"Episode {episode + 1}: Total Reward = {total_reward}")

TypeError: tuple indices must be integers or slices, not str