# Multi-Agent Soccer with ML-Agents

In this notebook, we'll train a 2vs2 soccer team using Unity ML-Agents. We'll use the MA-POCA (Multi-Agent POsthumous Credit Assignment) algorithm to train cooperative behavior among agents.

## Overview

The environment is called `SoccerTwos`. The goal is to get the ball into the opponent's goal while preventing the ball from entering your own goal.

This notebook will guide you through:
1. Setting up ML-Agents
2. Understanding the environment
3. Understanding MA-POCA algorithm
4. Configuring training parameters
5. Training the agents
6. Pushing the trained model to Hugging Face Hub
7. Participating in AI vs. AI challenges

## Step 0: Install ML-Agents and Download the Environment

First, we need to install ML-Agents and download the SoccerTwos environment executable.

In [None]:
# Install ML-Agents
!git clone https://github.com/Unity-Technologies/ml-agents
%cd ml-agents
!pip install -e ./ml-agents-envs
!pip install -e ./ml-agents

# For Mac users on Apple Silicon who encounter issues
# !conda install grpcio

### Download the Environment Executable

Based on your operating system, download one of these executables, unzip it, and place it in a new folder inside `ml-agents` called `training-envs-executables`:

- Windows: [Download executable](https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing)
- Linux (Ubuntu): [Download executable](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing)
- Mac: [Download executable](https://drive.google.com/drive/folders/1h7YB0qwjoxxghApQdEUQmk95ZwIDxrPG?usp=share_link)

For Mac users, you'll also need to run this command to make the executable runnable:
```bash
xattr -cr training-envs-executables/SoccerTwos/SoccerTwos.app
```

## Step 1: Understand the Environment

### SoccerTwos Environment

The SoccerTwos environment is a 2vs2 soccer game where each team has two agents. The goal is to get the ball into the opponent's goal while preventing the ball from entering your own goal.

### Reward Function

The reward function is:
- +1 when the team scores a goal
- -1 when the opponent team scores a goal
- Small negative reward each step to encourage faster goal-scoring

### Observation Space

The observation space is composed of vectors of size 336:
- 11 ray-casts forward distributed over 120 degrees (264 state dimensions)
- 3 ray-casts backward distributed over 90 degrees (72 state dimensions)
- Both of these ray-casts can detect 6 objects:
  - Ball
  - Blue Goal
  - Purple Goal
  - Wall
  - Blue Agent
  - Purple Agent

### Action Space

The action space consists of three discrete branches:
1. Forward/Backward movement
2. Rotation
3. Side movement

## Step 2: Understand MA-POCA

MA-POCA (Multi-Agent POsthumous Credit Assignment) is a training algorithm designed for cooperative multi-agent scenarios.

### Key Concepts:

1. **Centralized Critic**: A centralized critic processes the states of all agents in the team to estimate how well each agent is doing. Think of this critic as a coach.

2. **Decentralized Execution**: Each agent makes decisions based only on what it perceives locally.

3. **Credit Assignment**: The algorithm helps determine the contribution of each agent to the team's success.

This approach combines Self-Play (to learn competitive behavior against opponents) with MA-POCA (to learn cooperative behavior within the team).

## Step 3: Define the Configuration File

Let's create a configuration file for our training. This file will define the hyperparameters for the MA-POCA algorithm and self-play settings.

In [2]:
%%writefile config/poca/SoccerTwos.yaml
behaviors:
  SoccerTwos:
    trainer_type: poca
    hyperparameters:
      batch_size: 2048
      buffer_size: 20480
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: constant
    network_settings:
      normalize: false
      hidden_units: 512
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    keep_checkpoints: 5
    max_steps: 5000000
    time_horizon: 1000
    summary_freq: 10000
    self_play:
      save_steps: 50000
      team_change: 200000
      swap_steps: 2000
      window: 10
      play_against_latest_model_ratio: 0.5
      initial_elo: 1200.0

Overwriting config/poca/SoccerTwos.yaml


### Understanding the Configuration Parameters

Let's break down some of the key parameters:

#### Hyperparameters:
- `batch_size`: Number of experiences in each iteration of gradient descent
- `buffer_size`: Number of experiences to collect before updating the policy
- `learning_rate`: Strength of each gradient descent update step
- `epsilon`: Acceptable range for policy deviation during training

#### Network Settings:
- `hidden_units`: Number of units in the hidden layers
- `num_layers`: Number of hidden layers

#### Self-Play Parameters:
- `save_steps`: Number of steps between snapshots of the policy
- `team_change`: Number of steps between swapping teams
- `swap_steps`: Number of steps between changing the opponent policy
- `window`: Number of past policies to choose from for the opponent
- `play_against_latest_model_ratio`: Probability of playing against the most recent policy

You can modify these parameters to experiment with different training strategies.

## Step 4: Start the Training

Now we're ready to start training our agents. This will take several hours (5-8 hours for 5M timesteps), so be prepared to let your computer run for a while.

**Note**: The training command below should be run in a terminal, not in this notebook, as it will run for many hours.

In [3]:
# This is the command you should run in a terminal
# For Windows:
# !mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos.exe --run-id="SoccerTwos" --no-graphics

# For Mac:
!mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos/SoccerTwos.app --run-id="SoccerTwos" --no-graphics

# For Linux:
# !mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos/SoccerTwos.x86_64 --run-id="SoccerTwos" --no-graphics

/bin/bash: mlagents-learn: command not found


### Training Tips

- It's normal if you don't see a big increase in ELO score (or even see a decrease below 1200) before 2M timesteps. Your agents will spend most of their time moving randomly on the field before being able to score goals.
- You can stop the training with Ctrl + C, but be careful to only press it once to allow ML-Agents to generate the final .onnx file before closing.
- Training progress can be monitored using TensorBoard.

## Step 5: Push the Agent to the Hugging Face Hub

After training, you can push your model to the Hugging Face Hub to participate in the AI vs. AI challenge.

In [None]:
# Login to Hugging Face
!huggingface-cli login

# Push your model to the Hub
# Replace the placeholders with your actual values
# !mlagents-push-to-hf --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --repo-id="YourUsername/poca-SoccerTwos" --commit-message="First Push"

## Step 6: Verify Your Model for the AI vs. AI Challenge

To ensure your model is ready for the AI vs. AI Challenge, check that:

1. Your model has the tag `ML-Agents-SoccerTwos` in its Hugging Face repository
2. Your repository contains a `SoccerTwos.onnx` file

If these conditions are met, your model will be automatically added to the challenge pool.

## Step 7: Visualize Matches in the Demo

You can visualize how your model performs against others using the [ML-Agents-SoccerTwos demo](https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos).

To do this:
1. Go to the demo
2. Select your model as team blue (or purple)
3. Select another model to compete against
4. Watch the match!

The best opponents to compare your model against are either the top models on the leaderboard or the [baseline model](https://huggingface.co/unity/MLAgents-SoccerTwos).

## Conclusion

In this notebook, we've learned how to:
- Set up ML-Agents for multi-agent reinforcement learning
- Understand the SoccerTwos environment and MA-POCA algorithm
- Configure and train cooperative agents
- Push models to the Hugging Face Hub
- Participate in AI vs. AI challenges

This represents a practical application of multi-agent reinforcement learning where agents need to both cooperate with teammates and compete against opponents.