<a href="https://colab.research.google.com/github/wohecha/HuggingFace-Unit1/blob/main/notebooks/unit7/unit7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# A2A Agent vs. Agent.
---
based on:
https://huggingface.co/learn/deep-rl-course/unit7/hands-on

/!\ **This will take 5 to 8 hours of training** /!\

## Rules

1) don't change observation space or action space of the agent.  
*Model will not work during evaluation*
2) Use Unity MLAgents trainer.
3) to avoid bugs use the provided executable.  
*Unity Editor can be used at your own bugs.*

if issues: open issue on Github Repo  
https://github.com/huggingface/deep-rl-class/issues

## Install MLAgents and executables

Conda is advised as package manager & as environment creator.  
Install conda and create the environment.

### Install conda and activate environment

In [None]:
# change notebooks directory (return to root)
%cd /content/

In [None]:
# Download and install Miniconda
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local

In [None]:
# accept terms of service
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

In [None]:
%%capture
# Activate Miniconda and install Python ver 3.10.12
!source /usr/local/bin/activate
!conda install -q -y --prefix /usr/local python=3.10.12 ujson  # Specify the version here

# Set environment variables for Python and conda paths
!export PYTHONPATH=/usr/local/lib/python3.10/site-packages/
!export CONDA_PREFIX=/usr/local/envs/myenv

In [None]:
# Python Version in New Virtual Environment (Compatible with ML-Agents)
!python --version
#should be 3.10.xx

In [None]:
%%capture
!conda create --name rl python=3.10.12
!conda activate rl

In [None]:
# verify that environment is activated
!echo $CONDA_DEFAULT_ENV

### install MLAgents

In [None]:
# change notebooks directory (return to /content)
%cd /content/

In [None]:
%%capture
# clone the repository (2.63GB)
!git clone https://github.com/Unity-Technologies/ml-agents

In [None]:
%%capture
%cd ml-agents
!pip install -e ./ml-agents-envs
!pip install -e ./ml-agents

if having issues  
https://github.com/Unity-Technologies/ml-agents/issues/6019

### git-LFS
Git-lfs is for versionning Large Files & should be pre installed by default since 2022  
https://github.com/git-lfs/git-lfs/issues/3605

In [None]:
!git-lfs

### Install executables

In [None]:
# Create directory for the executables
!mkdir -p /content/ml-agents/training-envs-executables/linux

In [None]:
# download the executable
!wget "https://drive.usercontent.google.com/open?id=1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP&authuser=0" -O /content/ml-agents/training-envs-executables/linux/SoccerTwo.zip

In [None]:
# unzip the executable
!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/SoccerTwo.zip

In [None]:
# make sure fil is accessible
!chmod -R 755 ./training-envs-executables/linux/SoccerTwo
ls -lah ./training-envs-executables/linux/

### Understand MLAgents

MLAgents has been created by the UnityMLAgent team.  
Documentation can be found here:  
https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos

The objective of the game is to get the ball into the opponents goal while preventing the ball from entering your own goal.

#### - Reward function:
- `1-accumulated time penalty`:  
When ball enter opponents goal: accumulated time penalty is incremented by (1/maxStep) every fixed update & is reset to 0 at the beginning of an episode.
- `-1`:  
  When ball enters own goal.

#### - Observation space:
Observation space is composed of vectors of size 336:
- 11 ray-cast foward distributed over 120° *(264 state dimmension)*
- 3 ray-cast backward distributed over 90° *(72 states dimensions)*
- Both of these ray cast can detect 6 objects:
  - Ball
  - Blue Goal
  - Purple Goal
  - Wall
  - Blue Agent
  - Purple Agent

#### - Action Space:
the action space is 3 discrete branches:
- Forward: up/dn
- Sideways: left/right
- Rotation: left/right

### Undertanding MA-POCA

MA-POCA: *Multi-Agent POsthumous Credit Assignment*:
##### Context:
Self-play is great for 1 vs. 1.  
In our case: we are 2 vs. 2 ( each team has 2 agents) and cooperative behaviour is required.

Agents receive a reward as a group (+1 - penalty) when team scores a goal.  
-> Every agent on the team is rewarded even if each agent didn't contribute the same to the win;  

<i>Source: Unity Blog:<a href="https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors" target="_blank" rel="noopener noreferrer"> ML-Agents v2.0</a></i>

##### Problem:
But this makes it difficult for an agent to learn what to do independently.

##### Solution:
Unity MLAgents team has developed a solution in a multi-agent trainer fashion called **MA-POCA: *Multi-Agent POsthumous Credit Assignment***.

##### The idea:
A centralized critic processes the state of all agents in the team to estimate the performance of each agent.

this enables the agent to:  
- make decisions based only on what it perceives locally.
- Simultaneously evaluate how its actions performs in the context of the whole group.

<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/mapoca.png"  height='400' style="height:400px;" alt="MlAgents learn"/>
<figcaption>source: <a href="https://blog.unity.com/technology/ml-agents-plays-dodgeball" target="_blank" rel="noopener noreferrer">MLAgents Plays Dogeball</a>
<figcaption>
</figure>

#### Self-play with a MA-POCA-trainer:

- Poca trainer will help train cooperative behavior
- Self-play will help to win against opponent team.


More information: https://arxiv.org/pdf/2111.05992.pdf

### Config.yaml

Please see documentation for more information on the hyperparameters of the config file:  
https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md


In [None]:
config="""
behaviors:
  SoccerTwos:
    trainer_type: poca
    hyperparameters:
      batch_size: 2048
      buffer_size: 20480
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: constant
    network_settings:
      normalize: false
      hidden_units: 512
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    keep_checkpoints: 5
    max_steps: 5000000 # 5M (recommended value: will take 5 to 8 hours of training)
    time_horizon: 1000
    summary_freq: 10000
    self_play:
      save_steps: 50000
      team_change: 200000
      swap_steps: 2000
      window: 10
      play_against_latest_model_ratio: 0.5
      initial_elo: 1200.0
"""

In [None]:
# write config file
fname="/content/ml-agents/config/poca/SoccerTwos.yaml
with open(fname, "w", encoding="utf-8") as f:
    f.write(config)

### Training

5M timesteps is the recommended value.  
This might take around 5 to 8 hours of training.

In [None]:
!mlagents-learn fname \
--env /content/ml-agents/training-envs-executables/linux/training-envs-executables/SoccerTwo \
--run-id "SoccerTwos" \
--no-graphics