Copyright (c) 2021, salesforce.com, inc.\
All rights reserved.\
SPDX-License-Identifier: BSD-3-Clause\
For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause

Try this notebook on [Colab](http://colab.research.google.com/github/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb)!

# ⚠️ PLEASE NOTE:
This notebook runs on a GPU runtime.\
If running on Colab, choose Runtime > Change runtime type from the menu, then select `GPU` in the 'Hardware accelerator' dropdown menu.

# Introduction

In this tutorial, we describe how to
- Use the WarpDrive framework to perform end-to-end training of multi-agent reinforcement learning (RL) agents.
- Visualize the behavior using the trained policies.

In case you haven't familiarized yourself with WarpDrive, please see the other tutorials we have prepared for you
- [WarpDrive basics](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb)
- [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb)
- [WarpDrive reset and log controller](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)

Please also see our [tutorial](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4-create_custom_environments.md) on creating your own RL environment in CUDA C. Once you have your own environment in CUDA C, this tutorial explains how to integrate it with the WarpDrive framework to perform training.

# Dependencies

You can install the warpdrive package using

- the pip package manager OR
- by cloning the warp_drive package and installing the requirements (we shall use this when running on Colab).

In [None]:
import sys

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    ! git clone https://github.com/salesforce/warp-drive.git
    % cd warp-drive
    ! pip install -e .
    % cd tutorials
else:
    ! pip install rl_warp_drive

In [None]:
from warp_drive.env_wrapper import EnvWrapper
from warp_drive.training.trainer import Trainer
from warp_drive.training.models.fully_connected import FullyConnected

from example_envs.tag_continuous.tag_continuous import TagContinuous
from utils.generate_rollout_animation import generate_tag_env_rollout_animation

In [None]:
from gym.spaces import Discrete, MultiDiscrete
from IPython.display import HTML
import yaml
import numpy as np
import torch

# Training the continuous version of Tag with WarpDrive

We will now explain how to train your environments using WarpDrive in just a few steps. For the sake of exposition, we consider the continuous version of Tag.

For your reference, there is also an example end-to-end RL training script [here](https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/example_training_script.py) that contains all the steps below. It can use to set up your own custom training pipeline. Invoke training by using
```shell
python warp_drive/training/example_training_script.py --env <ENV-NAME>
```
where `<ENV-NAME>` can be `tag_gridworld` or `tag_continuous` (or any new env that you build).

## Step 1: Specify a set of run configurations for your experiments.

In order to run the training for these environments, we first need to specify a *run config*, which comprises the set of environment, training, and model parameters.

Note: there are also some default configurations in 'warp_drive/training/run_configs/default_configs.yaml', and the run configurations you provide will override them.

For this tutorial, we will use the configuration [here](assets/tag_continuous_training/run_config.yaml). Specifically, we'll use $5$ taggers and $100$ runners in a $20 \times 20$ square grid. The taggers and runners have the same skill level, i.e., the runners can move just as fast as the taggers. 

In [None]:
# Load the run config.
with open("assets/tag_continuous_training/run_config.yaml", encoding="utf8") as fp:
    run_config = yaml.safe_load(fp)

## Step 2: Create the environment object using WarpDrive's envWrapper.

### Important! Ensure that 'use_cuda' is set to True below (in order to run the simulation on the GPU).

In [None]:
env_wrapper = EnvWrapper(
    env_obj=TagContinuous(**run_config["env"]),
    num_envs=run_config["trainer"]["num_envs"],
    use_cuda=True,
)

Creating the env wrapper initializes the CUDA data manager and pushes some reserved data arrays to the GPU. It also initializes the CUDA function manager, and loads some WarpDrive library CUDA kernels.

## Step 3: Specify a mapping from the policy to agent indices.

Next, we will need to map each trainable policy to the agent indices that are using it. As such, we have the tagger and runner policies, and we will map those to the corresponding agents.

In [None]:
policy_tag_to_agent_id_map = {
    "tagger": list(env_wrapper.env.taggers),
    "runner": list(env_wrapper.env.runners),
}

Note that if you wish to use just a single policy across all the agents (or if you wish to use many other policies), you will need to update the run configuration as well as the policy_to_agent_id_mappping.

For example, for using a shared policy across all agents (say `shared_policy`), for example, you can just use the run configuration as
```python
    "policy": {
        "shared_policy": {
            "to_train": True,
            ...
        },
    },
```
and also set all the agent ids to use this shared policy
```python
    policy_tag_to_agent_id_map = {
        "shared_policy": np.arange(envObj.env.num_agents),
    }
```

**Importantly, make sure the `policy` keys and the `policy_tag_to_agent_id_map` keys are identical.**

## Step 4: Create the Trainer object.

In [None]:
trainer = Trainer(
    env_wrapper,
    run_config,
    policy_tag_to_agent_id_map
)

When the trainer object is created, all the relevant data arrays (e.g., "loc_x", "loc_y, "speed") are pushed from the CPU to the GPU. Additionally, the observation, reward, action and done flag data arrays are also pushed. As training happens, all these arrays are update in-place, and there's no data transferred back to the CPU.

# Visualizing the trainer policies

## Visualizing an episode roll-out before training 

Let us visualize an episode rollout before training begins. Note that at any time during training, we can fetch the data arrays on the GPU using the trainer API `fetch_episode_states`.

Below, we fetch the state arrays pertaining to agent locations and indicators on which agents are still active in the game, and will use these to visualize an episode roll-out

In [None]:
episode_states = trainer.fetch_episode_states(
    [
        "loc_x",
        "loc_y",
        "still_in_the_game"
    ]
)

In [None]:
# Visualize the entire episode roll-out
anim = generate_tag_env_rollout_animation(env_wrapper.env, episode_states)
HTML(anim.to_html5_video())

In the visualization above, the large purple dots represent the taggers, while the smaller blue dots represent the runners. Before training, the runners and taggers move around randomly, and that only results in about half the runners getting tagged, just by chance.

## Step 5: Perform training

Training is performed by calling trainer.train(). We run training for just a few episodes, as specified in the run configuration.

In [None]:
trainer.train()

As training happens, we log the speed performance numbers and the metrics for all the trained policies every `metrics_log_freq` iterations. The training results and the model checkpoints are also saved on a timely (as specified in the run configuration parameters `model_params_save_freq`) basis.

## Visualize an episode-rollout after training (for about 20M steps)

We can also initialize the trainer model parameters using saved model checkpoints via the `load_model_checkpoint` API. With this, we will be able to fetch the episode states for a trained model, for example. We will now visualize an episode roll-out using trained tagger and runner policy model weights (trained for 20M steps), that are located in [this](assets/tag_continuous_training/) folder.

In [None]:
trainer.load_model_checkpoint(
    {
        "tagger": "assets/tag_continuous_training/tagger_20000000.state_dict",
        "runner": "assets/tag_continuous_training/runner_20000000.state_dict"
    }
)

In [None]:
episode_states = trainer.fetch_episode_states(
    [
        "loc_x",
        "loc_y",
        "still_in_the_game"
    ]
)

In [None]:
# Visualize the entire episode roll-out
anim = generate_tag_env_rollout_animation(env_wrapper.env, episode_states)
HTML(anim.to_html5_video())

After training, the runners learn to run away from the taggers, and the taggers learn to chase them; there are some instances where we see that taggers also team up to chase and tag the runners. Eventually, (almost) all the runners are caught now.

In [None]:
# Close the trainer to clear up the CUDA memory heap
trainer.graceful_close()

You've now seen the entire end-to-end multi-agent RL pipeline!

# Learn More and Explore our Tutorials!

For your reference, all our tutorials are here:
- [A simple end-to-end RL training example](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/simple-end-to-end-example.ipynb)
- [WarpDrive basics](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb)
- [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb)
- [WarpDrive reset and log](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)
- [Creating custom environments](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4-create_custom_environments.md)
- [Training with WarpDrive](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb)