Copyright (c) 2021, salesforce.com, inc.\
All rights reserved.\
SPDX-License-Identifier: BSD-3-Clause\
For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause

### Colab

Try this notebook on [Colab](http://colab.research.google.com/github/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb)!

## ⚠️ PLEASE NOTE:
This notebook runs on a GPU runtime.\
If running on Colab, choose Runtime > Change runtime type from the menu, then select 'GPU' in the dropdown.

# Introduction

In this tutorial, we describe how to
- Use the WarpDrive framework to perform end-to-end training of multi-agent reinforcement learning (RL) agents.
- Visualize the behavior using the trained policies.

In case you haven't familiarized yourself with WarpDrive, please see the other tutorials we have prepared for you
- [WarpDrive basics](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb)
- [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb)
- [WarpDrive reset and log controller](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)

Please also see our [tutorial](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4-create_custom_environments.ipynb) on creating your own RL environment in CUDA C. Once you have your own environment in CUDA C, this tutorial explains how to integrate it with the WarpDrive framework to perform training.

## Dependencies

You can install the warpdrive package using

- the pip package manager OR
- by cloning the warp_drive package and installing the requirements (we shall use this when running on Colab).

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    ! git clone https://github.com/salesforce/warp-drive.git 
    % cd warp-drive
    ! pip install -e .
else:
    ! pip install rl_warp_drive

In [None]:
from warp_drive.env_wrapper import EnvWrapper
from warp_drive.training.models.fully_connected import FullyConnected

from example_envs.tag_continuous.tag_continuous import TagContinuous
from utils.generate_rollout_animation import generate_env_rollout_animation

In [None]:
from gym.spaces import Discrete, MultiDiscrete
from IPython.display import HTML
import json
import numpy as np
import torch

## Training the tag-continuous environment with WarpDrive

For your convenience, there are end-to-end RL training scripts at `warp_drive/training/example_training_scripts.py`. Currently, it supports training both the discrete and the continuous versions of Tag.

In order to run the training for these environments, we first need to configure the *run config*: the set of environment, training, and model parameters. 

The run configs for each of the environments are listed in `warp_drive/training/run_configs`, and a sample set of good configs for the **tag-continuous** environment is shown below. 

In this tutorial, we'll use $5$ taggers and $100$ runners in a $20 \times 20$ square grid. The taggers and runners have the same skill level, i.e., the runners can move just as fast as the taggers. 

```yaml
# YAML configuration for the tag continuous environment
name: "tag_continuous"

# Environment settings
env:
    num_taggers: 5
    num_runners: 100
    grid_length: 20
    episode_length: 500
    max_acceleration: 0.1
    min_acceleration: -0.1
    max_turn: 2.35  # 3*pi/4 radians
    min_turn: -2.35  # -3*pi/4 radians
    num_acceleration_levels: 20
    num_turn_levels: 20
    skill_level_runner: 1
    skill_level_tagger: 1
    seed: 274880
    use_full_observation: False
    runner_exits_game_after_tagged: True
    num_other_agents_observed: 10
    tag_reward_for_tagger: 10.0
    tag_penalty_for_runner: -10.0
    step_penalty_for_tagger: -0.00
    step_reward_for_runner: 0.00
    edge_hit_penalty: -0.0
    end_of_game_reward_for_runner: 1.0
    tagging_distance: 0.02
    
# Trainer settings
trainer:
    num_envs: 1  # Number of environment replicas
    num_episodes: 1000000000  # Number of episodes to run the training for
    train_batch_size: 100  # total batch size used for training per iteration (across all the environments)
    algorithm: "A2C"  # trainer algorithm
    vf_loss_coeff: 1  # loss coefficient for the value function loss
    entropy_coeff: 0.05  # coefficient for the entropy component of the loss
    clip_grad_norm: True  # fla indicating whether to clip the gradient norm or not
    max_grad_norm: 0.5  # when clip_grad_norm is True, the clip level
    normalize_advantage: False  # flag indicating whether to normalize advantage or not
    normalize_return: False  # flag indicating whether to normalize return or not

# Policy network settings
policy:  # list all the policies below
    runner:
        to_train: True
        name: "fully_connected"
        gamma: 0.98  # discount rate gamms
        lr: 0.005  # learning rate
        model:        
            fc_dims: [256, 256]  # dimension(s) of the fully connected layers as a list
            model_ckpt_filepath: ""
    tagger:
        to_train: True
        name: "fully_connected"
        gamma: 0.98
        lr: 0.002
        model:
            fc_dims: [256, 256]
            model_ckpt_filepath: ""
            
# Checkpoint saving setting
saving:
    print_metrics_freq: 100  # How often (in iterations) to print the metrics
    save_model_params_freq: 5000  # How often (in iterations) to save the model parameters
    basedir: "/tmp"  # base folder used for saving
    tag: "800runners_5taggers_bs100"
```

Next, we also need to specify a mapping from the policy to agent indices trained using that policy. This needs to be set in `warp_drive/training/example_training_script.py`. As such, we have the tagger and runner policies, and we map those to the corresponding agents, as in

```python
    policy_tag_to_agent_id_map = {
        "tagger": list(envObj.env.taggers),
        "runner": list(envObj.env.runners),
    }
```


Note that if you wish to use just a single policy across all the agents, or many other policies, you will need to update the run configuration as well as the policy_to_agent_id_mappping.

For example, for using a shared policy across all agents (say `shared_policy`), for example, you can just use the run configuration as
```python
    "policy": {
        "shared_policy": {
            "to_train": True,
            "name": "fully_connected",
            "gamma": 0.98,
            "lr": 0.002,
            "model": {
                "num_fc": 2,
                "fc_dim": 256,
                "model_ckpt_filepath": "",
            },
        },
    },
```
and also set all the agent ids to use this shared policy
```python
    policy_tag_to_agent_id_map = {
        "shared_policy": np.arange(envObj.env.num_agents),
    }
```

**Note: make sure the `policy` keys and the `policy_tag_to_agent_id_map` keys are identical.**

Once the run configuration and the policy to agent id mapping are set, you can invoke training by using
```shell
python warp_drive/training/example_training_script.py --env <ENV-NAME>
```
where `<ENV-NAME>` can be `tag_gridworld` or `tag_continuous` (or any new env that you build). And that's it!

The training script performs the following in order
1. Creates the pertinent environment object (with the `use_cuda` flag set to True).
2. Creates and pushes observtion, action, reward and done placeholder data arrays to the device.
3. Creates the trainer object using the environment object, the run configuration, and policy to agent id mapping.
4. Invokes trainer.train()

## Visualizing the trainer policies

In the run config, there's a `save_model_params_freq` parameter that can be set to frequently keep saving model checkpoints. With the model checkpoints, we can initialize the neural network weights and generate a full episode rollout. 

We can find an example run config and the trained tagger and runner policy model weights (after about 20M steps) in the `assets/tag_continuous_training/` folder.

In [None]:
# Load the run config.
with open("assets/tag_continuous_training/run_config.json") as f:
    run_config = json.load(f)

In [None]:
# Create the environment object.
env_wrapper = EnvWrapper(TagContinuous(**run_config['env']))

The taggers (runners) use a shared tagger (runner) policy model. The `policy_tag_to_agent_id_map` describes this mapping.

In [None]:
# Define the policy tag to agent id mapping.
policy_tag_to_agent_id_map = {
    "tagger": list(env_wrapper.env.taggers),
    "runner": list(env_wrapper.env.runners),
}

In [None]:
# Step through the environment.
# The environment(s) store and update the rollout data internally in env.global_state.


def generate_rollout_inplace(env_wrapper, run_config, load_model_weights=False):
    assert env_wrapper is not None
    assert run_config is not None
    
    obs = env_wrapper.reset_all_envs()
    action_space = env_wrapper.env.action_space[0]        
        
    # Instantiate the policy models.
    policy_models = {}

    for policy in policy_tag_to_agent_id_map:
        policy_config = run_config["policy"][policy]
        if policy_config["name"] == "fully_connected":
            policy_models[policy] = FullyConnected(
                env=env_wrapper,
                model_config=policy_config["model"],
                policy=policy,
                policy_tag_to_agent_id_map=policy_tag_to_agent_id_map,
            )
        else:
            raise NotImplementedError
    
    if load_model_weights:
        print(f"Loading saved weights into the policy models...")
        for policy in policy_models:
            state_dict_filepath = f"assets/tag_continuous_training/{policy}_after_training.state_dict"
            policy_models[policy].load_state_dict(torch.load(state_dict_filepath))            
            print(f"Loaded ckpt {state_dict_filepath} for {policy} policy model.")

    for t in range(env_wrapper.env.episode_length):
        stacked_obs = np.stack(obs.values()).astype(np.float32)
        
        
        # Create dict to collect the actions for all agents.
        if isinstance(action_space, Discrete):
            actions = {agent_id: 0 for agent_id in range(env_wrapper.env.num_agents)}
        elif isinstance(action_space, MultiDiscrete):
            actions = {agent_id: [0, 0] for agent_id in range(env_wrapper.env.num_agents)}
        else:
            raise NotImplementedError

        
        # Sample actions for all agents.
        for policy in policy_models:
            agent_ids = policy_tag_to_agent_id_map[policy]
            probabilities, vals = policy_models[policy](
                obs=torch.from_numpy(stacked_obs[agent_ids])
            )
            if isinstance(action_space, Discrete):
                for idx, probs in enumerate(probabilities):
                    sampled_actions = torch.multinomial(probs, num_samples=1)
                    for sample_action_idx, action in enumerate(sampled_actions):
                        actions[agent_ids[sample_action_idx]] = action.numpy()[0]

            elif isinstance(action_space, MultiDiscrete):
                for idx, probs in enumerate(probabilities):
                    sampled_actions = torch.multinomial(probs, num_samples=1)
                    for sample_action_idx, action in enumerate(sampled_actions):
                        actions[agent_ids[sample_action_idx]][idx] = action.numpy()[0]
            else:
                raise NotImplementedError
        
        
        # Execute actions in the environment.
        obs, rew, done, info = env_wrapper.step(actions)
        
        if done["__all__"]:
            break

## Visualize the environment before training

In [None]:
generate_rollout_inplace(env_wrapper, run_config)
anim = generate_env_rollout_animation(env_wrapper.env, i_start=1, fps=50, fig_width=6, fig_height=6)
HTML(anim.to_html5_video())

In the visualization above, the large purple dots represent the taggers, while the smaller blue dots represent the runners. Before training, the runners and taggers move around randomly, and that only results in some runners getting tagged, just by chance.

## Visualize the environment after training (for about 20M steps)

In [None]:
generate_rollout_inplace(env_wrapper, run_config, load_model_weights=True)
anim = generate_env_rollout_animation(env_wrapper.env, i_start=1, fps=50, fig_width=6, fig_height=6)
HTML(anim.to_html5_video())

After training, the runners learn to run away from the taggers, and the taggers learn to chase them; there are some instances where we see that taggers also team up to chase and tag the runners. Overall, about 80% of the runners are caught now.

# Learn More and Explore our Tutorials!

You've now seen the entire end-to-end multi-agent RL pipeline!

For your reference, all our tutorials are here:
- [A simple end-to-end RL training example](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/simple-end-to-end-example.ipynb)
- [WarpDrive basics](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb)
- [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb)
- [WarpDrive reset and log](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)
- [Creating custom environments](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4-create_custom_environments.ipynb)
- [Training with WarpDrive](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb)