Copyright (c) 2021, salesforce.com, inc. \
All rights reserved. \
SPDX-License-Identifier: BSD-3-Clause. \
For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause.

### Colab

Try this notebook on [Colab](http://colab.research.google.com/github/salesforce/warp-drive/blob/master/tutorials/tutorial-4-create_custom_environments.ipynb)!

# Introduction

In this tutorial, we will describe how to implement your own environment in CUDA C, and integrate it with WarpDrive for simulating the environment dynamics on the GPU.

In case you haven't familiarized yourself with WarpDrive, please see the other tutorials:

- [WarpDrive basics](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb)
- [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb)
- [WarpDrive reset and log](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)

We follow the OpenAI [gym](https://gym.openai.com/) style. Each simulation should have `__init__`, `reset` and `step` methods. 

To use WarpDrive, you only need to implement the `step()` method in CUDA C. WarpDrive can automatically reinitialize the environment after it's done, i.e., at every `reset`, using the environment `Wrapper` class. This class takes your CUDA C `step()` function and manages the simulation flow on the GPU. 

You can then do RL! See the [next tutorial](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb) to learn how to perform end-to-end multi-agent RL on a single GPU with WarpDrive.

## Building Simulations in CUDA C

CUDA C is an extension of C. See [this Nvidia blog](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) and the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) for more info and CUDA tutorials.

For our initial release of WarpDrive, we focus on relatively simple simulations. A key reason is that CUDA C can give you significantly faster simulations, but requires careful memory management, among other things. 

To make sure that everything works properly, one approach is to first implement your simulation logic in Python. You can then implement the same logic in CUDA C and check the simulation behaviors are the same. 

To help with this process, we provide an *environment consistency checker* method to do consistency tests between Python and CUDA C simulations. 

This workflow helps to familiarize yourself with CUDA C and works well for relatively simple simulations.

# Case Study: Building a CUDA Version of Tag

Within the WarpDrive package, you can find the source code for the discrete and continuous versions of Tag.

- [Tag (GridWorld)](https://www.github.com/salesforce/warp-drive/blob/master/example_envs/tag_gridworld/tag_gridworld.py)
- [Tag (Continuous)](https://www.github.com/salesforce/warp-drive/blob/master/example_envs/tag_continuous/tag_continuous.py)

Tag is a simple multi-agent game involving 'taggers' and 'runners'. The taggers chase and try to tag the runners. Tagged runners leave the game. Runners try to get away from the taggers.

Next, we'll use the *continuous* version of Tag to explain some important elements of building CUDA C simulations.

## Managing CUDA Simulations from Python using WarpDrive

We begin with the Python version of the continuous version [Tag](https://www.github.com/salesforce/warp-drive/blob/master/example_envs/tag_continuous/tag_continuous.py). The simulation follows the [gym](https://gym.openai.com/) format, implementing `reset` and `step` methods. We now detail all the steps necessary to transform the `step` function into [CUDA code](https://www.github.com/salesforce/warp-drive/blob/master/example_envs/tag_continuous/tag_continuous_step.cu) that can be run on a GPU. Importantly, WarpDrive lets you to call these CUDA methods from Python, so you can design your own RL workflow entirely in Python.

### 1. Add data to be pushed to GPU using DataFeed()

First, we need to push all the data relevant to performing the reset() and step() functions on the GPU. In particular, there are two methods that need to be added to the environment 
```python
    def get_data_dictionary(self):
        data_dict = DataFeed()
        ...
        return data_dict
```
and 
```python
    def get_tensor_dictionary(self):
        data_dict = DataFeed()
        ...
        return data_dict
```
WarpDrive automatically handles pushing the data arrays provided within these methods to the GPU global memory. The data dictionary will be used to push data that will not require to be modified during training - once pushed into the GPU, this data will persist on the GPU, and not be modified. The tensor dictionary comprises data that is directly accessible by PyTorch, and is handy for data that needs to be modified during training. In each of the aforementioned data_dictionary methods, the return type needs to be a `DataFeed` class, which is essentially a dictionary, with additional attributes.

With the help of the DataFeed class, we can push arrays that are created when the environment is initialized, and needs to be re-initialized at every reset.

```python
data_dict = DataFeed()
for feature in ["loc_x", "loc_y", "speed", "direction", "acceleration"]:
    data_dict.add_data(
        name=feature,
        data=self.global_state[feature][0],
        save_copy_and_apply_at_reset=True,
    )
```

Importantly, notice the `save_copy_and_apply_at_reset` flag set to True. This instructs WarpDrive to make a copy of this data and automatically re-initialize the data array to that exact value at each reset.

We can also push environment configuration parameters, for example,

```python
data_dict.add_data(
    name="tag_reward_for_tagger", data=self.tag_reward_for_tagger
)
data_dict.add_data(
    name="distance_margin_for_reward", data=self.distance_margin_for_reward
)
```

and any auxiliary variables that will be useful for modeling the step function dynamics:
```python
data_dict.add_data(
    name="neighbor_distances",
    data=np.zeros((self.num_agents, self.num_agents - 1), dtype=np.int32),
    save_copy_and_apply_at_reset=True,
)
```

An important point to note is that CUDA C always uses **32-bit precision**, so it's good to cast all the numpy arrays used in the Python simulation to 32-bit precision as well, before you push them.

### 2. Invoke the CUDA version of *step* in Python

After all the relevant data is added to the data dictionary, we need to invoke the CUDA C kernel code for stepping through the environment (when `self.use_cuda` is `True`). The syntax to do this is as follows

```python
if self.use_cuda:
    self.cuda_step(
                self.cuda_data_manager.device_data("loc_x),
                self.cuda_data_manager.device_data("loc_y),
                self.cuda_data_manager.device_data("speed),
                ...   
```

where you need to add all the keys of the data dictionary (in no particular order) as arguments to the step function. Also, remember to add the imperative `observations`, `sampled_actions` and `rewards` data, respectively as

```python
...
self.cuda_data_manager.device_data("observations"),
self.cuda_data_manager.device_data("sampled_actions"),
self.cuda_data_manager.device_data("rewards"),
...
```

It will also be very useful to add the following reserved keywords: `_done_`, `_timestep_` along with `n_agents`, `episode_length`, `block` and `grid`.
```python
...
self.cuda_data_manager.device_data("_done_"),
self.cuda_data_manager.device_data("_timestep_"),
self.cuda_data_manager.meta_info("n_agents"),
self.cuda_data_manager.meta_info("episode_length"),
block=self.cuda_function_manager.block,
grid=self.cuda_function_manager.grid,
```
Note that `n_agents` and `episode_length` are part of the meta information for the data manager, so they can be directly referenced from therein. In particular, the `block` and `grid` arguments are essential to have the CUDA implementation determine how many threads and blocks to activate and use for the environment simulation.


### 3. Write the *step* method in CUDA C

The most laborious part of this exercise is actually writing out the step function in CUDA C. This function will need to be named `Cuda<env.name>Step`, so that WarpDrive knows it represents the CUDA version of the step function for the particular environment. The order of the arguments should naturally follow the order written out where the CUDA C kernel is invoked.

```C
__global__ void CudaTagContinuousStep(
        float* loc_x_arr,
        float* loc_y_arr,
        float* speed_arr,
        ...
```

Note the keyword `__global__` used on the increment function. Global functions are also called "kernels" - they are functions you may call from the host. In our implementation of the CUDA C [code](https://www.github.com/salesforce/warp-drive/blob/master/example_envs/tag_continuous/tag_continuous_step.cu) for the tag environment, you will also notice there's also the keyword `__device__` (for example, `__device__ void CudaTagContinuousGenerateObservation()` and  `__device__ void CudaTagContinuousComputeReward()`) for functions that cannot be called from the host, but may only be called from other device or global functions.

Also, note the `void` return type - CUDA step functions don't need to return anything, but all the data arrays are modified in place.

While writing out the step code in CUDA C, the environment logic follows the same logic as in the Python step code. remember that each thread only acts on a single agent, and for a single environment. The code excerpt below is a side-by-side comparison of Python and CUDA C code for updating the agents' x and y location corodinates.

On the CUDA C side, we can simplify and make the code mode readable by using constants such as `kThisAgentId` and `kEnvId` (we have used this [naming style guide](https://google.github.io/styleguide/cppguide.html#General_Naming_Rules)) to indicate the thread and block indices, respectively. As you may have noticed by now, since each thread only writes to a specific index of the data array, understanding array indexing is critical.

<table align="left">
<tr>
<th> Python </th>
<th> CUDA C </th>
</tr>
<td>
    
```python
loc_x_curr_t = loc_x_prev_t + speed_curr_t * np.cos(dir_curr_t)
loc_y_curr_t = loc_y_prev_t + speed_curr_t * np.sin(dir_curr_t)
```
    
</td>
    
<td>
    
```c
const int kThisAgentId = threadIdx.x;
const int kEnvId = blockIdx.x;
if (kThisAgentId < kNumAgents) {
    const int kThisAgentArrayIdx = kEnvId * kNumAgents + kThisAgentId;

    loc_x_arr[kThisAgentArrayIdx] += speed_arr[kThisAgentArrayIdx] * cos(direction_arr[kThisAgentArrayIdx]);
    loc_y_arr[kThisAgentArrayIdx] += speed_arr[kThisAgentArrayIdx] * sin(direction_arr[kThisAgentArrayIdx]);
}
```
                              
</td>
    
</table>

### 4. The EnvWrapper Class

Once the CUDA version of the code is ready, WarpDrive provides an environment wrapper class to help launch the simulation on the CPU or the GPU. This wrapper determines whether the simulation needs to be on the CPU or the GPU (via the `use_cuda` argument flag), and proceeds accordingly. If the environment runs on the CPU, the `reset` and `step` calls also occur on the CPU. If the environment runs on the GPU, only the first `reset` happens on the CPU, all the relevant data is copied over the GPU after, and the subsequent steps (and resets) all happen on the GPU. In the latter case, the environment wrapper also uses the `num_envs` argument to instantiate multiple replicas of the environment on the GPU.

Additionally, the environment wrapper handles all the tasks required to run the environment on the GPU:

- Determines the environment's observation and action spaces
- Initializes the CUDA data and function managers for the environment
- Registers the CUDA version of the step() function
- Pushes the data listed in the data dictionary and tensor dictionary attributes of the environment, and repeats them across the environment dimension, if necessary.
- Automatically resets each environment when it is done.

#### Register the CUDA environment 

Here we have some more details about how to use EnvWrapper to identify and build your environment automatically once the CUDA C step environment is ready.

To make it work, you shall register your environment in `warp_drive/utils/common`. In `get_env_directory()`, you can simply provide the path to your CUDA environment source code. Please remember that the register uses the environment name defined in your environment class as the key so EnvWrapper class can link it to the right environment. 

The **FULL_PATH_TO_YOUR_ENV_SRC** can be any path inside or outside of WarpDrive. For example, you can develop your own CUDA step function and environment in your codebase and register right here.

```python
   envs = {
       "TagGridWorld": f"{get_project_root()}/example_envs/tag_gridworld/tag_gridworld_step.cu",
       "TagContinuous": f"{get_project_root()}/example_envs/tag_continuous/tag_continuous_step.cu",
       "YOUR_ENVIRONMENT": "FULL_PATH_TO_YOUR_ENV_SRC",
   }
```

Now, inside the EnvWrapper, function managers will be able to feed the `self.num_env` and `self.num_agents` to the CUDA in the compile time to build and load a unique CUDA environment context for all the tasks.

#### Unittest the CUDA environment 

Another registration you may want to do is to register the CUDA environment source code at `cuda_includes/test_build.cu`. As you may already use this CUDA code to explore a few cool functionalities of WarpDrive in the previous tutorials like [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb) and [WarpDrive reset and log](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb), this registration makes your environment in the unittest scope. For example, if you have a unittest module to test the consistency between CPU and GPU (as discussed later in this notebook), a CUDA environment with 5 agents and 2 env will be automatically built up and adopted by the unittest framework. And the build and test can be done automatically by calling `make compile-test`

## Important CUDA C Concepts

Writing CUDA programs requires basic knowledge of C and how CUDA C extends C. Here's a [quick reference](https://learnxinyminutes.com/docs/c/) to see the syntax of C. 

For many simulations, basic C concepts should get you very far. However, you could make very complex simulations -- the sky is the limit! 

Below, we'll discuss two important CUDA C concepts -- we're planning to add more materials and tools in the future to facilitate developing CUDA simulations.

### Array Indexing

As described in the first [tutorial](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb#Array-indexing), CUDA stores arrays in a C-contiguous or a row-major fashion; 

In general, it helps to set up some indexing constants as you develop code, so you can reuse them across your code. For example, the index for a specific agent id `kThisAgentId` ($0 \leq \text{kThisAgentId} < \text{NumAgents}$) in the location arrays (shaped (`NumEnvs, NumAgents`)) would be
```C
const int kThisAgentArrayIdx = kEnvId * kNumAgents + kThisAgentId;
```
and this index can be reused across different contexts.

### __syncthreads

Another keyword that is useful to understand in the context of multi-agent simulations is `__syncthreads()`. While all the agents can operate fully in parallel, there are often operations that may need to be performed sequentially by the agents or only by one of the agents. For such cases, we may use **__syncthreads()** command, a thread block-level synchronization barrier. All the threads will wait for all the threasd in the block to reach that point, until processing further.

```C
        // Increment time ONCE -- only 1 thread can do this.
        if (kThisAgentId == 0) {
            env_timestep_arr[kEnvId] += 1;
        }

        // Wait here until timestep has been updated
        __syncthreads();
```

## Debugging and Checking Consistency

Once you are done building your environment, you may use the `env_cpu_gpu_consistency_checker` function in WarpDrive to ensure the Python and CUDA C versions of the environment are logically consistent with one another. The consistency tests run across two full episode lengths (to ensure consistent behavior even beyond the point when the environments are reset), and ensure that the observations, rewards, and done flags match one another. For catching syntax errors, the C compiler is pretty good at pointing out the exact error and the line number. Often, to figure out deeper issues with the code, `printf` is your best friend.

# Learn More and Explore our Tutorials!

And that's it for this tutorial. Good luck building your environments.
Once you are done building, see our next [tutorial](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb) on training your environment with WarpDrive.

For your reference, all our tutorials are here:
- [A simple end-to-end RL training example](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/simple-end-to-end-example.ipynb)
- [WarpDrive basics](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1-warp_drive_basics.ipynb)
- [WarpDrive sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb)
- [WarpDrive reset and log](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)
- [Creating custom environments](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4-create_custom_environments.ipynb)
- [Training with WarpDrive](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb)