Copyright (c) 2021, salesforce.com, inc.\
All rights reserved.\
SPDX-License-Identifier: BSD-3-Clause\
For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause

**Try this notebook on [Colab](http://colab.research.google.com/github/salesforce/warp-drive/blob/master/tutorials/tutorial-1.b-warp_drive_basics.ipynb)!**

# ⚠️ PLEASE NOTE:
This notebook runs on a GPU runtime.\
If running on Colab, choose Runtime > Change runtime type from the menu, then select `GPU` in the 'Hardware accelerator' dropdown menu.

In [None]:
import torch

assert torch.cuda.device_count() > 0, "This notebook needs a GPU to run!"

# Dependencies

You can install the warp_drive package using

- the pip package manager, OR
- by cloning the warp_drive package and installing the requirements.

We will install the latest version of WarpDrive using the pip package manager.

In [None]:
! pip install -U rl_warp_drive

In [None]:
import numpy as np
from timeit import Timer

In [None]:
from warp_drive.managers.numba_managers.numba_data_manager import NumbaDataManager
from warp_drive.managers.numba_managers.numba_function_manager import NumbaFunctionManager
from warp_drive.utils.data_feed import DataFeed
from warp_drive.utils.common import get_project_root

In [None]:
# Set logger level e.g., DEBUG, INFO, WARNING, ERROR
import logging

logging.getLogger().setLevel(logging.INFO)

# Numba Example

In this tutorial, we will focus on using the Numba backend to run the same content in tutorial 1.a. 

In the following, we will demonstrate how to push and pull data between the host and the device, and how to write simple CUDA functions to manipulate the date. Let's begin by creating a CUDADataManager object. 

We specify a few multi-agent RL parameters in the `DataManager` creator. 

We'll create a multi-agent RL environment with 3 agents, an episode length of 5, and 2 environment replicas.

In [None]:
num_agents = 3
num_envs = 2
episode_length = 5

cuda_data_manager = NumbaDataManager(num_agents, num_envs, episode_length=episode_length)

Now, let's create some (random) data that we would like to push to the device. In the context of RL, this can pertain to the starting states created by `env reset()`. 

The starting states are arrays that need to hold data such as observations, actions and rewards during the course of the episode. They could also contain environment configuration settings and hyperparameters. 

Each environment and agent will have its own data, so we create a `(num_envs, num_agents)`-shaped array that will be pushed to the GPU.

In [None]:
random_data = np.random.rand(num_envs, num_agents)

In [None]:
random_data

# Push and pull data from host (CPU) to device (GPU)

In order to push data to the device, we have created a **DataFeed** helper object. For all data pushed from the host to device, we will need to provide a name identifier, the actual data, and two flags (both default to False):

- `save_copy_and_apply_at_reset` - if `True`, we make a copy of the starting data so that we can set the data array to that value at every environment reset, and
- `log_data_across_episode` - if `True`, we add a time dimension to the data, of size `episode_length`, set all $t>0$ index values to zeros, and store the data array at each time step separately. This is primarily used for logging the data for an episode rollout.

In [None]:
data_feed = DataFeed()
data_feed.add_data(
    name="random_data",
    data=random_data,
    save_copy_and_apply_at_reset=False,
    log_data_across_episode=False,
)

In [None]:
data_feed

The CUDA data manager provides the **push_data_to_device()** and **pull_data_from_device()** apis to handle data transfer between the host and the device.

In [None]:
cuda_data_manager.push_data_to_device(data_feed)

Notice that the data manager casted the data from float64 to float32. CUDA always uses 32-bit floating or integer representations of numbers.

In [None]:
data_fetched_from_device = cuda_data_manager.pull_data_from_device("random_data")

The data fetched from the device matches the data pushed (the small differences are due to type-casting).

In [None]:
data_fetched_from_device

Another integral part of RL is training. We also need to hold the observations, actions and rewards arrays. So fo training, we will wrap the data into a Pytorch Tensor.

## Making Training Data Accessible To PyTorch

Note that pushing and pulling data several times between the host and the device causes a lot of communication overhead. So, it's advisable that we push the data from the host to device only once, and then manipulate all the data on the GPU in-place. This is particularly important when data needs to be accessed frequently. A common example is the batch of observations and rewards gathered for each training iteration. 

Fortunately, our framework lets Pytorch access the data we pushed onto the GPU via pointers with minimal overhead. To make data accessible by Pytorch, we set the `torch_accessible` flag to True.

In [None]:
tensor_feed = DataFeed()
tensor_feed.add_data(name="random_tensor", data=random_data)

cuda_data_manager.push_data_to_device(tensor_feed, torch_accessible=True)

In [None]:
tensor_on_device = cuda_data_manager.data_on_device_via_torch("random_tensor")

## Time comparison for data pull (`torch_accessible` True versus False)

In [None]:
large_array = np.random.rand(1000, 1000)

### `torch_accessible=False`

In [None]:
data_feed = DataFeed()
data_feed.add_data(
    name="large_array",
    data=large_array,
)
cuda_data_manager.push_data_to_device(data_feed, torch_accessible=False)

In [None]:
Timer(lambda: cuda_data_manager.pull_data_from_device("large_array")).timeit(
    number=1000
)

### `torch_accessible=True`

In [None]:
data_feed = DataFeed()
data_feed.add_data(
    name="large_array_torch",
    data=large_array,
)
cuda_data_manager.push_data_to_device(data_feed, torch_accessible=True)

In [None]:
Timer(lambda: cuda_data_manager.data_on_device_via_torch("random_tensor")).timeit(1000)

You can see the time for accessing torch tensors on the GPU is negligible compared to data arrays!

Currently, the `DataManager` supports primitive data types, such as ints, floats, lists, and arrays. If you would like to push more sophisticated data structures or types to the GPU, such as dictionaries, you may do so by pushing / pulling each key-value pair as a separate array.

# Code Execution Inside CUDA

Once we push all the relevant data to the GPU, we will need to write functions to manipulate the data. To this end, we will need to write code in Numba, but invoke it from the host node. The `FunctionManager` is built to facilitate function initialization on the host and execution on the device. As we mentioned before, all the arrays on GPU will be modified on the GPU, and in-place. Let's begin by creating a CUDAFunctionManager object.

In [None]:
cuda_function_manager = NumbaFunctionManager(
    num_agents=cuda_data_manager.meta_info("n_agents"),
    num_envs=cuda_data_manager.meta_info("n_envs"),
)

## Array manipulation inside Numba

In the previous tutorial, we have discussed array indexing and our utility functions to facilitate the indexing in CUDA. One great benefit for Numba is its intrisinc syntax for multi-dimensional array indexing. Let's rewrite the same example in Numba this time. To recap, 

We write a simple function to add one to each element of the pushed data. We will perform this operation in parallel on the (num_envs) number of GPU blocks and the (num_agents) number of threads within.

In general, the operation is (almost) parallel. Going into a bit more detail - CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. So, as long as the number of agents is a multiple of 32, all the threads ar utilized, otherwise few threads remain idle. For example, if we use $1000$ agents, $24$ threads will remain idle, for a utilization rate of $97.65\%$.

In [None]:
source_code = """
import numba.cuda as numba_driver

@numba_driver.jit
def cuda_increment(data, num_agents):
    env_id = numba_driver.blockIdx.x
    agent_id = numba_driver.threadIdx.x
    if agent_id < num_agents:
        increment = env_id + agent_id
        data[env_id, agent_id] += increment
"""

Next, we use the `FunctionManager` API method **import_numba_from_source_code()** to build and load the Numba code.

*Note: WarpDrive does not support the direct string-type source code loading. In general, it's standard practice to have several standalone source codes written out in .py file, here, the source_code is saved in example_envs/dummy_env* 

In [None]:
source_code_path = f"example_envs.dummy_env.tutorial_basics"
cuda_function_manager.import_numba_from_source_code(
    source_code_path, default_functions_included=False
)
cuda_function_manager.initialize_functions(["cuda_increment"])

We will use the `FunctionManager`'s API method **get_function()** to load the CUDA kernel function and get an handle to invoke it from the host device.

In [None]:
increment_function = cuda_function_manager.get_function("cuda_increment")

Now, when invoking the `increment` function, along with the `data` and `num_agents` arguments, we also need to provide the block and grid arguments. These are also attributes of the CUDA `FunctionManager`: simply use\

- `block=cuda_function_manager.block`, and
- `grid=cuda_function_manager.grid`

Also, since we need to use the `num_agents` parameter, we also need to push it to the device. Instead of using a `DataFeed`, we may also push as follows:

In [None]:
cuda_data_manager.push_data_to_device(
    {
        "num_agents": {
            "data": num_agents,
            "attributes": {
                "save_copy_and_apply_at_reset": False,
                "log_data_across_episode": False,
            },
        }
    }
)

In [None]:
block=cuda_function_manager.block
grid=cuda_function_manager.grid

increment_function[grid, block](
    cuda_data_manager.device_data("random_data"),
    cuda_data_manager.device_data("num_agents"),
)

Below is the original (random) data that we pushed to the GPU:

In [None]:
random_data

and here's the incremented data:

In [None]:
cuda_data_manager.pull_data_from_device("random_data")

As expected, this method incremented each entry at index `(env_id, agent_id)` of the original data by `(env_id + agent_id)`! The differences are below.

In [None]:
cuda_data_manager.pull_data_from_device("random_data") - random_data

And we can invoke the increment function again to increment one more time (also in-place on the GPU), and the differences double.

In [None]:
block=cuda_function_manager.block
grid=cuda_function_manager.grid

increment_function[grid, block](
    cuda_data_manager.device_data("random_data"),
    cuda_data_manager.device_data("num_agents"),

)
cuda_data_manager.pull_data_from_device("random_data") - random_data

# Validating CUDA parallelism

We put all the pieces introduced so far together, and record the times for parallelized operations with different `num_envs` and `num_agents` settings.

In [None]:
def push_random_data_and_increment_timer(
    num_runs=1,
    num_envs=2,
    num_agents=3,
    source_code_path=None,
    episode_length=100,
):

    assert source_code_path is not None

    # Initialize the CUDA data manager
    cuda_data_manager = NumbaDataManager(
        num_agents=num_agents, num_envs=num_envs, episode_length=episode_length
    )

    # Initialize the CUDA function manager
    cuda_function_manager = NumbaFunctionManager(
        num_agents=cuda_data_manager.meta_info("n_agents"),
        num_envs=cuda_data_manager.meta_info("n_envs"),
    )

    # Load source code and initialize function
    cuda_function_manager.import_numba_from_source_code(
    source_code_path, default_functions_included=False
)
    cuda_function_manager.initialize_functions(["cuda_increment"])
    increment_function = cuda_function_manager.get_function("cuda_increment")

    def push_random_data(num_agents, num_envs):
        # Create random data
        random_data = np.random.rand(num_envs, num_agents)

        # Push data from host to device
        data_feed = DataFeed()
        data_feed.add_data(
            name="random_data",
            data=random_data,
        )
        data_feed.add_data(name="num_agents", data=num_agents)
        cuda_data_manager.push_data_to_device(data_feed)

    def increment_data():
        block=cuda_function_manager.block
        grid=cuda_function_manager.grid
        
        increment_function[grid, block](
            cuda_data_manager.device_data("random_data"),
            cuda_data_manager.device_data("num_agents"),
        )

    # One-time data push
    data_push_time = Timer(lambda: push_random_data(num_agents, num_envs)).timeit(
        number=1
    )
    # Increment the arrays 'num_runs' times
    program_run_time = Timer(lambda: increment_data()).timeit(number=num_runs)

    return {"data push times": data_push_time, "code run time": program_run_time}

## Record the times for a single data push and 10000 increment kernel calls.

In [None]:
%%capture

num_runs = 10000
times = {}

for scenario in [
    (1, 1),
    (1, 10),
    (1, 100),
    (10, 10),
    (1, 1000),
    (100, 100),
    (1000, 1000),
]:
    num_envs, num_agents = scenario
    times.update(
        {
            f"envs={num_envs}, agents={num_agents}": push_random_data_and_increment_timer(
                num_runs, num_envs, num_agents, source_code_path
            )
        }
    )

In [None]:
print(f"Times for {num_runs} function calls")
print("*" * 40)
for key, value in times.items():
    print(
        f"{key:30}: data push time: {value['data push times']:10.5}s,\t mean increment times: {value['code run time']:10.5}s"
    )

As we increase the number of environments and agents, the data size becomes larges, so pushing data becomes slower, but since all the threads operate in parallel, the average time taken in the increment function remains about the same!

Also notice that Numba is much slower (~1/10X) than PyCUDA in this simple example. The main reason is that JIT will repeat its runtime compilation everytime when it is being called. Since the execution of the kernel function is pretty lightweight in this example, the compilation time actually dominates the time. This problem will be improved much in the real problem when the kernel function itself takes much more time and JIT will also help to optimize the kernel execution at the runtime.

And that's it! By using building blocks such as the increment function, we can create arbitrarily complex functions in CUDA C. For some comparative examples, please see the example environments that have both Python implementations in `examples/envs` and corresponding CUDA C implementations in `src/envs`.

Below are some useful starting resources for CUDA C programming:

- [CUDA tutorial](https://cuda-tutorial.readthedocs.io/en/latest/)
- [Learn C](https://learnxinyminutes.com/docs/c/)
- [CUDA Quick Reference](http://www.icl.utk.edu/~mgates3/docs/cuda.html)
<!-- - [Thrust](https://developer.nvidia.com/thrust). Note: thrust is a flexible, high-level interface for GPU programming that greatly enhances developer productivity. -->

# Learn More and Explore our Tutorials!

This is the first tutorial on WarpDrive. Next, we suggest you check out our advanced tutorials on [WarpDrive's sampler](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2-warp_drive_sampler.ipynb) and [WarpDrive's reset and log controller](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb).

For your reference, all our tutorials are here:
1. [WarpDrive basics(intro and pycuda)](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1.a-warp_drive_basics.ipynb)
2. [WarpDrive basics(numba)](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-1.b-warp_drive_basics.ipynb)
3. [WarpDrive sampler(pycuda)](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2.a-warp_drive_sampler.ipynb)
4. [WarpDrive sampler(numba)](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-2.b-warp_drive_sampler.ipynb)
5. [WarpDrive resetter and logger](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-3-warp_drive_reset_and_log.ipynb)
6. [Create custom environments (pycuda)](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4.a-create_custom_environments_pycuda.md)
7. [Create custom environments (numba)](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-4.b-create_custom_environments_numba.md)
8. [Training with WarpDrive](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-5-training_with_warp_drive.ipynb)
9. [Scaling Up training with WarpDrive](https://www.github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-6-scaling_up_training_with_warp_drive.md)
10. [Training with WarpDrive + Pytorch Lightning](https://github.com/salesforce/warp-drive/blob/master/tutorials/tutorial-7-training_with_warp_drive_and_pytorch_lightning.ipynb)