<a href="https://colab.research.google.com/github/dlsyscourse/lecture14/blob/main/14_hardware_acceleration_architecture_overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 14: Hardware Acceleration Implementation

In this lecture, we will to walk through backend scafoldings to get us hardware accelerations for needle.




## Select a GPU runtime type
In this lecture, we are going to make use of c++ and CUDA to build accelerated linear algebra libraries. In order to do so, please make sure you select a runtime type with GPU and rerun the cells if needed:
- Click on the "Runtime" tab
- Click "Change runtime type"
- Select GPU

After you started the right runtime, you can run the following command to check if there is a GPU available.

In [1]:
!nvidia-smi

Wed Apr 17 02:36:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.72                 Driver Version: 536.45       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4060 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   39C    P4              14W /  55W |      0MiB /  8188MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Prepare the codebase

To get started, we can clone the related repo from the github.

In [2]:
!python3 -m pip install pybind11

[0m

### Build the needle cuda library

We leverage pybind to build a c++/cuda library for acceleration. You can type make to build the corresponding library.

We can then run the following command to make the path to the package available in colab's environment as well as the PYTHONPATH.

In [3]:
%set_env PYTHONPATH ./python
%set_env NEEDLE_BACKEND nd
import sys
sys.path.append('./python')

env: PYTHONPATH=./python
env: NEEDLE_BACKEND=nd


## Codebase walkthrough


Now click the files panel on the left side. You should be able to see these files

Python:
- needle/backend_ndarray/ndarray.py
- needle/backend_ndarray/ndarray_backend_numpy.py

C++/CUDA
- src/ndarray_backend_cpu.cc
- src/ndarray_backend_cuda.cu

The main goal of this lecture is to create an accelerated ndarray library.
As a result, we do not need to deal with needle.Tensor for now and will focus on backend_ndarray's implementation.

After we build up this array library, we can then use it to power backend array computations in needle.


## Creating a CUDA NDArray






In [4]:
from needle import backend_ndarray as nd

Using needle backend


In [5]:
x = nd.NDArray([1, 2, 3], device=nd.cpu())

device:  cpu()
array device:  cpu()


In [6]:
print(nd.cuda())
x = nd.NDArray([1, 2, 3], device=nd.cuda())

cuda()
device:  cuda()
array device:  cuda()


AttributeError: 'NoneType' object has no attribute 'Array'

In [None]:
x.device

In [None]:
x = nd.NDArray([1,2,3], device=nd.cuda())

In [None]:
y = x + 1

In [None]:
y

In [None]:
y = x + x

In [None]:
y

We can create a CUDA tensor from the data by specifying a device keyword.

In [None]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())

In [None]:
y = x + 1

In [None]:
x.numpy()

In [None]:
x.device

In [None]:
y = x + 1

In [None]:
y.device

In [None]:
y.numpy()

### Key Data Structures

Key data structures in backend_ndarray

- NDArray: the container to hold device specific ndarray
- BackendDevice: backend device
    - mod holds the module implementation that implements all functions
    - checkout ndarray_backend_numpy.py for a python-side reference.



## Trace GPU execution

Now, let us take a look at what happens when we execute the following code


In [None]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())
y = x + 1

In [None]:
x.device.from_numpy

In [None]:
x = nd.NDArray([1, 2, 3])

In [None]:
x.device.from_numpy

Have the following trace:

backend_ndarray/ndarray.py
- `NDArray.__add__`
- `NDArray.ewise_or_scalar`
- `ndarray_backend_cpu.cc:ScalarAdd`

In [None]:
y.numpy()

Have the following trace:

- `NDArray.numpy`
- `ndarray_backend_cpu.cc:to_numpy`

## Guidelines for Reading C++/CUDA related Files

Read
- src/ndarray_backend_cpu.cc
- src/ndarray_backend_cuda.cu


Optional
- CMakeLists.txt: this is used to setup the build and likely you do not need to tweak it.







## NDArray Data Structure

Open up `python/needle/backend_ndarray/ndarray.py`.

An NDArray contains the following fields:
- handle: The backend handle that build a flat array which stores the data.
- shape: The shape of the NDArray
- strides: The strides that shows how do we access multi-dimensional elements
- offset: The offset of the first element.
- device: The backend device that backs the computation






## Transformation as Strided Computation

We can leverage the strides and offset to perform transform/slicing with zero copy.

- Broadcast: insert strides that equals 0
- Tranpose: swap the strides
- Slice: change the offset and shape

For most of the computations, however, we will call `array.compact()` first to get a contiguous and aligned memory before running the computation.

In [None]:
import numpy as np
x = nd.NDArray([0,1,2,3,4,5], device=nd.cpu_numpy())

In [None]:
x.numpy()

In [None]:
z = nd.NDArray.make(shape=(2, 3),
                strides=(3, 1),
                device=x.device,
                handle=x._handle,
                offset=0)
z

In [None]:
b = nd.NDArray.make(shape=(2, 3, 4),
                    strides=(3, 1, 0),
                    device=z.device,
                    handle=z._handle,
                    offset=0)
b

In [None]:
y = nd.NDArray.make(shape=(3, 2, 2), strides=(2, 1, 0), device=x.device, handle=x._handle, offset=0)
y.numpy()

In [None]:
x = nd.NDArray([1, 2, 3, 4], device=nd.cpu_numpy())

In [None]:
x.numpy()

We can use strides and shape manipulation to create different views of the same array.

In [None]:
y = nd.NDArray.make(shape=(2, 2), strides=(2, 1), device=x.device, handle=x._handle, offset=0)

In [None]:
y.numpy()

In [None]:
z = nd.NDArray.make(shape=(2, 1), strides=(2, 1), device=x.device, handle=x._handle, offset=1)

In [None]:
z.numpy()

## CUDA Acceleration

Now let us open `src/ndarray_cuda_backend.cu` and take a look at current implementation of GPU ops.


## Steps for adding a new operator implementation
- Add an implementation in `ndarray_backend_cuda.cu`, expose via pybind
- Call into the operator in ndarray.py
- Write up testcases

In [None]:
!make

If we directly run the code block, we will see an error, because ewise mul is not yet implemented

In [None]:
x = nd.NDArray([1,2,3], device=nd.cuda())
x * 2

In [None]:
!nvprof python test_mul.py

## Connect back to needle Tensor

So far we only played with the `backend_ndarray` subpackage, which is a self-contained ndarray implementation within needle.

We can connect the ndarray back to needle as a backend.

In [None]:
import needle as ndl

In [None]:
x = ndl.Tensor([1,2,3], device=nd.cuda(), dtype="float32")
y = ndl.Tensor([2,3,5], device=nd.cuda(), dtype="float32")
z = x + y
z

In [None]:
z.device

In [None]:
type(z.cached_data)

## Write Standalone Python Test Files

Now that we have additional c++/cuda libraries in needle, we will need to type make in order to rebuild the library. Additionally, because the colab environment caches the old library, it is inconvenient to use the ipython cells to debug the updated library.




In [None]:
!make


We recommend writing separate python files and invoke them from the command line. Create a new file `tests/mytest.py` and write your local tests. This is also a common develop practice in big projects that involves python c++ FFI.

In [None]:
!python tests/mytest.py

After we have building the library, we could choose to fully restart the runtime (factory reset runtime) if you want to bring the updated change back to another colab. Note that you will need to save your code changes to the drive or a private github repo.