<a href="https://colab.research.google.com/github/trefftzc/cis677/blob/main/Basics_of_cupy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Based on:

https://docs.cupy.dev/en/stable/user_guide/basic.html

In [1]:
import numpy as np
import cupy as cp

The cupy.ndarray class is at the core of CuPy and is a replacement class for NumPy’s numpy.ndarray.



In [2]:
x_gpu = cp.array([1, 2, 3])

x_gpu above is an instance of cupy.ndarray. As one can see, CuPy’s syntax here is identical to that of NumPy. The main difference between cupy.ndarray and numpy.ndarray is that the CuPy arrays are allocated on the current device, which we will talk about later.

Most of the array manipulations are also done in the way similar to NumPy. Take the Euclidean norm (a.k.a L2 norm), for example. NumPy has numpy.linalg.norm() function that calculates it on CPU.

In [3]:
x_cpu = np.array([1, 2, 3])
l2_cpu = np.linalg.norm(x_cpu)

Using CuPy, we can perform the same calculations on GPU in a similar way:



In [4]:
x_gpu = cp.array([1, 2, 3])
l2_gpu = cp.linalg.norm(x_gpu)

In [5]:
print(l2_cpu)
print(l2_gpu)

3.7416573867739413
3.7416573867739413


CuPy implements many functions on cupy.ndarray objects. See the reference for the supported subset of NumPy API. Knowledge of NumPy will help you utilize most of the CuPy features. We, therefore, recommend you familiarize yourself with the NumPy documentation.

# Current Device
CuPy has a concept of a current device, which is the default GPU device on which the allocation, manipulation, calculation, etc., of arrays take place. Suppose ID of the current device is 0. In such a case, the following code would create an array x_on_gpu0 on GPU 0.

All CuPy operations (except for multi-GPU features and device-to-device copy) are performed on the currently active device.

In general, CuPy functions expect that the array is on the same device as the current one. Passing an array stored on a non-current device may work depending on the hardware configuration but is generally discouraged as it may not be performant.

# Data Transfer
Move arrays to a device
cupy.asarray() can be used to move a numpy.ndarray, a list, or any object that can be passed to numpy.array() to the current device:

In [6]:
x_cpu = np.array([1, 2, 3])
x_gpu = cp.asarray(x_cpu)  # move the data to the current device.

# Move array from a device to the host
Moving a device array to the host can be done by cupy.asnumpy() as follows:

In [7]:
x_gpu = cp.array([1, 2, 3])  # create an array in the current device
x_cpu = cp.asnumpy(x_gpu)  # move the array to the host.

We can also use cupy.ndarray.get():

In [8]:
x_cpu = x_gpu.get()

# How to write CPU/GPU agnostic code
CuPy’s compatibility with NumPy makes it possible to write CPU/GPU agnostic code. For this purpose, CuPy implements the cupy.get_array_module() function that returns a reference to cupy if any of its arguments resides on a GPU and numpy otherwise. Here is an example of a CPU/GPU agnostic function that computes log1p:

In [None]:
# Stable implementation of log(1 + exp(x))
def softplus(x):
    xp = cp.get_array_module(x)  # 'xp' is a standard usage in the community
    print("Using:", xp.__name__)
    return xp.maximum(0, x) + xp.log1p(xp.exp(-abs(x)))

When you need to manipulate CPU and GPU arrays, an explicit data transfer may be required to move them to the same location – either CPU or GPU. For this purpose, CuPy implements two sister methods called cupy.asnumpy() and cupy.asarray(). Here is an example that demonstrates the use of both methods:

In [11]:
x_cpu = np.array([1, 2, 3])
y_cpu = np.array([4, 5, 6])
x_cpu + y_cpu

x_gpu = cp.asarray(x_cpu)
# x_gpu + y_cpu # This would generate an error Unsupported type

result = cp.asnumpy(x_gpu) + y_cpu
print(result)
result = cp.asnumpy(x_gpu) + cp.asnumpy(y_cpu)
print(result)
result = x_gpu + cp.asarray(y_cpu)
print(result)
result = cp.asarray(x_gpu) + cp.asarray(y_cpu)
print(result)

[5 7 9]
[5 7 9]
[5 7 9]
[5 7 9]


The cupy.asnumpy() method returns a NumPy array (array on the host), whereas cupy.asarray() method returns a CuPy array (array on the current device). Both methods can accept arbitrary input, meaning that they can be applied to any data that is located on either the host or device and can be converted to an array.

# User-Defined Kernels
CuPy provides easy ways to define three types of CUDA kernels: elementwise kernels, reduction kernels and raw kernels. In this documentation, we describe how to define and call each kernels.

# Basics of elementwise kernels
An elementwise kernel can be defined by the ElementwiseKernel class. The instance of this class defines a CUDA kernel which can be invoked by the __call__ method of this instance.

A definition of an elementwise kernel consists of four parts: an input argument list, an output argument list, a loop body code, and the kernel name. For example, a kernel that computes a squared difference
 is defined as follows:

In [13]:
squared_diff = cp.ElementwiseKernel(
   'float32 x, float32 y',
   'float32 z',
   'z = (x - y) * (x - y)',
   'squared_diff')

The argument lists consist of comma-separated argument definitions. Each argument definition consists of a type specifier and an argument name. Names of NumPy data types can be used as type specifiers.

n, i, and names starting with an underscore _ are reserved for the internal use.

The above kernel can be called on either scalars or arrays with broadcasting:

In [15]:
x = cp.arange(10, dtype=np.float32).reshape(2, 5)
y = cp.arange(5, dtype=np.float32)
print(x)
print(y)
print(squared_diff(x, y))

print(squared_diff(x, 5))


[[0. 1. 2. 3. 4.]
 [5. 6. 7. 8. 9.]]
[0. 1. 2. 3. 4.]
[[ 0.  0.  0.  0.  0.]
 [25. 25. 25. 25. 25.]]
[[25. 16.  9.  4.  1.]
 [ 0.  1.  4.  9. 16.]]


Output arguments can be explicitly specified (next to the input arguments):

In [16]:
z = cp.empty((2, 5), dtype=np.float32)
print(squared_diff(x, y, z))

[[ 0.  0.  0.  0.  0.]
 [25. 25. 25. 25. 25.]]


# Type-generic kernels
If a type specifier is one character, then it is treated as a type placeholder. It can be used to define a type-generic kernels. For example, the above squared_diff kernel can be made type-generic as follows:

In [17]:
squared_diff_generic = cp.ElementwiseKernel(
    'T x, T y',
    'T z',
    'z = (x - y) * (x - y)',
    'squared_diff_generic')

Type placeholders of a same character in the kernel definition indicate the same type. The actual type of these placeholders is determined by the actual argument type. The ElementwiseKernel class first checks the output arguments and then the input arguments to determine the actual type. If no output arguments are given on the kernel invocation, then only the input arguments are used to determine the type.

The type placeholder can be used in the loop body code:

In [18]:
squared_diff_generic = cp.ElementwiseKernel(
    'T x, T y',
    'T z',
    '''
        T diff = x - y;
        z = diff * diff;
    ''',
    'squared_diff_generic')

More than one type placeholder can be used in a kernel definition. For example, the above kernel can be further made generic over multiple arguments:



In [19]:
squared_diff_super_generic = cp.ElementwiseKernel(
    'X x, Y y',
    'Z z',
    'z = (x - y) * (x - y)',
    'squared_diff_super_generic')

Note that this kernel requires the output argument explicitly specified, because the type Z cannot be automatically determined from the input arguments.

# Reduction kernels
Reduction kernels can be defined by the ReductionKernel class. We can use it by defining four parts of the kernel code:

1. Identity value: This value is used for the initial value of reduction.

2. Mapping expression: It is used for the pre-processing of each element to be reduced.

3. Reduction expression: It is an operator to reduce the multiple mapped values. The special variables a and b are used for its operands.

4. Post mapping expression: It is used to transform the resulting reduced values. The special variable a is used as its input. Output should be written to the output parameter.

ReductionKernel class automatically inserts other code fragments that are required for an efficient and flexible reduction implementation.

For example, L2 norm along specified axes can be written as follows:

In [20]:
l2norm_kernel = cp.ReductionKernel(
    'T x',  # input params
    'T y',  # output params
    'x * x',  # map
    'a + b',  # reduce
    'y = sqrt(a)',  # post-reduction map
    '0',  # identity value
    'l2norm'  # kernel name
)
x = cp.arange(10, dtype=np.float32).reshape(2, 5)
print(l2norm_kernel(x, axis=1))


[ 5.477226  15.9687195]


# Raw kernels
Raw kernels can be defined by the RawKernel class. By using raw kernels, you can define kernels from raw CUDA source.

RawKernel object allows you to call the kernel with CUDA’s cuLaunchKernel interface. In other words, you have control over grid size, block size, shared memory size and stream.

In [21]:
add_kernel = cp.RawKernel(r'''
extern "C" __global__
void my_add(const float* x1, const float* x2, float* y) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    y[tid] = x1[tid] + x2[tid];
}
''', 'my_add')
x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5)
x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.float32)
add_kernel((5,), (5,), (x1, x2, y))  # grid, block and arguments
y


array([[ 0.,  2.,  4.,  6.,  8.],
       [10., 12., 14., 16., 18.],
       [20., 22., 24., 26., 28.],
       [30., 32., 34., 36., 38.],
       [40., 42., 44., 46., 48.]], dtype=float32)

A very interesing comparison of the performance of numpy vs the performance of cupy:

https://medium.com/@weidagang/cupy-faster-matrix-operations-with-gpus-in-python-7c9f9b69eb84

The associated COLAB notebook:


https://colab.research.google.com/drive/1ytAyGSOKfjQ41V48hanFTH987yCZPGnI?usp=sharing