# Concurrency with Streams

## Learning Objectives

In this lab we will learn about how to achieve concurrency in CUDA using CUDA streams. (Concurrency can also be achieved by using multiple GPUs; we will explore this in a later lab.) Topics include:

- CUDA streams: what they are, and how to use them
- Achieving asynchronous H2D/D2H copies and kernels

## Motivation

Recall the three-step processing workflow for CUDA:

![](images/simple_processing_flow.png)

From a timeline perspective, this looks like:

![](images/serial_processing_flow.png)

That means that a substantial portion of the timeline is not spent on compute, and we're not utilizing the GPU efficiently. We would like instead to be able to do:

![](images/concurrent_processing_flow.png)

Then a much higher percentage of the time spent would involve computation.

## Pinned (Non-Pageable) Memory

CUDA enables the allocation of host *pinned memory*. Pinned memory enables:

- Faster host <-> device copies
- Memcopies from CPU to GPU that are asynchronous
- Memcopies from GPU to CPU that are asynchronous
- Direct access within a CUDA kernel
    - For this reason, pinned memory is also called "zero-copy" memory in some sources

It is used by calling the API:

```
cudaMallocHost(&ptr, length);
cudaFreeHost(ptr);
```

CUDA pinned memory is *page-locked* on the host. That means the virtual to physical address translation is fixed; the memory is not subject to normal OS paging. (Since virtual memory paging is often important to CPU memory performance, and pinning memory subtracts memory from the pool of pageable memory, one should be cautious in how much pinned memory is allocated.) This allows the GPU to directly dereference the pointer in a kernel (since the GPU knows directly where to access it, and can do so while bypassing the CPU).

If pageable memory has already been allocated on the CPU, it can be pinned through CUDA APIs[<sup>1</sup>](#footnote1):

```
cudaHostRegister(ptr, length, cudaHostRegisterDefault);
cudaHostUnregister(ptr);
```

Notably, the asynchronous memcopy operation that we are about to discuss, `cudaMemcpyAsync()`, is only fully asynchronous when writing to or reading from a pinned memory buffer. If the destination or source buffer is normal host pageable memory, the CUDA driver cannot complete the operation in a fully asynchronous manner, since it depends on the CPU handling the pageable memory.

## Exercise

Let's play around with pinned memory. Our claim is that for both synchronous and asynchronous copies, it should be faster to copy pinned memory to the device. (There is a tradeoff of course, in the sense that it takes longer to allocate pinned memory than to allocate pageable memory.) Let's verify that this is true. [exercises/pinned.cu](exercises/pinned.cu) currently makes a host to device and then device to host copy of some data, where the host data uses pageable memory. Profile the code to understand how long each of these operations took. Then convert the pageable allocations to pinned allocations (check [solutions/pinned.cu](solutions/pinned.cu) for answers) and re-collect the profile to see whether the memcopy operations got faster. If so, by how much? Does the speedup compensate for the additional time we spent pinning the memory? If not, how many memcopies would we have to do to amortize that cost out? Do these conclusions depend on the value of `N`?

In [None]:
!nvcc -arch=native -o pinned exercises/pinned.cu
!nsys profile --stats=true pinned

## CUDA Streams

When we launch a kernel with

```
kernel<<<blocks, threads>>>();
```

we have already learned that the kernel launches asynchronously with respect to the host, and we need to use (say) `cudaDeviceSynchronize()` to wait until the kernel is completed. Certain other CUDA APIs such as `cudaMemcpy()` are implicitly synchronous with respect to these kernel launches.

CUDA has the concept of *streams*, which are in-order work queues. Items of work (usually kernels or memcopies) submitted to a stream are executed in order, and if two items are submitted to the stream asynchronously, the second one cannot begin execution until the first one completes. Streams also obey the additional rule that two items of work submitted to *different* streams have no ordering prescribed by CUDA.

### Default Stream

CUDA provides a *default stream* to which asynchronous work is issued in the case of asynchronous APIs which do not require a stream. For example, if we do

```
kernel1<<<blocks, threads>>>();
kernel2<<<blocks, threads>>>();
```

both kernels execute in the default stream, and `kernel2` does not start until `kernel1` completes. 

We can intersperse asynchronous memcopies in this sequence of operations, for example:

```
cudaMemcpyAsync(d_ptr1, h_ptr1, length, cudaMemcpyHostToDevice);
kernel1<<<blocks, threads>>>(d_ptr1);
cudaMemcpyAsync(h_ptr1, d_ptr1, length, cudaMemcpyDeviceToHost);

cudaMemcpyAsync(d_ptr2, h_ptr2, length, cudaMemcpyHostToDevice);
kernel2<<<blocks, threads>>>(d_ptr2);
cudaMemcpyAsync(h_ptr2, d_ptr2, length, cudaMemcpyDeviceToHost);

cudaDeviceSynchronize();
```

and these six operations will occur in sequence on the GPU. The default stream has the special rule that it is synchronous with respect to all CUDA work in any stream.

### CUDA APIs and Streams

Many CUDA APIs (including some that we have already seen) explicitly accept a CUDA stream as an optional argument. For example, the extended version of the triple-chevron launch syntax is

```
kernel<<<blocks, threads, smem, stream>>>();
```

where `smem` is the amount of dynamically allocated shared memory to use, and `stream` is the handle for the CUDA stream you want to use.

Similarly, memcopy operations have this as well:

```
cudaMemcpyAsync(dest, src, length, direction, stream);
cudaMemPrefetchAsync(ptr, length, device, stream);
```

The default stream is denoted with the special argument of `0`. That is,

```
kernel<<<blocks, threads, 0, 0>>>();
```

submits the kernel to the default stream.

### Non-Default Streams

CUDA permits you to create streams of your own for asynchronous execution.

```
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

...

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
```

After the streams have been created, they can be used in the CUDA APIs:

```
cudaMemcpyAsync(d_ptr1, h_ptr1, length, cudaMemcpyHostToDevice, stream1);
kernel<<<blocks, threads, 0, stream2)(d_ptr2);
```

In this case, the asynchronous memcopy may be executed before, after, or concurrently with the kernel.

If you need to synchronize with respect to the work in a given non-default stream, you can use:

```
cudaStreamSynchronize(stream);
```

See below for some examples of possible execution scenarios. In this figure, `K` denotes kernel while `M` denotes memcopy, and the subsequent integer indicates which of two streams that operation has been submitted to (the order in the text corresponds to the order the operations are launched in).

![](images/stream_examples.png)

## Exercise

[exercises/user_stream.cu](exercises/user_stream.cu) is the same code we just ended with above. It should correctly set an array to the value `1`. Refactor this code by launching all the work in a non-default stream, so that the `cudaMemcpy` calls become `cudaMemcpyAsync`, and the kernel launch passes the stream as its fourth argument. Use `cudaStreamSynchronize()` instead of `cudaDeviceSynchronize()` to ensure all work is completed. Verify that the code still works correctly. Look at [solutions/user_stream.cu](solutions/user_stream.cu) if you need a hint.

In [None]:
!nvcc -arch=native -o user_stream exercises/user_stream.cu
!./user_stream

### Vector Processing Example

Suppose we're processing on vector data, for example adding one vector to another or doing an axpy operation. When using the default stream, we get:

```
cudaMemcpy(d_x, h_x, size_x, cudaMemcpyHostToDevice);
kernel<<<blocks, threads>>>(d_x, d_y, N);
cudaMemcpy(h_y, d_y, size_y, cudaMemcpyDeviceToHost);
```

![](images/vector_processing_serial.png)

Now imagine that we have an array `cudaStream_t streams[c]` with `c` streams. Suppose that the length of the arrays are `size_x` and `size_y` respectively. We can decompose the problem into `c` chunks, and then achieve overlap between memcopy and compute, with the kernel for a given stream executing simultaneously with the memcopy on the next stream.

```
for (int i = 0, i < c; i++) {
    size_t offx = (size_x / c) * i;
    size_t offy = (size_y / c) * i;
    cudaMemcpyAsync(d_x + offx, h_x + offx, size_x / c, cudaMemcpyHostToDevice, streams[i % ns]);
    kernel<<<blocks / c, threads, 0, streams[i % ns]>>>(d_x + offx, d_y + offy, N / c);
    cudaMemcpyAsync(h_y + offy, d_y + offy, size_y / c, cudaMemcpyDeviceToHost, streams[i % ns]);
}
```

Here is the two-stream example:

![](images/vector_processing_concurrent.png)

The above workflow is also possible with managed memory:

```
for (int i = 0, i < c; i++) {
    size_t offx = (size_x / c) * i;
    size_t offy = (size_y / c) * i;
    cudaMemPrefetchAsync(x + offx, size_x / c, 0, streams[i % ns]);
    kernel<<<blocks / c, threads, 0, streams[i % ns]>>>(x + offx, y + offy, N / c);
    cudaMemPrefetchAsync(y + offy, size_y / c, cudaCpuDeviceId, streams[i % ns]);
}
```

Stream semantics guarantee that the prefetching of the data completes before the kernel begins execution.

A caveat with managed memory is that the API call itself is often much higher latency than `cudaMemcpyAsync()` (since [the operation requires updating CPU and GPU page tables](https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/)). Depending on the length of the copies and the kernels, this can sometimes appear as a latency bubble that disrupts the achievement of fully asynchronous work.

## Interactions between Non-Default Streams and the Default Stream

We said above that two operations in different streams are asynchronous with respect to each other. An exception is the default stream, which is fully synchronous with respect to operations in other streams:

![](images/default_stream.png)

This means that operations launched on the default stream wait until all previously launched kernels (on any stream) are complete, and any kernels launched on any stream launched after the default stream kernel do not start until that kernel is complete. (Though it is possible to create non-default streams in such a way that they do not block with respect to the default stream; see [cudaStreamCreateWithFlags()](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1gb1e32aff9f59119e4d0a9858991c4ad3). It is also possible to make the default stream a regular stream that doesn't have this special synchronization behavior using the nvcc flag [--default-stream per-thread](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#default-stream)).

For this reason, the default stream should generally be avoided when you are trying to closely manage scenarios for concurrency.

## Exercise

The code in [exercises/overlap.cu](exercises/overlap.cu) performs a simple element-wise calculation on a vector. Compile, run, and profile the code as a first step.

In [None]:
!nvcc -arch=native -o overlap exercises/overlap.cu
!nsys profile -f true -o overlap --stats=true overlap

If you open the report (overlap.qdrep) in the Nsight Systems UI, you should see a sequential set of operations: copy the data from host to device, run the kernel, then copy the data from device to host.

The code also has a section to do the same operation with streams, but it is currently ifdef'ed out (because it is not completed. Complete that code section, dealing with the `FIXME` locations, then compile with that section enabled (`-DUSE_STREAMS`) and run again. (Check [the solution](solutions/overlap.cu) if you need help.) Does the version with streams complete faster, as we hoped? What can we say based on the stdout profiling report?

In [None]:
!nvcc -arch=native -o overlap_with_streams -DUSE_STREAMS exercises/overlap.cu
!nsys profile -f true -o overlap_with_streams --stats=true overlap_with_streams

Again, inspect the report (overlap_with_streams.qdrep) in the Nsight Systems UI and see if you can visually observe kernels overlapping with memory copies.

## Host Callbacks

CUDA streams are most commonly used for managing asynchronous memcopies and kernels, but it is also possible to insert asychronous host operations (callbacks) onto a stream. This is useful for performing some work that depends on the outcome of GPU kernels. These callbacks obey stream semantics: they do not begin execution until previous operations on the stream complete. A worker thread is spawned by CUDA to perform the host callback. The API is:

```
// Can pass data to the function
cudaLauncHostFunc(stream, function, data);
```

A limitation of callbacks is that they may not themselves call into the CUDA API.

## Review

In this lab we learned:

- What CUDA streams are and how to use them
- The difference between the default stream and user-created (non-default) streams
- How to write asynchronous workflows that can overlap compute and memcopies

## Further Study

[NVIDIA Developer Blog: Concurrency with Unified Memory](https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/)

[CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution)

CUDA Samples:  concurrentKernels, simpleStreams, asyncAPI, simpleCallbacks

## Lab Materials

You can download this notebook using the `File > Download as > Notebook (.ipnyb)` menu item. Source code files can be downloaded from the `File > Download` menu item after opening them.

## Footnotes

<span id="footnote1">1</span>: The Linux kernel can also page-lock memory with [mlock](https://man7.org/linux/man-pages/man2/mlock.2.html). However, this cannot be used as a replacement for CUDA pinned memory because the pinned pages also need to be registered with the CUDA driver.