# Week 1: CUDA operations in PyTorch. Introduction to benchmarking.
In this seminar, we'll learn a bit more about the things one needs to keep in mind when using GPU computations, both in general and in PyTorch. We'll also see a couple of examples on benchmarking code in Python; again, there are some caveats with CUDA.

First, let's import PyTorch and get some information about the currently used device:

In [1]:
import torch

torch.cuda.is_available()

True

In [2]:
torch.cuda.get_device_properties(0)

_CudaDeviceProperties(name='Tesla T4', major=7, minor=5, total_memory=15102MB, multi_processor_count=40)

## Memory allocation
As discussed in the lecture, GPU memory is separate from the CPU memory and needs to be explicitly allocated, triggering a host-device synchronization. PyTorch uses a caching memory allocator to repurpose already available but unused memory fragments.

See this example: we allocate the memory inside the function scope, so the tensor is deleted as soon as the scope is left:

In [3]:
def allocate_empty_tensor(dim_size):
    a = torch.empty(4096, dim_size, dtype=torch.float32, device="cuda")

In [4]:
allocate_empty_tensor(2048)

Printing the allocated memory size gives us an expected number of zero bytes:

In [5]:
torch.cuda.memory_allocated(0)

0

However, the GPU memory is still in use by the process; this is shown by `torch.cuda.memory_reserved` or `nvidia-smi` from the terminal. As a result, working in a shared GPU environment can leave a lot of unused yet allocated memory.

In [6]:
torch.cuda.memory_reserved()

33554432

In [7]:
!nvidia-smi

Tue Oct 22 14:49:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0              25W /  70W |    135MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Let's clear the cache now and see what happens:

In [8]:
torch.cuda.empty_cache()
torch.cuda.memory_reserved()

0

Note that this operation triggers a CPU-GPU synchronization, and thus using it in your code can significantly hurt the performance. It's almost always better to carefully manage the lifetime of your GPU tensors and avoid excessive allocations than to empty the cache.

Now, let's see how this cache is reused by allocating two tensors in a row: first a larger one, then a smaller one.

In [9]:
allocate_empty_tensor(2048)
torch.cuda.memory_reserved()

33554432

In [12]:
allocate_empty_tensor(1024)
torch.cuda.memory_reserved() / (1<<20)

32.0

As expected, we reuse the cache, since the array fits into the allocated chunk.

However, if we attempt to do this with a larger tensor (3072 elements in the second dimension instead of 2048), we observe something different:

In [16]:
allocate_empty_tensor(3072)
torch.cuda.memory_reserved()/ (1<<20)

80.0

What happened? The chunk of memory that was allocated for a 4096x2048 array did not fit a tensor of size 4096x3072: thus, PyTorch needed to allocate an additional segment of a sufficient size while keeping the previous one allocated (in case the user creates a smaller tensor later at some point).

In practice, this means that if your code is dealing with tensors of dynamic size (changing batch sizes, sequence lengths or image resolutions), it is recommended to first warm up the cache by allocating tensors for the largest expected input. Otherwise, in the worst case you allocate a quadratic amount of memory with respect to the largest input size instead of a linear one.

You can also view more detailed allocation statistics by running `torch.cuda.memory_stats()`

In [17]:
memory_stats = torch.cuda.memory_stats()
print(memory_stats["active.all.allocated"])
print(memory_stats["active.all.current"])
print(memory_stats["active.all.peak"])
print(memory_stats["reserved_bytes.all.current"])

9
0
1
83886080


In [18]:
torch.cuda.empty_cache()
print(torch.cuda.memory_stats()["reserved_bytes.all.current"])

0


## Benchmarking intro
The simplest way to check the performance impact of any change is to compare the runtime of code with and without it. Here, we will consider a simple way of doing this in Python.

First, let's define two functions that compute a batched version of a dot product for two matrices:

In [19]:
def batched_dot_mul_sum(a, b):
    """Computes batched dot by multiplying and summing"""
    return a.mul(b).sum(-1)


def batched_dot_bmm(a, b):
    """Computes batched dot by reducing to bmm"""
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, b.shape[-1], 1)
    return torch.bmm(a, b).flatten(-3)


# Input for benchmarking
x = torch.randn(10000, 64)

# Ensure that both functions compute the same output
assert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))

To conduct microbenchmarks by running the code hundreds of times and measuring the average execution time, you can use the built-in [timeit](https://docs.python.org/3/library/timeit.html) module. Simply create a Timer object with relevant arguments and call `.timeit()`:

In [21]:
import timeit

t0 = timeit.Timer(
    stmt="batched_dot_mul_sum(x, x)",
    setup="from __main__ import batched_dot_mul_sum",
    globals={"x": x},
)

t1 = timeit.Timer(
    stmt="batched_dot_bmm(x, x)",
    setup="from __main__ import batched_dot_bmm",
    globals={"x": x},
)

print(f"mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us")

mul_sum(x, x):  491.0 us
bmm(x, x):      940.7 us


In IPython, there exist line and cell timeit [magics](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics). They can be more convenient for smaller cases but allow you a bit less control over the setup.

In [22]:
%timeit batched_dot_mul_sum(x, x)

457 µs ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
%timeit batched_dot_bmm(x, x)

1.04 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


[torch.utils.benchmark](https://pytorch.org/docs/stable/benchmark_utils.html) copies the API of timeit while helping the user avoid common mistakes.

First, let's run it on the same code and data:

In [23]:
import torch.utils.benchmark as benchmark

t0 = benchmark.Timer(
    stmt="batched_dot_mul_sum(x, x)",
    setup="from __main__ import batched_dot_mul_sum",
    globals={"x": x},
)

t1 = benchmark.Timer(
    stmt="batched_dot_bmm(x, x)",
    setup="from __main__ import batched_dot_bmm",
    globals={"x": x},
)

print(t0.timeit(100))
print(t1.timeit(100))

<torch.utils.benchmark.utils.common.Measurement object at 0x7f7213deb070>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  444.65 us
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f7220fb15a0>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  900.82 us
  1 measurement, 100 runs , 1 thread


In [24]:
# in addition, we can set the number of threads for CPU computations
num_threads = torch.get_num_threads()
print(f"Benchmarking on {num_threads} threads")

t0 = benchmark.Timer(
    stmt="batched_dot_mul_sum(x, x)",
    setup="from __main__ import batched_dot_mul_sum",
    globals={"x": x},
    num_threads=num_threads,
    label="Multithreaded batch dot",
    sub_label="Implemented using mul and sum",
)

t1 = benchmark.Timer(
    stmt="batched_dot_bmm(x, x)",
    setup="from __main__ import batched_dot_bmm",
    globals={"x": x},
    num_threads=num_threads,
    label="Multithreaded batch dot",
    sub_label="Implemented using bmm",
)

print(t0.timeit(100))
print(t1.timeit(100))

Benchmarking on 1 threads
<torch.utils.benchmark.utils.common.Measurement object at 0x7f72e2804d60>
Multithreaded batch dot: Implemented using mul and sum
setup: from __main__ import batched_dot_mul_sum
  429.26 us
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f7220fb8880>
Multithreaded batch dot: Implemented using bmm
setup: from __main__ import batched_dot_bmm
  870.50 us
  1 measurement, 100 runs , 1 thread


In [25]:
# we can change it globally for PyTorch and measure the impact
prev_num_threads = num_threads
torch.set_num_threads(2)

num_threads = torch.get_num_threads()
print(f"Benchmarking on {num_threads} threads")

t0 = benchmark.Timer(
    stmt="batched_dot_mul_sum(x, x)",
    setup="from __main__ import batched_dot_mul_sum",
    globals={"x": x},
    num_threads=num_threads,
    label="Multithreaded batch dot",
    sub_label="Implemented using mul and sum",
)

t1 = benchmark.Timer(
    stmt="batched_dot_bmm(x, x)",
    setup="from __main__ import batched_dot_bmm",
    globals={"x": x},
    num_threads=num_threads,
    label="Multithreaded batch dot",
    sub_label="Implemented using bmm",
)

print(t0.timeit(100))
print(t1.timeit(100))
# in this case, we don't get any speedup, likely due to the overhead

torch.set_num_threads(prev_num_threads)

Benchmarking on 2 threads
<torch.utils.benchmark.utils.common.Measurement object at 0x7f72e366ee00>
Multithreaded batch dot: Implemented using mul and sum
setup: from __main__ import batched_dot_mul_sum
  457.31 us
  1 measurement, 100 runs , 2 threads
<torch.utils.benchmark.utils.common.Measurement object at 0x7f72210ff670>
Multithreaded batch dot: Implemented using bmm
setup: from __main__ import batched_dot_bmm
  576.61 us
  1 measurement, 100 runs , 2 threads


In [26]:
# by the way, what CPU do we have?
!lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU @ 2.00GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1
    Stepping:             3
    BogoMIPS:             4000.29
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 cl
                          flush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc re
                          p_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3
                           fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand
                           hypervisor lahf_lm abm 3dnowprefetch i

## Benchmarking GPU code
As we discussed in the lecture, CUDA kernel execution and PyTorch GPU operations are asynchronous.
This means that it is possible to launch several kernels in a row before receiving results from the first one. As a consequence, naive benchmarking without synchronization is likely to give you unrealistic results.

Let's the same example, but with slightly larger matrices on the GPU:

In [27]:
import timeit

x = torch.randn(10000, 1024, device="cuda")

t0 = timeit.Timer(
    stmt="batched_dot_mul_sum(x, x)",
    setup="from __main__ import batched_dot_mul_sum",
    globals={"x": x},
)

t1 = timeit.Timer(
    stmt="batched_dot_bmm(x, x)",
    setup="from __main__ import batched_dot_bmm",
    globals={"x": x},
)

# Ran each twice to show difference before/after warmup
print(f"mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us")
print(f"bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us")

mul_sum(x, x):  773.5 us
mul_sum(x, x):   28.4 us
bmm(x, x):      1565.1 us
bmm(x, x):       27.9 us


First, we see that the difference between the first and the second runs of timeit is quite noticeable, which happens because of initializing the CUDA context and loading [cuBLAS](https://docs.nvidia.com/cuda/cublas/index.html) — a CUDA library for accelerated linear algebra.

Second, the runtimes of two methods seem too small.

Let's run the same test with `torch.utils.benchmark` to see a different set of results.

In [28]:
t0 = benchmark.Timer(
    stmt="batched_dot_mul_sum(x, x)",
    setup="from __main__ import batched_dot_mul_sum",
    globals={"x": x},
)

t1 = benchmark.Timer(
    stmt="batched_dot_bmm(x, x)",
    setup="from __main__ import batched_dot_bmm",
    globals={"x": x},
)

# Run only once since benchmark module does warmup for us
print(t0.timeit(100))
print(t1.timeit(100))

<torch.utils.benchmark.utils.common.Measurement object at 0x7f7213dea8f0>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  502.46 us
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f72210ff760>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  243.98 us
  1 measurement, 100 runs , 1 thread


Now we have a more realistic set of measurements, which is caused by explicitly triggering the CPU-GPU synchronization and awaiting the results after launching the multiplication. The runtime is comparable to what we had in a previous set of benchmarks, but pay attention that now the matrices being multiplied are 16 times bigger.

Let's implement microbenchmarking by ourselves using nothing but [`time.perf_counter()`](https://docs.python.org/3/library/time.html#time.perf_counter) and CUDA methods given by PyTorch. First, we'll run `batched_dot_mul_sum` on the GPU with and without synchronization and see how this affects the results:

In [29]:
from time import perf_counter

import numpy as np

execution_times = []

for _ in range(100):
    start_time = perf_counter()
    batched_dot_mul_sum(x, x)
    execution_times.append(perf_counter() - start_time)

np.mean(execution_times)

4.564924999840514e-05

Compare this with the result that does not compute anything:

In [None]:
execution_times = []

for _ in range(100):
    start = perf_counter()
    execution_times.append(perf_counter() - start)

np.mean(execution_times)

1.940499997488132e-07

Finally, let's explicitly call `torch.cuda.synchronize` at the end of each iteration to see the difference.

In [30]:
execution_times = []

for _ in range(100):
    start_time = perf_counter()
    batched_dot_mul_sum(x, x)
    torch.cuda.synchronize()
    execution_times.append(perf_counter() - start_time)

np.mean(execution_times)

0.0005340387799947166

The same thing applies to the actual models. Let's take `Linear` as the simplest example:

In [31]:
x = torch.randn(10000, 512, device="cuda")
linear = torch.nn.Linear(512, 1024, bias=False, device="cuda")
N_ITERS = 200

execution_times = []

for _ in range(N_ITERS):
    start = perf_counter()
    result = linear(x)
    execution_times.append(perf_counter() - start)

np.mean(execution_times)

0.0001960344900015798

In [32]:
execution_times = []

for _ in range(N_ITERS):
    start = perf_counter()
    result = linear(x)
    torch.cuda.synchronize()
    execution_times.append(perf_counter() - start)

np.mean(execution_times)

0.002771046424999213

In [33]:
execution_times = []

start = perf_counter()
for _ in range(N_ITERS):
    result = linear(x)
torch.cuda.synchronize()

(perf_counter() - start) / N_ITERS

0.002707342575000098

Of course, `torch.utils.benchmark` also works:

In [34]:
t0 = benchmark.Timer(stmt="linear(x)", globals={"x": x, "linear": linear})
print(t0.timeit(100))

<torch.utils.benchmark.utils.common.Measurement object at 0x7f7213d03ca0>
linear(x)
  3.00 ms
  1 measurement, 100 runs , 1 thread


## CUDA Streams
To execute several operations concurrently, you may use CUDA streams in PyTorch. Below, you can see an example of their usage.

In [35]:
cuda = torch.device("cuda")
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # sum() may start execution before normal_() finishes!
    B = torch.sum(A)

In [46]:
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
# Initialize cuda tensors here. E.g.:
A = torch.rand(1000, 1000, device="cuda")
B = torch.rand(1000, 1000, device="cuda")
# Wait for the above tensors to initialize.
torch.cuda.synchronize()

execution_times = []

for _ in range(1000):
    start = perf_counter()
    with torch.cuda.stream(s1):
        C = torch.mm(A, A)
    with torch.cuda.stream(s2):
        D = torch.mm(B, B)
    # Wait for C and D to be computed.
    torch.cuda.synchronize()
    execution_times.append(perf_counter() - start)

np.mean(execution_times)

0.001042307611999604

In [44]:
# next, let's compute C and D sequentially

execution_times = []

for _ in range(1000):
    start = perf_counter()
    C = torch.mm(A, A)
    D = torch.mm(B, B)
    # Wait for C and D to be computed.
    torch.cuda.synchronize()
    execution_times.append(perf_counter() - start)

np.mean(execution_times)
# the speed is even higher in this case

0.0010812267739981963

As you can see in this example, the usefulness of streams can be limited: concurrent kernels need to underutilize the GPU and yet take long enough to compute. It might still be a good idea if you are trying to compute some expression and fetch data for the next computation simultaneously (for example, when training examples are large or in case of offloading).
The example below demonstrates how to use streams for that purpose.

In [47]:
compute_stream = torch.cuda.Stream()
h2d_stream = torch.cuda.Stream()

A = torch.rand(10240, 10240, device="cuda")
B = [torch.rand(4096, 4096, device="cpu").pin_memory() for _ in range(100)]

In [48]:
def sequential_execution():
  torch.matmul(A, A)
  for matrix in B:
    matrix.to("cuda", non_blocking=True)

In [49]:
%%timeit
sequential_execution()
torch.cuda.synchronize()

1.03 s ± 7.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [50]:
def stream_execution():
  with torch.cuda.stream(compute_stream):
    torch.matmul(A, A)
  with torch.cuda.stream(h2d_stream):
    for matrix in B:
      matrix.to("cuda", non_blocking=True)

In [51]:
%%timeit
stream_execution()
torch.cuda.synchronize()

551 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Debugging asynchronous code
Finding sources of errors in GPU-reliant code in PyTorch can also be difficult. Let's see this on a standard example of an off-by-one error and incorrect index for the embedding layer:

In [52]:
%%writefile incorrect_index.py
import torch
import torch.nn as nn

embedding = nn.Embedding(1024,32).to('cuda')
# 1024 > 1023 (largest index in the created embedding layer)
input = torch.full((1,1),1024,dtype=torch.long, device='cuda')

# out-of-bounds access
embedding_for_index = embedding(input)

result = torch.sigmoid(embedding_for_index)
loss = result.sum()
print(loss.item())
print(loss)

Writing incorrect_index.py


If we attempt to run the code as is, we'll see an error after the point in which we trigger the CPU-GPU synchronization (`loss.item`). This gives us no clues about what exactly caused an error:

In [53]:
!python incorrect_index.py

../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [6,0,0

(though an experienced person might get a hint from the C++ failed assertions printed before the error)

There are two ways to find the actual line that caused an exception:

* First, you can just move everything to CPU. This is a valid approach, but it involves making changes to your code or input arguments and sometimes can be difficult (for example, if the error occurs after a long chain of operations).

* Second, you may use the `CUDA_LAUNCH_BLOCKING` environment variable when starting your code. This will force synchronization for all GPU-related operations, making your code slower but allowing to see the exact source of the error.

In [54]:
!CUDA_LAUNCH_BLOCKING=1 python incorrect_index.py

../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [0,0,0], thread: [6,0,0

## Precision of floating point operations
Let's have a look at this code snippet and its results. In essence, it computes the sixth power of a random matrix (with a fixed seed) on two different devices.

In [55]:
torch.manual_seed(1337)
x = torch.randn(5000, 5000)
torch.use_deterministic_algorithms(False)


def matrix_power(x):
    y = x @ x @ x @ x @ x @ x
    return (y).sum().item()


print(matrix_power(x))
print(matrix_power(x.cuda()))

27654770130944.0
27654807879680.0


In [59]:
x.type()

'torch.FloatTensor'

(in case you are wondering, `torch.use_deterministic_algorithms(True)` won't help)

If we do the same using numpy (in two ways), we also get a different result:

In [60]:
print(matrix_power(x.numpy()))
np.linalg.matrix_power(x.numpy(), 6).sum()

27654749159424.0


27654760000000.0

Takeaway: numerical precision of floating point computations can vary between libraries, environments and devices, and from the user side, it is often hard to resolve this issue altogether. Usually, this happens due to a different summation order in code or due to inherent nondeterminism of hardware.

However, note that the relative error is small enough, which makes such blatant discrepancies less of a problem in regular deep learning code.

## CUDA Graphs

In some situations, the bottleneck of your code might be not the GPU compute performance and not even the memory bandwidth, but the latency of kernel execution. Launching each kernel takes CPU time, and if the kernels themselves run very fast, these launches can become a problem.

Fortunately, [CUDA graphs](https://github.com/pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) can address this problem. CUDA graphs capture a sequence of specific operations (i.e., kernel launches) that can later be replayed on new inputs with essentially a single kernel launch. Below, you can see an example of using graphs with a function that consists of many fast-running operations:

In [61]:
def slow_function(x):
  for _ in range(500):
    y = x*2
    z = torch.sigmoid(y)
    a = z + 5.0
    b = torch.nn.functional.relu(a)
    x = b
  return b

# the code below is a modified version of https://pytorch.org/docs/master/notes/cuda.html#cuda-graph-semantics
g = torch.cuda.CUDAGraph()

# Placeholder input used for capture
static_input = torch.ones((5,), device="cuda")
print(slow_function(static_input))

tensor([6.0000, 6.0000, 6.0000, 6.0000, 6.0000], device='cuda:0')


Below, you can see two examples of running complex_function: directly (the standard way) or by capturing `func_graph`, copying new inputs into `static_input` and `replay()`ing it. As you can see, the second version can be alsmost 10x faster compared to default eager execution.

In [62]:
%%timeit
slow_function(static_input)
torch.cuda.synchronize()

26.2 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [63]:
func_graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(func_graph):
  static_output = slow_function(static_input)
static_output.zero_()

# Fills the graph's input memory with new data to compute on
static_input.copy_(torch.full((5,), 3, device="cuda"))
func_graph.replay()
print(static_output)

tensor([6.0000, 6.0000, 6.0000, 6.0000, 6.0000], device='cuda:0')


In [None]:
%%timeit
func_graph.replay()
torch.cuda.synchronize()

2.74 ms ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Bonus (2/10 points for the first home assignment block)
Solve exercises in the notebook from https://github.com/srush/GPU-Puzzles. You will earn the number of points corresponding to the fraction of completed exercises.