## Summary
CUDA Graph (proprietary) provides a mechanism to launch multiple GPU operations through a single CPU operation,
and hence reduces the launching overheads between CUDA kernels. It
1. capture (which puts a CUDA stream in capture mode) and "build" a static graph (static shapes and static control flow) on the first run, and
2. replay or execute the graph subsequently.

The difference is more pronounced when the same sequence of operations is repeated many times.
Kernels in a replay also execute slightly faster on the GPU, but eliding CPU overhead is the main benefit.

CUDA graph can be captured if it doesnâ€™t violate any of the following constraints:
* Capture must occur on a non-default stream.
* Ops that synchronize the CPU with the GPU (e.g., .item() calls) are prohibited.
* Not using dynamic control flow, dynamic shapes (although we can interleave those code with `torch.cuda.make_graphed_callables` calls [1])
* CUDA RNG operations are permitted, but they may require extra bookkeepings [2]

## Example code [1]
```python
# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

# replay
for data, target in zip(real_inputs, real_targets):
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.
```

## Reference
[1] https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs

[2] https://docs.pytorch.org/docs/main/notes/cuda.html#constraints