Eager CUDAGraph + stream performance

In the following repro, with warmup in `block1` and `block2`, the latency of adding two cuda tensors `x+y` is 0.0013 ms. However, if we comment out either `block1` or `block2`, the latency reduces to 0.0011 ms. This is unexpected since there are `torch.cuda.synchronize()` after `block1` and `block2`.


Repro:
```python
import torch


x = torch.randn(1024, device="cuda", requires_grad=False)
y = torch.randn(1024, device="cuda", requires_grad=False)


def fn():
    return x + y


# warmup
############################# <---- Block1: comment out this line to reduce latency from 0.0013 ms to 0.0011 ms
for _ in range(100):
    fn()
############################

torch.cuda.synchronize()

with torch.cuda.stream(torch.cuda.Stream()):
    # warmup
    ############################# <---- Block2: comment out this line to reduce latency from 0.0013 ms to 0.0011 ms
    ############################# <---- comment out 1 of option1/option2 to reduce latency from 0.0013 ms to 0.0011 ms
    for _ in range(6):
        fn()
    ############################
    n_repeat = 2000

    torch.cuda.synchronize()

    # construct a cuda graph with `n_repeat` unrolled function calls to minimize
    # host overhead
    g = torch.cuda.CUDAGraph()
    with torch.cuda.graph(g):
        for _ in range(n_repeat):
            fn()

    # measure time and return
    ret = []
    n_retries = 10
    for _ in range(n_retries):
        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        g.replay()
        end_event.record()
        torch.cuda.synchronize()
        measured_latency = start_event.elapsed_time(end_event) / n_repeat
        ret.append(measured_latency)

compiler_latency = sum(ret) / len(ret)
print("compiler_latency:", compiler_latency)
```

cc @msaroufim @jerryzh168 @ptrblck @eqy @mcarilli @ezyang @eellison @penguinwu @galv 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eager CUDAGraph + stream performance #161104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eager CUDAGraph + stream performance #161104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions