# Profiling CUDA in Torch

## TL; DR

当需要定制化 cuda kernel 并且整合到 torch 代码中时，我们主要关心两个问题:
1. 如何对 cuda kernel 做 profiling (这也有助于我们决定是否需要自己实现 cuda kernel, 我们的目标是在真正需要的地方自己实现定制化的 kernel, 并且效率高于编译技术生成的实现)
2. 如何在 torch 代码中使用定制化的 cuda kernel.

对于 profiling，主要的工具有：
- cuda event
- torch.autograd.profiler
- torch.profiler
- ncu

对于如何在 torch 中使用定制化的 cuda kernel，主要有两种比较简单的方式：
- 使用 triton 实现 kernel (然后可以直接使用，triton kernel 就是一个 python 函数加了 @triton.jit);
- 使用 torch.utils.cpp_extension 的 load_inline

## Profiling CUDA Kernel in torch

In [1]:
import torch

### CUDA Event

In [2]:
def time_pytorch_function(f, input):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    for _ in range(10):
        # warmup
        f(input)
    
    start.record()
    f(input)
    end.record()
    torch.cuda.synchronize()

    return start.elapsed_time(end)

In [3]:
t = torch.randn((10000, 10000)).cuda()
time_pytorch_function(torch.square, t) # in milliseconds

1.1735039949417114

### torch.autograd.profiler

In [4]:
with torch.autograd.profiler.profile(use_device='cuda') as prof:
    torch.square(t)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# the result denotes most task is done in aten::pow

-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
             aten::square         1.29%      15.811us         7.39%      90.492us      90.492us      14.000us         1.12%       1.253ms       1.253ms             1  
                aten::pow         4.33%      52.971us         5.71%      69.973us      69.973us       1.233ms        98.40%       1.239ms       1.239ms             1  
        aten::result_type         0.12%       1.422us         0.12%       1.422us       1.422us       4.000us         0.32%       4.000us       4.000us        

### torch.profiler

In [5]:
# ## Default way to use profiler
# with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
#     for _ in range(10):
#         a = torch.square(torch.randn(10000, 10000).cuda())

# prof.export_chrome_trace("trace.json")

## With warmup and skip
# https://pytorch.org/docs/stable/profiler.html

# Non-default profiler schedule allows user to turn profiler on and off
# on different iterations of the training loop;
# trace_handler is called every time a new trace becomes available
def trace_handler(prof):
    print(prof.key_averages().table(
        sort_by="self_cuda_time_total", row_limit=-1))
    prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],

    # In this example with wait=1, warmup=1, active=2, repeat=1,
    # profiler will skip the first step/iteration,
    # start warming up on the second, record
    # the third and the forth iterations,
    # after which the trace will become available
    # and on_trace_ready (when set) is called;
    # the cycle repeats starting with the next step

    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1),
    on_trace_ready=trace_handler
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    # used when outputting for tensorboard
    ) as p:
        for iter in range(10):
            torch.square(torch.randn(10000, 10000).cuda())
            # send a signal to the profiler that the next iteration has started
            p.step()

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us     103.404ms       234.41%     103.404ms      51.702ms             2  
                                            aten::copy_         0.01%      60.255us         3.66%      42.133ms      21.067ms      41.776ms        94.70%      41.776ms      20.888ms             2  
         

### ncu: NVIDIA Nsight Compute


In [None]:
# ncu --set full -o output $(which python) test.py

## Integrating Custom kernel

### Triton Kernel

### Load inline