## CUPTI Counter / FLOPs Analysis

### About

In this demo we leverage the pytorch profiler to capture performance characteristics of CUDA kernels. See the section below on how to collect counters using pytorch profiler.

### Motivation and context

Performance counters measured on the GPU kernels can provide insights on how to speed up GPU kernels, conduct roofline analysis and other low level optimizations. Profiling tools like NSight Compute provide the ability achieve this interactively but they do not work well on remote application, jobs running on a cluster etc.

PyTorch profiler has an alternative lightweight API that gives uses [CUPTI Range Profiler API](https://docs.nvidia.com/cupti/r_main.html#r_profiler) to program and measure detailed performance counters from the device. The underlying mechanism is similar to what NSight uses but this solution is easier to deploy. For example, the application does not have to be launched with NSight compute. Also it supports the same list of [performance metrics](https://docs.nvidia.com/cupti/r_main.html#r_profiler) as NSight. Please see this [PR](https://github.com/pytorch/pytorch/pull/94689) for more details

Performance measurements are emitted to the trace either per kernel or for the entire performance profiling region.
When the CUPTI Profiler mode is enabled the PyTorch trace will contain the performance measurement values annotated in the GPU kernel events.
* The events are emitted under a `cuda_profiler_range` category
* The counter values are contained inside the args json part of the trace.

The CPU operators continue to be emitted as usual.

### Instructions

#### Collecting the trace with CUPTI Profiler Counters
One can collect performance metrics by adding the list of metrics using the experimental config option in pytorch profiler. Please see this [PR](https://github.com/pytorch/pytorch/pull/94689) for more details
```
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA, torch.profiler.ProfilerActivity.CPU],
    record_shapes=True,
    on_trace_ready=trace_handler,
    experimental_config=torch.profiler._ExperimentalConfig(
        profiler_metrics=[
            "kineto__tensor_core_insts",
             "dram__bytes_read.sum",
             "dram__bytes_write.sum"],
    profiler_measure_per_kernel=True),
) as prof:
    res = train_batch(modeldef)
    prof.step()```
```
The trace in this example was collected using param benchmarks. Run using
```
# Inside dir "param/train/compute"
> python -m python.pytorch.run_benchmark -c python/examples/pytorch/configs/alex_net.json -p -i 1 -d cuda --cupti-profiler --cupti-profiler-measure-per-kernel
```

#### Trace Analysis

To run this demo notebook on your laptop
1. Clone the repo `git clone https://github.com/fairinternal/TraceAnalyzer.git`
1. [Optional and recommended] Setup a venv or conda environment. See README for details.
1. Set the `trace_dir` parameter in the next cell to the location of the folder containing your collected pytorch profiler.


In [1]:
from hta.trace_analysis import TraceAnalysis
from hta.analyzers.counters_analysis import CountersAnalysis, CUDA_SASS_INSTRUCTION_COUNTER_FLOPS
trace_prefix = "~/Work/hta/HolisticTraceAnalysis/"
trace_dir = f"{trace_prefix}/tests/data/cupti_profiler/"
analyzer = TraceAnalysis(trace_dir=trace_dir)

2023-03-31 11:11:16,106 - hta - trace.py:L374 - INFO - /Users/bcoutinho/Work/hta/HolisticTraceAnalysis/tests/data/cupti_profiler
2023-03-31 11:11:16,125 - hta - trace_file.py:L98 - INFO - Rank to trace file map:
{0: '/Users/bcoutinho/Work/hta/HolisticTraceAnalysis/tests/data/cupti_profiler/benchmark_result_924719_1679055439_trace.json.gz'}
2023-03-31 11:11:16,126 - hta - trace.py:L501 - INFO - ranks=[0]
2023-03-31 11:11:16,184 - hta - trace.py:L113 - INFO - Parsed /Users/bcoutinho/Work/hta/HolisticTraceAnalysis/tests/data/cupti_profiler/benchmark_result_924719_1679055439_trace.json.gz time = 0.01 seconds mem = 3.25 MB


  df.loc[df["stream"].lt(0), "iteration"] = df["ts"].apply(_get_profiler_step)


In [2]:
gpu_kernels = CountersAnalysis.get_counter_data_with_operators(analyzer.t, ranks=[0])[0]

In [3]:
gpu_kernels.head()

Unnamed: 0,index,cat,name,pid,tid,ts,dur,smsp__sass_thread_inst_executed_op_hadd_pred_on.sum,smsp__sass_thread_inst_executed_op_hfma_pred_on.sum,smsp__sass_thread_inst_executed_op_dadd_pred_on.sum,...,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_dmul_pred_on.sum,smsp__sass_thread_inst_executed_op_hmul_pred_on.sum,index_correlation,iteration,depth,index_runtime,op_stack,top_level_op,bottom_level_op
0,4107,cuda_profiler_range,void at::native::(anonymous namespace)::distri...,0,0,2498229,121431,0,4866048,0,...,0,0,0,-1,-1,0,473,"[cudaLaunchKernel, aten::uniform_, aten::rand]",aten::rand,aten::uniform_
1,4108,cuda_profiler_range,__missing__,0,0,2619660,121431,0,0,0,...,0,0,0,-1,-1,0,3697,"[cudaLaunchKernel, cudaFuncSetAttribute, cudaF...",aten::conv2d,aten::convolution
2,4109,cuda_profiler_range,"void at::native::elementwise_kernel<128, 2, at...",0,0,2741091,121431,0,49561600,0,...,0,0,0,-1,-1,0,3699,"[cudaLaunchKernel, aten::add_, cudaFuncSetAttr...",aten::conv2d,aten::add_
3,4110,cuda_profiler_range,void at::native::vectorized_elementwise_kernel...,0,0,2862522,121431,0,0,0,...,0,0,0,-1,-1,0,3701,"[cudaLaunchKernel, aten::clamp_min_, aten::rel...",aten::relu_,aten::clamp_min_
4,4111,cuda_profiler_range,void at::native::(anonymous namespace)::max_po...,0,0,2983953,121431,0,5971968,0,...,0,0,0,-1,-1,0,3711,"[cudaLaunchKernel, aten::max_pool2d_with_indic...",aten::max_pool2d,aten::max_pool2d_with_indices


In [4]:
gpu_kernels["flops"] = 0
for counter, flops in CUDA_SASS_INSTRUCTION_COUNTER_FLOPS.items():
    gpu_kernels["flops"] += gpu_kernels[counter] * flops

In [5]:
gpu_kernels[["name", "bottom_level_op", "top_level_op", "flops"]]

Unnamed: 0,name,bottom_level_op,top_level_op,flops
0,void at::native::(anonymous namespace)::distri...,aten::uniform_,aten::rand,87195648
1,__missing__,aten::convolution,aten::conv2d,18263449600
2,"void at::native::elementwise_kernel<128, 2, at...",aten::add_,aten::conv2d,148684800
3,void at::native::vectorized_elementwise_kernel...,aten::clamp_min_,aten::relu_,0
4,void at::native::(anonymous namespace)::max_po...,aten::max_pool2d_with_indices,aten::max_pool2d,11943936
...,...,...,...,...
72,ampere_sgemm_32x32_sliced1x4_tn,aten::linear,aten::linear,4298637312
73,"void epilogue::impl::globalKernel<float, float...",aten::linear,aten::linear,524288
74,void at::native::vectorized_elementwise_kernel...,aten::clamp_min_,aten::relu_,0
75,ampere_sgemm_32x32_sliced1x4_tn,aten::addmm,aten::linear,1114112000
