Profiler differentiate runs/traces feature request
Scenarios
Goals
Here are a couple of typical scenarios that data scientists would like to compare the run after doing some tweak on some baseline model to see whether there are obvious changes in new model.
- Data scientists would like to compare the model performance after changing some module’s dimension. For e.g. change the nn.Linear dimension from 128x100 to 128000x100000
- Data scientists would like to see the model performance comparison between the apex fp32 and fp16.
- Data scientists would like to compare the model performance between ResNet 50 and ResNet 152.
- Data scientists would like to compare the model performance after change some loss function. For example, data scientists change the manually written softmax/crossentropy to some built-in support loss function in PyTorch
- Data scientists would like to compare the model performance after choosing some optimized operator in PyTorch. For example, some of the operators are fused into optimized one.
- Data scientists have two different nn.Module implementation for same model training and would like to see which one is better during the training process.
Non-Goals
- Some unusual case like comparing ResNet model with BERT model. In this case, plugin only show the top-level diff view.
Design
The design would cover the major 6 scenarios listed in above section.
The plugin UI will be categorized into two modes: normal mode and diff(comparison) mode. In diff mode, the UI would look like the following to allow user to select the runs for comparison.

After users select both baseline and experimental runs and click the diff button, the diff UI will be loaded.
Overview
The overview UI will show some summary information like device/memory, GPU utilization, memory usage, steps time etc.
| Category |
Device |
Memory |
GPU |
Avg Memory Usage |
Peak Memory Usage |
Avg Step Time(ms) |
| Baseline |
Tesla V100-DGXS-32GB |
31.74 GB |
30% |
1.50 GB |
5.00 GB |
9.3 |
| Experimental |
Tesla V100-DGXS-32GB |
31.74 GB |
17% |
1.70 GB |
4.90 GB |
11.2 |
| Delta |
N/A |
N/A |
-13% |
200.00 MB |
-100.00 MB |
1.9 |
| Delta(%) |
N/A |
N/A |
-43.33% |
13.33% |
-2.00% |
20.43% |
Diff View
In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,

We can split two examples runs in above diagrams. For the missing parts, we will leave it alone when do comparison.
Note: the functional.relu is only for illustrations purpose. It has the possibility that all functional will be missing.
After we align the run execution timeline in logic way, we can compare the absolute execution time for each logical part. Then we can get the following chart . The execution time match is using the critical path time, which means we should use CPU time for CPU tasks, GPU time for GPU task at most cases.

For each part, we can get the following difference line in execution order.

User can zoom in specific align parts by clicking it(exit the zoom by click blank region?). For example, the top module forward can be zoomed at submodule view in recursive way. When user select one block, for e.g. top module.forward, the detail comparison view for the selected blocks will be shown. If there are some gaps between the aligned blocks (for e.g. some unknown code like functional or pure cpu code like time.sleep ), an blank block should be inserted with name like “unknown”, which means the time should not belong to any modules.
Note: we only show the diff view for nn.Module level instead of for underlying operator for simplicity purpose, because there are enormous operators which will divert users’ interest.
Diff view will cover scenario 1, 3, 4, 6.
Operator/Kernel view
The operator/kernel view will show the operators/kernels summary view for baseline and experimental run. Each column is sortable , filterable. If user select specific blocks, only related stats will be shown.
| Operator |
Baseline Calls |
Exp Calls |
Delta Calls |
Delta Calls% |
Baseline Self Duration |
Exp Self Duration |
Delta Self Duration |
Delta Self Duration % |
| aten::emtpy |
100 |
150 |
50 |
50.00% |
138 |
140 |
2 |
1.45% |
| aten::zeros |
120 |
141 |
21 |
17.50% |
72 |
100 |
28 |
38.89% |
| aten::zero_ |
411 |
531 |
120 |
29.20% |
53 |
59 |
6 |
11.32% |
| aten::view |
31 |
14 |
-17 |
-54.84% |
46 |
84 |
38 |
82.61% |
We can extend the following columns in operator view, each column will have four sub-columns :
- device self-duration
- device total duration
- host self-duration
- host total duratio
Kernel view follows the same pattern.
The scenario 2 and 5 can be covered by operator view/kernel view.
Work Items
The following changes or requirements are needed for the diff view features to align the logical timeline.
- Extend torch.profiler.record_function to support customized metadata. (torch/csrc/autograd/record_function_ops.cpp::record_function_enter).
- Capture module parameters, size by leveraging record_function.
- Add top level module trace by using global hook like PR 55354.
- Trace each module in above hook.
Another approach is to trace every module in in nn.Module._call_impl. In this way, we need add trace_module in nn.Module which will be set through torch.profiler.profile. In nn.Module._call_impl, call record_function for current nn.Module.
- Add nn.Module name for all the module in the graph when added in nn.Module.add_module or nn.Module._setattr
Open Issues
- Choose which algorithm to align the logical timeline is not determined yet . The simplest way is to use the hierarchy’s name of each module to check the identity of modules in the two runs/traces.
- How to exit the zoom? By clicking on blank region or something else?
Profiler differentiate runs/traces feature request
Scenarios
Goals
Here are a couple of typical scenarios that data scientists would like to compare the run after doing some tweak on some baseline model to see whether there are obvious changes in new model.
Non-Goals
Design
The design would cover the major 6 scenarios listed in above section.
The plugin UI will be categorized into two modes: normal mode and diff(comparison) mode. In diff mode, the UI would look like the following to allow user to select the runs for comparison.
After users select both baseline and experimental runs and click the diff button, the diff UI will be loaded.
Overview
The overview UI will show some summary information like device/memory, GPU utilization, memory usage, steps time etc.
Diff View
In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,
We can split two examples runs in above diagrams. For the missing parts, we will leave it alone when do comparison.
Note: the functional.relu is only for illustrations purpose. It has the possibility that all functional will be missing.
After we align the run execution timeline in logic way, we can compare the absolute execution time for each logical part. Then we can get the following chart . The execution time match is using the critical path time, which means we should use CPU time for CPU tasks, GPU time for GPU task at most cases.
For each part, we can get the following difference line in execution order.
User can zoom in specific align parts by clicking it(exit the zoom by click blank region?). For example, the top module forward can be zoomed at submodule view in recursive way. When user select one block, for e.g. top module.forward, the detail comparison view for the selected blocks will be shown. If there are some gaps between the aligned blocks (for e.g. some unknown code like functional or pure cpu code like time.sleep ), an blank block should be inserted with name like “unknown”, which means the time should not belong to any modules.
Note: we only show the diff view for nn.Module level instead of for underlying operator for simplicity purpose, because there are enormous operators which will divert users’ interest.
Diff view will cover scenario 1, 3, 4, 6.
Operator/Kernel view
The operator/kernel view will show the operators/kernels summary view for baseline and experimental run. Each column is sortable , filterable. If user select specific blocks, only related stats will be shown.
We can extend the following columns in operator view, each column will have four sub-columns :
Kernel view follows the same pattern.
The scenario 2 and 5 can be covered by operator view/kernel view.
Work Items
The following changes or requirements are needed for the diff view features to align the logical timeline.
Another approach is to trace every module in in nn.Module._call_impl. In this way, we need add trace_module in nn.Module which will be set through torch.profiler.profile. In nn.Module._call_impl, call record_function for current nn.Module.
Open Issues