[RFC] Profiler differentiate runs/traces feature request

# Profiler differentiate runs/traces feature request

## Scenarios

### Goals

Here are a couple of typical scenarios that data scientists would like to compare the run after doing some tweak on some baseline model to see whether there are obvious changes in new model.

* Data scientists would like to compare the model performance after changing some module’s dimension. For e.g. change the nn.Linear dimension from 128x100 to 128000x100000
* Data scientists would like to see the model performance comparison between the apex fp32 and fp16.
* Data scientists would like to compare the model performance between ResNet 50 and ResNet 152.
* Data scientists would like to compare the model performance after change some loss function. For example, data scientists change the manually written softmax/crossentropy to some built-in support loss function in PyTorch
* Data scientists would like to compare the model performance after choosing some optimized operator in PyTorch. For example, some of the operators are fused into optimized one.
* Data scientists have two different nn.Module implementation for same model training and would like to see which one is better during the training process.

Non-Goals

* Some unusual case like comparing ResNet model with BERT model.  In this case, plugin only show the top-level diff view.

## Design

The design would cover the major 6 scenarios listed in above section.
The plugin UI will be categorized into two modes: normal mode and diff(comparison) mode. In diff mode, the UI would look like the following to allow user to select the runs for comparison.

![image](https://user-images.githubusercontent.com/817030/124871220-5b33c380-dff6-11eb-92a8-7ff14d15fd78.png)

After users select both baseline and experimental runs and click the diff button, the diff UI will be loaded.

### Overview
The overview UI will show some summary information   like device/memory, GPU utilization, memory usage, steps time etc.  

| Category        | Device                           | Memory  | GPU      |  Avg Memory Usage | Peak Memory Usage | Avg Step Time(ms) |
|  -------------- |  ------------------------  | ---------  | ---------| ----------------------- | ---------------------- | --------------------- |
| Baseline         | Tesla V100-DGXS-32GB | 31.74 GB | 30%      |  1.50 GB                     |	5.00 GB                   |	                    9.3|
| Experimental | Tesla V100-DGXS-32GB | 31.74 GB | 17%      |  1.70 GB                     |	4.90 GB                   |                          11.2|
| Delta             | N/A                                 | N/A        | -13%     |  200.00 MB                | -100.00 MB               |                            1.9|
| Delta(%)        | N/A                                 | N/A        | -43.33%| 13.33%                      | -2.00%                       |                     20.43%|

### Diff View
In diff view, we need split run into comparable pieces, during which each piece is align in logical timeline. For example,   

![image](https://user-images.githubusercontent.com/817030/124872985-7b648200-dff8-11eb-9f6d-0a44656fd8a3.png)

We can split two examples runs in above diagrams. For the missing parts, we will leave it alone when do comparison.
**Note: the functional.relu is only for illustrations purpose. It has the possibility that all functional will be missing.**
After we align the run execution timeline in logic way, we can compare the absolute execution time for each logical part. Then we can get the following chart . The execution time match is using the critical path time, which means we should use CPU time for CPU tasks, GPU time for GPU task at most cases. 

![image](https://user-images.githubusercontent.com/817030/124873065-920ad900-dff8-11eb-962d-2b2b2aac0450.png)

For each part, we can get the following difference line in execution order. 

![image](https://user-images.githubusercontent.com/817030/124873105-a058f500-dff8-11eb-8614-c4c000b45753.png)

User can zoom in specific align parts by clicking it(exit the zoom by click blank region?).  For example, the top module forward can be zoomed at submodule view in recursive way.  When user **select** one block, for e.g. top module.forward, the detail comparison view for the selected blocks will be shown. If there are some gaps between the aligned blocks (for e.g. some unknown code like functional or pure cpu code like time.sleep   ), **an blank block should be inserted with name like “unknown”**, which means the time should not belong to any modules. 
Note: **we only show the diff view for nn.Module level** instead of for underlying operator for simplicity purpose, because there are enormous operators which will divert users’ interest.
Diff view will cover scenario 1, 3, 4, 6.

### Operator/Kernel view 
The operator/kernel view will show the operators/kernels summary view for baseline and experimental  run. Each column is **sortable** , filterable. **If user select specific blocks, only related stats will be shown**.

| Operator | Baseline Calls | Exp Calls | Delta Calls | Delta Calls% | Baseline Self Duration | Exp Self Duration | Delta Self Duration | Delta Self Duration % | 
| ---------- | --------------- | ---------- | ----------- | -------------- | ------------------------ | ------------------- | -------------------- | ------------------------- |
| aten::emtpy | 100 |	150 | 50 | 50.00% | 138 | 140 | 2 | 1.45% |
| aten::zeros | 120 | 141 | 21 | 17.50% | 72 | 100 | 28 | 38.89% |
| aten::zero_ | 411 | 531 | 120 | 29.20% | 53 | 59 | 6 | 11.32% |
| aten::view | 31 |14 | -17 | -54.84% | 46 | 84 | 38 | 82.61% | 

We can extend  the following columns in operator view, each column will have four sub-columns  :
* device self-duration
* device total duration
* host self-duration
* host total duratio

Kernel view follows the same pattern.
The scenario 2 and 5 can be covered by operator view/kernel view.

## Work Items
The following changes or requirements are needed for the diff view features to align the logical timeline.
* Extend torch.profiler.record_function to support customized metadata. (torch/csrc/autograd/record_function_ops.cpp::record_function_enter).
* Capture module parameters, size by leveraging record_function.
* Add top level module trace by using global hook like PR 55354.
* Trace each module in above hook. 
    Another approach is to trace every module in in nn.Module._call_impl. In this way, we need add trace_module in nn.Module which will be set through torch.profiler.profile. In nn.Module._call_impl, call record_function for current nn.Module.
* Add nn.Module name for all the module in the graph when added in nn.Module.add_module or nn.Module.__setattr_ 

## Open Issues
* Choose which algorithm to align the logical timeline is not determined yet . The simplest way is to use the hierarchy’s name of each module to check the identity of modules in the two runs/traces.
* How to exit the zoom? By clicking on blank region or something else?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Profiler differentiate runs/traces feature request #342

Profiler differentiate runs/traces feature request

Scenarios

Goals

Design

Overview

Diff View

Operator/Kernel view

Work Items

Open Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category	Device	Memory	GPU	Avg Memory Usage	Peak Memory Usage	Avg Step Time(ms)
Baseline	Tesla V100-DGXS-32GB	31.74 GB	30%	1.50 GB	5.00 GB	9.3
Experimental	Tesla V100-DGXS-32GB	31.74 GB	17%	1.70 GB	4.90 GB	11.2
Delta	N/A	N/A	-13%	200.00 MB	-100.00 MB	1.9
Delta(%)	N/A	N/A	-43.33%	13.33%	-2.00%	20.43%

Operator	Baseline Calls	Exp Calls	Delta Calls	Delta Calls%	Baseline Self Duration	Exp Self Duration	Delta Self Duration	Delta Self Duration %
aten::emtpy	100	150	50	50.00%	138	140	2	1.45%
aten::zeros	120	141	21	17.50%	72	100	28	38.89%
aten::zero_	411	531	120	29.20%	53	59	6	11.32%
aten::view	31	14	-17	-54.84%	46	84	38	82.61%

[RFC] Profiler differentiate runs/traces feature request #342

Description

Profiler differentiate runs/traces feature request

Scenarios

Goals

Design

Overview

Diff View

Operator/Kernel view

Work Items

Open Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions