Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler.export_stacks doesn't return stack trace unless experimental_config is provided #100253

Open
NicolayP opened this issue Apr 28, 2023 · 20 comments
Assignees
Labels
high priority module: regression It used to work, and now it doesn't oncall: profiler profiler-related issues (cpu, gpu, kineto)

Comments

@NicolayP
Copy link

NicolayP commented Apr 28, 2023

馃悰 Describe the bug

Since I upgraded torch from 1.13.0+cu117 to 2.0.0+cu117, the following code isn't logging nor printing the stack trace.

import torch
from torch.profiler import profile, record_function

a = torch.rand(100, 100)
b = torch.rand(100, 100)

with profile(with_stack=True, profile_memory=True) as prof:
    with record_function("model_inference"):
        torch.add(a, b)

prof.export_stacks("myFile", "self_cpu_time_total")

print(prof.key_averages(group_by_stack_n=5))

for 1.13.0+cu117 the file is logged correctly:

test_prof.py(8):_<module>;torch/autograd/profiler.py(478):___init__;<built-in_method_zeros_of_type_object_at_0x7f55776329c0> 45
test_prof.py(8):_<module>;torch/autograd/profiler.py(478):___init__;<built-in_method_zeros_of_type_object_at_0x7f55776329c0> 30
test_prof.py(8):_<module>;torch/autograd/profiler.py(478):___init__;<built-in_method_zeros_of_type_object_at_0x7f55776329c0> 5
test_prof.py(8):_<module>;torch/autograd/profiler.py(487):___enter__;torch/_ops.py(437):___call__;<built-in_method__record_function_enter_of_PyCapsule_object_at_0x7f549a9189f0> 143
test_prof.py(8):_<module>;torch/autograd/profiler.py(487):___enter__;torch/_ops.py(437):___call__;<built-in_method__record_function_enter_of_PyCapsule_object_at_0x7f549a9189f0> 3
test_prof.py(8):_<module>;<built-in_method_add_of_type_object_at_0x7f55776329c0> 85
test_prof.py(8):_<module>;torch/profiler/profiler.py(475):___exit__;torch/profiler/profiler.py(484):_stop;torch/profiler/profiler.py(511):__transit_action;torch/profiler/profiler.py(117):_stop_trace;torch/autograd/profiler.py(223):___exit__;torch/cuda/__init__.py(556):_synchronize;torch/cuda/__init__.py(201):__lazy_init;<built-in_function__cuda_init> 1269
test_prof.py(8):_<module>;torch/profiler/profiler.py(475):___exit__;torch/profiler/profiler.py(484):_stop;torch/profiler/profiler.py(511):__transit_action;torch/profiler/profiler.py(117):_stop_trace;torch/autograd/profiler.py(223):___exit__;torch/cuda/__init__.py(556):_synchronize;torch/cuda/__init__.py(201):__lazy_init;torch/cuda/__init__.py(108):__check_capability;torch/cuda/__init__.py(344):_get_device_capability;torch/cuda/__init__.py(361):_get_device_properties;<built-in_method__get_device_properties_of_PyCapsule_object_at_0x7f5597a3d8d0> 175
test_prof.py(8):_<module>;torch/profiler/profiler.py(475):___exit__;torch/profiler/profiler.py(484):_stop;torch/profiler/profiler.py(511):__transit_action;torch/profiler/profiler.py(117):_stop_trace;torch/autograd/profiler.py(223):___exit__;torch/cuda/__init__.py(556):_synchronize;<built-in_function__cuda_synchronize> 252937

for 2.0.0+117cu however, I get an empty file. Is there something I'm missing?

Versions

For the env with torch 1.13.0

Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.27

Python version: 3.9.15 (main, Nov 24 2022, 14:31:59)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1650
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Stepping:            13
CPU MHz:             964.671
CPU max MHz:         5000.0000
CPU min MHz:         800.0000
BogoMIPS:            4800.00
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.13.0
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.14.0
[conda] cudatoolkit               11.0.3              h88f8997_10    conda-forge
[conda] numpy                     1.23.5                   pypi_0    pypi
[conda] torch                     1.13.0                   pypi_0    pypi
[conda] torchaudio                0.13.0                   pypi_0    pypi
[conda] torchvision               0.14.0                   pypi_0    pypi

For torch 2.0.0:

Collecting environment information...
PyTorch version: 2.0.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.27

Python version: 3.9.16 (main, Mar  8 2023, 14:00:05)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1650
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Stepping:            13
CPU MHz:             1672.804
CPU max MHz:         5000.0000
CPU min MHz:         800.0000
BogoMIPS:            4800.00
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.0.0
[pip3] torchvision==0.15.1
[pip3] triton==2.0.0
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] torch                     2.0.0                    pypi_0    pypi
[conda] torchvision               0.15.1                   pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi

cc @ezyang @gchanan @zou3519 @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb @dzhulgakov @davidberard98

@Zeyad-Abdelreheem
Copy link

I traced your code and found the following, the issue is in the profile object (prof)
exactly in this line

the above line gives 2 different results in the 2 torch versions, it should return the list of events that will be parsed to create the stack trace you need.

It was working fine in torch version = 1.13.0 but in 2.0.0 it seems to be not working I couldn't see its source code.

Here is a test:-
test

In version 2.0.0 the returned list of events can't be parsed by the subsequent functions, so the stack traces are all empty which resulted in giving you an empty file.

I traced all other functions and found they are exactly the same between the 2 versions.

@NicolayP
Copy link
Author

Hi @Zeyad-Abdelreheem thanks, at least i can eliminate a problem on my side if it's reproducable. Any idea who to @ to get help ?

@Zeyad-Abdelreheem
Copy link

Not sure, probably @albanD or @apaszke.

@ngimel ngimel added the oncall: profiler profiler-related issues (cpu, gpu, kineto) label May 1, 2023
@albanD
Copy link
Collaborator

albanD commented May 1, 2023

Hey!
We have a process to triage issue and automatic cc are added once the right label is added by the triager.
Hopefully someone from the profiler team can help cc @chaekit in particular.

Also tentative high priority.

@nihaljn
Copy link

nihaljn commented May 27, 2023

Any updates on this?

@albanD albanD removed the oncall: profiler profiler-related issues (cpu, gpu, kineto) label Jun 5, 2023
@malfet malfet added the module: regression It used to work, and now it doesn't label Jun 5, 2023
@malfet
Copy link
Contributor

malfet commented Jun 5, 2023

Will take an ai to find a PoC and figure out how it missed regression testing during the release.

@janeyx99 janeyx99 added oncall: profiler profiler-related issues (cpu, gpu, kineto) and removed triage review labels Jun 5, 2023
@davidberard98 davidberard98 self-assigned this Jun 5, 2023
@aaronenyeshi
Copy link
Member

aaronenyeshi commented Jun 5, 2023

Hi @NicolayP, our profiler team has recorded this issue, and will assign a POC to look into the issue. There was a significant rework to the profiler in the past year, and this regression may have been caused by that. Sorry for the inconvenience.

@davidberard98
Copy link
Contributor

Bisect still in progress but looks likely to be from #86263, will take a closer look tomorrow after the bisect finishes.

@davidberard98
Copy link
Contributor

It looks like you can turn this back on with this:

...
with profile(with_stack=True, profile_memory=True, experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True)) as prof:
...

However, this doesn't show all the same information as it used to; I'll continue investigating

@davidberard98
Copy link
Contributor

@nihaljn @NicolayP , for now you can use the experimental_config described above. We'll consider rewriting export_stacks to avoid needing this extra config.

@davidberard98 davidberard98 changed the title Profiler doesn't return stack trace profiler.export_stacks doesn't return stack trace unless experimental_config is provided Jun 7, 2023
@yundai424
Copy link

hi @davidberard98 friendly checking in a bit, do we by chance have any update on this issue? Thanks!

@davidberard98
Copy link
Contributor

We don't have any immediate plans to change this behavior; you should use the experimental_config option if you want this feature.

@srdecny
Copy link

srdecny commented Nov 1, 2023

It seems like with torch==2.1.0+cu118, the suggested experimental_config workaround no longer works and yields an empty file when calling export_stacks.

@oliver-batchelor
Copy link

Any update on this? I've been running into the same issue in both 2.0 and 2.1. No stack information

@MilkClouds
Copy link

No stack information on torch 2.2.0, but experimental_config works.

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Mar 22, 2024

Hi, is there any updates? This bug seems to still be there with latest pytorch.

@mx
Copy link

mx commented Apr 20, 2024

Can confirm that on 2.2.0, even with the experimental_config as described above, output stack traces are empty.

@DuaneNielsen
Copy link

This worked on pytorch 2.2.2+cu121

    def trace_handler(p):
        output = p.key_averages(group_by_stack_n=5).table(sort_by="self_cuda_memory_usage", row_limit=20)
        print(output)
        p.export_chrome_trace("./trace_" + str(p.step_num) + ".json")


    with profile(
            activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
            schedule=torch.profiler.schedule(
                wait=1,
                warmup=1,
                active=2),
            on_trace_ready=trace_handler,
            profile_memory=True,
            with_stack=True,
            experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True)
    ) as p:

I have to say... there is a huge difference between knowing you have 3.8Gb allocated from aten::empty.. and knowing you have 3.8Gb allocated from torch/nn/functional.py(2451): batch_norm

The latter is a lot more actionable that the former.

@botbw
Copy link

botbw commented Apr 30, 2024

It seems experimental_config is still needed when using torch==2.3.0

@Linusnie
Copy link

Linusnie commented Jun 2, 2024

For me it's not working either with or without experimental config on 2.3.0+cu118

sraikund16 added a commit to sraikund16/pytorch that referenced this issue Jun 4, 2024
Summary:
Many users have been complaining that with stack does not work on its own as described in the our pytorch tutorials. If used it returns an empty python stack. It only returns a stack if JIT is enabled.
pytorch#89406
pytorch#95238
pytorch#100253
pytorch#117515

Test Plan: Edited unit test to only contain with_stack

Differential Revision: D58149768
@sraikund16 sraikund16 reopened this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: regression It used to work, and now it doesn't oncall: profiler profiler-related issues (cpu, gpu, kineto)
Projects
None yet
Development

No branches or pull requests