Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

Open
theo-costain-arondite opened this issue Apr 24, 2024 · 6 comments
Assignees
Labels
high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: mps Related to Apple Metal Performance Shaders framework module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@theo-costain-arondite
Copy link

theo-costain-arondite commented Apr 24, 2024

馃悰 Describe the bug

I found that running a torchvision model under MPS backend was extremely slow compared to cpu.
I ran the profiler and found that the vast majority of that time was coming from a small number of calls to aten::nonzero.
Using the repro below with cpu device takes ~1s to run, but switching to mps increases this to ~75s, most of which is spent in aten::nonzero.
I wonder if this might be related to #122916.

repro

import torch
import torchvision.models as models
from tqdm import tqdm

device = "mps"

inputs = torch.randn(3, 224, 224).to(device)

transform = models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT.transforms()

dectection_model = models.detection.fasterrcnn_resnet50_fpn_v2(
    weights=models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT,
).to(device)
dectection_model.eval()

from torch.profiler import ProfilerActivity, profile, record_function

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        out = dectection_model([transform(inputs)])

print(
    prof.key_averages(group_by_input_shape=True).table(
        sort_by="cpu_time_total", row_limit=10
    )
)
prof.export_chrome_trace(f"trace_test_{device}.json")

CPU profile results

-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                      model_inference         2.25%      26.482ms       100.00%        1.176s        1.176s             1                                                                                []  
                         aten::conv2d         0.00%      14.000us        38.19%     449.105ms     112.276ms             4                         [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], []]  
                    aten::convolution         0.00%      30.000us        38.19%     449.091ms     112.273ms             4                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], []]  
                   aten::_convolution         0.00%      33.000us        38.19%     449.061ms     112.265ms             4  [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], [], [], [], [], []  
    aten::_nnpack_spatial_convolution        38.18%     448.984ms        38.19%     449.028ms     112.257ms             4                                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], []]  
                         aten::conv2d         0.00%       6.000us         9.21%     108.300ms      54.150ms             2                     [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], []]  
                    aten::convolution         0.00%      15.000us         9.21%     108.294ms      54.147ms             2             [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], []]  
                   aten::_convolution         0.00%      13.000us         9.21%     108.279ms      54.139ms             2  [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], [], [], [], []  
                    aten::thnn_conv2d         0.00%       4.000us         9.21%     108.266ms      54.133ms             2                         [[1, 256, 200, 200], [256, 256, 3, 3], [], [256], [], []]  
           aten::_slow_conv2d_forward         8.96%     105.342ms         9.21%     108.262ms      54.131ms             2                         [[1, 256, 200, 200], [256, 256, 3, 3], [], [256], [], []]  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 1.176s

MPS profile results

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference        -0.00%   -2260.000us       100.00%       74.706s       74.706s             1                                                                                []  
                     aten::where        -0.00%     -17.000us        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      38.000us        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
                   aten::nonzero        99.24%       74.138s        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
                       aten::cat         0.06%      41.956ms         0.06%      41.956ms       1.998ms            21                                                                          [[], []]  
                       aten::sub         0.05%      34.111ms         0.05%      34.111ms      34.111ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                        aten::to        -0.00%     -88.000us         0.04%      32.683ms       1.421ms            23                                                          [[], [], [], [], [], []]  
                  aten::_to_copy         0.00%     676.000us         0.04%      32.681ms       2.723ms            12                                                      [[], [], [], [], [], [], []]  
                     aten::copy_         0.04%      32.618ms         0.04%      32.645ms       2.720ms            12                                                                      [[], [], []]  
                     aten::stack         0.00%      64.000us         0.03%      23.270ms       2.327ms            10                                                                          [[], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 74.706s

P.S. I think running the mps repro above might have hard crashed my laptop (it happened whilst i was writing this issue for the first time), but I don't have access to another machine to test that this isn't an issue with my macine.

Versions

Collecting environment information...
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 12:57:28) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] torchvision==0.17.2
[conda] numpy 1.26.4 py312h7f4fdc5_0
[conda] numpy-base 1.26.4 py312he047099_0
[conda] pytorch 2.2.2 py3.12_0 pytorch
[conda] torchvision 0.17.2 py312_cpu pytorch

cc @ezyang @gchanan @zou3519 @kadeng @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen

@malfet malfet added module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: mps Related to Apple Metal Performance Shaders framework labels Apr 24, 2024
@malfet malfet self-assigned this Apr 24, 2024
@malfet
Copy link
Contributor

malfet commented Apr 24, 2024

Assigning to myself to get a quick repro...

@malfet malfet added module: crash Problem manifests as a hard crash, as opposed to a RuntimeError high priority and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 24, 2024
@malfet
Copy link
Contributor

malfet commented Apr 24, 2024

Hi-pri as in 2.3.0, it just crashes MPS runtime:

% python bug-124850.py
STAGE:2024-04-24 11:27:36 16333:8699272 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG14XFamilyCommandBuffer: 0x16c6a8a80>
    label = <none> 
    device = <AGXG14SDevice: 0x162850a00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x167220000>
        label = <none> 
        device = <AGXG14SDevice: 0x162850a00>
            name = Apple M2 Pro 
    retainedReferences = 1
STAGE:2024-04-24 11:27:38 16333:8699272 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-04-24 11:27:38 16333:8699272 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference         0.24%       3.440ms       100.00%        1.437s        1.437s             1                                                                                []  
                     aten::where         0.00%       4.000us        51.07%     734.106ms     183.526ms             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      32.000us        51.07%     734.102ms     183.525ms             4                                                                          [[1000]]  
                   aten::nonzero        50.60%     727.331ms        51.06%     734.028ms     183.507ms             4                                                                          [[1000]]  
                       aten::cat         3.51%      50.495ms         3.51%      50.495ms       2.405ms            21                                                                          [[], []]  
                     aten::clamp         3.50%      50.339ms         3.50%      50.339ms      50.339ms             1                                                                  [[1000], [], []]  
                       aten::sub         3.45%      49.571ms         3.45%      49.571ms      49.571ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                     aten::stack         0.00%      62.000us         1.64%      23.551ms       2.355ms            10                                                                          [[], []]  
                        aten::to        -0.01%    -148.000us         1.39%      19.996ms     952.190us            21                                                          [[], [], [], [], [], []]  
                  aten::_to_copy         0.01%     179.000us         1.39%      19.988ms       1.999ms            10                                                      [[], [], [], [], [], [], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 1.437s

Now I wonder whether this is a regression or not...

@malfet malfet added this to the 2.3.1 milestone Apr 24, 2024
@theo-costain-arondite
Copy link
Author

@malfet If it helps, I installed a new env on my machine and tested against 2.1, and got the same results.

PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.9 (main, Apr 19 2024, 11:43:47) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.0
[pip3] torchvision==0.16.0
[conda] numpy                     1.26.4          py311he598dae_0  
[conda] numpy-base                1.26.4          py311hfbfe69c_0  
[conda] pytorch                   2.1.0                  py3.11_0    pytorch
[conda] torchvision               0.16.0                py311_cpu    pytorch
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference        -0.02%  -12725.000us       100.00%       74.826s       74.826s             1                                                                                []  
                     aten::where         0.00%       2.925ms        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      20.000us        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
                   aten::nonzero        99.06%       74.121s        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
                torchvision::nms         0.12%      86.786ms         0.13%      97.168ms      97.168ms             1                                                           [[4507, 4], [4507], []]  
                       aten::cat         0.05%      39.946ms         0.05%      39.946ms       1.902ms            21                                                                          [[], []]  
                     aten::index         0.04%      31.605ms         0.05%      37.235ms      18.617ms             2                                                                 [[1, 159882], []]  
                       aten::sub         0.05%      36.937ms         0.05%      36.937ms      36.937ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                  aten::_to_copy         0.00%     627.000us         0.04%      30.244ms       1.680ms            18                                                      [[], [], [], [], [], [], []]  
                        aten::to         0.00%      93.000us         0.04%      30.234ms       1.315ms            23                                                          [[], [], [], [], [], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 74.826s

@cpuhrsch cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Apr 29, 2024
@kulinseth
Copy link
Collaborator

nonzero is indeed written in an inefficient way...

@kulinseth
Copy link
Collaborator

There is a change for non_zero which should help
main...skotapati:pytorch-sidk:dev/skotapati/nonzero_mps

pytorchbot pushed a commit that referenced this issue May 13, 2024
Fixes #124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: #125355
Approved by: https://github.com/kulinseth

(cherry picked from commit a40d6df)
huydhn pushed a commit that referenced this issue May 13, 2024
[MPS] Native nonzero implementation (#125355)

Fixes #124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: #125355
Approved by: https://github.com/kulinseth

(cherry picked from commit a40d6df)

Co-authored-by: Denis Vieriu <dvieriu@apple.com>
@huydhn
Copy link
Contributor

huydhn commented May 30, 2024

From what I see, the crashing issue is still there in the upcoming 2.3.1 release on my M1 laptop. Reopening this issue for now and I'll double check with @malfet

STAGE:2024-05-29 18:51:13 53295:3191016 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error:
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x36ae0d100>
    label = <none>
    device = <AGXG13XDevice: 0x1434d4a00>
        name = Apple M1 Pro
    commandQueue = <AGXG13XFamilyCommandQueue: 0x16d02d800>
        label = <none>
        device = <AGXG13XDevice: 0x1434d4a00>
            name = Apple M1 Pro
    retainedReferences = 1
STAGE:2024-05-29 18:51:16 53295:3191016 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-05-29 18:51:16 53295:3191016 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                 model_inference        -1.77%  -46315.000us       100.00%        2.612s        2.612s             1                                                                                []
                     aten::where         2.16%      56.389ms        66.03%        1.725s     431.137ms             4                                                                          [[1000]]
             aten::nonzero_numpy         0.00%      41.000us        66.03%        1.725s     431.136ms             4                                                                          [[1000]]
                   aten::nonzero        65.76%        1.718s        66.02%        1.724s     431.114ms             4                                                                          [[1000]]
                       aten::cat         2.96%      77.404ms         2.96%      77.404ms       3.686ms            21                                                                          [[], []]
                       aten::sub         2.24%      58.528ms         2.24%      58.528ms      58.528ms             1                                                    [[3, 224, 224], [3, 1, 1], []]
                     aten::stack         0.00%     103.000us         1.25%      32.740ms       3.274ms            10                                                                          [[], []]
                    aten::arange         1.00%      26.230ms         1.01%      26.254ms       2.188ms            12                                                                 [[], [], [], [0]]
                     aten::where         0.00%       6.000us         0.94%      24.461ms      12.230ms             2                                                                          [[4507]]
             aten::nonzero_numpy         0.00%      13.000us         0.94%      24.455ms      12.227ms             2                                                                          [[4507]]
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
Self CPU time total: 2.612s

@huydhn huydhn reopened this May 30, 2024
@huydhn huydhn modified the milestones: 2.3.1, 2.4.0 May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: crash Problem manifests as a hard crash, as opposed to a RuntimeError module: mps Related to Apple Metal Performance Shaders framework module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants