aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

theo-costain-arondite · 2024-04-24T15:58:58Z

🐛 Describe the bug

I found that running a torchvision model under MPS backend was extremely slow compared to cpu.
I ran the profiler and found that the vast majority of that time was coming from a small number of calls to aten::nonzero.
Using the repro below with cpu device takes ~1s to run, but switching to mps increases this to ~75s, most of which is spent in aten::nonzero.
I wonder if this might be related to #122916.

repro

import torch
import torchvision.models as models
from tqdm import tqdm

device = "mps"

inputs = torch.randn(3, 224, 224).to(device)

transform = models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT.transforms()

dectection_model = models.detection.fasterrcnn_resnet50_fpn_v2(
    weights=models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT,
).to(device)
dectection_model.eval()

from torch.profiler import ProfilerActivity, profile, record_function

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        out = dectection_model([transform(inputs)])

print(
    prof.key_averages(group_by_input_shape=True).table(
        sort_by="cpu_time_total", row_limit=10
    )
)
prof.export_chrome_trace(f"trace_test_{device}.json")

CPU profile results

-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                      model_inference         2.25%      26.482ms       100.00%        1.176s        1.176s             1                                                                                []  
                         aten::conv2d         0.00%      14.000us        38.19%     449.105ms     112.276ms             4                         [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], []]  
                    aten::convolution         0.00%      30.000us        38.19%     449.091ms     112.273ms             4                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], []]  
                   aten::_convolution         0.00%      33.000us        38.19%     449.061ms     112.265ms             4  [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], [], [], [], [], [], [], [], [], []  
    aten::_nnpack_spatial_convolution        38.18%     448.984ms        38.19%     449.028ms     112.257ms             4                                 [[1000, 256, 7, 7], [256, 256, 3, 3], [], [], []]  
                         aten::conv2d         0.00%       6.000us         9.21%     108.300ms      54.150ms             2                     [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], []]  
                    aten::convolution         0.00%      15.000us         9.21%     108.294ms      54.147ms             2             [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], []]  
                   aten::_convolution         0.00%      13.000us         9.21%     108.279ms      54.139ms             2  [[1, 256, 200, 200], [256, 256, 3, 3], [256], [], [], [], [], [], [], [], [], []  
                    aten::thnn_conv2d         0.00%       4.000us         9.21%     108.266ms      54.133ms             2                         [[1, 256, 200, 200], [256, 256, 3, 3], [], [256], [], []]  
           aten::_slow_conv2d_forward         8.96%     105.342ms         9.21%     108.262ms      54.131ms             2                         [[1, 256, 200, 200], [256, 256, 3, 3], [], [256], [], []]  
-------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 1.176s

MPS profile results

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference        -0.00%   -2260.000us       100.00%       74.706s       74.706s             1                                                                                []  
                     aten::where        -0.00%     -17.000us        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      38.000us        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
                   aten::nonzero        99.24%       74.138s        99.25%       74.143s       18.536s             4                                                                          [[1000]]  
                       aten::cat         0.06%      41.956ms         0.06%      41.956ms       1.998ms            21                                                                          [[], []]  
                       aten::sub         0.05%      34.111ms         0.05%      34.111ms      34.111ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                        aten::to        -0.00%     -88.000us         0.04%      32.683ms       1.421ms            23                                                          [[], [], [], [], [], []]  
                  aten::_to_copy         0.00%     676.000us         0.04%      32.681ms       2.723ms            12                                                      [[], [], [], [], [], [], []]  
                     aten::copy_         0.04%      32.618ms         0.04%      32.645ms       2.720ms            12                                                                      [[], [], []]  
                     aten::stack         0.00%      64.000us         0.03%      23.270ms       2.327ms            10                                                                          [[], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 74.706s

P.S. I think running the mps repro above might have hard crashed my laptop (it happened whilst i was writing this issue for the first time), but I don't have access to another machine to test that this isn't an issue with my macine.

Versions

Collecting environment information...
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 12:57:28) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] torchvision==0.17.2
[conda] numpy 1.26.4 py312h7f4fdc5_0
[conda] numpy-base 1.26.4 py312he047099_0
[conda] pytorch 2.2.2 py3.12_0 pytorch
[conda] torchvision 0.17.2 py312_cpu pytorch

cc @ezyang @gchanan @zou3519 @kadeng @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen

The text was updated successfully, but these errors were encountered:

malfet · 2024-04-24T16:56:12Z

Assigning to myself to get a quick repro...

malfet · 2024-04-24T18:28:46Z

Hi-pri as in 2.3.0, it just crashes MPS runtime:

% python bug-124850.py
STAGE:2024-04-24 11:27:36 16333:8699272 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG14XFamilyCommandBuffer: 0x16c6a8a80>
    label = <none> 
    device = <AGXG14SDevice: 0x162850a00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x167220000>
        label = <none> 
        device = <AGXG14SDevice: 0x162850a00>
            name = Apple M2 Pro 
    retainedReferences = 1
STAGE:2024-04-24 11:27:38 16333:8699272 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-04-24 11:27:38 16333:8699272 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference         0.24%       3.440ms       100.00%        1.437s        1.437s             1                                                                                []  
                     aten::where         0.00%       4.000us        51.07%     734.106ms     183.526ms             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      32.000us        51.07%     734.102ms     183.525ms             4                                                                          [[1000]]  
                   aten::nonzero        50.60%     727.331ms        51.06%     734.028ms     183.507ms             4                                                                          [[1000]]  
                       aten::cat         3.51%      50.495ms         3.51%      50.495ms       2.405ms            21                                                                          [[], []]  
                     aten::clamp         3.50%      50.339ms         3.50%      50.339ms      50.339ms             1                                                                  [[1000], [], []]  
                       aten::sub         3.45%      49.571ms         3.45%      49.571ms      49.571ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                     aten::stack         0.00%      62.000us         1.64%      23.551ms       2.355ms            10                                                                          [[], []]  
                        aten::to        -0.01%    -148.000us         1.39%      19.996ms     952.190us            21                                                          [[], [], [], [], [], []]  
                  aten::_to_copy         0.01%     179.000us         1.39%      19.988ms       1.999ms            10                                                      [[], [], [], [], [], [], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 1.437s

Now I wonder whether this is a regression or not...

theo-costain-arondite · 2024-04-25T10:45:03Z

@malfet If it helps, I installed a new env on my machine and tested against 2.1, and got the same results.

PyTorch version: 2.1.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.9 (main, Apr 19 2024, 11:43:47) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.0
[pip3] torchvision==0.16.0
[conda] numpy                     1.26.4          py311he598dae_0  
[conda] numpy-base                1.26.4          py311hfbfe69c_0  
[conda] pytorch                   2.1.0                  py3.11_0    pytorch
[conda] torchvision               0.16.0                py311_cpu    pytorch

--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
                 model_inference        -0.02%  -12725.000us       100.00%       74.826s       74.826s             1                                                                                []  
                     aten::where         0.00%       2.925ms        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
             aten::nonzero_numpy         0.00%      20.000us        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
                   aten::nonzero        99.06%       74.121s        99.06%       74.126s       18.531s             4                                                                          [[1000]]  
                torchvision::nms         0.12%      86.786ms         0.13%      97.168ms      97.168ms             1                                                           [[4507, 4], [4507], []]  
                       aten::cat         0.05%      39.946ms         0.05%      39.946ms       1.902ms            21                                                                          [[], []]  
                     aten::index         0.04%      31.605ms         0.05%      37.235ms      18.617ms             2                                                                 [[1, 159882], []]  
                       aten::sub         0.05%      36.937ms         0.05%      36.937ms      36.937ms             1                                                    [[3, 224, 224], [3, 1, 1], []]  
                  aten::_to_copy         0.00%     627.000us         0.04%      30.244ms       1.680ms            18                                                      [[], [], [], [], [], [], []]  
                        aten::to         0.00%      93.000us         0.04%      30.234ms       1.315ms            23                                                          [[], [], [], [], [], []]  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------  
Self CPU time total: 74.826s

kulinseth · 2024-04-29T18:16:39Z

nonzero is indeed written in an inefficient way...

kulinseth · 2024-04-30T01:18:58Z

There is a change for non_zero which should help
main...skotapati:pytorch-sidk:dev/skotapati/nonzero_mps

Fixes #124850 Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed). Pull Request resolved: #125355 Approved by: https://github.com/kulinseth (cherry picked from commit a40d6df)

[MPS] Native nonzero implementation (#125355) Fixes #124850 Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed). Pull Request resolved: #125355 Approved by: https://github.com/kulinseth (cherry picked from commit a40d6df) Co-authored-by: Denis Vieriu <dvieriu@apple.com>

huydhn · 2024-05-30T01:55:23Z

From what I see, the crashing issue is still there in the upcoming 2.3.1 release on my M1 laptop. Reopening this issue for now and I'll double check with @malfet

STAGE:2024-05-29 18:51:13 53295:3191016 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error:
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x36ae0d100>
    label = <none>
    device = <AGXG13XDevice: 0x1434d4a00>
        name = Apple M1 Pro
    commandQueue = <AGXG13XFamilyCommandQueue: 0x16d02d800>
        label = <none>
        device = <AGXG13XDevice: 0x1434d4a00>
            name = Apple M1 Pro
    retainedReferences = 1
STAGE:2024-05-29 18:51:16 53295:3191016 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-05-29 18:51:16 53295:3191016 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                 model_inference        -1.77%  -46315.000us       100.00%        2.612s        2.612s             1                                                                                []
                     aten::where         2.16%      56.389ms        66.03%        1.725s     431.137ms             4                                                                          [[1000]]
             aten::nonzero_numpy         0.00%      41.000us        66.03%        1.725s     431.136ms             4                                                                          [[1000]]
                   aten::nonzero        65.76%        1.718s        66.02%        1.724s     431.114ms             4                                                                          [[1000]]
                       aten::cat         2.96%      77.404ms         2.96%      77.404ms       3.686ms            21                                                                          [[], []]
                       aten::sub         2.24%      58.528ms         2.24%      58.528ms      58.528ms             1                                                    [[3, 224, 224], [3, 1, 1], []]
                     aten::stack         0.00%     103.000us         1.25%      32.740ms       3.274ms            10                                                                          [[], []]
                    aten::arange         1.00%      26.230ms         1.01%      26.254ms       2.188ms            12                                                                 [[], [], [], [0]]
                     aten::where         0.00%       6.000us         0.94%      24.461ms      12.230ms             2                                                                          [[4507]]
             aten::nonzero_numpy         0.00%      13.000us         0.94%      24.455ms      12.227ms             2                                                                          [[4507]]
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
Self CPU time total: 2.612s

malfet added module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: mps Related to Apple Metal Performance Shaders framework labels Apr 24, 2024

malfet self-assigned this Apr 24, 2024

malfet added module: crash Problem manifests as a hard crash, as opposed to a RuntimeError high priority and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 24, 2024

pytorch-bot bot added the triage review label Apr 24, 2024

malfet added this to the 2.3.1 milestone Apr 24, 2024

cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Apr 29, 2024

DenisVieriu97 mentioned this issue May 1, 2024

[MPS] Native nonzero implementation #125355

Closed

pytorchmergebot closed this as completed in a40d6df May 3, 2024

pytorchbot mentioned this issue May 13, 2024

[MPS] Native nonzero implementation #126100

Merged

atalman mentioned this issue May 13, 2024

[v2.3.1] Release Tracker #125425

Closed

atalman mentioned this issue May 29, 2024

Release 2.3.1 validations checklist and cherry-picks #127441

Closed

29 tasks

huydhn reopened this May 30, 2024

huydhn modified the milestones: 2.3.1, 2.4.0 May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

theo-costain-arondite commented Apr 24, 2024 •

edited by pytorch-bot bot

malfet commented Apr 24, 2024

malfet commented Apr 24, 2024 •

edited

theo-costain-arondite commented Apr 25, 2024

kulinseth commented Apr 29, 2024

kulinseth commented Apr 30, 2024

huydhn commented May 30, 2024 •

edited

aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

aten::nonzero calls taking a huge amount of time when using MPS backend vs CPU #124850

Comments

theo-costain-arondite commented Apr 24, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

repro

CPU profile results

MPS profile results

Versions

malfet commented Apr 24, 2024

malfet commented Apr 24, 2024 • edited

theo-costain-arondite commented Apr 25, 2024

kulinseth commented Apr 29, 2024

kulinseth commented Apr 30, 2024

huydhn commented May 30, 2024 • edited

theo-costain-arondite commented Apr 24, 2024 •

edited by pytorch-bot bot

malfet commented Apr 24, 2024 •

edited

huydhn commented May 30, 2024 •

edited