Performance with MPS on AMD GPUs are worse than CPU #78210

lucadiliello · 2022-05-24T21:31:11Z

🐛 Describe the bug

I tried running some experiments on the RX5300M 4GB GPU and everything seems to work correctly. The problem is that the performance are worse than the ones on the CPU of the same Mac.

To reproduce, just clone the tests in this repo https://github.com/lucadiliello/pytorch-apple-silicon-benchmarks and run either

python tests/transformers_sequence_classification.py --device mps --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 16

or

python tests/transformers_sequence_classification.py --device cpu --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 16

While the CPU took 143s, with the MPS backend the test completed in 228s. I'm sure the GPU was being because I constantly monitored the usage with Activity Monitor.

Versions

PyTorch version: 1.13.0.dev20220524
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.3.1 (x86_64)
GCC version: Could not collect
Clang version: 13.1.6 (clang-1316.0.21.2.5)
CMake version: Could not collect
Libc version: N/A

Python version: 3.8.12 (default, Oct 12 2021, 06:23:56) [Clang 10.0.0 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] torch==1.13.0.dev20220524
[pip3] torchaudio==0.12.0.dev20220524
[pip3] torchvision==0.13.0.dev20220524
[conda] numpy 1.22.4 pypi_0 pypi
[conda] torch 1.13.0.dev20220524 pypi_0 pypi
[conda] torchaudio 0.12.0.dev20220524 pypi_0 pypi
[conda] torchvision 0.13.0.dev20220524 pypi_0 pypi

cc @VitalyFedyunin @ngimel

The text was updated successfully, but these errors were encountered:

dbl001 · 2022-05-25T15:22:35Z

I ran your tests on an Intel iMac 27" 2020 with an AMD Radeon Pro 5700 XT
E.g.
3.8 GHz 8-Core Intel Core i7
AMD Radeon Pro 5700 XT 16 GB
The Activity Monitor showed heavy GPU usage on the 'mps' test.

% python tests/transformers_sequence_classification.py --device mps --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 16
Downloading: 100%|██████████████████████████████| 570/570 [00:00<00:00, 209kB/s]
Downloading: 100%|████████████████████████████| 213k/213k [00:00<00:00, 997kB/s]
Downloading: 100%|███████████████████████████| 436k/436k [00:00<00:00, 1.47MB/s]
Downloading: 100%|███████████████████████████| 29.0/29.0 [00:00<00:00, 26.6kB/s]
Downloading: 100%|███████████████████████████| 436M/436M [00:40<00:00, 10.7MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([16, 128])
INFO:root: * attention_mask: torch.Size([16, 128])
INFO:root: * labels: torch.Size([16])
Testing...: 100%|█████████████████████████████| 100/100 [01:40<00:00,  1.00s/it]
INFO:root:Model bert-base-cased took 100.02 seconds to do 100 steps in inference with batch size 16 on mps.
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks % python tests/transformers_sequence_classification.py --device cpu --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 16
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([16, 128])
INFO:root: * attention_mask: torch.Size([16, 128])
INFO:root: * labels: torch.Size([16])
Testing...: 100%|█████████████████████████████| 100/100 [01:38<00:00,  1.02it/s]
INFO:root:Model bert-base-cased took 98.11 seconds to do 100 steps in inference with batch size 16 on cpu.
 % python tests/transformers_sequence_classification.py --device mps --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 8
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([8, 128])
INFO:root: * attention_mask: torch.Size([8, 128])
INFO:root: * labels: torch.Size([8])
Testing...: 100%|█████████████████████████████| 100/100 [00:51<00:00,  1.94it/s]
INFO:root:Model bert-base-cased took 51.48 seconds to do 100 steps in inference with batch size 8 on mps.
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks %

xsacha · 2022-05-25T22:05:38Z

My guess is it might just be the 5300M is a lot slower than the 5700 XT. In this particular case, the CPU might be faster than the GPU for such a low-end GPU.
It doesn't mean the GPU is useless as it still offloads work from the CPU.

lucadiliello · 2022-05-26T01:04:55Z

My guess is it might just be the 5300M is a lot slower than the 5700 XT. In this particular case, the CPU might be faster than the GPU for such a low-end GPU.
It doesn't mean the GPU is useless as it still offloads work from the CPU.

It may make sense for the 5300M, but I do not see why the 5700XT 16GB is going as fast as the CPU of the iMac.

xsacha · 2022-05-26T01:27:41Z

Oh, I see.

dbl001 · 2022-05-26T02:18:40Z

batch_size affects performance dramatically...

% python tests/transformers_sequence_classification.py --device mps --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 4
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([4, 128])
INFO:root: * attention_mask: torch.Size([4, 128])
INFO:root: * labels: torch.Size([4])
Testing...: 100%|█████████████████████████████| 100/100 [00:28<00:00,  3.51it/s]
INFO:root:Model bert-base-cased took 28.49 seconds to do 100 steps in inference with batch size 4 on mps.
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks % python tests/transformers_sequence_classification.py --device mps --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 2
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([2, 128])
INFO:root: * attention_mask: torch.Size([2, 128])
INFO:root: * labels: torch.Size([2])
Testing...: 100%|█████████████████████████████| 100/100 [00:16<00:00,  6.11it/s]
INFO:root:Model bert-base-cased took 16.39 seconds to do 100 steps in inference with batch size 2 on mps.
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks % python tests/transformers_sequence_classification.py --device mps --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 1
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([1, 128])
INFO:root: * attention_mask: torch.Size([1, 128])
INFO:root: * labels: torch.Size([1])
Testing...: 100%|█████████████████████████████| 100/100 [00:09<00:00, 10.37it/s]
INFO:root:Model bert-base-cased took 9.67 seconds to do 100 steps in inference with batch size 1 on mps.

xsacha · 2022-05-26T02:42:18Z

It almost scales with batch size. Almost like it's doing every batch a 'batch' number of times.

Edit: is the test just running batch X 100?

dbl001 · 2022-05-26T20:34:04Z

CPU

 % python tests/transformers_sequence_classification.py --device cpu --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 1
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([1, 128])
INFO:root: * attention_mask: torch.Size([1, 128])
INFO:root: * labels: torch.Size([1])
Testing...: 100%|█████████████████████████████| 100/100 [00:05<00:00, 16.86it/s]
INFO:root:Model bert-base-cased took 5.95 seconds to do 100 steps in inference with batch size 1 on cpu.
 % python tests/transformers_sequence_classification.py --device cpu --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 2
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([2, 128])
INFO:root: * attention_mask: torch.Size([2, 128])
INFO:root: * labels: torch.Size([2])
Testing...: 100%|█████████████████████████████| 100/100 [00:10<00:00,  9.12it/s]
INFO:root:Model bert-base-cased took 10.98 seconds to do 100 steps in inference with batch size 2 on cpu.
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks % 
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks % python tests/transformers_sequence_classification.py --device cpu --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 4
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([4, 128])
INFO:root: * attention_mask: torch.Size([4, 128])
INFO:root: * labels: torch.Size([4])
Testing...: 100%|█████████████████████████████| 100/100 [00:22<00:00,  4.46it/s]
INFO:root:Model bert-base-cased took 22.45 seconds to do 100 steps in inference with batch size 4 on cpu.
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks % python tests/transformers_sequence_classification.py --device cpu --pre_trained_name bert-base-cased --mode inference --steps 100 --sequence_length 128 --batch_size 8
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:root:Input tensors size:
INFO:root: * input_ids: torch.Size([8, 128])
INFO:root: * attention_mask: torch.Size([8, 128])
INFO:root: * labels: torch.Size([8])
Testing...: 100%|█████████████████████████████| 100/100 [00:43<00:00,  2.31it/s]
INFO:root:Model bert-base-cased took 43.39 seconds to do 100 steps in inference with batch size 8 on cpu.
(base) davidlaxer@x86_64-apple-darwin13 pytorch-apple-silicon-benchmarks %

james-brown-upfeat · 2023-06-20T15:47:34Z

Did this ever get any more attention?

I have a Mac with CPU 2.9 GHz 6-Core Intel Core i9 and GPU Radeon Pro 560X 4 GB, and when I run google/pix2struct-ocrvqa-base on a random image, it takes 38.959s on CPU, but seems like it would just run forever on GPU.

The image size is roughly 1GB so it should theoretically fit fully in VRAM I think.

I did confirm GPU usage sits near 100% the whole time for the 560X using the Activity Monitor GPU view.

I did not force the mac to use the Intel UHD Graphics 630 1536 MB for my monitors, so the mouse got laggy while the model was running, maybe the window manager used up some VRAM?

Here's some sample code:

question: str
image: PIL.image
model_name = "google/pix2struct-ocrvqa-base"

processor = Pix2StructProcessor.from_pretrained(model_name)
inputs = processor(
    images=image,
    text=question,
    return_tensors="pt",
)
model = Pix2StructForConditionalGeneration.from_pretrained(model_name)

has_mps = torch.backends.mps.is_available()
built_with_mps = torch.backends.mps.is_built()
if has_mps and built_with_mps:
    model = model.to('mps')
    inputs = inputs.to('mps')

predictions = model.generate(
    **inputs,
    max_length=256,
)
result = processor.decode(
    predictions[0],
)
print(result)

Also I'm very new at this stuff, so I wouldn't be surprised if I'm missing some significant settings I should be using. I just did what https://huggingface.co/google/pix2struct-ai2d-base said to do (the ocrvqa model says to follow those instructions).

I don't necessarily need a solution, I just want to provide another data point. Thanks!

malfet added module: performance Issues related to performance, either of kernel code or framework glue module: mps Related to Apple Metal Performance Shaders framework triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance with MPS on AMD GPUs are worse than CPU #78210

Performance with MPS on AMD GPUs are worse than CPU #78210

lucadiliello commented May 24, 2022 •

edited by pytorch-bot bot

dbl001 commented May 25, 2022 •

edited

xsacha commented May 25, 2022

lucadiliello commented May 26, 2022

xsacha commented May 26, 2022 •

edited

dbl001 commented May 26, 2022

xsacha commented May 26, 2022 •

edited

dbl001 commented May 26, 2022

james-brown-upfeat commented Jun 20, 2023

Performance with MPS on AMD GPUs are worse than CPU #78210

Performance with MPS on AMD GPUs are worse than CPU #78210

Comments

lucadiliello commented May 24, 2022 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

dbl001 commented May 25, 2022 • edited

xsacha commented May 25, 2022

lucadiliello commented May 26, 2022

xsacha commented May 26, 2022 • edited

dbl001 commented May 26, 2022

xsacha commented May 26, 2022 • edited

dbl001 commented May 26, 2022

james-brown-upfeat commented Jun 20, 2023

lucadiliello commented May 24, 2022 •

edited by pytorch-bot bot

dbl001 commented May 25, 2022 •

edited

xsacha commented May 26, 2022 •

edited

xsacha commented May 26, 2022 •

edited