Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run semi-structured spare benchmarks on consumer hardware #174

Open
jcaip opened this issue Apr 25, 2024 · 12 comments
Open

Run semi-structured spare benchmarks on consumer hardware #174

jcaip opened this issue Apr 25, 2024 · 12 comments
Assignees
Labels

Comments

@jcaip
Copy link
Contributor

jcaip commented Apr 25, 2024

2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but Phil (@philipbutler) has access to consumer GPUs that could also take advantage of sparse acceleration as well.

Steps to get numbers:

  1. install pytorch pip nightlies from here
  2. verify that your consumer GPU supports semi-structured sparsity
import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
  1. Clone pytorch and get benchmark script:
  2. Run benchmarks. For now, let's see if the nvidia-fixed-mn / nvidia-fixed-k benchmarks still show speedups.
python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-k --dtype bfloat16 --backend cutlass
python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-mn --dtype bfloat16 --backend cutlass

Afterwards, it would be great to get benchmarks for the ViT-B shapes found here: https://github.com/pytorch/ao/blob/main/benchmarks/sam_vit_b_shapes.csv

@philipbutler
Copy link

philipbutler commented Apr 25, 2024

Had to set up this PC, so had to do a clean Python install, and noticing neither pandas nor tqdm is in requirements.txt

@philipbutler
Copy link

philipbutler commented Apr 25, 2024

The benchmark command should use --dtype bf16

@philipbutler
Copy link

philipbutler commented Apr 25, 2024

Ran into RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported

Consider adding install CUDA 12.1 and the CUTLASS Quickstart to the steps.
Running through it now!

(I'm confused rn)

@philipbutler
Copy link

Actually, @jcaip, does it make sense that to_sparse_semi_structured(torch.ones(256, 256).half().cuda()) works, but running the first benchmark script shows RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported ?

@jcaip
Copy link
Contributor Author

jcaip commented Apr 26, 2024

That's strange to me @philipbutler let me think for a bit

Can you open powershell and run nvidia-smi and screenshot the results?

@philipbutler
Copy link

@jcaip
image

@jcaip
Copy link
Contributor Author

jcaip commented Apr 26, 2024

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

@philipbutler
Copy link

philipbutler commented Apr 26, 2024

@jcaip Just making this easy as possible for future benchmarking, step 2 should say

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())

@philipbutler
Copy link

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

@jcaip Same error with the 2.3 release

@gau-nernst
Copy link
Collaborator

gau-nernst commented Apr 27, 2024

4070Ti Super, running Ubuntu 22.04.
torch==2.4.0.dev20240426+cu121
bfloat16, cutlass

Fixed k

m n k sparse_latency (ms) dense_latency (ms) speedup (d/s)
0 3072 3072 10240 1.10574 2.131 1.92722
1 4096 4096 10240 1.9605 3.73044 1.9028
2 5120 5120 10240 3.12083 6.10269 1.95547
3 6144 6144 10240 4.74411 8.79509 1.8539
4 7168 7168 10240 7.29741 11.9486 1.63738
5 8192 8192 10240 10.6073 15.4296 1.45462
6 9216 9216 10240 13.6835 19.1741 1.40125
7 10240 10240 10240 16.8367 23.4461 1.39256
8 11264 11264 10240 20.37 28.2801 1.38832
9 12288 12288 10240 24.1402 33.545 1.38959
10 13312 13312 10240 28.4292 39.2493 1.3806
11 14336 14336 10240 32.851 45.5614 1.38691
12 15360 15360 10240 37.7906 54.6426 1.44593
13 16384 16384 10240 42.789 63.5041 1.48412
14 17408 17408 10240 48.5377 69.684 1.43567
15 18432 18432 10240 54.2561 77.7116 1.43231
16 19456 19456 10240 60.3411 85.183 1.41169
17 20480 20480 10240 66.7151 97.5466 1.46214

Fixed mn

m n k sparse_latency (ms) dense_latency (ms) speedup (d/s)
0 10240 10240 2560 3.12135 6.23817 1.99855
1 10240 10240 3840 4.59394 9.28166 2.02041
2 10240 10240 5120 7.15086 12.251 1.71322
3 10240 10240 6400 10.5324 14.7059 1.39625
4 10240 10240 7680 13.0499 18.0573 1.38372
5 10240 10240 8960 15.3995 20.6897 1.34353
6 10240 10240 10240 16.8406 23.4697 1.39364
7 10240 10240 11520 19.2673 26.2984 1.36493
8 10240 10240 12800 20.9322 29.0503 1.38782
9 10240 10240 14080 23.14 31.9612 1.38121
10 10240 10240 15360 25.6844 34.6865 1.35049
11 10240 10240 16640 26.2421 37.4893 1.42859
12 10240 10240 17920 30.1967 40.3297 1.33556
13 10240 10240 19200 32.4673 43.1666 1.32954
14 10240 10240 20480 33.5382 46.002 1.37163

SAM ViT-B shapes

m n k sparse_latency (ms) dense_latency (ms) speedup (d/s)
0 32768 768 3072 1.22253 1.7901 1.46426
1 32768 2304 768 0.787232 1.33425 1.69486
2 32768 3072 768 1.04701 1.74003 1.66191
3 32768 768 768 0.271155 0.437884 1.61488
4 39200 2304 768 0.948154 1.5765 1.66271
5 39200 768 768 0.324627 0.510302 1.57196

I omit some redundant columns from the saved csv file. correct and contiguous columns are all True.

@msaroufim
Copy link
Member

Nice work @gau-nernst pretty cool to see results that seem uniformily faster
@philipbutler would highly recommend using WSL or dual booting (I personally dual boot), getting windows and cuda to work is just not worth it

@jcaip
Copy link
Contributor Author

jcaip commented Apr 30, 2024

@gau-nernst 💯 Thanks for running these - that's awesome! For others reading, I'd like to collect these, with our A100 results somewhere. So please contribute and I'll collate these together in a nice doc. We can also collect block sparse microbenchmarks too, I know @cpuhrsch is interested in those.

@philipbutler Thank you for giving it a shot + your edits we're super helpful too :) . Yeah I think I agree with mark that dual booting linux is probably the easiest solution - but could you open an issue for tracking purposes (feel free to tag me) in pytorch about lack of windows support for semi-structured sparsity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants