Run semi-structured spare benchmarks on consumer hardware #174

jcaip · 2024-04-25T17:37:20Z

2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but Phil (@philipbutler) has access to consumer GPUs that could also take advantage of sparse acceleration as well.

Steps to get numbers:

install pytorch pip nightlies from here
verify that your consumer GPU supports semi-structured sparsity

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())

Clone pytorch and get benchmark script:
Run benchmarks. For now, let's see if the nvidia-fixed-mn / nvidia-fixed-k benchmarks still show speedups.

python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-k --dtype bfloat16 --backend cutlass
python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-mn --dtype bfloat16 --backend cutlass

Afterwards, it would be great to get benchmarks for the ViT-B shapes found here: https://github.com/pytorch/ao/blob/main/benchmarks/sam_vit_b_shapes.csv

The text was updated successfully, but these errors were encountered:

philipbutler · 2024-04-25T19:09:50Z

Had to set up this PC, so had to do a clean Python install, and noticing neither pandas nor tqdm is in requirements.txt

philipbutler · 2024-04-25T19:15:09Z

The benchmark command should use --dtype bf16

philipbutler · 2024-04-25T19:19:59Z

Ran into RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported

Consider adding install CUDA 12.1 and the CUTLASS Quickstart to the steps.
Running through it now!
(I'm confused rn)

philipbutler · 2024-04-25T20:12:07Z

Actually, @jcaip, does it make sense that to_sparse_semi_structured(torch.ones(256, 256).half().cuda()) works, but running the first benchmark script shows RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported ?

jcaip · 2024-04-26T15:21:30Z

That's strange to me @philipbutler let me think for a bit

Can you open powershell and run nvidia-smi and screenshot the results?

philipbutler · 2024-04-26T15:26:00Z

@jcaip

jcaip · 2024-04-26T15:42:37Z

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

philipbutler · 2024-04-26T15:58:56Z

@jcaip Just making this easy as possible for future benchmarking, step 2 should say

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())

philipbutler · 2024-04-26T16:25:17Z

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

@jcaip Same error with the 2.3 release

gau-nernst · 2024-04-27T07:55:13Z

4070Ti Super, running Ubuntu 22.04.
torch==2.4.0.dev20240426+cu121
bfloat16, cutlass

Fixed k

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	3072	3072	10240	1.10574	2.131	1.92722
1	4096	4096	10240	1.9605	3.73044	1.9028
2	5120	5120	10240	3.12083	6.10269	1.95547
3	6144	6144	10240	4.74411	8.79509	1.8539
4	7168	7168	10240	7.29741	11.9486	1.63738
5	8192	8192	10240	10.6073	15.4296	1.45462
6	9216	9216	10240	13.6835	19.1741	1.40125
7	10240	10240	10240	16.8367	23.4461	1.39256
8	11264	11264	10240	20.37	28.2801	1.38832
9	12288	12288	10240	24.1402	33.545	1.38959
10	13312	13312	10240	28.4292	39.2493	1.3806
11	14336	14336	10240	32.851	45.5614	1.38691
12	15360	15360	10240	37.7906	54.6426	1.44593
13	16384	16384	10240	42.789	63.5041	1.48412
14	17408	17408	10240	48.5377	69.684	1.43567
15	18432	18432	10240	54.2561	77.7116	1.43231
16	19456	19456	10240	60.3411	85.183	1.41169
17	20480	20480	10240	66.7151	97.5466	1.46214

Fixed mn

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	10240	10240	2560	3.12135	6.23817	1.99855
1	10240	10240	3840	4.59394	9.28166	2.02041
2	10240	10240	5120	7.15086	12.251	1.71322
3	10240	10240	6400	10.5324	14.7059	1.39625
4	10240	10240	7680	13.0499	18.0573	1.38372
5	10240	10240	8960	15.3995	20.6897	1.34353
6	10240	10240	10240	16.8406	23.4697	1.39364
7	10240	10240	11520	19.2673	26.2984	1.36493
8	10240	10240	12800	20.9322	29.0503	1.38782
9	10240	10240	14080	23.14	31.9612	1.38121
10	10240	10240	15360	25.6844	34.6865	1.35049
11	10240	10240	16640	26.2421	37.4893	1.42859
12	10240	10240	17920	30.1967	40.3297	1.33556
13	10240	10240	19200	32.4673	43.1666	1.32954
14	10240	10240	20480	33.5382	46.002	1.37163

SAM ViT-B shapes

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	32768	768	3072	1.22253	1.7901	1.46426
1	32768	2304	768	0.787232	1.33425	1.69486
2	32768	3072	768	1.04701	1.74003	1.66191
3	32768	768	768	0.271155	0.437884	1.61488
4	39200	2304	768	0.948154	1.5765	1.66271
5	39200	768	768	0.324627	0.510302	1.57196

I omit some redundant columns from the saved csv file. correct and contiguous columns are all True.

msaroufim · 2024-04-29T18:24:35Z

Nice work @gau-nernst pretty cool to see results that seem uniformily faster
@philipbutler would highly recommend using WSL or dual booting (I personally dual boot), getting windows and cuda to work is just not worth it

jcaip · 2024-04-30T16:27:45Z

@gau-nernst 💯 Thanks for running these - that's awesome! For others reading, I'd like to collect these, with our A100 results somewhere. So please contribute and I'll collate these together in a nice doc. We can also collect block sparse microbenchmarks too, I know @cpuhrsch is interested in those.

@philipbutler Thank you for giving it a shot + your edits we're super helpful too :) . Yeah I think I agree with mark that dual booting linux is probably the easiest solution - but could you open an issue for tracking purposes (feel free to tag me) in pytorch about lack of windows support for semi-structured sparsity?

jcaip self-assigned this Apr 30, 2024

philipbutler mentioned this issue Apr 30, 2024

Semi-Structured Sparsity unsupported for Windows #191

Closed

jcaip mentioned this issue Apr 30, 2024

added gpu benchmarking script #192

Open

philipbutler mentioned this issue May 1, 2024

Semi-Structured Sparsity unsupported for Windows pytorch/pytorch#125302

Open

msaroufim added the tracker label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run semi-structured spare benchmarks on consumer hardware #174

Run semi-structured spare benchmarks on consumer hardware #174

jcaip commented Apr 25, 2024 •

edited

philipbutler commented Apr 25, 2024 •

edited

philipbutler commented Apr 25, 2024 •

edited

philipbutler commented Apr 25, 2024 •

edited

philipbutler commented Apr 25, 2024

jcaip commented Apr 26, 2024 •

edited

philipbutler commented Apr 26, 2024

jcaip commented Apr 26, 2024

philipbutler commented Apr 26, 2024 •

edited

philipbutler commented Apr 26, 2024

gau-nernst commented Apr 27, 2024 •

edited

msaroufim commented Apr 29, 2024

jcaip commented Apr 30, 2024

Run semi-structured spare benchmarks on consumer hardware #174

Run semi-structured spare benchmarks on consumer hardware #174

Comments

jcaip commented Apr 25, 2024 • edited

Steps to get numbers:

philipbutler commented Apr 25, 2024 • edited

philipbutler commented Apr 25, 2024 • edited

philipbutler commented Apr 25, 2024 • edited

philipbutler commented Apr 25, 2024

jcaip commented Apr 26, 2024 • edited

philipbutler commented Apr 26, 2024

jcaip commented Apr 26, 2024

philipbutler commented Apr 26, 2024 • edited

philipbutler commented Apr 26, 2024

gau-nernst commented Apr 27, 2024 • edited

msaroufim commented Apr 29, 2024

jcaip commented Apr 30, 2024

jcaip commented Apr 25, 2024 •

edited

philipbutler commented Apr 25, 2024 •

edited

philipbutler commented Apr 25, 2024 •

edited

philipbutler commented Apr 25, 2024 •

edited

jcaip commented Apr 26, 2024 •

edited

philipbutler commented Apr 26, 2024 •

edited

gau-nernst commented Apr 27, 2024 •

edited