Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. #88078

nikitaved · 2022-10-31T11:32:16Z

As per title.

Additionally we also introduce support for:

Rectangular block sizes which are powers of 2 and at least 16 (triton's dot limitation).
Batch support with broadcasting for either of the arguments.

cc @ngimel @alexsamardzic @pearu @cpuhrsch @amjames @bhosmer @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire @VitalyFedyunin

pytorch-bot · 2022-10-31T11:32:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88078

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5835817:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nikitaved · 2022-10-31T11:44:04Z

torch/sparse/triton_ops/triton_bsr_dense_mm.py

+    batch_idx, row_idx = nnz_per_row.nonzero(as_tuple=True)
+


nonzero could be removed when the number of nonzero rows is high, then skipping empty rows could be delegated to the kernel.

nonzero is also one of those ops that will cause a sync if I'm not mistaken.

Exactly, so it makes sense to avoid using it in some circumstances which also addresses your comments below.

cpuhrsch · 2022-11-03T19:40:57Z

torch/sparse/triton_ops/triton_bsr_dense_mm.py

+import triton.language as tl
+
+
+def compressed_indices_to_plain_indices(cidx, pidx):


nit: is this unused?

Not here anymore, right, but it is a very nice util for other kernels we might wanna have for sparse in Triton. We can remove it now...

torch/sparse/triton_ops/triton_bsr_dense_mm.py

…materialized

…taved/triton_bsr_dense_mm

test/test_sparse_csr.py

torch/sparse/triton_ops/triton_bsr_dense_mm.py

…taved/triton_bsr_dense_mm

malfet · 2023-01-26T17:05:21Z

@nikitaved it would be good to update the PR description to explain what was fixed since the last revert.
Also, I wonder why "revertedx2" label is not there..

nikitaved · 2023-01-26T17:53:48Z

@malfet , nothing was really fixed, we just disabled tests for CUDA 11.6, the kernel was just hanging there, but new tests showed up that scan native functions including dummy entries to be overwritten with Triton implementations, and this is where things started breaking...

…taved/triton_bsr_dense_mm

test/test_sparse_csr.py

…taved/triton_bsr_dense_mm

nikitaved · 2023-01-31T17:07:54Z

From the offline discussion with @cpuhrsch : we decided to remove all the Cpp hooks for now. test_decomp issues are real (through introducing a native function), but they manifest themselves after a very long time spinning the tests. We will try to investigate these issues in a follow-up PR.

…taved/triton_bsr_dense_mm

nikitaved · 2023-02-14T08:30:21Z

All right, CUDA 11.6 is deprecated. Let's spin this one again.

nikitaved · 2023-02-14T16:53:42Z

Closing in favor of #94823 for more granular issue control.

pytorch-bot bot added the release notes: sparse release notes category label Oct 31, 2022

nikitaved added module: performance Issues related to performance, either of kernel code or framework glue module: sparse Related to torch.sparse module: half Related to float16 half-precision floats and removed release notes: sparse release notes category labels Oct 31, 2022

pytorch-bot bot added the release notes: sparse release notes category label Oct 31, 2022

nikitaved commented Oct 31, 2022

View reviewed changes

pytorchbot added the open source label Oct 31, 2022

nikitaved force-pushed the nikitaved/triton_bsr_dense_mm branch from a7c503b to 8a0f7d0 Compare November 3, 2022 16:22

cpuhrsch reviewed Nov 3, 2022

View reviewed changes

torch/sparse/triton_ops/triton_bsr_dense_mm.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Nov 3, 2022

View reviewed changes

torch/sparse/triton_ops/triton_bsr_dense_mm.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Nov 3, 2022

View reviewed changes

torch/sparse/triton_ops/triton_bsr_dense_mm.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Nov 3, 2022

View reviewed changes

torch/sparse/triton_ops/triton_bsr_dense_mm.py Outdated Show resolved Hide resolved

nikitaved and others added 3 commits November 4, 2022 14:52

preliminary kernel implementation

c2890ea

fix bug with nnz=1, because then col_indices is a VIEW and not fully …

6062fba

…materialized

introduce kernel for inputs with "dense" rowspaces

5dee33f

nikitaved force-pushed the nikitaved/triton_bsr_dense_mm branch from 8a0f7d0 to 5dee33f Compare November 4, 2022 14:53

nikitaved changed the title ~~Improve bsr @ strided performance in addmm for bfloat16/half with Triton kernels.~~ Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. Nov 4, 2022

nikitaved added 6 commits November 7, 2022 09:47

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

fb0d542

…taved/triton_bsr_dense_mm

add stubs

b82e1c6

update dispatch

5809273

add tests

fd91309

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

88193a8

…taved/triton_bsr_dense_mm

minor

38225a7

ngimel reviewed Nov 8, 2022

View reviewed changes

test/test_sparse_csr.py Outdated Show resolved Hide resolved

torch/sparse/triton_ops/triton_bsr_dense_mm.py Outdated Show resolved Hide resolved

nikitaved added 3 commits November 10, 2022 11:04

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

683c647

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

8287142

…taved/triton_bsr_dense_mm

move batch grid from pos 0 to 2

363fb3c

possibly whitelist

1332eea

nikitaved reopened this Jan 26, 2023

malfet requested a review from cpuhrsch January 26, 2023 17:05

nikitaved added 10 commits January 27, 2023 09:49

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

d22c5c3

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

bc92143

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

c427422

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

0075923

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

cdd3d19

…taved/triton_bsr_dense_mm

enabler CUDA 11.6

086cfb6

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

7c1faaa

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

1a14cb3

…taved/triton_bsr_dense_mm

minor, similar to mulit_scale_dot

a939990

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

dbccce1

…taved/triton_bsr_dense_mm

ZainRizvi mentioned this pull request Jan 30, 2023

Should remove linting & policy enforcement exceptions from the new PyTorch 2.0 modules #93158

Closed

cpuhrsch reviewed Jan 31, 2023

View reviewed changes

test/test_sparse_csr.py Outdated Show resolved Hide resolved

nikitaved added 4 commits January 31, 2023 16:37

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

8f40e22

…taved/triton_bsr_dense_mm

remove CPP hooks, keep the kernel and the tests

a6ce29a

minor

9a2d989

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

6d747d5

…taved/triton_bsr_dense_mm

nikitaved added 5 commits February 1, 2023 08:57

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

2b2c807

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

d45f962

…taved/triton_bsr_dense_mm

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

1829abc

…taved/triton_bsr_dense_mm

minor

f949888

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

5835817

…taved/triton_bsr_dense_mm

nikitaved closed this Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. #88078

Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. #88078

nikitaved commented Oct 31, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Oct 31, 2022 •

edited

nikitaved Oct 31, 2022

cpuhrsch Nov 3, 2022

nikitaved Nov 4, 2022

cpuhrsch Nov 3, 2022

nikitaved Nov 3, 2022

malfet commented Jan 26, 2023

nikitaved commented Jan 26, 2023 •

edited

nikitaved commented Jan 31, 2023 •

edited

nikitaved commented Feb 14, 2023

nikitaved commented Feb 14, 2023

		import triton.language as tl


		def compressed_indices_to_plain_indices(cidx, pidx):

Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. #88078

Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. #88078

Conversation

nikitaved commented Oct 31, 2022 • edited by pytorch-bot bot

pytorch-bot bot commented Oct 31, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88078

✅ No Failures

nikitaved Oct 31, 2022

Choose a reason for hiding this comment

cpuhrsch Nov 3, 2022

Choose a reason for hiding this comment

nikitaved Nov 4, 2022

Choose a reason for hiding this comment

cpuhrsch Nov 3, 2022

Choose a reason for hiding this comment

nikitaved Nov 3, 2022

Choose a reason for hiding this comment

malfet commented Jan 26, 2023

nikitaved commented Jan 26, 2023 • edited

nikitaved commented Jan 31, 2023 • edited

nikitaved commented Feb 14, 2023

nikitaved commented Feb 14, 2023

Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. #88078

Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. #88078

nikitaved commented Oct 31, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Oct 31, 2022 •

edited

nikitaved commented Jan 26, 2023 •

edited

nikitaved commented Jan 31, 2023 •

edited