Add scatter_mm and bsr_scatter_mm operations. #110396

pearu · 2023-10-02T16:08:33Z

This PR introduces scatter_mm operation (compute mm of arbitrary pairs of tensors given in batches of tensors) that is used to implement bsr_scatter_mm that is equivalent to bsr_dense_mm (the mm operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise).

The figures below illustrate the performance differences of bsr_scatter_mm and bsr_dense_mm (GPU: NVIDIA GeForce RTX 2060 SUPER). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_scatter_mm or bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using bsr_scatter_mm at its performance equilibrium points with respect to bsr_dense_mm.

The same figures for GPU card NVIDIA A100-SXM4-80GB:

In sum:

bsr_scatter_mm is about 2x faster than bsr_dense_mm for small block sizes of 16 and 32 and large tensors [GPU: NVIDIA GeForce RTX 2060 SUPER].
bsr_scatter_mm is up to 2x faster than bsr_dense_mm for small block sizes of 16 and large tensors [GPU: NVIDIA A100-SXM4-80GB].
bsr_dense_mm is up to 20 % faster than bsr_scatter_mm for block sizes of 64 or larger [GPU: NVIDIA GeForce RTX 2060 SUPER].
However, bsr_dense_mm fails with OutOfResources exception for block sizes of 256 or larger whereas bsr_scatter_mm succeeds.

Stack from ghstack (oldest at bottom):

cc @alexsamardzic @nikitaved @cpuhrsch @amjames @bhosmer

[ghstack-poisoned]

pytorch-bot · 2023-10-02T16:08:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110396

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2e9e52b with merge base 57c7aa1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d050ac42749d119c50ff7ca68dc67a67a9d72113 Pull Request resolved: #110396

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. ![strided_vs_bsr_mm](https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab) ![strided_vs_bsr_mm_speedup](https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168) In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. [ghstack-poisoned]

ghstack-source-id: 86665e5a524fa8235e26d5cd50403a000adf6372 Pull Request resolved: #110396

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. ![strided_vs_bsr_mm](https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab) ![strided_vs_bsr_mm_speedup](https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168) In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. [ghstack-poisoned]

ghstack-source-id: fad7ff8a0b8aa73463b93607c12a7f064e3d3f00 Pull Request resolved: #110396

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. ![strided_vs_bsr_mm](https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab) ![strided_vs_bsr_mm_speedup](https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168) In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. [ghstack-poisoned]

ghstack-source-id: e0ed1bb421d1cce16c2efa50f8e83f2104d9c323 Pull Request resolved: #110396

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. ![strided_vs_bsr_mm](https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab) ![strided_vs_bsr_mm_speedup](https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168) In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. [ghstack-poisoned]

ghstack-source-id: 5f48f4fc1b0770e26a4b1a44bc15e771f1d5c675 Pull Request resolved: #110396

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. <img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%"> The same figures for GPU card `NVIDIA A100-SXM4-80GB`: <img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%"> In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`]. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. [ghstack-poisoned]

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. <img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%"> The same figures for GPU card `NVIDIA A100-SXM4-80GB`: <img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%"> In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`]. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. cc alexsamardzic nikitaved cpuhrsch amjames bhosmer [ghstack-poisoned]

cpuhrsch · 2023-10-23T17:09:31Z

@pytorchbot merge

pytorchmergebot · 2023-10-23T17:11:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

cpuhrsch · 2023-10-23T17:13:50Z

torch/sparse/_triton_ops.py

+        # below.
+        device_name = torch.cuda.get_device_name()
+        is_A100 = 'A100' in device_name
+        if (M, K, N) == (256,) * 3:


I guess you essentially want to store a decision tree here?

Pull Request resolved: #111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396

… is 16. (#111489) Pull Request resolved: #111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396, #111470

As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: #111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396, #111470, #111489

#111796) Pull Request resolved: #111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396, #111470, #111489, #111760

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. <img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%"> The same figures for GPU card `NVIDIA A100-SXM4-80GB`: <img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%"> In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`]. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. Pull Request resolved: pytorch#110396 Approved by: https://github.com/cpuhrsch

Pull Request resolved: pytorch#111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396

… is 16. (pytorch#111489) Pull Request resolved: pytorch#111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. <img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%"> The same figures for GPU card `NVIDIA A100-SXM4-80GB`: <img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%"> In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`]. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. Pull Request resolved: pytorch#110396 Approved by: https://github.com/cpuhrsch

Pull Request resolved: pytorch#111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396

… is 16. (pytorch#111489) Pull Request resolved: pytorch#111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise). The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`. <img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%"> The same figures for GPU card `NVIDIA A100-SXM4-80GB`: <img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%"> In sum: - `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`]. - `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`]. - However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds. Pull Request resolved: pytorch#110396 Approved by: https://github.com/cpuhrsch

Pull Request resolved: pytorch#111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396

… is 16. (pytorch#111489) Pull Request resolved: pytorch#111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

Add scatter_mm and bsr_scatter_mm operations.

0c76f14

[ghstack-poisoned]

pytorch-bot bot added the release notes: sparse release notes category label Oct 2, 2023

pearu added a commit that referenced this pull request Oct 2, 2023

Add scatter_mm and bsr_scatter_mm operations.

01c2868

ghstack-source-id: d050ac42749d119c50ff7ca68dc67a67a9d72113 Pull Request resolved: #110396

pearu marked this pull request as draft October 2, 2023 16:09

pytorchbot added the open source label Oct 2, 2023

pearu added this to In progress in Sparse tensors via automation Oct 9, 2023

pearu added a commit that referenced this pull request Oct 11, 2023

Add scatter_mm and bsr_scatter_mm operations.

9eec6ff

ghstack-source-id: 86665e5a524fa8235e26d5cd50403a000adf6372 Pull Request resolved: #110396

pearu added a commit that referenced this pull request Oct 12, 2023

Add scatter_mm and bsr_scatter_mm operations.

4785ad5

ghstack-source-id: fad7ff8a0b8aa73463b93607c12a7f064e3d3f00 Pull Request resolved: #110396

pearu added a commit that referenced this pull request Oct 13, 2023

Add scatter_mm and bsr_scatter_mm operations.

de4bd70

ghstack-source-id: e0ed1bb421d1cce16c2efa50f8e83f2104d9c323 Pull Request resolved: #110396

pearu added a commit that referenced this pull request Oct 17, 2023

Add scatter_mm and bsr_scatter_mm operations.

e863010

ghstack-source-id: 5f48f4fc1b0770e26a4b1a44bc15e771f1d5c675 Pull Request resolved: #110396

pearu mentioned this pull request Oct 18, 2023

Use lru_cache to cache indices data for bsr_scatter_mm. #111470

Closed

pearu marked this pull request as ready for review October 18, 2023 17:02

pearu requested a review from amjames October 18, 2023 17:02

pearu added module: sparse Related to torch.sparse topic: new features topic category topic: performance topic category labels Oct 18, 2023

This was referenced Oct 18, 2023

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. #111489

Closed

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm #111760

Closed

pearu mentioned this pull request Oct 23, 2023

Add batched dimensions support to the second operand of bsr_scatter_mm #111796

Closed

cpuhrsch approved these changes Oct 23, 2023

View reviewed changes

Sparse tensors automation moved this from In progress to Reviewer approved Oct 23, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2023

pytorchmergebot added the merging label Oct 23, 2023

cpuhrsch reviewed Oct 23, 2023

View reviewed changes

pytorchmergebot added Merged and removed merging labels Oct 23, 2023

pytorchmergebot closed this in d4708a6 Oct 23, 2023

Sparse tensors automation moved this from Reviewer approved to Done Oct 23, 2023

pytorchmergebot pushed a commit that referenced this pull request Oct 23, 2023

Use lru_cache to cache indices data for bsr_scatter_mm. (#111470)

6078ed9

Pull Request resolved: #111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396

pytorchmergebot pushed a commit that referenced this pull request Oct 23, 2023

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize…

f3d08ab

… is 16. (#111489) Pull Request resolved: #111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396, #111470

andreigh pushed a commit to andreigh/pytorch that referenced this pull request Oct 26, 2023

Use lru_cache to cache indices data for bsr_scatter_mm. (pytorch#111470)

ad7920a

Pull Request resolved: pytorch#111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396

facebook-github-bot deleted the gh/pearu/120/head branch October 27, 2023 14:25

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023

Use lru_cache to cache indices data for bsr_scatter_mm. (pytorch#111470)

e1a4751

Pull Request resolved: pytorch#111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023

Use lru_cache to cache indices data for bsr_scatter_mm. (pytorch#111470)

1f9c421

Pull Request resolved: pytorch#111470 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scatter_mm and bsr_scatter_mm operations. #110396

Add scatter_mm and bsr_scatter_mm operations. #110396

pearu commented Oct 2, 2023 •

edited

pytorch-bot bot commented Oct 2, 2023 •

edited

cpuhrsch commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

cpuhrsch Oct 23, 2023

Add scatter_mm and bsr_scatter_mm operations. #110396

Add scatter_mm and bsr_scatter_mm operations. #110396

Conversation

pearu commented Oct 2, 2023 • edited

pytorch-bot bot commented Oct 2, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110396

✅ No Failures

cpuhrsch commented Oct 23, 2023

pytorchmergebot commented Oct 23, 2023

Merge started

cpuhrsch Oct 23, 2023

Choose a reason for hiding this comment

pearu commented Oct 2, 2023 •

edited

pytorch-bot bot commented Oct 2, 2023 •

edited