Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. #111489

pearu · 2023-10-18T18:02:48Z

Stack from ghstack (oldest at bottom):

… is 16. [ghstack-poisoned]

pytorch-bot · 2023-10-18T18:02:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111489

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4d751ed with merge base 57c7aa1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… is 16. ghstack-source-id: 7c28a7bf9c372e2c3f3a47e2b9491f839306a5ac Pull Request resolved: #111489

…n blocksize is 16." [ghstack-poisoned]

As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: #111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396, #111470, #111489

#111796) Pull Request resolved: #111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: #110396, #111470, #111489, #111760

… is 16. (pytorch#111489) Pull Request resolved: pytorch#111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

… is 16. (pytorch#111489) Pull Request resolved: pytorch#111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

… is 16. (pytorch#111489) Pull Request resolved: pytorch#111489 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470

…1760) As in the title. The figures below illustrate the performance differences of bsr_dense_mm with optimized parameters and bsr_dense_mm with default parameters (GPU: NVIDIA A100-SXM4-80GB). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value bsr_dense_mm have the same performance characteristics as torch.matmul. The second figure represents speedups from using optimized meta parameters in bsr_dense_mm at its performance equilibrium points with respect to bsr_dense_mm with default meta parameters. In sum, this PR speeds up `bsr_dense_mm` about 50 % depending on the bsr tensor shape and blocksize and lowers the performance equilibrium points of BSR tensor sparsity and strided tensor for matmul operations. <img src="https://github.com/pytorch/pytorch/assets/402156/6fe9d35f-dd21-4aa0-bb01-6ee257254453" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/506921c6-3770-4209-ad3d-498d2ae4989d" width="48%"> Pull Request resolved: pytorch#111760 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489

pytorch#111796) Pull Request resolved: pytorch#111796 Approved by: https://github.com/cpuhrsch ghstack dependencies: pytorch#110396, pytorch#111470, pytorch#111489, pytorch#111760

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize…

1ad319e

… is 16. [ghstack-poisoned]

This was referenced Oct 18, 2023

Add scatter_mm and bsr_scatter_mm operations. #110396

Closed

Use lru_cache to cache indices data for bsr_scatter_mm. #111470

Closed

pytorch-bot bot added the release notes: sparse release notes category label Oct 18, 2023

pearu added a commit that referenced this pull request Oct 18, 2023

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize…

7d95766

… is 16. ghstack-source-id: 7c28a7bf9c372e2c3f3a47e2b9491f839306a5ac Pull Request resolved: #111489

pytorchbot added the open source label Oct 18, 2023

pearu mentioned this pull request Oct 22, 2023

Add NVIDIA A100 optimized meta parameters to bsr_dense_mm #111760

Closed

Update on "Use more performant bsr_scatter_mm within bsr_dense_mm whe…

4d751ed

…n blocksize is 16." [ghstack-poisoned]

pearu mentioned this pull request Oct 23, 2023

Add batched dimensions support to the second operand of bsr_scatter_mm #111796

Closed

pearu requested review from cpuhrsch and amjames October 23, 2023 17:11

cpuhrsch approved these changes Oct 23, 2023

View reviewed changes

pytorchmergebot added the Merged label Oct 23, 2023

pytorchmergebot closed this in f3d08ab Oct 23, 2023

facebook-github-bot deleted the gh/pearu/122/head branch October 27, 2023 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. #111489

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. #111489

pearu commented Oct 18, 2023 •

edited

pytorch-bot bot commented Oct 18, 2023 •

edited

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. #111489

Use more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16. #111489

Conversation

pearu commented Oct 18, 2023 • edited

pytorch-bot bot commented Oct 18, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111489

✅ No Failures

pearu commented Oct 18, 2023 •

edited

pytorch-bot bot commented Oct 18, 2023 •

edited