Skip to content

Conversation

@pearu
Copy link
Collaborator

@pearu pearu commented Nov 23, 2023

As in the title.

As a result, nn.linear(<strided tensor>, <BSR tensor>, bias=<strided tensor>) performance increases as follows (float16, NVIDIA A100-SXM4-80GB):

  • 256x256 weights, speed up is 14..27 %
  • 512x512 weights, speed up is 9..25 %
  • 1024x1024 weights, speed up is 5..20 %
  • 2048x2048 weights, speed up is 3..16 %
  • 4092x4092 weights, speed up is 2..9 %

Stack from ghstack (oldest at bottom):

@pytorch-bot pytorch-bot bot added the release notes: sparse release notes category label Nov 23, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 23, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114484

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5cb2195 with merge base 56a95af (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pearu pearu self-assigned this Nov 23, 2023
@pearu pearu added module: performance Issues related to performance, either of kernel code or framework glue topic: not user facing topic category open source labels Nov 23, 2023
@pearu pearu changed the title Eliminate unnecessary copy in addmm with sparse compressed block operand Eliminate unnecessary copy in CUDA addmm with sparse compressed block operand Nov 23, 2023
@pearu pearu requested review from amjames and cpuhrsch November 23, 2023 21:39
…essed block operand"

As in the title.

As a result, `nn.linear(<strided tensor>, <BSR tensor>, bias=<strided tensor>)` performance increases as follows (`float16`, `NVIDIA A100-SXM4-80GB`):
- 256x256 weights, speed up is 14..27 %
- 512x512 weights, speed up is 9..25 %
- 1024x1024 weights, speed up is 5..20 %
- 2048x2048 weights, speed up is 3..16 %
- 4092x4092 weights, speed up is 2..9 %




[ghstack-poisoned]
pearu added a commit that referenced this pull request Nov 24, 2023
…essed block operand"

As in the title.

As a result, `nn.linear(<strided tensor>, <BSR tensor>, bias=<strided tensor>)` performance increases as follows (`float16`, `NVIDIA A100-SXM4-80GB`):
- 256x256 weights, speed up is 14..27 %
- 512x512 weights, speed up is 9..25 %
- 1024x1024 weights, speed up is 5..20 %
- 2048x2048 weights, speed up is 3..16 %
- 4092x4092 weights, speed up is 2..9 %




[ghstack-poisoned]
@pearu
Copy link
Collaborator Author

pearu commented Nov 28, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: performance Issues related to performance, either of kernel code or framework glue open source release notes: sparse release notes category topic: not user facing topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants