Improve MX4 perf Pt. 1 #2709

spcyppt · 2024-06-10T21:23:18Z

Summary:
This kernel is compute bound when the input tensor is very large. One of the factors and the main optimization for this diff is doing a fusion instruction. That is, instead of doing x * pow(2, y), we do scalebn(x, y) -- this significantly reduces the number of cycles needed to do the computation.

Improve MX4 kernel performance

replace pow with scalebn
remove flush_fp32_subnorms which is currently not used. (Note: we will update to support this later)

Performance improvement:

quantize: 39M cycles -> 24M
dequantize: 28M -> 10M

Reviewed By: sryap

Differential Revision: D58296469

facebook-github-bot · 2024-06-10T21:23:27Z

This pull request was exported from Phabricator. Differential Revision: D58296469

netlify · 2024-06-10T21:23:34Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`d24b546`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66679897d76b7300080d13d4
😎 Deploy Preview	https://deploy-preview-2709--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2024-06-11T00:14:34Z

This pull request was exported from Phabricator. Differential Revision: D58296469

Summary: Pull Request resolved: #2709 This kernel is compute bound when the input tensor is very large. One of the factors and the main optimization for this diff is doing a fusion instruction. That is, instead of doing `x * pow(2, y)`, we do `scalebn(x, y)` -- this significantly reduces the number of cycles needed to do the computation. Improve MX4 kernel performance - replace `pow` with `scalebn` - remove `flush_fp32_subnorms` which is currently not used. (Note: we will update to support this later) Performance improvement: - quantize: 39M cycles -> 24M - dequantize: 28M -> 10M Reviewed By: sryap Differential Revision: D58296469

facebook-github-bot · 2024-06-11T00:21:32Z

This pull request was exported from Phabricator. Differential Revision: D58296469

facebook-github-bot · 2024-06-11T00:39:31Z

This pull request has been merged in d8900ae.

facebook-github-bot added the cla signed label Jun 10, 2024

facebook-github-bot added the fb-exported label Jun 10, 2024

spcyppt force-pushed the export-D58296469 branch from d5448a1 to 488cde7 Compare June 11, 2024 00:14

spcyppt force-pushed the export-D58296469 branch from 488cde7 to d24b546 Compare June 11, 2024 00:21

facebook-github-bot closed this in d8900ae Jun 11, 2024

facebook-github-bot added the Merged label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MX4 perf Pt. 1 #2709

Improve MX4 perf Pt. 1 #2709

spcyppt commented Jun 10, 2024

facebook-github-bot commented Jun 10, 2024

netlify bot commented Jun 10, 2024 •

edited

Loading

facebook-github-bot commented Jun 11, 2024

facebook-github-bot commented Jun 11, 2024

facebook-github-bot commented Jun 11, 2024

Improve MX4 perf Pt. 1 #2709

Improve MX4 perf Pt. 1 #2709

Conversation

spcyppt commented Jun 10, 2024

facebook-github-bot commented Jun 10, 2024

netlify bot commented Jun 10, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Jun 11, 2024

facebook-github-bot commented Jun 11, 2024

facebook-github-bot commented Jun 11, 2024

netlify bot commented Jun 10, 2024 •

edited

Loading