Switch to Triton FP8 Quantization in EMU1.6 #2688

jwfromm · 2024-06-05T23:50:59Z

Summary:
For some reason, the cuda quantize_fp8_per_row kernel is very slow in EMU. Switching it to the functionally equivalent triton kernel yields excellent speedups from FP8. For eager mode, I'm seeing a 20% e2e speedup and still getting proper outputs.

Eager:
BF16: 19702.10ms
FP8 Triton Quant: 16466.97ms

Compiled:
FP8 Native Quant: 14605.18ms
FP8 Triton Quant: 16043.92ms
BF16: 18030.98ms

We see that quantizing in native pytorch helps quite a bit when torch.compile is used. I added the option to choose which quantization function is used and default to triton when torch.compile is off and native torch when torch.compile is on. This gives us the best performance in either case.

Reviewed By: jiawenliu64

Differential Revision: D58167756

facebook-github-bot · 2024-06-05T23:51:07Z

This pull request was exported from Phabricator. Differential Revision: D58167756

netlify · 2024-06-05T23:51:15Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`6c1f6df`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6661e03494075400089aaa5f
😎 Deploy Preview	https://deploy-preview-2688--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: For some reason, the cuda `quantize_fp8_per_row` kernel is very slow in EMU. Switching it to the functionally equivalent triton kernel yields excellent speedups from FP8. For eager mode, I'm seeing a 20% e2e speedup and still getting proper outputs. Eager: BF16: 19702.10ms FP8 Triton Quant: 16466.97ms Compiled: FP8 Native Quant: 14605.18ms FP8 Triton Quant: 16043.92ms BF16: 18030.98ms We see that quantizing in native pytorch helps quite a bit when torch.compile is used. I added the option to choose which quantization function is used and default to triton when torch.compile is off and native torch when torch.compile is on. This gives us the best performance in either case. Reviewed By: jiawenliu64 Differential Revision: D58167756

facebook-github-bot · 2024-06-05T23:52:38Z

This pull request was exported from Phabricator. Differential Revision: D58167756

Summary: For some reason, the cuda `quantize_fp8_per_row` kernel is very slow in EMU. Switching it to the functionally equivalent triton kernel yields excellent speedups from FP8. For eager mode, I'm seeing a 20% e2e speedup and still getting proper outputs. Eager: BF16: 19702.10ms FP8 Triton Quant: 16466.97ms Compiled: FP8 Native Quant: 14605.18ms FP8 Triton Quant: 16043.92ms BF16: 18030.98ms We see that quantizing in native pytorch helps quite a bit when torch.compile is used. I added the option to choose which quantization function is used and default to triton when torch.compile is off and native torch when torch.compile is on. This gives us the best performance in either case. Reviewed By: jiawenliu64 Differential Revision: D58167756

facebook-github-bot · 2024-06-06T16:13:44Z

This pull request was exported from Phabricator. Differential Revision: D58167756

facebook-github-bot · 2024-06-06T18:47:35Z

This pull request has been merged in f7666ed.

facebook-github-bot added the cla signed label Jun 5, 2024

facebook-github-bot added the fb-exported label Jun 5, 2024

jwfromm force-pushed the export-D58167756 branch from c9dddb2 to 9c256d1 Compare June 5, 2024 23:52

jwfromm force-pushed the export-D58167756 branch from 9c256d1 to 6c1f6df Compare June 6, 2024 16:13

facebook-github-bot closed this in f7666ed Jun 6, 2024

facebook-github-bot added the Merged label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Triton FP8 Quantization in EMU1.6 #2688

Switch to Triton FP8 Quantization in EMU1.6 #2688

jwfromm commented Jun 5, 2024

facebook-github-bot commented Jun 5, 2024

netlify bot commented Jun 5, 2024 •

edited

Loading

facebook-github-bot commented Jun 5, 2024

facebook-github-bot commented Jun 6, 2024

facebook-github-bot commented Jun 6, 2024

Switch to Triton FP8 Quantization in EMU1.6 #2688

Switch to Triton FP8 Quantization in EMU1.6 #2688

Conversation

jwfromm commented Jun 5, 2024

facebook-github-bot commented Jun 5, 2024

netlify bot commented Jun 5, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Jun 5, 2024

facebook-github-bot commented Jun 6, 2024

facebook-github-bot commented Jun 6, 2024

netlify bot commented Jun 5, 2024 •

edited

Loading