Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Triton FP8 Quantization in EMU1.6 #2688

Closed
wants to merge 1 commit into from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Jun 5, 2024

Summary:
For some reason, the cuda quantize_fp8_per_row kernel is very slow in EMU. Switching it to the functionally equivalent triton kernel yields excellent speedups from FP8. For eager mode, I'm seeing a 20% e2e speedup and still getting proper outputs.

Eager:
BF16: 19702.10ms
FP8 Triton Quant: 16466.97ms

Compiled:
FP8 Native Quant: 14605.18ms
FP8 Triton Quant: 16043.92ms
BF16: 18030.98ms

We see that quantizing in native pytorch helps quite a bit when torch.compile is used. I added the option to choose which quantization function is used and default to triton when torch.compile is off and native torch when torch.compile is on. This gives us the best performance in either case.

Reviewed By: jiawenliu64

Differential Revision: D58167756

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58167756

Copy link

netlify bot commented Jun 5, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 6c1f6df
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6661e03494075400089aaa5f
😎 Deploy Preview https://deploy-preview-2688--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Jun 5, 2024
Summary:

For some reason, the cuda `quantize_fp8_per_row` kernel is very slow in EMU. Switching it to the functionally equivalent triton kernel yields excellent speedups from FP8. For eager mode, I'm seeing a 20% e2e speedup and still getting proper outputs.

Eager:
BF16: 19702.10ms
FP8 Triton Quant: 16466.97ms

Compiled:
FP8 Native Quant: 14605.18ms
FP8 Triton Quant: 16043.92ms
BF16: 18030.98ms

We see that quantizing in native pytorch helps quite a bit when torch.compile is used. I added the option to choose which quantization function is used and default to triton when torch.compile is off and native torch when torch.compile is on. This gives us the best performance in either case.

Reviewed By: jiawenliu64

Differential Revision: D58167756
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58167756

Summary:

For some reason, the cuda `quantize_fp8_per_row` kernel is very slow in EMU. Switching it to the functionally equivalent triton kernel yields excellent speedups from FP8. For eager mode, I'm seeing a 20% e2e speedup and still getting proper outputs.

Eager:
BF16: 19702.10ms
FP8 Triton Quant: 16466.97ms

Compiled:
FP8 Native Quant: 14605.18ms
FP8 Triton Quant: 16043.92ms
BF16: 18030.98ms

We see that quantizing in native pytorch helps quite a bit when torch.compile is used. I added the option to choose which quantization function is used and default to triton when torch.compile is off and native torch when torch.compile is on. This gives us the best performance in either case.

Reviewed By: jiawenliu64

Differential Revision: D58167756
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58167756

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f7666ed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants