Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MX4 quantization #2659

Closed
wants to merge 1 commit into from
Closed

MX4 quantization #2659

wants to merge 1 commit into from

Conversation

spcyppt
Copy link
Contributor

@spcyppt spcyppt commented Jun 1, 2024

Summary:
Implement MX4 quantization-dequantization ops

Usage:
Quantization:

quantized_output = torch.ops.fbgemm.quantize_mx_cuda(
            A,
            split_sizes,
            scale_bits=8,
            ebits=2,
            mbits=3,
            max_norm=6.0f,
            mx_group_size=32,
        )

where
A is 1-D input tensor and
split_sizes is list of int containing number of elements in each rank e.g., split_sizes = [1024, 2048] for 2 ranks. Note that each value needs to be a multiple of group size.
The output is ensured to be 16-byte aligned. Anything less than 16 bytes is padded to make 16 bytes.
Given that the group_size is 32, for the best performance, the value should be a multiple of 32 x 16 = 512.

Dequantization:

dequanted_output = torch.ops.fbgemm.dequantize_mx_cuda(
            quantized_output,
            split_sizes,
            mx_group_size=32,
        )

Reviewed By: sryap

Differential Revision: D57145102

Summary:
Implement MX4 quantization-dequantization ops

Usage:
**Quantization:**
```
quantized_output = torch.ops.fbgemm.quantize_mx_cuda(
            A,
            split_sizes,
            scale_bits=8,
            ebits=2,
            mbits=3,
            max_norm=6.0f,
            mx_group_size=32,
        )
```
where 
`A` is 1-D input tensor and
`split_sizes` is list of int containing number of elements in each rank e.g., `split_sizes = [1024, 2048]` for 2 ranks. Note that each value needs to be a multiple of `group size`.
The output is ensured to be 16-byte aligned. Anything less than 16 bytes is padded to make 16 bytes.
Given that the `group_size` is 32, for the best performance, the value should be a multiple of 32 x 16 = 512. 

**Dequantization:**
```
dequanted_output = torch.ops.fbgemm.dequantize_mx_cuda(
            quantized_output,
            split_sizes,
            mx_group_size=32,
        )
```

Reviewed By: sryap

Differential Revision: D57145102
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57145102

Copy link

netlify bot commented Jun 1, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 345e654
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/665a69e13eeda70008ec9d61
😎 Deploy Preview https://deploy-preview-2659--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57145102

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in eb7ccec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants