Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing quantize in int4 mode #159

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Artyom17
Copy link
Contributor

Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024
@Artyom17
Copy link
Contributor Author

@HDCharles ?

@Chillee
Copy link
Contributor

Chillee commented Apr 21, 2024

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

@Artyom17
Copy link
Contributor Author

Artyom17 commented Apr 22, 2024

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

@jerryzh168
Copy link
Contributor

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

this is probably related to packing, there is a silent numerical error right now if we use the packed weight on cpu v.s. cuda:

(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cuda"), weight_int4pack.to("cuda"), scales_and_zeros.to("cuda"), out_features, self.groupsize)[:3,:3]
tensor([[-0.0048, -0.0957, -0.0757],
[ 0.0243, -0.0211, -0.0081],
[ 0.0194, -0.0398, -0.0081]], device='cuda:0', dtype=torch.bfloat16)
(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cpu"), weight_int4pack.to("cpu"), scales_and_zeros.to("cpu"), out_features, self.groupsize)[:3,:3]
tensor([[-4.8218e-03, 1.6235e-02, 1.9043e-02],
[-1.4526e-02, -2.1118e-02, -8.0566e-03],
[ 3.0518e-05, -2.4414e-03, 5.4932e-03]], dtype=torch.bfloat16)

cc @HDCharles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants