Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floatpoint(8,23)flips the input values #48

Open
ASHWIN2605 opened this issue Jul 8, 2021 · 4 comments
Open

Floatpoint(8,23)flips the input values #48

ASHWIN2605 opened this issue Jul 8, 2021 · 4 comments

Comments

@ASHWIN2605
Copy link

Hi,

I have tried the following code
a=torch.tensor([3.0])
out=float_quantize(a,8,23,"nearest")

The output is printed as -3.0.

This happens only when the rounding is nearest .I am not able to understand why is this happening. Can you please explain me why is this happening, as I am missing something here.

@Tiiiger
Copy link
Owner

Tiiiger commented Jul 9, 2021

what is printed out when you don't use nearest rounding?

@ASHWIN2605
Copy link
Author

When I use stochastic rounding, the same input number is printed.

@Tiiiger
Copy link
Owner

Tiiiger commented Jul 13, 2021

hi @ASHWIN2605

Good catch, I think this is an edge case. I'll look into the code soon.

But 8bits exponent, 23 bits mantissa is the standard fp32 format anyways so I don't think you want to quantize it anyways.

@wassimseif
Copy link

Hello,

This is from round_bitwise function in quant_cpu.cpp.
Specifically rand_prob = 1 << (23 - man_bits - 1); when man_bit = 23 then it becomes rand_prob = 1 << -1;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants