Floatpoint(8,23)flips the input values #48

ASHWIN2605 · 2021-07-08T16:57:14Z

Hi,

I have tried the following code
a=torch.tensor([3.0])
out=float_quantize(a,8,23,"nearest")

The output is printed as -3.0.

This happens only when the rounding is nearest .I am not able to understand why is this happening. Can you please explain me why is this happening, as I am missing something here.

Tiiiger · 2021-07-09T00:51:08Z

what is printed out when you don't use nearest rounding?

ASHWIN2605 · 2021-07-09T08:17:14Z

When I use stochastic rounding, the same input number is printed.

Tiiiger · 2021-07-13T08:25:51Z

hi @ASHWIN2605

Good catch, I think this is an edge case. I'll look into the code soon.

But 8bits exponent, 23 bits mantissa is the standard fp32 format anyways so I don't think you want to quantize it anyways.

wassimseif · 2021-10-27T08:48:24Z

Hello,

This is from round_bitwise function in quant_cpu.cpp.
Specifically rand_prob = 1 << (23 - man_bits - 1); when man_bit = 23 then it becomes rand_prob = 1 << -1;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floatpoint(8,23)flips the input values #48

Floatpoint(8,23)flips the input values #48

ASHWIN2605 commented Jul 8, 2021

Tiiiger commented Jul 9, 2021

ASHWIN2605 commented Jul 9, 2021

Tiiiger commented Jul 13, 2021

wassimseif commented Oct 27, 2021

Floatpoint(8,23)flips the input values #48

Floatpoint(8,23)flips the input values #48

Comments

ASHWIN2605 commented Jul 8, 2021

Tiiiger commented Jul 9, 2021

ASHWIN2605 commented Jul 9, 2021

Tiiiger commented Jul 13, 2021

wassimseif commented Oct 27, 2021