Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with error bounds #7

Open
mrmbernardi opened this issue Nov 16, 2023 · 3 comments
Open

Issues with error bounds #7

mrmbernardi opened this issue Nov 16, 2023 · 3 comments

Comments

@mrmbernardi
Copy link

mrmbernardi commented Nov 16, 2023

https://github.com/shubhamchandak94/LFZip/blob/14b2e9f79b1d9bc3f60d5cb69e20eb1529c709b4/src/nlms_compress.py#L85C18-L85C18

The value np.finfo(np.float32).resolution is used to reduce the error bound, but it's not fit for this purpose. The value just represents the approximate spacing between floating point values above and around 1 to the nearest power of 10. It is incorrect for magnitudes away from 1.

If we take an error bound of 10 and subtract this value, the resulting new error bound when interpreted as float32 is the same as the original. Conversely, if we were to take a smaller error bound the new error bound would be multiple steps lower than the next smallest bound. The correct function to use here would be np.spacing:

https://numpy.org/doc/stable/reference/generated/numpy.spacing.html

This returns the precise spacing between a value and the next one away from zero (despite what the documentation says).

@shubhamchandak94
Copy link
Owner

We do check later if the precision issue exists between original and reconstruction, and if so we write the original stream data directly to avoid breaking the contract. You can see that in both the C++ and the Python versions.

With that in mind, I think it's fine from a correctness perspective even if the resolution function gives 0. Do you believe there is a specific case where this would lead to issues?

@mrmbernardi
Copy link
Author

One problem here is that you arbitrarily limit the error bound to be greater than 1e-6. This is very large considering that the smallest normal float32 is 1.1754944e-38. If I'm not mistaken, ~40% of the representable float32 values have smaller magnitudes than 1e-6.

There's also the fact that if you set an error bound very close to but greater than 1e-6, then the adjusted value will be very small, even subnormal or zero, and this would make compression fail with all values being written to the outlier stream.

This could be solved by using np.spacing to make the adjustment and then ensuring that the adjusted value is still significantly large (e.g. still a normal value).

Alternatively, you could simply not make any adjustment at all. As you point out, you're already checking the reconstructed values against the original error bound, so you could do away with the adjustment and it would not break the contract

@shubhamchandak94
Copy link
Owner

Got it, thanks for pointing this out and explaining the considerations. I will get back to fixing it in a month or so since I'll need to run some experiments to make sure this does give benefits in some scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants