Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endianness was incorrectly assumed for GroupWord #20

Merged
merged 2 commits into from
Oct 28, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/swisstable_group_query/no_simd.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ impl GroupQuery {
// has pretty much the same effect as a hash collision, something
// that we need to deal with in any case anyway.

let group = GroupWord::from_le_bytes(*group);
let group = GroupWord::from_ne_bytes(*group);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the place where the original bytes are written be changed from native endian to little endian instead? This change asserts that the serialization format is endianness dependent, when I think it should use a fixed endianness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell with some smaller tests this does not change how the hash table is serialized. That said there may be a better solution, as far as I can tell this issues comes from some assumptions of layout of arrays in memory. The most simple test that is failing is probably this one https://github.com/rust-lang/odht/blob/main/src/swisstable_group_query/mod.rs#L58..L73

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the underlying format is byte oriented, so endianness doesn't matter in the serialization format.

However, this GroupWord is batching the u8 control words to try to efficiently scan for matches or empties. It's using the bit-oriented trailing_zeros() to produce byte-oriented usize indexes into the control words. It makes sense that we would want from_le_bytes for that, to ensure low bytes are loaded at the correct "trailing" end. So I don't really understand what's happening here, why from_ne_bytes is fixing anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! The values eq_mask and empty_mask also have a to_le() on their construction, so they're effectively flipped twice. We can either flip with from_le_bytes(..) or these to_le(), but should not do both.

let cmp = group ^ repeat(h2);
let high_bit_greater_than_128 = (!cmp) & repeat(0x80);
let high_bit_greater_than_128_or_zero = cmp.wrapping_sub(repeat(0x01));
Expand Down