Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bincode doesn't generate stable serializations of HashMap #230

Closed
pedrocr opened this issue Mar 4, 2018 · 13 comments
Closed

bincode doesn't generate stable serializations of HashMap #230

pedrocr opened this issue Mar 4, 2018 · 13 comments

Comments

@pedrocr
Copy link

@pedrocr pedrocr commented Mar 4, 2018

When serializing a struct with a HashMap it's easy to end up with two values that are equal to each other but serialize to different output. An example that fails sometimes:

#[macro_use] extern crate serde_derive;
extern crate bincode;

use std::collections::HashMap;

#[derive(Debug, PartialEq, Serialize, Deserialize)]
struct MyStruct {
    map: HashMap<u64, u64>,
}

fn main() {
    let mut map = HashMap::new();
    map.insert(0, 0);
    map.insert(1, 1);

    let s1 = MyStruct{map,};
    let e1: Vec<u8> = bincode::serialize(&s1).unwrap();
    let s2: MyStruct = bincode::deserialize(&e1).unwrap();
    let e2: Vec<u8> = bincode::serialize(&s2).unwrap();
    
    println!("struct1 is {:?}", s1);
    println!("struct2 is {:?}", s2);

    assert_eq!(s1, s2);
    assert_eq!(e1, e2);
}

This is most likely because rust is seeding the hash with different values on different runs and bincode serializes in hash order and not key order. This makes it hard to use bincode to hash a struct for deduplication/content addressing purposes.

@TyOverby
Copy link
Collaborator

@TyOverby TyOverby commented Mar 4, 2018

Yeah, this is mostly a limitation of Rust Hashmaps: Iteration order is not guaranteed. Have you tried using a HashMap with a Hasher other than the default?

If that doesn't work, I'm afraid that your only other option will be to pull the items out into a vec, sort them, and then serialize the vec.

@pedrocr
Copy link
Author

@pedrocr pedrocr commented Mar 4, 2018

For now I've worked around this by using the indexmap crate. Wouldn't it be possible to do a .keys().sort() before serialization to always serialize in the same order? I guess that may need to be done in serde.

@TyOverby
Copy link
Collaborator

@TyOverby TyOverby commented Mar 5, 2018

That would have to be done in serde.

Have you tried using other hashers? Is using BTreeMap an option? I think those have guaranteed ordering.

@TyOverby
Copy link
Collaborator

@TyOverby TyOverby commented Mar 5, 2018

I'm going to close this issue because I don't think that there's anything that can be done on the bincode side. But feel free to use this issue for more communication; I'm interested in seeing your use-case succeed.

@TyOverby TyOverby closed this Mar 5, 2018
@pedrocr
Copy link
Author

@pedrocr pedrocr commented Mar 5, 2018

I solved this for now with BTreeMap indeed:

pedrocr/syncer@d5cfd3a

This is not ideal because it imposes the sorting requirement on all operations when I only really care about it for serialization. Another solution would be to provide my own serialization implementation:

https://stackoverflow.com/questions/42723065/how-to-sort-hashmap-keys-when-serializing-with-serde#42723390

That would still impose a penalty on serialization. An even better solution may be to figure out how to seed the HashMap RandomState with a fixed value from my struct and that way I get no performance penalty and stable serialization. Right now that's only an optimization for directories with lots of files so it's not a priority.

Thanks for the help!

@TyOverby
Copy link
Collaborator

@TyOverby TyOverby commented Mar 5, 2018

Have you considered linked-hash-map? It is "sorted" by order of insertion and should have minimal runtime overhead.

@pedrocr
Copy link
Author

@pedrocr pedrocr commented Mar 5, 2018

I used indexmap which provides the same guarantee but it's not enough, which is why I moved to BTreeMap. I need the guarantee that no matter the order of operations if the end result is equal then the serialization is equal. Just guaranteeing insertion order doesn't do that because there are several orderings of operations that end up with the same outcome.

@TyOverby
Copy link
Collaborator

@TyOverby TyOverby commented Mar 6, 2018

Yeah, if you need this behavior, I think that sorting (or using a pre-sorted container) is the only way to make this work.

@pedrocr
Copy link
Author

@pedrocr pedrocr commented Mar 6, 2018

The ideal solution is to make HashMap use a stable seed based on info in the struct itself. The current rust stdlib doesn't make that easy to do so I'll just ignore the issue for now and use BTreeMap.

@TyOverby
Copy link
Collaborator

@TyOverby TyOverby commented Mar 6, 2018

last-thought: have you tried using https://doc.servo.org/fnv/struct.FnvHasher.html? I read through the source and there's no randomization of the seed.

@pedrocr
Copy link
Author

@pedrocr pedrocr commented Mar 6, 2018

Its disadvantages are that it performs badly on larger inputs, and provides no protection against collision attacks

The problem with the default hash isn't that it's seeded, it's that I can't just provide the seed myself and that you're now supposed to use DefaultHasher which is not guaranteed to be the same forever. Something like metrohash may be a better solution as it takes a seed value:

https://docs.rs/metrohash/1.0.5/metrohash/struct.MetroHash64.html#method.with_seed

@TyOverby
Copy link
Collaborator

@TyOverby TyOverby commented Mar 6, 2018

That looks like it might work!

However, I'm not sure that the stdlib guarantees iteration order even when the seed is constant. I think rust uses robbinhood hashing, and from what I remember of that algorithm, it does really weird stuff with array location and item placement.

@pedrocr
Copy link
Author

@pedrocr pedrocr commented Mar 6, 2018

That might be a problem indeed and even trickier to detect. In most cases it will work the same for the same size of hash table and yet if on one iteration you do a lot of inserts and removes and on another iteration you don't it may happen that one resizes the table and the other doesn't and the iteration order is different. So you're right, it's probably not a good solution, good catch! Just doing the key sorting is probably the better idea then. The slowdown is probably not even large enough to care about. BTreeMap it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.