Skip to content
This repository has been archived by the owner on Mar 19, 2022. It is now read-only.

Implement checksums #31

Closed
aboodman opened this issue Dec 3, 2019 · 3 comments
Closed

Implement checksums #31

aboodman opened this issue Dec 3, 2019 · 3 comments
Assignees

Comments

@aboodman
Copy link
Contributor

aboodman commented Dec 3, 2019

Right now the code is using the Noms hash as a checksum, which means that (a) the client and server need to agree on a (potentially old) version of Noms to use for the hash, and also that (b) we can't efficiently have the checksum be over a different serialization than the internal schema.

Originally I thought we'd use LtHash as an incremental hash but @phritz points out it is overpowered for what we need, because it is actually a fully homomorphic hash. I hadn't realized that.

Ideally we could find a simple hash to use here which is incremental. Maybe even an old fashioned checksum is sufficient? It doesn't need to be secure.

All this is doing is what a classic checksum does -- it's a sanity check that sync is working properly.

@aboodman aboodman changed the title Use lthash for the checksum Implement checksums Feb 19, 2020
@phritz
Copy link
Contributor

phritz commented Feb 22, 2020

ok here's what i think about this, feedback welcome

requirements for checksumming in r:

  • works at the level of the fundamental data abstraction, which is a key value store
  • widely available implementation
  • fast; on the order of 1s on a mobile device for an entire 20MB dataset
  • incremental in the sense that if the state changes computing the new checksum should take O(size of data that changed)
  • collision resistent in a non cryptographic sense; this thing will be used as a sanity check so we dont care if adversaries can construct preimages or collisions, we just care that the probability that any checksum() of any two maps is equal is low
  • not too much overhead, maybe order 10ish bytes per entry in the map (10k entries implies 100KBish overhead)

non-requirements:

  • cryptographic guarantees. i dont think we care about first or second preimage attacks for example. this is a checksum.

proposal:

  1. choose a hash function, call it h. crc32 is super fast, we could use that.
  2. let the hash of an entry in the map be the hash of the key bytes concatenated with the utf8 bytes representing the value, so for example: h(entry) = h("k=" || entry.key || "v=" || entry.value)
  3. let the checksum of the map be the XOR of the hashes of all the entries
    • let mc be the map checksum, initialized to zeros
    • on insertion: mc = mc XOR h(newEntry)
    • on deletion: mc = mc XOR h(oldEntry)
    • on modification: mc = mc XOR h(oldEntry) XOR h(newEntry)

how's this all sound?

@aboodman
Copy link
Contributor Author

Sounds great. Yeah this is all much simpler.

I still think that the serialization should be <num-entries> || <key-len> || <key-bytes> || <val-len> || <val-bytes> ... even though we don't need to do this strictly speaking because we don't care about cryptographic guarantees. It just seems more correct.

@aboodman
Copy link
Contributor Author

I believe this is complete, right @phritz

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants