Skip to content
This repository has been archived by the owner on Sep 3, 2021. It is now read-only.

DBReadError(MapLoadError(CorruptChunk(Corrupt("missing key")))) #351

Open
phritz opened this issue May 19, 2021 · 5 comments
Open

DBReadError(MapLoadError(CorruptChunk(Corrupt("missing key")))) #351

phritz opened this issue May 19, 2021 · 5 comments
Assignees

Comments

@phritz
Copy link
Contributor

phritz commented May 19, 2021

https://rocicorp.slack.com/archives/C01JJGGS6CU/p1621426062298400

UnhandledRejection
Non-Error promise rejection captured with value: DBReadError(MapLoadError(CorruptChunk(Corrupt(“missing key”))))
Pull returned: PullFailed(FetchFailed(RequestTimeout(TimeoutError { _private: () })))
logger: console
arguments: [“Pull returned: PullFailed(FetchFailed(RequestTimeout(TimeoutError { _private: () })))“]
@phritz
Copy link
Contributor Author

phritz commented May 19, 2021

image

@phritz
Copy link
Contributor Author

phritz commented May 19, 2021

fyi 11 occurrences over 8 users

@phritz
Copy link
Contributor Author

phritz commented May 19, 2021

In debugging this I discovered a separate annoyance: #354

@phritz phritz self-assigned this May 19, 2021
@phritz
Copy link
Contributor Author

phritz commented May 20, 2021

The line throwing the error is here:

return Err(LoadError::Corrupt("missing key"));
. The key in the leafentry proto is None. This is happening when we go do an opentransaction and read the main head, the main head chunk is corrupt in this way. However here's where we create the proto and it does not look possible for it to write None:
key: Some(builder.create_vector(e.key)),
. I can't find anywhere else where we construct this proto (other than tests). I also don't see how there could be a replicache-level bug in how we read the proto which is here:
let root = leaf::get_root_as_leaf(chunk.data());
. We're just iterating the entries in the proto, there's literally nothing else going on.

I don't see a pattern with what happens in the logs just before it hits this error, other than pushes and pulls completing just before. The 18 occurrences of the error were not limited to one user, they were spread across 14 users.

I'm wondering if it really is the chunk's bytes being corrupted somehow. But that's a bit of a stretch: the data have to be corrupted in such a way that it still parses correctly as a proto. There are no other map load or corrupt chunk errors other than this one. If it were being corrupted with random data I would expect at least some of the time for it not to parse at all. But we don't see that. Perhaps the data is being partially written? Or partially overwritten?

Something that I did notice is that 18 out of 18 occurrences of this error are on Chrome Mobile 91.0.4472, which I think is a newish version. (They are 89% Chrome Mobile 91.0.4472 and Chrome Mobile WebView 91.0.4472). @arv @aboodman is there a clue in that maybe? Seems a pretty clear indicator of... something.

As for what to do next I'm open to suggestions but thinking:

  1. Improve the logging/error so that we get the chunk hash and bytes when this happens, and then get it into users hands if we can.
  2. Go through the flatbuffers bug reports and see if anything jumps out.
  3. Carefully read the memstore and prolly map code to see if there's anything that jumps out. For example I can imagine if a map entry gets aliased and is accessed without synchronization then we could read a partially written value. (But rust should make this hard, so....).

@phritz
Copy link
Contributor Author

phritz commented May 20, 2021

Suggestion from aaron which i think is good: try to craft the minimal byte array that yields this error.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant