Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixity checks sometimes failing in non-reproducible ways #464

Open
jrochkind opened this issue Nov 4, 2019 · 7 comments
Open

Fixity checks sometimes failing in non-reproducible ways #464

jrochkind opened this issue Nov 4, 2019 · 7 comments

Comments

@jrochkind
Copy link
Collaborator

@jrochkind jrochkind commented Nov 4, 2019

Our fixity checks sometimes fail; when we notice, we run a fixity check again, and it always passes.

Most recent example on prod server https://app.honeybadger.io/projects/53196/faults/56777728

We don't really know why this is. Should investigate and fix.

We could just have the software do a second fixity check after any fails, to see if it fails the second time, if not ignore the failure. That may or may not resolve it and get us to stop getting failures.

One hypothesis is that S3 is sometimes returning an error or dropping the connection early, and instead of being a network-related error that gets raised as such, it somehow turns into a failed fixity check.

I suppose another hypothesis would be that S3 sometimes really does return a file with corrupt bytes... but that seems unlikely.

Could also be a bug in shrine/down that somehow corrupt the bytes in a race condition that sometimes appears.

This is all difficult to debug.

@jrochkind jrochkind changed the title Fixity checks sometimes erroneously failing Fixity checks sometimes failing in non-reproducible ways Nov 4, 2019
@eddierubeiz

This comment has been minimized.

Copy link
Contributor

@eddierubeiz eddierubeiz commented Dec 17, 2019

I also like the idea of temporarily storing the number of bytes run through the fixity check mechanism, then discarding that number immediately if the check succeeds.

@eddierubeiz

This comment has been minimized.

Copy link
Contributor

@eddierubeiz eddierubeiz commented Dec 18, 2019

We just got another one of these today -- https://app.honeybadger.io/projects/58989/faults/58717322
As usual, rechecking fixed the problem.

@jrochkind

This comment has been minimized.

Copy link
Collaborator Author

@jrochkind jrochkind commented Dec 18, 2019

when a fixity failure is recorded… we could preserve the whole file that the checksum was computed by? Email it to us or attach it to honeybadger error or something.

It’s a LOT of bytes though. It could also just save it to disk somewhere, even though we don’t rely on persistent file system, it’ll prob stay there long enough for us to grab it usually.

That would help debug why it’s calculating a wrong checksum. Was the file truncated early? Was the file actually an S3 error message? Etc. We could diff it with the actual file, etc.
Immediately re-running check might result in us not seeing a “false positive”, but I really wanna know why they are happening, not just cover em up.

Although actually to do that, we would need to change our use of shrine APIs, cause currently we stream the file, so there's no way to go back and look at it again after checksum! Ah, i see why you say number of bytes, we could keep that sum as we go. Well, maybe temporarily we write all bytes to a temporary location on disk as we go, then delete it if check was green, until we diagnose this problem. I dunno brainstorming.

@eddierubeiz

This comment has been minimized.

Copy link
Contributor

@eddierubeiz eddierubeiz commented Dec 18, 2019

If the computed checksum is different from the file's checksum, but the number of bytes checked is different from the length of the file, then we are likely checking either not the complete file, or something that is not the file. What we do in that case is a separate question, but I suspect this is true for all our currently "failed" fixity checks.

@jrochkind

This comment has been minimized.

Copy link
Collaborator Author

@jrochkind jrochkind commented Dec 18, 2019

Right, but actually having the file to look at not just the number of bytes, will tell us more details than " likely checking either not the complete file, or something that is not the file", like which of those, and what thing that's not the file if not the file. Just knowing it's one of those (or not) won't give us too much more to go on for next steps, while having the actual content that was checked will give us more.

@eddierubeiz

This comment has been minimized.

Copy link
Contributor

@eddierubeiz eddierubeiz commented Dec 18, 2019

Hm. If all else fails, we could even keep a buffer containing (a very small) part of the file in memory before sending it to the fixity check mechanism. Again, discard if green. If red, we have a "black box": a small chunk of the last data that was checked by the fixity check mechanism, which should be the end of the file.

@eddierubeiz

This comment has been minimized.

Copy link
Contributor

@eddierubeiz eddierubeiz commented Dec 18, 2019

So we could rule out a couple of likely scenarios: if the black box contains the last n bytes of an aws error message, we can at least google that; if it contains n bytes that are somewhere inside the file, then the file is being truncated, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.