New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WT-8747 Reading between commit_timestamp and durable_timestamp can produce inconsistency #7485
WT-8747 Reading between commit_timestamp and durable_timestamp can produce inconsistency #7485
Conversation
autobuild: Can one of the admins verify this patch? |
This version produces WT_ROLLBACK if you read at the wrong time; that's not necessarily the best behavior, as retrying won't succeed. Also, it probably ought to stop prohibiting the read once stable moves forward, and this version doesn't. And I think it's also necessary to check on-disk values as well, since they might be evicted without being fully stable yet. So I've marked this a draft, but it's probably good enough to run preliminary tests to see if it blows up the world. Update by KB: |
1. Don't need to reject reads once stable reaches durable. (Whether or not the value is checkpointed; once it's stable nobody can commit and become durable before it does, so it's ok for them to read it.) This limits the possible problems to things happening shortly after commits. 2. Do need to check on-disk values; they can easily get evicted before they're stable.
Either move reads from between commit and durable to durable or after, or assert that reading in between fails, or both.
(and if anyone's wondering, it does explode the world) |
Whole pile of these failed, but fortunately they all have pretty much the same structure and needed the same fix.
Don't bother running tests on this push (dcf4fdf); there's still a bit missing (on-disk values with nondurable stop times) and that causes at least one known failure and fixing it may introduce more. |
(this makes reads of nondurable removes fail once they've been evicted) Add a specific test for this issue, that among other things makes sure that the restrictions don't interfere with RTS.
There's still a way to get inconsistent behavior: read from past the initial commit's durable timestamp, but commit with no timestamp, which becomes durable immediately, then checkpoint before stable reaches the initial commit's durable timestamp and crash. I'm inclined to think that reading with a read timestamp and then committing without any timestamp (in one transaction) isn't valid, though it's not prohibited and nothing in the docs says not to. Unfortunately, you can do the same thing by reading with no timestamp and committing with no timestamp, which is harder to argue about and doesn't seem to directly violate any of the strictures about reading timestamped tables with no timestamp. Not sure what to do about either of these cases, but the changes here do at least fix the fully timestamped part of the problem. (And they should be ready to go now.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, but I'm not a great reviewer for it. @kommiharibabu, if you think we should have an additional reviewer on this, please select someone.
src/docs/timestamp-prepare.dox
Outdated
reading its writes in a second transaction and then committing such that the | ||
second transaction becomes durable before the first can produce data | ||
inconsistency. To help prevent this, reads between the commit and durable | ||
timestamp will fail with ::WT_ROLLBACK if the global stable timestamp has not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the recent conversation with @keithbostic, MongoDB read operations don't expect a WT_ROLLBACK error. MongoDB may not hit this scenario, but in case it happens can it cause some problem? How about returning a WT_NOTFOUND instead if this is the only update for the key or returning a historical version?
Based on this change, the entire approach can change, @keithbostic can you provide your views on the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... I'd prefer to go with rollback
if it's possible. My concern is that notfound
could be interpreted by the caller as the row physically isn't there which I think could itself lead to interesting and hard-to-debug failures.
I agree the list of failures in the patch build was unexpected. I guess my plan is to make this our next SERVER ticket once the WT-8745/SERVER-63166 pair is resolved, and understand the failures. Hopefully they're fixable, but if not, maybe notfound
is our fallback position. I'm going to run a patch build with notfound
, and see if that makes a difference.
Let me know if I'm misunderstanding the implications of returning notfound, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning the previous version (or WT_NOTFOUND if there is no previous version) is mostly equivalent to rounding the commit timestamp up to the durable timestamp. I suspect that's not what we want, since there's then no point distinguishing the commit timestamp from the durable timestamp.
(I say "mostly equivalent" because if this applies only when stable hasn't moved up to the durable timestamp yet, then it's equivalent to that only before that point. But changing what you see from one value to another when reading in that interval, depending on where stable is, seems like a bad plan also.)
Always returning WT_NOTFOUND in that interval seems bad because it's providing wrong data rather than no data, so I'd rather not go that way if possible.
There's another completely different approach, which is that what we're doing here is trying to avoid generating dependent transactions. But dependent transactions aren't catastrophic and this is a minor manifestation of them; we could just track the dependence. I'll make a top-level post about this so it doesn't get buried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details @keithbostic and @sauclovian-wt.
- Returning WT_ROLLBACK has problems with MDB as they don't handle it. If MDB supports handling WT_ROLLBACK for all the read operations it shouldn't be a problem.
- Returning WT_NOTFOUND has problems as described by David.
I am thinking of another return type WT_PREPARE_CONFLICT instead of either WT_ROLLBACK or WT_NOTFOUND as this issue occurs only with prepared transactions. I believe that MDB should have the code to handle the WT_PREPATE_CONFLICT error for reading operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea -- thank you, @kommiharibabu.
Let's run a patch check, and see if that passes the MDB server tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDB patch build has a lot of failures, I checked one failure where the dbcheck failed with conflict operation. I suspect it is due to our WT_ROLLBACK error. |
Patch build returning Update: it's a little better, but not significantly, still a large number of failures. |
What we're trying to do with this patch is avoid generating dependent transactions: basically, if txn1 commits and txn2 reads what it wrote before it becomes stable, then txn2 depends on txn1 in the sense that it must not become durable before txn1. The current patchset prevents this by not allowing txn2 to read txn1's commits until either they're stable or we're guaranteed that txn2 cannot become durable before txn1 via other constraints. We could instead just remember the dependency and fail txn2 or round up its durable timestamp if it tries to commit too early. Unlike typical dependent transaction scenarios where txn2 is prevented from /committing/ before txn1 (which means that if txn1 aborts, txn2 also must abort, which requires a lot of machinery to handle) all we really need to do to handle this is to track txn2's earliest legal durable timestamp depending on what it reads. This requires code at all the same places as the current patchset, but instead of a visibility check we just track the maximum durable timestamp that we've seen, and then at commit time check it against the proposed durable (or commit) timestamp and the value of stable at that point. Downside: we'd need to round up rather than fail, because failure at that point will panic, but it's possible that this could be worked around by checks at prepare time. Upside: it handles the cases where txn2 commits without a timestamp. Other downside though is that it's kind of against the system architecture (which isn't set up to track reads in detail) so the changes will likely be kinda messy. |
Separately, I saw something in an old Jira ticket yesterday that suggests that MongoDB uses all_durable to pick new commit timestamps. If that's true (and it's always true rather than mostly true) then they won't ever actually hit this problem and it might make sense to give them a flag to silence the checks. (nope, that's too optimistic; all_durable plus a little isn't necessarily ahead of txn1's durable. never mind this bit) |
(Don't bother testing what I just pushed; just want a non-broken changeset for archival before I start on a completely different approach to the problem.) |
Instead of prohibiting reads of nondurable data, track the durable timestamps of data read by a transaction and prohibit committing with a durable timestamp (or commit timestamp for non-prepared transactions) earlier than that. Keep the test changes; most of them are about only reading nondurable data when we explicitly mean to. It seems best to avoid doing that otherwise.
Don't read g.ts_stable until it's actually been set by set_stable(), otherwise we can try and set the oldest timestamp after the stable timestamp.
…edtiger#7577) The cache size is now set to 1GB instead of 100MB by default.
Add more documentation around the ignore_prepare configuration.
Here is the different approach. However, the conclusion for the time being is that this is also reasonably likely to break MongoDB and everyone's tired of thinking about it. So the plan for now is to close WT-8747 by documenting the behavior (and merging the test changes that avoid reading nondurable data except on purpose) with a separate pull request, then archiving this one and maybe coming back to it in a while. |
Update: we closed WT-8747 (via another pull request) by documenting the issue as a hazard. That includes the test cleanup, so I've merged that here. I opened a new ticket WT-8881 to track the fact that we should really eventually figure out a way to prohibit these commits; it points back here for archival purposes since these changes are probably still useful. Also I've ascertained that it is potentially necessary to track the durable timestamps of values read from history, which the code here so far doesn't do, and the new test in here doesn't check. Checking it is a bit of a nuisance, unfortunately. I'm going to close this pull request. |
Prohibit reading for updates that are committed but not stable.