Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WT-8747 Reading between commit_timestamp and durable_timestamp can produce inconsistency #7485

Closed

Conversation

sauclovian-wt
Copy link
Contributor

Prohibit reading for updates that are committed but not stable.

@lukech
Copy link
Contributor

lukech commented Jan 29, 2022

autobuild: Can one of the admins verify this patch?

@sauclovian-wt sauclovian-wt marked this pull request as draft January 29, 2022 10:14
@sauclovian-wt
Copy link
Contributor Author

sauclovian-wt commented Jan 29, 2022

This version produces WT_ROLLBACK if you read at the wrong time; that's not necessarily the best behavior, as retrying won't succeed. Also, it probably ought to stop prohibiting the read once stable moves forward, and this version doesn't. And I think it's also necessary to check on-disk values as well, since they might be evicted without being fully stable yet. So I've marked this a draft, but it's probably good enough to run preliminary tests to see if it blows up the world.

Update by KB:
After discussion, the plan is to return WT_ROLLBACK, but include stable in the check so it won't repeatedly fail on the same timestamp.

1. Don't need to reject reads once stable reaches durable. (Whether or
not the value is checkpointed; once it's stable nobody can commit and
become durable before it does, so it's ok for them to read it.) This
limits the possible problems to things happening shortly after
commits.

2. Do need to check on-disk values; they can easily get evicted before
they're stable.
Either move reads from between commit and durable to durable or after,
or assert that reading in between fails, or both.
@sauclovian-wt
Copy link
Contributor Author

(and if anyone's wondering, it does explode the world)

Whole pile of these failed, but fortunately they all have pretty much
the same structure and needed the same fix.
@sauclovian-wt
Copy link
Contributor Author

Don't bother running tests on this push (dcf4fdf); there's still a bit missing (on-disk values with nondurable stop times) and that causes at least one known failure and fixing it may introduce more.

(this makes reads of nondurable removes fail once they've been evicted)

Add a specific test for this issue, that among other things makes sure
that the restrictions don't interfere with RTS.
@sauclovian-wt
Copy link
Contributor Author

There's still a way to get inconsistent behavior: read from past the initial commit's durable timestamp, but commit with no timestamp, which becomes durable immediately, then checkpoint before stable reaches the initial commit's durable timestamp and crash. I'm inclined to think that reading with a read timestamp and then committing without any timestamp (in one transaction) isn't valid, though it's not prohibited and nothing in the docs says not to. Unfortunately, you can do the same thing by reading with no timestamp and committing with no timestamp, which is harder to argue about and doesn't seem to directly violate any of the strictures about reading timestamped tables with no timestamp.

Not sure what to do about either of these cases, but the changes here do at least fix the fully timestamped part of the problem. (And they should be ready to go now.)

Copy link
Contributor

@keithbostic keithbostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, but I'm not a great reviewer for it. @kommiharibabu, if you think we should have an additional reviewer on this, please select someone.

src/include/txn.h Outdated Show resolved Hide resolved
src/include/txn_inline.h Outdated Show resolved Hide resolved
@keithbostic keithbostic marked this pull request as ready for review February 3, 2022 21:53
@keithbostic
Copy link
Contributor

reading its writes in a second transaction and then committing such that the
second transaction becomes durable before the first can produce data
inconsistency. To help prevent this, reads between the commit and durable
timestamp will fail with ::WT_ROLLBACK if the global stable timestamp has not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the recent conversation with @keithbostic, MongoDB read operations don't expect a WT_ROLLBACK error. MongoDB may not hit this scenario, but in case it happens can it cause some problem? How about returning a WT_NOTFOUND instead if this is the only update for the key or returning a historical version?

Based on this change, the entire approach can change, @keithbostic can you provide your views on the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... I'd prefer to go with rollback if it's possible. My concern is that notfound could be interpreted by the caller as the row physically isn't there which I think could itself lead to interesting and hard-to-debug failures.

I agree the list of failures in the patch build was unexpected. I guess my plan is to make this our next SERVER ticket once the WT-8745/SERVER-63166 pair is resolved, and understand the failures. Hopefully they're fixable, but if not, maybe notfound is our fallback position. I'm going to run a patch build with notfound, and see if that makes a difference.

Let me know if I'm misunderstanding the implications of returning notfound, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning the previous version (or WT_NOTFOUND if there is no previous version) is mostly equivalent to rounding the commit timestamp up to the durable timestamp. I suspect that's not what we want, since there's then no point distinguishing the commit timestamp from the durable timestamp.

(I say "mostly equivalent" because if this applies only when stable hasn't moved up to the durable timestamp yet, then it's equivalent to that only before that point. But changing what you see from one value to another when reading in that interval, depending on where stable is, seems like a bad plan also.)

Always returning WT_NOTFOUND in that interval seems bad because it's providing wrong data rather than no data, so I'd rather not go that way if possible.

There's another completely different approach, which is that what we're doing here is trying to avoid generating dependent transactions. But dependent transactions aren't catastrophic and this is a minor manifestation of them; we could just track the dependence. I'll make a top-level post about this so it doesn't get buried.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details @keithbostic and @sauclovian-wt.

  1. Returning WT_ROLLBACK has problems with MDB as they don't handle it. If MDB supports handling WT_ROLLBACK for all the read operations it shouldn't be a problem.
  2. Returning WT_NOTFOUND has problems as described by David.

I am thinking of another return type WT_PREPARE_CONFLICT instead of either WT_ROLLBACK or WT_NOTFOUND as this issue occurs only with prepared transactions. I believe that MDB should have the code to handle the WT_PREPATE_CONFLICT error for reading operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea -- thank you, @kommiharibabu.

Let's run a patch check, and see if that passes the MDB server tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kommiharibabu
Copy link
Contributor

MDB patch build has a lot of failures, I checked one failure where the dbcheck failed with conflict operation. I suspect it is due to our WT_ROLLBACK error.

@keithbostic
Copy link
Contributor

keithbostic commented Feb 4, 2022

Patch build returning notfound instead of rollback: https://evergreen.mongodb.com/version/61fd6248a4cf472d4bc3243c?redirect_spruce_users=true

Update: it's a little better, but not significantly, still a large number of failures.

@sauclovian-wt
Copy link
Contributor Author

What we're trying to do with this patch is avoid generating dependent transactions: basically, if txn1 commits and txn2 reads what it wrote before it becomes stable, then txn2 depends on txn1 in the sense that it must not become durable before txn1. The current patchset prevents this by not allowing txn2 to read txn1's commits until either they're stable or we're guaranteed that txn2 cannot become durable before txn1 via other constraints.

We could instead just remember the dependency and fail txn2 or round up its durable timestamp if it tries to commit too early. Unlike typical dependent transaction scenarios where txn2 is prevented from /committing/ before txn1 (which means that if txn1 aborts, txn2 also must abort, which requires a lot of machinery to handle) all we really need to do to handle this is to track txn2's earliest legal durable timestamp depending on what it reads. This requires code at all the same places as the current patchset, but instead of a visibility check we just track the maximum durable timestamp that we've seen, and then at commit time check it against the proposed durable (or commit) timestamp and the value of stable at that point. Downside: we'd need to round up rather than fail, because failure at that point will panic, but it's possible that this could be worked around by checks at prepare time. Upside: it handles the cases where txn2 commits without a timestamp.

Other downside though is that it's kind of against the system architecture (which isn't set up to track reads in detail) so the changes will likely be kinda messy.

@sauclovian-wt
Copy link
Contributor Author

sauclovian-wt commented Feb 5, 2022

Separately, I saw something in an old Jira ticket yesterday that suggests that MongoDB uses all_durable to pick new commit timestamps. If that's true (and it's always true rather than mostly true) then they won't ever actually hit this problem and it might make sense to give them a flag to silence the checks.

(nope, that's too optimistic; all_durable plus a little isn't necessarily ahead of txn1's durable. never mind this bit)

@keithbostic
Copy link
Contributor

@sauclovian-wt
Copy link
Contributor Author

(Don't bother testing what I just pushed; just want a non-broken changeset for archival before I start on a completely different approach to the problem.)

sauclovian-wt and others added 9 commits February 24, 2022 02:23
Instead of prohibiting reads of nondurable data, track the durable
timestamps of data read by a transaction and prohibit committing with
a durable timestamp (or commit timestamp for non-prepared
transactions) earlier than that.

Keep the test changes; most of them are about only reading nondurable
data when we explicitly mean to. It seems best to avoid doing that
otherwise.
Don't read g.ts_stable until it's actually been set by set_stable(), otherwise we can try and set the oldest timestamp after the stable timestamp.
…edtiger#7577)

The cache size is now set to 1GB instead of 100MB by default.
Add more documentation around the ignore_prepare configuration.
@sauclovian-wt
Copy link
Contributor Author

Here is the different approach.

However, the conclusion for the time being is that this is also reasonably likely to break MongoDB and everyone's tired of thinking about it. So the plan for now is to close WT-8747 by documenting the behavior (and merging the test changes that avoid reading nondurable data except on purpose) with a separate pull request, then archiving this one and maybe coming back to it in a while.

@sauclovian-wt sauclovian-wt marked this pull request as draft February 25, 2022 02:43
@sauclovian-wt
Copy link
Contributor Author

Update: we closed WT-8747 (via another pull request) by documenting the issue as a hazard. That includes the test cleanup, so I've merged that here. I opened a new ticket WT-8881 to track the fact that we should really eventually figure out a way to prohibit these commits; it points back here for archival purposes since these changes are probably still useful.

Also I've ascertained that it is potentially necessary to track the durable timestamps of values read from history, which the code here so far doesn't do, and the new test in here doesn't check. Checking it is a bit of a nuisance, unfortunately.

I'm going to close this pull request.

@keithbostic keithbostic deleted the wt-8747-read-nondurable branch February 26, 2022 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants