Skip to content

Commit

Permalink
WT-2316: WT_CURSOR.prev out-of-order returns, fix a split race.
Browse files Browse the repository at this point in the history
Simplify the __page_descend() function, it only needs a WT_PAGE, it
doesn't need the WT_REF for the page, hopefully that makes it easier
to understand.

Add comments to each of the races we have to address, try and describe
the case each check is intended to fix.
  • Loading branch information
keithbostic committed Jan 11, 2016
1 parent 70c8b78 commit f1977bb
Show file tree
Hide file tree
Showing 2 changed files with 146 additions and 29 deletions.
118 changes: 99 additions & 19 deletions src/btree/bt_walk.c
Original file line number Diff line number Diff line change
Expand Up @@ -118,10 +118,42 @@ __page_ascend(WT_SESSION_IMPL *session,
__ref_index_slot(session, parent_ref, pindexp, slotp);

/*
* When internal pages split, the WT_REF structures being moved
* are updated first. If the WT_REF we started with references
* the same page as we found on our search of the parent, there
* is a consistent view.
* There's a split race when a cursor moving forwards through
* the tree ascends the tree. If we're splitting an internal
* page into its parent, we move the WT_REF structures and
* then update the parent's page index before updating the split
* page's page index, and it's not an atomic update. A thread
* can read the split page's original page index and then read
* the parent page's replacement index.
*
* This can create a race for next-cursor movements.
*
* For example, imagine an internal page with 3 child pages,
* with the namespaces a-f, g-h and i-j; the first child page
* splits. The parent starts out with the following page-index:
*
* | ... | a | g | i | ... |
*
* which changes to this:
*
* | ... | a | c | e | g | i | ... |
*
* The split page starts out with the following page-index:
*
* | a | b | c | d | e | f |
*
* Imagine a cursor finishing the 'f' part of the namespace that
* starts its ascent to the parent's 'a' slot. Then the page
* splits and the parent page's page index is replaced. If the
* cursor then searches the parent's replacement page index for
* the 'a' slot, it finds it and then increments to the slot
* after the 'a' slot, the 'c' slot, and then it incorrectly
* repeats its traversal of part of the namespace.
*
* This function takes a WT_REF argument which is the page from
* which we start our ascent. If the parent's slot we find in
* our search doesn't point to the same page as that initial
* WT_REF, there's a race and we start over again.
*/
if (ref->home == parent_ref->page)
break;
Expand All @@ -136,7 +168,7 @@ __page_ascend(WT_SESSION_IMPL *session,
*/
static void
__page_descend(WT_SESSION_IMPL *session,
WT_REF *ref, WT_PAGE_INDEX **pindexp, uint32_t *slotp, bool prev)
WT_PAGE *page, WT_PAGE_INDEX **pindexp, uint32_t *slotp, bool prev)

This comment has been minimized.

Copy link
@erbenmo

erbenmo Jan 19, 2016

Minor: update the comment in line 176 since we are not passing in a ref now as input.

This comment has been minimized.

Copy link
@keithbostic

keithbostic Jan 21, 2016

Author Contributor

@erbenmo, thanks!

I'll get that fixed.

{
WT_PAGE_INDEX *pindex;

Expand All @@ -145,23 +177,71 @@ __page_descend(WT_SESSION_IMPL *session,
* have a hazard pointer.
*/
for (;; __wt_yield()) {
WT_INTL_INDEX_GET(session, ref->page, pindex);
WT_INTL_INDEX_GET(session, page, pindex);
*slotp = prev ? pindex->entries - 1 : 0;

/*
* When internal pages split, the WT_REF structures being moved
* are updated first. If the last WT_REF on the page into which
* we're descending doesn't reference the page as its home, the
* page has split and a "previous" cursor moving to the last
* slot on the page would be in the wrong part of the namespace.
* The test isn't necessary for a "next" cursor because we do
* right-hand splits on internal pages and the first part of the
* page's namespace won't have changed as part of the split.
* Rather than testing the direction boolean, do the test the
* "previous" cursor movement requires, it's not wrong for the
* "next" cursor movement.
* There's a split race when a cursor moving backwards through
* the tree descends the tree. If we're splitting an internal
* page into its parent, we move the WT_REF structures and
* update the parent's page index before updating the split
* page's page index, and it's not an atomic update. A thread
* can read the parent page's replacement page index and then
* read the split page's original index.
*
* This can create a race for previous-cursor movements.
*
* For example, imagine an internal page with 3 child pages,
* with the namespaces a-f, g-h and i-j; the first child page
* splits. The parent starts out with the following page-index:
*
* | ... | a | g | i | ... |
*
* The split page starts out with the following page-index:
*
* | a | b | c | d | e | f |
*
* The first step is to move the c-f ranges into a new subtree,
* so, for example we might have two new internal pages 'c' and
* 'e', where the new 'c' page references the c-d namespace and
* the new 'e' page references the e-f namespace. The top of the
* subtree references the parent page, but until the parent's
* page index is updated, any threads in the subtree won't be
* able to ascend out of the subtree. However, once the parent
* page's page index is updated to this:
*
* | ... | a | c | e | g | i | ... |
*
* threads in the subtree can ascend into the parent. Imagine a
* cursor in the c-d part of the namespace that ascends to the
* parent's 'c' slot. It would then decrement to the slot before
* the 'c' slot, the 'a' slot.
*
* The previous-cursor movement selects the last slot in the 'a'
* page; if the split page's page-index hasn't been updated yet,
* it will select the 'f' slot, which is incorrect. Once the
* split page's page index is updated to this:
*
* | a | b |
*
* the previous-cursor movement will select the 'b' slot, which
* is correct.
*
* This function takes an argument which is the internal page
* from which we're descending. If the last slot on the page no
* longer points to the current page as its "home", the page is
* being split and part of its namespace moved. We have the
* correct page and we don't have to move, all we have to do is
* wait until the split page's page index is updated.
*
* No test is necessary for a next-cursor movement because we
* do right-hand splits on internal pages and the initial part
* of the page's namespace won't change as part of a split.
* Instead of testing the direction boolean, do the test the
* previous cursor movement requires in all cases, even though
* it will always succeed for a next-cursor movement.
*/
if (pindex->index[*slotp]->home == ref->page)
if (pindex->index[*slotp]->home == page)
break;
}
*pindexp = pindex;
Expand Down Expand Up @@ -482,7 +562,7 @@ descend: couple = ref;
empty_internal = true;

__page_descend(
session, ref, &pindex, &slot, prev);
session, ref->page, &pindex, &slot, prev);
} else {
/*
* Optionally skip leaf pages, the second half.
Expand Down
57 changes: 47 additions & 10 deletions src/include/btree.i
Original file line number Diff line number Diff line change
Expand Up @@ -1457,17 +1457,54 @@ __wt_split_intl_race(
*
* There's a page-split race when we walk the tree: if we're splitting
* an internal page into its parent, we update the parent's page index
* and then update the page being split, and it's not an atomic update.
* A thread could read the parent page's original page index, and then
* read the page's replacement index. Because internal page splits work
* by replacing the original page with the initial part of the original
* page, the result of this race is we will have a key that's past the
* end of the current page, and the parent's page index will have moved.
* before updating the split page's page index, and it's not an atomic
* update. A thread can read the parent page's original page index and
* then read the split page's replacement index.
*
* It's also possible a thread could read the parent page's replacement
* page index, and then read the page's original index. Because internal
* splits work by truncating the original page, the original page's old
* content is compatible, this isn't a problem and we ignore this race.
* Because internal page splits work by truncating the original page to
* the initial part of the original page, the result of this race is we
* will have a search key that points past the end of the current page.
* This is only an issue when we search past the end of the page, if we
* find a WT_REF in the page with the namespace we're searching for, we
* don't care if the WT_REF moved or not while we were searching, we
* have the correct page.
*
* For example, imagine an internal page with 3 child pages, with the
* namespaces a-f, g-h and i-j; the first child page splits. The parent
* starts out with the following page-index:
*
* | ... | a | g | i | ... |
*
* which changes to this:
*
* | ... | a | c | e | g | i | ... |
*
* The child starts out with the following page-index:
*
* | a | b | c | d | e | f |
*
* which changes to this:
*
* | a | b |
*
* The thread searches the original parent page index for the key "cat",
* it couples to the "a" child page; if it uses the replacement child
* page index, it will search past the end of the page and couple to the
* "b" page, which is wrong.
*
* To detect the problem, we remember the parent page's page index used
* to descend the tree. Whenever we search past the end of a page, we
* check to see if the parent's page index has changed since our use of
* it during descent. As the problem only appears if we read the split
* page's replacement index, the parent page's index must already have
* changed, ensuring we detect the problem.
*
* It's possible for the opposite race to happen (a thread could read
* the parent page's replacement page index and then read the split
* page's original index). This isn't a problem because internal splits
* work by truncating the split page, so the split page search is for
* content the split page retains after the split, and we ignore this
* race.
*/
WT_INTL_INDEX_GET(session, parent, pindex);
return (pindex != saved_pindex);
Expand Down

0 comments on commit f1977bb

Please sign in to comment.