Skip to content

Commit

Permalink
vinyl: restart read iterator in case L0 is changed
Browse files Browse the repository at this point in the history
Data read in vinyl is known to yield in case of disc access. So it opens
a window for modifications of in-memory level. Imagine following scenario:
right before data selection tuple is inserted into space. It passes first
stage of commit procedure, i.e. it is prepared to be committed but still
is not yet reached WAL.  Meanwhile iterator is starting to read the same key.
At this moment prepared statement is already inserted to in-memory tree
ergo visible to read iterator. So, read iterator fetches this statement
and proceeds to disk scan.  In turn, disk scan yields and in this moment
WAL fails to write statement on disk. Next, two cases are possible:
1. WAL thread has enough time to complete rollback procedure.
2. WAL thread fails to finish rollback in this time gap.

In the first case read iterator should skip statement: version of
in-memory tree has diverged from iterator's one, so we fall back into
iterator restoration procedure. Mem iterator might become invalid so
the only choice is to restart whole 'advance' routine.
Let's don't try to restore it and always restart iteration cycle if
L0 level has changed during yield.

In the second case nothing is changed to read iterator, so it simply
returns prepared statement (and it is considered to be OK).

Closes #3395
  • Loading branch information
Korablev77 committed Jun 23, 2020
1 parent f84cb1a commit 83462a5
Show file tree
Hide file tree
Showing 4 changed files with 363 additions and 55 deletions.
59 changes: 5 additions & 54 deletions src/box/vy_read_iterator.c
Expand Up @@ -378,58 +378,6 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
return 0;
}

/**
* Restore the position of the active in-memory tree iterator
* after a yield caused by a disk read and update 'next'
* if necessary.
*/
static NODISCARD int
vy_read_iterator_restore_mem(struct vy_read_iterator *itr,
struct vy_entry *next)
{
int rc;
int cmp;
struct vy_read_src *src = &itr->src[itr->mem_src];

rc = vy_mem_iterator_restore(&src->mem_iterator,
itr->last, &src->history);
if (rc < 0)
return -1; /* memory allocation error */
if (rc == 0)
return 0; /* nothing changed */

struct vy_entry entry = vy_history_last_stmt(&src->history);
cmp = vy_read_iterator_cmp_stmt(itr, entry, *next);
if (cmp > 0) {
/*
* Memory trees are append-only so if the
* source is not on top of the heap after
* restoration, it was not before.
*/
assert(src->front_id < itr->front_id);
return 0;
}
if (cmp < 0) {
/*
* The new statement precedes the current
* candidate for the next key.
*/
*next = entry;
itr->front_id++;
} else {
/*
* The new statement updates the next key.
* Make sure we don't read the old value
* from the cache while applying UPSERTs.
*/
struct vy_read_src *cache_src = &itr->src[itr->cache_src];
if (cache_src->front_id == itr->front_id)
vy_history_cleanup(&cache_src->history);
}
src->front_id = itr->front_id;
return 0;
}

static void
vy_read_iterator_restore(struct vy_read_iterator *itr);

Expand Down Expand Up @@ -529,8 +477,11 @@ vy_read_iterator_advance(struct vy_read_iterator *itr)
* as it is owned exclusively by the current fiber so the only
* source to check is the active in-memory tree.
*/
if (vy_read_iterator_restore_mem(itr, &next) != 0)
return -1;
struct vy_mem_iterator *mem_itr = &itr->src[itr->mem_src].mem_iterator;
if (mem_itr->version != mem_itr->mem->version) {
vy_read_iterator_restore(itr);
goto restart;
}
/*
* Scan the next range in case we transgressed the current
* range's boundaries.
Expand Down
232 changes: 232 additions & 0 deletions test/vinyl/gh-3395-read-prepared-uncommitted.result
@@ -0,0 +1,232 @@
-- test-run result file version 2
-- Let's test following case: right before data selection tuple is
-- inserted into space. It passes first stage of commit procedure,
-- i.e. it is prepared to be committed but still not yet reached WAL.
-- Meanwhile we are starting to read the same key. At this moment
-- prepared statement is already inserted to in-memory tree. So,
-- read iterator fetches this statement and proceeds to disk scan.
-- In turn, disk scan yields and in this moment WAL fails to write
-- statement on disk, so it is rolled back. Read iterator must
-- recognize this situation and handle it properly.
--
test_run = require('test_run').new()
| ---
| ...
fiber = require('fiber')
| ---
| ...
errinj = box.error.injection
| ---
| ...

s = box.schema.create_space('test', {engine = 'vinyl'})
| ---
| ...
pk = s:create_index('pk')
| ---
| ...
sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false})
| ---
| ...
s:replace{3, 2}
| ---
| - [3, 2]
| ...
s:replace{2, 2}
| ---
| - [2, 2]
| ...
box.snapshot()
| ---
| - ok
| ...

c = fiber.channel(1)
| ---
| ...

function do_write() s:replace{1, 2} end
| ---
| ...
function init_read() end
| ---
| ...
function do_read() local ret = sk:select{2} c:put(ret) end
| ---
| ...

-- Since we have tuples stored on disk, read procedure may
-- yield, opening window for WAL thread to commit or rollback
-- statements. In our case, WAL_WRITE will lead to rollback
-- of {1, 2} statement. Note that the race condition may result in
-- two possible scenarios:
-- 1. WAL thread has time to rollback the statement. In this case
-- {1, 2} will be deleted from mem (L0 lsm level) and we'll fall back
-- into read iterator restoration (since rollback bumps mem's version,
-- but iterator's version remains unchanged).
-- 2. WAL thread doesn't keep up with rollback/commit. Thus, state of
-- mem doesn't change and the statement is returned in the result set
-- (i.e. dirty read takes place).
--
test_run:cmd("setopt delimiter ';'");
| ---
| - true
| ...
-- is_tx_faster_than_wal determines whether wal thread has time
-- to finish its routine or not. In the first case we add extra
-- time gap to make sure that WAL thread finished work and
-- statement is rolled back.
--
function read_prepared_with_delay(is_tx_faster_than_wal)
errinj.set("ERRINJ_WAL_DELAY", true)
fiber.create(do_write, s)
init_read()
errinj.set("ERRINJ_VY_READ_PAGE_DELAY", true)
fiber.create(do_read, sk, c)
errinj.set("ERRINJ_WAL_WRITE", true)
if is_tx_faster_than_wal then
errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", true)
end
errinj.set("ERRINJ_WAL_DELAY", false)
fiber.sleep(0.1)
errinj.set("ERRINJ_VY_READ_PAGE_DELAY", false)
local res = c:get()
errinj.set("ERRINJ_WAL_WRITE", false)
if is_tx_faster_than_wal then
errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", false)
end
return res
end;
| ---
| ...

test_run:cmd("setopt delimiter ''");
| ---
| - true
| ...

-- 1. Prepared tuple is invisible to read iterator since WAL
-- has enough time and rollback procedure is finished.
--
read_prepared_with_delay(false)
| ---
| - - [2, 2]
| - [3, 2]
| ...
-- 2. Tuple is not rolled back so it is visible to all transactions.
--
read_prepared_with_delay(true)
| ---
| - - [1, 2]
| - [2, 2]
| - [3, 2]
| ...

-- Give WAL thread time to catch up.
--
fiber.sleep(0.1)
| ---
| ...

sk:select{2}
| ---
| - - [2, 2]
| - [3, 2]
| ...

s:drop()
| ---
| ...

-- A bit more sophisticated case: tuple to be inserted is
-- not the first in result set.
--
s = box.schema.create_space('test', {engine = 'vinyl'})
| ---
| ...
pk = s:create_index('pk')
| ---
| ...
-- Set page_size to minimum so that accessing each tuple results
-- in page cache miss and disk access - we need it to set
-- VY_READ_PAGE_DELAY injection.
--
sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false, page_size = 5})
| ---
| ...

s:replace{1, 10}
| ---
| - [1, 10]
| ...
s:replace{2, 20}
| ---
| - [2, 20]
| ...
s:replace{4, 20}
| ---
| - [4, 20]
| ...
s:replace{5, 30}
| ---
| - [5, 30]
| ...
box.snapshot()
| ---
| - ok
| ...

gen = nil
| ---
| ...
param = nil
| ---
| ...
state = nil
| ---
| ...
function do_write() s:replace{3, 20} end
| ---
| ...
function init_read() gen, param, state = sk:pairs({20}, {iterator = box.index.EQ}) gen(param, state) end
| ---
| ...
function do_read() local _, ret = gen(param, state) c:put(ret) end
| ---
| ...

read_prepared_with_delay(false)
| ---
| - [4, 20]
| ...
-- All the same but test second scenario (WAL thread is not finished
-- yet and tuple is visible).
--
read_prepared_with_delay(true)
| ---
| - [3, 20]
| ...
-- Give WAL thread time to catch up.
--
fiber.sleep(0.1)
| ---
| ...
-- Read view is aborted due to rollback of prepared statement.
--
gen(param, state)
| ---
| - error: The read view is aborted
| ...

fiber.sleep(0.1)
| ---
| ...
sk:select{20}
| ---
| - - [2, 20]
| - [4, 20]
| ...

s:drop()
| ---
| ...

0 comments on commit 83462a5

Please sign in to comment.