Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale tuples in secondary index under certain conditions #3607

Closed
locker opened this issue Aug 3, 2018 · 1 comment
Closed

Stale tuples in secondary index under certain conditions #3607

locker opened this issue Aug 3, 2018 · 1 comment
Assignees
Labels
bug Something isn't working vinyl
Milestone

Comments

@locker
Copy link
Member

locker commented Aug 3, 2018

Branch 1.9 commit f29466c (1.10 and 2.0 are affected as well)

yaml = require('yaml')

box.cfg{log_level = 4}

s = box.schema.space.create('test', {engine = 'vinyl'})
s:create_index('pk')
s:create_index('sk', {parts = {2, 'unsigned'}})

s:insert{1, 10}
box.snapshot()

s:update(1, {{'=', 2, 10}})
s:delete(1)
box.snapshot()

s:insert{1, 20}

print(yaml.encode(s.index.sk:select()))

os.exit(0)

The script produces the following output:

2018-08-03 14:55:33.261 [13023] main/101/test.lua C> Tarantool 1.9.1-59-gf29466cd
2018-08-03 14:55:33.262 [13023] main/101/test.lua C> log level 4
---
- [1, 20]
- [1, 20]
...

[1, 20] is printed twice, because [1, 10] wasn't removed from the secondary index.

@locker locker added bug Something isn't working vinyl labels Aug 3, 2018
@locker locker self-assigned this Aug 3, 2018
@kyukhin kyukhin added this to the 1.9.2 milestone Aug 3, 2018
locker added a commit that referenced this issue Aug 10, 2018
index.update() looks up the old tuple in the primary index, applies
update operations to it, then writes a DELETE statement to secondary
indexes to delete the old tuple and a REPLACE statement to all indexes
to insert the new tuple. It also sets a column mask for both DELETE and
REPLACE statements. The column mask is a bit mask which has a bit set if
the corresponding field is updated by update operations. It is used by
the write iterator for two purposes. First, the write iterator skips
REPLACE statements that don't update key fields. Second, the write
iterator turns a REPLACE that has a column mask that intersects with key
fields into an INSERT (so that it can get annihilated with a DELETE when
the time comes). The latter is correct, because if an update() does
update secondary key fields, then it must have deleted the old tuple and
hence the new tuple is unique in terms of extended key (merged primary
and secondary key parts, i.e. cmp_def).

The problem is that a bit may be set in a column mask even if the
corresponding field does not actually get updated. For example, consider
the following example.

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 10}
  box.snapshot()
  s:update(1, {{'=', 2, 10}})

The update() doesn't modify the secondary key field so it only writes
REPLACE{1, 10} to the secondary index (actually it writes DELETE{1, 10}
too, but it gets overwritten by the REPLACE). However, the REPLACE has
column mask that says that update() does modify the key field, because a
column mask is generated solely from update operations, before applying
them. As a result, the write iterator will not skip this REPLACE on
dump. This won't have any serious consequences, because this is a mere
optimization. What is worse, the write iterator will also turn the
REPLACE into an INSERT, which is absolutely wrong as the REPLACE is
preceded by INSERT{1, 10}. If the tuple gets deleted, the DELETE
statement and the INSERT created by the write iterator from the REPLACE
will get annihilated, leaving the old INSERT{1, 10} visible.

The issue may result in invalid select() output as demonstrated in the
issue description. It may also result in crashes, because the tuple
cache is very sensible to invalid select() output.

To fix this issue let's clear key bits in the column mask if we detect
that an update() doesn't actually update secondary key fields although
the column mask says it does. We do that in vy_tx_set(), if we see that
the statement overwritten by a REPLACE is a DELETE: the point is a
DELETE may be written to a secondary index only if the tuple it is
supposed to delete actually exists (we can't generate a surrogate DELETE
otherwise); so if a REPLACE overwrites a DELETE, there must be the same
REPLACE in the index and hence we can clear key bits in the REPLACE
column mask.

Closes #3607
locker added a commit that referenced this issue Aug 10, 2018
index.update() looks up the old tuple in the primary index, applies
update operations to it, then writes a DELETE statement to secondary
indexes to delete the old tuple and a REPLACE statement to all indexes
to insert the new tuple. It also sets a column mask for both DELETE and
REPLACE statements. The column mask is a bit mask which has a bit set if
the corresponding field is updated by update operations. It is used by
the write iterator for two purposes. First, the write iterator skips
REPLACE statements that don't update key fields. Second, the write
iterator turns a REPLACE that has a column mask that intersects with key
fields into an INSERT (so that it can get annihilated with a DELETE when
the time comes). The latter is correct, because if an update() does
update secondary key fields, then it must have deleted the old tuple and
hence the new tuple is unique in terms of extended key (merged primary
and secondary key parts, i.e. cmp_def).

The problem is that a bit may be set in a column mask even if the
corresponding field does not actually get updated. For example, consider
the following example.

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 10}
  box.snapshot()
  s:update(1, {{'=', 2, 10}})

The update() doesn't modify the secondary key field so it only writes
REPLACE{1, 10} to the secondary index (actually it writes DELETE{1, 10}
too, but it gets overwritten by the REPLACE). However, the REPLACE has
column mask that says that update() does modify the key field, because a
column mask is generated solely from update operations, before applying
them. As a result, the write iterator will not skip this REPLACE on
dump. This won't have any serious consequences, because this is a mere
optimization. What is worse, the write iterator will also turn the
REPLACE into an INSERT, which is absolutely wrong as the REPLACE is
preceded by INSERT{1, 10}. If the tuple gets deleted, the DELETE
statement and the INSERT created by the write iterator from the REPLACE
will get annihilated, leaving the old INSERT{1, 10} visible.

The issue may result in invalid select() output as demonstrated in the
issue description. It may also result in crashes, because the tuple
cache is very sensible to invalid select() output.

To fix this issue let's clear key bits in the column mask if we detect
that an update() doesn't actually update secondary key fields although
the column mask says it does.

Closes #3607
locker added a commit that referenced this issue Aug 10, 2018
index.update() looks up the old tuple in the primary index, applies
update operations to it, then writes a DELETE statement to secondary
indexes to delete the old tuple and a REPLACE statement to all indexes
to insert the new tuple. It also sets a column mask for both DELETE and
REPLACE statements. The column mask is a bit mask which has a bit set if
the corresponding field is updated by update operations. It is used by
the write iterator for two purposes. First, the write iterator skips
REPLACE statements that don't update key fields. Second, the write
iterator turns a REPLACE that has a column mask that intersects with key
fields into an INSERT (so that it can get annihilated with a DELETE when
the time comes). The latter is correct, because if an update() does
update secondary key fields, then it must have deleted the old tuple and
hence the new tuple is unique in terms of extended key (merged primary
and secondary key parts, i.e. cmp_def).

The problem is that a bit may be set in a column mask even if the
corresponding field does not actually get updated. For example, consider
the following example.

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 10}
  box.snapshot()
  s:update(1, {{'=', 2, 10}})

The update() doesn't modify the secondary key field so it only writes
REPLACE{1, 10} to the secondary index (actually it writes DELETE{1, 10}
too, but it gets overwritten by the REPLACE). However, the REPLACE has
column mask that says that update() does modify the key field, because a
column mask is generated solely from update operations, before applying
them. As a result, the write iterator will not skip this REPLACE on
dump. This won't have any serious consequences, because this is a mere
optimization. What is worse, the write iterator will also turn the
REPLACE into an INSERT, which is absolutely wrong as the REPLACE is
preceded by INSERT{1, 10}. If the tuple gets deleted, the DELETE
statement and the INSERT created by the write iterator from the REPLACE
will get annihilated, leaving the old INSERT{1, 10} visible.

The issue may result in invalid select() output as demonstrated in the
issue description. It may also result in crashes, because the tuple
cache is very sensible to invalid select() output.

To fix this issue let's clear key bits in the column mask if we detect
that an update() doesn't actually update secondary key fields although
the column mask says it does.

Closes #3607
@locker
Copy link
Member Author

locker commented Aug 10, 2018

Fixed by e72867c

@locker locker closed this as completed Aug 10, 2018
locker added a commit that referenced this issue May 24, 2019
If an UPDATE request doesn't touch key parts of a secondary index, we
don't need to write it to the index memory level or dump it to disk, as
this would only increase IO load. Historically, we use column mask set
by the UPDATE operation to skip secondary indexes that are not affected
by the operation on commit. However, there's a problem here: the column
mask isn't precise - it may have a bit set even if the corresponding
column doesn't really get updated, e.g. consider {'+', 2, 0}. Not taking
this into account may result in appearance of phantom tuples on disk as
the write iterator assumes that statements that have no effect aren't
written to secondary indexes (this is needed to apply INSERT+DELETE
"annihilation" optimization). We fixed that by clearing column mask bits
in vy_tx_set in case we detect that the key isn't changed, for more
details see #3607 and commit e72867c ("vinyl: fix appearance of
phantom tuple in secondary index after update"). It was rather an ugly
hack, but it worked.

However, it turned out that apart from looking hackish this code has
a nasty bug that may lead to tuples missing from secondary indexes.
Consider the following example:

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 1, 1}

  box.begin()
  s:update(1, {{'=', 2, 2}})
  s:update(1, {{'=', 3, 2}})
  box.commit()

The first update operation writes DELETE{1,1} and REPLACE{2,1} to the
secondary index write set. The second update replaces REPLACE{2,1} with
DELETE{2,1} and then with REPLACE{2,1}. When replacing DELETE{2,1} with
REPLACE{2,1} in the write set, we assume that the update doesn't modify
secondary index key parts and clear the column mask so as not to commit
a pointless request, see vy_tx_set. As a result, we skip the first
update too and get key {2,1} missing from the secondary index.

Actually, it was a dumb idea to use column mask to skip statements in
the first place, as there's a much easier way to filter out statements
that have no effect for secondary indexes. The thing is every DELETE
statement inserted into a secondary index write set acts as a "single
DELETE", i.e. there's exactly one older statement it is supposed to
purge. This is, because in contrast to the primary index we don't write
DELETE statements blindly - we always look up the tuple overwritten in
the primary index first. This means that REPLACE+DELETE for the same key
is basically a no-op and can be safely skip. Moreover, DELETE+REPLACE
can be treated as no-op, too, because secondary indexes don't store full
tuples hence all REPLACE statements for the same key are equivalent.
By marking such pair of statements as no-op in vy_tx_set, we guarantee
that no-op statements don't make it to secondary index memory or disk
levels.

Closes #4242
locker added a commit that referenced this issue May 27, 2019
If an UPDATE request doesn't touch key parts of a secondary index, we
don't need to re-index it in the in-memory secondary index, as this
would only increase IO load. Historically, we use column mask set by the
UPDATE operation to skip secondary indexes that are not affected by the
operation on commit. However, there's a problem here: the column mask
isn't precise - it may have a bit set even if the corresponding column
value isn't changed by the update operation, e.g. consider {'+', 2, 0}.
Not taking this into account may result in appearance of phantom tuples
on disk as the write iterator assumes that statements that have no
effect aren't written to secondary indexes (this is needed to apply
INSERT+DELETE "annihilation" optimization). We fixed that by clearing
column mask bits in vy_tx_set in case we detect that the key isn't
changed, for more details see #3607 and commit e72867c ("vinyl: fix
appearance of phantom tuple in secondary index after update"). It was
rather an ugly hack, but it worked.

However, it turned out that apart from looking hackish this code has
a nasty bug that may lead to tuples missing from secondary indexes.
Consider the following example:

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 1, 1}

  box.begin()
  s:update(1, {{'=', 2, 2}})
  s:update(1, {{'=', 3, 2}})
  box.commit()

The first update operation writes DELETE{1,1} and REPLACE{2,1} to the
secondary index write set. The second update replaces REPLACE{2,1} with
DELETE{2,1} and then with REPLACE{2,1}. When replacing DELETE{2,1} with
REPLACE{2,1} in the write set, we assume that the update doesn't modify
secondary index key parts and clear the column mask so as not to commit
a pointless request, see vy_tx_set. As a result, we skip the first
update too and get key {2,1} missing from the secondary index.

Actually, it was a dumb idea to use column mask to skip statements in
the first place, as there's a much easier way to filter out statements
that have no effect for secondary indexes. The thing is every DELETE
statement inserted into a secondary index write set acts as a "single
DELETE", i.e. there's exactly one older statement it is supposed to
purge. This is, because in contrast to the primary index we don't write
DELETE statements blindly - we always look up the tuple overwritten in
the primary index first. This means that REPLACE+DELETE for the same key
is basically a no-op and can be safely skip. Moreover, DELETE+REPLACE
can be treated as no-op, too, because secondary indexes don't store full
tuples hence all REPLACE statements for the same key are equivalent.
By marking both statements as no-op in vy_tx_set, we guarantee that
no-op statements don't make it to secondary index memory or disk levels.

Closes #4242
locker added a commit that referenced this issue May 27, 2019
If an UPDATE request doesn't touch key parts of a secondary index, we
don't need to re-index it in the in-memory secondary index, as this
would only increase IO load. Historically, we use column mask set by the
UPDATE operation to skip secondary indexes that are not affected by the
operation on commit. However, there's a problem here: the column mask
isn't precise - it may have a bit set even if the corresponding column
value isn't changed by the update operation, e.g. consider {'+', 2, 0}.
Not taking this into account may result in appearance of phantom tuples
on disk as the write iterator assumes that statements that have no
effect aren't written to secondary indexes (this is needed to apply
INSERT+DELETE "annihilation" optimization). We fixed that by clearing
column mask bits in vy_tx_set in case we detect that the key isn't
changed, for more details see #3607 and commit e72867c ("vinyl: fix
appearance of phantom tuple in secondary index after update"). It was
rather an ugly hack, but it worked.

However, it turned out that apart from looking hackish this code has
a nasty bug that may lead to tuples missing from secondary indexes.
Consider the following example:

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 1, 1}

  box.begin()
  s:update(1, {{'=', 2, 2}})
  s:update(1, {{'=', 3, 2}})
  box.commit()

The first update operation writes DELETE{1,1} and REPLACE{2,1} to the
secondary index write set. The second update replaces REPLACE{2,1} with
DELETE{2,1} and then with REPLACE{2,1}. When replacing DELETE{2,1} with
REPLACE{2,1} in the write set, we assume that the update doesn't modify
secondary index key parts and clear the column mask so as not to commit
a pointless request, see vy_tx_set. As a result, we skip the first
update too and get key {2,1} missing from the secondary index.

Actually, it was a dumb idea to use column mask to skip statements in
the first place, as there's a much easier way to filter out statements
that have no effect for secondary indexes. The thing is every DELETE
statement inserted into a secondary index write set acts as a "single
DELETE", i.e. there's exactly one older statement it is supposed to
purge. This is, because in contrast to the primary index we don't write
DELETE statements blindly - we always look up the tuple overwritten in
the primary index first. This means that REPLACE+DELETE for the same key
is basically a no-op and can be safely skip. Moreover, DELETE+REPLACE
can be treated as no-op, too, because secondary indexes don't store full
tuples hence all REPLACE statements for the same key are equivalent.
By marking both statements as no-op in vy_tx_set, we guarantee that
no-op statements don't make it to secondary index memory or disk levels.

Closes #4242

(cherry picked from commit 69aee6f)
locker added a commit that referenced this issue May 27, 2019
If an UPDATE request doesn't touch key parts of a secondary index, we
don't need to re-index it in the in-memory secondary index, as this
would only increase IO load. Historically, we use column mask set by the
UPDATE operation to skip secondary indexes that are not affected by the
operation on commit. However, there's a problem here: the column mask
isn't precise - it may have a bit set even if the corresponding column
value isn't changed by the update operation, e.g. consider {'+', 2, 0}.
Not taking this into account may result in appearance of phantom tuples
on disk as the write iterator assumes that statements that have no
effect aren't written to secondary indexes (this is needed to apply
INSERT+DELETE "annihilation" optimization). We fixed that by clearing
column mask bits in vy_tx_set in case we detect that the key isn't
changed, for more details see #3607 and commit e72867c ("vinyl: fix
appearance of phantom tuple in secondary index after update"). It was
rather an ugly hack, but it worked.

However, it turned out that apart from looking hackish this code has
a nasty bug that may lead to tuples missing from secondary indexes.
Consider the following example:

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 1, 1}

  box.begin()
  s:update(1, {{'=', 2, 2}})
  s:update(1, {{'=', 3, 2}})
  box.commit()

The first update operation writes DELETE{1,1} and REPLACE{2,1} to the
secondary index write set. The second update replaces REPLACE{2,1} with
DELETE{2,1} and then with REPLACE{2,1}. When replacing DELETE{2,1} with
REPLACE{2,1} in the write set, we assume that the update doesn't modify
secondary index key parts and clear the column mask so as not to commit
a pointless request, see vy_tx_set. As a result, we skip the first
update too and get key {2,1} missing from the secondary index.

Actually, it was a dumb idea to use column mask to skip statements in
the first place, as there's a much easier way to filter out statements
that have no effect for secondary indexes. The thing is every DELETE
statement inserted into a secondary index write set acts as a "single
DELETE", i.e. there's exactly one older statement it is supposed to
purge. This is, because in contrast to the primary index we don't write
DELETE statements blindly - we always look up the tuple overwritten in
the primary index first. This means that REPLACE+DELETE for the same key
is basically a no-op and can be safely skip. Moreover, DELETE+REPLACE
can be treated as no-op, too, because secondary indexes don't store full
tuples hence all REPLACE statements for the same key are equivalent.
By marking both statements as no-op in vy_tx_set, we guarantee that
no-op statements don't make it to secondary index memory or disk levels.

Closes #4242

(cherry picked from commit 69aee6f)
avtikhon pushed a commit that referenced this issue May 29, 2019
If an UPDATE request doesn't touch key parts of a secondary index, we
don't need to re-index it in the in-memory secondary index, as this
would only increase IO load. Historically, we use column mask set by the
UPDATE operation to skip secondary indexes that are not affected by the
operation on commit. However, there's a problem here: the column mask
isn't precise - it may have a bit set even if the corresponding column
value isn't changed by the update operation, e.g. consider {'+', 2, 0}.
Not taking this into account may result in appearance of phantom tuples
on disk as the write iterator assumes that statements that have no
effect aren't written to secondary indexes (this is needed to apply
INSERT+DELETE "annihilation" optimization). We fixed that by clearing
column mask bits in vy_tx_set in case we detect that the key isn't
changed, for more details see #3607 and commit e72867c ("vinyl: fix
appearance of phantom tuple in secondary index after update"). It was
rather an ugly hack, but it worked.

However, it turned out that apart from looking hackish this code has
a nasty bug that may lead to tuples missing from secondary indexes.
Consider the following example:

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('pk')
  s:create_index('sk', {parts = {2, 'unsigned'}})
  s:insert{1, 1, 1}

  box.begin()
  s:update(1, {{'=', 2, 2}})
  s:update(1, {{'=', 3, 2}})
  box.commit()

The first update operation writes DELETE{1,1} and REPLACE{2,1} to the
secondary index write set. The second update replaces REPLACE{2,1} with
DELETE{2,1} and then with REPLACE{2,1}. When replacing DELETE{2,1} with
REPLACE{2,1} in the write set, we assume that the update doesn't modify
secondary index key parts and clear the column mask so as not to commit
a pointless request, see vy_tx_set. As a result, we skip the first
update too and get key {2,1} missing from the secondary index.

Actually, it was a dumb idea to use column mask to skip statements in
the first place, as there's a much easier way to filter out statements
that have no effect for secondary indexes. The thing is every DELETE
statement inserted into a secondary index write set acts as a "single
DELETE", i.e. there's exactly one older statement it is supposed to
purge. This is, because in contrast to the primary index we don't write
DELETE statements blindly - we always look up the tuple overwritten in
the primary index first. This means that REPLACE+DELETE for the same key
is basically a no-op and can be safely skip. Moreover, DELETE+REPLACE
can be treated as no-op, too, because secondary indexes don't store full
tuples hence all REPLACE statements for the same key are equivalent.
By marking both statements as no-op in vy_tx_set, we guarantee that
no-op statements don't make it to secondary index memory or disk levels.

Closes #4242
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vinyl
Projects
None yet
Development

No branches or pull requests

2 participants