Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

appending_hash<row> ignores cells after first null #4567

Closed
pdziepak opened this issue Jun 17, 2019 · 31 comments · Fixed by #7203
Closed

appending_hash<row> ignores cells after first null #4567

pdziepak opened this issue Jun 17, 2019 · 31 comments · Fixed by #7203
Assignees
Milestone

Comments

@pdziepak
Copy link
Contributor

Scylla version (or git commit hash): d773e4b

appending_hash<row> is supposed to update the provided hasher with information form the selected columns. However, the loops that is supposed to do that ends early if a requested column doesn't not exist. As a result, a row hash will not include information about the cells after the first missing one.

        for (auto id : columns) {
            const cell_and_hash* cell_and_hash = cells.find_cell_and_hash(id);
            if (!cell_and_hash) {
                return; // <- should be continue
            }
            auto&& def = s.column_at(kind, id);
            // add current cell to the hasher
@pdziepak pdziepak added the bug label Jun 17, 2019
@pdziepak pdziepak self-assigned this Jun 17, 2019
@slivne
Copy link
Contributor

slivne commented Jun 18, 2019

@pdziepak I need a bit more context about this - is this a bug in computing digests ? and potentially not identifying a case in which two rows are different and because of the return the hashes do match ?

since when do we have this ?

Example

Schema {PK PRIMARY KEY, A INT ,B INT ,C INT}

Node 1: PK=1,A=1,C=3
Node 2: PK=1,A=1,C=2

how did you find this bug - code inspection / unitests ?

@pdziepak
Copy link
Contributor Author

d773e4b is the commit that introduced the bug (for the aficionados of archaeology, in the earlier versions of that patch there was a lambda and the return behaved like continue, later, lambda became a loop, but the return stayed being return).

The user-visible problem is exactly like you've shown in your example. If the queried columns are A, B, and C, then node 1 and 2 will produce the same digest since only column A will be taken into consideration.

I've spotted this by inspecting the code trying to find the source of an unrelated bug.

pdziepak added a commit to pdziepak/scylla that referenced this issue Jun 18, 2019
Currently, appending_hash<row> returns as soon as it encounters first
missing cell (i.e. a null). This is incorrect, since any present cells
after that null will not contribute to the hash. The right behaviour is
to simply ignore missing cells.

Fixes scylladb#4567.
pdziepak added a commit to pdziepak/scylla that referenced this issue Jun 18, 2019
@slivne slivne added this to the 3.2 milestone Jul 11, 2019
@pdziepak pdziepak removed their assignment Aug 1, 2019
@slivne slivne modified the milestones: 3.2, 3.4 Feb 13, 2020
@slivne slivne modified the milestones: 4.0, 4.1 Mar 24, 2020
@slivne slivne modified the milestones: 4.1, 4.2 Jun 1, 2020
@gleb-cloudius
Copy link
Contributor

We need to fix this asap. This is the root cause for #7116.

@eliransin
Copy link
Contributor

@psarna this is a blocker for Jepsen. Please be advised there is a PR in place with requested changes on it (by pdziepak)

@avikivity
Copy link
Member

Perhaps we can use a new hash function enum (with the same hash function) to describe the behavioral change.

@gleb-cloudius
Copy link
Contributor

@avikivity this is a regression. All the problems you are trying to avoid by feature bits or new hash function would have present when the regression was introduced. It did not cause any mayhem so much so that we did not notice it. The same will be true for fixing the regression.

@avikivity
Copy link
Member

@gleb-cloudius we don't know that it did not cause damage. Also, we have more users now, so there is more potential for problems.

@gleb-cloudius
Copy link
Contributor

@avikivity we know because otherwise we would have notice the bug. "More users" argument is hard to dispute, but it is not like 2 years ago we had small amount either. Read repair for affected rows will not kill anyone and if repair uses the same hashing function it is likely read repair will happen anyway because data will be different after 2 years of been neglected. If schema digest may be different this is more serious problem, but may it?

@avikivity
Copy link
Member

More users = more chances the problem will surface. I don't want to risk ruining someone's day during an upgrade, which is already stressful.

I think it can happen during schema digest. Why wouldn't it?

@gleb-cloudius
Copy link
Contributor

Read repair will not ruin anyone's day. If it will they will have the bad day regardless because data is out of sync after we neglected to repair it for two years.

For the problem to happen a cell should not be present. Is it possible for schema tables? May be it is and we got lucky and nobody hit the problem when upgraded to broken version.

@tgrabiec
Copy link
Contributor

tgrabiec commented Sep 8, 2020

Note that nodetool repair is not prone to this bug, it doesn't use this function.

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Sep 8, 2020 via email

@tgrabiec
Copy link
Contributor

tgrabiec commented Sep 8, 2020

Schema digest is also not prone.

psarna pushed a commit to psarna/scylla that referenced this issue Sep 8, 2020
psarna pushed a commit to psarna/scylla that referenced this issue Sep 8, 2020
@psarna
Copy link
Contributor

psarna commented Sep 8, 2020

I have a partially working work-in-progress series based on Paweł's commits (https://github.com/psarna/scylla/tree/fix_ignoring_cells_after_null_in_appending_hash) which adds a cluster feature for this, but it's quite tough to decide where to check for this feature bit...

The code in this branch is ugly and doesn't work well with our test suite, but I don't have any clear idea how to approach it, so suggestions are welcome. To sum up - the cluster feature check needs to be placed somewhere, and I don't see a good place for it. In my patch I placed the check inside mutation_partition.cc and used an awful global variable access to check if the cluster supports the new feature (which is terrible, but also common practice). It doesn't work, because in our test suite this global variable is not always initializes, so the code crashes. It would be better to pass a parameter up the callchain, but this will in turn spoil quite a few interfaces, starting with mutation_partition::query_compacted and going up through mutation::query. I don't see any other choice but to add this ugly parameter up the callchain, but I won't manage to do this today anyway, so I'm posting this cry for help in case somebody has a better idea.

/cc @avikivity

@avikivity
Copy link
Member

I suggested before to model it as a hash function change: add digest_algorithm::xxHash_with_null_fix. Then specialize on the hasher.

I see there is also a digester class that type-erases the algorithm, so maybe not so easy to specialize.

avikivity added a commit that referenced this issue Sep 10, 2020
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes #4567
Based on #4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for #4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible
psarna pushed a commit to psarna/scylla that referenced this issue Sep 10, 2020
avikivity added a commit that referenced this issue Sep 10, 2020
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes #4567
Based on #4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for #4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible
@avikivity
Copy link
Member

@psarna how does this interact with the cached hashes? I think it should not, since the cached hashes are on a cell level. My concern is that hashes generated with one hash function will be used with another hash function.

Are deleted cells and never-written cells treated the same way?

@psarna
Copy link
Contributor

psarna commented Sep 14, 2020

No idea, I'll browse the code and try to verify

@psarna
Copy link
Contributor

psarna commented Sep 14, 2020

From what I see here, deleted and never-written cells are not treated the same way. Hashes are cached for atomic_cell, which boils down to using atomic_cell_view, which have special cases for cells which are deleted (assuming that I'm reading our IMR code correctly, which is a very generous assumption:

template<>
struct appending_hash<atomic_cell_view> {
    template<typename Hasher>
    void operator()(Hasher& h, atomic_cell_view cell, const column_definition& cdef) const {
        feed_hash(h, cell.is_live());
        feed_hash(h, cell.timestamp());
        if (cell.is_live()) {
            if (cdef.is_counter()) {
                counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
                    ::feed_hash(h, ccv);
                });
                return;
            }
            if (cell.is_live_and_has_ttl()) {
                feed_hash(h, cell.expiry());
                feed_hash(h, cell.ttl());
            }
            feed_hash(h, cell.value());
        } else {
            feed_hash(h, cell.deletion_time());
        }
    }
};

@avikivity
Copy link
Member

template<typename Hasher>
void appending_hash<row>::operator()(Hasher& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const {
    for (auto id : columns) {
        const cell_and_hash* cell_and_hash = cells.find_cell_and_hash(id);
        if (!cell_and_hash) {
            feed_hash(h, appending_hash<row>::null_hash_value);
            continue;
        }
        auto&& def = s.column_at(kind, id);
        if (def.is_atomic()) {
            max_ts.update(cell_and_hash->cell.as_atomic_cell(def).timestamp());
            if constexpr (query::using_hash_of_hash_v<Hasher>) {
                if (cell_and_hash->hash) {
                    feed_hash(h, *cell_and_hash->hash);
                } else {
                    query::default_hasher cellh;

Is it correct to use default_hasher here? Shouldn't we use Hasher?

Even more in this variant:

void appending_hash<row>::operator()<legacy_xx_hasher_without_null_digest>(legacy_xx_hasher_without_null_digest& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const {

Although xxhash is compatible with legacy_xx_hasher_without_null_digest below the row level. So it's just a code wart, not a correctnesss problem.

@avikivity
Copy link
Member

From what I see here, deleted and never-written cells are not treated the same way.

And that's correct, since we'd want deleted cells with different deletion times to trigger a read repair.

This can cause an unnecessary read repair if a tombstone is hashed in one replica and nothing in another, because the tombstone was expired and garbage collected. But I see nothing we can do about it.

@psarna
Copy link
Contributor

psarna commented Sep 14, 2020

template<typename Hasher>
void appending_hash<row>::operator()(Hasher& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const {
    for (auto id : columns) {
        const cell_and_hash* cell_and_hash = cells.find_cell_and_hash(id);
        if (!cell_and_hash) {
            feed_hash(h, appending_hash<row>::null_hash_value);
            continue;
        }
        auto&& def = s.column_at(kind, id);
        if (def.is_atomic()) {
            max_ts.update(cell_and_hash->cell.as_atomic_cell(def).timestamp());
            if constexpr (query::using_hash_of_hash_v<Hasher>) {
                if (cell_and_hash->hash) {
                    feed_hash(h, *cell_and_hash->hash);
                } else {
                    query::default_hasher cellh;

Is it correct to use default_hasher here? Shouldn't we use Hasher?

Even more in this variant:

void appending_hash<row>::operator()<legacy_xx_hasher_without_null_digest>(legacy_xx_hasher_without_null_digest& h, const row& cells, const schema& s, column_kind kind, const query::column_id_vector& columns, max_timestamp& max_ts) const {

Although xxhash is compatible with legacy_xx_hasher_without_null_digest below the row level. So it's just a code wart, not a correctnesss problem.

Sounds right. I'll send a patch to remove the wart(s).

ManManson pushed a commit to ManManson/scylla that referenced this issue Sep 15, 2020
syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Sep 16, 2020
@avikivity
Copy link
Member

@roydahan we need feedback on this fix, especially during upgrades.

@roydahan
Copy link

@avikivity I'll need more information about this.
What is your concern in terms of rolling upgrades?
(The current upgrade tests we have in master and 4.2.rc4 (AFAIU has the fix already) pass)

@avikivity
Copy link
Member

Suppose you have a row pk ck c1 c2 c3
If there is a NULL in the middle (for example, c1=1, c3=3, but c2 has no value), then the patch changes the way such rows are compared when doing quorum reads (previously, we would treat (1, NULL, 2) the same as (1, NULL, 3).

The risk is that during the upgrade we'd see a spike in read repairs.

So we need a test that runs a QUORUM read workload before, during, and after the upgrade, and checks for read repair. There can be a few read repairs initiated after the upgrade, but there should not be any during the upgrade or a large number afterwards. The test should run on a few million rows, say reading 1000 rows/sec.

It can be manual, no need to repeat it.

@roydahan
Copy link

roydahan commented Sep 30, 2020

When looking on any read_repair metric we have, I don't see any that occurred - not before, during or after the upgrade.

Screen Shot 2020-09-30 at 15 35 04

Example of data in my table:

cassandra@cqlsh> SELECT * FROM  ks.targettable limit 10;

 pk                                   | ck       | c1                                   | c2   | c3  | c4   | c5   | c6
--------------------------------------+----------+--------------------------------------+------+-----+------+------+-----------
 093f8e10-1dde-11b2-da01-7977122bdf87 | 16382367 | 13ad0af0-1dd2-11b2-2a71-a5bc8c82f897 | null | 650 | null | null | 646.18738
 4b0dea10-1df9-11b2-da01-7977122bdf87 |    95964 | 13cfae20-1dd2-11b2-2a71-a5bc8c82f897 | null | 843 | null | null | 945.32086
 4b0dea10-1df9-11b2-da01-7977122bdf87 |  8391994 | 13f2c680-1dd2-11b2-2a71-a5bc8c82f897 | null | 470 | null | null | 768.49689
 4b0dea10-1df9-11b2-da01-7977122bdf87 | 13446650 | 141542a0-1dd2-11b2-2a71-a5bc8c82f897 | null | 936 | null | null | 233.95412
 4b0dea10-1df9-11b2-da01-7977122bdf87 | 18609679 | 14019390-1dd2-11b2-2a71-a5bc8c82f897 | null | 715 | null | null |   7.00867
 b6587d30-1dea-11b2-da01-7977122bdf87 |  3352719 | 13db1fd0-1dd2-11b2-2a71-a5bc8c82f897 | null | 797 | null | null | 697.15063
 b6587d30-1dea-11b2-da01-7977122bdf87 |  6253728 | 13c43c70-1dd2-11b2-2a71-a5bc8c82f897 | null | 732 | null | null | 390.12354
 b6587d30-1dea-11b2-da01-7977122bdf87 |  9661279 | 13b21400-1dd2-11b2-2a71-a5bc8c82f897 | null | 179 | null | null | 814.90497
 b6587d30-1dea-11b2-da01-7977122bdf87 | 10944590 | 13c4d8b0-1dd2-11b2-2a71-a5bc8c82f897 | null | 225 | null | null | 974.78217
 c5cb3980-1dde-11b2-da01-7977122bdf87 | 15431964 | 13ac95c0-1dd2-11b2-2a71-a5bc8c82f897 | null |  89 | null | null |  896.2901

I have around 15K op/ss during the insert part and 15K op/s during the reads.
Screen Shot 2020-09-30 at 15 33 55

The test scenarios is:

  1. 4 nodes cluster of 4.1.7 (without the above fix).
  2. Inserting 60M rows where (Pk, ck, C1, NULL, C3, NULL, NULL, C6).
  3. Start reading using both:
    •  cql: select * from targettable where pk = ? and ck = ?
      
    •  cql: select * from targettable where pk = ?
      
  4. During reads:
    a. upgrade node1 to master with the above fix.
    b. rollback node1 (just because it's already part of the scenario I have).
    c. Upgrade all nodes one by one node 1 - 4.
  5. Continue read for a while.

@kostja
Copy link
Contributor

kostja commented Oct 1, 2020

Please backport to 4.2

tgrabiec pushed a commit that referenced this issue Oct 1, 2020
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes #4567
Based on #4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for #4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible

(cherry picked from commit 0e03c97)
@tgrabiec
Copy link
Contributor

tgrabiec commented Oct 2, 2020

@roydahan Did you finish testing this?

@roydahan
Copy link

roydahan commented Oct 4, 2020

I've done the test Avi suggested, but I can't really verify that I had the case where the replicas differs from each other after the NULL.
In a later test I used CL=ONE and higher load during insert and saw some read_repairs after full cluster upgrade (not a lot).
But again, I can't really tell if it's row missing completely or caused by the issue above.

In any case, it looks like it doesn't break anything.

@slivne
Copy link
Contributor

slivne commented Oct 4, 2020

Backported to 4.2

Lets wait with additional backports till it matures in the field

avikivity added a commit that referenced this issue Nov 4, 2020
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes #4567
Based on #4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for #4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible

(cherry picked from commit 0e03c97)
@avikivity
Copy link
Member

Backported to 4.1.

elcallio pushed a commit to elcallio/scylla that referenced this issue Jan 12, 2021
* 'master' of github.com:scylladb/scylla: (28 commits)
  scylla-gdb.py: add scylla schema command
  scylla-gdb.py: add pretty-printer for bytes
  scylla-gdb.py: don't use the string display hint for UUIDs
  alternator test: fix two tests that failed in HTTPS mode
  alternator test: add reproducing tests for several issues
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for scylladb#4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible
  tools: toolchain: update for gnutls-3.6.15
  storage_service: Fix use-after-free when calculating effective ownership
  cql3: Fix NULL reference in get_column_defs_for_filtering
  storage_service: Fix a TOKENS update race for replace operation
  Add support passing python3 dependencies from main repo to scylla-python3 script
  Update seastar submodule
  dist/common/scripts: abort scylla_prepare with better error message
  clustering_row: Do not re-implement deletable_row
  deletable_row: Do not mess with clustering_row
  ...
asias pushed a commit to asias/scylla that referenced this issue Apr 15, 2021
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes scylladb#4567
Based on scylladb#4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for scylladb#4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible

(cherry picked from commit 0e03c97)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants