Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing key in cassandra-harry verification while doing rolling restart of the cluster #10598

Closed
fruch opened this issue May 18, 2022 · 63 comments · Fixed by #16003
Closed

Missing key in cassandra-harry verification while doing rolling restart of the cluster #10598

fruch opened this issue May 18, 2022 · 63 comments · Fixed by #16003

Comments

@fruch
Copy link
Contributor

fruch commented May 18, 2022

Installation details

Kernel Version: 5.13.0-1021-aws
Scylla version (or git commit hash): 5.0~rc3-20220406.f92622e0d with build-id 2b79c4744216b294fdbd2f277940044c899156ea
Cluster size: 6 nodes (i3.large)

Scylla Nodes used in this run:

  • longevity-harry-2h-test-har-db-node-5ad966f5-9 (13.51.238.118 | 10.0.2.230) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-8 (16.16.27.133 | 10.0.2.177) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-7 (13.53.112.117 | 10.0.1.21) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-6 (13.51.234.38 | 10.0.1.59) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-5 (13.50.16.149 | 10.0.0.74) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-4 (13.49.245.114 | 10.0.3.165) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-3 (13.49.18.135 | 10.0.3.102) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-2 (16.171.54.36 | 10.0.0.136) (shards: 2)
  • longevity-harry-2h-test-har-db-node-5ad966f5-1 (16.171.52.196 | 10.0.3.210) (shards: 2)

OS / Image: ami-0e4ae5e4a139c50f3 (aws: eu-north-1)

Test: longevity-harry-2h-test
Test id: 5ad966f5-8ded-437d-9224-7601b95777f1
Test name: scylla-staging/fruch/longevity-harry-2h-test
Test config file(s):

Issue description

During RollingRestartCluster, cassandra-harry fail in it's verification find only 19 key out of 134 expected ones.
It's doing the query six times, and in all of them get the same results

This is the query, a prepare statment that runs with ConsistencyLevel.QUORUM:

SELECT pk0000, pk0001, pk0002, ck0000, ck0001, static0000, static0001, static0002, static0003, static0004, regular0000, regular0001, regular0002, regular0003, writetime(static0000), writetime(static0001), writetime(static0002), writetime(static0003), writetime(static0004), writetime(regular0000), writetime(regular0001), writetime(regular0002), writetime(regular0003) FROM harry.table0 WHERE pk0000 = ? AND pk0001 = ? AND pk0002 = ? ORDER BY ck0000 DESC, ck0001 DESC;

Form cassandra-harry log, Partition state is the expected result, Observed state is the result from the query (notice that the expected is in reverse order then the observed)

java.lang.RuntimeException: java.util.concurrent.ExecutionException: harry.model.Model$ValidationException: Expected results to have the same number of results, but expected result iterator has more results.

Partition state:
...
133 lines like the following
...
RowState{cd=9101507508570737439, vds=[216457386, -9223372036854775808, 2148363779, -9223372036854775808], lts=[2379820, -9223372036854775808, 2379820, -9223372036854775808], clustering=[1092247935856, 2.1378701E-38], values=[216457386, null, -2146603517, null]}

Observed state:
resultSetRow(2952445832411809646L, 9101507508570737439L, staticValues(13953956969167411L,185L,31129L,215L,148L), slts(2379990L,2379990L,2379990L,2379990L,2379990L), values(216457386L,-9223372036854775808L,2148363779L,-9223372036854775808L), lts(2379820L,-9223372036854775808L,2379820L,-9223372036854775808L), clustering=[1092247935856, 2.1378701E-38], values=[216457386, null, -2146603517, null], statics=[9.77609217999303E-308, -71, 31129, -41, -108])
...
18 lines like that

With nemesis running this is happens 3/4 time it run

On steady state, with nemesis cassandra-harry finish with success, on 5.0.rc3

On master we are facing a coredump without nemesis, so we can't compare (#10553)

  • Restore Monitor Stack command: $ hydra investigate show-monitor 5ad966f5-8ded-437d-9224-7601b95777f1
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 5ad966f5-8ded-437d-9224-7601b95777f1

Logs:

Jenkins job URL

@fruch
Copy link
Contributor Author

fruch commented May 18, 2022

More times it happened:

Installation details

Kernel Version: 5.13.0-1021-aws
Scylla version (or git commit hash): 5.0~rc3-20220406.f92622e0d with build-id 2b79c4744216b294fdbd2f277940044c899156ea
Cluster size: 6 nodes (i3.large)
OS / Image: ami-0e4ae5e4a139c50f3 (aws: eu-north-1)

Test: longevity-harry-2h-test
Test id: a4bdcc57-c0d5-4e25-8ab7-56a32f1d6a40

  • Restore Monitor Stack command: $ hydra investigate show-monitor a4bdcc57-c0d5-4e25-8ab7-56a32f1d6a40
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a4bdcc57-c0d5-4e25-8ab7-56a32f1d6a40

Logs:

Jenkins job URL

@slivne
Copy link
Contributor

slivne commented May 19, 2022

Can we try and run this on 4.6 - to see if this is a regression

@slivne
Copy link
Contributor

slivne commented May 19, 2022

@fruch ^^

@slivne
Copy link
Contributor

slivne commented May 19, 2022

@eliransin try to find someone that can help look at this

@slivne slivne added this to the 5.0 milestone May 22, 2022
@slivne slivne removed the triage/oss label May 22, 2022
@fruch
Copy link
Contributor Author

fruch commented May 22, 2022

@slivne @eliransin

Happens also on 4.6.3, so it's no regression:

Kernel Version: 5.11.0-1022-aws
Scylla version (or git commit hash): 4.6.3-20220414.8bf149fdd with build-id 8d16d8972498cc769071ff25309b009eb77bf77a
Cluster size: 6 nodes (i3.large)

OS / Image: ami-010abbe835b052d37 (aws: eu-north-1)

Test: longevity-harry-2h-test
Test id: fc555304-c2e8-4e9d-8c0c-9d0e8f37260d
Test name: scylla-staging/fruch/longevity-harry-2h-test
Test config file(s):

  • longevity-harry-2h.yaml

  • Restore Monitor Stack command: $ hydra investigate show-monitor fc555304-c2e8-4e9d-8c0c-9d0e8f37260d

  • Restore monitor on AWS instance using Jenkins job

  • Show all stored logs command: $ hydra investigate show-logs fc555304-c2e8-4e9d-8c0c-9d0e8f37260d

Logs:

Jenkins job URL

fruch added a commit to fruch/scylla-cluster-tests that referenced this issue May 25, 2022
now we proven the the nemesis mostly works,
beside scylladb/scylladb#10598 that still under
investigation

This reverts commit 820d561.
fruch added a commit to fruch/scylla-cluster-tests that referenced this issue May 25, 2022
now we proven the the nemesis mostly works,
beside scylladb/scylladb#10598 that still under
investigation

This reverts commit 820d561.
roydahan pushed a commit to scylladb/scylla-cluster-tests that referenced this issue May 26, 2022
now we proven the the nemesis mostly works,
beside scylladb/scylladb#10598 that still under
investigation

This reverts commit 820d561.
fruch added a commit to scylladb/scylla-cluster-tests that referenced this issue May 26, 2022
now we proven the the nemesis mostly works,
beside scylladb/scylladb#10598 that still under
investigation

This reverts commit 820d561.

(cherry picked from commit 09a0292)
fruch added a commit to scylladb/scylla-cluster-tests that referenced this issue May 26, 2022
now we proven the the nemesis mostly works,
beside scylladb/scylladb#10598 that still under
investigation

This reverts commit 820d561.

(cherry picked from commit 09a0292)
@slivne slivne modified the milestones: 5.0, 5.1 May 30, 2022
@soyacz
Copy link
Contributor

soyacz commented Jun 2, 2022

The same happened during destroy_data_then_repair disruption. More test details:

Installation details

Kernel Version: 5.13.0-1025-aws
Scylla version (or git commit hash): 5.0~rc6-20220523.338edcc02 with build-id 60217f35371db2b1283e0c5bc67a7f5604768d41
Cluster size: 6 nodes (i3.large)

Scylla Nodes used in this run:

  • longevity-harry-2h-5-0-db-node-73b768fb-7 (54.75.1.119 | 10.0.3.90) (shards: 2)
  • longevity-harry-2h-5-0-db-node-73b768fb-6 (34.243.250.142 | 10.0.2.86) (shards: 2)
  • longevity-harry-2h-5-0-db-node-73b768fb-5 (52.215.207.187 | 10.0.2.6) (shards: 2)
  • longevity-harry-2h-5-0-db-node-73b768fb-4 (34.240.29.125 | 10.0.2.162) (shards: 2)
  • longevity-harry-2h-5-0-db-node-73b768fb-3 (54.74.134.253 | 10.0.0.27) (shards: 2)
  • longevity-harry-2h-5-0-db-node-73b768fb-2 (34.240.157.146 | 10.0.0.119) (shards: 2)
  • longevity-harry-2h-5-0-db-node-73b768fb-1 (34.250.15.164 | 10.0.3.220) (shards: 2)

OS / Image: ami-07b5745a1e6de34ce (aws: eu-west-1)

Test: longevity-harry-2h-test
Test id: 73b768fb-6b82-4e6e-ba97-408f305b5b1b
Test name: scylla-5.0/longevity/longevity-harry-2h-test
Test config file(s):

  • longevity-harry-2h.yaml

  • Restore Monitor Stack command: $ hydra investigate show-monitor 73b768fb-6b82-4e6e-ba97-408f305b5b1b

  • Restore monitor on AWS instance using Jenkins job

  • Show all stored logs command: $ hydra investigate show-logs 73b768fb-6b82-4e6e-ba97-408f305b5b1b

Logs:

Jenkins job URL

@roydahan roydahan added the high label Jun 27, 2022
@roydahan
Copy link

@eliransin will someone look at it?
I'm setting the high lable since otherwise no one will ever fix it.
And if no one is going to fix it, we can just stop running the test.

@eliransin
Copy link
Contributor

@roydahan does it happen all the time. Can I have a reproducer job so I can debug?
Did we or can we run it against Cassandra to eliminate a bug it cassandra-harry? if the same failure occure
in cassandra, this would mean one of the two:

  1. Bug in cassandra-harry
  2. Bug in cassandra HL design which we replicated (at least to some extent) - I think this is unlikely.

@fruch
Copy link
Contributor Author

fruch commented Jul 17, 2022

@roydahan does it happen all the time.
Can I have a reproducer job so I can debug?

you can use https://jenkins.scylladb.com/view/master/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/, it a 2.5h run.

Did we or can we run it against Cassandra to eliminate a bug it cassandra-harry?

we don't have SCT setup to run Cassandra with our nemesis

if the same failure occure in cassandra, this would mean one of the two:

  1. Bug in cassandra-harry
  2. Bug in cassandra HL design which we replicated (at least to some extent) - I think this is unlikely.

Again, just to add, in steadily state, without restarting nodes, cassandra-harry verification is working correctly.
So we can only look at the querying the tool is doing, and see that it's doing inserts, and then scans of the last 100 item it's inserted, and compares that to what's it has in memory (very similar to Gemini, but without an oracle)

@eliransin
Copy link
Contributor

Where is the cassandra-harry log (on which tar file?)

@fruch
Copy link
Contributor Author

fruch commented Jul 27, 2022

Where is the cassandra-harry log (on which tar file?)

in the loader-set

@DoronArazii
Copy link

@eliransin are we planning to fix this for 5.1?

@eliransin
Copy link
Contributor

I am not sure we will have the bandwidth for it for 5.1

@DoronArazii
Copy link

Pushing to 5.2 as we don't have the bandwidth to deal with it now.
Per @fruch it is not a regression

/Cc @roydahan @slivne

@DoronArazii DoronArazii removed this from the 5.1 milestone Sep 20, 2022
@roydahan
Copy link

roydahan commented Nov 6, 2023

@bhalevy based on this comment, I believe it's somewhere in your team's domain.

@bhalevy
Copy link
Member

bhalevy commented Nov 6, 2023

@michoecho can you take it from here please?

@bhalevy
Copy link
Member

bhalevy commented Nov 6, 2023

It would be really to have cassandra-harry available also for dtests, and reproduce this issue with one, so we can add such dtests to our regression test suite.

@michoecho
Copy link
Contributor

@michoecho can you take it from here please?

Looking.

@michoecho
Copy link
Contributor

michoecho commented Nov 7, 2023

Apparently reconciliation of range tombstones during reverse queries is very broken.

Minimal reproducer:

  1. Start a 2-node cluster.
  2. Run:
create keyspace ks with replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
create table ks.t (pk int, ck int, primary key (pk, ck));
insert into ks.t(pk, ck) values(0, 0);
  1. Stop a node.
  2. Run:
delete from ks.t where pk = 0 and ck > -1 and ck < 1;
  1. Restart the stopped node.
  2. Run a reverse query and observe wrong results:
cqlsh> consistency all;
Consistency level set to ALL.
cqlsh> select * from ks.t where pk = 0 order by ck desc;

 pk | ck
----+----
  0 |  0

The state of mutation fragments on both nodes is exactly as expected. cl=one queries and non-reverse cl=all queries give expected results. It's the reverse query reconciliation that's broken.

@bhalevy
Copy link
Member

bhalevy commented Nov 7, 2023

@michoecho thanks for reproducing and closing-in on the root cause.

Cc @denesb, @tgrabiec your insight and review later on would be appreciated

@michoecho
Copy link
Contributor

Fix:

diff --git a/mutation/mutation_partition.cc b/mutation/mutation_partition.cc
index 8bdd6086e3..9dbb0bb07e 100644
--- a/mutation/mutation_partition.cc
+++ b/mutation/mutation_partition.cc
@@ -2242,7 +2242,7 @@ stop_iteration reconcilable_result_builder::consume(range_tombstone&& rt) {
 }

 stop_iteration reconcilable_result_builder::consume(range_tombstone_change&& rtc) {
-    if (auto rt_opt = _rt_assembler.consume(_schema, std::move(rtc))) {
+    if (auto rt_opt = _rt_assembler.consume(_reversed ? *_schema.make_reversed() : _schema, std::move(rtc))) {
         return consume(std::move(*rt_opt));
     }
     return stop_iteration::no;

As far as I understand, this means that range tombstones of the form ((prefix, a); (prefix, b...)) are never emitted from a reverse mutation_query. In other words: if there is at least one replica which didn't replicate a range tombstone (of such form), the reconciled result will be such as if the delete didn't happen.

(Tombstones of the form ((prefix, a...); (prefix, +inf)), ((prefix, -inf); (prefix, b...)) and ((prefix, -inf); (prefix, +inf)) were sent correctly, though, because their bounds are ordered the same regardless of whether you use the reverse or the non-reverse comparator.)

This is quite a terrible bug. It means that a workload with range deletes and reverse reads can't be trusted.

This seems to be a Scylla 5.1 regression. (4629f7d).

@michoecho michoecho added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed area/reverse queries Requires-Backport-to-5.1 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed area/range deletes labels Nov 7, 2023
@denesb
Copy link
Contributor

denesb commented Nov 7, 2023

@michoecho please prepare a PR.

@mykaul
Copy link
Contributor

mykaul commented Dec 14, 2023

@scylladb/scylla-maint - please backport to relevant (5.2, 5.4) branches.

denesb added a commit that referenced this issue Dec 14, 2023
…ies' from Michał Chojnowski

reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.

This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.

In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.

As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.

This series fixes the bug and adds a minimal reproducer test.

Fixes #10598

Closes #16003

* github.com:scylladb/scylladb:
  mutation_query_test: test that range tombstones are sent in reverse queries
  mutation_query: properly send range tombstones in reverse queries

(cherry picked from commit 65e42e4)
denesb added a commit that referenced this issue Dec 14, 2023
…ies' from Michał Chojnowski

reconcilable_result_builder passes range tombstone changes to _rt_assembler
using table schema, not query schema.
This means that a tombstone with bounds (a; b), where a < b in query schema
but a > b in table schema, will not be emitted from mutation_query.

This is a very serious bug, because it means that such tombstones in reverse
queries are not reconciled with data from other replicas.
If *any* queried replica has a row, but not the range tombstone which deleted
the row, the reconciled result will contain the deleted row.

In particular, range deletes performed while a replica is down will not
later be visible to reverse queries which select this replica, regardless of the
consistency level.

As far as I can see, this doesn't result in any persistent data loss.
Only in that some data might appear resurrected to reverse queries,
until the relevant range tombstone is fully repaired.

This series fixes the bug and adds a minimal reproducer test.

Fixes #10598

Closes #16003

* github.com:scylladb/scylladb:
  mutation_query_test: test that range tombstones are sent in reverse queries
  mutation_query: properly send range tombstones in reverse queries

(cherry picked from commit 65e42e4)
@denesb
Copy link
Contributor

denesb commented Dec 14, 2023

Backport to 5.4 and 5.2 queued.

@denesb denesb removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment