Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation of cql create statement options should happen when it is created, not when executed #14710

Closed
bhalevy opened this issue Jul 15, 2023 · 22 comments · Fixed by #15091
Closed
Assignees
Labels
area/raft symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Jul 15, 2023

The dtest fails consistently since https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release-with-raft/148/testReport/nodetool_additional_test/TestNodetool/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split004___test_resetlocalschema_api_issue_7811_SizeTieredCompactionStrategy_/

AssertionError: Got unexpected exception: ('Unable to complete the operation against any hosts', {<Host: 127.0.40.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/raft.hh:299): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size')"">})
assert "Can't find a column family with UUID" in '(\'Unable to complete the operation against any hosts\', {<Host: 127.0.40.1:9042 datacenter1>: <Error from server: co...chine error at raft/raft.hh:299): exceptions::syntax_exception (Invalid long value NOPE for \'min_sstable_size\')"">})'
 +  where '(\'Unable to complete the operation against any hosts\', {<Host: 127.0.40.1:9042 datacenter1>: <Error from server: co...chine error at raft/raft.hh:299): exceptions::syntax_exception (Invalid long value NOPE for \'min_sstable_size\')"">})' = str(NoHostAvailable('Unable to complete the operation against any hosts', {<Host: 127.0.40.1:9042 datacenter1>: <Error fro... machine error at raft/raft.hh:299): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size')"">}))
@bhalevy bhalevy added tests/dtest area/raft triage/master Looking for assignee symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Jul 15, 2023
@DoronArazii DoronArazii added this to the 5.4 milestone Jul 16, 2023
@DoronArazii
Copy link

@kostja please have a look

@kostja
Copy link
Contributor

kostja commented Jul 17, 2023

@bhalevy why is this test not part of gating?

@mykaul
Copy link
Contributor

mykaul commented Jul 17, 2023

@bhalevy why is this test not part of gating?

Most likely since no one added it to gating.

@kostja
Copy link
Contributor

kostja commented Jul 17, 2023

@bhalevy Raft schema state machine is stopped in the test, thanks to this exception: std::_Nested_exception<raft::state_machine_error> (State machine error at raft/raft.hh:299): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size'

The underlying problem is that min_sstable_size presence is not checked during statemenet, prepare, only checked during execute/apply, which is too late. If we can't apply a committed Raft command, the schema state machine must be stopped.

I don't know why it didn't fail before - this is the puzzling part. There were changes to option validation for keyspaces recently (e.g. #13986) but nothing relevant to this exception I can find.

The test is added in 2022. Not added to the gating. Could have been done in test.py just as well, already then.

@kostja kostja assigned bhalevy and unassigned kostja Jul 17, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Jul 17, 2023

@bhalevy why is this test not part of gating?

Most likely since no one added it to gating.

@fruch it looks like nodetool_additional_test is indeed not part of next_gating.
Can you please add it after this issue is fixed?

However, @kostja, even if it was, it fails like this only when raft is enabled.
So we test all next_gating tests both with raft consistent topology changes and without?

@bhalevy
Copy link
Member Author

bhalevy commented Jul 17, 2023

@bhalevy Raft schema state machine is stopped in the test, thanks to this exception: std::_Nested_exception<raft::state_machine_error> (State machine error at raft/raft.hh:299): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size'

The source location is useless here since this is the file:line numbers of state_machine_error::state_machine_error, not where it is called.

The underlying problem is that min_sstable_size presence is not checked during statemenet, prepare, only checked during execute/apply, which is too late. If we can't apply a committed Raft command, the schema state machine must be stopped.

@kostja the test tries to create the table twice.
First time is with the botched compaction option and it fails sort of as expected.

https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release-with-raft/197/artifact/logs-full.release.004/dtest-gw0.log

02:19:02,592 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1267 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Creating ks.cf with invalid compaction_options: {'class': 'SizeTieredCompactionStrategy', 'min_sstable_size': 'NOPE'}
02:19:02,596 779     cassandra.cluster              WARNING  libevreactor.py     :370  | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Host 127.0.3.1:9042 error: Server error.
02:19:02,596 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1273 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Got exception: ('Unable to complete the operation against any hosts', {<Host: 127.0.3.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size')"">})

But then the test retries creating the table, not with default options, but it got the same exception
which seems to indicate the raft group0 didn't recover properly from the previous error.

02:19:03,385 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1280 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Recreating ks.cf
02:19:03,386 779     cassandra.cluster              WARNING  libevreactor.py     :370  | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Host 127.0.3.1:9042 error: Server error.
02:19:03,387 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1287 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Got exception: ('Unable to complete the operation against any hosts', {<Host: 127.0.3.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size')"">})

I don't know why it didn't fail before - this is the puzzling part. There were changes to option validation for keyspaces recently (e.g. #13986) but nothing relevant to this exception I can find.

The test is added in 2022. Not added to the gating. Could have been done in test.py just as well, already then.

Can your team please bisect the branch and see what triggers the failure?

@bhalevy
Copy link
Member Author

bhalevy commented Jul 17, 2023

https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release-with-raft/197/artifact/logs-full.release.004/1689560346681_nodetool_additional_test.py%3A%3ATestNodetool%3A%3Atest_resetlocalschema_api_issue_7811%5BSizeTieredCompactionStrategy%5D/node1.log shows the first failure:

INFO  2023-07-17 02:19:02,380 [shard 0] migration_manager - Create new Keyspace: KSMetaData{name=ks, strategyClass=org.apache.cassandra.locator.SimpleStrategy, strategyOptions={replication_factor=2}, cfMetaData={}, durable_writes=1, userTypes=org.apache.cassandra.config.UTMetaData@0x6000019e2ce0}
INFO  2023-07-17 02:19:02,381 [shard 0] schema_tables - Creating keyspace ks
INFO  2023-07-17 02:19:02,386 [shard 0] schema_tables - Schema version changed to 4e532d8f-546f-31f4-a347-4c491abdb056
INFO  2023-07-17 02:19:02,593 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x6000017e3800[cfId=4bf92b10-2448-11ee-ba27-c53a7f15efe2,ksName=ks,cfName=cf,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),comment=test cf,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.UTF8Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=key, type=org.apache.cassandra.db.marshal.UTF8Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.UTF8Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=v, type=org.apache.cassandra.db.marshal.UTF8Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={min_sstable_size=NOPE, enabled=true},compressionParameters={},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=4bf92b11-2448-11ee-ba27-c53a7f15efe2,droppedColumns={},collections={},indices={}]
INFO  2023-07-17 02:19:02,595 [shard 0] schema_tables - Creating ks.cf id=4bf92b10-2448-11ee-ba27-c53a7f15efe2 version=11913792-fac3-3a4d-96d8-440842fbf213
ERROR 2023-07-17 02:19:02,596 [shard 0] raft - [e24850cc-eb05-41e4-9cda-39811ccf3556] applier fiber stopped because of the error: std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size')

https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release-with-raft/197/artifact/logs-full.release.004/1689560346681_nodetool_additional_test.py%3A%3ATestNodetool%3A%3Atest_resetlocalschema_api_issue_7811%5BSizeTieredCompactionStrategy%5D/node2.log

INFO  2023-07-17 02:19:02,413 [shard 0] schema_tables - Creating keyspace ks
INFO  2023-07-17 02:19:02,418 [shard 0] schema_tables - Schema version changed to 4e532d8f-546f-31f4-a347-4c491abdb056

@kostja
Copy link
Contributor

kostja commented Jul 17, 2023

@bhalevy I wrote earlier: the check for options can't happen during statement execute, it must happen during statement prepare. Right now it happens in execute, in the compaction_strategy class, when the command is already committed to the raft log.

If you're looking for the offending patch, I guess gleb's refactors for schema commitlog commitguard could be the trigger. But not the root cause.

@kostja
Copy link
Contributor

kostja commented Jul 17, 2023

The sequence of events is:

  • the statement is validated, written to Raft log on the majority; the command is committed;
  • the application of the command fails, raft state machine stops
  • next command fails to execute

@bhalevy
Copy link
Member Author

bhalevy commented Jul 17, 2023

Currently, check_restricted_table_properties is called from create_table_statement::execute.
@Deexie can we move options validation to prepare time as part of #13956?

@kostja
Copy link
Contributor

kostja commented Jul 17, 2023

@bhalevy Raft schema state machine is stopped in the test, thanks to this exception: std::_Nested_exception<raft::state_machine_error> (State machine error at raft/raft.hh:299): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size'

The source location is useless here since this is the file:line numbers of state_machine_error::state_machine_error, not where it is called.

The underlying problem is that min_sstable_size presence is not checked during statemenet, prepare, only checked during execute/apply, which is too late. If we can't apply a committed Raft command, the schema state machine must be stopped.

@kostja the test tries to create the table twice. First time is with the botched compaction option and it fails sort of as expected.

https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release-with-raft/197/artifact/logs-full.release.004/dtest-gw0.log

02:19:02,592 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1267 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Creating ks.cf with invalid compaction_options: {'class': 'SizeTieredCompactionStrategy', 'min_sstable_size': 'NOPE'}
02:19:02,596 779     cassandra.cluster              WARNING  libevreactor.py     :370  | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Host 127.0.3.1:9042 error: Server error.
02:19:02,596 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1273 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Got exception: ('Unable to complete the operation against any hosts', {<Host: 127.0.3.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size')"">})

But then the test retries creating the table, not with default options, but it got the same exception which seems to indicate the raft group0 didn't recover properly from the previous error.

02:19:03,385 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1280 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Recreating ks.cf
02:19:03,386 779     cassandra.cluster              WARNING  libevreactor.py     :370  | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Host 127.0.3.1:9042 error: Server error.
02:19:03,387 779     nodetool_additional_test       DEBUG    nodetool_additional_test.py:1287 | test_resetlocalschema_api_issue_7811[SizeTieredCompactionStrategy]: Got exception: ('Unable to complete the operation against any hosts', {<Host: 127.0.3.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Raft instance is stopped, reason: "background error, std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1206): exceptions::syntax_exception (Invalid long value NOPE for 'min_sstable_size')"">})

I don't know why it didn't fail before - this is the puzzling part. There were changes to option validation for keyspaces recently (e.g. #13986) but nothing relevant to this exception I can find.
The test is added in 2022. Not added to the gating. Could have been done in test.py just as well, already then.

Can your team please bisect the branch and see what triggers the failure?

Bisecting dtest is a sad effort. What needs to be done, is adding a test.py test case, bisecting on it, and adding a prepare-time validation for column family/keyspace options.

@bhalevy bhalevy changed the title test_resetlocalschema_api_issue_7811 fails with raft enabled validation of cql create statement options should happen when it is created, not when executed Jul 18, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Jul 18, 2023

It seems like the issue hit with raft enabled is just another manifestation of #7811
(maybe we should close this one as duplicate)

@avikivity
Copy link
Member

No vulnerable branches, not backporting.

@bhalevy
Copy link
Member Author

bhalevy commented Dec 4, 2023

No vulnerable branches, not backporting.

@avikivity the issue is also hit sporadically in older branches.
See #7811 (comment) for example.

@bhalevy
Copy link
Member Author

bhalevy commented Dec 22, 2023

No vulnerable branches, not backporting.

@avikivity the issue is also hit sporadically in older branches. See #7811 (comment) for example.

@avikivity @denesb 5.2 / 2023.1 are vulnerable too, please backport.

@denesb
Copy link
Contributor

denesb commented Dec 22, 2023

@avikivity @denesb 5.2 / 2023.1 are vulnerable too, please backport.

Is this fixing a regression? Or is this something that was always broken? In the latter case, all branches are vulnerable.
I'm not saying we shouldn't backport, unless it is a regression, just trying to understand the situation.

@bhalevy
Copy link
Member Author

bhalevy commented Dec 22, 2023

It's not a regression AFAIK, but it is causing CI stability issues.
Also, since it's user-facing and increases safety of the CQL interface I still think it should be backported, even though it doesn't fix a regression.

@denesb
Copy link
Contributor

denesb commented Dec 22, 2023

After looking at the PR fixing this, it seems it fixes a regression, in releases that have consistent schema changes. Those releases will choke on an invalid create/alter statement.

@denesb
Copy link
Contributor

denesb commented Dec 22, 2023

5.4 already has the fix.
The fix doesn't apply to 5.2 cleanly, @Deexie please provide a backport PR against 5.2.

@bhalevy
Copy link
Member Author

bhalevy commented Dec 25, 2023

Fix is required in 5.2 on top of bfd8401 (5.2.11)

@denesb
Copy link
Contributor

denesb commented Jan 4, 2024

@Deexie ping for backport PR.

denesb added a commit that referenced this issue Jan 12, 2024
…' from Aleksandra Martyniuk

Table properties validation is performed on statement execution.
Thus, when one attempts to create a table with invalid options,
an incorrect command gets committed in Raft. But then its
application fails, leading to a raft machine being stopped.

Check table properties when create and alter statements are prepared.

Fixes: #14710.

Closes #16750

* github.com:scylladb/scylladb:
  cql3: statements: delete execute override
  cql3: statements: call check_restricted_table_properties in prepare
  cql3: statements: pass data_dictionary::database to check_restricted_table_properties
@denesb denesb removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Jan 12, 2024
@denesb
Copy link
Contributor

denesb commented Jan 12, 2024

Backported to 5.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants