Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gms,service: add a feature to protect the usage of allow_mutation_read_page_without_live_row #15890

Closed

Conversation

denesb
Copy link
Contributor

@denesb denesb commented Oct 31, 2023

allow_mutation_read_page_without_live_row is a new option in the partition_slice::option option set. In a mixed clusters, old nodes possibly don't know this new option, so its usage must be protected by a cluster feature. This patch does just that.

Fixes: #15795

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

❌ - Build

Build Failure:

[2023-10-31T07:05:05.653Z] FAILED: build/debug/service/storage_proxy.o 
[2023-10-31T07:05:05.653Z] clang++ -MD -MT build/debug/service/storage_proxy.o -MF build/debug/service/storage_proxy.o.d -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/gen/include -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -I/usr/include/p11-kit-1   -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION -Og -DSCYLLA_BUILD_MODE=debug -g -gz -iquote. -iquote build/debug/gen --std=gnu++20  -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -DBOOST_TEST_DYN_LINK   -DNOMINMAX -DNOMINMAX  -fvisibility=hidden -Wall -Werror -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-command-line-argument -Wno-unsupported-friend -Wno-implicit-int-float-conversion -Wno-psabi -Wno-narrowing -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -DFMT_DEPRECATED_OSTREAM  -c -o build/debug/service/storage_proxy.o service/storage_proxy.cc
[2023-10-31T07:05:05.653Z] service/storage_proxy.cc:4984:13: error: use of undeclared identifier features; did you mean gms::feature?
[2023-10-31T07:05:05.654Z]         if (features().empty_replica_mutation_pages) {
[2023-10-31T07:05:05.654Z]             ^
[2023-10-31T07:05:05.654Z] ./db/system_keyspace.hh:58:11: note: gms::feature declared here
[2023-10-31T07:05:05.654Z]     class feature;
[2023-10-31T07:05:05.654Z]           ^
[2023-10-31T07:05:05.654Z] 1 error generated.

Build Details:

@denesb denesb force-pushed the allow-empty-pages-cluster-feature branch from 77137be to 8e15394 Compare October 31, 2023 07:25
@denesb denesb requested a review from avikivity October 31, 2023 09:07
@avikivity
Copy link
Member

Aha, I remember that we sent the new bit without a feature thinking it will be ignored, but it's caught.

@denesb denesb force-pushed the allow-empty-pages-cluster-feature branch from 8e15394 to 317fce1 Compare October 31, 2023 14:14
@denesb
Copy link
Contributor Author

denesb commented Oct 31, 2023

I don't understand all this failures on such a simple patch. Rebased.

@avikivity
Copy link
Member

I don't understand all this failures on such a simple patch. Rebased.

Ran on m5ad.8xlarge, so should be okay disk-wise.

@avikivity
Copy link
Member

Actually it ran on many instance types, but all have nvme as far as I can tell.

@avikivity
Copy link
Member

@fruch is there any way to tell if the mount procedure worked correctly?

@fruch
Copy link
Contributor

fruch commented Oct 31, 2023

Now, we are not collecting logs from that stage.

Anyhow those failures are too reproducible, it doesn't fit the patterns we've seen when disk wasn't configured

@avikivity
Copy link
Member

yes, looks like a genuine bug uncovered by CI (like it's supposed to)

@denesb
Copy link
Contributor Author

denesb commented Nov 1, 2023

Looks like a genuine bug related to view schemas. Nodes are crashing due to uninitialized schema: #15904

@denesb
Copy link
Contributor Author

denesb commented Nov 1, 2023

All the failing tests I checked so far are related to schema problems.

@denesb denesb force-pushed the allow-empty-pages-cluster-feature branch from 317fce1 to 7af279a Compare November 3, 2023 06:55
@denesb
Copy link
Contributor Author

denesb commented Nov 3, 2023

DO NOT MERGE!!!!

I pushed a version which has the if condition commented, to see if that is what triggers all these failures. The patch now effectively just adds a feature flag.

@kbr-scylla
Copy link
Contributor

I looked at the test_topology_ops failure and created #15935.

There are a couple of bugs that have already been detected by the test. For now I'll disable the part of the test which is detecting these issues (the one which is performing CQL writes) until we fix all known bugs.

@denesb denesb force-pushed the allow-empty-pages-cluster-feature branch 2 times, most recently from e09527e to ea9c8c7 Compare November 3, 2023 10:37
@denesb
Copy link
Contributor Author

denesb commented Nov 3, 2023

I looked at the test_topology_ops failure and created #15935.

There are a couple of bugs that have already been detected by the test. For now I'll disable the part of the test which is detecting these issues (the one which is performing CQL writes) until we fix all known bugs.

Thanks. Why is this particular PR triggering this so reliably? I also saw this test fail at other occasions, but here it is very consistent.

@denesb
Copy link
Contributor Author

denesb commented Nov 3, 2023

Thanks. Why is this particular PR triggering this so reliably? I also saw this test fail at other occasions, but here it is very consistent.

Actually, it is not that consistent.

@denesb denesb force-pushed the allow-empty-pages-cluster-feature branch from ea9c8c7 to b5a4ad8 Compare November 7, 2023 11:08
@denesb
Copy link
Contributor Author

denesb commented Nov 7, 2023

I have no other choice but to dig into these failures. Seems like there is a real problem behind all this.

@denesb
Copy link
Contributor Author

denesb commented Nov 8, 2023

@denesb
Copy link
Contributor Author

denesb commented Nov 8, 2023

@kbr-scylla
Copy link
Contributor

#15924

Rebase -- these background writes are disabled on master (6cc5bca)

@denesb
Copy link
Contributor Author

denesb commented Nov 8, 2023

@denesb
Copy link
Contributor Author

denesb commented Nov 8, 2023

@denesb
Copy link
Contributor Author

denesb commented Nov 8, 2023

test_tracing_shutdown 🔍

#16004 with a twist. Writes to system_traces fail to lookup the table by version (due to different version), so they call back to the originating node, which is shutting down and has already unregistered its RPC verbs. Not sure if this deserves a separate bug or not.

@denesb denesb force-pushed the allow-empty-pages-cluster-feature branch from b5a4ad8 to 1b87d9c Compare November 8, 2023 14:47
@denesb
Copy link
Contributor Author

denesb commented Nov 8, 2023

Looks like almost all failed tests are related to #16004.

@denesb
Copy link
Contributor Author

denesb commented Nov 8, 2023

#15924

Rebase -- these background writes are disabled on master (6cc5bca)

Rebased.

…d_page_without_live_row

allow_mutation_read_page_without_live_row is a new option in the
partition_slice::option option set. In a mixed clusters, old nodes
possibly don't know this new option, so its usage must be protected by a
cluster feature. This patch does just that.

Fixes: scylladb#15795
@denesb denesb force-pushed the allow-empty-pages-cluster-feature branch from 1b87d9c to 903c2a1 Compare November 16, 2023 15:51
@denesb
Copy link
Contributor Author

denesb commented Nov 16, 2023

Rebased on top of 2238144 (the fix for #16004).

@bhalevy
Copy link
Member

bhalevy commented Nov 16, 2023

Rebased on top of 2238144 (the fix for #16004).

Cc @mykaul

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - Sanity Tests

Build Details:

@bhalevy
Copy link
Member

bhalevy commented Nov 16, 2023

@kbr-scylla / @avikivity do you need to re-review or can this PR be merged as is?

Copy link
Contributor

@kbr-scylla kbr-scylla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like at least one other person who knows this code, preferably someone who reviewed the previous PR (which extended the enum_set), to review this one before I merge.

@denesb
Copy link
Contributor Author

denesb commented Nov 17, 2023

I'd like at least one other person who knows this code, preferably someone who reviewed the previous PR (which extended the enum_set), to review this one before I merge.

That person is @tgrabiec.

denesb added a commit that referenced this pull request Nov 21, 2023
…d_page_without_live_row

allow_mutation_read_page_without_live_row is a new option in the
partition_slice::option option set. In a mixed clusters, old nodes
possibly don't know this new option, so its usage must be protected by a
cluster feature. This patch does just that.

Fixes: #15795

Closes #15890

(cherry picked from commit f539612)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read load failing after one node upgrade [bad_enum_set_mask (Bit mask contains invalid enumeration indices.)]
7 participants