Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 6.0] tablets: alter keyspace #18982

Closed
wants to merge 18 commits into from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented May 29, 2024

This change supports changing replication factor in tablets-enabled keyspaces.
This covers both increasing and decreasing the number of tablets replicas through
first building topology mutations (alter_keyspace_statement.cc) and then
tablets/topology/schema mutations (topology_coordinator.cc).
For the limitations of the current solution, please see the docs changes attached to this PR.

Fixes: #16129

(cherry picked from commit 65deddd)

(cherry picked from commit c888945)

(cherry picked from commit cb40f13)

(cherry picked from commit 0600897)

(cherry picked from commit c174eee)

(cherry picked from commit 6fd0a49)

(cherry picked from commit 59d3fd6)

(cherry picked from commit 80ed442)

(cherry picked from commit 7081215)

(cherry picked from commit fbd75c5)

(cherry picked from commit b875151)

(cherry picked from commit 951915e)

(cherry picked from commit ec5708b)

(cherry picked from commit 39181c4)

(cherry picked from commit 1b913dd)

(cherry picked from commit 2567e30)

(cherry picked from commit 6e0e267)

(cherry picked from commit 66f6001)

Refs #16723

KrzaQ and others added 18 commits May 29, 2024 20:33
This is needed, because the same name cannot be used for 2 separate
entities, because we're getting double-metrics-registration error, thus
the names have to be configurable, not hardcoded.

(cherry picked from commit 65deddd)
Note we're suppressing a UBSanitizer overflow error in UTs. That's
because our linter complains about a possible overflow, which never
happens, but tests are still failing because of it.

(cherry picked from commit c888945)
Query processor needs to access storage service to check if global
topology request is still ongoing and to be able to wait until it
completes.

(cherry picked from commit cb40f13)
Allows executing combined topology & schema mutations under a single RAFT command

(cherry picked from commit 0600897)
It will be used when processing ALTER KS statement, but also to
create a separate processing path for a KS with tablets (as opposed to
a vnode KS).

(cherry picked from commit c174eee)
With current implementation only 1 global topo req can be executed at a
time, so when ALTER KS is executed, we'll have to check if any other
global topo req is ongoing and fail the req if that's the case.

(cherry picked from commit 6fd0a49)
…rocess alter ks global topo req

Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.

(cherry picked from commit 59d3fd6)
…ifferent raft commands

Since ALTER KS requires creating topology_change raft command, some
functions need to be extended to handle it. RAFT commands are recognized
by types, so some functions are just going to be parameterized by type,
i.e. made into templates.
These templates are instantiated already, so that only 1 instances of
each template exists across the whole code base, to avoid compiling it
in each translation unit.

(cherry picked from commit 7081215)
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.

(cherry picked from commit fbd75c5)

# Conflicts:
#	service/topology_coordinator.cc
This patch removes the support for the "wildcard" replication_factor
option for ALTER KEYSPACE when the keyspace supports tablets.
It will still be supported for CREATE KEYSPACE so that a user doesn't
have to know all datacenter names when creating the keyspace,
but ALTER KEYSPACE will require that and the user will have to
specify the exact change in replication factors they wish to make by
explicitly specifying the datacenter names.
Expanding the replication_factor option in the ALTER case is
unintuitive and it's a trap many users fell into.

See #8881, #15391, #16115

(cherry picked from commit b875151)
…than 1

We want to ensure that when the replication factor
of a keyspace changes, it changes by at most 1 per DC
if it uses tablets. The rationale for that is to make
sure that the old and new quorums overlap by at least
one node.

After these changes, attempts to change the RF of
a keyspace in any DC by more than 1 will fail.

(cherry picked from commit 951915e)
This commit adds a test verifying that we can only
change the RF of a keyspace for any DC by at most 1
when using tablets.

Fixes #18029

(cherry picked from commit ec5708b)
Up until now we waited until mutations are in place and then returned
directly to the caller of the ALTER statement, but that doesn't imply
that tablets were deleted/created, so we must wait until the whole
processing is done and return only then.

(cherry picked from commit 39181c4)
Now the scailing works and test must check it does

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 2567e30)
When the test changes RF from 2 to 3, the extra node executes "rebuild"
transition which means that it streams tablets replicas from two other
peers. When doing it, the node receives two sets of sstables with
mutations from the given tablet. The test part that checks if the extra
node received the mutations notices two mutation fragments on the new
replica and errorneously fails by seeing, that RF=3 is not equal to the
number of mutations found, which is 4.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 6e0e267)

# Conflicts:
#	test/topology_custom/test_tablets.py
The check is performed by selecting from mutation_fragments(table), but
it's known that this query crashes Scylla when there's no tablet replica
on that node.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 66f6001)

# Conflicts:
#	test/topology_custom/test_tablets.py
@mergify mergify bot requested a review from nyh as a code owner May 29, 2024 20:33
@mergify mergify bot added the conflicts label May 29, 2024
@mergify mergify bot requested a review from tgrabiec as a code owner May 29, 2024 20:33
Copy link
Author

mergify bot commented May 29, 2024

Cherry-pick of fbd75c5 has failed:

On branch mergify/copy/branch-6.0/pr-16723
Your branch is ahead of 'origin/branch-6.0' by 9 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit fbd75c5c06.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   cql3/query_processor.cc
	modified:   cql3/statements/alter_keyspace_statement.cc
	modified:   cql3/statements/ks_prop_defs.cc
	modified:   cql3/statements/ks_prop_defs.hh
	modified:   cql3/statements/schema_altering_statement.hh
	modified:   locator/abstract_replication_strategy.hh
	modified:   service/storage_service.hh
	modified:   service/topology_mutation.cc
	modified:   service/topology_mutation.hh
	modified:   test/cql-pytest/test_tablets.py

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   service/topology_coordinator.cc

Cherry-pick of 6e0e267 has failed:

On branch mergify/copy/branch-6.0/pr-16723
Your branch is ahead of 'origin/branch-6.0' by 16 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit 6e0e2674f0.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   test/topology_custom/test_tablets.py

no changes added to commit (use "git add" and/or "git commit -a")

Cherry-pick of 66f6001 has failed:

On branch mergify/copy/branch-6.0/pr-16723
Your branch is ahead of 'origin/branch-6.0' by 17 commits.
  (use "git push" to publish your local commits)

You are currently cherry-picking commit 66f6001c77.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   test/topology_custom/test_tablets.py

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot marked this pull request as draft May 29, 2024 20:34
@scylladb-promoter
Copy link
Contributor

Docs Preview 📖

Docs Preview for this pull request is available here

Changed Files:

Note: This preview will be available for 30 days and will be automatically deleted after that period. You can manually trigger a new build by committing changes.

@ptrsmrn ptrsmrn requested review from xemul May 29, 2024 20:45
@ptrsmrn
Copy link
Contributor

ptrsmrn commented May 29, 2024

@xemul I don't have write permissions to the repo and even though I resolved merge conflicts, I still cannot push them (by git push -f), so I pushed them to my fork https://github.com/ptrsmrn/scylladb/tree/mergify/copy/branch-6.0/pr-16723 instead. Should I close this PR and opened another one from my fork to the 6.0 branch?
cc @bhalevy

@ptrsmrn
Copy link
Contributor

ptrsmrn commented May 29, 2024

@xemul I don't have write permissions to the repo and even though I resolved merge conflicts, I still cannot push them (by git push -f), so I pushed them to my fork ptrsmrn/scylladb@mergify/copy/branch-6.0/pr-16723 instead. Should I close this PR and opened another one from my fork to the 6.0 branch? cc @bhalevy

I opened another backport PR in hopes we'll get CI results in the morning - #18985

@ptrsmrn
Copy link
Contributor

ptrsmrn commented May 30, 2024

The changes have been already backported, per #16723 (comment)

@ptrsmrn ptrsmrn closed this May 30, 2024
@mergify mergify bot deleted the mergify/copy/branch-6.0/pr-16723 branch May 30, 2024 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants