Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online DDL: support migration cut-over backoff and forced cut-over #14546

Merged
merged 39 commits into from Dec 13, 2023

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Nov 19, 2023

Description

Closes #14530

This PR implements two functionalities relating to schema migration cut-over, relevant to ALTER TABLE in vitess strategy only. Both related to a scenario where the cut-over times out due to either excessive load on the migrated table, or due to some lock being placed on the table. The two functionalities are:

  1. A backoff mechanism: if a cut-over fails, the next attempt is done not before 1min has passed; the next one not before additional 5min have passed, the next is 10min, 30min and from that point cut-overs are only attempted at 30min intervals. This is to avoid a scenario where the database, that is already is under heavy load, needs to cope with frequently recurring cut-over attempts, which themselves put additional locks on tables.
  2. The complementing functionality: a force cut-over mechanism where, one the executer begins cut-over, it brutally kills any queries using the migrated table, as well as terminating any transactions that are holding locks on the table (by terminating their connections). This has high likelyhood to pave the way for the cut-over to succeed.

Backoff

The backoff mechanism is implemented as-is, and is not configurable.

Force cut-over

Forced cut-over can be controlled in these ways.

--force-cut-over-after DDL strategy flag

Example:

vtctldclient ApplySchema --ddl-strategy="vitess --force-cut-over-after=1h" --sql "alter table corder force" commerce

It's possible to preconfigure the maximum duration where we allow cut-overs to fail/timeout due to pending queries/transactions. --force-cut-over-after, if nonzero, applies starting the first cut-over attempt.

In the above example, --force-cut-over-after is set to 1 hour. The migration may run for as long as it needs, say 5 hours. Starting the first cut-over attempt, the clock starts ticking 1 hour. The cut-over may be successful, in which case all's well and nothing further happens. Or it may fail, in which case the backoff mechanism kicks in. The next attempt is done within 1m, then 5m, etc. But if these all keep failing, then 1h since the very first failed attempt, irrespective of backoff, and within a 1min resolution, the scheduler runs a cut-over with query&transaction termination. This is highly likely to succeed. But if it fails, then it continues to attempt forced cut-overs every minute.

ALTER VITESS_MIGRATION ... FORCE_CUTOVER ...

We introduce a new syntax:

ALTER VITESS_MIGRATION '9748c3b7_7fdb_11eb_ac2c_f875a4d24e90' FORCE_CUTOVER;
ALTER VITESS_MIGRATION FORCE_CUTOVER ALL;

The former forces cut-over for a specific migration, the latter for all pending migrations (queued, ready, running). All these command do is set the new schema_migrations.force_cutover column value to 1, much like ALTER VITESS_MIGRATION ... COMPLETE ....

The scheduler picks up this force_cutover column value on its next review of running migrations. If it's 1, then any backoff state is ignored, and the scheduler attempts a forced cut-over, terminating queries and transactions.

vtctldclient OnlineDDL force-cutover

These matching options are added to vtctldclient OnlineDDL command:

vtctldclient OnlineDDL force-cutover commerce 3ced2e5b_86b3_11ee_acea_0a43f95f28a3
vtctldclient OnlineDDL force-cutover commerce all

Notes

Only works on MySQL 8.0 (requires performance_schema.data_locks table. 5.7 does not provide reliable information).

To note the obvious, a forced cut-over is only relevant when the migration is actually eligible for cut-over. If the migration is incomplete, then it won't attempt a cut-over. Issuing a ALTER VITESS_MIGRATION ... FORCE_CUTOVER ... will set the column to 1, but it will only become relevant when the migration becomes ready to cut-over. Also, if the migration runs with --postpone-completion, then it will not be eligible to cut-over ; the user will first need to issue a ALTER VITESS_MIGRATION ... COMPLETE .... It's fine if the user runs a ALTER VITESS_MIGRATION ... FORCE_CUTOVER ... first, but this will only have the effect of setting force_cutover=1; it will not COMPLETE the migration.

Added unit and endtoend tests.

Documentation: vitessio/website#1641.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…irst invocation

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…actions holding a lock on a table

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…illQueriesOnTable: kill queries on a table, and kill connections with transactions holding locks on table

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…r and Online DDL executor

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…and the effect on 'force_cutover' column'

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ransaction holding lock

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…t, ForceCutOverSchemaMigrationResponse

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels Nov 19, 2023
@shlomi-noach
Copy link
Contributor Author

Documentation PR: vitessio/website#1641

@shlomi-noach shlomi-noach removed the NeedsWebsiteDocsUpdate What it says label Nov 20, 2023
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…tess into onlineddl-cutover-backoff

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! My only major concern is that we have no tests for the KILL portion, which is particularly sensitive. Am I missing them? It would be ideally covered by unit tests with a lot of test cases with mock mysql results. But we probably don't have the framework in place for that so it may be difficult. Let me know what you think.

go/cmd/vtctldclient/command/onlineddl.go Outdated Show resolved Hide resolved
go/cmd/vtctldclient/command/onlineddl.go Outdated Show resolved Hide resolved
go/cmd/vtctldclient/command/onlineddl.go Show resolved Hide resolved
require.NotNil(t, rs)
for _, row := range rs.Named().Rows {
message := row.AsString("message", "")
if strings.Contains(message, messageSubstring) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to make it case insensitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we should? If the test knows what to expect, then it should expect the exact string.

go/vt/schema/ddl_strategy_test.go Outdated Show resolved Hide resolved
@@ -673,6 +673,37 @@ func (s *VtctldServer) CleanupSchemaMigration(ctx context.Context, req *vtctldat
return resp, nil
}

// ForceCutOverSchemaMigration is part of the vtctlservicepb.VtctldServer interface.
func (s *VtctldServer) ForceCutOverSchemaMigration(ctx context.Context, req *vtctldatapb.ForceCutOverSchemaMigrationRequest) (resp *vtctldatapb.ForceCutOverSchemaMigrationResponse, err error) {
span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigrationResponse")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be the function name, not the response part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow, can you please explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trace span name should be the function name:

span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigration")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last thing and it's minor. So I will approve and you can change this whenever you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

go/vt/vttablet/onlineddl/executor.go Outdated Show resolved Hide resolved
@@ -765,8 +766,93 @@ func (e *Executor) terminateVReplMigration(ctx context.Context, uuid string) err
return nil
}

func (e *Executor) killQueriesOnTable(ctx context.Context, tableName string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really covered by any tests, is it? This feels like the most critical aspect as we're killing things off in the production system.

Things like this are actually easier in unit tests as you can mock the mysql query responses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered in https://github.com/vitessio/vitess/pull/14546/files#diff-e1236744a601a269891381b0ed09b7151d86f5a9b001d8c4e954211c2ae04a8d ; there's a test that holds an open transaction, we attempt a completion, the migration does not complete; we then force cut-over, and we see that the migration completes.

go/vt/vttablet/onlineddl/executor.go Outdated Show resolved Hide resolved
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

My only major concern is that we have no tests for the KILL portion, which is particularly sensitive. Am I missing them?

This is covered in https://github.com/vitessio/vitess/pull/14546/files#diff-e1236744a601a269891381b0ed09b7151d86f5a9b001d8c4e954211c2ae04a8d ; there's a test that holds an open transaction, we attempt a completion, the migration does not complete; we then force cut-over, and we see that the migration completes.

shlomi-noach and others added 5 commits November 30, 2023 08:42
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@@ -673,6 +673,37 @@ func (s *VtctldServer) CleanupSchemaMigration(ctx context.Context, req *vtctldat
return resp, nil
}

// ForceCutOverSchemaMigration is part of the vtctlservicepb.VtctldServer interface.
func (s *VtctldServer) ForceCutOverSchemaMigration(ctx context.Context, req *vtctldatapb.ForceCutOverSchemaMigrationRequest) (resp *vtctldatapb.ForceCutOverSchemaMigrationResponse, err error) {
span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigrationResponse")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattlord is saying this, essentially

Suggested change
span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigrationResponse")
span, ctx := trace.NewSpan(ctx, "VtctldServer.ForceCutOverSchemaMigration")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

return nil, err
}

log.Info("Calling ApplySchema to force cut-over migration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we include the UUID in this log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -66,6 +66,8 @@ service Vtctld {
rpc CleanupSchemaMigration(vtctldata.CleanupSchemaMigrationRequest) returns (vtctldata.CleanupSchemaMigrationResponse) {};
// CompleteSchemaMigration completes one or all migrations executed with --postpone-completion.
rpc CompleteSchemaMigration(vtctldata.CompleteSchemaMigrationRequest) returns (vtctldata.CompleteSchemaMigrationResponse) {};
// ForceCutOverSchemaMigration marks a schema migration for forced cut-over.
rpc ForceCutOverSchemaMigration(vtctldata.ForceCutOverSchemaMigrationRequest) returns (vtctldata.ForceCutOverSchemaMigrationResponse) {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this down below FindAllShardsInKeyspace (to keep them tidy)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -459,6 +458,15 @@ message CompleteSchemaMigrationResponse {
map<string, uint64> rows_affected_by_shard = 1;
}

message ForceCutOverSchemaMigrationRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move these down below FindAllShardsInKeyspaceRequest/Response (to keep them tidy)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach merged commit 3b58bee into vitessio:main Dec 13, 2023
119 checks passed
@shlomi-noach shlomi-noach deleted the onlineddl-cutover-backoff branch December 13, 2023 06:56
ejortegau pushed a commit to slackhq/vitess that referenced this pull request Dec 13, 2023
…itessio#14546)

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Online DDL cut-over backoff + forced completion
3 participants