New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental: automated, scheduled, dependency free online DDL via gh-ost/pt-online-schema-change #6547
Conversation
Zero dependencies doesn't mean zero configuration. What's the throttle replication lag? I'm used to as low as |
I'm very excited to see this type of automation. Solving problems everyone deals with in a Vitess-native way goes a long way towards driving mass adoption. At the same time, it's somewhat disappointing that it is using |
👋 it's always best to be explicit. I'm not sure if your impression is that I'm fighting a religious war or am just too obsessed with my own creation 😄 , this isn't the case. FWIW
I'm sorry to hear that, and apologize if I've alienated you in any way. I'm not sure my view on MySQL foreign keys should be alienating people and I'm dumbfounded that this is the case. |
Apologies, my comment wasn't meant to be a personal attack or to suggest that you have personally alienated me. Your tools are awesome, and my "I get why" comment was just me acknowledging that you wrote it and thus are able to move the quickest with it, not to mention that it probably has been the most requested integration. As for FKs, I wasn't trying to say that you have alienated anyone personally with your views, just recognizing that I'm aware of them from prior posts. Given the history of gh-ost where you were working at specific companies that didn't use FKs, it makes total sense to not deal with the extra complexity that they bring to DB tooling. In Vitess by contrast, where one of the main areas of focus is full MySQL compatibility, we're trying to support the majority of workloads, many of which include FKs, so it'd be great to support them eventually, whether that is achieved via gh-ost, pt-osc, vreplication, or something else. Again I'm sorry for coming across negatively. I've personally interacted with you building the Orchestrator integration with the Vitess helm charts and have always been impressed by your knowledge and willingness to help. I was super excited when I found out you were going to Planetscale. As I mentioned originally, I love that you are doing the work to add this level of automation, taking advantage of the control plane that doesn't exist in vanilla MySQL, and will really help to drive adoption of Vitess. I look forward to continued interaction with you and want you to know that I hold you in the highest regard. |
😍 |
@derekperkins Thank you for your kind message ❤️ and I also reflect that I may take some words to present differently than intended, as I'm not a native English speaker and I can mis-parse things. I also very much enjoyed working with you in our Regarding foreign keys, there's two ways forward:
|
Thanks so much for this work! I am incredibly excited about the prospect of online schema change as first-class feature supported in Vitess. We have been using
Here are something of the things we are working on adding -
My only concern with the current proposal is the overhead with using the topo server for co-ordination of schema changes. |
@ameetkotian adressing some of the bullet points:
Yes. As mentioned above, first iteration will not support concurrent migration+reshard operation, but that should be solved in future iterations. The current PR as it is still does not address the topic of resharding. With regard to failovers, again current PR does not address it, but the idea is that in the short term we will identify a failover and restart the migration. Possibly, and only where
I see that more as an external migration tracking/management system ownership. At least for now, the purpose of the PR is to provide the mechanics for online schema changes.
Agreed
Agreed. i suspect
At this time I see this at a higher level than
I'd like to point you to this experimental PR, checksumming data on the fly. I haven't yet tested it in production.
👍 This is on my agenda. |
Possible syntax change:
|
Recent commit, c68d438, changes syntax to
and also breaks |
The WIP on VExec will mostly eliminate that. We will only write to global |
I have a POC for
|
|
Shlomi I am soooo excited about this!
I can provide some details around this.
|
The current implementation, by the way, is to run gh-ost directly on the master server via its tablet. I consider to keep it that way, and use the replicas only for throttling. Since vitess requires ROW binlog format in the first place, this should be a safe decision. As for replicas taken down for backup or for other reasons, I wish to use freno as the all-knowing throttling service, and that’s in the mid-term run. In the short term, I still need to figure it out... |
I like this syntax choice a lot, it's very readable. Are there any configuration options we might want to set in SQL? I'm not sure if it makes sense, but maybe these could be pseudo function calls,
I'm glad that there's a viable path to support them down the road. For the reasons you've laid out in other comments, I'd prefer to see support in |
Yeah, I suspect we'd need to support some config via SQL; in particular, I'm looking at what's an acceptable replication lag.
|
On the topic of handling failures:
|
@rohit-nayak-ps I've now pushed my changes to Notable changes:
|
re: |
Migration options now available: alter with_ghost table my_table ... -- no options
alter with_ghost '--max-lag-millis=1500' table my_table ...
alter with_pt '--max-lag 1.5s --null-to-not-null' table my_table ... |
It's now possible to
syntax subject to change:
|
Suggestions from Andrew Mason in Vitess slack: I'm thinking a simple solution would be to add two flags to vttablet that is like: Another thought with respect to not passing |
…into online-ddl Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Incorporated #6815, where the throttler is disabled, by default. |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
The test
The watched path is |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Found it! |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Upon migration completion (whether successful or failed), online-ddl executor renames away the artifacts. This uses some logic from #6719 :
|
…endtoend tests Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
I'm ready to have this PR merged. It now supports throttling and table lifecycle. I have not made changes to the |
OMG 🎉 |
Pointing out that the |
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data. The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547. This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream.
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data. The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547. This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream.
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data. The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547. This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream. # Conflicts: # go/vt/vtgate/vcursor_impl.go Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data. The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547. This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream. # Conflicts: # go/vt/vtgate/vcursor_impl.go Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
This PR (work in progress) introduces zero dependency online schema changes with
gh-ost
/pt-online-schema-change
.UPDATE: this comment edited to reflect support for
pt-online-schema-change
. Originally this PR only supportedgh-ost
. Mostly whenever you seegh-ost
, considerpt-online-schema-change
to apply, as well.TL;DR
User will issue:
or
and vitess will schedule an online schema change operation to run on all relevant shards, then proceed to apply the change via
gh-ost
on all shards.While this PR is WIP, this flow works. More breakdown to follow, indicating what's been done and what's still missing.
The ALTER TABLE problem
First, to iterate the problem: schema changes have always been a problem with MySQL; a straight
ALTER
is a blocking operation; aONLINE ALTER
is only "online" on the master/primary, but is effectively blocking on replicas. Online schema change tools likept-online-schema-change
andgh-ost
overcome these limitations by emulating anALTER
on a "ghost" table, which is populated from the original table, then swapped in its space.For disclosure, I authored
gh-ost
's code as part of the database infrastructure team at GitHub.Traditionally, online schema changes are considered to be "risky". Trigger based migrations add significant load onto the master server, and their cut-over phase is known to be a dangerous point.
gh-ost
was created at GitHub to address these concerns, and successfully eliminated concerns for operational risks: withgh-ost
the load on the master is low, and well controlled, and the cut-over phase is known to cause no locking issues.gh-ost
comes with different risks: it applies data changes programmatically, thus the issue of data integrity is of utmost importance. Another note of concern is data traffic: going out from MySQL intogh-ost
and back into MySQL (as opposed to all-in MySQL inpt-online-schema-change
).This way or the other, running an online schema change is typically a manual operation. A human being will schedule the migration, kick it running, monitor it, possibly cut-over. In a sharded environment, a developer's request to
ALTER TABLE
explodes ton
different migrations, each needs to be scheduled, kicked, monitored & tracked.Sharded environments are obviously common for
vitess
users and so these users feel the pain more than others.Schema migration cycle & steps
Schema management is a process that begins with the user designing a schema change, and ends with the schema being applied in production. This is a breakdown of schema management steps as I know them:
ALTER TABLE
orpt-online-schema-change
orgh-ost
command)What we propose to address
Vitess's architecture uniquely positions it to be able to automate away much of the process. Specifically:
ALTER TABLE
statement into agh-ost
invocation is super useful if done by vitess, since vitess can not only validate schema/params, but also can provide credentials, identify a throttle-control replica, can instructgh-ost
on how to communicate progress via hooks, etc.vitess
just knows where the table is located. It knows if the schema is sharded. It knows who the shards are, who the shards masters are. It knows where to rungh-ost
. Last,vitess
can tell us which replicas we can use for throttling.vttablet
is the ideal entity to run a migration; can read instructions fromtopo
server and can write progress totopo
server.vitess
is aware of possible master failovers and can request a re-execute is a migration is so interrupted mid process.vtctld
API can offer endpoints to track status of a migration (e.g. "in progress on-80
, in queue on80-
"). It may offer progress pct and ETA.gh-ost
, the cut-over phase is safe to automate away.vttablet
is in an excellent position to automate that away.What this PR does, and what we expect to achieve
The guideline for this PR is: zero added dependencies; everything must be automatically and implicitly available via a normal
vitess
installation.A breakdown:
User facing
This PR enables the user to run an online schema migration (aka online DDL) via:
vtgate
: the user connects tovitess
with their standard MySQL client, and issues aALTER WITH 'gh-ost' TABLE ...
statement. Notice this isn't a valid MySQL syntax -- it's a hint forvitess
that we want to run this migration online.vitess
still supports synchronous, "normal"ALTER TABLE
statements, which IMO should be discouraged.vtctl
: the user runsvtctl ApplySchema -sql "alter with _gh-ost' table ..."
.The response, in both cases, is a migration ID, or a job ID, if you will. Consider the following examples.
via
vtgate
:via
vtctl
:In both cases, a UUID is returned, which can be used for tracking (WIP) the progress of the migration across shards.
Parser
Vitess' parser now accepts
ALTER WITH 'gh-ost' TABLE
andALTER WITH 'pt-osc' TABLE
syntax. We're still to determine if this is the exact syntax we want to go with.Topo
Whether submitted by
vtgate
orvtctl
, we don't immediately run the migration. As mentioned before, we may wish to postpone the migration. Perhaps the relevant servers are already running a migration.Instead, we write the migration request into global
topo
, e.g.:/vitess/global/schema-migration/requests/90c5afd4-da38-11ea-a3ff-f875a4d24e90
Once we create the request in
topo
, we immediately return the generated UUID/migration ID (90c5afd4-da38-11ea-a3ff-f875a4d24e90
in the above example) to the user.vtctld
vtctld
gets a conceptual "upgrade" with this PR. It is no longer a reactive service.vtctld
now actively monitors newschema-migration/requests
intopo
.When it sees such a request, it evaluates what are the relevantn
shards.With current implementaiton, it writesn
"job" entries, one per shard. e.g./vitess/global/schema-migration/jobs/commerce/-80/ce45b84a-da2d-11ea-b490-f875a4d24e90
and/vitess/global/schema-migration/jobs/commerce/80-/ce45b84a-da2d-11ea-b490-f875a4d24e90
for a keyspace with two shards; or just/vitess/global/schema-migration/jobs/commerce/0/1dd17132-da23-11ea-a3d2-f875a4d24e90
for a keyspace with one shard.DONE: WIP: we will investigate use of new
VExec
to actually distribute the jobs tovttablet
.what
vtctld
does now, is, once it sees a migration request, it pushes a VExec request for that migration. If the VExec request succeeds, that means all shards have been notified, andvtctld
can stow away the migration request (work is complete as far asvtctld
is concerned). If VExec returns with an error, that means at least one shard did not get the request, andvtctld
will keep retrying pushing this request.vttablet
This is where most of the action takes place.
vttablet
runs a migration service which continuously probes for, schedules, and executes migrations.DONE:
With current implementation, tablets which havetablet_type=MASTER
continuously probe for new entries. We look to replace this withVExec
.migration requests are pushed via
VExec
; the request includes theINSERT IGNORE
query that persists the migration in_vt.schema_migrations
. The tablet no longer reads from, nor writes to, Global Topo.A new table is introduced:
_vt.schema_migrations
, which is howvttablet
manages and tracks its own migrations.vttablet
will only run a single migration at a time.vttablet
will see if there's an unhandled migration requests. It will queue it.vttablet
will make a migrationready
if there's no running migration and no other migration is marked asready
.vttablet
will run aready
migration. This is really the interesting part, with lots of goodies:vttablet
will evaluate thegh-ost ...
command to run. It will obviously populate--alter=... --database=...
.vttablet
creates a temp directory where it generates a script to rungh-ost
.vttablet
creates a hooks path and auto-generates hook files. The hooks will interact withvttablet
vttablet
has an API endpoint by which the hooks can communicategh-ost
's status (started/running/success/failure) withvttablet
.vttablet
providesgh-ost
with--hooks-hint
which is the migration's UUID.vttablet
automatically generates agh-ost
user on the MySQL server, with a random password. The password is never persisted and does not appear onps
. It is written to, and loaded from, an environment variable.vttablet
grants the properprivileges
on the newly created accountvttablet
will destroy the account once migration completes.vitess
repo includes agh-ost
binary. We requiregh-ost
fromopenark/gh-ost
as opposed togithub/gh-ost
because we've had to make some special adjustments togh-ost
s oas to support this flow. I do not have direct ownership togithub/gh-ost
and cannot enforce those changes upstream, though I have made the contribution requestss upstream.make build
automatically appendsgh-ost
binary, compressed, tovttablet
binary, via Ricebox.vttablet
, upon startup, auto extractsgh-ost
binary into/tmp/vt-gh-ost
. Please note that the user does not need to install gh-ost.vttablet
to report back the job as complete/failed. We look to useVExec
. TBD.Tracking breakdown
OnlineDDL
struct, defines a migration request and its statusALTER WITH 'gh-ost' TABLE
andALTER WITH 'pt-osc' TABLE
syntaxtopo
)vtctl
to skip "big changes" check when-online_schema_change
is giventablet_executor
to submit an online DDL request totopo
as opposed to running it on tabletsvtctld
runs a daemon to monitor for, and review migration requestsvtctld
evaluates which shards are affected_vt.schema_migrations
backend table to support migration automation (on each shard))vttablet
validates MySQL connection and variablesvttablet
creates migration commandvttablet
creates hooksvttablet
provides HTTP API for hooks to report their status backvttablet
createsgh-ost
user with random passwordvttablet
destroysgh-ost
user upon completiongh-ost
embedded invttablet
binary and auto-extracted byvttablet
vttablet
runs adry-run
executionvttablet
runs a--execute
(actual) executionvttablet
supports aCancel
request (not used yet) to abort migrationvttablet
as a state machine to work throught the migration stepsgh-ost
migration requests, suceessful and failed migrationsVExec
to apply migrations onto tabletsVExec
to control migrations (abort, retry)vttablet
to heuristically check for available disk spacegh-ost
logs if necessaryALTER WITH 'gh-ost' TABLE...
andALTER WITH 'pt-osc' TABLE
syntax make sense? Other?throttle by replicawait for replica to catch up with new credentials before starting the migrationpt-online-schema-change
bundled insidevttablet
binarypt-online-schema-change
define foreign key flags for- user can define as runtime flagspt-online-schema-change
executionvttablet
itself crashespt-online-schema-change
passwords are in cleartext. Can we avoid that?vtctl ApplySchema
use sameWITH 'gh-ost'
andWITH 'pt-osc'
query hints as invtgate
.gh-ost
andpt-online-schema-change
pathspt-osc
triggers after migration failurept-osc
triggers on migration cancellation (overlaps with previous bullet, but has stronger guarantee)pt-osc
triggers from stale/zombiept-osc
migrationvtctl OnlineDDL
command for simple visibility and manipulation. See Experimental: automated, scheduled, dependency free online DDL via gh-ost/pt-online-schema-change #6547 (comment)artifacts
column, suggesting which tables need to be cleaned up after migrationQuite likely more entries to be added.
Further reading, resources, acknowledgements
We're obviously using gh-ost. I use my own
openark/gh-ost
since I have no ownership of the original https://github.com/github/gh-ost.gh-ost
was/is developed by GitHub 2016-2020.pt-online-schema-change
is part of the popular Percona ToolkitThe schema migratoin scheduling and tracking work is based on my previous work at GitHub. The implementation in this PR is new and rewritten, but based on concepts that have matured on my work on
skeefree
. Consider these resources:Also:
Initial incarnation of this PR: planetscale#67; some useful comments on that PR.
Call for feedback
We're looking for community's feedback on the above suggestions/flow. Thank you for taking the time to read and respond!