Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental: automated, scheduled, dependency free online DDL via gh-ost/pt-online-schema-change #6547

Merged
merged 191 commits into from Oct 5, 2020

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Aug 9, 2020

This PR (work in progress) introduces zero dependency online schema changes with gh-ost/pt-online-schema-change.

UPDATE: this comment edited to reflect support for pt-online-schema-change. Originally this PR only supported gh-ost. Mostly whenever you see gh-ost, consider pt-online-schema-change to apply, as well.

TL;DR

User will issue:

alter with 'gh-ost' table example modify id bigint not null;

alter with 'pt-osc' table example modify id bigint not null

or

$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
    ApplySchema -sql "alter with 'gh-ost' table example modify id bigint unsigned not null" commerce

$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
    ApplySchema -sql "alter with 'pt-osc' table example modify id bigint unsigned not null" commerce

and vitess will schedule an online schema change operation to run on all relevant shards, then proceed to apply the change via gh-ost on all shards.

While this PR is WIP, this flow works. More breakdown to follow, indicating what's been done and what's still missing.

The ALTER TABLE problem

First, to iterate the problem: schema changes have always been a problem with MySQL; a straight ALTER is a blocking operation; a ONLINE ALTER is only "online" on the master/primary, but is effectively blocking on replicas. Online schema change tools like pt-online-schema-change and gh-ost overcome these limitations by emulating an ALTER on a "ghost" table, which is populated from the original table, then swapped in its space.

For disclosure, I authored gh-ost's code as part of the database infrastructure team at GitHub.

Traditionally, online schema changes are considered to be "risky". Trigger based migrations add significant load onto the master server, and their cut-over phase is known to be a dangerous point. gh-ost was created at GitHub to address these concerns, and successfully eliminated concerns for operational risks: with gh-ost the load on the master is low, and well controlled, and the cut-over phase is known to cause no locking issues. gh-ost comes with different risks: it applies data changes programmatically, thus the issue of data integrity is of utmost importance. Another note of concern is data traffic: going out from MySQL into gh-ost and back into MySQL (as opposed to all-in MySQL in pt-online-schema-change).

This way or the other, running an online schema change is typically a manual operation. A human being will schedule the migration, kick it running, monitor it, possibly cut-over. In a sharded environment, a developer's request to ALTER TABLE explodes to n different migrations, each needs to be scheduled, kicked, monitored & tracked.

Sharded environments are obviously common for vitess users and so these users feel the pain more than others.

Schema migration cycle & steps

Schema management is a process that begins with the user designing a schema change, and ends with the schema being applied in production. This is a breakdown of schema management steps as I know them:

  1. Design code
  2. Publish changes (pull request)
  3. Review
  4. Formalize migration command (the specific ALTER TABLE or pt-online-schema-change or gh-ost command)
  5. Locate: where in production should this migration run?
  6. Schedule
  7. Execute
  8. Audit/monitor
  9. Cut-over/complete
  10. Cleanup
  11. Notify user
  12. Deploy & merge

What we propose to address

Vitess's architecture uniquely positions it to be able to automate away much of the process. Specifically:

  • Formalize migration command: turning an ALTER TABLE statement into a gh-ost invocation is super useful if done by vitess, since vitess can not only validate schema/params, but also can provide credentials, identify a throttle-control replica, can instruct gh-ost on how to communicate progress via hooks, etc.
  • Locate: given schema/table, vitess just knows where the table is located. It knows if the schema is sharded. It knows who the shards are, who the shards masters are. It knows where to run gh-ost. Last, vitess can tell us which replicas we can use for throttling.
  • Schedule: vitess is again in a unique position to schedule migrations. The fact someone asks for a migration to run does not mean the migration should start right away. For example, a shard may already be running an earlier migration. Running two migrations at a time is less than ideal, and it's best to wait out the first migration before beginning the second. A scheduling mechanism is both useful to running the migrations in optimal order/sequence, as well as providing feedback to the user ("your migration is on hold because this and that", or "your migration is 2nd in queue to run")
  • Execute: vttablet is the ideal entity to run a migration; can read instructions from topo server and can write progress to topo server. vitess is aware of possible master failovers and can request a re-execute is a migration is so interrupted mid process.
  • Audit/monitor: vtctld API can offer endpoints to track status of a migration (e.g. "in progress on -80, in queue on 80-"). It may offer progress pct and ETA.
  • cut-over/complete: in my experience with gh-ost, the cut-over phase is safe to automate away.
  • cleanup: the old table needs to be dropped; vttablet is in an excellent position to automate that away.

What this PR does, and what we expect to achieve

The guideline for this PR is: zero added dependencies; everything must be automatically and implicitly available via a normal vitess installation.

A breakdown:

User facing

This PR enables the user to run an online schema migration (aka online DDL) via:

  • vtgate: the user connects to vitess with their standard MySQL client, and issues a ALTER WITH 'gh-ost' TABLE ... statement. Notice this isn't a valid MySQL syntax -- it's a hint for vitess that we want to run this migration online. vitess still supports synchronous, "normal" ALTER TABLE statements, which IMO should be discouraged.
  • vtctl: the user runs vtctl ApplySchema -sql "alter with _gh-ost' table ...".

The response, in both cases, is a migration ID, or a job ID, if you will. Consider the following examples.

via vtgate:

mysql> create table example(id int auto_increment primary key, name tinytext);

mysql> show create table example \G

CREATE TABLE `example` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` tinytext,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

mysql> alter with 'gh-ost' table example modify id bigint not null, add column status int, add key status_dx(status);
+--------------------------------------+
| uuid                                 |
+--------------------------------------+
| 211febfa-da2d-11ea-b490-f875a4d24e90 |
+--------------------------------------+

-- <wait...>

mysql> show create table example \G

CREATE TABLE `example` (
  `id` bigint NOT NULL,
  `name` tinytext,
  `status` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `status_dx` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

via vtctl:

$ mysql -e "show create table example\G"

CREATE TABLE `example` (
  `id` bigint NOT NULL,
  `name` tinytext,
  `status` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `status_dx` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci


$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
    ApplySchema -sql "alter with 'gh-ost'  table example modify id bigint unsigned not null" commerce
8ec347e1-da2e-11ea-892d-f875a4d24e90


$ mysql -e "show create table example\G"

CREATE TABLE `example` (
  `id` bigint unsigned NOT NULL,
  `name` tinytext,
  `status` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `status_dx` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

In both cases, a UUID is returned, which can be used for tracking (WIP) the progress of the migration across shards.

Parser

Vitess' parser now accepts ALTER WITH 'gh-ost' TABLE and ALTER WITH 'pt-osc' TABLE syntax. We're still to determine if this is the exact syntax we want to go with.

Topo

Whether submitted by vtgate or vtctl, we don't immediately run the migration. As mentioned before, we may wish to postpone the migration. Perhaps the relevant servers are already running a migration.

Instead, we write the migration request into global topo, e.g.:

  • key: /vitess/global/schema-migration/requests/90c5afd4-da38-11ea-a3ff-f875a4d24e90
  • content:
{"keyspace":"commerce","table":"example","sql":"alter table example modify id bigint not null","uuid":"90c5afd4-da38-11ea-a3ff-f875a4d24e90","online":true,"time_created":1596701930662801294,"status":"requested"}

Once we create the request in topo, we immediately return the generated UUID/migration ID (90c5afd4-da38-11ea-a3ff-f875a4d24e90 in the above example) to the user.

vtctld

vtctld gets a conceptual "upgrade" with this PR. It is no longer a reactive service. vtctld now actively monitors new schema-migration/requests in topo.

When it sees such a request, it evaluates what are the relevant n shards.

With current implementaiton, it writes n "job" entries, one per shard. e.g.

  • /vitess/global/schema-migration/jobs/commerce/-80/ce45b84a-da2d-11ea-b490-f875a4d24e90 and
    /vitess/global/schema-migration/jobs/commerce/80-/ce45b84a-da2d-11ea-b490-f875a4d24e90 for a keyspace with two shards; or just
  • /vitess/global/schema-migration/jobs/commerce/0/1dd17132-da23-11ea-a3d2-f875a4d24e90 for a keyspace with one shard.

DONE: WIP: we will investigate use of new VExec to actually distribute the jobs to vttablet.

what vtctld does now, is, once it sees a migration request, it pushes a VExec request for that migration. If the VExec request succeeds, that means all shards have been notified, and vtctld can stow away the migration request (work is complete as far as vtctld is concerned). If VExec returns with an error, that means at least one shard did not get the request, and vtctld will keep retrying pushing this request.

vttablet

This is where most of the action takes place.

vttablet runs a migration service which continuously probes for, schedules, and executes migrations.

DONE: With current implementation, tablets which have tablet_type=MASTER continuously probe for new entries. We look to replace this with VExec.

migration requests are pushed via VExec; the request includes the INSERT IGNORE query that persists the migration in _vt.schema_migrations. The tablet no longer reads from, nor writes to, Global Topo.

A new table is introduced: _vt.schema_migrations, which is how vttablet manages and tracks its own migrations.

vttablet will only run a single migration at a time.

vttablet will see if there's an unhandled migration requests. It will queue it.

vttablet will make a migration ready if there's no running migration and no other migration is marked as ready.

vttablet will run a ready migration. This is really the interesting part, with lots of goodies:

  • vttablet will evaluate the gh-ost ... command to run. It will obviously populate --alter=... --database=....
  • vttablet creates a temp directory where it generates a script to run gh-ost.
  • vttablet creates a hooks path and auto-generates hook files. The hooks will interact with vttablet
  • vttablet has an API endpoint by which the hooks can communicate gh-ost's status (started/running/success/failure) with vttablet.
  • vttablet provides gh-ost with --hooks-hint which is the migration's UUID.
  • vttablet automatically generates a gh-ost user on the MySQL server, with a random password. The password is never persisted and does not appear on ps. It is written to, and loaded from, an environment variable.
  • vttablet grants the proper privileges on the newly created account
  • vttablet will destroy the account once migration completes.
  • vitess repo includes a gh-ost binary. We require gh-ost from openark/gh-ost as opposed to github/gh-ost because we've had to make some special adjustments to gh-ost s oas to support this flow. I do not have direct ownership to github/gh-ost and cannot enforce those changes upstream, though I have made the contribution requestss upstream.
  • make build automatically appends gh-ost binary, compressed, to vttablet binary, via Ricebox.
  • vttablet, upon startup, auto extracts gh-ost binary into /tmp/vt-gh-ost. Please note that the user does not need to install gh-ost.
  • WIP: vttablet to report back the job as complete/failed. We look to use VExec. TBD.

Tracking breakdown

  • New OnlineDDL struct, defines a migration request and its status
  • Parser supports ALTER WITH 'gh-ost' TABLE and ALTER WITH 'pt-osc' TABLE syntax
  • builder and analyzer to create an Online DDL plan (write to topo)
  • vtctl to skip "big changes" check when -online_schema_change is given
  • tablet_executor to submit an online DDL request to topo as opposed to running it on tablets
  • vtctld runs a daemon to monitor for, and review migration requests
  • vtctld evaluates which shards are affected
  • _vt.schema_migrations backend table to support migration automation (on each shard))
  • vttablet validates MySQL connection and variables
  • vttablet creates migration command
  • vttablet creates hooks
  • vttablet provides HTTP API for hooks to report their status back
  • vttablet creates gh-ost user with random password
  • vttablet destroys gh-ost user upon completion
  • gh-ost embedded in vttablet binary and auto-extracted by vttablet
  • vttablet runs a dry-run execution
  • vttablet runs a --execute (actual) execution
  • vttablet supports a Cancel request (not used yet) to abort migration
  • vttablet as a state machine to work throught the migration steps
  • counters for gh-ost migration requests, suceessful and failed migrations
  • use of VExec to apply migrations onto tablets
  • use of VExec to control migrations (abort, retry)
  • consider flow for retries
  • identify a reparent operation that runs during a migration, probabaly auto-restart the migration
  • vttablet to heuristically check for available disk space
  • tracking, auditing of all migrations
  • getting gh-ost logs if necessary
  • what's the best way to suggest we want an online migration? Does current ALTER WITH 'gh-ost' TABLE... and ALTER WITH 'pt-osc' TABLE syntax make sense? Other?
  • For first iteration, migrations and Reshard operations should be mutually exclusive. Can't run both at the same time. Next iterations will remove this constraint.
  • throttle by replica
  • wait for replica to catch up with new credentials before starting the migration
  • Use vttablet throttler
  • pt-online-schema-change bundled inside vttablet binary
  • support pt-online-schema-change
  • define foreign key flags for pt-online-schema-change execution - user can define as runtime flags
  • clenaup online-ddl directory after success
  • control throttling
  • control termination (panic abort)
  • control termination (panic abort) even after vttablet itself crashes
  • pt-online-schema-change passwords are in cleartext. Can we avoid that?
  • vtctl ApplySchema use same WITH 'gh-ost' and WITH 'pt-osc' query hints as in vtgate.
  • support override of gh-ost and pt-online-schema-change paths
  • cleanup pt-osc triggers after migration failure
  • forcibly remove pt-osc triggers on migration cancellation (overlaps with previous bullet, but has stronger guarantee)
  • cleanup pt-osc triggers from stale/zombie pt-osc migration
  • vtctl OnlineDDL command for simple visibility and manipulation. See Experimental: automated, scheduled, dependency free online DDL via gh-ost/pt-online-schema-change #6547 (comment)
  • end to end tests
  • populate artifacts column, suggesting which tables need to be cleaned up after migration

Quite likely more entries to be added.

Further reading, resources, acknowledgements

We're obviously using gh-ost. I use my own openark/gh-ost since I have no ownership of the original https://github.com/github/gh-ost. gh-ost was/is developed by GitHub 2016-2020.

pt-online-schema-change is part of the popular Percona Toolkit

The schema migratoin scheduling and tracking work is based on my previous work at GitHub. The implementation in this PR is new and rewritten, but based on concepts that have matured on my work on skeefree. Consider these resources:

Also:

  • An early presentation on gh-ost

Initial incarnation of this PR: planetscale#67; some useful comments on that PR.

Call for feedback

We're looking for community's feedback on the above suggestions/flow. Thank you for taking the time to read and respond!

@shlomi-noach
Copy link
Contributor Author

Zero dependencies doesn't mean zero configuration. What's the throttle replication lag? I'm used to as low as 1s, but some setups cannot meet that constraint. How do we let the user configure that?

@derekperkins
Copy link
Member

I'm very excited to see this type of automation. Solving problems everyone deals with in a Vitess-native way goes a long way towards driving mass adoption. At the same time, it's somewhat disappointing that it is using gh-ost. I get why it is, but given that gh-ost doesn't support foreign keys, and I know your personal views about them, that alienates us and many/most MySQL users who do use them.

@shlomi-noach
Copy link
Contributor Author

I get why it is

👋 it's always best to be explicit. I'm not sure if your impression is that I'm fighting a religious war or am just too obsessed with my own creation 😄 , this isn't the case. FWIW pt-online-schema-change is also based on my own creation, so my bias is towards both.

pt-online-schema-change can potentially also be supported. I began with gh-ost as this is the tool I'm most ocnvenient with and have fluency in its code. My Perl foo skills are very low, and hacking the pt-online-schema-change hooks will be more difficult for me.

your personal views about them, that alienates us and many/most MySQL users who do use them.

I'm sorry to hear that, and apologize if I've alienated you in any way. I'm not sure my view on MySQL foreign keys should be alienating people and I'm dumbfounded that this is the case.

@derekperkins
Copy link
Member

derekperkins commented Aug 10, 2020

Apologies, my comment wasn't meant to be a personal attack or to suggest that you have personally alienated me. Your tools are awesome, and my "I get why" comment was just me acknowledging that you wrote it and thus are able to move the quickest with it, not to mention that it probably has been the most requested integration.

As for FKs, I wasn't trying to say that you have alienated anyone personally with your views, just recognizing that I'm aware of them from prior posts. Given the history of gh-ost where you were working at specific companies that didn't use FKs, it makes total sense to not deal with the extra complexity that they bring to DB tooling. In Vitess by contrast, where one of the main areas of focus is full MySQL compatibility, we're trying to support the majority of workloads, many of which include FKs, so it'd be great to support them eventually, whether that is achieved via gh-ost, pt-osc, vreplication, or something else.

Again I'm sorry for coming across negatively. I've personally interacted with you building the Orchestrator integration with the Vitess helm charts and have always been impressed by your knowledge and willingness to help. I was super excited when I found out you were going to Planetscale. As I mentioned originally, I love that you are doing the work to add this level of automation, taking advantage of the control plane that doesn't exist in vanilla MySQL, and will really help to drive adoption of Vitess. I look forward to continued interaction with you and want you to know that I hold you in the highest regard.

@mattlord
Copy link
Contributor

😍

@shlomi-noach
Copy link
Contributor Author

@derekperkins Thank you for your kind message ❤️ and I also reflect that I may take some words to present differently than intended, as I'm not a native English speaker and I can mis-parse things. I also very much enjoyed working with you in our orchestrator/helm interactions.

Regarding foreign keys, there's two ways forward:

  • support pt-online-schema-change. This should be doable, though I expect not zero-dependency (the user will have to make sure to have Perl and dependent packages)
  • add support for foreign keys in gh-ost. This should also be doable. At the time I laid the plan for what a community contribution might look like.

@ameetkotian
Copy link
Contributor

Thanks so much for this work! I am incredibly excited about the prospect of online schema change as first-class feature supported in Vitess.

We have been using gh-ost for Vitess schema changes for about ~18 months now. The largest keyspace has more than a thousand shards. We had to build a schema change service outside of Vitess to handle distributed schema changes. The same service uses pt-osc to execute schema changes for non-Vitess cluster. Here are some of the problems that we needed to solve to make gh-ost work for Vitess. It is not intended to be a requirements list but something that will help you find more datapoints for this RFC.

  • Handle operations like tablet replacement, primary failovers, shard splits, mysql downtime due to backups, deployments, and upgrades during ongoing schema changes
  • Tracking owners for schema changes i.e., who triggered the ALTER statement
  • Handling retries in case of failures
  • Ability to trigger gh-ost's builtin throttling
  • Ability to send out notifications on job success/failures.

Here are something of the things we are working on adding -

  • Integrate gh-ost test on replica feature
  • Integrate table consistency checks before cutover.
  • Safe execution of DROP table - workflow should rename the tables and execute the drop at a later point for ease of rollback.
  • Version controlled schemas
  • Integrating pt-archiver

My only concern with the current proposal is the overhead with using the topo server for co-ordination of schema changes.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Aug 11, 2020

@ameetkotian adressing some of the bullet points:

Handle operations like tablet replacement, primary failovers, shard splits, mysql downtime due to backups, deployments, and upgrades during ongoing schema changes

Yes. As mentioned above, first iteration will not support concurrent migration+reshard operation, but that should be solved in future iterations. The current PR as it is still does not address the topic of resharding. With regard to failovers, again current PR does not address it, but the idea is that in the short term we will identify a failover and restart the migration. Possibly, and only where gh-ost is involved, we could finally work towards resurrection. I'd say that's in the long future.
I have no insight yet into backups/deployments/upgrades.

Tracking owners for schema changes i.e., who triggered the ALTER statement

I see that more as an external migration tracking/management system ownership. At least for now, the purpose of the PR is to provide the mechanics for online schema changes.

Handling retries in case of failures

Agreed

Ability to trigger gh-ost's builtin throttling

Agreed. i suspect VExec will make that a simple operation. VExec is a recent addition in vitess, at this time only used in vreplication, where the user can interact with internals of vitess using SQL statements; consider virtual tables like MySQL's INFORMATION_SCHEMA, only those are also updatable. More to come I hope.

Ability to send out notifications on job success/failures.

At this time I see this at a higher level than vitess.

Integrate table consistency checks before cutover.

I'd like to point you to this experimental PR, checksumming data on the fly. I haven't yet tested it in production.

Safe execution of DROP table - workflow should rename the tables and execute the drop at a later point for ease of rollback.

👍 This is on my agenda.

@shlomi-noach
Copy link
Contributor Author

Possible syntax change:

  • ALTER WITH_GHOST TABLE... to use gh-ost
  • ALTER WITH_PT TABLE... to use pt-online-schema-change

@shlomi-noach
Copy link
Contributor Author

Recent commit, c68d438, changes syntax to

  • ALTER WITH_GHOST TABLE..., which runs gh-ost
  • ALTER WITH_PT TABLE..., which does nothing at this time

and also breaks vtctl ApplySchema. The problem with ApplySchema is that if we say alter with_ghost table... then th epre-flight test fails, since with_ghost is not valid MySQL syntax. Need to look into that.

@shlomi-noach
Copy link
Contributor Author

My only concern with the current proposal is the overhead with using the topo server for co-ordination of schema changes.

The WIP on VExec will mostly eliminate that. We will only write to global topo upon ALTER TABLE request. So one write per migration request, and then I believe another write once the migration is fully complete on all shards, and TBD what kind of write, if any, should one or more migrations fail.

@shlomi-noach
Copy link
Contributor Author

I have a POC for pt-online-schema-change. Problems I have:

  • pt-osc doesn't report periodic status, ie. does not say "I'm healthy" throughout the migration. This is workable, but I feel blind.
  • user will have to make sure they have Perl::DBI and Perl::DBD::MySQL installed; so this isn't a zero dependency setup. Again, workable.
  • still perl-foo-ing the plugins

@shlomi-noach
Copy link
Contributor Author

pt-online-schema-change now supported via ALTER WITH_PT TABLE ...

pt-online-schema-change seems to require credentials in cleartext: either on the command line, or in .cnf file; I can't seem to hide the password in an environment file.

@shlomi-noach shlomi-noach changed the title WIP: automated, scheduled, dependency free online DDL via gh-ost WIP: automated, scheduled, dependency free online DDL via gh-ost/pt-online-schema-chane Aug 11, 2020
@shlomi-noach shlomi-noach changed the title WIP: automated, scheduled, dependency free online DDL via gh-ost/pt-online-schema-chane WIP: automated, scheduled, dependency free online DDL via gh-ost/pt-online-schema-change Aug 11, 2020
@ajm188
Copy link
Contributor

ajm188 commented Aug 11, 2020

Shlomi I am soooo excited about this!

I have no insight yet into backups/deployments/upgrades.

I can provide some details around this.

  • Backups: when running with the builtinbackupengine, vttablet will do a shutdown of mysqld to copy all the ibd files.
    • If this happens on the tablet gh-ost is streaming, then (I'd guess, but we could test), gh-ost will get into a bad state not being able to connect to the replica for up to several hours.
    • If it happens on a tablet that gh-ost is using for throttle-control-replicas (aside: is every replica in the shard going to be watched for replication lag?), then the alter will be throttled for the duration of the backup (again, several hours)
    • Suggestion, just my opinion 😄 : I think what I want here is that the vitess/gh-ost integration won't pick the backup tablet (if a backup is ongoing), and, the builtinbackup code won't pick the gh-ost tablet (if a schema change is ongoing), and will tell gh-ost to stop checking replication lag on the backup tablet
  • Deployments (vttablet): After restarting all replicas, eventually you need to restart the primary, which is safest to do by PlannedReparent-ing to another tablet in the shard and then restarting the former primary. We've seen problems with the _ghc table heartbeats being written to a replica cause errant transactions, so if that reparent happens while a gh-ost process is running, we run into trouble.
    • I don't have any good suggestions here, except maybe make PlannedReparent refuse to operate if a schema change is happening? That feels it could cause more problems than it solves, though.
  • Upgrades (I'm assuming Ameet meant mysql upgrades): same problem as deployments with respect to reparents, but also there's another concern I have about gh-ost losing connection to the replica it's streaming from for however long mysqld ends up being down during the upgrade.

@shlomi-noach
Copy link
Contributor Author

The current implementation, by the way, is to run gh-ost directly on the master server via its tablet. I consider to keep it that way, and use the replicas only for throttling. Since vitess requires ROW binlog format in the first place, this should be a safe decision.

As for replicas taken down for backup or for other reasons, I wish to use freno as the all-knowing throttling service, and that’s in the mid-term run. In the short term, I still need to figure it out...

@derekperkins
Copy link
Member

ALTER WITH_GHOST TABLE / ALTER WITH_PT TABLE

I like this syntax choice a lot, it's very readable. Are there any configuration options we might want to set in SQL? I'm not sure if it makes sense, but maybe these could be pseudo function calls, WITH_GHOST(a, b, c), though adding more parameters would be hard to keep backwards compatible. Maybe using comments?

Regarding foreign keys, there's two ways forward...

I'm glad that there's a viable path to support them down the road. For the reasons you've laid out in other comments, I'd prefer to see support in gh-ost eventually because it seems to be the superior tool, and maybe we'll be able to contribute towards it.

@shlomi-noach
Copy link
Contributor Author

Are there any configuration options we might want to set in SQL? I'

Yeah, I suspect we'd need to support some config via SQL; in particular, I'm looking at what's an acceptable replication lag.
I'm still digging the parser, but I'd suspect the following format may work:

ALTER WITH_GHOST MIGRATION_LAG_SECONDS=1.0 TABLE ...

@shlomi-noach
Copy link
Contributor Author

On the topic of handling failures:

  • A new VExec interface allows the user to retry a migration via SQL: something like vtctl VExec commerce.91b5c953-e1e2-11ea-a097-f875a4d24e90 "update _vt.schema_migrations set migration_status='retry'
  • If vttblet itself fails during the migration, then it's possible that we never mark the migration as failed.
    • in the case of gh-ost, this is identifyable, because gh-ost sends keepalive updates via on-status hook. Thus, we can heuristically claim that if no sign of life was seen from a migration in past 10 minutes, the migration must be failed.
    • with pt-online-schema-change this is complicated, and I don't have an immediate solution yet; probably need to track the unix process ID. I'm open to suggestions.

@shlomi-noach
Copy link
Contributor Author

@rohit-nayak-ps I've now pushed my changes to VExec. I don't have a separate PR just for the VExec changes, because I worked the refactor based on the changes I needed in schema_migrations and what I found in common with vreplication.

Notable changes:

  • obviously vexec.go, vexec_plan.go.
    • I introduced a vexecPlanner and inverted the query analysis logic (first we identify affected table, then we select tha planner, which then analyzes the query)
    • the planner uses vexecPlannerParams, a set of hints to instruct the planner on how to analyze/refactor the query: name of workflow column in backend table; set of mutable or immutable columns in UPDATE queries, etc.
    • VExec is now much more stateful, and keeps query, parsed statement, refactored query, planner, etc.
  • Introducing new VExecRequest. It's a combination of workflow+keyspace+query. I think VReplicationExec can be made redundant.
  • Introducing new VExec() gRPC call, which takes a VExecRequest and returns a VExecResponse. Again, I think vreplication logic can use this.
  • on the tablet side, a VExec interceptor gets the query
    • and constructs a TabletVExec entity, which analyzes the query. It parses some useful information, like the set of columns changed in an UPDATE query, or columns with literal values on the WHERE clause. This entity is given to the final code (the "engine" or however it is referenced) that implements the vexec logic on the tablet side. It will use TabletVExec to make sanity checks, validate the query, and finally return a result set.

@shlomi-noach
Copy link
Contributor Author

re: pt-online-schema-change, I will add a replication-lag plugin, that re-implements the original lag check, plus ensures to report every minute to vttablet.

@shlomi-noach
Copy link
Contributor Author

Migration options now available:

alter with_ghost table my_table ... -- no options
alter with_ghost '--max-lag-millis=1500' table my_table ...
alter with_pt '--max-lag 1.5s --null-to-not-null' table my_table ...

@shlomi-noach
Copy link
Contributor Author

It's now possible to retry or cancel a migration.

  • retry only possible if the migration is failed or cancelled
  • cancel only possible if the migration is queued, reader or running (in the latter case we interrupt the migration and cause it to fail)

syntax subject to change:

  • vtctl VExec commerce.5fe35da1-e2be-11ea-aa07-f875a4d24e90 "update _vt.schema_migrations set migration_status='retry' "
  • vtctl VExec commerce.5fe35da1-e2be-11ea-aa07-f875a4d24e90 "update _vt.schema_migrations set migration_status='cancel' "

@shlomi-noach
Copy link
Contributor Author

Suggestions from Andrew Mason in Vitess slack:

I'm thinking a simple solution would be to add two flags to vttablet that is like:
-gh-ost-path
-pt-osc-path
Then if one of those isn't set, vttablet can use the baked-in binary for that strategy, otherwise it can locate binary where I told it to look.

Another thought with respect to not passing -drop-old-table, is that gh-ost could use -force-table-names to include the migration UUID in the gh-ost table names

…into online-ddl

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

Incorporated #6815, where the throttler is disabled, by default.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

The test TestBackupMysqlctld/TestMasterReplicaSameBackup (endtoend shard 21) keeps failing; vtctlclient exits with error status 1. I'm able to reproduce locally. Testing locally, I see:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xd4017d]

goroutine 183 [running]:
vitess.io/vitess/go/vt/topo/etcd2topo.(*Server).Watch.func1(0xc000da7680, 0xc0010ee870, 0xc0010dcc30, 0xc0011c8c80, 0x11, 0xc0010e00d8, 0xc0010e00d0, 0x2733420, 0xc001270140, 0xc00003a7a0)
	/home/shlomi/dev/github/planetscale/vitess/go/vt/topo/etcd2topo/watch.go:93 +0x77d
created by vitess.io/vitess/go/vt/topo/etcd2topo.(*Server).Watch
	/home/shlomi/dev/github/planetscale/vitess/go/vt/topo/etcd2topo/watch.go:68 +0x5c7

The watched path is /zone1/SrvVSchema. I'm still unsure why this error happens on this PR and not on other PRs. Looking for things that could cause lock/timeouts; can't see an issue with Open()/Close() on tablet sever; any topo entry I Lock I then unlock. Still digging.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

Found it! endtoend (21) is now fixed in df502e9. The problem was with an unqualified query, called by initSchema(), called by Open(). I'm not sure why this would only fail in TestBackupMain/TestMasterReplicaSameBackup and not in TestBackupMain/TestReplicaBackup or TestBackupMain/TestRdonlyBackup or TestBackupMain/TestMasterBackup.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

Upon migration completion (whether successful or failed), online-ddl executor renames away the artifacts. This uses some logic from #6719 :

  • An artifact table, if found (e.g. _mytable_old or _c07abcac_06cb_11eb_ac94_f875a4d24e90_20201005082948_del) is RENAMEd to e.g. _vt_PURGE_0b19830706ca11ebaf6bf875a4d24e90_20201005051720
  • When Managed DROP TABLE #6719 is merged, this will take the table through garbage collection lifecycle

…endtoend tests

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

I'm ready to have this PR merged. It now supports throttling and table lifecycle.

I have not made changes to the ALTER syntax. See #6782 ; we can iterate in a followup PR.

@deepthi deepthi merged commit dd6ecae into vitessio:master Oct 5, 2020
@shlomi-noach shlomi-noach deleted the online-ddl branch October 6, 2020 06:05
@shlomi-noach
Copy link
Contributor Author

OMG 🎉

@shlomi-noach
Copy link
Contributor Author

Pointing out that the ALTER WITH... syntax is still subject to change.

@askdba askdba added this to the v8.0 milestone Oct 12, 2020
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 15, 2021
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data.

The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547.

This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream.
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 17, 2021
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data.

The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547.

This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream.
setassociative added a commit to tinyspeck/vitess that referenced this pull request Apr 15, 2021
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data.

The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547.

This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream.
# Conflicts:
#	go/vt/vtgate/vcursor_impl.go

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
deepthi pushed a commit to planetscale/vitess that referenced this pull request Apr 22, 2021
This checks if a vtgate is currently filtering keyspaces before requesting the TopoServer. This is necessary because a TopoServer can't be accessed in those cases as the filtered Topo in those cases could make it unsafe to make writes since all reads would be returning a subset of the actual topo data.

The only use of the requested topoServer that I found was in the DDL handling path and was introduced in vitessio#6547.

This is deployed on dev but should get testing (endtoend or unit, unclear on best path atm) before going upstream.
# Conflicts:
#	go/vt/vtgate/vcursor_impl.go

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants