Race condition when executing multiple copies of sqitch on the same DB #579

neilmayhew · 2021-06-22T01:04:53Z

Our app is deployed in multiple identical containers to form a load-balanced cluster. Each container needs to ensure that DB migrations are run before the app starts up, so we do this in the entrypoint script. It's not possible to predict which container will run first, so they all do it. The sqitch documentation suggests that this is a safe thing to do because of the locking done by begin_work. However, it doesn't work properly.

I'm able to reproduce the problem in development by running sqitch deploy from two different terminals. If the sqitch in the first terminal has already started on the plan, the second one says

Change "App::Sqitch::Plan::Change=HASH(0x4f543f0)" has already been deployed

and exits, even though not all of the changes have been deployed yet. (And in production the app then starts up and fails immediately because it doesn't match the DB schema.)

Once I improved the message to show a meaningful ID I was able to see what's happening. The second sqitch figures out which changes remain to be deployed, and does sanity checking on them in check_deploy_dependencies. However, by this time the first sqitch has deployed another change and the second sqitch aborts because it's trying to deploy an already-deployed change.

In production we don't hit this all the time because we're typically deploying only a single new change to an existing DB, and the chance of overlap is low. However, in our testing environments we start with an empty DB and apply all the migrations since the beginning of time in a big block. This triggers the problem almost every time.

I was surprised that the second sqitch doesn't just block until the first one is finished deploying all the changes, but I see that the locking happens around each individual change and not the entire series. As a result, querying and deploying can end up interleaved. This seems like a design error to me, and I feel that sqitch should take the locks before it does anything at all, and not release them until it's ready to exit.

What do you think? I'm interested in submitting a PR for this, but I don't want to spend time on it if it's unlikely to be accepted.

The text was updated successfully, but these errors were encountered:

theory · 2021-06-27T21:27:17Z

Yeah, I think what happens it that it calls begin_work table before each change, rather than all changes. I have to admit it never occurred to me that deployment would be automated across a number of services restarting. In my mind, the deployment of the database is separate from the services that use it, but obviously sometimes they are not compatible across versions of the services that use the database. So the process it typically:

Shut down services
Update the database (once in one place)
Upgrade the services
Start the upgraded services

I think if you want to have the service restarts run the deployments themselves on restart, you might want to create a lock to prevent them from stomping on each other. If you're using PostgreSQL, for example, you can use an advisory lock, so whichever instance of the service gets the lock first does the deployment. If you use pg_advisory_lock() before calling Sqitch, whichever instances gets the lock first will do the deployment. The others will wait, and when the lock is freed, each will get the lock in turn, then run sqitch deploy and find it up-to-date.

I think you're right that this pattern could be extended to Sqitch itself. I imagine a lock_project() method or some such, implemented separately for each engine, and called at the start of work in the deploy, revert, rebase, and checkout commands. Each engine would have to implement the lock-or-wait in its own way, of course. It's probably a good idea to have in place for safety reasons, so yes, I would be happy to see it and likely to accept it.

But as a practice, I would also recommend developing patterns the reduce the risk.

neilmayhew · 2021-07-30T21:31:00Z

Thanks for these thoughts. Unfortunately, shutting down all services doesn't work well for our situation, because we want to do rolling upgrades of the load-balanced services (ie blue-green deployment) to eliminate downtime (which is very important to the business). Most DB migrations can be done in a backwards-compatible way (eg adding a column) so the old services can coexist with the new, and in the great majority of service upgrades there aren't any DB migrations anyway.

For now, I've taken the path of using a DB server lock. I wrote a simple utility that runs a command as a subprocess after taking a lock named on the command line. I use this to run sqitch, and thus guarantee that only one copy is executing at a time. This is working well in practice, but of course it would be preferable to have sqitch handle this itself (perhaps selected by a command-line option).

Since this is an acceptable solution for us we're choosing not to contribute a PR, but I strongly recommend that you keep in mind the need for automated, rolling deployments as you continue to develop sqitch.

Was showing a change object memory address, which is not helpful. Instead show the change name with tags and its ID. Should be relatively rare, appearing only when more than one Sqitch attempts to deploy multiple changes at the same time. See #579 for details.

Called at the start of deploy() and revert(). The default implementation does nothing, but on MySQL and Postgres, it acquires an adivosry lock not released till the end of the session. This is to ensure that only one instance of Sqitch is making changes at a time. Raises an error for the user to tell them that another instance is at work. Not implemented for the other engines as I could not quickly find a way to create session-level locks for them. See #579 for details.

theory · 2021-09-05T22:51:54Z

Have a look at d9acbec, @neilmayhew. I think this will do the trick for MySQL and PostgreSQL, except for the fact that they raise an error when another instances is running. For your use-case, however, I assume it'd make more sense to have Sqitch wait for the locks, yes? I'll have to think about how to show a message to say that it's waiting, and only throw an error on a timeout (on MySQL).

theory · 2021-09-26T22:04:21Z

See also #591, which you'll want to use to set longer wait timeouts if you have long deploys.

theory · 2021-09-26T22:55:26Z

@neilmayhew would you be able to test these two PRs? If not, would you be able to test if I made a prerelease? Thanks!

theory self-assigned this Aug 1, 2021

theory added bug engine todo labels Aug 1, 2021

theory mentioned this issue Sep 12, 2021

Add destination locking to pgsql and mysql #589

Merged

theory closed this as completed Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when executing multiple copies of sqitch on the same DB #579

Race condition when executing multiple copies of sqitch on the same DB #579

neilmayhew commented Jun 22, 2021

theory commented Jun 27, 2021

neilmayhew commented Jul 30, 2021 •

edited

theory commented Sep 5, 2021

theory commented Sep 26, 2021

theory commented Sep 26, 2021

Race condition when executing multiple copies of sqitch on the same DB #579

Race condition when executing multiple copies of sqitch on the same DB #579

Comments

neilmayhew commented Jun 22, 2021

theory commented Jun 27, 2021

neilmayhew commented Jul 30, 2021 • edited

theory commented Sep 5, 2021

theory commented Sep 26, 2021

theory commented Sep 26, 2021

neilmayhew commented Jul 30, 2021 •

edited