Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when executing multiple copies of sqitch on the same DB #579

Closed
neilmayhew opened this issue Jun 22, 2021 · 5 comments
Closed
Assignees

Comments

@neilmayhew
Copy link

Our app is deployed in multiple identical containers to form a load-balanced cluster. Each container needs to ensure that DB migrations are run before the app starts up, so we do this in the entrypoint script. It's not possible to predict which container will run first, so they all do it. The sqitch documentation suggests that this is a safe thing to do because of the locking done by begin_work. However, it doesn't work properly.

I'm able to reproduce the problem in development by running sqitch deploy from two different terminals. If the sqitch in the first terminal has already started on the plan, the second one says

Change "App::Sqitch::Plan::Change=HASH(0x4f543f0)" has already been deployed

and exits, even though not all of the changes have been deployed yet. (And in production the app then starts up and fails immediately because it doesn't match the DB schema.)

Once I improved the message to show a meaningful ID I was able to see what's happening. The second sqitch figures out which changes remain to be deployed, and does sanity checking on them in check_deploy_dependencies. However, by this time the first sqitch has deployed another change and the second sqitch aborts because it's trying to deploy an already-deployed change.

In production we don't hit this all the time because we're typically deploying only a single new change to an existing DB, and the chance of overlap is low. However, in our testing environments we start with an empty DB and apply all the migrations since the beginning of time in a big block. This triggers the problem almost every time.

I was surprised that the second sqitch doesn't just block until the first one is finished deploying all the changes, but I see that the locking happens around each individual change and not the entire series. As a result, querying and deploying can end up interleaved. This seems like a design error to me, and I feel that sqitch should take the locks before it does anything at all, and not release them until it's ready to exit.

What do you think? I'm interested in submitting a PR for this, but I don't want to spend time on it if it's unlikely to be accepted.

@theory
Copy link
Collaborator

theory commented Jun 27, 2021

Yeah, I think what happens it that it calls begin_work table before each change, rather than all changes. I have to admit it never occurred to me that deployment would be automated across a number of services restarting. In my mind, the deployment of the database is separate from the services that use it, but obviously sometimes they are not compatible across versions of the services that use the database. So the process it typically:

  1. Shut down services
  2. Update the database (once in one place)
  3. Upgrade the services
  4. Start the upgraded services

I think if you want to have the service restarts run the deployments themselves on restart, you might want to create a lock to prevent them from stomping on each other. If you're using PostgreSQL, for example, you can use an advisory lock, so whichever instance of the service gets the lock first does the deployment. If you use pg_advisory_lock() before calling Sqitch, whichever instances gets the lock first will do the deployment. The others will wait, and when the lock is freed, each will get the lock in turn, then run sqitch deploy and find it up-to-date.

I think you're right that this pattern could be extended to Sqitch itself. I imagine a lock_project() method or some such, implemented separately for each engine, and called at the start of work in the deploy, revert, rebase, and checkout commands. Each engine would have to implement the lock-or-wait in its own way, of course. It's probably a good idea to have in place for safety reasons, so yes, I would be happy to see it and likely to accept it.

But as a practice, I would also recommend developing patterns the reduce the risk.

@neilmayhew
Copy link
Author

neilmayhew commented Jul 30, 2021

Thanks for these thoughts. Unfortunately, shutting down all services doesn't work well for our situation, because we want to do rolling upgrades of the load-balanced services (ie blue-green deployment) to eliminate downtime (which is very important to the business). Most DB migrations can be done in a backwards-compatible way (eg adding a column) so the old services can coexist with the new, and in the great majority of service upgrades there aren't any DB migrations anyway.

For now, I've taken the path of using a DB server lock. I wrote a simple utility that runs a command as a subprocess after taking a lock named on the command line. I use this to run sqitch, and thus guarantee that only one copy is executing at a time. This is working well in practice, but of course it would be preferable to have sqitch handle this itself (perhaps selected by a command-line option).

Since this is an acceptable solution for us we're choosing not to contribute a PR, but I strongly recommend that you keep in mind the need for automated, rolling deployments as you continue to develop sqitch.

@theory theory self-assigned this Aug 1, 2021
theory added a commit that referenced this issue Sep 5, 2021
Was showing a change object memory address, which is not helpful. Instead show
the change name with tags and its ID. Should be relatively rare, appearing
only when more than one Sqitch attempts to deploy multiple changes at the same
time. See #579 for details.
theory added a commit that referenced this issue Sep 5, 2021
Called at the start of deploy() and revert(). The default implementation does
nothing, but on MySQL and Postgres, it acquires an adivosry lock not released
till the end of the session. This is to ensure that only one instance of
Sqitch is making changes at a time. Raises an error for the user to tell them
that another instance is at work. Not implemented for the other engines as I
could not quickly find a way to create session-level locks for them. See #579
for details.
@theory
Copy link
Collaborator

theory commented Sep 5, 2021

Have a look at d9acbec, @neilmayhew. I think this will do the trick for MySQL and PostgreSQL, except for the fact that they raise an error when another instances is running. For your use-case, however, I assume it'd make more sense to have Sqitch wait for the locks, yes? I'll have to think about how to show a message to say that it's waiting, and only throw an error on a timeout (on MySQL).

@theory
Copy link
Collaborator

theory commented Sep 26, 2021

See also #591, which you'll want to use to set longer wait timeouts if you have long deploys.

@theory theory closed this as completed Sep 26, 2021
@theory
Copy link
Collaborator

theory commented Sep 26, 2021

@neilmayhew would you be able to test these two PRs? If not, would you be able to test if I made a prerelease? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants