-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when executing multiple copies of sqitch on the same DB #579
Comments
Yeah, I think what happens it that it calls
I think if you want to have the service restarts run the deployments themselves on restart, you might want to create a lock to prevent them from stomping on each other. If you're using PostgreSQL, for example, you can use an advisory lock, so whichever instance of the service gets the lock first does the deployment. If you use I think you're right that this pattern could be extended to Sqitch itself. I imagine a But as a practice, I would also recommend developing patterns the reduce the risk. |
Thanks for these thoughts. Unfortunately, shutting down all services doesn't work well for our situation, because we want to do rolling upgrades of the load-balanced services (ie blue-green deployment) to eliminate downtime (which is very important to the business). Most DB migrations can be done in a backwards-compatible way (eg adding a column) so the old services can coexist with the new, and in the great majority of service upgrades there aren't any DB migrations anyway. For now, I've taken the path of using a DB server lock. I wrote a simple utility that runs a command as a subprocess after taking a lock named on the command line. I use this to run sqitch, and thus guarantee that only one copy is executing at a time. This is working well in practice, but of course it would be preferable to have sqitch handle this itself (perhaps selected by a command-line option). Since this is an acceptable solution for us we're choosing not to contribute a PR, but I strongly recommend that you keep in mind the need for automated, rolling deployments as you continue to develop sqitch. |
Was showing a change object memory address, which is not helpful. Instead show the change name with tags and its ID. Should be relatively rare, appearing only when more than one Sqitch attempts to deploy multiple changes at the same time. See #579 for details.
Called at the start of deploy() and revert(). The default implementation does nothing, but on MySQL and Postgres, it acquires an adivosry lock not released till the end of the session. This is to ensure that only one instance of Sqitch is making changes at a time. Raises an error for the user to tell them that another instance is at work. Not implemented for the other engines as I could not quickly find a way to create session-level locks for them. See #579 for details.
Have a look at d9acbec, @neilmayhew. I think this will do the trick for MySQL and PostgreSQL, except for the fact that they raise an error when another instances is running. For your use-case, however, I assume it'd make more sense to have Sqitch wait for the locks, yes? I'll have to think about how to show a message to say that it's waiting, and only throw an error on a timeout (on MySQL). |
See also #591, which you'll want to use to set longer wait timeouts if you have long deploys. |
@neilmayhew would you be able to test these two PRs? If not, would you be able to test if I made a prerelease? Thanks! |
Our app is deployed in multiple identical containers to form a load-balanced cluster. Each container needs to ensure that DB migrations are run before the app starts up, so we do this in the entrypoint script. It's not possible to predict which container will run first, so they all do it. The sqitch documentation suggests that this is a safe thing to do because of the locking done by
begin_work
. However, it doesn't work properly.I'm able to reproduce the problem in development by running
sqitch deploy
from two different terminals. If the sqitch in the first terminal has already started on the plan, the second one saysand exits, even though not all of the changes have been deployed yet. (And in production the app then starts up and fails immediately because it doesn't match the DB schema.)
Once I improved the message to show a meaningful ID I was able to see what's happening. The second sqitch figures out which changes remain to be deployed, and does sanity checking on them in
check_deploy_dependencies
. However, by this time the first sqitch has deployed another change and the second sqitch aborts because it's trying to deploy an already-deployed change.In production we don't hit this all the time because we're typically deploying only a single new change to an existing DB, and the chance of overlap is low. However, in our testing environments we start with an empty DB and apply all the migrations since the beginning of time in a big block. This triggers the problem almost every time.
I was surprised that the second sqitch doesn't just block until the first one is finished deploying all the changes, but I see that the locking happens around each individual change and not the entire series. As a result, querying and deploying can end up interleaved. This seems like a design error to me, and I feel that sqitch should take the locks before it does anything at all, and not release them until it's ready to exit.
What do you think? I'm interested in submitting a PR for this, but I don't want to spend time on it if it's unlikely to be accepted.
The text was updated successfully, but these errors were encountered: