New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: multiple concurrent deployments causing crashes #1067
Comments
Hi, I want to work on this issue. Can I get more information about where and why this error is happening? Please also provide any telemetry logs that might be useful in debugging this issue. Thanks! |
Hey @that-ambuj, thank you for showing interest in this issue! So we believe the problem is that when a new deployment is started, the project container (deployer) will start building it. Then, if another is started before the first one is finished, they will both enter the building state, and they will both write log output to the deployer sqlite database concurrently. This can lead to the sqlite database locking up and the deployment crashing. As described in the issue we have a solution for this on a feature branch, so I'm not sure how worthwhile it is to resolve this on main currently (it could be quite complicated too). Some time may pass before we are ready to merge the fix on the feature branch however, so it's worth considering solving it now. If we do decide it's best to wait on this issue, I'll try to find some other issues you might work on, if you're interested! What are your thoughts on this @chesedo and @iulianbarbu? |
So, is this issue happening because the same user is trying to deploy the same project in parallel(with or without knowledge) or is it because there are multiple users trying to deploy at the same time to shuttle? Basically, is the sqlite file stored on a user's local machine/docker container or on shuttle's servers? |
@that-ambuj In this context, the sqlite db is local in the project container. |
Thanks! Understood. So If I want to work on this issue, I should use the feat/shuttle-runtime-scaling branch, right? |
I think we'll have to solve the problem of multiple concurrent builds on the new builder system too. My thinking is that we can approach it in a few ways:
Not really, I would say we need to start off from the current codebase ( |
Hey @iulianbarbu, Thanks a lot of making this issue clearer for me. I have started to find the cause for this bug, it would require to change a few lines at deployer/src/lib.rs:50. Lines 51 to 59 in 609411c
Essentially It would require checking the state of the last deployment and ask user to take the neccessary action. If my idea is correct, I think I should starting working on implementing 1. |
Hi @jonaro00 and @iulianbarbu, I have made some changes at https://github.com/that-ambuj/shuttle/tree/fix/concurrent-builds but the tests are a bit inconsistent. They sometimes fail and most of the times pass, and most notably Can you tell me more about what that test is supposed to do, so I can fix it? |
I recently discovered a line in the code base that is supposed to stop deployments running in the background that are in Line 46 in 38f42bd
shuttle/deployer/src/persistence/mod.rs Lines 307 to 318 in 38f42bd
|
@that-ambuj That function is only called on deployer startup, and does not prevent multiple deployments from entering building. |
@jonaro00 I want to achieve the former case to avoid locking the sqlite database for the time being until we implement concurrent builds in the queue. For that reason we will need the queue. What I think that prevents concurrent access to the sqlite database is because of a misconfiguration of WAL. |
What happened?
When deploying a service while a previous deployment is still loading/building, the new deployment and/or the project container may crash. This error presents itself in different ways and we haven't been able to exactly nail down the cause of each crash, but some common ones are a sqlite "database is locked" error for failed deploys and the project container going into an errored state which requires restarting the project.
This is a tracking issue for this bug, which we aim to resolve in our refactor on the feat/shuttle-runtime-scaling branch. We will split the building step into a separate service, which will also hash the source-code of the deploy, and if the next deploy is the same it will simply not start building it.
Version
v0.20.0
Which operating system(s) are you seeing the problem on?
In deployment
Which CPU architectures are you seeing the problem on?
In deployment
Relevant log output
No response
Duplicate declaration
The text was updated successfully, but these errors were encountered: