Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: multiple concurrent deployments causing crashes #1067

Open
1 task done
oddgrd opened this issue Jul 3, 2023 · 11 comments
Open
1 task done

[Bug]: multiple concurrent deployments causing crashes #1067

oddgrd opened this issue Jul 3, 2023 · 11 comments
Labels
S-Accepted This will be worked on T-Bug Something isn't working

Comments

@oddgrd
Copy link
Contributor

oddgrd commented Jul 3, 2023

What happened?

When deploying a service while a previous deployment is still loading/building, the new deployment and/or the project container may crash. This error presents itself in different ways and we haven't been able to exactly nail down the cause of each crash, but some common ones are a sqlite "database is locked" error for failed deploys and the project container going into an errored state which requires restarting the project.

This is a tracking issue for this bug, which we aim to resolve in our refactor on the feat/shuttle-runtime-scaling branch. We will split the building step into a separate service, which will also hash the source-code of the deploy, and if the next deploy is the same it will simply not start building it.

Version

v0.20.0

Which operating system(s) are you seeing the problem on?

In deployment

Which CPU architectures are you seeing the problem on?

In deployment

Relevant log output

No response

Duplicate declaration

  • I have searched the issues and there are none like this.
@oddgrd oddgrd added T-Bug Something isn't working A-deployer labels Jul 3, 2023
@that-ambuj
Copy link

Hi, I want to work on this issue. Can I get more information about where and why this error is happening? Please also provide any telemetry logs that might be useful in debugging this issue. Thanks!

@oddgrd
Copy link
Contributor Author

oddgrd commented Aug 3, 2023

Hey @that-ambuj, thank you for showing interest in this issue! So we believe the problem is that when a new deployment is started, the project container (deployer) will start building it. Then, if another is started before the first one is finished, they will both enter the building state, and they will both write log output to the deployer sqlite database concurrently. This can lead to the sqlite database locking up and the deployment crashing.

As described in the issue we have a solution for this on a feature branch, so I'm not sure how worthwhile it is to resolve this on main currently (it could be quite complicated too). Some time may pass before we are ready to merge the fix on the feature branch however, so it's worth considering solving it now.

If we do decide it's best to wait on this issue, I'll try to find some other issues you might work on, if you're interested!

What are your thoughts on this @chesedo and @iulianbarbu?

@that-ambuj
Copy link

So, is this issue happening because the same user is trying to deploy the same project in parallel(with or without knowledge) or is it because there are multiple users trying to deploy at the same time to shuttle?

Basically, is the sqlite file stored on a user's local machine/docker container or on shuttle's servers?

@jonaro00
Copy link
Member

jonaro00 commented Aug 5, 2023

@that-ambuj In this context, the sqlite db is local in the project container. deploy->Ctrl-C->deploy will make two enter building at the same time.

@that-ambuj
Copy link

that-ambuj commented Aug 5, 2023

Thanks! Understood. So If I want to work on this issue, I should use the feat/shuttle-runtime-scaling branch, right?

@iulianbarbu
Copy link
Contributor

iulianbarbu commented Aug 7, 2023

Hey @that-ambuj, thank you for showing interest in this issue! So we believe the problem is that when a new deployment is started, the project container (deployer) will start building it. Then, if another is started before the first one is finished, they will both enter the building state, and they will both write log output to the deployer sqlite database concurrently. This can lead to the sqlite database locking up and the deployment crashing.

As described in the issue we have a solution for this on a feature branch, so I'm not sure how worthwhile it is to resolve this on main currently (it could be quite complicated too). Some time may pass before we are ready to merge the fix on the feature branch however, so it's worth considering solving it now.

If we do decide it's best to wait on this issue, I'll try to find some other issues you might work on, if you're interested!

What are your thoughts on this @chesedo and @iulianbarbu?

I think we'll have to solve the problem of multiple concurrent builds on the new builder system too. My thinking is that we can approach it in a few ways:

  1. For now: support only one build for a created project at any time. This means we should respond with an error message whenever a deploy request comes in and deployment is in the building state already. We should also add an option to deploy with force, which will stop the existing building/cancel the current deployment and start a new build/deployment corresponding to the last deploy request.
  2. For later, and questionable at the same time: support multiple concurrent builds/deployments for the same service. Each deployment will have a unique URL and be kept alive as long as the user wants, with the option of being shut down on request. This will definitely be helpful when doing deployment previews for a Shuttle based service, which might be a feature looked at by teams that collaborate on a codebase and need previews for testing purposes.

So If I want to work on this issue, I should use the feat/shuttle-runtime-scaling branch, right?

Not really, I would say we need to start off from the current codebase (main branch), at least for 1). We could start working on this unless my thoughts above don't miss anything out of the picture.

@that-ambuj
Copy link

that-ambuj commented Aug 7, 2023

Hey @iulianbarbu, Thanks a lot of making this issue clearer for me. I have started to find the cause for this bug, it would require to change a few lines at deployer/src/lib.rs:50.

// Make sure we don't stop the last running deploy. This works because they are returned in descending order.
let project_id = Ulid::from_string(args.project_id.as_str())
.expect("to have a valid ULID as project_id arg");
for existing_deployment in runnable_deployments.into_iter().skip(1) {
persistence
.stop_running_deployment(existing_deployment)
.await
.unwrap();
}

Essentially It would require checking the state of the last deployment and ask user to take the neccessary action. If my idea is correct, I think I should starting working on implementing 1.

@jonaro00 jonaro00 added S-Accepted This will be worked on and removed A-deployer labels Aug 17, 2023
@that-ambuj
Copy link

that-ambuj commented Aug 18, 2023

Hi @jonaro00 and @iulianbarbu, I have made some changes at https://github.com/that-ambuj/shuttle/tree/fix/concurrent-builds but the tests are a bit inconsistent.

They sometimes fail and most of the times pass, and most notably deployment::deploy_layer::tests::deployment_to_be_queued is failing. I'm not sure if my changes are causing these tests to fail because they also fail in a similar fashion on main branch. Also during manual testing by deploying, the pre deployment tests are also failing, but my guess is that's because of my changes. It is making it harder for me to reproduce and confirm that the issue is fixed.

Can you tell me more about what that test is supposed to do, so I can fix it?

@that-ambuj
Copy link

I recently discovered a line in the code base that is supposed to stop deployments running in the background that are in Building, Loading, Queued or Built state. And the git blame goes 7 months back, so maybe this issue is unnecessary?

persistence.cleanup_invalid_states().await.unwrap();

pub async fn cleanup_invalid_states(&self) -> Result<()> {
sqlx::query("UPDATE deployments SET state = ? WHERE state IN(?, ?, ?, ?)")
.bind(State::Stopped)
.bind(State::Queued)
.bind(State::Built)
.bind(State::Building)
.bind(State::Loading)
.execute(&self.pool)
.await?;
Ok(())
}

@jonaro00
Copy link
Member

@that-ambuj That function is only called on deployer startup, and does not prevent multiple deployments from entering building.
@iulianbarbu Do you want the deployment to not enter the queue at all or make it wait in the queue? (If the former, there would be no need for a queue 😝)

@that-ambuj
Copy link

@jonaro00 I want to achieve the former case to avoid locking the sqlite database for the time being until we implement concurrent builds in the queue. For that reason we will need the queue.

What I think that prevents concurrent access to the sqlite database is because of a misconfiguration of WAL.

@oddgrd oddgrd mentioned this issue Sep 14, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-Accepted This will be worked on T-Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants