[Bug]: multiple concurrent deployments causing crashes #1067

oddgrd · 2023-07-03T12:35:51Z

What happened?

When deploying a service while a previous deployment is still loading/building, the new deployment and/or the project container may crash. This error presents itself in different ways and we haven't been able to exactly nail down the cause of each crash, but some common ones are a sqlite "database is locked" error for failed deploys and the project container going into an errored state which requires restarting the project.

This is a tracking issue for this bug, which we aim to resolve in our refactor on the feat/shuttle-runtime-scaling branch. We will split the building step into a separate service, which will also hash the source-code of the deploy, and if the next deploy is the same it will simply not start building it.

Version

v0.20.0

Which operating system(s) are you seeing the problem on?

In deployment

Which CPU architectures are you seeing the problem on?

In deployment

Relevant log output

No response

Duplicate declaration

I have searched the issues and there are none like this.

that-ambuj · 2023-08-03T07:15:03Z

Hi, I want to work on this issue. Can I get more information about where and why this error is happening? Please also provide any telemetry logs that might be useful in debugging this issue. Thanks!

oddgrd · 2023-08-03T14:47:56Z

Hey @that-ambuj, thank you for showing interest in this issue! So we believe the problem is that when a new deployment is started, the project container (deployer) will start building it. Then, if another is started before the first one is finished, they will both enter the building state, and they will both write log output to the deployer sqlite database concurrently. This can lead to the sqlite database locking up and the deployment crashing.

As described in the issue we have a solution for this on a feature branch, so I'm not sure how worthwhile it is to resolve this on main currently (it could be quite complicated too). Some time may pass before we are ready to merge the fix on the feature branch however, so it's worth considering solving it now.

If we do decide it's best to wait on this issue, I'll try to find some other issues you might work on, if you're interested!

What are your thoughts on this @chesedo and @iulianbarbu?

that-ambuj · 2023-08-05T08:11:43Z

So, is this issue happening because the same user is trying to deploy the same project in parallel(with or without knowledge) or is it because there are multiple users trying to deploy at the same time to shuttle?

Basically, is the sqlite file stored on a user's local machine/docker container or on shuttle's servers?

jonaro00 · 2023-08-05T10:15:42Z

@that-ambuj In this context, the sqlite db is local in the project container. deploy->Ctrl-C->deploy will make two enter building at the same time.

that-ambuj · 2023-08-05T18:06:40Z

Thanks! Understood. So If I want to work on this issue, I should use the feat/shuttle-runtime-scaling branch, right?

iulianbarbu · 2023-08-07T06:51:36Z

Hey @that-ambuj, thank you for showing interest in this issue! So we believe the problem is that when a new deployment is started, the project container (deployer) will start building it. Then, if another is started before the first one is finished, they will both enter the building state, and they will both write log output to the deployer sqlite database concurrently. This can lead to the sqlite database locking up and the deployment crashing.

As described in the issue we have a solution for this on a feature branch, so I'm not sure how worthwhile it is to resolve this on main currently (it could be quite complicated too). Some time may pass before we are ready to merge the fix on the feature branch however, so it's worth considering solving it now.

If we do decide it's best to wait on this issue, I'll try to find some other issues you might work on, if you're interested!

What are your thoughts on this @chesedo and @iulianbarbu?

I think we'll have to solve the problem of multiple concurrent builds on the new builder system too. My thinking is that we can approach it in a few ways:

For now: support only one build for a created project at any time. This means we should respond with an error message whenever a deploy request comes in and deployment is in the building state already. We should also add an option to deploy with force, which will stop the existing building/cancel the current deployment and start a new build/deployment corresponding to the last deploy request.
For later, and questionable at the same time: support multiple concurrent builds/deployments for the same service. Each deployment will have a unique URL and be kept alive as long as the user wants, with the option of being shut down on request. This will definitely be helpful when doing deployment previews for a Shuttle based service, which might be a feature looked at by teams that collaborate on a codebase and need previews for testing purposes.

So If I want to work on this issue, I should use the feat/shuttle-runtime-scaling branch, right?

Not really, I would say we need to start off from the current codebase (main branch), at least for 1). We could start working on this unless my thoughts above don't miss anything out of the picture.

that-ambuj · 2023-08-07T12:14:14Z

Hey @iulianbarbu, Thanks a lot of making this issue clearer for me. I have started to find the cause for this bug, it would require to change a few lines at deployer/src/lib.rs:50.

shuttle/deployer/src/lib.rs

Lines 51 to 59 in 609411c

    
           // Make sure we don't stop the last running deploy. This works because they are returned in descending order. 
        
           let project_id = Ulid::from_string(args.project_id.as_str()) 
        
               .expect("to have a valid ULID as project_id arg"); 
        
           for existing_deployment in runnable_deployments.into_iter().skip(1) { 
        
               persistence 
        
                   .stop_running_deployment(existing_deployment) 
        
                   .await 
        
                   .unwrap(); 
        
           }

Essentially It would require checking the state of the last deployment and ask user to take the neccessary action. If my idea is correct, I think I should starting working on implementing 1.

that-ambuj · 2023-08-18T09:08:19Z

Hi @jonaro00 and @iulianbarbu, I have made some changes at https://github.com/that-ambuj/shuttle/tree/fix/concurrent-builds but the tests are a bit inconsistent.

They sometimes fail and most of the times pass, and most notably deployment::deploy_layer::tests::deployment_to_be_queued is failing. I'm not sure if my changes are causing these tests to fail because they also fail in a similar fashion on main branch. Also during manual testing by deploying, the pre deployment tests are also failing, but my guess is that's because of my changes. It is making it harder for me to reproduce and confirm that the issue is fixed.

Can you tell me more about what that test is supposed to do, so I can fix it?

that-ambuj · 2023-08-20T07:44:20Z

I recently discovered a line in the code base that is supposed to stop deployments running in the background that are in Building, Loading, Queued or Built state. And the git blame goes 7 months back, so maybe this issue is unnecessary?

shuttle/deployer/src/lib.rs

Line 46 in 38f42bd

persistence.cleanup_invalid_states().await.unwrap();

shuttle/deployer/src/persistence/mod.rs

Lines 307 to 318 in 38f42bd

    
           pub async fn cleanup_invalid_states(&self) -> Result<()> { 
        
               sqlx::query("UPDATE deployments SET state = ? WHERE state IN(?, ?, ?, ?)") 
        
                   .bind(State::Stopped) 
        
                   .bind(State::Queued) 
        
                   .bind(State::Built) 
        
                   .bind(State::Building) 
        
                   .bind(State::Loading) 
        
                   .execute(&self.pool) 
        
                   .await?; 
        
               Ok(()) 
        
           }

jonaro00 · 2023-08-21T07:50:05Z

@that-ambuj That function is only called on deployer startup, and does not prevent multiple deployments from entering building.
@iulianbarbu Do you want the deployment to not enter the queue at all or make it wait in the queue? (If the former, there would be no need for a queue 😝)

that-ambuj · 2023-08-21T15:11:29Z

@jonaro00 I want to achieve the former case to avoid locking the sqlite database for the time being until we implement concurrent builds in the queue. For that reason we will need the queue.

What I think that prevents concurrent access to the sqlite database is because of a misconfiguration of WAL.

oddgrd added T-Bug Something isn't working A-deployer labels Jul 3, 2023

jonaro00 added S-Accepted This will be worked on and removed A-deployer labels Aug 17, 2023

oddgrd mentioned this issue Sep 6, 2023

[Bug]: database is locked #1201

Open

1 task

oddgrd mentioned this issue Sep 14, 2023

[Bug]: can't deploy #1237

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: multiple concurrent deployments causing crashes #1067

[Bug]: multiple concurrent deployments causing crashes #1067

oddgrd commented Jul 3, 2023 •

edited

that-ambuj commented Aug 3, 2023

oddgrd commented Aug 3, 2023

that-ambuj commented Aug 5, 2023

jonaro00 commented Aug 5, 2023

that-ambuj commented Aug 5, 2023 •

edited

iulianbarbu commented Aug 7, 2023 •

edited

that-ambuj commented Aug 7, 2023 •

edited

that-ambuj commented Aug 18, 2023 •

edited

that-ambuj commented Aug 20, 2023

jonaro00 commented Aug 21, 2023

that-ambuj commented Aug 21, 2023

[Bug]: multiple concurrent deployments causing crashes #1067

[Bug]: multiple concurrent deployments causing crashes #1067

Comments

oddgrd commented Jul 3, 2023 • edited

What happened?

Version

Which operating system(s) are you seeing the problem on?

Which CPU architectures are you seeing the problem on?

Relevant log output

Duplicate declaration

that-ambuj commented Aug 3, 2023

oddgrd commented Aug 3, 2023

that-ambuj commented Aug 5, 2023

jonaro00 commented Aug 5, 2023

that-ambuj commented Aug 5, 2023 • edited

iulianbarbu commented Aug 7, 2023 • edited

that-ambuj commented Aug 7, 2023 • edited

that-ambuj commented Aug 18, 2023 • edited

that-ambuj commented Aug 20, 2023

jonaro00 commented Aug 21, 2023

that-ambuj commented Aug 21, 2023

oddgrd commented Jul 3, 2023 •

edited

that-ambuj commented Aug 5, 2023 •

edited

iulianbarbu commented Aug 7, 2023 •

edited

that-ambuj commented Aug 7, 2023 •

edited

that-ambuj commented Aug 18, 2023 •

edited