Attempt at fixing sporadic failures of `shuttle-deployer` #980

Kazy · 2023-06-07T14:45:34Z

Description of change

The shuttle-deployer tests are sometimes failing when unrelated changes are made. This can be seen both during CI as well as when working locally (we faced that when working on the pagination).

It's quite hard to reproduce, but if we look at CircleCI errors (https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3134/workflows/c461d70f-616c-416a-a1e5-148ca2afa854/jobs/59077, https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3134/workflows/bd9ea5b5-c6bf-4262-bc58-2d82ce416341/jobs/59040, https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3109/workflows/aeefba11-ae65-48b7-b522-30b7931e32a9/jobs/58417, among others), we often get this error:

2023-06-05T08:35:40.737556Z ERROR builder:build_failed{error=InputOutput(Custom { kind: Other, error: "background task failed" }) id=572c7682-bc23-48c9-b4a8-97850da7938e state=Crashed}: shuttle_deployer::deployment::queue: service build encountered an error error=Internal I/O error: background task failed error.sources=[background task failed]

And right before it, this line:

2023-06-05T08:35:40.736962Z  INFO builder:handle{id=572c7682-bc23-48c9-b4a8-97850da7938e state=Building}: shuttle_deployer::deployment::queue: Moving built executable

In particular, right after the log line above in the code, we have a call to store_executable, with a call to tokio::fs::rename inside. This in combination with the error message (background task failed) led me to this issue. One comment mentions the following:

I eventually figured out this was a case of not keeping the top-level task waiting on the spawned tasks, resulting in the runtime shutting down and those tasks failing midway through (with the above error).

In the deployer, and more precisely in this code path, this is done twice:

Once when we spawn Queued::handle (first commit) in queue::task, where we pull deployments to build and run.
Once when we spawn queue::task, which is done when building a DeploymentManager.

An other kind of bug happens in the test suite:

thread 'thread 'deployment::deploy_layer::tests::deployment_bind_panicdeployment::deploy_layer::tests::deployment_main_panic' panicked at '' panicked at 'states should go into 'Crashed' panicking in bind: [
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Queued,
    },
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Building,
    },
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Built,
    },
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Loading,
    },
]states should go into 'Crashed' when panicking in main: [
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Queued,
    },
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Building,
    },
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Built,
    },
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Loading,
    },
]', ', thread 'deployment::deploy_layer::tests::deployment_to_be_queueddeployer/src/deployment/deploy_layer.rs' panicked at ':states should go into 'Running' for a valid service: [
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Queued,
    },
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Building,
    },
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Built,
    },
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Loading,
    },
]deployer/src/deployment/deploy_layer.rs', deployer/src/deployment/deploy_layer.rs:754:797::1717

Sometimes it was just a timeout where the deployment seems to have stalled, sometimes it actually results in a crash (state: Crashed) instead. This was particularly painful when debugging since it would hang for several minutes, even though it was sometime clear that the states had diverged. I've added a fix for it to return early when we can.

I also got this kind of error:

thread 'tokio-runtime-worker' panicked at 'message from tonic stream: Status { code: Unknown, message: "error reading a body from connection: stream closed because of a broken pipe", source: Some(hyper::Error(Body, Error { kind: Io(Custom { kind: BrokenPipe, error: "stream closed because of a broken pipe" }) })) }', deployer/src/deployment/run.rs:386:49

I assume that similarly to the first issue, we're not properly waiting on the returned JoinHandle, causing the stream to be closed while a task is still running. This time it happened in the run.rs file of the deployer. This fixed both the error with tonic stream and the states of the deployment that either hang or crashed.

How has this been tested? (if applicable)

I managed to sometime (but not often) reproduce the errors locally, which I can't seem to be able to do now with these changes. I ran the following command, which before my fixes exhibited some of the issues in a few iteration:

export RUST_BACKTRACE=full
for i in {0..20}; do
rm -rf deployer/**/target; rm -rf /run/user/1000/shuttle_run_test*;
if ! cargo test -p shuttle-deployer -- --nocapture; then
        break
fi
sleep 5
done

Now I want to try with CircleCI :)

Instead of relying on a one second sleep.

iulianbarbu

Related to the comment you've found:

I eventually figured out this was a case of not keeping the top-level task waiting on the spawned tasks, resulting in the runtime shutting down and those tasks failing midway through (with the above error).

This is something we can definitely do from a completeness point of view and this is a great catch. Just a few question left, let me know if you can clarify them.

iulianbarbu · 2023-06-23T14:00:36Z

deployer/src/deployment/mod.rs

-            storage_manager,
-        }
+        (
+            set,


Can we attach this to the deployment manager?

If we attach it to the deployer, then we have to wrap it in an Arc and a Mutex. I tried to avoid that (no particular reason actually, now that I think about it) but it might be better to wrap it and keep the JoinSet with the manager.

deployer/src/deployment/queue.rs

deployer/src/deployment/deploy_layer.rs

chesedo

Thanks for this update @Kazy! And sorry for the delay, we wanted to doubly confirm the impact of the TaskSet and I'm happy with the tests I've run. Just two questions, but my tests have shown that we might not have to worry about them anyway.

chesedo · 2023-06-23T13:46:23Z

deployer/src/deployment/queue.rs

@@ -52,7 +55,7 @@ pub async fn task(
        let storage_manager = storage_manager.clone();
        let queue_client = queue_client.clone();

-        tokio::spawn(async move {
+        tasks.spawn(async move {


The while loop this spawn is in has a 'static lifetime. So I'm wondering if an overflow will eventually happen if tasks are only inserted into the set and are never waited for?

Very good point, I think this can happen yes. If the queue never ends (e.g. in the case of a long running process), the task set will never be awaited, and the memory will keep growing with each deployment.

I wonder how this could be fixed. I see two solutions, I don't know if there are some simpler ones:

Spawn a thread that will be responsible for running join_next + a signal once we're done with the queue to notify that thread that the next time join_next returns None (i.e. that the set is empty), it must return.

Instead of doing a while loop on recv.recv(), use a select! between recv.recv(), tasks.join_next(), and the else case where both returned None.

I think the second one is the cleanest, let me know what you think !

Ideally, we would use something like 2. However, this doesn't seem to crash anything in our stress testing so we think about merging it without the extra handling for now. There is also a hard limit imposed on the container memory which will most probably crash the user's deployer, but not affect the rest of the system. Also, we will move away from this one deployer-per-user's project architecture soon, so it is not critical to address this.

chesedo · 2023-06-23T13:47:57Z

deployer/src/deployment/run.rs

@@ -83,15 +88,15 @@ pub async fn task(
        };
        let runtime_manager = runtime_manager.clone();

-        tokio::spawn(async move {
+        set.spawn(async move {


The same goes for this while loop which also has a 'static lifetime

iulianbarbu

The outstanding commits are not critical, we can merge it as is.

deployer/src/deployment/deploy_layer.rs

iulianbarbu · 2023-06-26T11:04:44Z

deployer/src/deployment/queue.rs

@@ -52,7 +55,7 @@ pub async fn task(
        let storage_manager = storage_manager.clone();
        let queue_client = queue_client.clone();

-        tokio::spawn(async move {
+        tasks.spawn(async move {


Ideally, we would use something like 2. However, this doesn't seem to crash anything in our stress testing so we think about merging it without the extra handling for now. There is also a hard limit imposed on the container memory which will most probably crash the user's deployer, but not affect the rest of the system. Also, we will move away from this one deployer-per-user's project architecture soon, so it is not critical to address this.

iulianbarbu · 2023-06-26T11:08:23Z

@Kazy , if you got spare time to address this let us know: #980 (comment).

Kazy · 2023-06-26T12:15:13Z

@iulianbarbu this has been addressed :)

…#980) * feat(deployer): use joinset to await builder tasks on shutdown * feat(deployer): use joinset for the DeploymentManager as well * test(deployment): when testing tests, abort early when different * test(deployment): use test_states in deployment_to_be_queued Instead of relying on a one second sleep. * fix(deployment): properly await spawned tasks * ref(deployer): move join set of DeploymentManager into struct * fix(deployer): use tokio::select! to await tasks set in deploy/run queue

Kazy added 5 commits June 7, 2023 12:01

feat(deployer): use joinset to await builder tasks on shutdown

5fd74e0

feat(deployer): use joinset for the DeploymentManager as well

9e9185c

test(deployment): when testing tests, abort early when different

ab30302

test(deployment): use test_states in deployment_to_be_queued

7c2869c

Instead of relying on a one second sleep.

fix(deployment): properly await spawned tasks

5107821

Kazy force-pushed the attempt-fix-ci-deployer branch from f6fd6cc to 5107821 Compare June 7, 2023 14:47

Kazy changed the title ~~Attempt at fixing spurious failures of shuttle-deployer~~ Attempt at fixing sporadic failures of shuttle-deployer Jun 7, 2023

oddgrd added B-Shuttle Batch Submitted by a Shuttle Batch participant B-M A medium task that requires experience with the codebase A-deployer labels Jun 7, 2023

iulianbarbu reviewed Jun 23, 2023

View reviewed changes

chesedo reviewed Jun 23, 2023

View reviewed changes

iulianbarbu approved these changes Jun 26, 2023

View reviewed changes

Kazy added 2 commits June 26, 2023 14:00

ref(deployer): move join set of DeploymentManager into struct

cf3b89c

fix(deployer): use tokio::select! to await tasks set in deploy/run queue

6d5db2b

Kazy requested review from iulianbarbu and chesedo June 26, 2023 12:14

Kazy force-pushed the attempt-fix-ci-deployer branch from b989adc to 6d5db2b Compare June 26, 2023 12:40

oddgrd merged commit 9aef803 into shuttle-hq:main Jun 27, 2023
28 of 29 checks passed

Kazy deleted the attempt-fix-ci-deployer branch June 27, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt at fixing sporadic failures of `shuttle-deployer` #980

Attempt at fixing sporadic failures of `shuttle-deployer` #980

Kazy commented Jun 7, 2023

iulianbarbu left a comment

iulianbarbu Jun 23, 2023 •

edited

Loading

Kazy Jun 23, 2023

chesedo left a comment

chesedo Jun 23, 2023

Kazy Jun 23, 2023 •

edited

Loading

iulianbarbu Jun 26, 2023

chesedo Jun 23, 2023

iulianbarbu left a comment

iulianbarbu Jun 26, 2023

iulianbarbu commented Jun 26, 2023

Kazy commented Jun 26, 2023

Attempt at fixing sporadic failures of shuttle-deployer #980

Attempt at fixing sporadic failures of shuttle-deployer #980

Conversation

Kazy commented Jun 7, 2023

Description of change

How has this been tested? (if applicable)

iulianbarbu left a comment

Choose a reason for hiding this comment

iulianbarbu Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Kazy Jun 23, 2023

Choose a reason for hiding this comment

chesedo left a comment

Choose a reason for hiding this comment

chesedo Jun 23, 2023

Choose a reason for hiding this comment

Kazy Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

iulianbarbu Jun 26, 2023

Choose a reason for hiding this comment

chesedo Jun 23, 2023

Choose a reason for hiding this comment

iulianbarbu left a comment

Choose a reason for hiding this comment

iulianbarbu Jun 26, 2023

Choose a reason for hiding this comment

iulianbarbu commented Jun 26, 2023

Kazy commented Jun 26, 2023

Attempt at fixing sporadic failures of `shuttle-deployer` #980

Attempt at fixing sporadic failures of `shuttle-deployer` #980

iulianbarbu Jun 23, 2023 •

edited

Loading

Kazy Jun 23, 2023 •

edited

Loading