Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt at fixing sporadic failures of shuttle-deployer #980

Merged
merged 7 commits into from
Jun 27, 2023

Conversation

Kazy
Copy link
Member

@Kazy Kazy commented Jun 7, 2023

Description of change

The shuttle-deployer tests are sometimes failing when unrelated changes are made. This can be seen both during CI as well as when working locally (we faced that when working on the pagination).

It's quite hard to reproduce, but if we look at CircleCI errors (https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3134/workflows/c461d70f-616c-416a-a1e5-148ca2afa854/jobs/59077, https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3134/workflows/bd9ea5b5-c6bf-4262-bc58-2d82ce416341/jobs/59040, https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3109/workflows/aeefba11-ae65-48b7-b522-30b7931e32a9/jobs/58417, among others), we often get this error:

2023-06-05T08:35:40.737556Z ERROR builder:build_failed{error=InputOutput(Custom { kind: Other, error: "background task failed" }) id=572c7682-bc23-48c9-b4a8-97850da7938e state=Crashed}: shuttle_deployer::deployment::queue: service build encountered an error error=Internal I/O error: background task failed error.sources=[background task failed]

And right before it, this line:

2023-06-05T08:35:40.736962Z  INFO builder:handle{id=572c7682-bc23-48c9-b4a8-97850da7938e state=Building}: shuttle_deployer::deployment::queue: Moving built executable

In particular, right after the log line above in the code, we have a call to store_executable, with a call to tokio::fs::rename inside. This in combination with the error message (background task failed) led me to this issue. One comment mentions the following:

I eventually figured out this was a case of not keeping the top-level task waiting on the spawned tasks, resulting in the runtime shutting down and those tasks failing midway through (with the above error).

In the deployer, and more precisely in this code path, this is done twice:

  • Once when we spawn Queued::handle (first commit) in queue::task, where we pull deployments to build and run.
  • Once when we spawn queue::task, which is done when building a DeploymentManager.

An other kind of bug happens in the test suite:

thread 'thread 'deployment::deploy_layer::tests::deployment_bind_panicdeployment::deploy_layer::tests::deployment_main_panic' panicked at '' panicked at 'states should go into 'Crashed' panicking in bind: [
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Queued,
    },
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Building,
    },
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Built,
    },
    StateLog {
        id: 9e2ab92a-34d4-4439-835b-c911aab1e5ba,
        state: Loading,
    },
]states should go into 'Crashed' when panicking in main: [
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Queued,
    },
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Building,
    },
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Built,
    },
    StateLog {
        id: 7c030971-9f73-4ea2-9a4f-2f8c44194c2d,
        state: Loading,
    },
]', ', thread 'deployment::deploy_layer::tests::deployment_to_be_queueddeployer/src/deployment/deploy_layer.rs' panicked at ':states should go into 'Running' for a valid service: [
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Queued,
    },
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Building,
    },
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Built,
    },
    StateLog {
        id: eca96eb6-c63d-42d3-a18e-c584f2b4309c,
        state: Loading,
    },
]deployer/src/deployment/deploy_layer.rs', deployer/src/deployment/deploy_layer.rs:754:797::1717

Sometimes it was just a timeout where the deployment seems to have stalled, sometimes it actually results in a crash (state: Crashed) instead. This was particularly painful when debugging since it would hang for several minutes, even though it was sometime clear that the states had diverged. I've added a fix for it to return early when we can.

I also got this kind of error:

thread 'tokio-runtime-worker' panicked at 'message from tonic stream: Status { code: Unknown, message: "error reading a body from connection: stream closed because of a broken pipe", source: Some(hyper::Error(Body, Error { kind: Io(Custom { kind: BrokenPipe, error: "stream closed because of a broken pipe" }) })) }', deployer/src/deployment/run.rs:386:49

I assume that similarly to the first issue, we're not properly waiting on the returned JoinHandle, causing the stream to be closed while a task is still running. This time it happened in the run.rs file of the deployer. This fixed both the error with tonic stream and the states of the deployment that either hang or crashed.

How has this been tested? (if applicable)

I managed to sometime (but not often) reproduce the errors locally, which I can't seem to be able to do now with these changes. I ran the following command, which before my fixes exhibited some of the issues in a few iteration:

export RUST_BACKTRACE=full
for i in {0..20}; do
rm -rf deployer/**/target; rm -rf /run/user/1000/shuttle_run_test*;
if ! cargo test -p shuttle-deployer -- --nocapture; then
        break
fi
sleep 5
done

Now I want to try with CircleCI :)

@Kazy Kazy force-pushed the attempt-fix-ci-deployer branch from f6fd6cc to 5107821 Compare June 7, 2023 14:47
@Kazy Kazy changed the title Attempt at fixing spurious failures of shuttle-deployer Attempt at fixing sporadic failures of shuttle-deployer Jun 7, 2023
@oddgrd oddgrd added B-Shuttle Batch Submitted by a Shuttle Batch participant B-M A medium task that requires experience with the codebase A-deployer labels Jun 7, 2023
Copy link
Contributor

@iulianbarbu iulianbarbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment you've found:

I eventually figured out this was a case of not keeping the top-level task waiting on the spawned tasks, resulting in the runtime shutting down and those tasks failing midway through (with the above error).

This is something we can definitely do from a completeness point of view and this is a great catch. Just a few question left, let me know if you can clarify them.

storage_manager,
}
(
set,
Copy link
Contributor

@iulianbarbu iulianbarbu Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we attach this to the deployment manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we attach it to the deployer, then we have to wrap it in an Arc and a Mutex. I tried to avoid that (no particular reason actually, now that I think about it) but it might be better to wrap it and keep the JoinSet with the manager.

deployer/src/deployment/queue.rs Outdated Show resolved Hide resolved
deployer/src/deployment/deploy_layer.rs Show resolved Hide resolved
Copy link
Contributor

@chesedo chesedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this update @Kazy! And sorry for the delay, we wanted to doubly confirm the impact of the TaskSet and I'm happy with the tests I've run. Just two questions, but my tests have shown that we might not have to worry about them anyway.

@@ -52,7 +55,7 @@ pub async fn task(
let storage_manager = storage_manager.clone();
let queue_client = queue_client.clone();

tokio::spawn(async move {
tasks.spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The while loop this spawn is in has a 'static lifetime. So I'm wondering if an overflow will eventually happen if tasks are only inserted into the set and are never waited for?

Copy link
Member Author

@Kazy Kazy Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, I think this can happen yes. If the queue never ends (e.g. in the case of a long running process), the task set will never be awaited, and the memory will keep growing with each deployment.

I wonder how this could be fixed. I see two solutions, I don't know if there are some simpler ones:

  1. Spawn a thread that will be responsible for running join_next + a signal once we're done with the queue to notify that thread that the next time join_next returns None (i.e. that the set is empty), it must return.
  2. Instead of doing a while loop on recv.recv(), use a select! between recv.recv(), tasks.join_next(), and the else case where both returned None.

I think the second one is the cleanest, let me know what you think !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would use something like 2. However, this doesn't seem to crash anything in our stress testing so we think about merging it without the extra handling for now. There is also a hard limit imposed on the container memory which will most probably crash the user's deployer, but not affect the rest of the system. Also, we will move away from this one deployer-per-user's project architecture soon, so it is not critical to address this.

@@ -83,15 +88,15 @@ pub async fn task(
};
let runtime_manager = runtime_manager.clone();

tokio::spawn(async move {
set.spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same goes for this while loop which also has a 'static lifetime

Copy link
Contributor

@iulianbarbu iulianbarbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outstanding commits are not critical, we can merge it as is.

deployer/src/deployment/deploy_layer.rs Show resolved Hide resolved
@@ -52,7 +55,7 @@ pub async fn task(
let storage_manager = storage_manager.clone();
let queue_client = queue_client.clone();

tokio::spawn(async move {
tasks.spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would use something like 2. However, this doesn't seem to crash anything in our stress testing so we think about merging it without the extra handling for now. There is also a hard limit imposed on the container memory which will most probably crash the user's deployer, but not affect the rest of the system. Also, we will move away from this one deployer-per-user's project architecture soon, so it is not critical to address this.

@iulianbarbu
Copy link
Contributor

@Kazy , if you got spare time to address this let us know: #980 (comment).

@Kazy Kazy requested review from iulianbarbu and chesedo June 26, 2023 12:14
@Kazy
Copy link
Member Author

Kazy commented Jun 26, 2023

@iulianbarbu this has been addressed :)

@oddgrd oddgrd merged commit 9aef803 into shuttle-hq:main Jun 27, 2023
28 of 29 checks passed
@Kazy Kazy deleted the attempt-fix-ci-deployer branch June 27, 2023 09:45
AlphaKeks pushed a commit to AlphaKeks/shuttle that referenced this pull request Jul 21, 2023
…#980)

* feat(deployer): use joinset to await builder tasks on shutdown

* feat(deployer): use joinset for the DeploymentManager as well

* test(deployment): when testing tests, abort early when different

* test(deployment): use test_states in deployment_to_be_queued

Instead of relying on a one second sleep.

* fix(deployment): properly await spawned tasks

* ref(deployer): move join set of DeploymentManager into struct

* fix(deployer): use tokio::select! to await tasks set in deploy/run queue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-M A medium task that requires experience with the codebase B-Shuttle Batch Submitted by a Shuttle Batch participant
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants