-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt at fixing sporadic failures of shuttle-deployer
#980
Conversation
Instead of relying on a one second sleep.
f6fd6cc
to
5107821
Compare
shuttle-deployer
shuttle-deployer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the comment you've found:
I eventually figured out this was a case of not keeping the top-level task waiting on the spawned tasks, resulting in the runtime shutting down and those tasks failing midway through (with the above error).
This is something we can definitely do from a completeness point of view and this is a great catch. Just a few question left, let me know if you can clarify them.
deployer/src/deployment/mod.rs
Outdated
storage_manager, | ||
} | ||
( | ||
set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we attach this to the deployment manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we attach it to the deployer, then we have to wrap it in an Arc
and a Mutex
. I tried to avoid that (no particular reason actually, now that I think about it) but it might be better to wrap it and keep the JoinSet
with the manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this update @Kazy! And sorry for the delay, we wanted to doubly confirm the impact of the TaskSet
and I'm happy with the tests I've run. Just two questions, but my tests have shown that we might not have to worry about them anyway.
deployer/src/deployment/queue.rs
Outdated
@@ -52,7 +55,7 @@ pub async fn task( | |||
let storage_manager = storage_manager.clone(); | |||
let queue_client = queue_client.clone(); | |||
|
|||
tokio::spawn(async move { | |||
tasks.spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The while loop this spawn is in has a 'static
lifetime. So I'm wondering if an overflow will eventually happen if tasks are only inserted into the set and are never waited for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point, I think this can happen yes. If the queue never ends (e.g. in the case of a long running process), the task set will never be awaited, and the memory will keep growing with each deployment.
I wonder how this could be fixed. I see two solutions, I don't know if there are some simpler ones:
- Spawn a thread that will be responsible for running
join_next
+ a signal once we're done with the queue to notify that thread that the next timejoin_next
returnsNone
(i.e. that the set is empty), it must return. - Instead of doing a while loop on
recv.recv()
, use aselect!
betweenrecv.recv()
,tasks.join_next()
, and theelse
case where both returnedNone
.
I think the second one is the cleanest, let me know what you think !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would use something like 2. However, this doesn't seem to crash anything in our stress testing so we think about merging it without the extra handling for now. There is also a hard limit imposed on the container memory which will most probably crash the user's deployer, but not affect the rest of the system. Also, we will move away from this one deployer-per-user's project architecture soon, so it is not critical to address this.
deployer/src/deployment/run.rs
Outdated
@@ -83,15 +88,15 @@ pub async fn task( | |||
}; | |||
let runtime_manager = runtime_manager.clone(); | |||
|
|||
tokio::spawn(async move { | |||
set.spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same goes for this while loop which also has a 'static
lifetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outstanding commits are not critical, we can merge it as is.
deployer/src/deployment/queue.rs
Outdated
@@ -52,7 +55,7 @@ pub async fn task( | |||
let storage_manager = storage_manager.clone(); | |||
let queue_client = queue_client.clone(); | |||
|
|||
tokio::spawn(async move { | |||
tasks.spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would use something like 2. However, this doesn't seem to crash anything in our stress testing so we think about merging it without the extra handling for now. There is also a hard limit imposed on the container memory which will most probably crash the user's deployer, but not affect the rest of the system. Also, we will move away from this one deployer-per-user's project architecture soon, so it is not critical to address this.
@Kazy , if you got spare time to address this let us know: #980 (comment). |
@iulianbarbu this has been addressed :) |
b989adc
to
6d5db2b
Compare
…#980) * feat(deployer): use joinset to await builder tasks on shutdown * feat(deployer): use joinset for the DeploymentManager as well * test(deployment): when testing tests, abort early when different * test(deployment): use test_states in deployment_to_be_queued Instead of relying on a one second sleep. * fix(deployment): properly await spawned tasks * ref(deployer): move join set of DeploymentManager into struct * fix(deployer): use tokio::select! to await tasks set in deploy/run queue
Description of change
The
shuttle-deployer
tests are sometimes failing when unrelated changes are made. This can be seen both during CI as well as when working locally (we faced that when working on the pagination).It's quite hard to reproduce, but if we look at CircleCI errors (https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3134/workflows/c461d70f-616c-416a-a1e5-148ca2afa854/jobs/59077, https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3134/workflows/bd9ea5b5-c6bf-4262-bc58-2d82ce416341/jobs/59040, https://app.circleci.com/pipelines/github/shuttle-hq/shuttle/3109/workflows/aeefba11-ae65-48b7-b522-30b7931e32a9/jobs/58417, among others), we often get this error:
And right before it, this line:
In particular, right after the log line above in the code, we have a call to
store_executable
, with a call totokio::fs::rename
inside. This in combination with the error message (background task failed
) led me to this issue. One comment mentions the following:In the deployer, and more precisely in this code path, this is done twice:
Queued::handle
(first commit) inqueue::task
, where we pull deployments to build and run.queue::task
, which is done when building aDeploymentManager
.An other kind of bug happens in the test suite:
Sometimes it was just a timeout where the deployment seems to have stalled, sometimes it actually results in a crash (
state: Crashed
) instead. This was particularly painful when debugging since it would hang for several minutes, even though it was sometime clear that the states had diverged. I've added a fix for it to return early when we can.I also got this kind of error:
I assume that similarly to the first issue, we're not properly waiting on the returned
JoinHandle
, causing the stream to be closed while a task is still running. This time it happened in therun.rs
file of the deployer. This fixed both the error withtonic stream
and the states of the deployment that either hang or crashed.How has this been tested? (if applicable)
I managed to sometime (but not often) reproduce the errors locally, which I can't seem to be able to do now with these changes. I ran the following command, which before my fixes exhibited some of the issues in a few iteration:
Now I want to try with CircleCI :)