Increase close timeout of `Nanny` in `LocalCUDACluster` #1260

pentschev · 2023-10-12T13:01:48Z

Tests in CI have been failing more often, but those errors can't be reproduced locally. This is possibly related to Nanny's internal mechanism to establish timeouts to kill processes, perhaps due to higher load on the servers, tasks take longer and killing processes takes into account the overall time taken to establish a timeout, which is then drastically reduced leaving little time to actually shutdown processes. It is also not possible to programatically set a different timeout given existing Distributed's API, which currently calls close() without arguments in SpecCluster._correct_state_internal().

Given the limitations described above, a new class is added by this change with the sole purpose of rewriting the timeout for Nanny.close() method with an increased value, and then use the new class when launching LocalCUDACluster via the worker_class argument.

Tests in CI have been failing more often, but those errors can't be reproduced locally. This is possibly related to `Nanny`'s internal mechanism to establish timeouts to kill processes, perhaps due to higher load on the servers, tasks take longer and killing processes takes into account the overall time taken to establish a timeout, which is then drastically reduced leaving little time to actually shutdown processes. It is also not possible to programatically set a different timeout given existing Distributed's API, which currently calls `close()` without arguments in `SpecCluster._correct_state_internal()`. Given the limitations described above, a new class is added by this change with the sole purpose of rewriting the timeout for `Nanny.close()` method with an increased value, and then use the new class when launching `LocalCUDACluster` via the `worker_class` argument.

quasiben · 2023-10-12T19:01:18Z

/merge

pentschev · 2023-10-12T19:02:29Z

Thanks @quasiben !

csadorf · 2023-10-30T16:01:56Z

I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?

pentschev · 2023-10-30T16:11:32Z

I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?

It is possible that this may occur in real use cases, but I've tried locally running the tests that have been failing in Dask-CUDA thousands of times over many a period of many weeks/months and I couldn't reproduce this issue once, which leads me to believe this is due to high-load in CI. In any case, those issues only occur when the cluster is shutting down, so even if it manifests itself for users it could be problematic only in the event that the user relies on the process' exit code, which is indeed not great but unfortunately shutdown is a difficult issue to properly resolve in Distributed.

pentschev added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Oct 12, 2023

pentschev requested a review from a team as a code owner October 12, 2023 13:01

github-actions bot added the python python code needed label Oct 12, 2023

Use IncreasedCloseTimeoutNanny for test_explicit_comms

4424ec9

quasiben approved these changes Oct 12, 2023

View reviewed changes

rapids-bot bot merged commit 48de0c5 into rapidsai:branch-23.12 Oct 12, 2023
24 checks passed

pentschev deleted the increase-nanny-timeout branch October 16, 2023 09:34

pentschev mentioned this pull request Oct 30, 2023

Unpin dask and distributed for 23.12 development rapidsai/cuml#5627

Merged

jnke2016 mentioned this pull request Nov 13, 2023

Temporarily suppress the timeout error rapidsai/cugraph#3929

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase close timeout of `Nanny` in `LocalCUDACluster` #1260

Increase close timeout of `Nanny` in `LocalCUDACluster` #1260

pentschev commented Oct 12, 2023

quasiben commented Oct 12, 2023

pentschev commented Oct 12, 2023

csadorf commented Oct 30, 2023

pentschev commented Oct 30, 2023

Increase close timeout of Nanny in LocalCUDACluster #1260

Increase close timeout of Nanny in LocalCUDACluster #1260

Conversation

pentschev commented Oct 12, 2023

quasiben commented Oct 12, 2023

pentschev commented Oct 12, 2023

csadorf commented Oct 30, 2023

pentschev commented Oct 30, 2023

Increase close timeout of `Nanny` in `LocalCUDACluster` #1260

Increase close timeout of `Nanny` in `LocalCUDACluster` #1260