Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Add timeout to retry_fn to catch hanging syncs #28155

Merged
merged 13 commits into from Sep 2, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Aug 29, 2022

Signed-off-by: Kai Fricke kai@anyscale.com

Why are these changes needed?

Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.

Todo:

  • Add end to end trainable test
  • Throw descriptive error on timeout

Related issue number

Closes ##26802

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>
Kai Fricke added 4 commits August 29, 2022 17:01
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke marked this pull request as ready for review August 30, 2022 00:39
@krfricke krfricke requested a review from gjoliver August 30, 2022 00:40
Signed-off-by: Kai Fricke <kai@anyscale.com>
python/ray/tune/execution/trial_runner.py Outdated Show resolved Hide resolved
proc = threading.Thread(target=_retry_fn)
proc.daemon = True
proc.start()
proc.join(timeout=timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that you have a thread, imagine eventually we checkpoint on the side while the training just keeps going 🤯 😄

one nit, I also think timeout should be per-retry? (so timeout=num_retries * timeout here). otherwise the actual timeout will be dependent on how many retries you set here? although, admittedly, num_retries is not even a configurable bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about this, but the reason why I kept a global timeout was because it is a) simpler/cleaner to implement and b) we basically want to define a maximum time we want to block training, so I think we should be fine with this. Let me know if you prefer this per-retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually as discussed, let's have a timeout per retry. Otherwise if the first sync hangs we will not try again. Updated the PR

Kai Fricke added 3 commits August 30, 2022 11:29
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke requested a review from gjoliver August 30, 2022 22:01
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool man. 2 nits, but let's give this a try.
feel free to merge after you address the minor comments.

python/ray/tune/utils/util.py Outdated Show resolved Hide resolved
python/ray/tune/trainable/trainable.py Show resolved Hide resolved
Kai Fricke added 4 commits August 31, 2022 11:53
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke merged commit 3590a86 into ray-project:master Sep 2, 2022
@krfricke krfricke deleted the tune/sync-timeout branch September 2, 2022 11:52
@richardliaw richardliaw changed the title [tune] Add timeout ro retry_fn to catch hanging syncs [tune] Add timeout to retry_fn to catch hanging syncs Sep 2, 2022
kira-lin pushed a commit to kira-lin/ray that referenced this pull request Sep 8, 2022
)

Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Zhi Lin <zl1nn@outlook.com>
ilee300a pushed a commit to ilee300a/ray that referenced this pull request Sep 12, 2022
)

Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
matthewdeng pushed a commit that referenced this pull request Sep 15, 2022
Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.

Signed-off-by: Kai Fricke <kai@anyscale.com>
krfricke pushed a commit that referenced this pull request Dec 7, 2022
#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Capiru pushed a commit to Capiru/ray that referenced this pull request Dec 21, 2022
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Capiru <gabriel_s_prado@hotmail.com>
Capiru pushed a commit to Capiru/ray that referenced this pull request Dec 21, 2022
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Capiru <gabriel_s_prado@hotmail.com>
Capiru pushed a commit to Capiru/ray that referenced this pull request Dec 22, 2022
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Capiru <gabriel_s_prado@hotmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 16, 2023
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…oject#30855)

ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants