Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Catch generic exceptions from Syncer.wait #37181

Merged
merged 3 commits into from
Jul 7, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jul 7, 2023

Why are these changes needed?

#36993 changed Syncer.wait to raise the syncing error rather than wrap in a TuneError in all cases except for a TimeoutError. This was changed due to the TuneError being unnecessary and causing the error logging to be confusing.

However, we do catch a TuneError specifically in many parts of the code that call Syncer.wait, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:

  1. Driver background syncing to and from cloud: Soft fail on TimeoutError --> Soft fail on general exceptions (this is the change from the PR)
  2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Related issue number

Closes #37170

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@woshiyyya
Copy link
Member

woshiyyya commented Jul 7, 2023

Generally looks good to me. One question:

https://github.com/ray-project/ray/pull/36993/files#diff-337c0b1697a4bb23bf5245df62440258c93879ef90e260279504b5845e5fbee8R622

Why do we start to raise general Exception instead of wrapping it into a TuneError?

@justinvyu
Copy link
Contributor Author

@woshiyyya Just updated the PR description with some explanation.

@krfricke krfricke merged commit 43f717e into ray-project:master Jul 7, 2023
61 of 65 checks passed
@justinvyu justinvyu deleted the air/catch_syncer_wait_fix branch July 7, 2023 21:19
SongGuyang pushed a commit to alipay/ant-ray that referenced this pull request Jul 12, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release test tune_cloud_durable_upload.aws failed
3 participants