Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air/output] Include traceback as part of the warning log for syncing failures #36993

Merged
merged 8 commits into from
Jul 5, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jun 30, 2023

Why are these changes needed?

This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed.

Example

Before:

(train_fn pid=44915) The latest sync operation failed with the following error: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=44915) Retrying after sleeping for 1.0 seconds...
(train_fn pid=44915) The latest sync operation failed with the following error: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=44915) Retrying after sleeping for 1.0 seconds...
(train_fn pid=44915) The latest sync operation failed with the following error: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=44915) Retrying after sleeping for 1.0 seconds...
Training saved a checkpoint for iteration 1 at: /Users/justin/ray_results/train_fn_2023-06-30_14-11-35/train_fn_b34ef_00000_0_2023-06-30_14-11-38/checkpoint_000000

Training completed after 1 iterations at 2023-06-30 14:11:50. Total running time: 12s

2023-06-30 14:11:50,669 WARNING tune.py:1122 -- Trial Runner checkpointing failed: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.

After:

(train_fn pid=67722) The latest sync operation failed with the following error: OSError(22, 'The provided token is malformed or otherwise invalid.')
(train_fn pid=67722) Retrying 3 more time(s) after sleeping for 1.0 seconds...
(train_fn pid=67722) The latest sync operation failed with the following error: OSError(22, 'The provided token is malformed or otherwise invalid.')
(train_fn pid=67722) Retrying 2 more time(s) after sleeping for 1.0 seconds...
(train_fn pid=67722) The latest sync operation failed with the following error: OSError(22, 'The provided token is malformed or otherwise invalid.')
(train_fn pid=67722) Retrying 1 more time(s) after sleeping for 1.0 seconds...
Training saved a checkpoint for iteration 1 at: /Users/justin/ray_results/train_fn_2023-06-30_15-33-03/train_fn_14f9a_00000_0_2023-06-30_15-33-06/checkpoint_000000

Training completed after 1 iterations at 2023-06-30 15:33:21. Total running time: 14s

2023-06-30 15:33:21,617 WARNING tune.py:1122 -- Trial Runner checkpointing failed: [Errno 22] The provided token is malformed or otherwise invalid.

(train_fn pid=67722) Could not upload checkpoint to s3://checkpoint-bucket/train_fn_2023-06-30_15-33-03/train_fn_14f9a_00000_0_2023-06-30_15-33-06/checkpoint_000000 even after 3 retries. Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds). See the full error details below:
(train_fn pid=67722) Failed sync even after 3 retries. The latest sync failed with the following error:
(train_fn pid=67722) Traceback (most recent call last):
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 112, in _error_wrapper
(train_fn pid=67722)     return await func(*args, **kwargs)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/aiobotocore/client.py", line 358, in _make_api_call
(train_fn pid=67722)     raise error_class(parsed_response, operation_name)
(train_fn pid=67722) botocore.exceptions.ClientError: An error occurred (InvalidToken) when calling the PutObject operation: The provided token is malformed or otherwise invalid.
(train_fn pid=67722) 
(train_fn pid=67722) The above exception was the direct cause of the following exception:
(train_fn pid=67722) 
(train_fn pid=67722) Traceback (most recent call last):
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 465, in wait_or_retry
(train_fn pid=67722)     self.wait()
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 622, in wait
(train_fn pid=67722)     raise e
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 620, in wait
(train_fn pid=67722)     self._sync_process.wait(timeout=self.sync_timeout)
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 298, in wait
(train_fn pid=67722)     raise exception
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 261, in entrypoint
(train_fn pid=67722)     result = self._fn(*args, **kwargs)
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/air/_internal/remote_storage.py", line 527, in upload_to_uri
(train_fn pid=67722)     _pyarrow_fs_copy_files(local_path, bucket_path, destination_filesystem=fs)
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/air/_internal/remote_storage.py", line 93, in _pyarrow_fs_copy_files
(train_fn pid=67722)     return pyarrow.fs.copy_files(
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/pyarrow/fs.py", line 267, in copy_files
(train_fn pid=67722)     _copy_files_selector(source_fs, source_sel,
(train_fn pid=67722)   File "pyarrow/_fs.pyx", line 1619, in pyarrow._fs._copy_files_selector
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/spec.py", line 1770, in close
(train_fn pid=67722)     self.flush(force=True)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/spec.py", line 1641, in flush
(train_fn pid=67722)     if self._upload_chunk(final=force) is not False:
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 2208, in _upload_chunk
(train_fn pid=67722)     self.commit()
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 2223, in commit
(train_fn pid=67722)     write_result = self._call_s3(
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 2075, in _call_s3
(train_fn pid=67722)     return self.fs.call_s3(method, self.s3_additional_kwargs, *kwarglist, **kwargs)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/asyn.py", line 114, in wrapper
(train_fn pid=67722)     return sync(self.loop, func, *args, **kwargs)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/asyn.py", line 99, in sync
(train_fn pid=67722)     raise return_result
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/asyn.py", line 54, in _runner
(train_fn pid=67722)     result[0] = await coro
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 339, in _call_s3
(train_fn pid=67722)     return await _error_wrapper(
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 139, in _error_wrapper
(train_fn pid=67722)     raise err
(train_fn pid=67722) OSError: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=67722) 

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 3, 2023
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit for the max_retries case

last_error = None
for _ in range(max_retries):
last_error_traceback = None
for i in range(max_retries + 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this makes semantically sense (we try 4 times --> thus 3 retries), but let's set default max_retries to 2 then

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@krfricke krfricke merged commit 9ec342c into ray-project:master Jul 5, 2023
59 of 63 checks passed
krfricke pushed a commit that referenced this pull request Jul 7, 2023
#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
SongGuyang pushed a commit to alipay/ant-ray that referenced this pull request Jul 12, 2023
… failures (ray-project#36993)

This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
SongGuyang pushed a commit to alipay/ant-ray that referenced this pull request Jul 12, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… failures (ray-project#36993)

This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
… failures (ray-project#36993)

This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
… failures (ray-project#36993)

This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing.

However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way.

Here's the current handling of syncing errors to cloud:
1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR)
2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants