[air/output] Include traceback as part of the warning log for syncing failures #36993

justinvyu · 2023-06-30T18:31:07Z

Why are these changes needed?

This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed.

Example

Before:

(train_fn pid=44915) The latest sync operation failed with the following error: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=44915) Retrying after sleeping for 1.0 seconds...
(train_fn pid=44915) The latest sync operation failed with the following error: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=44915) Retrying after sleeping for 1.0 seconds...
(train_fn pid=44915) The latest sync operation failed with the following error: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=44915) Retrying after sleeping for 1.0 seconds...
Training saved a checkpoint for iteration 1 at: /Users/justin/ray_results/train_fn_2023-06-30_14-11-35/train_fn_b34ef_00000_0_2023-06-30_14-11-38/checkpoint_000000

Training completed after 1 iterations at 2023-06-30 14:11:50. Total running time: 12s

2023-06-30 14:11:50,669 WARNING tune.py:1122 -- Trial Runner checkpointing failed: Sync process failed: [Errno 22] The provided token is malformed or otherwise invalid.

After:

(train_fn pid=67722) The latest sync operation failed with the following error: OSError(22, 'The provided token is malformed or otherwise invalid.')
(train_fn pid=67722) Retrying 3 more time(s) after sleeping for 1.0 seconds...
(train_fn pid=67722) The latest sync operation failed with the following error: OSError(22, 'The provided token is malformed or otherwise invalid.')
(train_fn pid=67722) Retrying 2 more time(s) after sleeping for 1.0 seconds...
(train_fn pid=67722) The latest sync operation failed with the following error: OSError(22, 'The provided token is malformed or otherwise invalid.')
(train_fn pid=67722) Retrying 1 more time(s) after sleeping for 1.0 seconds...
Training saved a checkpoint for iteration 1 at: /Users/justin/ray_results/train_fn_2023-06-30_15-33-03/train_fn_14f9a_00000_0_2023-06-30_15-33-06/checkpoint_000000

Training completed after 1 iterations at 2023-06-30 15:33:21. Total running time: 14s

2023-06-30 15:33:21,617 WARNING tune.py:1122 -- Trial Runner checkpointing failed: [Errno 22] The provided token is malformed or otherwise invalid.

(train_fn pid=67722) Could not upload checkpoint to s3://checkpoint-bucket/train_fn_2023-06-30_15-33-03/train_fn_14f9a_00000_0_2023-06-30_15-33-06/checkpoint_000000 even after 3 retries. Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds). See the full error details below:
(train_fn pid=67722) Failed sync even after 3 retries. The latest sync failed with the following error:
(train_fn pid=67722) Traceback (most recent call last):
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 112, in _error_wrapper
(train_fn pid=67722)     return await func(*args, **kwargs)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/aiobotocore/client.py", line 358, in _make_api_call
(train_fn pid=67722)     raise error_class(parsed_response, operation_name)
(train_fn pid=67722) botocore.exceptions.ClientError: An error occurred (InvalidToken) when calling the PutObject operation: The provided token is malformed or otherwise invalid.
(train_fn pid=67722) 
(train_fn pid=67722) The above exception was the direct cause of the following exception:
(train_fn pid=67722) 
(train_fn pid=67722) Traceback (most recent call last):
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 465, in wait_or_retry
(train_fn pid=67722)     self.wait()
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 622, in wait
(train_fn pid=67722)     raise e
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 620, in wait
(train_fn pid=67722)     self._sync_process.wait(timeout=self.sync_timeout)
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 298, in wait
(train_fn pid=67722)     raise exception
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/syncer.py", line 261, in entrypoint
(train_fn pid=67722)     result = self._fn(*args, **kwargs)
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/air/_internal/remote_storage.py", line 527, in upload_to_uri
(train_fn pid=67722)     _pyarrow_fs_copy_files(local_path, bucket_path, destination_filesystem=fs)
(train_fn pid=67722)   File "/Users/justin/Developer/justinvyu-dev/python/ray/air/_internal/remote_storage.py", line 93, in _pyarrow_fs_copy_files
(train_fn pid=67722)     return pyarrow.fs.copy_files(
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/pyarrow/fs.py", line 267, in copy_files
(train_fn pid=67722)     _copy_files_selector(source_fs, source_sel,
(train_fn pid=67722)   File "pyarrow/_fs.pyx", line 1619, in pyarrow._fs._copy_files_selector
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/spec.py", line 1770, in close
(train_fn pid=67722)     self.flush(force=True)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/spec.py", line 1641, in flush
(train_fn pid=67722)     if self._upload_chunk(final=force) is not False:
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 2208, in _upload_chunk
(train_fn pid=67722)     self.commit()
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 2223, in commit
(train_fn pid=67722)     write_result = self._call_s3(
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 2075, in _call_s3
(train_fn pid=67722)     return self.fs.call_s3(method, self.s3_additional_kwargs, *kwarglist, **kwargs)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/asyn.py", line 114, in wrapper
(train_fn pid=67722)     return sync(self.loop, func, *args, **kwargs)
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/asyn.py", line 99, in sync
(train_fn pid=67722)     raise return_result
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/fsspec/asyn.py", line 54, in _runner
(train_fn pid=67722)     result[0] = await coro
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 339, in _call_s3
(train_fn pid=67722)     return await _error_wrapper(
(train_fn pid=67722)   File "/Users/justin/miniforge3/envs/ray_dev_py38/lib/python3.8/site-packages/s3fs/core.py", line 139, in _error_wrapper
(train_fn pid=67722)     raise err
(train_fn pid=67722) OSError: [Errno 22] The provided token is malformed or otherwise invalid.
(train_fn pid=67722)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…better_sync_errors

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…better_sync_errors

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

krfricke

One nit for the max_retries case

krfricke · 2023-07-03T23:30:22Z

python/ray/tune/syncer.py

-        last_error = None
-        for _ in range(max_retries):
+        last_error_traceback = None
+        for i in range(max_retries + 1):


So this makes semantically sense (we try 4 times --> thus 3 retries), but let's set default max_retries to 2 then

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…better_sync_errors

#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing. However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way. Here's the current handling of syncing errors to cloud: 1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR) 2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change) Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… failures (ray-project#36993) This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: 久龙 <guyang.sgy@antfin.com>

ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing. However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way. Here's the current handling of syncing errors to cloud: 1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR) 2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: 久龙 <guyang.sgy@antfin.com>

… failures (ray-project#36993) This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing. However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way. Here's the current handling of syncing errors to cloud: 1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR) 2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

… failures (ray-project#36993) This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing. However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way. Here's the current handling of syncing errors to cloud: 1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR) 2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change) Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… failures (ray-project#36993) This PR improves observability into errors that occur during syncing by including the traceback which can help track down the underlying cause of the error. When using with wait_or_retry, this only prints the full traceback after all retry attempts have failed. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

ray-project#36993 changed `Syncer.wait` to raise the syncing error rather than wrap in a `TuneError` in all cases except for a `TimeoutError`. This was changed due to the `TuneError` being unnecessary and causing the error logging to be confusing. However, we do catch a `TuneError` specifically in many parts of the code that call `Syncer.wait`, so these cases now hard fail rather than soft failing with a warning log. This PR also fixes experiment syncing up and down to handle syncing errors in a standard way. Here's the current handling of syncing errors to cloud: 1. Driver background syncing to and from cloud: Soft fail on `TimeoutError` --> Soft fail on general exceptions (this is the change from the PR) 2. Trainable checkpoint/artifact syncing: Soft fail on all syncing errors after retrying 2 times. (no change) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Include traceback as part of warning log for syncer

272f034

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from krfricke June 30, 2023 18:31

justinvyu assigned krfricke Jun 30, 2023

justinvyu added 5 commits June 30, 2023 14:42

Only show the full traceback on the last error

ac3f9f2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

f957bd2

…better_sync_errors

print repr and remove unnecessary nesting in a TuneError

3681738

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

7e8eec3

…better_sync_errors

Fix fake sync process

80299d9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 3, 2023

krfricke approved these changes Jul 3, 2023

View reviewed changes

justinvyu added 2 commits July 5, 2023 09:31

Default number of retries = 2

abecb14

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

d2ab5ef

…better_sync_errors

justinvyu requested review from richardliaw, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners July 5, 2023 16:32

krfricke merged commit 9ec342c into ray-project:master Jul 5, 2023
59 of 63 checks passed

woshiyyya mentioned this pull request Jul 7, 2023

Release test tune_cloud_durable_upload.aws failed #37170

Closed

justinvyu mentioned this pull request Jul 7, 2023

[air] Catch generic exceptions from Syncer.wait #37181

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air/output] Include traceback as part of the warning log for syncing failures #36993

[air/output] Include traceback as part of the warning log for syncing failures #36993

justinvyu commented Jun 30, 2023 •

edited

Loading

krfricke left a comment

krfricke Jul 3, 2023

[air/output] Include traceback as part of the warning log for syncing failures #36993

[air/output] Include traceback as part of the warning log for syncing failures #36993

Conversation

justinvyu commented Jun 30, 2023 • edited Loading

Why are these changes needed?

Example

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

krfricke Jul 3, 2023

Choose a reason for hiding this comment

justinvyu commented Jun 30, 2023 •

edited

Loading