-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air/output] Add callback hook for trial recovery, only print error table at end #37572
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts: # python/ray/tune/experimental/output.py
# Conflicts: # python/ray/tune/experimental/output.py
Looks great! A few questions:
|
On end, we get the error table.
I should have clarified this - the regular error stacktrace is still streamed from the training worker to stderr!
The error.txt is automatically appended, so we keep all previously caught errors automatically. |
Thanks! I am a fan with the footnote :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kai. LGTM!
…able at end (ray-project#37572) This changes the context-aware output handler so that trial errors are immediately reported with their respective error files. The error table is only printed at the end. This introduces a new callback hook, `on_trial_recover` which is required so error files are also available in the immediate output when a trial has a transient failure. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…able at end (ray-project#37572) This changes the context-aware output handler so that trial errors are immediately reported with their respective error files. The error table is only printed at the end. This introduces a new callback hook, `on_trial_recover` which is required so error files are also available in the immediate output when a trial has a transient failure. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
…able at end (ray-project#37572) This changes the context-aware output handler so that trial errors are immediately reported with their respective error files. The error table is only printed at the end. This introduces a new callback hook, `on_trial_recover` which is required so error files are also available in the immediate output when a trial has a transient failure. Signed-off-by: Kai Fricke <kai@anyscale.com>
…able at end (ray-project#37572) This changes the context-aware output handler so that trial errors are immediately reported with their respective error files. The error table is only printed at the end. This introduces a new callback hook, `on_trial_recover` which is required so error files are also available in the immediate output when a trial has a transient failure. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…able at end (ray-project#37572) This changes the context-aware output handler so that trial errors are immediately reported with their respective error files. The error table is only printed at the end. This introduces a new callback hook, `on_trial_recover` which is required so error files are also available in the immediate output when a trial has a transient failure. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
Includes #37571
This changes the context-aware output handler so that trial errors are immediately reported with their respective error files. The error table is only printed at the end.
This introduces a new callback hook,
on_trial_recover
which is required so error files are also available in the immediate output when a trial has a transient failure.Two alternatives here are to either rely on the trial error handling to output the error file or to just not output the error file and wait for a full failure (
on_trial_error
) or experiment end to show error files for transient errors.During training:
Only at end:
Related issue number
Closes #36854
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.