-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] move logger and syncer handling to callbacks #11699
Conversation
@@ -594,6 +441,8 @@ def _create_logger(self, config, logger_creator=None): | |||
self._result_logger = logger_creator(config) | |||
self._logdir = self._result_logger.logdir | |||
else: | |||
from ray.tune.logger import UnifiedLogger | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local import to avoid all kinds of circular import problems. Also, it seems like trainable-specific loggers are not really used? I couldn't find a case in the ray code where it has not been instantiated with a noop logger. Should we still keep this for backwards compatibility with custom trainable loggers? Or might this be used by RLLib trainables? cc @ericl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, Trainable specific loggers are used in RLlib, so I think we need to keep this.
@@ -618,6 +436,8 @@ def add_trial(self, trial): | |||
self.trial_executor.try_checkpoint_metadata(trial) | |||
|
|||
def debug_string(self, delim="\n"): | |||
from ray.tune.progress_reporter import trial_progress_str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid circular imports. also makes sense here because debug_string is just here for, well, debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work! left a couple comments, mainly concerned with
- backwards compat
- proper functionality
Also, not sure about if we are flushing logs properly - would be great to verify!
@@ -103,7 +103,7 @@ You can also obtain profiling information: | |||
|
|||
.. code-block:: python | |||
|
|||
>>> from ray.tune.logger import pretty_print |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(backwards compat) can we keep this for this PR?
@@ -799,7 +799,7 @@ def on_trial_result(self, trial_runner: "trial_runner.TrialRunner", | |||
if reset_successful: | |||
trial_executor.restore(trial, checkpoint, block=True) | |||
else: | |||
trial_executor.stop_trial(trial, stop_logger=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be keeping this here?
@@ -594,6 +441,8 @@ def _create_logger(self, config, logger_creator=None): | |||
self._result_logger = logger_creator(config) | |||
self._logdir = self._result_logger.logdir | |||
else: | |||
from ray.tune.logger import UnifiedLogger | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, Trainable specific loggers are used in RLlib, so I think we need to keep this.
@@ -724,6 +544,7 @@ def _process_trial(self, trial): | |||
""" | |||
try: | |||
result = self.trial_executor.fetch_result(trial) | |||
result.update(trial_id=trial.trial_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is getting super complicated :) we should document this (in a separate pr..)
Actually, another major concern I have is that this PR has a lot of moving parts. Is it possible to split this PR up? One strawman would be to do this in 3 parts:
Especially the event loop and other weird places (like syncer, flushing, closing/opening loggers) - it's a bit hard to reason about all of these changes at once. This will reduce the chance of needing to revert or causing breaking functionality. |
That might be a good solution. I'll address your comments in this PR first and then create sub PRs. |
Why are these changes needed?
In order to prepare instantiating loggers (instead of passing logger classes), we need to refactor the logger interface. This PR introduces an
ExperimentLogger
class which will replace the current (per-trial)Logger
classes. AnExperimentLogger
lives on the driver and handles logging for all trials. In this first PR, it serves as a per-trial wrapper ofLogger
objects, but the default loggers will be refactored into their ownExperimentLogger
classes in a follow-up PR.Currently, log and checkpoint syncing lives in the
UnifiedLogger
class. This has to be refactored as well, since theUnifiedLogger
will be depracated. It also makes more sense to separate syncing from logging. Thus, aSyncerCallback
is introduced that takes care of syncing.By default, default loggers and a default syncer is created.
This is a draft PR for testing and development.
Todos for this PR (non exhaustive):
Todos for follow up PR:
Logger
classes to newExperimentLogger
classesExperimentLogger
classesBreaking changes:
Experiment.loggers
spec is now deprecated. Loggers cannot be passed toExperiment
objects, as they live as callbacks in the trial runner - per-trial or per-experiment specific loggers are thus not supported anymore.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.