[tune] move logger and syncer handling to callbacks #11699

krfricke · 2020-10-29T11:48:29Z

Why are these changes needed?

In order to prepare instantiating loggers (instead of passing logger classes), we need to refactor the logger interface. This PR introduces an ExperimentLogger class which will replace the current (per-trial) Logger classes. An ExperimentLogger lives on the driver and handles logging for all trials. In this first PR, it serves as a per-trial wrapper of Logger objects, but the default loggers will be refactored into their own ExperimentLogger classes in a follow-up PR.

Currently, log and checkpoint syncing lives in the UnifiedLogger class. This has to be refactored as well, since the UnifiedLogger will be depracated. It also makes more sense to separate syncing from logging. Thus, a SyncerCallback is introduced that takes care of syncing.

By default, default loggers and a default syncer is created.

This is a draft PR for testing and development.

Todos for this PR (non exhaustive):

Ensure that the SyncerCallback is always called last after any loggers

Todos for follow up PR:

Move existing Logger classes to new ExperimentLogger classes
Doc changes. Since the current PR preserves (almost all) existing behavior, inclde the doc changes only once we introduced the new ExperimentLogger classes

Breaking changes:

The Experiment.loggers spec is now deprecated. Loggers cannot be passed to Experiment objects, as they live as callbacks in the trial runner - per-trial or per-experiment specific loggers are thus not supported anymore.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/tune/syncer.py

krfricke · 2020-10-29T16:41:05Z

python/ray/tune/trainable.py

@@ -594,6 +441,8 @@ def _create_logger(self, config, logger_creator=None):
            self._result_logger = logger_creator(config)
            self._logdir = self._result_logger.logdir
        else:
+            from ray.tune.logger import UnifiedLogger
+


local import to avoid all kinds of circular import problems. Also, it seems like trainable-specific loggers are not really used? I couldn't find a case in the ray code where it has not been instantiated with a noop logger. Should we still keep this for backwards compatibility with custom trainable loggers? Or might this be used by RLLib trainables? cc @ericl

ah, Trainable specific loggers are used in RLlib, so I think we need to keep this.

krfricke · 2020-10-29T16:42:10Z

python/ray/tune/trial_runner.py

@@ -618,6 +436,8 @@ def add_trial(self, trial):
        self.trial_executor.try_checkpoint_metadata(trial)

    def debug_string(self, delim="\n"):
+        from ray.tune.progress_reporter import trial_progress_str


avoid circular imports. also makes sense here because debug_string is just here for, well, debugging

richardliaw

nice work! left a couple comments, mainly concerned with

backwards compat
proper functionality

Also, not sure about if we are flushing logs properly - would be great to verify!

richardliaw · 2020-10-30T20:49:40Z

doc/source/raysgd/raysgd_pytorch.rst

@@ -103,7 +103,7 @@ You can also obtain profiling information:

 .. code-block:: python

-    >>> from ray.tune.logger import pretty_print


(backwards compat) can we keep this for this PR?

doc/source/tune/user-guide.rst

richardliaw · 2020-10-30T21:19:52Z

python/ray/tune/schedulers/pbt.py

@@ -799,7 +799,7 @@ def on_trial_result(self, trial_runner: "trial_runner.TrialRunner",
        if reset_successful:
            trial_executor.restore(trial, checkpoint, block=True)
        else:
-            trial_executor.stop_trial(trial, stop_logger=False)


Should we be keeping this here?

python/ray/tune/syncer.py

richardliaw · 2020-10-30T21:24:12Z

python/ray/tune/trainable.py

@@ -594,6 +441,8 @@ def _create_logger(self, config, logger_creator=None):
            self._result_logger = logger_creator(config)
            self._logdir = self._result_logger.logdir
        else:
+            from ray.tune.logger import UnifiedLogger
+


ah, Trainable specific loggers are used in RLlib, so I think we need to keep this.

richardliaw · 2020-10-31T03:31:20Z

python/ray/tune/trial_runner.py

@@ -724,6 +544,7 @@ def _process_trial(self, trial):
        """
        try:
            result = self.trial_executor.fetch_result(trial)
+            result.update(trial_id=trial.trial_id)


this code is getting super complicated :) we should document this (in a separate pr..)

python/ray/tune/trial_runner.py

python/ray/tune/tune.py

python/ray/tune/logger.py

richardliaw · 2020-10-31T03:57:58Z

Actually, another major concern I have is that this PR has a lot of moving parts. Is it possible to split this PR up?

One strawman would be to do this in 3 parts:

Code refactor (move code around into new files)
Syncer refactor
New logger (Legacy Logger)

Especially the event loop and other weird places (like syncer, flushing, closing/opening loggers) - it's a bit hard to reason about all of these changes at once. This will reduce the chance of needing to revert or causing breaking functionality.

krfricke · 2020-11-02T07:29:37Z

That might be a good solution. I'll address your comments in this PR first and then create sub PRs.

krfricke · 2020-11-02T13:37:34Z

I split the PR as proposed. Let's continue the review and merge process in the respective PRs:

#11746
#11748
#11749

Kai Fricke added 15 commits October 19, 2020 16:34

Move loggers from Trial to Callbacks

cbd4240

Merge remote-tracking branch 'upstream/master' into tune-logger-callback

e4747cc

Intermediate commit

c0a245b

Create callbacks on tune.run

b944ccc

Fix syncer IP lookup

cb29ed0

Update syncer + tests

79ae993

Merge remote-tracking branch 'upstream/master' into tune-logger-callback

7959ea1

Fix cluster tests

ecde308

fix linter error

9fa3131

fix rllib pretty_print import

acf730f

Fix result extra fields

f273f96

Fix run_experiment test

2805d90

Fix stop_logger argument

1a18648

Fix sync test

7053dc0

Fix lint

cbb3973

krfricke commented Oct 29, 2020

View reviewed changes

python/ray/tune/syncer.py Outdated Show resolved Hide resolved

krfricke commented Oct 29, 2020

View reviewed changes

Kai Fricke added 3 commits October 29, 2020 20:05

Fix done result update in trial runner

14a0f72

Move result update

c5a7008

Re-order callbacks

0391c17

krfricke marked this pull request as ready for review October 30, 2020 07:53

krfricke requested a review from richardliaw October 30, 2020 07:54

krfricke assigned krfricke and richardliaw Oct 30, 2020

richardliaw requested changes Oct 31, 2020

View reviewed changes

Kai Fricke added 2 commits November 2, 2020 08:43

Apply changes from code review

382f237

Move default callback creation

89105af

Kai Fricke added 4 commits November 2, 2020 08:51

Better error message

65e2e6f

Flush on trial save/restore

21bbb2d

Merge remote-tracking branch 'upstream/master' into tune-logger-callback

943a8b1

Update tests

f9e6552

This was referenced Nov 2, 2020

[tune] logger refactor part 1: move classes and utilities to own files #11746

Merged

[tune] logger refactor part 2: Add SyncerCallback #11748

Merged

[tune] logger refactor part 3: Add ExperimentLogger class #11749

Merged

krfricke closed this Nov 2, 2020

krfricke deleted the tune-logger-callback branch September 22, 2023 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] move logger and syncer handling to callbacks #11699

[tune] move logger and syncer handling to callbacks #11699

krfricke commented Oct 29, 2020 •

edited

Loading

krfricke Oct 29, 2020

richardliaw Oct 30, 2020

krfricke Oct 29, 2020

richardliaw left a comment

richardliaw Oct 30, 2020

richardliaw Oct 30, 2020

richardliaw Oct 30, 2020

richardliaw Oct 31, 2020

richardliaw commented Oct 31, 2020

krfricke commented Nov 2, 2020

krfricke commented Nov 2, 2020

		@@ -103,7 +103,7 @@ You can also obtain profiling information:

		.. code-block:: python

		>>> from ray.tune.logger import pretty_print

[tune] move logger and syncer handling to callbacks #11699

[tune] move logger and syncer handling to callbacks #11699

Conversation

krfricke commented Oct 29, 2020 • edited Loading

Why are these changes needed?

Related issue number

Checks

krfricke Oct 29, 2020

Choose a reason for hiding this comment

richardliaw Oct 30, 2020

Choose a reason for hiding this comment

krfricke Oct 29, 2020

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

richardliaw Oct 30, 2020

Choose a reason for hiding this comment

richardliaw Oct 30, 2020

Choose a reason for hiding this comment

richardliaw Oct 30, 2020

Choose a reason for hiding this comment

richardliaw Oct 31, 2020

Choose a reason for hiding this comment

richardliaw commented Oct 31, 2020

krfricke commented Nov 2, 2020

krfricke commented Nov 2, 2020

krfricke commented Oct 29, 2020 •

edited

Loading