[train/tune] Refactor trial metadata organization #38165

krfricke · 2023-08-07T10:39:40Z

Why are these changes needed?

The Trial object currently keeps properties with different scopes:

Static properties that are set on init
Static properties that are set on init but can be overwritten on restore
Temporary properties that are not saved, e.g. the trial location,
run metadata that is updated during training, such as the last result, the available checkpoints, etc.

This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.

Specifically, it introduces a _TemporaryTrialState class that contains temporary properties, and a _TrainingRunMetadata class that contains run metadata such as the last result, error files, and available checkpoints.

It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.

The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the Trial class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.

A few other improvements have been made (Trial.runner is now Trial.temporary_state.ray_actor), and a lot of the changes in this PR are changes to tests.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

# Conflicts: # python/ray/tune/experiment/trial.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

# Conflicts: # python/ray/tune/experiment/trial.py # python/ray/tune/result_grid.py

xwjiang2010 · 2023-08-09T23:55:30Z

python/ray/tune/trainable/metadata.py

+from ray.tune.utils.serialization import TuneFunctionEncoder, TuneFunctionDecoder
+
+
+class _TrainingRunMetadata:


this is just a lump of some of the fields under Trial right? And do we still think everything under Trial is driver's view of things?
I feel like putting this under trainable folder is kinda misleading or are we planning to have trainable take charge of this eventually?

Talked offline, the idea is to eventually have trainables taking care of the management of these data (low priority for 2.7)

xwjiang2010

Stamp. waiting for tests to pass..

# Conflicts: # python/ray/tune/analysis/experiment_analysis.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com>

The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>

Kai Fricke added 18 commits August 4, 2023 12:33

trial state / ray actor

a0b416e

Signed-off-by: Kai Fricke <kai@anyscale.com>

docstring

1eefd8a

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into tune/trial-in-dir

5191055

# Conflicts: # python/ray/tune/experiment/trial.py

use runtime metadata

40df921

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix restoration

640a9e1

Signed-off-by: Kai Fricke <kai@anyscale.com>

get_runner_ip

e280407

Signed-off-by: Kai Fricke <kai@anyscale.com>

rename

ab85ec7

Signed-off-by: Kai Fricke <kai@anyscale.com>

last result time

01232c8

Signed-off-by: Kai Fricke <kai@anyscale.com>

move into separate file

9d9765e

Signed-off-by: Kai Fricke <kai@anyscale.com>

getstate/setstate

1beace4

Signed-off-by: Kai Fricke <kai@anyscale.com>

Fix tests

d512698

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/trial-in-dir

67f0857

Fix property access

53c4b76

Signed-off-by: Kai Fricke <kai@anyscale.com>

clear cache on status update

abb7a20

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/trial-in-dir

c6afef0

fix test

de1d32d

Signed-off-by: Kai Fricke <kai@anyscale.com>

result grid

0be5afd

Signed-off-by: Kai Fricke <kai@anyscale.com>

rename

7ee3fff

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke marked this pull request as ready for review August 8, 2023 16:30

krfricke requested a review from xwjiang2010 August 8, 2023 16:31

krfricke assigned xwjiang2010 Aug 8, 2023

Kai Fricke added 2 commits August 8, 2023 19:31

fix trial getstate

40b6037

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into tune/trial-in-dir

20afcc2

# Conflicts: # python/ray/tune/experiment/trial.py # python/ray/tune/result_grid.py

xwjiang2010 reviewed Aug 9, 2023

View reviewed changes

xwjiang2010 approved these changes Aug 10, 2023

View reviewed changes

Kai Fricke added 5 commits August 11, 2023 10:12

Merge branch 'master' into tune/trial-in-dir

7345d09

# Conflicts: # python/ray/tune/analysis/experiment_analysis.py

checkpoint config

a25491b

Signed-off-by: Kai Fricke <kai@anyscale.com>

last result

7822868

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix pickle/unpickle

a27cc22

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into tune/trial-in-dir

378d7a6

Kai Fricke added 3 commits August 14, 2023 10:21

merge conflicts

dd0bd22

Signed-off-by: Kai Fricke <kai@anyscale.com>

fix tests

ab65eed

Signed-off-by: Kai Fricke <kai@anyscale.com>

restoring_from

0e89900

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke merged commit 1797e81 into ray-project:master Aug 14, 2023
64 of 71 checks passed

krfricke deleted the tune/trial-in-dir branch August 14, 2023 14:05

matthewdeng mentioned this pull request Aug 16, 2023

Release test tune_cloud_ssh_sync.aws failed #38494

Closed

justinvyu mentioned this pull request Aug 21, 2023

Release test tune_cloud_durable_upload.aws failed #38493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train/tune] Refactor trial metadata organization #38165

[train/tune] Refactor trial metadata organization #38165

krfricke commented Aug 7, 2023 •

edited

Loading

xwjiang2010 Aug 9, 2023

xwjiang2010 Aug 10, 2023

xwjiang2010 left a comment

		from ray.tune.utils.serialization import TuneFunctionEncoder, TuneFunctionDecoder


		class _TrainingRunMetadata:

[train/tune] Refactor trial metadata organization #38165

[train/tune] Refactor trial metadata organization #38165

Conversation

krfricke commented Aug 7, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

xwjiang2010 Aug 9, 2023

Choose a reason for hiding this comment

xwjiang2010 Aug 10, 2023

Choose a reason for hiding this comment

xwjiang2010 left a comment

Choose a reason for hiding this comment

krfricke commented Aug 7, 2023 •

edited

Loading