Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train/tune] Refactor trial metadata organization #38165

Merged
merged 28 commits into from
Aug 14, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Aug 7, 2023

Why are these changes needed?

The Trial object currently keeps properties with different scopes:

  1. Static properties that are set on init
  2. Static properties that are set on init but can be overwritten on restore
  3. Temporary properties that are not saved, e.g. the trial location,
  4. run metadata that is updated during training, such as the last result, the available checkpoints, etc.

This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.

Specifically, it introduces a _TemporaryTrialState class that contains temporary properties, and a _TrainingRunMetadata class that contains run metadata such as the last result, error files, and available checkpoints.

It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.

The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the Trial class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.

A few other improvements have been made (Trial.runner is now Trial.temporary_state.ray_actor), and a lot of the changes in this PR are changes to tests.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 18 commits August 4, 2023 12:33
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts:
#	python/ray/tune/experiment/trial.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke marked this pull request as ready for review August 8, 2023 16:30
Kai Fricke added 2 commits August 8, 2023 19:31
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts:
#	python/ray/tune/experiment/trial.py
#	python/ray/tune/result_grid.py
from ray.tune.utils.serialization import TuneFunctionEncoder, TuneFunctionDecoder


class _TrainingRunMetadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a lump of some of the fields under Trial right? And do we still think everything under Trial is driver's view of things?
I feel like putting this under trainable folder is kinda misleading or are we planning to have trainable take charge of this eventually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline, the idea is to eventually have trainables taking care of the management of these data (low priority for 2.7)

Copy link
Contributor

@xwjiang2010 xwjiang2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp. waiting for tests to pass..

Kai Fricke added 5 commits August 11, 2023 10:12
# Conflicts:
#	python/ray/tune/analysis/experiment_analysis.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Kai Fricke added 3 commits August 14, 2023 10:21
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke merged commit 1797e81 into ray-project:master Aug 14, 2023
64 of 71 checks passed
@krfricke krfricke deleted the tune/trial-in-dir branch August 14, 2023 14:05
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
The Trial object currently keeps properties with different scopes:

1. Static properties that are set **on init**
2. Static properties that are set on init but can be overwritten **on restore**
3. **Temporary** properties that are not saved, e.g. the trial location,
4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc.

This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.

Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints.

It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.

The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.

A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
The Trial object currently keeps properties with different scopes:

1. Static properties that are set **on init**
2. Static properties that are set on init but can be overwritten **on restore**
3. **Temporary** properties that are not saved, e.g. the trial location,
4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc.

This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.

Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints.

It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.

The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.

A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
The Trial object currently keeps properties with different scopes:

1. Static properties that are set **on init**
2. Static properties that are set on init but can be overwritten **on restore**
3. **Temporary** properties that are not saved, e.g. the trial location,
4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc.

This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.

Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints.

It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.

The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.

A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests.

Signed-off-by: Kai Fricke <kai@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
The Trial object currently keeps properties with different scopes:

1. Static properties that are set **on init**
2. Static properties that are set on init but can be overwritten **on restore**
3. **Temporary** properties that are not saved, e.g. the trial location,
4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc.

This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.

Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints.

It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.

The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.

A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
The Trial object currently keeps properties with different scopes:

1. Static properties that are set **on init**
2. Static properties that are set on init but can be overwritten **on restore**
3. **Temporary** properties that are not saved, e.g. the trial location,
4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc.

This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.

Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints.

It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.

The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.

A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants