Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] More checkpoint configurability, Result extension #25943

Merged
merged 33 commits into from
Jun 29, 2022

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Jun 20, 2022

Why are these changes needed?

This PR:

  • Allows the user to set keep_checkpoints_num and checkpoint_score_attr in RunConfig using the CheckpointStrategy dataclass
  • Adds two new fields to the Result object - best_checkpoints - a list of saved best checkpoints as determined by CheckpointingConfig.

Related issue number

Closes #24868

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Yard1 Yard1 added this to the Ray AIR milestone Jun 20, 2022
@Yard1 Yard1 mentioned this pull request Jun 21, 2022
6 tasks
python/ray/air/config.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Show resolved Hide resolved
python/ray/air/config.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yard1!

I'm wondering if we can simplify the API though. As a user all I need is to

  1. Get all reported metrics ordered by iteration
  2. Get all checkpoint paths and its corresponding metrics

If I have these then it's pretty easy for me to calculate the best metric and best checkpoint based on whatever attribute I want without needing to be concerned about checkpoint config or tune config.

If the resulting dataframe contains checkpoint paths, would it be sufficient to expose that?

python/ray/air/config.py Outdated Show resolved Hide resolved
python/ray/tune/result_grid.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
@Yard1
Copy link
Member Author

Yard1 commented Jun 21, 2022

@amogkam ok, let me see

@Yard1
Copy link
Member Author

Yard1 commented Jun 21, 2022

@amogkam @xwjiang2010 simplified, PTAL

@xwjiang2010 There was an oversight in function runner, where the duplicate result would get checkpointed again. I have disabled that in order for checkpoint_history to have the correct number of checkpoints, but let's see if it doesn't break CI somewhere else.

@Yard1 Yard1 requested a review from ericl as a code owner June 24, 2022 21:33
python/ray/train/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there

python/ray/util/ml_utils/checkpoint_manager.py Outdated Show resolved Hide resolved
python/ray/util/ml_utils/checkpoint_manager.py Outdated Show resolved Hide resolved
f.write(json.dumps(dict(x=config["x"], step=i)))
tune.report(x=config["x"], step=i)

analysis = tune.run(f, config={"x": tune.grid_search([1, 3])})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of scope here but we should really find a way to construct experiment checkpoints without always running full runs for testing.

@Yard1 Yard1 requested a review from krfricke June 28, 2022 18:48
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@krfricke krfricke merged commit dc7ed08 into ray-project:master Jun 29, 2022
@Yard1 Yard1 deleted the more_checkpoint_configurability branch June 29, 2022 15:41
amogkam pushed a commit that referenced this pull request Jul 7, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires #25943 to be merged in first
edoakes pushed a commit to edoakes/ray that referenced this pull request Jul 14, 2022
…t#25943)

This PR:
* Allows the user to set `keep_checkpoints_num` and `checkpoint_score_attr` in `RunConfig` using the `CheckpointStrategy` dataclass
* Adds two new fields to the `Result` object - `best_checkpoints` - a list of saved best checkpoints as determined by `CheckpointingConfig`.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Jul 14, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
nikitavemuri pushed a commit to nikitavemuri/ray that referenced this pull request Jul 15, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: Nikita Vemuri <nikitavemuri@gmail.com>
Catch-Bull pushed a commit to alipay/ant-ray that referenced this pull request Jul 18, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
Catch-Bull pushed a commit to alipay/ant-ray that referenced this pull request Jul 18, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: Catch-Bull <burglarralgrub@gmail.com>
klwuibm pushed a commit to yuanchi2807/ray that referenced this pull request Jul 27, 2022
…t#25943)

This PR:
* Allows the user to set `keep_checkpoints_num` and `checkpoint_score_attr` in `RunConfig` using the `CheckpointStrategy` dataclass
* Adds two new fields to the `Result` object - `best_checkpoints` - a list of saved best checkpoints as determined by `CheckpointingConfig`.

Signed-off-by: klwuibm <kwu888@gmail.com>
klwuibm pushed a commit to yuanchi2807/ray that referenced this pull request Jul 27, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: klwuibm <kwu888@gmail.com>
franklsf95 pushed a commit to franklsf95/ray that referenced this pull request Aug 2, 2022
…t#25943)

This PR:
* Allows the user to set `keep_checkpoints_num` and `checkpoint_score_attr` in `RunConfig` using the `CheckpointStrategy` dataclass
* Adds two new fields to the `Result` object - `best_checkpoints` - a list of saved best checkpoints as determined by `CheckpointingConfig`.

Signed-off-by: Frank Luan <lsf@berkeley.edu>
franklsf95 pushed a commit to franklsf95/ray that referenced this pull request Aug 2, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: Frank Luan <lsf@berkeley.edu>
gramhagen pushed a commit to gramhagen/ray that referenced this pull request Aug 15, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: Scott Graham <scgraham@microsoft.com>
gramhagen pushed a commit to gramhagen/ray that referenced this pull request Aug 15, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires ray-project#25943 to be merged in first

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIR] More checkpoint configurability
5 participants