Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train][Tune] Support reading train result from cloud storage #40622

Merged
merged 12 commits into from
Nov 17, 2023

Conversation

ahmed-mahran
Copy link
Contributor

@ahmed-mahran ahmed-mahran commented Oct 24, 2023

This is to support reading Result object from remote/cloud storage systems like s3 and gcs.

Why are these changes needed?

Result can already be stored in remote/cloud storage. There should be a functionality to read it back.

result = Result.from_path("s3://storage_path/experiment_dir/trial_dir/", storage_filesystem=...)
# result.best_checkpoints

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

This is to support reading Result object from remote storage systems like s3 and gcs.
@ahmed-mahran
Copy link
Contributor Author

@woshiyyya would you take a look please

@ahmed-mahran
Copy link
Contributor Author

@justinvyu @krfricke

Signed-off-by: Ahmed Mahran <ahmahran@gmail.com>
@justinvyu
Copy link
Contributor

Thanks, this is looking really good! I'll do a review sometime later today.

@justinvyu justinvyu self-assigned this Nov 7, 2023
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR 🤩

I have a few comments, and I can help you out with it.

python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/tests/test_result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
@ahmed-mahran
Copy link
Contributor Author

Thanks @justinvyu for the thorough review! I'll try to address your comments early next week.

ahmed-mahran added a commit to CodecLondon/ray that referenced this pull request Nov 13, 2023
ahmed-mahran added a commit to CodecLondon/ray that referenced this pull request Nov 13, 2023
ahmed-mahran added a commit to CodecLondon/ray that referenced this pull request Nov 13, 2023
@ahmed-mahran
Copy link
Contributor Author

@justinvyu, I think I've addressed your feedback.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Apart from needing to fix the lint errors, I just have some small comments.

python/ray/tune/result_grid.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
python/ray/air/result.py Outdated Show resolved Hide resolved
Comment on lines 310 to 316
def _read_file_as_str(
storage_filesystem: pyarrow.fs.FileSystem,
storage_path: str,
compression: Optional[str] = "detect",
buffer_size: Optional[int] = None,
encoding: Optional[str] = "utf-8",
) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm okay with adding this helper, but also it might be clearer to just do the filesystem operations directly? We can add this in the future if it turns out we need to read a file as text directly from cloud very often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be cleaner and clearer to do it this way especially the logic is used twice (for json and csv). I was going to make it inner function in from_path function. I can move as a static method in Result like the old _validate_trial_dir.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Approved and pending tests passing. 🙌

@justinvyu
Copy link
Contributor

Ok, the mock_s3_bucket_uri fixture is actually not in the conftest for ray.air.tests. Let's move this unit test to ray.train.tests instead and add the conftest as a dependency.

  1. Move
    py_test(
    to https://github.com/ray-project/ray/blob/master/python/ray/train/BUILD (sorted by alphabetical order)
  2. Add conftest as a dependency:
py_test(
    name = "test_result",
    size = "medium",
    srcs = ["tests/test_result.py"],
    tags = ["team:ml", "exclusive"],
    deps = [":ml_lib", ":conftest"]
)

ahmed-mahran added a commit to CodecLondon/ray that referenced this pull request Nov 16, 2023
ray-project#40622 (comment)

The test requires mock_s3_bucket_uri fixture which is in conftest for ray.train.tests
ray-project#40622 (comment)

The test requires mock_s3_bucket_uri fixture which is in conftest for ray.train.tests
@justinvyu
Copy link
Contributor

Sorry, it is :train_lib instead of :ml_lib 😅

@justinvyu justinvyu merged commit 7c7eaf2 into ray-project:master Nov 17, 2023
17 of 20 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
…oject#40622)

This is to support reading Result object from remote/cloud storage systems like s3 and gcs.

---------

Signed-off-by: Ahmed Mahran <ahmahran@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants