Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[part 1/2] [train] Add metadata argument to Trainer #38481

Merged
merged 11 commits into from
Aug 18, 2023

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Aug 15, 2023

Why are these changes needed?

This implements the feature. In part 2, I'll add some docs. I'm splitting part 2 since merging just part 1 will unblock other issues.

Related issue number

Part of #38288

Signed-off-by: Eric Liang <ekhliang@gmail.com>
@@ -69,6 +71,8 @@ def make_train_loop(
def train_loop_per_worker():
import pandas as pd

print("Session metadata", train.get_context().get_metadata())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just updating the example class to use the new metadata stuff.

Signed-off-by: Eric Liang <ekhliang@gmail.com>

def train_func(config):
assert metadata, metadata
# Propagate user metadata from the Trainer constructor.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very happy about this hack, but it's much cleaner than trying to propagate this dict through all the tune function wrapper layers.

try:
self.metadata = json.loads(json.dumps(self.metadata))
except Exception as e:
raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably @justinvyu can comment more on the implementation (thought it looks very simple) -- the API and tests look great to me :)

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, generally looks good to me. Just one suggestion about the multi-rank metadata setting and some nits.

python/ray/train/_internal/session.py Outdated Show resolved Hide resolved
python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
Comment on lines +562 to +571
# Set additional user metadata from the Trainer.
if persisted_checkpoint and self.metadata:
user_metadata = persisted_checkpoint.get_metadata()
for k, v in self.metadata.items():
# Update keys not already set by the user. This gives user-set keys
# precedence over keys set at the Trainer level.
if k not in user_metadata:
user_metadata[k] = v
persisted_checkpoint.set_metadata(user_metadata)

Copy link
Contributor

@justinvyu justinvyu Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be setting the metadata many times here, once for each worker. Can we guard this with a rank check (e.g. only set metadata on rank 0 worker) and only set metadata once?

The other caveat here is that other trainers (xgb, lgbm, sklearn) don't have Train workers calling train.report, so we can't access train.get_context().get_world_rank() there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I feel like this would be more brittle given we are now supporting not reporting from rank 0. I don't think the performance impact here is measurable.

Copy link
Contributor

@justinvyu justinvyu Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, not too big of a deal

Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

@ericl ericl merged commit a4b1340 into ray-project:master Aug 18, 2023
41 of 46 checks passed
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants