-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[part 1/2] [train] Add metadata argument to Trainer #38481
Conversation
Signed-off-by: Eric Liang <ekhliang@gmail.com>
python/ray/air/util/check_ingest.py
Outdated
@@ -69,6 +71,8 @@ def make_train_loop( | |||
def train_loop_per_worker(): | |||
import pandas as pd | |||
|
|||
print("Session metadata", train.get_context().get_metadata()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just updating the example class to use the new metadata stuff.
|
||
def train_func(config): | ||
assert metadata, metadata | ||
# Propagate user metadata from the Trainer constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very happy about this hack, but it's much cleaner than trying to propagate this dict through all the tune function wrapper layers.
python/ray/train/base_trainer.py
Outdated
try: | ||
self.metadata = json.loads(json.dumps(self.metadata)) | ||
except Exception as e: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably @justinvyu can comment more on the implementation (thought it looks very simple) -- the API and tests look great to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, generally looks good to me. Just one suggestion about the multi-rank metadata setting and some nits.
# Set additional user metadata from the Trainer. | ||
if persisted_checkpoint and self.metadata: | ||
user_metadata = persisted_checkpoint.get_metadata() | ||
for k, v in self.metadata.items(): | ||
# Update keys not already set by the user. This gives user-set keys | ||
# precedence over keys set at the Trainer level. | ||
if k not in user_metadata: | ||
user_metadata[k] = v | ||
persisted_checkpoint.set_metadata(user_metadata) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will be setting the metadata many times here, once for each worker. Can we guard this with a rank check (e.g. only set metadata on rank 0 worker) and only set metadata once?
The other caveat here is that other trainers (xgb, lgbm, sklearn) don't have Train workers calling train.report
, so we can't access train.get_context().get_world_rank()
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I feel like this would be more brittle given we are now supporting not reporting from rank 0. I don't think the performance impact here is measurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, not too big of a deal
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm!
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This implements the feature. In part 2, I'll add some docs. I'm splitting part 2 since merging just part 1 will unblock other issues.
Related issue number
Part of #38288