-
Notifications
You must be signed in to change notification settings - Fork 361
Refactor profiler in trainers #1833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
84d842c
to
3d41873
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile
? I am seeing it in other PRs also so it is not specific to this PR
Default value for global ARG results in an empty or invalid base image name
InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/
This warning appears because there is no default value defined for |
As @SurbhiJainUSC pointed out, since the JAX AI Image Dockerfile uses a base image that is specified via a --build-arg, when this is not passed to the docker build command this error would be thrown. We could set default value for the arg JAX_AI_IMAGE_BASEIMAGE but it could become an old image for the default value if it doesn't get updated periodically. |
Thanks @Rohan-Bierneni and @SurbhiJainUSC. @Rohan-Bierneni Do you know how we can get the warning to stop showing up on unrelated PRs? |
I believe the warning comes from the tests that run on every PR. This dockerfile is used to build the docker image for every PR to run the tests. So, I don't think so we can stop this warning unless we assign any default value as @Rohan-Bierneni mentioned. |
@SurbhiJainUSC @bvandermoon To summarize, the issue stems from the JAX AI Image Dockerfile requiring a specific docker image to be specified that gets used as a base-image. However, Github has a "beta" feature called "unchanged files with check annotations" that flags this Dockerfile since there is a --build-arg which doesn't have a default value i.e $JAX_AI_BASE_IMAGE. We can either do follow the Github Annotations rules and set a default value so that this Dockerfile doesn't get flagged, or we can disable the "Unchanged files with Check Annotations" feature. I found a discussion post on how to do this. It would require a change in the workflow file: https://github.com/orgs/community/discussions/8379#discussioncomment-11943317 @bvandermoon which option would be best here. And if going with option 1 what default value for the JAX AI Image would be best? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one question which is outside of the scope of this PR - would it be cleaner to just use the config values directly for fields that directly set an equivalent field? Not sure how much value there is in a line like this:
self.profile_cleanly = config.profile_cleanly
Let me know if there is some convention here for not wanting to set the full config as a self variable or something
Agree with you both that we don't want to include a default since it will just go out of date. Is there a way to set an ignore flag like we do for linting? maxtext/MaxText/checkpointing.py Line 40 in f827a03
If not, I guess we can disable the feature in the workflow. Hopefully there is a way to just ignore though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! one more step closer to clean trainer files!
Description
This PR refactors profiler code that we have in all trainers.
Tests
Verified that the profile is uploaded successfully on TensorBoard:

Checklist
Before submitting this PR, please make sure (put X in square brackets):