Skip to content

Refactor profiler in trainers #1833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

SurbhiJainUSC
Copy link
Collaborator

@SurbhiJainUSC SurbhiJainUSC commented Jun 14, 2025

Description

This PR refactors profiler code that we have in all trainers.

Tests

Verified that the profile is uploaded successfully on TensorBoard:
image

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile? I am seeing it in other PRs also so it is not specific to this PR

Default value for global ARG results in an empty or invalid base image name

InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/

@SurbhiJainUSC
Copy link
Collaborator Author

@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile? I am seeing it in other PRs also so it is not specific to this PR

Default value for global ARG results in an empty or invalid base image name

InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/

This warning appears because there is no default value defined for JAX_AI_IMAGE_BASEIMAGE. If the intent is to always explicitly provide the value for JAX_AI_IMAGE_BASEIMAGE while building the docker, then Dockerfile is technically correct and we can ignore the warning.

@Rohan-Bierneni
Copy link
Collaborator

@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile? I am seeing it in other PRs also so it is not specific to this PR

Default value for global ARG results in an empty or invalid base image name

InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/

As @SurbhiJainUSC pointed out, since the JAX AI Image Dockerfile uses a base image that is specified via a --build-arg, when this is not passed to the docker build command this error would be thrown. We could set default value for the arg JAX_AI_IMAGE_BASEIMAGE but it could become an old image for the default value if it doesn't get updated periodically.

@bvandermoon
Copy link
Collaborator

bvandermoon commented Jun 16, 2025

@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile? I am seeing it in other PRs also so it is not specific to this PR

Default value for global ARG results in an empty or invalid base image name

InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/

As @SurbhiJainUSC pointed out, since the JAX AI Image Dockerfile uses a base image that is specified via a --build-arg, when this is not passed to the docker build command this error would be thrown. We could set default value for the arg JAX_AI_IMAGE_BASEIMAGE but it could become an old image for the default value if it doesn't get updated periodically.

Thanks @Rohan-Bierneni and @SurbhiJainUSC. @Rohan-Bierneni Do you know how we can get the warning to stop showing up on unrelated PRs?

@SurbhiJainUSC
Copy link
Collaborator Author

@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile? I am seeing it in other PRs also so it is not specific to this PR

Default value for global ARG results in an empty or invalid base image name

InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/

As @SurbhiJainUSC pointed out, since the JAX AI Image Dockerfile uses a base image that is specified via a --build-arg, when this is not passed to the docker build command this error would be thrown. We could set default value for the arg JAX_AI_IMAGE_BASEIMAGE but it could become an old image for the default value if it doesn't get updated periodically.

Thanks @Rohan-Bierneni and @SurbhiJainUSC. @Rohan-Bierneni Do you know how we can get the warning to stop showing up on unrelated PRs?

I believe the warning comes from the tests that run on every PR. This dockerfile is used to build the docker image for every PR to run the tests. So, I don't think so we can stop this warning unless we assign any default value as @Rohan-Bierneni mentioned.

@Rohan-Bierneni
Copy link
Collaborator

@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile? I am seeing it in other PRs also so it is not specific to this PR

Default value for global ARG results in an empty or invalid base image name

InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/

As @SurbhiJainUSC pointed out, since the JAX AI Image Dockerfile uses a base image that is specified via a --build-arg, when this is not passed to the docker build command this error would be thrown. We could set default value for the arg JAX_AI_IMAGE_BASEIMAGE but it could become an old image for the default value if it doesn't get updated periodically.

Thanks @Rohan-Bierneni and @SurbhiJainUSC. @Rohan-Bierneni Do you know how we can get the warning to stop showing up on unrelated PRs?

I believe the warning comes from the tests that run on every PR. This dockerfile is used to build the docker image for every PR to run the tests. So, I don't think so we can stop this warning unless we assign any default value as @Rohan-Bierneni mentioned.

@SurbhiJainUSC @bvandermoon To summarize, the issue stems from the JAX AI Image Dockerfile requiring a specific docker image to be specified that gets used as a base-image. However, Github has a "beta" feature called "unchanged files with check annotations" that flags this Dockerfile since there is a --build-arg which doesn't have a default value i.e $JAX_AI_BASE_IMAGE.

We can either do follow the Github Annotations rules and set a default value so that this Dockerfile doesn't get flagged, or we can disable the "Unchanged files with Check Annotations" feature. I found a discussion post on how to do this. It would require a change in the workflow file: https://github.com/orgs/community/discussions/8379#discussioncomment-11943317

@bvandermoon which option would be best here. And if going with option 1 what default value for the JAX AI Image would be best?

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one question which is outside of the scope of this PR - would it be cleaner to just use the config values directly for fields that directly set an equivalent field? Not sure how much value there is in a line like this:

self.profile_cleanly = config.profile_cleanly

Let me know if there is some convention here for not wanting to set the full config as a self variable or something

@bvandermoon
Copy link
Collaborator

bvandermoon commented Jun 18, 2025

@parambole @Rohan-Bierneni do you know what is causing this message in maxtext_jax_ai_image.Dockerfile? I am seeing it in other PRs also so it is not specific to this PR

Default value for global ARG results in an empty or invalid base image name

InvalidDefaultArgInFrom: Default value for ARG $JAX_AI_IMAGE_BASEIMAGE results in empty or invalid base image name
More info: https://docs.docker.com/go/dockerfile/rule/invalid-default-arg-in-from/

As @SurbhiJainUSC pointed out, since the JAX AI Image Dockerfile uses a base image that is specified via a --build-arg, when this is not passed to the docker build command this error would be thrown. We could set default value for the arg JAX_AI_IMAGE_BASEIMAGE but it could become an old image for the default value if it doesn't get updated periodically.

Thanks @Rohan-Bierneni and @SurbhiJainUSC. @Rohan-Bierneni Do you know how we can get the warning to stop showing up on unrelated PRs?

I believe the warning comes from the tests that run on every PR. This dockerfile is used to build the docker image for every PR to run the tests. So, I don't think so we can stop this warning unless we assign any default value as @Rohan-Bierneni mentioned.

@SurbhiJainUSC @bvandermoon To summarize, the issue stems from the JAX AI Image Dockerfile requiring a specific docker image to be specified that gets used as a base-image. However, Github has a "beta" feature called "unchanged files with check annotations" that flags this Dockerfile since there is a --build-arg which doesn't have a default value i.e $JAX_AI_BASE_IMAGE.

We can either do follow the Github Annotations rules and set a default value so that this Dockerfile doesn't get flagged, or we can disable the "Unchanged files with Check Annotations" feature. I found a discussion post on how to do this. It would require a change in the workflow file: https://github.com/orgs/community/discussions/8379#discussioncomment-11943317

@bvandermoon which option would be best here. And if going with option 1 what default value for the JAX AI Image would be best?

Agree with you both that we don't want to include a default since it will just go out of date.

Is there a way to set an ignore flag like we do for linting?

# pylint: disable=too-many-positional-arguments

If not, I guess we can disable the feature in the workflow. Hopefully there is a way to just ignore though

Copy link
Collaborator

@khatwanimohit khatwanimohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! one more step closer to clean trainer files!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants