-
Notifications
You must be signed in to change notification settings - Fork 361
Refactor: Recording and logging training and evaluation metrics in all trainers #1815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
# Write train config params, num model params, and XLA flags to tensorboard | ||
max_utils.add_text_to_summary_writer("num_model_parameters", str(num_model_parameters), writer) | ||
max_utils.add_text_to_summary_writer("libtpu_init_args", os.environ["LIBTPU_INIT_ARGS"], writer) | ||
maxtext_utils.add_config_to_summary_writer(config, writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to verify that the logs are identical to the version prior to refactoring, were you able to take a write out the logs and then diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have verified the logs are same on Tensoboard before and after refactoring.
@@ -143,18 +131,7 @@ def train_loop(config, recorder, state=None): | |||
last_profiling_step = prof.finished_initial_profile_step | |||
|
|||
example_batch = None | |||
last_step_completion = datetime.datetime.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But previously last_step_completion
was calculated here after write_setup_info_to_tensorboard
work is done
am I reading this right?
in fact I think it should be calculated a bit later right before train step begins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to calculate step_time_delta
.
step_time_delta for step=n is (time when train step n is completed) - (time when train step n-1 is completed)
For step>1, calculating last_step_completion
in metric_logger.record_train_metrics()
makes sense because it is called just after training step is completed. However, it is unclear to me what should be the right spot for calculating last_step_completion
for first step. Also, how do we interpret step_time_delta
for step=0? What is your opinion @A9isha ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking into it more, I have moved the calculation for last_step_completion
for the first step to just before we start the training run. This aligns with what you were suggesting previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the training logs.
fe62587
to
fc920f4
Compare
fc920f4
to
1cab90d
Compare
1cab90d
to
a4e0fbc
Compare
Description
This PR removes
writer
from the list of values returned bysetup_mesh_and_model
. This simiplifiessetup_mesh_and_model
as it is unnecessary to create a TensorBoard summary writer object in that method. Additionally, this PR also refactors the metrics related code in all trainers (train.py, sft_trainer.py, grpo_trainer.py, elastic_train.py).Thanks to @richjames0 for contributing!
Tests
E2E testing with train.py and sft_trainer.py - verified the metrics are correctly uploaded to TensorBoard
Checklist
Before submitting this PR, please make sure (put X in square brackets):