Skip to content

Conversation

@cota
Copy link
Collaborator

@cota cota commented Dec 20, 2023

This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (#6162)". In that commit we missed that experiment_cfg does not have the accelerator_model record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against benchmark_experiment, which does have accelerator_model).

We could fix this two ways: (1) always save benchmark_experiment, not only on success, or (2) add accelerator_model to experiment_config.

I've chosen to go with (1) since that's what we were doing before 63455e0.

This commit can be seen as a partial revert of "63455e0cd Unify
the way in which result files are dumped (pytorch#6162)". In that
commit we missed that `experiment_cfg` does not have the
`accelerator_model` record. Thus, when a benchmark fails we
do not include that record in the JSONL file, and therefore
resuming a run doesn't work because the failing entry is not
recognized (note that when checking whether to resume we
compare the JSONL entry against `benchmark_experiment`, which
does have `accelerator_model`).

We could fix this two ways: (1) always save `benchmark_experiment`,
not only on success, or (2) add `accelerator_model` to
experiment_config.

I've chosen to go with (1) since that's what we were doing
before 63455e0.
@cota cota requested a review from golechwierowicz December 20, 2023 05:08
Copy link
Collaborator

@golechwierowicz golechwierowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

@cota cota merged commit 86636ba into pytorch:master Dec 20, 2023
@cota cota deleted the cfg branch December 20, 2023 17:40
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Jan 3, 2024
…#6218)

This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (pytorch#6162)". In that commit we missed that `experiment_cfg` does not have the `accelerator_model` record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against `benchmark_experiment`, which does have `accelerator_model`).

We could fix this two ways: (1) always save `benchmark_experiment`, not only on success, or (2) add `accelerator_model` to experiment_config.

I've chosen to go with (1) since that's what we were doing before 63455e0.
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (#6162)". In that commit we missed that `experiment_cfg` does not have the `accelerator_model` record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against `benchmark_experiment`, which does have `accelerator_model`).

We could fix this two ways: (1) always save `benchmark_experiment`, not only on success, or (2) add `accelerator_model` to experiment_config.

I've chosen to go with (1) since that's what we were doing before 63455e0.
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (#6162)". In that commit we missed that `experiment_cfg` does not have the `accelerator_model` record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against `benchmark_experiment`, which does have `accelerator_model`).

We could fix this two ways: (1) always save `benchmark_experiment`, not only on success, or (2) add `accelerator_model` to experiment_config.

I've chosen to go with (1) since that's what we were doing before 63455e0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants