benchmark/experiment_runner: save accelerator_model on error #6218

cota · 2023-12-20T05:08:15Z

This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (#6162)". In that commit we missed that experiment_cfg does not have the accelerator_model record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against benchmark_experiment, which does have accelerator_model).

We could fix this two ways: (1) always save benchmark_experiment, not only on success, or (2) add accelerator_model to experiment_config.

I've chosen to go with (1) since that's what we were doing before 63455e0.

This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (pytorch#6162)". In that commit we missed that `experiment_cfg` does not have the `accelerator_model` record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against `benchmark_experiment`, which does have `accelerator_model`). We could fix this two ways: (1) always save `benchmark_experiment`, not only on success, or (2) add `accelerator_model` to experiment_config. I've chosen to go with (1) since that's what we were doing before 63455e0.

golechwierowicz

Good call!

…#6218) This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (pytorch#6162)". In that commit we missed that `experiment_cfg` does not have the `accelerator_model` record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against `benchmark_experiment`, which does have `accelerator_model`). We could fix this two ways: (1) always save `benchmark_experiment`, not only on success, or (2) add `accelerator_model` to experiment_config. I've chosen to go with (1) since that's what we were doing before 63455e0.

This commit can be seen as a partial revert of "63455e0cd Unify the way in which result files are dumped (#6162)". In that commit we missed that `experiment_cfg` does not have the `accelerator_model` record. Thus, when a benchmark fails we do not include that record in the JSONL file, and therefore resuming a run doesn't work because the failing entry is not recognized (note that when checking whether to resume we compare the JSONL entry against `benchmark_experiment`, which does have `accelerator_model`). We could fix this two ways: (1) always save `benchmark_experiment`, not only on success, or (2) add `accelerator_model` to experiment_config. I've chosen to go with (1) since that's what we were doing before 63455e0.

cota requested a review from golechwierowicz December 20, 2023 05:08

golechwierowicz approved these changes Dec 20, 2023

View reviewed changes

cota merged commit 86636ba into pytorch:master Dec 20, 2023

cota deleted the cfg branch December 20, 2023 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

benchmark/experiment_runner: save accelerator_model on error #6218

benchmark/experiment_runner: save accelerator_model on error #6218

Uh oh!

cota commented Dec 20, 2023

Uh oh!

golechwierowicz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benchmark/experiment_runner: save accelerator_model on error #6218

benchmark/experiment_runner: save accelerator_model on error #6218

Uh oh!

Conversation

cota commented Dec 20, 2023

Uh oh!

golechwierowicz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants