Resume with single arg #457

alexstoken · 2020-07-20T23:14:09Z

Based on conversation regarding --resume functionality in #104

This PR aims to reduce confusion and user error (exceptions, and undetected errors which cause poor training performance) when using the resume training functionality. This PR restricts users to only using --resume as intended.

There are two use cases:

Search for the most recent run/exp* directory, and resume training from there.

python train.py --resume

Resume from a specific unfinished training run. Useful when multiple training runs have been interrupted.

python train.py --resume runs/exp*

Other additions:

Checkpoint to warn users when they are attempting to resume an already completed run.
Checks input resume dirs for necessary files for resuming. Need 3: opt.yaml, hyp.yaml, last*.pt
Ignores all other args if --resume is used. This ensures users do not accidentally interfere with their training scheme.

Here is a colab notebook with some examples.

Happy to adjust implementation further based on discussion.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced --resume behavior with checks for completed runs and improved configuration restoration.

📊 Key Changes

🚫 If an attempt is made to resume a finished training run, the run's directory is removed and a warning is issued.
✏️ The help text for the --resume argument now specifies that it should point to an experiment directory, not a .pt file.
🔄 The resumption method in train.py was reworked to:
- Ensure the provided --resume path points to a valid directory containing necessary files (e.g., opt.yaml, hyp.yaml, and weights).
- Load and apply options (opt) and hyperparameters (hyp) from the original run.
- Set the path to the weights file from the last checkpoint of the original run.
📁 The updated behavior creates a better link between the resumed run and its parent, eliminating confusion and potential mistakes when resuming training.

🎯 Purpose & Impact

The update prevents users from inadvertently trying to resume a run that has already concluded, which could lead to data loss.
By automating the lifting of configurations from previous runs, this PR simplifies the process of resuming, making it more error-proof and user-friendly.
Users benefit from a more robust and intuitive resume functionality, increasing the effectiveness of iterative training sessions and saving time during the machine learning model development lifecycle.

…folder

…d of to last.pt

…m run to resume from.

github-actions · 2020-08-20T00:32:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

alexstoken · 2020-08-20T15:40:14Z

Solved by #756 .

alexstoken added 22 commits July 17, 2020 14:57

Change get_latest_run() to search for newest dir in /runs

640e203

Change docstring to reflect search for dirs

73965f4

Add opt file loader to replace args when opt.resume chosen

0b88350

Remove logic to set opt.weights, since it's now done in resume section

c9982c2

Add try except to catch a FileNotFound if opt.yaml isnt in runs/exp* …

f279dde

…folder

Change help statement to pass a directory to recent experiment instea…

ecc7422

…d of to last.pt

Add checks for last_run_dir and try/except for loading opt.yaml

5ce66a2

Add exceptions when files are not found. Update hyp dict with hyp fro…

e3ec60c

…m run to resume from.

Glob last*.pt to ensure named runs are found

18925c5

Set opt.resume attribute to point to last_run_dir

1a65229

Remove "Warning" tag when raising specific errors

56f440b

Add warning to users who attempt to resume a completed run

441471b

Remove experiment dir for runs resumed from a competed training run

a375298

Raise FileNotFound instead of just printing warning

57cf0ea

Fix FileNotFound error raising

92d5f52

Comment resume setup section

1a7ce31

Seperate steps to restore opt, hyp, weights

4057e64

Change FileNotFoundError messages to be uniform

8e27e1f

Fix typo in error message

9de3400

Remove duplicate error for FileNotFound for opt.yaml

0ed27e6

Clarify opt print statement

248308d

Merge branch 'master' into resume_single_cmd

cc7b8f9

github-actions bot added the Stale label Aug 20, 2020

alexstoken closed this Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume with single arg #457

Resume with single arg #457

alexstoken commented Jul 20, 2020 •

edited by UltralyticsAssistant

github-actions bot commented Aug 20, 2020

alexstoken commented Aug 20, 2020

Resume with single arg #457

Resume with single arg #457

Conversation

alexstoken commented Jul 20, 2020 • edited by UltralyticsAssistant

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot commented Aug 20, 2020

alexstoken commented Aug 20, 2020

alexstoken commented Jul 20, 2020 •

edited by UltralyticsAssistant