New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ultralytics 8.1.41
DDP resume untrained-checkpoint fix
#9453
Conversation
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9453 +/- ##
==========================================
- Coverage 76.66% 76.63% -0.04%
==========================================
Files 120 120
Lines 15175 15174 -1
==========================================
- Hits 11634 11628 -6
- Misses 3541 3546 +5
Flags with carried forward coverage won't be shown. Click here to find out more. β View full report in Codecov by Sentry. |
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@Laughing-q I'm trying to solve a DDP training bug when starting from a partially completed checkpoint from a different dataset (with different class counts) in #9419 but I'm having a problem. This PR fixes the bug report, but it breaks DDP resume. I can't figure how to do both.
Both should work correctly but I can't figure it out right now. |
@Laughing-q this PR also resolves #9329, user asking to change device on resume. |
@glenn-jocher ok found the reason. Let me check what's the best solution here. |
@glenn-jocher DDP resuming broken is because ultralytics/ultralytics/engine/trainer.py Line 650 in ea52750
resume in the overrides we read from trainer when generating ddp script hence remains False .ultralytics/ultralytics/utils/dist.py Line 31 in ea52750
And this PR added an explicit condition to launch resume_training only if self.args.resume is explicitly set, then the DDP resuming broken.Fixed in 97a2478 |
@Laughing-q I'm seeing a strange error when resuming. Resume works correctly, but then I get an error I've never seen before after a few successful resume epochs:
To reproduce on the server:
EDIT: Ah wait, this must be because of the recent FP16 optimizer checkpointing. I noticed that the |
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@Laughing-q ok I think I fixed this in c8523ea The |
ultralytics 8.1.41
DDP resume untrained-checkpoint fix
@Laughing-q merging this PR as 8.1.41 to get these fixes out while we figure out the LR-resume bug some more. |
β¦#9453) Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: UltralyticsAssistant <web@ultralytics.com> Co-authored-by: Laughing-q <1185102784@qq.com>
May resolve #9419
π οΈ PR Summary
Made with β€οΈ by Ultralytics Actions
π Summary
Enhanced the training process with improved fault tolerance and customization options. π
π Key Changes
self.resume
is true, allowing more control over training resumption.π― Purpose & Impact
In summary, these changes make the training process more robust, customizable, and user-friendly, particularly in distributed environments or when resuming after interruption.
π οΈ PR Summary
Made with β€οΈ by Ultralytics Actions
π Summary
Improved training resumption logic and package update checks in Ultralytics YOLO π!
π Key Changes
π― Purpose & Impact
Together, these changes aim to enhance the usability, stability, and flexibility of Ultralytics YOLO, making advanced computer vision more accessible and efficient.