`ultralytics 8.1.41` DDP resume untrained-checkpoint fix #9453

glenn-jocher · 2024-03-31T20:18:35Z

May resolve #9419

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced the training process with improved fault tolerance and customization options. 🚀

📊 Key Changes

Specified "backend" explicitly in distributed training setup.
Added a condition to resume training only if self.resume is true, allowing more control over training resumption.
Expanded argument updates for resuming training with reduced memory usage, adding "device" to "imgsz" and "batch".
Simplified assertions and logging related to resuming training, making the process more straightforward and less error-prone.

🎯 Purpose & Impact

Better Fault Tolerance: By enforcing a timeout in distributed training setups and allowing more flexible memory usage adjustments, the changes aim to reduce crash instances due to network issues or CUDA out-of-memory (OOM) errors. 🛡️
Enhanced Customization: The addition of "device" to adjustable parameters when resuming training provides users more control over their training environment, potentially mitigating memory constraints. 🔧
Streamlined Training Resume Process: These updates make resuming training more intuitive by removing unnecessary checks and clarifying logging messages. This leads to a smoother experience for users restarting training sessions. 🔄

In summary, these changes make the training process more robust, customizable, and user-friendly, particularly in distributed environments or when resuming after interruption.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improved training resumption logic and package update checks in Ultralytics YOLO 🚀!

📊 Key Changes

🔼 Updated the version to 8.1.41.
🌐 Enhanced Distributed Data Parallel (DDP) setup by specifying the backend directly.
🔄 Improved the training resume feature to better handle model, device, and memory configurations.
⏭ Added robustness to auto-updating dependencies by allowing retries.
🛠 Fixed optimizer to properly handle non-float32 tensors in conversion for mixed precision training.

🎯 Purpose & Impact

Version Bump: Keeps users on the latest, most stable release. 🆕
DDP Setup: Ensures more reliable multi-GPU training by explicitly setting the backend, potentially improving performance and reducing bugs. ✔️
Resuming Training: Makes it easier for users to pause and resume their training processes, even on different devices or with adjusted memory parameters, preventing loss of progress and improving user experience. 💾
Dependency Updates: Increases the success rate of automatic dependency updates, reducing setup time and improving user experience. ⬆️
Optimizer Tweak: Ensures more stable and accurate training in mixed precision, enhancing performance and reducing memory usage. 🚀

Together, these changes aim to enhance the usability, stability, and flexibility of Ultralytics YOLO, making advanced computer vision more accessible and efficient.

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

codecov · 2024-03-31T20:20:30Z

Codecov Report

Attention: Patch coverage is 87.50000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 76.63%. Comparing base (2cee889) to head (e37336f).

Files	Patch %	Lines
ultralytics/engine/trainer.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9453      +/-   ##
==========================================
- Coverage   76.66%   76.63%   -0.04%     
==========================================
  Files         120      120              
  Lines       15175    15174       -1     
==========================================
- Hits        11634    11628       -6     
- Misses       3541     3546       +5

Flag	Coverage Δ
Benchmarks	`36.30% <25.00%> (+<0.01%)`	⬆️
GPU	`38.23% <25.00%> (-0.11%)`	⬇️
Tests	`71.89% <87.50%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher · 2024-03-31T20:51:19Z

@Laughing-q I'm trying to solve a DDP training bug when starting from a partially completed checkpoint from a different dataset (with different class counts) in #9419 but I'm having a problem.

This PR fixes the bug report, but it breaks DDP resume. I can't figure how to do both.

To test the PR train COCO8, crash training after a few epochs, and try to train VOC.yaml DDP from the partially trained COCO8 model.
To test DDP resume, just train COCO8 DDP, crash then resume it.

Both should work correctly but I can't figure it out right now.

glenn-jocher · 2024-03-31T20:53:27Z

@Laughing-q this PR also resolves #9329, user asking to change device on resume.

Laughing-q · 2024-04-01T03:04:01Z

@glenn-jocher ok found the reason. Let me check what's the best solution here.

Laughing-q · 2024-04-01T03:34:23Z

@glenn-jocher DDP resuming broken is because self.args.resume is set to False when loading args from last checkpoint:

ultralytics/ultralytics/engine/trainer.py

Line 650 in ea52750

self.args = get_cfg(ckpt_args)

resume in the overrides we read from trainer when generating ddp script hence remains False.

ultralytics/ultralytics/utils/dist.py

Line 31 in ea52750

overrides = {vars(trainer.args)}

And this PR added an explicit condition to launch resume_training only if self.args.resume is explicitly set, then the DDP resuming broken.
Fixed in 97a2478

glenn-jocher · 2024-04-01T11:20:59Z

@Laughing-q I'm seeing a strange error when resuming. Resume works correctly, but then I get an error I've never seen before after a few successful resume epochs:

RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32 notwithstanding

To reproduce on the server:

yolo train device=6,7
# CTRL+C after 10 epochs
yolo train device=6,7 model=/usr/src/ultralytics/runs/detect/train/weights/last.pt resume

EDIT: Ah wait, this must be because of the recent FP16 optimizer checkpointing. I noticed that the step keys in the optimizer stay at FP16 while all the other tensors return to FP32 on resume. Maybe I need to explicitly cast the entire optimizer to FP32 on resume, or prevent the step keys from converting to FP16.

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher · 2024-04-01T11:27:53Z

@Laughing-q ok I think I fixed this in c8523ea

The step keys in the optimizer will stay as FP32 and now resume works fully. The step keys are very lightweight so will not increase checkpoint size.

glenn-jocher · 2024-04-01T16:56:44Z

@Laughing-q merging this PR as 8.1.41 to get these fixes out while we figure out the LR-resume bug some more.

…#9453) Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: UltralyticsAssistant <web@ultralytics.com> Co-authored-by: Laughing-q <1185102784@qq.com>

glenn-jocher and others added 2 commits March 31, 2024 22:18

DDP resume untrained checkpoint fix

c8d8aa1

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

Auto-format by https://ultralytics.com/actions

d978179

DDP resume untrained checkpoint fix

37127d2

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher mentioned this pull request Mar 31, 2024

size mismatch error when finetuning a model with multiprocessing #9419

Closed

2 tasks

DDP resume untrained checkpoint fix

7f9ff24

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

Burhan-Q added the enhancement New feature or request label Mar 31, 2024

glenn-jocher added 2 commits March 31, 2024 22:45

DDP resume untrained checkpoint fix

3245005

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

Update model.py

c51756e

glenn-jocher added the TODO Items that needs completing label Mar 31, 2024

glenn-jocher linked an issue Mar 31, 2024 that may be closed by this pull request

Changing device with resume #9329

Closed

2 tasks

glenn-jocher mentioned this pull request Mar 31, 2024

Changing device with resume #9329

Closed

2 tasks

glenn-jocher marked this pull request as draft March 31, 2024 20:59

Merge branch 'main' into ddp-resume

6869af4

fix

97a2478

Laughing-q marked this pull request as ready for review April 1, 2024 03:34

Merge branch 'main' into ddp-resume

3a8a385

Do not convert 'step' to FP16

c8523ea

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher added 2 commits April 1, 2024 13:31

Merge branch 'main' into ddp-resume

9d74826

Update checks.py

b928b29

glenn-jocher removed the TODO Items that needs completing label Apr 1, 2024

glenn-jocher changed the title ~~DDP resume untrained checkpoint fix~~ ultralytics 8.1.41 DDP resume untrained-checkpoint fix Apr 1, 2024

glenn-jocher added 2 commits April 1, 2024 18:53

Merge branch 'main' into ddp-resume

72626b4

Update __init__.py

e37336f

glenn-jocher merged commit 959acf6 into main Apr 1, 2024
13 checks passed

glenn-jocher deleted the ddp-resume branch April 1, 2024 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ultralytics 8.1.41` DDP resume untrained-checkpoint fix #9453

`ultralytics 8.1.41` DDP resume untrained-checkpoint fix #9453

glenn-jocher commented Mar 31, 2024 •

edited by github-actions bot

codecov bot commented Mar 31, 2024 •

edited

glenn-jocher commented Mar 31, 2024

glenn-jocher commented Mar 31, 2024 •

edited

Laughing-q commented Apr 1, 2024

Laughing-q commented Apr 1, 2024 •

edited

glenn-jocher commented Apr 1, 2024 •

edited

glenn-jocher commented Apr 1, 2024

glenn-jocher commented Apr 1, 2024

ultralytics 8.1.41 DDP resume untrained-checkpoint fix #9453

ultralytics 8.1.41 DDP resume untrained-checkpoint fix #9453

Conversation

glenn-jocher commented Mar 31, 2024 • edited by github-actions bot

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

codecov bot commented Mar 31, 2024 • edited

Codecov Report

glenn-jocher commented Mar 31, 2024

glenn-jocher commented Mar 31, 2024 • edited

Laughing-q commented Apr 1, 2024

Laughing-q commented Apr 1, 2024 • edited

glenn-jocher commented Apr 1, 2024 • edited

glenn-jocher commented Apr 1, 2024

glenn-jocher commented Apr 1, 2024

`ultralytics 8.1.41` DDP resume untrained-checkpoint fix #9453

`ultralytics 8.1.41` DDP resume untrained-checkpoint fix #9453

glenn-jocher commented Mar 31, 2024 •

edited by github-actions bot

codecov bot commented Mar 31, 2024 •

edited

glenn-jocher commented Mar 31, 2024 •

edited

Laughing-q commented Apr 1, 2024 •

edited

glenn-jocher commented Apr 1, 2024 •

edited