Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ultralytics 8.1.41 DDP resume untrained-checkpoint fix #9453

Merged
merged 14 commits into from Apr 1, 2024
Merged

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Mar 31, 2024

May resolve #9419

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Enhanced the training process with improved fault tolerance and customization options. πŸš€

πŸ“Š Key Changes

  • Specified "backend" explicitly in distributed training setup.
  • Added a condition to resume training only if self.resume is true, allowing more control over training resumption.
  • Expanded argument updates for resuming training with reduced memory usage, adding "device" to "imgsz" and "batch".
  • Simplified assertions and logging related to resuming training, making the process more straightforward and less error-prone.

🎯 Purpose & Impact

  • Better Fault Tolerance: By enforcing a timeout in distributed training setups and allowing more flexible memory usage adjustments, the changes aim to reduce crash instances due to network issues or CUDA out-of-memory (OOM) errors. πŸ›‘οΈ
  • Enhanced Customization: The addition of "device" to adjustable parameters when resuming training provides users more control over their training environment, potentially mitigating memory constraints. πŸ”§
  • Streamlined Training Resume Process: These updates make resuming training more intuitive by removing unnecessary checks and clarifying logging messages. This leads to a smoother experience for users restarting training sessions. πŸ”„

In summary, these changes make the training process more robust, customizable, and user-friendly, particularly in distributed environments or when resuming after interruption.

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Improved training resumption logic and package update checks in Ultralytics YOLO πŸš€!

πŸ“Š Key Changes

  • πŸ”Ό Updated the version to 8.1.41.
  • 🌐 Enhanced Distributed Data Parallel (DDP) setup by specifying the backend directly.
  • πŸ”„ Improved the training resume feature to better handle model, device, and memory configurations.
  • ⏭ Added robustness to auto-updating dependencies by allowing retries.
  • πŸ›  Fixed optimizer to properly handle non-float32 tensors in conversion for mixed precision training.

🎯 Purpose & Impact

  • Version Bump: Keeps users on the latest, most stable release. πŸ†•
  • DDP Setup: Ensures more reliable multi-GPU training by explicitly setting the backend, potentially improving performance and reducing bugs. βœ”οΈ
  • Resuming Training: Makes it easier for users to pause and resume their training processes, even on different devices or with adjusted memory parameters, preventing loss of progress and improving user experience. πŸ’Ύ
  • Dependency Updates: Increases the success rate of automatic dependency updates, reducing setup time and improving user experience. ⬆️
  • Optimizer Tweak: Ensures more stable and accurate training in mixed precision, enhancing performance and reducing memory usage. πŸš€

Together, these changes aim to enhance the usability, stability, and flexibility of Ultralytics YOLO, making advanced computer vision more accessible and efficient.

glenn-jocher and others added 2 commits March 31, 2024 22:18
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Copy link

codecov bot commented Mar 31, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 76.63%. Comparing base (2cee889) to head (e37336f).

Files Patch % Lines
ultralytics/engine/trainer.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9453      +/-   ##
==========================================
- Coverage   76.66%   76.63%   -0.04%     
==========================================
  Files         120      120              
  Lines       15175    15174       -1     
==========================================
- Hits        11634    11628       -6     
- Misses       3541     3546       +5     
Flag Coverage Ξ”
Benchmarks 36.30% <25.00%> (+<0.01%) ⬆️
GPU 38.23% <25.00%> (-0.11%) ⬇️
Tests 71.89% <87.50%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

β˜” View full report in Codecov by Sentry.
πŸ“’ Have feedback on the report? Share it here.

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@Burhan-Q Burhan-Q added the enhancement New feature or request label Mar 31, 2024
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher
Copy link
Member Author

@Laughing-q I'm trying to solve a DDP training bug when starting from a partially completed checkpoint from a different dataset (with different class counts) in #9419 but I'm having a problem.

This PR fixes the bug report, but it breaks DDP resume. I can't figure how to do both.

  • To test the PR train COCO8, crash training after a few epochs, and try to train VOC.yaml DDP from the partially trained COCO8 model.
  • To test DDP resume, just train COCO8 DDP, crash then resume it.

Both should work correctly but I can't figure it out right now.

@glenn-jocher glenn-jocher added the TODO Items that needs completing label Mar 31, 2024
@glenn-jocher glenn-jocher linked an issue Mar 31, 2024 that may be closed by this pull request
2 tasks
@glenn-jocher glenn-jocher mentioned this pull request Mar 31, 2024
2 tasks
@glenn-jocher
Copy link
Member Author

glenn-jocher commented Mar 31, 2024

@Laughing-q this PR also resolves #9329, user asking to change device on resume.

Screenshot 2024-03-31 at 22 53 48

@glenn-jocher glenn-jocher marked this pull request as draft March 31, 2024 20:59
@Laughing-q
Copy link
Member

@glenn-jocher ok found the reason. Let me check what's the best solution here.

@Laughing-q
Copy link
Member

Laughing-q commented Apr 1, 2024

@glenn-jocher DDP resuming broken is because self.args.resume is set to False when loading args from last checkpoint:

self.args = get_cfg(ckpt_args)

resume in the overrides we read from trainer when generating ddp script hence remains False.
overrides = {vars(trainer.args)}

And this PR added an explicit condition to launch resume_training only if self.args.resume is explicitly set, then the DDP resuming broken.
Fixed in 97a2478

@Laughing-q Laughing-q marked this pull request as ready for review April 1, 2024 03:34
@glenn-jocher
Copy link
Member Author

glenn-jocher commented Apr 1, 2024

@Laughing-q I'm seeing a strange error when resuming. Resume works correctly, but then I get an error I've never seen before after a few successful resume epochs:

RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32 notwithstanding

Screenshot 2024-04-01 at 13 19 34

To reproduce on the server:

yolo train device=6,7
# CTRL+C after 10 epochs
yolo train device=6,7 model=/usr/src/ultralytics/runs/detect/train/weights/last.pt resume

EDIT: Ah wait, this must be because of the recent FP16 optimizer checkpointing. I noticed that the step keys in the optimizer stay at FP16 while all the other tensors return to FP32 on resume. Maybe I need to explicitly cast the entire optimizer to FP32 on resume, or prevent the step keys from converting to FP16.

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher
Copy link
Member Author

@Laughing-q ok I think I fixed this in c8523ea

The step keys in the optimizer will stay as FP32 and now resume works fully. The step keys are very lightweight so will not increase checkpoint size.

Screenshot 2024-04-01 at 13 26 30

@glenn-jocher glenn-jocher removed the TODO Items that needs completing label Apr 1, 2024
@glenn-jocher glenn-jocher changed the title DDP resume untrained checkpoint fix ultralytics 8.1.41 DDP resume untrained-checkpoint fix Apr 1, 2024
@glenn-jocher
Copy link
Member Author

@Laughing-q merging this PR as 8.1.41 to get these fixes out while we figure out the LR-resume bug some more.

@glenn-jocher glenn-jocher merged commit 959acf6 into main Apr 1, 2024
13 checks passed
@glenn-jocher glenn-jocher deleted the ddp-resume branch April 1, 2024 17:46
hmurari pushed a commit to hmurari/ultralytics that referenced this pull request Apr 17, 2024
…#9453)

Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: UltralyticsAssistant <web@ultralytics.com>
Co-authored-by: Laughing-q <1185102784@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

size mismatch error when finetuning a model with multiprocessing Changing device with resume
4 participants