Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stucked on 100% Optimizing weights #630

Closed
1 task done
sdy623 opened this issue Apr 9, 2024 · 6 comments
Closed
1 task done

Stucked on 100% Optimizing weights #630

sdy623 opened this issue Apr 9, 2024 · 6 comments
Assignees
Labels
bug Something isn't working fixed Bug is resolved

Comments

@sdy623
Copy link

sdy623 commented Apr 9, 2024

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I used my own agent to train the model, but I can't find the model I trained on the HUB webpage.
Some has this simailar problem, but I can find my results.csv file. For that problem no results.csv file.
image
Here is my training log

Ultralytics HUB: Uploading checkpoint https://hub.ultralytics.com/models/GxzAXECMi65QweQlpTjs

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     97/100      17.4G      1.365     0.5464      1.235        101        640: 100%|██████████| 264/264 [01:42<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603      0.948      0.942      0.976      0.667

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     98/100      17.3G      1.363     0.5444      1.235        139        640: 100%|██████████| 264/264 [01:40<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603      0.949      0.941      0.977      0.667

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     99/100      17.3G       1.36     0.5425      1.232        137        640: 100%|██████████| 264/264 [01:40<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603      0.949      0.941      0.977      0.668

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
    100/100      17.3G      1.349     0.5367       1.22        156        640: 100%|██████████| 264/264 [01:41<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603       0.95      0.941      0.977      0.668

100 epochs completed in 3.364 hours.
Optimizer stripped from runs/detect/train12/weights/last.pt, 311.6MB
Optimizer stripped from runs/detect/train12/weights/best.pt, 311.6MB

Validating runs/detect/train12/weights/best.pt...
Ultralytics YOLOv8.1.45 🚀 Python-3.10.12 torch-2.1.0a0+32f93b1 CUDA:0 (B1.gpu.large, 24118MiB)
YOLOv5x6u summary (fused): 463 layers, 155375236 parameters, 0 gradients, 250.3 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:42<
                   all        747      49603       0.95      0.941      0.977      0.668
Speed: 0.1ms preprocess, 8.5ms inference, 0.0ms loss, 1.8ms postprocess per image
Results saved to runs/detect/train12
Ultralytics HUB: Syncing final model...
100%|██████████| 297M/297M [02:44<00:00, 1.89MB/s]
Ultralytics HUB: Done ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/GxzAXECMi65QweQlpTjs 🚀

After the View model at https://hub.ultralytics.com/models/GxzAXECMi65QweQlpTjs 🚀, the training exits, but shows
Optimizing weights. After a while it quits, it become disconnected, and I can't find the model I trained in the HUB.
image

Could some can help me, I will appreciate their help.

Environment

  • Trainging agent: docker-conatiner
  • Kernel version: 5.15.146
  • Memory: 24GB
  • GPU Memory: 24GB
  • Python: 3.10.12
  • CUDA: 12.2.140
  • torch: 2.1.0a0+32f93b1
  • ultralytics: 8.1.45

Minimal Reproducible Example

  1. Login to hub
  2. Choose 'Bring your own agent' option to train the model
  3. Exec the model on my trainging agent
  4. Wait the train ends
  5. The program comes up.

Additional

No response

@sdy623 sdy623 added the bug Something isn't working label Apr 9, 2024
Copy link

github-actions bot commented Apr 9, 2024

👋 Hello @sdy623, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

@sdy623
Copy link
Author

sdy623 commented Apr 9, 2024

image

@ultralytics ultralytics deleted a comment from pderrenger Apr 10, 2024
@sergiuwaxmann sergiuwaxmann self-assigned this Apr 10, 2024
@sergiuwaxmann
Copy link
Member

@sdy623 Thank you for bringing this to our attention.
It appears that the upload of final weights encountered a failure. Our team is currently investigating the issue to identify and resolve the underlying cause. I will keep you updated on our progress
Your patience and understanding are greatly appreciated.

@sergiuwaxmann
Copy link
Member

Hello @sdy623!
Great news! Our team has released a fix for the issue you reported. You should no longer experience this problem in new Cloud Training sessions.
Thanks for your patience!

@sergiuwaxmann sergiuwaxmann added the fixed Bug is resolved label Apr 22, 2024
@sdy623
Copy link
Author

sdy623 commented Apr 22, 2024

Thank you very much. How can I deploy the fixed version

@sdy623 sdy623 closed this as completed Apr 22, 2024
@sergiuwaxmann
Copy link
Member

@sdy623 When using Ultralytics HUB, the system automatically utilizes the latest version. For local training, please ensure you are using the most recent ultralytics version (8.2.2).
Unfortunately, the recent fix does not apply to models trained on earlier versions, so you will need to retrain your model with the latest version. We sincerely apologize for the inconvenience this causes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Bug is resolved
Projects
None yet
Development

No branches or pull requests

2 participants