Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epochs remaining gone negative and stuck on optimising weights #622

Closed
1 task done
tlong123 opened this issue Mar 26, 2024 · 13 comments
Closed
1 task done

Epochs remaining gone negative and stuck on optimising weights #622

tlong123 opened this issue Mar 26, 2024 · 13 comments
Assignees
Labels
bug Something isn't working fixed Bug is resolved

Comments

@tlong123
Copy link

tlong123 commented Mar 26, 2024

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Models, Training

Bug

I have set up a model to train for 100 epochs. I left it running and when I came back the next day it said disconnected, checkpointed at 99th epoch so I've clicked resume training however now I'm stuck on 100% optimising weights, with minus 6 epochs remaining and the time estimate is stuck saying estimating...

I'm still being billed for this time and the number of negative epochs is only increasing with time. Is this expected or has something gone wrong with the model? I'm training it on your cloud. (its just changed to -7 epochs remaining now)

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

@tlong123 tlong123 added the bug Something isn't working label Mar 26, 2024
Copy link

👋 Hello @tlong123, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

@tlong123
Copy link
Author

Additionally, other than deleting the model I can't see any way to stop it from actively costing me money in its current state, and I don't want to delete it as I've paid money for it to be trained!

@UltralyticsAssistant
Copy link
Member

@tlong123 hello! First off, thanks for reaching out and detailing the issue you're facing with the training process on Ultralytics HUB. 🌟 It sounds like you've stumbled upon a rare glitch, particularly with the training session not properly concluding and going into negative epochs – this indeed is not expected behavior.

Rest assured, we prioritize both the performance and the billing concerns of our users. Here's a couple of steps you can take:

  1. If you haven't already, please attempt to manually stop the training session. While you've mentioned an issue with stopping the model without deleting it, there should be a stop or pause option available in the UI for your training session.
  2. Regarding the unexpected billing, we absolutely understand the importance of fair billing practices. We recommend reaching out to our support through the official communication channels mentioned in our documentation. They will be able to look into your billing details and make necessary adjustments based on the glitch.

We're here to ensure a smooth and efficient training experience. Your detailed report is incredibly valuable for us to improve and rectify such issues. If anything else comes up or if you have further questions, don't hesitate to update the issue or reach out to our support team.

Thanks for your patience and understanding. 👍

@tlong123
Copy link
Author

ultralytics_glitch
I've attached a screenshot of where its at now. It eventually disconnected by itself

@sergiuwaxmann sergiuwaxmann self-assigned this Mar 27, 2024
@sergiuwaxmann
Copy link
Member

Hello @tlong123!
First of all, I am sorry for the inconvenience. This must be an issue on our end.
Can you share your model ID (it is in the model page URL) so we can investigate this and prevent this from happening in the future? Also, I will refund the credits used for training this model.

@tlong123
Copy link
Author

yeah sure its : y9dPahdYO4ShpAfpD6pG

@sergiuwaxmann
Copy link
Member

@tlong123 I have refunded the credits used for training this model.
Our team is investigating the cause of the issue with your training and exploring ways to prevent such incidents in the future. I will keep you updated on our progress.

@tlong123
Copy link
Author

Thanks Sergiu! I think I should mention here that I've just had the same issue with another model I've tried to train - model ID is PmUMb8RaZufqpap3APXi
Screenshot 2024-03-28 134453

@Burhan-Q
Copy link
Member

@tlong123 I'm trying to help with troubleshooting the issue and wanted to ask for more information about you configurations. If you could share:

  • Model (YOLOv8n | s | m | ....)
  • Task (detect, pose, etc.)
  • Training settings (imgsz, batch, etc.)

I can better help out with testing something more closely resembling your setup.

@Burhan-Q
Copy link
Member

Looks like most of the info is actually in the screenshots you shared, so I'll test with those. Anything else you can recall that's not shown in the screenshot would be helpful to know

@tlong123
Copy link
Author

Hi Burhan! only other things I can think of are that they were trained on the Ultralytics cloud, and both disconnected at some point before 100 epochs so I had to click resume training

@sergiuwaxmann
Copy link
Member

@tlong123 I have refunded the credits used for training the second model.
Our team is investigating the cause of the issue with your trainings and exploring ways to prevent such incidents in the future. I will keep you updated on our progress.

Please accept our apologies for the inconvenience caused.

@sergiuwaxmann
Copy link
Member

Hello @tlong123!
Great news! Our team has released a fix for the issue you reported. You should no longer experience this problem in new Cloud Training sessions.
Thanks for your patience!

@sergiuwaxmann sergiuwaxmann added the fixed Bug is resolved label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Bug is resolved
Projects
None yet
Development

No branches or pull requests

4 participants