Epochs remaining gone negative and stuck on optimising weights #622

tlong123 · 2024-03-26T22:51:05Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

Models, Training

Bug

I have set up a model to train for 100 epochs. I left it running and when I came back the next day it said disconnected, checkpointed at 99th epoch so I've clicked resume training however now I'm stuck on 100% optimising weights, with minus 6 epochs remaining and the time estimate is stuck saying estimating...

I'm still being billed for this time and the number of negative epochs is only increasing with time. Is this expected or has something gone wrong with the model? I'm training it on your cloud. (its just changed to -7 epochs remaining now)

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

github-actions · 2024-03-26T22:51:30Z

👋 Hello @tlong123, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

Quickstart. Start training and deploying YOLO models with HUB in seconds.
Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
Projects: Creating and Managing. Group your models into projects for improved organization.
Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
- iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
- Android. Explore TFLite acceleration on mobile devices.
Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

tlong123 · 2024-03-26T22:54:44Z

Additionally, other than deleting the model I can't see any way to stop it from actively costing me money in its current state, and I don't want to delete it as I've paid money for it to be trained!

UltralyticsAssistant · 2024-03-27T03:03:33Z

@tlong123 hello! First off, thanks for reaching out and detailing the issue you're facing with the training process on Ultralytics HUB. 🌟 It sounds like you've stumbled upon a rare glitch, particularly with the training session not properly concluding and going into negative epochs – this indeed is not expected behavior.

Rest assured, we prioritize both the performance and the billing concerns of our users. Here's a couple of steps you can take:

If you haven't already, please attempt to manually stop the training session. While you've mentioned an issue with stopping the model without deleting it, there should be a stop or pause option available in the UI for your training session.
Regarding the unexpected billing, we absolutely understand the importance of fair billing practices. We recommend reaching out to our support through the official communication channels mentioned in our documentation. They will be able to look into your billing details and make necessary adjustments based on the glitch.

We're here to ensure a smooth and efficient training experience. Your detailed report is incredibly valuable for us to improve and rectify such issues. If anything else comes up or if you have further questions, don't hesitate to update the issue or reach out to our support team.

Thanks for your patience and understanding. 👍

tlong123 · 2024-03-27T09:04:13Z

I've attached a screenshot of where its at now. It eventually disconnected by itself

sergiuwaxmann · 2024-03-27T09:37:34Z

Hello @tlong123!
First of all, I am sorry for the inconvenience. This must be an issue on our end.
Can you share your model ID (it is in the model page URL) so we can investigate this and prevent this from happening in the future? Also, I will refund the credits used for training this model.

tlong123 · 2024-03-27T09:39:17Z

yeah sure its : y9dPahdYO4ShpAfpD6pG

sergiuwaxmann · 2024-03-27T10:16:13Z

@tlong123 I have refunded the credits used for training this model.
Our team is investigating the cause of the issue with your training and exploring ways to prevent such incidents in the future. I will keep you updated on our progress.

tlong123 · 2024-03-28T13:45:18Z

Thanks Sergiu! I think I should mention here that I've just had the same issue with another model I've tried to train - model ID is PmUMb8RaZufqpap3APXi

Burhan-Q · 2024-03-28T14:48:33Z

@tlong123 I'm trying to help with troubleshooting the issue and wanted to ask for more information about you configurations. If you could share:

Model (YOLOv8n | s | m | ....)
Task (detect, pose, etc.)
Training settings (imgsz, batch, etc.)

I can better help out with testing something more closely resembling your setup.

Burhan-Q · 2024-03-28T14:56:49Z

Looks like most of the info is actually in the screenshots you shared, so I'll test with those. Anything else you can recall that's not shown in the screenshot would be helpful to know

tlong123 · 2024-03-28T17:54:26Z

Hi Burhan! only other things I can think of are that they were trained on the Ultralytics cloud, and both disconnected at some point before 100 epochs so I had to click resume training

sergiuwaxmann · 2024-03-28T19:44:25Z

@tlong123 I have refunded the credits used for training the second model.
Our team is investigating the cause of the issue with your trainings and exploring ways to prevent such incidents in the future. I will keep you updated on our progress.

Please accept our apologies for the inconvenience caused.

sergiuwaxmann · 2024-04-22T10:57:03Z

Hello @tlong123!
Great news! Our team has released a fix for the issue you reported. You should no longer experience this problem in new Cloud Training sessions.
Thanks for your patience!

tlong123 added the bug Something isn't working label Mar 26, 2024

sergiuwaxmann self-assigned this Mar 27, 2024

sergiuwaxmann mentioned this issue Apr 1, 2024

Custom model using timed training doesn't allow its use #627

Closed

1 task

sergiuwaxmann added the fixed Bug is resolved label Apr 22, 2024

sergiuwaxmann closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epochs remaining gone negative and stuck on optimising weights #622

Epochs remaining gone negative and stuck on optimising weights #622

tlong123 commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 26, 2024

tlong123 commented Mar 26, 2024

UltralyticsAssistant commented Mar 27, 2024

tlong123 commented Mar 27, 2024

sergiuwaxmann commented Mar 27, 2024

tlong123 commented Mar 27, 2024

sergiuwaxmann commented Mar 27, 2024

tlong123 commented Mar 28, 2024

Burhan-Q commented Mar 28, 2024

Burhan-Q commented Mar 28, 2024

tlong123 commented Mar 28, 2024

sergiuwaxmann commented Mar 28, 2024

sergiuwaxmann commented Apr 22, 2024

Epochs remaining gone negative and stuck on optimising weights #622

Epochs remaining gone negative and stuck on optimising weights #622

Comments

tlong123 commented Mar 26, 2024 • edited Loading

Search before asking

HUB Component

Bug

Environment

Minimal Reproducible Example

Additional

github-actions bot commented Mar 26, 2024

tlong123 commented Mar 26, 2024

UltralyticsAssistant commented Mar 27, 2024

tlong123 commented Mar 27, 2024

sergiuwaxmann commented Mar 27, 2024

tlong123 commented Mar 27, 2024

sergiuwaxmann commented Mar 27, 2024

tlong123 commented Mar 28, 2024

Burhan-Q commented Mar 28, 2024

Burhan-Q commented Mar 28, 2024

tlong123 commented Mar 28, 2024

sergiuwaxmann commented Mar 28, 2024

sergiuwaxmann commented Apr 22, 2024

tlong123 commented Mar 26, 2024 •

edited

Loading