Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model not getting trained on single GPU #32

Open
aryanmangal769 opened this issue Aug 28, 2023 · 5 comments
Open

Model not getting trained on single GPU #32

aryanmangal769 opened this issue Aug 28, 2023 · 5 comments

Comments

@aryanmangal769
Copy link

When I try to train on single GPU, the error keeps on increasing and I cannot see any good results even thill 38th epoch.

train_class_error starts from 97.88 and deom 19th to 37th epoch its consistently 100. Can you debug this?

Please let me know if you need some more information

@yrcong
Copy link
Owner

yrcong commented Oct 20, 2023

We train the model for 150 epochs. 38th epoch might be just warm-up. Maybe you can try to load some pretrained weights to accelerate the training?

@qqxqqbot
Copy link

@aryanmangal769 bro! How to train the model on One GPU?

@qqxqqbot
Copy link

@me I add os.environ['MASTER_PORT'] = '8889' in main.py

@yrcong
Copy link
Owner

yrcong commented Apr 22, 2024

It is not related to the port. Make --nproc_per_node=1 pls

@AlphaGoooo
Copy link

I trained 70 epochs but the results are still bad, including the errors and the loss. The loss is always like 33, 34..., is it normal or something goes wrong?
微信图片_20240712155952

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants