Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on A100 #53

Open
mathematiguy opened this issue Dec 13, 2022 · 0 comments
Open

Training on A100 #53

mathematiguy opened this issue Dec 13, 2022 · 0 comments

Comments

@mathematiguy
Copy link

Hi there,

I'm training the model on an 80GB A100 gpu and I'm having trouble replicating the claim that the model trains for 200K steps in under 5 hours. So far I'm using the flags given in the README, but I'm wondering if you used any others to make training that fast on one GPU, such as increasing the batch size. It looks like my GPU utilisation is high so I feel like it should be converging faster.

In my case it seems to be taking twice as long, it looks like it will converge in around 10 hours for E2E. But since ROCStory takes 4 times as many steps (800K) I'm guessing that will take about two days which seems like a lot of extra time.

I understand you are very busy, so if you have time to respond that will be great. Otherwise I will just put up with the extra time for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant