-
Notifications
You must be signed in to change notification settings - Fork 726
Distributed Training For Cifar10 Example #2336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,430 @@ | |||
| # Lint as: python2, python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can do this without duplicating the util module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are around 4 places and about 30 lines total that are different from the original util module, which is a non-trivial change and why I duplicated the module. That being said, it should be possible to incorporate all of these changes in the README.
| - Single worker training with GPU (i.e. NVIDIA K80) | ||
|
|
||
| If configured correctly, the above two configurations should yield a 60% to 70% speed up. If you need an even higher speed up, | ||
| consider using multi-worker training with 1 GPU per node. However, the cost efficiency will be lower due to observed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much was the speed up observed? Include the speedup here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most speed up observed is about 80% compared to the base single node case. I've added this data point to here as well. FYI, these are just summaries from the experimentation data shared in the internal document.
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days |
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days |
This PR adds a user guide for how to use distributed training on GKE for the Cifar10 MLKit example. The workflow depends on the custom trainer executor implemented here: #2248