Skip to content

Conversation

@Eric-Le-Ge
Copy link
Contributor

This PR adds a user guide for how to use distributed training on GKE for the Cifar10 MLKit example. The workflow depends on the custom trainer executor implemented here: #2248

@Eric-Le-Ge Eric-Le-Ge changed the title Documentation For Distributed Training On Cifar10 Documentation For Distributed Training For Cifar10 Example Aug 15, 2020
@Eric-Le-Ge Eric-Le-Ge changed the title Documentation For Distributed Training For Cifar10 Example Distributed Training For Cifar10 Example Aug 15, 2020
@Eric-Le-Ge
Copy link
Contributor Author

@charlesccychen @chuanyu

@@ -0,0 +1,430 @@
# Lint as: python2, python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can do this without duplicating the util module?

Copy link
Contributor Author

@Eric-Le-Ge Eric-Le-Ge Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are around 4 places and about 30 lines total that are different from the original util module, which is a non-trivial change and why I duplicated the module. That being said, it should be possible to incorporate all of these changes in the README.

- Single worker training with GPU (i.e. NVIDIA K80)

If configured correctly, the above two configurations should yield a 60% to 70% speed up. If you need an even higher speed up,
consider using multi-worker training with 1 GPU per node. However, the cost efficiency will be lower due to observed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much was the speed up observed? Include the speedup here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most speed up observed is about 80% compared to the base single node case. I've added this data point to here as well. FYI, these are just summaries from the experimentation data shared in the internal document.

@github-actions
Copy link
Contributor

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2020

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the stale label Nov 9, 2020
@github-actions github-actions bot closed this Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants