Distributed Training For Cifar10 Example #2336

Eric-Le-Ge · 2020-08-15T01:50:19Z

This PR adds a user guide for how to use distributed training on GKE for the Cifar10 MLKit example. The workflow depends on the custom trainer executor implemented here: #2248

Eric-Le-Ge · 2020-08-15T01:51:52Z

@charlesccychen @chuanyu

charlesccychen · 2020-08-25T18:36:35Z

tfx/examples/cifar10/distributed/cifar10_utils_native_keras.py

@@ -0,0 +1,430 @@
+# Lint as: python2, python3


Is there a way we can do this without duplicating the util module?

I think there are around 4 places and about 30 lines total that are different from the original util module, which is a non-trivial change and why I duplicated the module. That being said, it should be possible to incorporate all of these changes in the README.

tfx/examples/cifar10/distributed/README.md

chuanyu · 2020-08-25T23:46:34Z

tfx/examples/cifar10/distributed/README.md

+- Single worker training with GPU (i.e. NVIDIA K80)
+
+If configured correctly, the above two configurations should yield a 60% to 70% speed up. If you need an even higher speed up,
+consider using multi-worker training with 1 GPU per node. However, the cost efficiency will be lower due to observed 


How much was the speed up observed? Include the speedup here as well?

The most speed up observed is about 80% compared to the base single node case. I've added this data point to here as well. FYI, these are just summaries from the experimentation data shared in the internal document.

tfx/examples/cifar10/distributed/README.md

github-actions · 2020-09-28T01:37:49Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions · 2020-11-09T01:40:53Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

Eric-Le-Ge added 2 commits August 14, 2020 18:42

documentation for distributed training

56e6d9f

documentation for distributed training

880cd87

googlebot added the cla: yes label Aug 15, 2020

Eric-Le-Ge changed the title ~~Documentation For Distributed Training On Cifar10~~ Documentation For Distributed Training For Cifar10 Example Aug 15, 2020

Eric-Le-Ge changed the title ~~Documentation For Distributed Training For Cifar10 Example~~ Distributed Training For Cifar10 Example Aug 15, 2020

charlesccychen reviewed Aug 25, 2020

View reviewed changes

chuanyu reviewed Aug 25, 2020

View reviewed changes

charlesccychen reviewed Aug 27, 2020

View reviewed changes

Eric-Le-Ge added 2 commits August 27, 2020 11:59

update README.md

f533d69

Merge remote-tracking branch 'origin' into ericlege-cifar10

16313a8

github-actions bot added the stale label Sep 28, 2020

github-actions bot closed this Oct 3, 2020

charlesccychen reopened this Oct 3, 2020

github-actions bot closed this Oct 9, 2020

charlesccychen reopened this Oct 9, 2020

charlesccychen removed the stale label Oct 9, 2020

github-actions bot added the stale label Nov 9, 2020

github-actions bot closed this Nov 15, 2020

Distributed Training For Cifar10 Example #2336

Distributed Training For Cifar10 Example #2336

Uh oh!

Conversation

Eric-Le-Ge commented Aug 15, 2020

Uh oh!

Eric-Le-Ge commented Aug 15, 2020

Uh oh!

charlesccychen Aug 25, 2020

Choose a reason for hiding this comment

Uh oh!

Eric-Le-Ge Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chuanyu Aug 25, 2020

Choose a reason for hiding this comment

Uh oh!

Eric-Le-Ge Aug 27, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 28, 2020

Uh oh!

github-actions bot commented Nov 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Eric-Le-Ge Aug 27, 2020 •

edited

Loading