Update the example for spark-tensorflow-distributor #166

liangz1 · 2020-07-15T15:24:25Z

This PR fixes the data downloading issue in the example code.

Reproduce: On a cluster with multiple GPUs per worker node, with spark.resources.tasks.gpu.amount set to 1, running the original example will trigger an error related to data downloading.

Cause: There will be multiple tasks running on the same worker and each task will try to write the data to the same path, which will corrupt the data.

Fix: Randomize the file path.

liangz1 added 2 commits July 13, 2020 07:54

fix example

6085a1d

fix example

28264e0

googlebot added the cla: yes label Jul 15, 2020

jhseu approved these changes Jul 15, 2020

View reviewed changes

jhseu merged commit 8d96a9f into tensorflow:master Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the example for spark-tensorflow-distributor #166

Update the example for spark-tensorflow-distributor #166

liangz1 commented Jul 15, 2020

Update the example for spark-tensorflow-distributor #166

Update the example for spark-tensorflow-distributor #166

Conversation

liangz1 commented Jul 15, 2020