Fixing an issue where parameter server always reserves 1 GPU as long as TensorFlow GPU build is used #276

dratini6 · 2018-04-30T20:15:24Z

This PR provides a fix for issue described here: #275

yahoocla · 2018-04-30T20:15:26Z

Thank you for submitting this pull request, however I do not see a valid CLA on file for you. Before we can merge this request please visit https://yahoocla.herokuapp.com/ and agree to the terms. Thanks! 😄

junshi15 · 2018-04-30T22:00:35Z

LGTM.
@dratini6 can you sign the CLA?
@leewyang can you take a look as well? basically, if num_gpus is zero, it is routed to the CPU path.

leewyang · 2018-05-01T16:00:03Z

I don't think that's sufficient... unless you're setting num_gpus=0 on the PS nodes somehow...

FWIW, I tried a variant using and ctx.job_name != 'ps', which mostly seems to work, although the PS node still reports an error:

2018-05-01 16:21:19.553707: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE

... but it seemed to be harmless, since the job completed fine.

junshi15 · 2018-05-01T17:17:51Z

I was not clear earlier on.

This patch addresses the scenario when the user sets num_gpus to zero, PS should not request a GPU, even on a GPU cluster. We ran pure CPU jobs on GPU cluster here for some reason. So we set num_gpus = 0 as a input to the spark job. We set num_gpus = 0 in the following call
TFNode.start_cluster_server()
The original logic, even with num_gpus == 0, still grabs one gpu for the PS, which is undesirable.

This patch, however, does not prevent PS from grabbing a GPU if num_gpus > 0. We are not concerned with this case and do not try to address it with this patch.

junshi15 · 2018-05-01T17:20:32Z

BTW, "CUDA_ERROR_NO_DEVICE" is harmless. Since the build is gpu-based, it will check CUDA devices, but we set num_gpus = 0, which sets "CUDA_VISIBLE_DEVICES = ''", cause the error message. Then tensorflow falls back on CPU mode, which is expected.

dratini6 · 2018-05-01T17:29:06Z

+1 to @junshi15's comment. When we do TFNode.start_cluster_server(ctx, num_gpus=0), the current logic would request 1 GPU for PS, instead of 0. In this PR, PS will get CPU in that call.

leewyang · 2018-05-01T19:58:22Z

K, makes sense. This is still complaining about the CLA check though, can you sign the CLA?

dratini6 · 2018-05-01T20:23:36Z

@leewyang @junshi15 I just signed CLA. Thanks!

leewyang · 2018-05-01T23:47:46Z

👍

Request GPU only when num_gpus is positive

8bdfc5b

leewyang merged commit 25210b2 into yahoo:master May 1, 2018

dratini6 mentioned this pull request May 2, 2018

Parameter server always reserves 1 GPU as long as TensorFlow GPU build is used #275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing an issue where parameter server always reserves 1 GPU as long as TensorFlow GPU build is used #276

Fixing an issue where parameter server always reserves 1 GPU as long as TensorFlow GPU build is used #276

Uh oh!

dratini6 commented Apr 30, 2018

Uh oh!

yahoocla commented Apr 30, 2018

Uh oh!

junshi15 commented Apr 30, 2018

Uh oh!

leewyang commented May 1, 2018 •

edited

Loading

Uh oh!

junshi15 commented May 1, 2018

Uh oh!

junshi15 commented May 1, 2018

Uh oh!

dratini6 commented May 1, 2018

Uh oh!

leewyang commented May 1, 2018

Uh oh!

dratini6 commented May 1, 2018 •

edited

Loading

Uh oh!

leewyang commented May 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fixing an issue where parameter server always reserves 1 GPU as long as TensorFlow GPU build is used #276

Fixing an issue where parameter server always reserves 1 GPU as long as TensorFlow GPU build is used #276

Uh oh!

Conversation

dratini6 commented Apr 30, 2018

Uh oh!

yahoocla commented Apr 30, 2018

Uh oh!

junshi15 commented Apr 30, 2018

Uh oh!

leewyang commented May 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junshi15 commented May 1, 2018

Uh oh!

junshi15 commented May 1, 2018

Uh oh!

dratini6 commented May 1, 2018

Uh oh!

leewyang commented May 1, 2018

Uh oh!

dratini6 commented May 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leewyang commented May 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leewyang commented May 1, 2018 •

edited

Loading

dratini6 commented May 1, 2018 •

edited

Loading