Skip to content

Conversation

@dratini6
Copy link
Contributor

This PR provides a fix for issue described here: #275

@yahoocla
Copy link

Thank you for submitting this pull request, however I do not see a valid CLA on file for you. Before we can merge this request please visit https://yahoocla.herokuapp.com/ and agree to the terms. Thanks! 😄

@junshi15
Copy link
Contributor

LGTM.
@dratini6 can you sign the CLA?
@leewyang can you take a look as well? basically, if num_gpus is zero, it is routed to the CPU path.

@leewyang
Copy link
Contributor

leewyang commented May 1, 2018

I don't think that's sufficient... unless you're setting num_gpus=0 on the PS nodes somehow...

FWIW, I tried a variant using and ctx.job_name != 'ps', which mostly seems to work, although the PS node still reports an error:

2018-05-01 16:21:19.553707: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE

... but it seemed to be harmless, since the job completed fine.

@junshi15
Copy link
Contributor

junshi15 commented May 1, 2018

I was not clear earlier on.

This patch addresses the scenario when the user sets num_gpus to zero, PS should not request a GPU, even on a GPU cluster. We ran pure CPU jobs on GPU cluster here for some reason. So we set num_gpus = 0 as a input to the spark job. We set num_gpus = 0 in the following call
TFNode.start_cluster_server()
The original logic, even with num_gpus == 0, still grabs one gpu for the PS, which is undesirable.

This patch, however, does not prevent PS from grabbing a GPU if num_gpus > 0. We are not concerned with this case and do not try to address it with this patch.

@junshi15
Copy link
Contributor

junshi15 commented May 1, 2018

BTW, "CUDA_ERROR_NO_DEVICE" is harmless. Since the build is gpu-based, it will check CUDA devices, but we set num_gpus = 0, which sets "CUDA_VISIBLE_DEVICES = ''", cause the error message. Then tensorflow falls back on CPU mode, which is expected.

@dratini6
Copy link
Contributor Author

dratini6 commented May 1, 2018

+1 to @junshi15's comment. When we do TFNode.start_cluster_server(ctx, num_gpus=0), the current logic would request 1 GPU for PS, instead of 0. In this PR, PS will get CPU in that call.

@leewyang
Copy link
Contributor

leewyang commented May 1, 2018

K, makes sense. This is still complaining about the CLA check though, can you sign the CLA?

@dratini6
Copy link
Contributor Author

dratini6 commented May 1, 2018

@leewyang @junshi15 I just signed CLA. Thanks!

@leewyang
Copy link
Contributor

leewyang commented May 1, 2018

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants