Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotFoundError: No CPU devices are available in this process #509

Closed
ghost opened this issue Jan 2, 2021 · 1 comment
Closed

NotFoundError: No CPU devices are available in this process #509

ghost opened this issue Jan 2, 2021 · 1 comment

Comments

@ghost
Copy link

ghost commented Jan 2, 2021

Hello!
I have built TF 1.13.2 from source code and I'm trying to run a two-nodes distributed learning on one machine (1cpu 1gpu) by using a cpu-only parameter server. However an error occured after I run the following commands:

nohup python tf_cnn_benchmarks.py --data_format=NHWC --num_gpus=0 --batch_size=8 --model=vgg16 --data_name=imagenet --variable_update=parameter_server --ps_hosts='localhost:2222' --worker_hosts='localhost:2223' --job_name='ps' --task_index=0 --local_parameter_device=cpu --device=cpu > ps.log &

Any help would be appreciated. Thanks!

Environment:
OS: Ubuntu 18.04
GCC version: 4.8
CUDA and NCCL version: cuda 10.0
Framework version: TF 1.13

Here is my ps.log:


2021-01-02 14:56:01.077977: E tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:466] Not found: No CPU devices are available in this process
Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 73, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tf_cnn_benchmarks.py", line 63, in main
    bench = benchmark_cnn.BenchmarkCNN(params)
  File "/home/cluster/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1454, in __init__
    params, create_config_proto(params))
  File "/home/cluster/benchmarks/scripts/tf_cnn_benchmarks/platforms/default/util.py", line 42, in get_cluster_manager
    return cnn_util.GrpcClusterManager(params, config_proto)
  File "/home/cluster/benchmarks/scripts/tf_cnn_benchmarks/cnn_util.py", line 246, in __init__
    protocol=params.server_protocol)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/server_lib.py", line 148, in __init__
    self._server = c_api.TF_NewServer(self._server_def.SerializeToString())
tensorflow.python.framework.errors_impl.NotFoundError: No CPU devices are available in this process
@reedwm
Copy link
Member

reedwm commented Jan 5, 2021

Unfortunately tf_cnn_benchmarks is unmaintained, so we do not plan on addressing issues. I recommend using the official models instead.

If you do want to keep using tf_cnn_benchmarks, try removing the --num_gpus=0 parameter, as IIRC you don't want that set when using CPUs (even if no GPUs are used). Also use the branch cnn_tf_v1.13_compatible if you aren't already doing so.

@ghost ghost closed this as completed May 4, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant