-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't use GPU in the local mode. #34
Comments
Thanks for the detailed description. We will fix this problem ASAP. Sorry for your trouble. |
Enhancement required: change the way of detecting available GPUs. |
@jiaxiang-wu Thanks for reply and explanation. And I realized that why the test passed on a machine with 4 GPUs and failed on this machine with one GPU. I will follow your update and test later. |
@jiaxiang-wu In addition, I hope you could notice me after fixing the problem. Thanks. |
@howtocodewang No problem. We will update this issue after the fix is done, and also send you an e-mail as notification. |
Will be glad to send a PR fixing this. I would like to have a clarification on the new policy though. |
@KranthiGV In my opinion, an idle GPU which is defined by authors is a GPU with no any processes even as a display device. That means if you only have one GPU in your PC, your GPU will process some display tasks at the same time while you are running some GPU-based algorithm. In this situation, your GPU is not an idle GPU. So I encountered this issue on a PC with one GPU device but ran the demo successfully on a PC with 4 GPUs. |
@howtocodewang Yes, you are right. |
I'll go ahead with implementing 1st policy. |
@KranthiGV Great, looking forward to your contribution. |
@jiaxiang-wu |
@KranthiGV |
Describe the bug
My environment settings are:
To Reproduce
Steps to reproduce the behavior:
"./scripts/run_local.sh nets/resnet_at_cifar10_run.py"
(pruning_tf) daisy@deep-learning:~/Pruning_and_Compression/PocketFlow$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py Python script: nets/resnet_at_cifar10_run.py of GPUs: 1 extra arguments: --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /home/daisy/Pruning_and_Compression/PocketFlow/data/cifar-10-batches-bin Traceback (most recent call last): File "utils/get_idle_gpus.py", line 54, in <module> raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus)) ValueError: not enough idle GPUs; idle GPUs are: [] 'nets/resnet_at_cifar10_run.py' -> 'main.py' multi-GPU training disabled [WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported INFO:tensorflow:FLAGS: INFO:tensorflow:data_disk: local INFO:tensorflow:data_hdfs_host: None INFO:tensorflow:data_dir_local: /home/daisy/Pruning_and_Compression/PocketFlow/data/cifar-10-batches-bin INFO:tensorflow:data_dir_hdfs: None INFO:tensorflow:cycle_length: 4 INFO:tensorflow:nb_threads: 8 INFO:tensorflow:buffer_size: 1024 INFO:tensorflow:prefetch_size: 8 INFO:tensorflow:nb_classes: 10 INFO:tensorflow:nb_smpls_train: 50000 INFO:tensorflow:nb_smpls_val: 5000 INFO:tensorflow:nb_smpls_eval: 10000 INFO:tensorflow:batch_size: 128 INFO:tensorflow:batch_size_eval: 100 INFO:tensorflow:resnet_size: 20 INFO:tensorflow:lrn_rate_init: 0.1 INFO:tensorflow:batch_size_norm: 128.0 INFO:tensorflow:momentum: 0.9 INFO:tensorflow:loss_w_dcy: 0.0002 INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow INFO:tensorflow:summ_step: 100 INFO:tensorflow:save_step: 10000 INFO:tensorflow:save_path: ./models/model.ckpt INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt INFO:tensorflow:enbl_dst: False INFO:tensorflow:enbl_warm_start: False INFO:tensorflow:loss_w_dst: 4.0 INFO:tensorflow:tempr_dst: 4.0 INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt INFO:tensorflow:nb_epochs_rat: 1.0 INFO:tensorflow:ddpg_actor_depth: 2 INFO:tensorflow:ddpg_actor_width: 64 INFO:tensorflow:ddpg_critic_depth: 2 INFO:tensorflow:ddpg_critic_width: 64 INFO:tensorflow:ddpg_noise_type: param INFO:tensorflow:ddpg_noise_prtl: tdecy INFO:tensorflow:ddpg_noise_std_init: 1.0 INFO:tensorflow:ddpg_noise_dst_finl: 0.01 INFO:tensorflow:ddpg_noise_adpt_rat: 1.03 INFO:tensorflow:ddpg_noise_std_finl: 1e-05 INFO:tensorflow:ddpg_rms_eps: 0.0001 INFO:tensorflow:ddpg_tau: 0.01 INFO:tensorflow:ddpg_gamma: 0.9 INFO:tensorflow:ddpg_lrn_rate: 0.001 INFO:tensorflow:ddpg_loss_w_dcy: 0.0 INFO:tensorflow:ddpg_record_step: 1 INFO:tensorflow:ddpg_batch_size: 64 INFO:tensorflow:ddpg_enbl_bsln_func: True INFO:tensorflow:ddpg_bsln_decy_rate: 0.95 INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt INFO:tensorflow:ws_prune_ratio: 0.75 INFO:tensorflow:ws_prune_ratio_prtl: optimal INFO:tensorflow:ws_nb_rlouts: 200 INFO:tensorflow:ws_nb_rlouts_min: 50 INFO:tensorflow:ws_reward_type: single-obj INFO:tensorflow:ws_lrn_rate_rg: 0.03 INFO:tensorflow:ws_nb_iters_rg: 20 INFO:tensorflow:ws_lrn_rate_ft: 0.0003 INFO:tensorflow:ws_nb_iters_ft: 400 INFO:tensorflow:ws_nb_iters_feval: 25 INFO:tensorflow:ws_prune_ratio_exp: 3.0 INFO:tensorflow:ws_iter_ratio_beg: 0.1 INFO:tensorflow:ws_iter_ratio_end: 0.5 INFO:tensorflow:ws_mask_update_step: 500.0 INFO:tensorflow:cp_lasso: True INFO:tensorflow:cp_quadruple: False INFO:tensorflow:cp_reward_policy: accuracy INFO:tensorflow:cp_nb_points_per_layer: 10 INFO:tensorflow:cp_nb_batches: 60 INFO:tensorflow:cp_prune_option: auto INFO:tensorflow:cp_prune_list_file: ratio.list INFO:tensorflow:cp_best_path: ./models/best_model.ckpt INFO:tensorflow:cp_original_path: ./models/original_model.ckpt INFO:tensorflow:cp_preserve_ratio: 0.5 INFO:tensorflow:cp_uniform_preserve_ratio: 0.6 INFO:tensorflow:cp_noise_tolerance: 0.15 INFO:tensorflow:cp_lrn_rate_ft: 0.0001 INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2 INFO:tensorflow:cp_finetune: False INFO:tensorflow:cp_retrain: False INFO:tensorflow:cp_list_group: 1000 INFO:tensorflow:cp_nb_rlouts: 200 INFO:tensorflow:cp_nb_rlouts_min: 50 INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt INFO:tensorflow:dcp_prune_ratio: 0.5 INFO:tensorflow:dcp_nb_stages: 3 INFO:tensorflow:dcp_lrn_rate_adam: 0.001 INFO:tensorflow:dcp_nb_iters_block: 10000 INFO:tensorflow:dcp_nb_iters_layer: 500 INFO:tensorflow:uql_equivalent_bits: 4 INFO:tensorflow:uql_nb_rlouts: 200 INFO:tensorflow:uql_w_bit_min: 2 INFO:tensorflow:uql_w_bit_max: 8 INFO:tensorflow:uql_tune_layerwise_steps: 100 INFO:tensorflow:uql_tune_global_steps: 2000 INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt INFO:tensorflow:uql_tune_disp_steps: 300 INFO:tensorflow:uql_enbl_random_layers: True INFO:tensorflow:uql_enbl_rl_agent: False INFO:tensorflow:uql_enbl_rl_global_tune: True INFO:tensorflow:uql_enbl_rl_layerwise_tune: False INFO:tensorflow:uql_weight_bits: 4 INFO:tensorflow:uql_activation_bits: 32 INFO:tensorflow:uql_use_buckets: False INFO:tensorflow:uql_bucket_size: 256 INFO:tensorflow:uql_quant_epochs: 60 INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt INFO:tensorflow:uql_quantize_all_layers: False INFO:tensorflow:uql_bucket_type: channel INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt INFO:tensorflow:uqtf_weight_bits: 8 INFO:tensorflow:uqtf_activation_bits: 8 INFO:tensorflow:uqtf_quant_delay: 0 INFO:tensorflow:uqtf_freeze_bn_delay: None INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01 INFO:tensorflow:nuql_equivalent_bits: 4 INFO:tensorflow:nuql_nb_rlouts: 200 INFO:tensorflow:nuql_w_bit_min: 2 INFO:tensorflow:nuql_w_bit_max: 8 INFO:tensorflow:nuql_tune_layerwise_steps: 100 INFO:tensorflow:nuql_tune_global_steps: 2101 INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt INFO:tensorflow:nuql_tune_disp_steps: 300 INFO:tensorflow:nuql_enbl_random_layers: True INFO:tensorflow:nuql_enbl_rl_agent: False INFO:tensorflow:nuql_enbl_rl_global_tune: True INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False INFO:tensorflow:nuql_init_style: quantile INFO:tensorflow:nuql_opt_mode: weights INFO:tensorflow:nuql_weight_bits: 4 INFO:tensorflow:nuql_activation_bits: 32 INFO:tensorflow:nuql_use_buckets: False INFO:tensorflow:nuql_bucket_size: 256 INFO:tensorflow:nuql_quant_epochs: 60 INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt INFO:tensorflow:nuql_quantize_all_layers: False INFO:tensorflow:nuql_bucket_type: split INFO:tensorflow:log_dir: ./logs INFO:tensorflow:enbl_multi_gpu: False INFO:tensorflow:learner: full-prec INFO:tensorflow:exec_mode: train INFO:tensorflow:debug: False INFO:tensorflow:h: False INFO:tensorflow:help: False INFO:tensorflow:helpfull: False INFO:tensorflow:helpshort: False 2018-11-08 09:10:59.811925: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-11-08 09:10:59.814345: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2018-11-08 09:10:59.814367: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: deep-learning 2018-11-08 09:10:59.814374: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: deep-learning 2018-11-08 09:10:59.814396: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.12.0 2018-11-08 09:10:59.814417: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.12.0 2018-11-08 09:10:59.814424: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 390.12.0 INFO:tensorflow:iter #100: lr = 1.0000e-01 | loss = 1.7772e+00 | accuracy = 3.7500e-01 | speed = 95.59 pics / sec
Expected behavior
Firstly, I found that the speed is too slow. I thought that the GPU device is not used. Then, I noticed that
failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
in the print information. So I checked the GPU device information usingnvidia-smi
in terminal. I got these information,`(pruning_tf) daisy@deep-learning:~$ nvidia-smi
Thu Nov 8 09:25:12 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.12 Driver Version: 390.12 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 On | N/A |
| 30% 67C P0 68W / 250W | 244MiB / 6080MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1091 G /usr/lib/xorg/Xorg 242MiB |
+-----------------------------------------------------------------------------+`
,which proved that I had installed right CUDA version and nvidia driver verison.
Last, I tried to import tensorflow module in the python. But it did not report any error about CUDA. What it printed is
`(pruning_tf) daisy@deep-learning:~$ python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
`
I don't think the further warning is the key reason.
The reason for slow speed and low GPU utilization is not using GPU device.
So can anyone help me solve this problem? Thanks a lot !
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: