Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault #18

Closed
as754770178 opened this issue Nov 6, 2018 · 7 comments
Closed

Segmentation fault #18

as754770178 opened this issue Nov 6, 2018 · 7 comments

Comments

@as754770178
Copy link

python3.6 cuda9 cudnn7 tensorflow:1.10.0

`(tf-1.10-cp3) [zgz@localhost PocketFlow]$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py
Python script: nets/resnet_at_cifar10_run.py

of GPUs: 1

extra arguments: --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /home/zgz/project/data_set/cifar-10-batches-py
Traceback (most recent call last):
File "utils/get_idle_gpus.py", line 54, in
raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus))
ValueError: not enough idle GPUs; idle GPUs are: []
‘nets/resnet_at_cifar10_run.py’ -> ‘main.py’
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: /home/zgz/project/data_set/cifar-10-batches-py
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 60
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
2018-11-06 12:51:30.550266: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-06 12:51:30.582314: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-11-06 12:51:30.582432: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: localhost.localdomain
2018-11-06 12:51:30.582455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: localhost.localdomain
2018-11-06 12:51:30.582543: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 396.26.0
2018-11-06 12:51:30.582644: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 396.26.0
2018-11-06 12:51:30.582667: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 396.26.0
./scripts/run_local.sh: line 45: 43704 Segmentation fault (core dumped) python main.py ${extra_args}
`

path.conf
`# data files
data_hdfs_host = None
data_dir_local_cifar10 = /home/zgz/project/data_set/cifar-10-batches-py
data_dir_hdfs_cifar10 = None
data_dir_seven_cifar10 = None
data_dir_docker_cifar10 = /opt/ml/data # DO NOT EDIT
data_dir_local_ilsvrc12 = None
data_dir_hdfs_ilsvrc12 = None
data_dir_seven_ilsvrc12 = None
data_dir_docker_ilsvrc12 = /opt/ml/data # DO NOT EDIT

model files

model_http_url = https://api.ai.tencent.com/pocketflow
`

@jiaxiang-wu
Copy link
Contributor

It seems that you are using the Python version of CIFAR-10 data set.

INFO:tensorflow:data_dir_local: /home/zgz/project/data_set/cifar-10-batches-py

Please use the binary version instead. Same issue as this one.

@as754770178
Copy link
Author

Traceback (most recent call last): File "utils/get_idle_gpus.py", line 54, in <module> raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus)) ValueError: not enough idle GPUs; idle GPUs are: [] ‘nets/resnet_at_cifar10_run.py’ -> ‘main.py’
and

2018-11-06 14:54:44.429596: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
why report these tow error?

@jiaxiang-wu
Copy link
Contributor

jiaxiang-wu commented Nov 6, 2018

It seems PocketFlow failed to find an idle GPU device. Can you post the result of nvidia-smi?

$ nvidia-smi

@as754770178
Copy link
Author

`(tf-1.10-cp3) [zgz@localhost models]$ nvidia-smi
Tue Nov 6 15:05:06 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:06:00.0 Off | 0 |
| N/A 81C P0 64W / 149W | 110MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:07:00.0 Off | 0 |
| N/A 61C P0 73W / 149W | 110MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:85:00.0 Off | 0 |
| N/A 79C P0 61W / 149W | 110MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:86:00.0 Off | 0 |
| N/A 59C P0 71W / 149W | 110MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 1 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 2 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 3 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
+-----------------------------------------------------------------------------+`

`>>> from tensorflow.python.client import device_lib

print(device_lib.list_local_devices())
2018-11-06 12:46:55.594800: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-06 12:46:55.832609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.019314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:07:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.188525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:85:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.388330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.388891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2018-11-06 12:46:57.834106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-06 12:46:57.834160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2018-11-06 12:46:57.834170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2018-11-06 12:46:57.834178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2018-11-06 12:46:57.834184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2018-11-06 12:46:57.834190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2018-11-06 12:46:57.835321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 10662 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-11-06 12:46:57.955038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:1 with 10662 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0, compute capability: 3.7)
2018-11-06 12:46:58.078889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:2 with 10662 MB memory) -> physical GPU (device: 2, name: Tesla K80, pci bus id: 0000:85:00.0, compute capability: 3.7)
2018-11-06 12:46:58.196860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:3 with 10662 MB memory) -> physical GPU (device: 3, name: Tesla K80, pci bus id: 0000:86:00.0, compute capability: 3.7)
`

@jiaxiang-wu
Copy link
Contributor

In utils/get_idle_gpus.py, a GPU device is treated as idle if there is no process running on it. According to your nvidia-smi's results, each of these four GPUs have some processes running, so utils/get_idle_gpus.py cannot find an idle one.

To temporarily override this, you may skip calling utils/get_idle_gpus.py and manually specify an idle GPU in scripts/run_local.sh.

@as754770178
Copy link
Author

OK, thanks

@jiaxiang-wu
Copy link
Contributor

Closing this issue. Reopen it if there is any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants