Segmentation fault #18

as754770178 · 2018-11-06T04:05:17Z

python3.6 cuda9 cudnn7 tensorflow:1.10.0

`(tf-1.10-cp3) [zgz@localhost PocketFlow]$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py
Python script: nets/resnet_at_cifar10_run.py

of GPUs: 1

extra arguments: --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /home/zgz/project/data_set/cifar-10-batches-py
Traceback (most recent call last):
File "utils/get_idle_gpus.py", line 54, in
raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus))
ValueError: not enough idle GPUs; idle GPUs are: []
‘nets/resnet_at_cifar10_run.py’ -> ‘main.py’
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: /home/zgz/project/data_set/cifar-10-batches-py
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 60
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
2018-11-06 12:51:30.550266: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-06 12:51:30.582314: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-11-06 12:51:30.582432: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: localhost.localdomain
2018-11-06 12:51:30.582455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: localhost.localdomain
2018-11-06 12:51:30.582543: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 396.26.0
2018-11-06 12:51:30.582644: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 396.26.0
2018-11-06 12:51:30.582667: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 396.26.0
./scripts/run_local.sh: line 45: 43704 Segmentation fault (core dumped) python main.py ${extra_args}
`

path.conf
`# data files
data_hdfs_host = None
data_dir_local_cifar10 = /home/zgz/project/data_set/cifar-10-batches-py
data_dir_hdfs_cifar10 = None
data_dir_seven_cifar10 = None
data_dir_docker_cifar10 = /opt/ml/data # DO NOT EDIT
data_dir_local_ilsvrc12 = None
data_dir_hdfs_ilsvrc12 = None
data_dir_seven_ilsvrc12 = None
data_dir_docker_ilsvrc12 = /opt/ml/data # DO NOT EDIT

model files

model_http_url = https://api.ai.tencent.com/pocketflow
`

jiaxiang-wu · 2018-11-06T04:25:19Z

It seems that you are using the Python version of CIFAR-10 data set.

INFO:tensorflow:data_dir_local: /home/zgz/project/data_set/cifar-10-batches-py

Please use the binary version instead. Same issue as this one.

as754770178 · 2018-11-06T05:55:19Z

Traceback (most recent call last): File "utils/get_idle_gpus.py", line 54, in <module> raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus)) ValueError: not enough idle GPUs; idle GPUs are: [] ‘nets/resnet_at_cifar10_run.py’ -> ‘main.py’
and

2018-11-06 14:54:44.429596: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
why report these tow error?

jiaxiang-wu · 2018-11-06T06:01:10Z

It seems PocketFlow failed to find an idle GPU device. Can you post the result of nvidia-smi?

$ nvidia-smi

as754770178 · 2018-11-06T06:03:57Z

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 1 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 2 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
| 3 17631 C ...gz/anaconda2/envs/tf-1.8-cp3/bin/python 99MiB |
+-----------------------------------------------------------------------------+`

`>>> from tensorflow.python.client import device_lib

print(device_lib.list_local_devices())
2018-11-06 12:46:55.594800: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-06 12:46:55.832609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.019314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:07:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.188525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:85:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.388330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.00GiB
2018-11-06 12:46:56.388891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2018-11-06 12:46:57.834106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-06 12:46:57.834160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2018-11-06 12:46:57.834170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2018-11-06 12:46:57.834178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2018-11-06 12:46:57.834184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2018-11-06 12:46:57.834190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2018-11-06 12:46:57.835321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 10662 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-11-06 12:46:57.955038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:1 with 10662 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0, compute capability: 3.7)
2018-11-06 12:46:58.078889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:2 with 10662 MB memory) -> physical GPU (device: 2, name: Tesla K80, pci bus id: 0000:85:00.0, compute capability: 3.7)
2018-11-06 12:46:58.196860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:3 with 10662 MB memory) -> physical GPU (device: 3, name: Tesla K80, pci bus id: 0000:86:00.0, compute capability: 3.7)
`

jiaxiang-wu · 2018-11-06T06:07:56Z

In utils/get_idle_gpus.py, a GPU device is treated as idle if there is no process running on it. According to your nvidia-smi's results, each of these four GPUs have some processes running, so utils/get_idle_gpus.py cannot find an idle one.

To temporarily override this, you may skip calling utils/get_idle_gpus.py and manually specify an idle GPU in scripts/run_local.sh.

as754770178 · 2018-11-06T06:10:13Z

OK, thanks

jiaxiang-wu · 2018-11-06T06:38:17Z

Closing this issue. Reopen it if there is any further questions.

jiaxiang-wu closed this as completed Nov 6, 2018

GoldenSpark mentioned this issue Nov 20, 2018

DisChnPrunedLearner with resnet18 on ImageNet can't converge in local mode #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault #18

Segmentation fault #18

as754770178 commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018

as754770178 commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018 •

edited

as754770178 commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018

as754770178 commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018

Segmentation fault #18

Segmentation fault #18

Comments

as754770178 commented Nov 6, 2018

of GPUs: 1

model files

jiaxiang-wu commented Nov 6, 2018

as754770178 commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018 • edited

as754770178 commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018

as754770178 commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018

jiaxiang-wu commented Nov 6, 2018 •

edited