To run PocketFlow in the local mode, e.g. to train a full-precision ResNet-20 model for the CIFAR-10 #3

molyswu · 2018-11-03T06:50:41Z

No description provided.

jiaxiang-wu · 2018-11-03T08:05:29Z

Hi, which Python version are you using? Python 2.x or Python 3.x?
PocketFlow is developed under Python 3.6.

molyswu · 2018-11-03T08:50:35Z

Python 3 run :
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: model/gradients/model/resnet_model/Mean_grad/Shape-0-2-VecPermuteNCHWToNHWC-LayoutOptimizer/_53 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_672_m...tOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: model/gradients/model/resnet_model/Mean_grad/Shape-0-2-VecPermuteNCHWToNHWC-LayoutOptimizer/_53 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_672_m...tOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'data/IteratorGetNext', defined at:
File "main.py", line 69, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "main.py", line 51, in main
learner = create_learner(sm_writer, model_helper)
File "/home/../../GPD/PocketFlow/learners/learner_utils.py", line 44, in create_learner
learner = FullPrecLearner(sm_writer, model_helper)
File "/home/../../GPD/PocketFlow/learners/full_precision/learner.py", line 55, in init
self.__build(is_train=True)
File "/home/../../GPD/PocketFlow/learners/full_precision/learner.py", line 118, in __build
images, labels = iterator.get_next()

OutOfRangeError (see above for traceback): End of sequence
[[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: model/gradients/model/resnet_model/Mean_grad/Shape-0-2-VecPermuteNCHWToNHWC-LayoutOptimizer/_53 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_672_m...tOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

jiaxiang-wu · 2018-11-03T09:13:26Z

Hi, could you please post the execution command and full log? Especially the FLAGS's values part.
BTW, if the data directory path correctly configured? Are data files in place?

molyswu · 2018-11-03T09:51:09Z

Hi,
log:
Traceback (most recent call last):
File "utils/get_path_args.py", line 23, in
assert len(sys.argv) == 4
AssertionError
Python script:

of GPUs: 1

extra arguments:
cp: missing destination file operand after 'main.py'
Try 'cp --help' for more information.
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:save_step: 10000
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:data_disk: local
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:data_dir_local: /home/yjzx/Downloads/wts/GPD/Project/data
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:batch_size: 68
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:model_http_url: None
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:summ_step: 100
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:exec_mode: train
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:cp_nb_batches: 60
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:helpfull: False
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:helpshort: False
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:h: False
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:help: False
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:debug: False
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:learner: full-prec
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:uql_nb_rlouts: 200
2018-11-03 16:56:45.387781: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-11-03 16:56:45.556702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:17:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-11-03 16:56:45.556733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-03 16:56:45.726320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-03 16:56:45.726359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-03 16:56:45.726367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-03 16:56:45.726530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10411 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2018-11-03 16:56:47.026253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-11-03 16:56:47.026315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-03 16:56:47.026323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-11-03 16:56:47.026332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-11-03 16:56:47.026425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10411 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: model/gradients/model/resnet_model/Mean_grad/Shape-0-2-VecPermuteNCHWToNHWC-LayoutOptimizer/_53 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_672_m...tOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 69, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "main.py", line 55, in main
learner.train()
File "/home/yjzx/Downloads/wts/GPD/PocketFlow/learners/full_precision/learner.py", line 71, in train
self.sess_train.run(self.train_op)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: model/gradients/model/resnet_model/Mean_grad/Shape-0-2-VecPermuteNCHWToNHWC-LayoutOptimizer/_53 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_672_m...tOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'data/IteratorGetNext', defined at:
File "main.py", line 69, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "main.py", line 51, in main
learner = create_learner(sm_writer, model_helper)
File "/home/yjzx/Downloads/wts/GPD/PocketFlow/learners/learner_utils.py", line 44, in create_learner
learner = FullPrecLearner(sm_writer, model_helper)
File "/home/yjzx/Downloads/wts/GPD/PocketFlow/learners/full_precision/learner.py", line 55, in init
self.__build(is_train=True)
File "/home/yjzx/Downloads/wts/GPD/PocketFlow/learners/full_precision/learner.py", line 118, in __build
images, labels = iterator.get_next()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 370, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1466, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence
[[Node: data/IteratorGetNext = IteratorGetNextoutput_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: model/gradients/model/resnet_model/Mean_grad/Shape-0-2-VecPermuteNCHWToNHWC-LayoutOptimizer/_53 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_672_m...tOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

jiaxiang-wu · 2018-11-03T10:25:26Z

how did you start the program? what command are you using?

jiaxiang-wu · 2018-11-03T10:27:44Z

Can you list files under this directory?
/home/yjzx/Downloads/wts/GPD/Project/data

molyswu · 2018-11-04T02:34:51Z

my command: ./scripts/run_local.sh nets/resnet_at_cifar10.py

Thank you for your help!

jiaxiang-wu · 2018-11-04T02:51:40Z

Can you list files under this directory?
/home/yjzx/Downloads/wts/GPD/Project/data

It seems that the program cannot find those data files.

jiaxiang-wu · 2018-11-12T01:32:33Z

It has been one week since last activity. Closing the issue. Reopen it if there are any further question.

rwagner1 · 2019-01-03T14:42:57Z

Hi all,
I was having the same issue, but could resolve it by using the cifar-10 binary version, not the python one. Use the "cifar-10-batches-bin" dataset, not "cifar-10-batches-py".

Hope this helps.

zhoushuairan · 2019-08-12T06:10:01Z

who can help me solve this question?sincerelly thanks.....
file"video_to_cu_depth.py",line 121,in
aeesrt len(sys.argv)==5
AssertionError

jiaxiang-wu closed this as completed Nov 12, 2018

GoldenSpark mentioned this issue Nov 21, 2018

DisChnPrunedLearner with resnet18 on ImageNet can't converge in local mode #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To run PocketFlow in the local mode, e.g. to train a full-precision ResNet-20 model for the CIFAR-10 #3

To run PocketFlow in the local mode, e.g. to train a full-precision ResNet-20 model for the CIFAR-10 #3

molyswu commented Nov 3, 2018 •

edited

Loading

jiaxiang-wu commented Nov 3, 2018 •

edited

Loading

molyswu commented Nov 3, 2018 •

edited

Loading

jiaxiang-wu commented Nov 3, 2018 •

edited

Loading

molyswu commented Nov 3, 2018

jiaxiang-wu commented Nov 3, 2018

jiaxiang-wu commented Nov 3, 2018

molyswu commented Nov 4, 2018

jiaxiang-wu commented Nov 4, 2018 •

edited

Loading

jiaxiang-wu commented Nov 12, 2018

rwagner1 commented Jan 3, 2019

zhoushuairan commented Aug 12, 2019

To run PocketFlow in the local mode, e.g. to train a full-precision ResNet-20 model for the CIFAR-10 #3

To run PocketFlow in the local mode, e.g. to train a full-precision ResNet-20 model for the CIFAR-10 #3

Comments

molyswu commented Nov 3, 2018 • edited Loading

jiaxiang-wu commented Nov 3, 2018 • edited Loading

molyswu commented Nov 3, 2018 • edited Loading

jiaxiang-wu commented Nov 3, 2018 • edited Loading

molyswu commented Nov 3, 2018

of GPUs: 1

jiaxiang-wu commented Nov 3, 2018

jiaxiang-wu commented Nov 3, 2018

molyswu commented Nov 4, 2018

jiaxiang-wu commented Nov 4, 2018 • edited Loading

jiaxiang-wu commented Nov 12, 2018

rwagner1 commented Jan 3, 2019

zhoushuairan commented Aug 12, 2019

molyswu commented Nov 3, 2018 •

edited

Loading

jiaxiang-wu commented Nov 3, 2018 •

edited

Loading

molyswu commented Nov 3, 2018 •

edited

Loading

jiaxiang-wu commented Nov 3, 2018 •

edited

Loading

jiaxiang-wu commented Nov 4, 2018 •

edited

Loading