Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load model in GCS directly #13

Closed
yokinglou opened this issue Jun 21, 2019 · 11 comments
Closed

Can't load model in GCS directly #13

yokinglou opened this issue Jun 21, 2019 · 11 comments

Comments

@yokinglou
Copy link

yokinglou commented Jun 21, 2019

When I wanted to run the model on TPU, I used "gs://..." replace the ${LARGE_DIR}. But it turns out the IOError.
Traceback (most recent call last): File "run_classifier.py", line 903, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "run_classifier.py", line 722, in main sp.Load(FLAGS.spiece_model_file) File "/usr/local/lib/python2.7/dist-packages/sentencepiece.py", line 118, in Load return _sentencepiece.SentencePieceProcessor_Load(self, filename) IOError: Not found: "gs://ykproject/pre-trained/xlnet_cased_L-24_H-1024_A-16/spiece.model": No such file or directory Error #2

Did this mean sp.Load() doesn't support load GCS file? And I should change the code. Or something other should I do?

@zihangdai
Copy link
Owner

That's an inconvenience of GCS. The sentence piece "Load" function can only access "local" files (only tf.gfile can directly access Google storage. I recommend keeping a local copy of the downloaded model directory.

@ymcui
Copy link
Contributor

ymcui commented Jun 21, 2019

You should load sentencepiece model in local dir instead of GCS. For example, in Colab, you should manually upload spiece.model in your notebook.

@yokinglou
Copy link
Author

That's an inconvenience of GCS. The sentence piece "Load" function can only access "local" files (only tf.gfile can directly access Google storage. I recommend keeping a local copy of the downloaded model directory.

I have tried to use the local files. But it showed another error:

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on pre-trained/xlnet_cased_L-24_H-1024_A- 16/xlnet_model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: 'pre-trained/xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt') [[node checkpoint_initializer_170 (defined at /home/henrylk127/xlnet/model_utils.py:76) ]]

I run the project on the console of Google Cloud Platform.

@ymcui
Copy link
Contributor

ymcui commented Jun 21, 2019

maybe you should post your shell script here

@yokinglou
Copy link
Author

maybe you should post your shell script here

For example:
python run_classifier.py
--do_train=True
--do_eval=True
--task_name=qnli
--data_dir=./glue/glue_data/QNLI
--output_dir=proc_data/qnli
--model_dir=exp/qnli
--uncased=False
--spiece_model_file=pre-trained/xlnet_cased_L-24_H-1024_A-16/spiece.model
--model_config_path=pre-trained/xlnet_cased_L-24_H-1024_A-16/xlnet_config.json
--init_checkpoint=pre-trained/xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt
--max_seq_length=512
--train_batch_size=8
--num_hosts=1
--num_core_per_host=8
--learning_rate=5e-5
--train_steps=1200
--warmup_steps=120
--save_steps=600
--is_regression=True
--use_tpu=True
--tpu=henrylk127

@ymcui
Copy link
Contributor

ymcui commented Jun 21, 2019

how about adding ./ in front of each file/path/dir?
such as changing
spiece_model_file=pre-trained/xlnet_cased_L-24_H-1024_A-16/spiece.model
to
spiece_model_file=./pre-trained/xlnet_cased_L-24_H-1024_A-16/spiece.model

@yokinglou
Copy link
Author

It still shows the same error.

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./pre-trained/xlnet_cased_L-24_H-1024_ A-16/xlnet_model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: './pre-trained/xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt') [[node checkpoint_initializer_170 (defined at /home/henrylk127/xlnet/model_utils.py:76) ]]

The solution to this problem I searched is to use GCS path instead of the local path. After using GCS, the error which I mentioned before shows.
I just run the program on the terminal of Google Cloud Platform. Maybe the VM Engine does not support the local file loading?

@ymcui
Copy link
Contributor

ymcui commented Jun 21, 2019

For me, I use spiece.model in local file system, and load xlnet weight/config from GCS.

@yokinglou
Copy link
Author

After I changing 'output_dir', 'model_dir' and 'init_checkpoint' to GCS and remaining other directories locally, it works.
However, I encountered a new error: OutOfRangeError.
The issue is shown below:
'I0621 14:16:36.503958 140182961767872 tpu_estimator.py:504] Init TPU system
I0621 14:16:40.712445 140182961767872 tpu_estimator.py:510] Initialized TPU in 4 seconds
I0621 14:16:40.713411 140181362325248 tpu_estimator.py:463] Starting infeed thread controller.
I0621 14:16:40.713979 140181327095552 tpu_estimator.py:482] Starting outfeed thread controller.
I0621 14:16:42.156764 140182961767872 tpu_estimator.py:536] Enqueue next (600) batch(es) of data to infeed.
I0621 14:16:42.157500 140182961767872 tpu_estimator.py:540] Dequeue next (600) batch(es) of data from outfeed.
I0621 14:16:51.020121 140181362325248 error_handling.py:70] Error recorded from infeed: End of sequence
[[node input_pipeline_task0/while/IteratorGetNext (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]

Caused by op u'input_pipeline_task0/while/IteratorGetNext', defined at:
File "run_classifier.py", line 903, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_classifier.py", line 767, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2547, in _model_fn
input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1167, in generate_infeed_enqueue_ops_and_dequeue_fn
self._invoke_input_fn_and_record_structure())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1271, in _invoke_input_fn_and_record_structure
wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2945, in _wrap_computation_in_while_loop
parallel_iterations=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2934, in computation
with ops.control_dependencies(op_fn()):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 859, in enqueue_ops_fn
features, labels = inputs.features_and_labels() # Calls get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 3127, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 414, in get_next
output_shapes=self._structure._flat_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1685, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): End of sequence
[[node input_pipeline_task0/while/IteratorGetNext (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
'
Does this mean that I need to apply for more computation resources?

@zihangdai
Copy link
Owner

As the example script scripts/tpu_squad_large.sh shows:

# Local path for model config & sentence-piece model
  --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
# Google storage path for `init_checkpoint`, processed data dir `output_dir` and `model_dir`
  --output_dir=${GS_PROC_DATA_DIR} \
  --init_checkpoint=${GS_INIT_CKPT_DIR}/xlnet_model.ckpt \
  --model_dir=${GS_MODEL_DIR} \
# Local path for raw input data
  --train_file=${SQUAD_DIR}/train-v2.0.json \
  --predict_file=${SQUAD_DIR}/dev-v2.0.json \

@yokinglou
Copy link
Author

Thanks for your help. There are still some problems. If it still not work, I will start another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants