error in pretrain an XLNet . train_gpu.py. #111

3NFBAGDU · 2019-07-03T11:53:24Z

I have successfully ran spm_train \ --input=$INPUT \ --model_prefix=sp10m.cased.v3 \ --vocab_size=32000 \ --character_coverage=0.99995 \ --model_type=unigram \ --control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \ --user_defined_symbols=<eop>,.,(,),",-,–,£,€ \ --shuffle_input_sentence \ --input_sentence_size=10000000
and python data_utils.py \ --bsz_per_host=32 \ --num_core_per_host=16 \ --seq_len=512 \ --reuse_len=256 \ --input_glob=*.txt \ --save_dir=${SAVE_DIR} \ --num_passes=20 \ --bi_data=True \ --sp_path=spiece.model \ --mask_alpha=6 \ --mask_beta=1 \ --num_predict=85.

now when i want to run train_gpu.py :
sudo python3 train_gpu.py --record_info_dir=/home/ubuntu/xlnet/training/tfrecords --train_batch_size=2048 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --model_dir=/home/ubuntu/axalimodeli i have the error:

/usr/local/lib/python3.5/dist-packages/tensorflow-plugins /home/ubuntu/.local/lib/python3.5/site-packages/tensorflow-plugins /usr/lib/python3/dist-packages/tensorflow-plugins /usr/lib/python3.5/dist-packages/tensorflow-plugins I0702 11:32:34.041983 139935611332352 train_gpu.py:319] n_token 32000 I0702 11:32:34.042275 139935611332352 data_utils.py:795] Use the following tfrecord dirs: ['/home/ubuntu/xlnet/training/tfrecords'] I0702 11:32:34.042413 139935611332352 data_utils.py:799] [0] Record glob: /home/ubuntu/xlnet/training/tfrecords/record_info-train-*.bsz-2048.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-85.json I0702 11:32:34.042960 139935611332352 data_utils.py:803] [0] Num of record info path: 0 I0702 11:32:34.043075 139935611332352 data_utils.py:836] [Dir 0] Number of chosen batches: 0 I0702 11:32:34.043182 139935611332352 data_utils.py:838] [Dir 0] Number of chosen files: 0 I0702 11:32:34.043281 139935611332352 data_utils.py:839] [] I0702 11:32:34.043379 139935611332352 data_utils.py:846] Total number of batches: 0 I0702 11:32:34.043897 139935611332352 data_utils.py:848] Total number of files: 0 I0702 11:32:34.044010 139935611332352 data_utils.py:849] [] I0702 11:32:34.044113 139935611332352 train_gpu.py:204] num of batches 0 I0702 11:32:34.044229 139935611332352 data_utils.py:555] Host 0 handles 0 files Traceback (most recent call last): File "train_gpu.py", line 328, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "train_gpu.py", line 324, in main train("/gpu:0") File "train_gpu.py", line 212, in train train_set = train_input_fn(params) File "/home/ubuntu/xlnet/data_utils.py", line 868, in input_fn num_predict=num_predict) File "/home/ubuntu/xlnet/data_utils.py", line 757, in get_dataset bsz_per_core=bsz_per_core) File "/home/ubuntu/xlnet/data_utils.py", line 566, in parse_files_to_dataset dataset = tf.data.TFRecordDataset(dataset) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__ filenames, compression_type, buffer_size, num_parallel_reads) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__ filenames = _create_or_validate_filenames_dataset(filenames) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 50, in _create_or_validate_filenames_dataset "filenamesmust be atf.data.Datasetoftf.stringelements.") TypeError:filenamesmust be atf.data.Datasetoftf.string elements.

Can you help me to solve this case. data_utils.py make the tfrecords file in this directory '/home/ubuntu/xlnet/training/tfrecords'.

The text was updated successfully, but these errors were encountered:

graykode · 2019-07-04T02:24:00Z

This is because your file didnt not match xlnet/proc_data/example/tfrecords/*.json and train_gpu option. you should be same option data_util.py and train_gpu

3NFBAGDU · 2019-07-04T07:34:12Z

This is because your file didnt not match xlnet/proc_data/example/tfrecords/*.json and train_gpu option. you should be same option data_util.py and train_gpu

sudo python3 data_utils.py --bsz_per_host=32 --num_core_per_host=16 --seq_len=512 --reuse_len=256 --input_glob=books-sentences.txt --save_dir=lado.txt --num_passes=20 --bi_data=True --sp_path=/home/ubuntu/giorgi/sp10m.cased.v3.model --mask_alpha=6 --mask_beta=1 --num_predict=85

sudo python3 train_gpu.py --record_info_dir=/home/ubuntu/xlnet/training/tfrecords --train_batch_size=2048 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85

--seq_len, --reuse_len is same in both.
What parameter Should i change in data_utils.py?

SuMarsss · 2019-07-04T09:41:16Z

meter Should i change in data_utils.py?

The parameter train_batch_sizeshould be 32 ,which is equal to bsz_per_host

3NFBAGDU · 2019-07-04T09:54:56Z

meter Should i change in data_utils.py?

The parameter train_batch_sizeshould be 32 ,which is equal to bsz_per_host

i have changed train_batch_size = 32, but got same error TypeError: filenamesmust be atf.data.Datasetoftf.string elements.

volker42maru · 2019-07-08T00:09:58Z

Had the same issue. You have to set the uncased flag in either data_utils.py (preproc) or train.py, because they are set to different default values within these files.

3NFBAGDU · 2019-07-08T07:15:59Z

Had the same issue. You have to set the uncased flag in either data_utils.py (preproc) or train.py, because they are set to different default values within these files.

sudo python3 train_gpu.py --record_info_dir=fix3/tfrecords --train_batch_size=32 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --model_dir='my_model' --uncased=False

sudo python3 data_utils.py --bsz_per_host=32 --num_core_per_host=16 --seq_len=512 --reuse_len=256 --input_glob=books-sentences.txt --save_dir=fix3 --num_passes=20 --bi_data=True --sp_path=/home/ubuntu/giorgi/sp10m.cased.v3.model --mask_alpha=6 --mask_beta=1 --num_predict=85 --uncased=False

same error : TypeError: filenamesmust be atf.data.Datasetoftf.string elements.

volker42maru · 2019-07-09T00:04:13Z

Check if the name of your generated .json file includes the same param values as you use in train.py. In your case prolly: record_info-train-0-0.bsz-32.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-85.json

rkcalnode · 2019-07-09T06:43:49Z

I have the same problem. And try below, this error was solved.

Had the same issue. You have to set the uncased flag in either data_utils.py (preproc) or train.py, > because they are set to different default values within these files.

But Another assertion error occured in modeling.py(line: 236)
assert bsz%2 == 0

Result of print bsz is below. Tensor was passed. But is it expected?

Inserted code in modeling.py (line 233)

    print('***** bsz ***** : {}'.format(bsz))
    print('***** type of bsz ***** : {}'.format(type(bsz)))
    print('***** bsz%2 ***** : {}'.format(bsz%2))
...
    if bsz is not None:

Stdout

***** bsz ***** : Tensor("model/transformer/strided_slice:0", shape=(), dtype=int32, device=/gpu:0)
***** type of bsz ***** : <class 'tensorflow.python.framework.ops.Tensor'>
***** bsz%2 ***** : Tensor("model/transformer/mod:0", shape=(), dtype=int32, device=/gpu:0)

hockeybro12 · 2019-07-15T19:28:52Z

@rkcalnode That assertion is buggy, so if your batch size is divisible by 2 then just comment it out.

3NFBAGDU changed the title ~~error train_gpu.py~~ error in pretrain an XLNet . train_gpu.py. Jul 3, 2019

bzantium mentioned this issue Jul 4, 2019

Fix prefix in get_input_fn / Fix assertion for even number of batch size #121

Open

3NFBAGDU closed this as completed Jul 9, 2019

hisashi-ito added a commit to hisashi-ito/xlnet that referenced this issue Oct 7, 2019

bugfix for zihangdai#111

f4aed44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error in pretrain an XLNet . train_gpu.py. #111

error in pretrain an XLNet . train_gpu.py. #111

3NFBAGDU commented Jul 3, 2019

graykode commented Jul 4, 2019

3NFBAGDU commented Jul 4, 2019 •

edited

SuMarsss commented Jul 4, 2019

3NFBAGDU commented Jul 4, 2019 •

edited

volker42maru commented Jul 8, 2019

3NFBAGDU commented Jul 8, 2019

volker42maru commented Jul 9, 2019

rkcalnode commented Jul 9, 2019

hockeybro12 commented Jul 15, 2019

error in pretrain an XLNet . train_gpu.py. #111

error in pretrain an XLNet . train_gpu.py. #111

Comments

3NFBAGDU commented Jul 3, 2019

graykode commented Jul 4, 2019

3NFBAGDU commented Jul 4, 2019 • edited

SuMarsss commented Jul 4, 2019

3NFBAGDU commented Jul 4, 2019 • edited

volker42maru commented Jul 8, 2019

3NFBAGDU commented Jul 8, 2019

volker42maru commented Jul 9, 2019

rkcalnode commented Jul 9, 2019

hockeybro12 commented Jul 15, 2019

3NFBAGDU commented Jul 4, 2019 •

edited

3NFBAGDU commented Jul 4, 2019 •

edited