Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in pretrain an XLNet . train_gpu.py. #111

Closed
3NFBAGDU opened this issue Jul 3, 2019 · 9 comments
Closed

error in pretrain an XLNet . train_gpu.py. #111

3NFBAGDU opened this issue Jul 3, 2019 · 9 comments

Comments

@3NFBAGDU
Copy link

3NFBAGDU commented Jul 3, 2019

I have successfully ran spm_train \ --input=$INPUT \ --model_prefix=sp10m.cased.v3 \ --vocab_size=32000 \ --character_coverage=0.99995 \ --model_type=unigram \ --control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \ --user_defined_symbols=<eop>,.,(,),",-,–,£,€ \ --shuffle_input_sentence \ --input_sentence_size=10000000
and python data_utils.py \ --bsz_per_host=32 \ --num_core_per_host=16 \ --seq_len=512 \ --reuse_len=256 \ --input_glob=*.txt \ --save_dir=${SAVE_DIR} \ --num_passes=20 \ --bi_data=True \ --sp_path=spiece.model \ --mask_alpha=6 \ --mask_beta=1 \ --num_predict=85.

now when i want to run train_gpu.py :
sudo python3 train_gpu.py --record_info_dir=/home/ubuntu/xlnet/training/tfrecords --train_batch_size=2048 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --model_dir=/home/ubuntu/axalimodeli i have the error:

/usr/local/lib/python3.5/dist-packages/tensorflow-plugins /home/ubuntu/.local/lib/python3.5/site-packages/tensorflow-plugins /usr/lib/python3/dist-packages/tensorflow-plugins /usr/lib/python3.5/dist-packages/tensorflow-plugins I0702 11:32:34.041983 139935611332352 train_gpu.py:319] n_token 32000 I0702 11:32:34.042275 139935611332352 data_utils.py:795] Use the following tfrecord dirs: ['/home/ubuntu/xlnet/training/tfrecords'] I0702 11:32:34.042413 139935611332352 data_utils.py:799] [0] Record glob: /home/ubuntu/xlnet/training/tfrecords/record_info-train-*.bsz-2048.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-85.json I0702 11:32:34.042960 139935611332352 data_utils.py:803] [0] Num of record info path: 0 I0702 11:32:34.043075 139935611332352 data_utils.py:836] [Dir 0] Number of chosen batches: 0 I0702 11:32:34.043182 139935611332352 data_utils.py:838] [Dir 0] Number of chosen files: 0 I0702 11:32:34.043281 139935611332352 data_utils.py:839] [] I0702 11:32:34.043379 139935611332352 data_utils.py:846] Total number of batches: 0 I0702 11:32:34.043897 139935611332352 data_utils.py:848] Total number of files: 0 I0702 11:32:34.044010 139935611332352 data_utils.py:849] [] I0702 11:32:34.044113 139935611332352 train_gpu.py:204] num of batches 0 I0702 11:32:34.044229 139935611332352 data_utils.py:555] Host 0 handles 0 files Traceback (most recent call last): File "train_gpu.py", line 328, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "train_gpu.py", line 324, in main train("/gpu:0") File "train_gpu.py", line 212, in train train_set = train_input_fn(params) File "/home/ubuntu/xlnet/data_utils.py", line 868, in input_fn num_predict=num_predict) File "/home/ubuntu/xlnet/data_utils.py", line 757, in get_dataset bsz_per_core=bsz_per_core) File "/home/ubuntu/xlnet/data_utils.py", line 566, in parse_files_to_dataset dataset = tf.data.TFRecordDataset(dataset) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__ filenames, compression_type, buffer_size, num_parallel_reads) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__ filenames = _create_or_validate_filenames_dataset(filenames) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 50, in _create_or_validate_filenames_dataset "filenamesmust be atf.data.Datasetoftf.stringelements.") TypeError:filenamesmust be atf.data.Datasetoftf.string elements.

Can you help me to solve this case. data_utils.py make the tfrecords file in this directory '/home/ubuntu/xlnet/training/tfrecords'.

@3NFBAGDU 3NFBAGDU changed the title error train_gpu.py error in pretrain an XLNet . train_gpu.py. Jul 3, 2019
@graykode
Copy link
Contributor

graykode commented Jul 4, 2019

This is because your file didnt not match xlnet/proc_data/example/tfrecords/*.json and train_gpu option. you should be same option data_util.py and train_gpu

@3NFBAGDU
Copy link
Author

3NFBAGDU commented Jul 4, 2019

This is because your file didnt not match xlnet/proc_data/example/tfrecords/*.json and train_gpu option. you should be same option data_util.py and train_gpu

sudo python3 data_utils.py --bsz_per_host=32 --num_core_per_host=16 --seq_len=512 --reuse_len=256 --input_glob=books-sentences.txt --save_dir=lado.txt --num_passes=20 --bi_data=True --sp_path=/home/ubuntu/giorgi/sp10m.cased.v3.model --mask_alpha=6 --mask_beta=1 --num_predict=85

sudo python3 train_gpu.py --record_info_dir=/home/ubuntu/xlnet/training/tfrecords --train_batch_size=2048 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85

--seq_len, --reuse_len is same in both.
What parameter Should i change in data_utils.py?

@SuMarsss
Copy link

SuMarsss commented Jul 4, 2019

meter Should i change in data_utils.py?

The parameter train_batch_sizeshould be 32 ,which is equal to bsz_per_host

@3NFBAGDU
Copy link
Author

3NFBAGDU commented Jul 4, 2019

meter Should i change in data_utils.py?

The parameter train_batch_sizeshould be 32 ,which is equal to bsz_per_host

i have changed train_batch_size = 32, but got same error TypeError: filenamesmust be atf.data.Datasetoftf.string elements.

@volker42maru
Copy link

Had the same issue. You have to set the uncased flag in either data_utils.py (preproc) or train.py, because they are set to different default values within these files.

@3NFBAGDU
Copy link
Author

3NFBAGDU commented Jul 8, 2019

Had the same issue. You have to set the uncased flag in either data_utils.py (preproc) or train.py, because they are set to different default values within these files.

sudo python3 train_gpu.py --record_info_dir=fix3/tfrecords --train_batch_size=32 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --model_dir='my_model' --uncased=False

sudo python3 data_utils.py --bsz_per_host=32 --num_core_per_host=16 --seq_len=512 --reuse_len=256 --input_glob=books-sentences.txt --save_dir=fix3 --num_passes=20 --bi_data=True --sp_path=/home/ubuntu/giorgi/sp10m.cased.v3.model --mask_alpha=6 --mask_beta=1 --num_predict=85 --uncased=False

same error : TypeError: filenamesmust be atf.data.Datasetoftf.string elements.

@volker42maru
Copy link

Check if the name of your generated .json file includes the same param values as you use in train.py. In your case prolly: record_info-train-0-0.bsz-32.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-85.json

@3NFBAGDU 3NFBAGDU closed this as completed Jul 9, 2019
@rkcalnode
Copy link

I have the same problem. And try below, this error was solved.

Had the same issue. You have to set the uncased flag in either data_utils.py (preproc) or train.py, > because they are set to different default values within these files.

But Another assertion error occured in modeling.py(line: 236)
assert bsz%2 == 0

Result of print bsz is below. Tensor was passed. But is it expected?

Inserted code in modeling.py (line 233)

    print('***** bsz ***** : {}'.format(bsz))
    print('***** type of bsz ***** : {}'.format(type(bsz)))
    print('***** bsz%2 ***** : {}'.format(bsz%2))
...
    if bsz is not None:

Stdout

***** bsz ***** : Tensor("model/transformer/strided_slice:0", shape=(), dtype=int32, device=/gpu:0)
***** type of bsz ***** : <class 'tensorflow.python.framework.ops.Tensor'>
***** bsz%2 ***** : Tensor("model/transformer/mod:0", shape=(), dtype=int32, device=/gpu:0)

@hockeybro12
Copy link

@rkcalnode That assertion is buggy, so if your batch size is divisible by 2 then just comment it out.

hisashi-ito added a commit to hisashi-ito/xlnet that referenced this issue Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants