Error when I train with train_gpu.py. (tf_records) #85

rainmaker712 · 2019-06-28T23:33:49Z

Hi, thanks for your contribution.

I was trying to preprocess my own data and training own my gpu machine, but after I created the tfrecords file from the wiki data and try to run train, it does not work with "TypeError: filenames must be a tf.data.Dataset of tf.string elements."

It seems like simple dir issues or tfrecords does not complete correctly since I saw on the logs num of record info path is zero.

Does anyone have a similar issue?
Appreciate for help.

I0629 08:26:34.599769 4393649600 tf_logging.py:115] n_token 32000
I0629 08:26:34.600133 4393649600 tf_logging.py:115] Use the following tfrecord dirs: ['data_out3/tfrecords']
I0629 08:26:34.600275 4393649600 tf_logging.py:115] [0] Record glob: data_out3/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0629 08:26:34.600965 4393649600 tf_logging.py:115] [0] Num of record info path: 0
I0629 08:26:34.601068 4393649600 tf_logging.py:115] [Dir 0] Number of chosen batches: 0
I0629 08:26:34.601134 4393649600 tf_logging.py:115] [Dir 0] Number of chosen files: 0
I0629 08:26:34.601197 4393649600 tf_logging.py:115] []
I0629 08:26:34.601253 4393649600 tf_logging.py:115] Total number of batches: 0
I0629 08:26:34.601778 4393649600 tf_logging.py:115] Total number of files: 0
I0629 08:26:34.601840 4393649600 tf_logging.py:115] []
I0629 08:26:34.601900 4393649600 tf_logging.py:115] num of batches 0
I0629 08:26:34.601970 4393649600 tf_logging.py:115] Host 0 handles 0 files
Traceback (most recent call last):
File "train_gpu.py", line 328, in
tf.app.run()
File "/Users/user/tf110/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 324, in main
train("/gpu:0")
File "train_gpu.py", line 212, in train
train_set = train_input_fn(params)
File "/Users/user/xlnet/data_utils.py", line 868, in input_fn
num_predict=num_predict)
File "/Users/user/xlnet/data_utils.py", line 757, in get_dataset
bsz_per_core=bsz_per_core)
File "/Users/user/xlnet/data_utils.py", line 566, in parse_files_to_dataset
dataset = tf.data.TFRecordDataset(dataset)
File "/Users/user/tf110/lib/python3.6/site-packages/tensorflow/python/data/ops/readers.py", line 194, in init
"filenames must be a tf.data.Dataset of tf.string elements.")
TypeError: filenames must be a tf.data.Dataset of tf.string elements.

The text was updated successfully, but these errors were encountered:

kimiyoung · 2019-06-28T23:41:31Z

I0629 08:26:34.600275 4393649600 tf_logging.py:115] [0] Record glob: data_out3/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0629 08:26:34.600965 4393649600 tf_logging.py:115] [0] Num of record info path: 0

Please double check the directory data_out3/tfrecords.

rainmaker712 · 2019-07-01T05:05:51Z

I0629 08:26:34.600275 4393649600 tf_logging.py:115] [0] Record glob: data_out3/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0629 08:26:34.600965 4393649600 tf_logging.py:115] [0] Num of record info path: 0

Please double check the directory data_out3/tfrecords.

@kimiyoung , Thanks for your response.

I located at the same location and same message comes out again.

I0701 13:58:11.560843 4427961792 train_gpu.py:317] n_token 32000
I0701 13:58:11.561153 4427961792 data_utils.py:801] Use the following tfrecord dirs: ['data_out_test/tfrecords']
I0701 13:58:11.561454 4427961792 data_utils.py:805] [0] Record glob: data_out_test/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0701 13:58:11.562546 4427961792 data_utils.py:809] []
I0701 13:58:11.562644 4427961792 data_utils.py:812] [0] Num of record info path: 0
I0701 13:58:11.562748 4427961792 data_utils.py:845] [Dir 0] Number of chosen batches: 0
I0701 13:58:11.562829 4427961792 data_utils.py:847] [Dir 0] Number of chosen files: 0
I0701 13:58:11.562886 4427961792 data_utils.py:848] []
I0701 13:58:11.562942 4427961792 data_utils.py:855] Total number of batches: 0
I0701 13:58:11.563116 4427961792 data_utils.py:857] Total number of files: 0
I0701 13:58:11.563173 4427961792 data_utils.py:858] []
I0701 13:58:11.563229 4427961792 train_gpu.py:202] num of batches 0
I0701 13:58:11.563291 4427961792 data_utils.py:864] {'batch_size': 16}

I might make a small mistake or is there any similar effect if I preprocessed in a wrong way? (There is no error though)

kimiyoung · 2019-07-01T05:09:03Z

look at the batch size...

rainmaker712 · 2019-07-01T09:03:11Z

look at the batch size...

Thanks! even though that was not a main issue (upload old picture version tfrecords), it gives me hint for solving issues (I will share later).

3NFBAGDU · 2019-07-02T11:40:22Z

I have the same error. I ran sudo python3 train_gpu.py --record_info_dir=/home/ubuntu/xlnet/training/tfrecords --train_batch_size=2048 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --model_dir=/home/ubuntu/axalimodeli and have
following error:

/usr/local/lib/python3.5/dist-packages/tensorflow-plugins /home/ubuntu/.local/lib/python3.5/site-packages/tensorflow-plugins /usr/lib/python3/dist-packages/tensorflow-plugins /usr/lib/python3.5/dist-packages/tensorflow-plugins I0702 11:32:34.041983 139935611332352 train_gpu.py:319] n_token 32000 I0702 11:32:34.042275 139935611332352 data_utils.py:795] Use the following tfrecord dirs: ['/home/ubuntu/xlnet/training/tfrecords'] I0702 11:32:34.042413 139935611332352 data_utils.py:799] [0] Record glob: /home/ubuntu/xlnet/training/tfrecords/record_info-train-*.bsz-2048.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-85.json I0702 11:32:34.042960 139935611332352 data_utils.py:803] [0] Num of record info path: 0 I0702 11:32:34.043075 139935611332352 data_utils.py:836] [Dir 0] Number of chosen batches: 0 I0702 11:32:34.043182 139935611332352 data_utils.py:838] [Dir 0] Number of chosen files: 0 I0702 11:32:34.043281 139935611332352 data_utils.py:839] [] I0702 11:32:34.043379 139935611332352 data_utils.py:846] Total number of batches: 0 I0702 11:32:34.043897 139935611332352 data_utils.py:848] Total number of files: 0 I0702 11:32:34.044010 139935611332352 data_utils.py:849] [] I0702 11:32:34.044113 139935611332352 train_gpu.py:204] num of batches 0 I0702 11:32:34.044229 139935611332352 data_utils.py:555] Host 0 handles 0 files Traceback (most recent call last): File "train_gpu.py", line 328, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "train_gpu.py", line 324, in main train("/gpu:0") File "train_gpu.py", line 212, in train train_set = train_input_fn(params) File "/home/ubuntu/xlnet/data_utils.py", line 868, in input_fn num_predict=num_predict) File "/home/ubuntu/xlnet/data_utils.py", line 757, in get_dataset bsz_per_core=bsz_per_core) File "/home/ubuntu/xlnet/data_utils.py", line 566, in parse_files_to_dataset dataset = tf.data.TFRecordDataset(dataset) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__ filenames, compression_type, buffer_size, num_parallel_reads) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__ filenames = _create_or_validate_filenames_dataset(filenames) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 50, in _create_or_validate_filenames_dataset "filenamesmust be atf.data.Datasetoftf.stringelements.") TypeError:filenamesmust be atf.data.Datasetoftf.string elements.

Can you help me ?

rainmaker712 · 2019-07-11T05:08:28Z

I think it's just simple path issues.

What you could easily do, you can just add 2~3 lines of codes on data_utils.py like below
You can just change the path.

@3NFBAGDU

3NFBAGDU · 2019-07-11T10:52:00Z

@AIscientist Thank you i have done it,
I have pretrained XLNET model [train_gpu.py].
Now i want to make sentence_embeddings with my pretrain model. Is this possible to do it?
for example, i want to know how are you how represents as sentence vector.
Thank you.
model = load(my_model) sentence_vector = model.predict('how are you?').

rainmaker712 · 2019-07-11T14:07:52Z

@3NFBAGDU Yes, you can do that.
In order to do that, you convert 'how are you?' into model inputs with sentence ids and
if you see the instruction on README.md, there is a section called "Custom Usage of XLNet."
On there, you can either use

# Get a summary of the sequence using the last hidden state
summary = xlnet_model.get_pooled_out(summary_type="last")

or

# Get a sequence output
seq_out = xlnet_model.get_sequence_output()

to get your input embeddings.

abhi060698 · 2019-07-17T22:07:00Z

I'm still struggling with this issue, if possible could you post your data_utils.py or email it to me please ? @3NFBAGDU @AIscientist

rainmaker712 · 2019-07-23T01:47:47Z

@abhi060698
I only changed for this part from data_utils.py.

def get_input_fn(
    tfrecord_dir,
    split,
    bsz_per_host,
    seq_len,
    reuse_len,
    bi_data,
    num_hosts=1,
    num_core_per_host=1,
    perm_size=None,
    mask_alpha=None,
    mask_beta=None,
    uncased=False,
    num_passes=None,
    use_bfloat16=False,
    num_predict=None):

  # Merge all record infos into a single one
  record_glob_base = format_filename(
      prefix="record_info-{}-*".format(split),
      bsz_per_host=bsz_per_host,
      seq_len=seq_len,
      bi_data=bi_data,
      suffix="json",
      mask_alpha=mask_alpha,
      mask_beta=mask_beta,
      reuse_len=reuse_len,
      uncased=uncased,
      fixed_num_predict=num_predict)

  record_info = {"num_batch": 0, "filenames": []}

  tfrecord_dirs = tfrecord_dir.split(",")
  tf.logging.info("Use the following tfrecord dirs: %s", tfrecord_dirs)

  for idx, record_dir in enumerate(tfrecord_dirs):
    record_glob = os.path.join(record_dir, record_glob_base)
    tf.logging.info("[%d] Record glob: %s", idx, record_glob)

    record_paths = sorted(tf.gfile.Glob(record_glob))

    ## File load error -> maunual change
    import glob
    # path = 'xlnet_cased_KR_3/tfrecords/'
    path = 'xlnet_cased_JP_12/tfrecords/'
    # path = 'xlnet_cased_JP_3/tfrecords/'
    record_paths = [f for f in glob.glob(path + "*.json", recursive=True)]
    tf.logging.info(record_paths)

    tf.logging.info("[%d] Num of record info path: %d",
                    idx, len(record_paths))

    cur_record_info = {"num_batch": 0, "filenames": []}

    for record_info_path in record_paths:
      if num_passes is not None:
        record_info_name = os.path.basename(record_info_path)
        fields = record_info_name.split(".")[0].split("-")
        pass_id = int(fields[-1])
        if len(fields) == 5 and pass_id >= num_passes:
          tf.logging.info("Skip pass %d: %s", pass_id, record_info_name)
          continue

      with tf.gfile.Open(record_info_path, "r") as fp:
        info = json.load(fp)
        if num_passes is not None:
          eff_num_passes = min(num_passes, len(info["filenames"]))
          ratio = eff_num_passes / len(info["filenames"])
          cur_record_info["num_batch"] += int(info["num_batch"] * ratio)
          cur_record_info["filenames"] += info["filenames"][:eff_num_passes]
        else:
          cur_record_info["num_batch"] += info["num_batch"]
          cur_record_info["filenames"] += info["filenames"]

    # overwrite directory for `cur_record_info`
    new_filenames = []
    for filename in cur_record_info["filenames"]:
      basename = os.path.basename(filename)
      new_filename = os.path.join(record_dir, basename)
      new_filenames.append(new_filename)
    cur_record_info["filenames"] = new_filenames

    tf.logging.info("[Dir %d] Number of chosen batches: %s",
                    idx, cur_record_info["num_batch"])
    tf.logging.info("[Dir %d] Number of chosen files: %s",
                    idx, len(cur_record_info["filenames"]))
    tf.logging.info(cur_record_info["filenames"])

    # add `cur_record_info` to global `record_info`
    record_info["num_batch"] += cur_record_info["num_batch"]
    record_info["filenames"] += cur_record_info["filenames"]

  tf.logging.info("Total number of batches: %d",
                  record_info["num_batch"])
  tf.logging.info("Total number of files: %d",
                  len(record_info["filenames"]))
  tf.logging.info(record_info["filenames"])

bzantium mentioned this issue Jul 4, 2019

Fix prefix in get_input_fn / Fix assertion for even number of batch size #121

Open

rainmaker712 closed this as completed Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when I train with train_gpu.py. (tf_records) #85

Error when I train with train_gpu.py. (tf_records) #85

rainmaker712 commented Jun 28, 2019

kimiyoung commented Jun 28, 2019

rainmaker712 commented Jul 1, 2019

kimiyoung commented Jul 1, 2019

rainmaker712 commented Jul 1, 2019 •

edited

3NFBAGDU commented Jul 2, 2019 •

edited

rainmaker712 commented Jul 11, 2019

3NFBAGDU commented Jul 11, 2019

rainmaker712 commented Jul 11, 2019 •

edited

abhi060698 commented Jul 17, 2019

rainmaker712 commented Jul 23, 2019

Error when I train with train_gpu.py. (tf_records) #85

Error when I train with train_gpu.py. (tf_records) #85

Comments

rainmaker712 commented Jun 28, 2019

kimiyoung commented Jun 28, 2019

rainmaker712 commented Jul 1, 2019

kimiyoung commented Jul 1, 2019

rainmaker712 commented Jul 1, 2019 • edited

3NFBAGDU commented Jul 2, 2019 • edited

rainmaker712 commented Jul 11, 2019

3NFBAGDU commented Jul 11, 2019

rainmaker712 commented Jul 11, 2019 • edited

abhi060698 commented Jul 17, 2019

rainmaker712 commented Jul 23, 2019

rainmaker712 commented Jul 1, 2019 •

edited

3NFBAGDU commented Jul 2, 2019 •

edited

rainmaker712 commented Jul 11, 2019 •

edited