Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when I train with train_gpu.py. (tf_records) #85

Closed
rainmaker712 opened this issue Jun 28, 2019 · 10 comments
Closed

Error when I train with train_gpu.py. (tf_records) #85

rainmaker712 opened this issue Jun 28, 2019 · 10 comments

Comments

@rainmaker712
Copy link

Hi, thanks for your contribution.

I was trying to preprocess my own data and training own my gpu machine, but after I created the tfrecords file from the wiki data and try to run train, it does not work with "TypeError: filenames must be a tf.data.Dataset of tf.string elements."

It seems like simple dir issues or tfrecords does not complete correctly since I saw on the logs num of record info path is zero.

Does anyone have a similar issue?
Appreciate for help.

I0629 08:26:34.599769 4393649600 tf_logging.py:115] n_token 32000
I0629 08:26:34.600133 4393649600 tf_logging.py:115] Use the following tfrecord dirs: ['data_out3/tfrecords']
I0629 08:26:34.600275 4393649600 tf_logging.py:115] [0] Record glob: data_out3/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0629 08:26:34.600965 4393649600 tf_logging.py:115] [0] Num of record info path: 0
I0629 08:26:34.601068 4393649600 tf_logging.py:115] [Dir 0] Number of chosen batches: 0
I0629 08:26:34.601134 4393649600 tf_logging.py:115] [Dir 0] Number of chosen files: 0
I0629 08:26:34.601197 4393649600 tf_logging.py:115] []
I0629 08:26:34.601253 4393649600 tf_logging.py:115] Total number of batches: 0
I0629 08:26:34.601778 4393649600 tf_logging.py:115] Total number of files: 0
I0629 08:26:34.601840 4393649600 tf_logging.py:115] []
I0629 08:26:34.601900 4393649600 tf_logging.py:115] num of batches 0
I0629 08:26:34.601970 4393649600 tf_logging.py:115] Host 0 handles 0 files
Traceback (most recent call last):
File "train_gpu.py", line 328, in
tf.app.run()
File "/Users/user/tf110/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train_gpu.py", line 324, in main
train("/gpu:0")
File "train_gpu.py", line 212, in train
train_set = train_input_fn(params)
File "/Users/user/xlnet/data_utils.py", line 868, in input_fn
num_predict=num_predict)
File "/Users/user/xlnet/data_utils.py", line 757, in get_dataset
bsz_per_core=bsz_per_core)
File "/Users/user/xlnet/data_utils.py", line 566, in parse_files_to_dataset
dataset = tf.data.TFRecordDataset(dataset)
File "/Users/user/tf110/lib/python3.6/site-packages/tensorflow/python/data/ops/readers.py", line 194, in init
"filenames must be a tf.data.Dataset of tf.string elements.")
TypeError: filenames must be a tf.data.Dataset of tf.string elements.

@kimiyoung
Copy link
Collaborator

I0629 08:26:34.600275 4393649600 tf_logging.py:115] [0] Record glob: data_out3/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0629 08:26:34.600965 4393649600 tf_logging.py:115] [0] Num of record info path: 0

Please double check the directory data_out3/tfrecords.

@rainmaker712
Copy link
Author

I0629 08:26:34.600275 4393649600 tf_logging.py:115] [0] Record glob: data_out3/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0629 08:26:34.600965 4393649600 tf_logging.py:115] [0] Num of record info path: 0

Please double check the directory data_out3/tfrecords.

@kimiyoung , Thanks for your response.

I located at the same location and same message comes out again.

I0701 13:58:11.560843 4427961792 train_gpu.py:317] n_token 32000
I0701 13:58:11.561153 4427961792 data_utils.py:801] Use the following tfrecord dirs: ['data_out_test/tfrecords']
I0701 13:58:11.561454 4427961792 data_utils.py:805] [0] Record glob: data_out_test/tfrecords/record_info-train-*.bsz-16.seqlen-128.reuse-64.bi.alpha-6.beta-1.fnp-85.json
I0701 13:58:11.562546 4427961792 data_utils.py:809] []
I0701 13:58:11.562644 4427961792 data_utils.py:812] [0] Num of record info path: 0
I0701 13:58:11.562748 4427961792 data_utils.py:845] [Dir 0] Number of chosen batches: 0
I0701 13:58:11.562829 4427961792 data_utils.py:847] [Dir 0] Number of chosen files: 0
I0701 13:58:11.562886 4427961792 data_utils.py:848] []
I0701 13:58:11.562942 4427961792 data_utils.py:855] Total number of batches: 0
I0701 13:58:11.563116 4427961792 data_utils.py:857] Total number of files: 0
I0701 13:58:11.563173 4427961792 data_utils.py:858] []
I0701 13:58:11.563229 4427961792 train_gpu.py:202] num of batches 0
I0701 13:58:11.563291 4427961792 data_utils.py:864] {'batch_size': 16}

image

I might make a small mistake or is there any similar effect if I preprocessed in a wrong way? (There is no error though)

@kimiyoung
Copy link
Collaborator

look at the batch size...

@rainmaker712
Copy link
Author

rainmaker712 commented Jul 1, 2019

look at the batch size...

Thanks! even though that was not a main issue (upload old picture version tfrecords), it gives me hint for solving issues (I will share later).

@3NFBAGDU
Copy link

3NFBAGDU commented Jul 2, 2019

I have the same error. I ran sudo python3 train_gpu.py --record_info_dir=/home/ubuntu/xlnet/training/tfrecords --train_batch_size=2048 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=24 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=85 --model_dir=/home/ubuntu/axalimodeli and have
following error:

/usr/local/lib/python3.5/dist-packages/tensorflow-plugins /home/ubuntu/.local/lib/python3.5/site-packages/tensorflow-plugins /usr/lib/python3/dist-packages/tensorflow-plugins /usr/lib/python3.5/dist-packages/tensorflow-plugins I0702 11:32:34.041983 139935611332352 train_gpu.py:319] n_token 32000 I0702 11:32:34.042275 139935611332352 data_utils.py:795] Use the following tfrecord dirs: ['/home/ubuntu/xlnet/training/tfrecords'] I0702 11:32:34.042413 139935611332352 data_utils.py:799] [0] Record glob: /home/ubuntu/xlnet/training/tfrecords/record_info-train-*.bsz-2048.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-85.json I0702 11:32:34.042960 139935611332352 data_utils.py:803] [0] Num of record info path: 0 I0702 11:32:34.043075 139935611332352 data_utils.py:836] [Dir 0] Number of chosen batches: 0 I0702 11:32:34.043182 139935611332352 data_utils.py:838] [Dir 0] Number of chosen files: 0 I0702 11:32:34.043281 139935611332352 data_utils.py:839] [] I0702 11:32:34.043379 139935611332352 data_utils.py:846] Total number of batches: 0 I0702 11:32:34.043897 139935611332352 data_utils.py:848] Total number of files: 0 I0702 11:32:34.044010 139935611332352 data_utils.py:849] [] I0702 11:32:34.044113 139935611332352 train_gpu.py:204] num of batches 0 I0702 11:32:34.044229 139935611332352 data_utils.py:555] Host 0 handles 0 files Traceback (most recent call last): File "train_gpu.py", line 328, in <module> tf.compat.v1.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "train_gpu.py", line 324, in main train("/gpu:0") File "train_gpu.py", line 212, in train train_set = train_input_fn(params) File "/home/ubuntu/xlnet/data_utils.py", line 868, in input_fn num_predict=num_predict) File "/home/ubuntu/xlnet/data_utils.py", line 757, in get_dataset bsz_per_core=bsz_per_core) File "/home/ubuntu/xlnet/data_utils.py", line 566, in parse_files_to_dataset dataset = tf.data.TFRecordDataset(dataset) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__ filenames, compression_type, buffer_size, num_parallel_reads) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__ filenames = _create_or_validate_filenames_dataset(filenames) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/readers.py", line 50, in _create_or_validate_filenames_dataset "filenamesmust be atf.data.Datasetoftf.stringelements.") TypeError:filenamesmust be atf.data.Datasetoftf.string elements.

Can you help me ?

@rainmaker712
Copy link
Author

I think it's just simple path issues.

What you could easily do, you can just add 2~3 lines of codes on data_utils.py like below
You can just change the path.

@3NFBAGDU

image

@3NFBAGDU
Copy link

@AIscientist Thank you i have done it,
I have pretrained XLNET model [train_gpu.py].
Now i want to make sentence_embeddings with my pretrain model. Is this possible to do it?
for example, i want to know how are you how represents as sentence vector.
Thank you.
model = load(my_model) sentence_vector = model.predict('how are you?').

@rainmaker712
Copy link
Author

rainmaker712 commented Jul 11, 2019

@3NFBAGDU Yes, you can do that.
In order to do that, you convert 'how are you?' into model inputs with sentence ids and
if you see the instruction on README.md, there is a section called "Custom Usage of XLNet."
On there, you can either use

# Get a summary of the sequence using the last hidden state
summary = xlnet_model.get_pooled_out(summary_type="last")

or

# Get a sequence output
seq_out = xlnet_model.get_sequence_output()

to get your input embeddings.

@abhi060698
Copy link

I'm still struggling with this issue, if possible could you post your data_utils.py or email it to me please ? @3NFBAGDU @AIscientist

@rainmaker712
Copy link
Author

@abhi060698
I only changed for this part from data_utils.py.

def get_input_fn(
    tfrecord_dir,
    split,
    bsz_per_host,
    seq_len,
    reuse_len,
    bi_data,
    num_hosts=1,
    num_core_per_host=1,
    perm_size=None,
    mask_alpha=None,
    mask_beta=None,
    uncased=False,
    num_passes=None,
    use_bfloat16=False,
    num_predict=None):

  # Merge all record infos into a single one
  record_glob_base = format_filename(
      prefix="record_info-{}-*".format(split),
      bsz_per_host=bsz_per_host,
      seq_len=seq_len,
      bi_data=bi_data,
      suffix="json",
      mask_alpha=mask_alpha,
      mask_beta=mask_beta,
      reuse_len=reuse_len,
      uncased=uncased,
      fixed_num_predict=num_predict)

  record_info = {"num_batch": 0, "filenames": []}

  tfrecord_dirs = tfrecord_dir.split(",")
  tf.logging.info("Use the following tfrecord dirs: %s", tfrecord_dirs)

  for idx, record_dir in enumerate(tfrecord_dirs):
    record_glob = os.path.join(record_dir, record_glob_base)
    tf.logging.info("[%d] Record glob: %s", idx, record_glob)

    record_paths = sorted(tf.gfile.Glob(record_glob))

    ## File load error -> maunual change
    import glob
    # path = 'xlnet_cased_KR_3/tfrecords/'
    path = 'xlnet_cased_JP_12/tfrecords/'
    # path = 'xlnet_cased_JP_3/tfrecords/'
    record_paths = [f for f in glob.glob(path + "*.json", recursive=True)]
    tf.logging.info(record_paths)

    tf.logging.info("[%d] Num of record info path: %d",
                    idx, len(record_paths))

    cur_record_info = {"num_batch": 0, "filenames": []}

    for record_info_path in record_paths:
      if num_passes is not None:
        record_info_name = os.path.basename(record_info_path)
        fields = record_info_name.split(".")[0].split("-")
        pass_id = int(fields[-1])
        if len(fields) == 5 and pass_id >= num_passes:
          tf.logging.info("Skip pass %d: %s", pass_id, record_info_name)
          continue

      with tf.gfile.Open(record_info_path, "r") as fp:
        info = json.load(fp)
        if num_passes is not None:
          eff_num_passes = min(num_passes, len(info["filenames"]))
          ratio = eff_num_passes / len(info["filenames"])
          cur_record_info["num_batch"] += int(info["num_batch"] * ratio)
          cur_record_info["filenames"] += info["filenames"][:eff_num_passes]
        else:
          cur_record_info["num_batch"] += info["num_batch"]
          cur_record_info["filenames"] += info["filenames"]

    # overwrite directory for `cur_record_info`
    new_filenames = []
    for filename in cur_record_info["filenames"]:
      basename = os.path.basename(filename)
      new_filename = os.path.join(record_dir, basename)
      new_filenames.append(new_filename)
    cur_record_info["filenames"] = new_filenames

    tf.logging.info("[Dir %d] Number of chosen batches: %s",
                    idx, cur_record_info["num_batch"])
    tf.logging.info("[Dir %d] Number of chosen files: %s",
                    idx, len(cur_record_info["filenames"]))
    tf.logging.info(cur_record_info["filenames"])

    # add `cur_record_info` to global `record_info`
    record_info["num_batch"] += cur_record_info["num_batch"]
    record_info["filenames"] += cur_record_info["filenames"]

  tf.logging.info("Total number of batches: %d",
                  record_info["num_batch"])
  tf.logging.info("Total number of files: %d",
                  len(record_info["filenames"]))
  tf.logging.info(record_info["filenames"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants