Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

--decode_to_file does not create output file #48

Closed
mehmedes opened this issue Jun 25, 2017 · 7 comments
Closed

--decode_to_file does not create output file #48

mehmedes opened this issue Jun 25, 2017 · 7 comments

Comments

@mehmedes
Copy link

During inference, I'm not able to create a file containing the inference output.
I've tried --decode_to_file, but no output file is being created...

@lukaszkaiser
Copy link
Contributor

It should be sufficient to just use --decode_from_file=path, it creates a file names path.decodes... where in ... it puts the model name and so on. Did you try that?

@mehmedes
Copy link
Author

mehmedes commented Jun 25, 2017

Actually, I did.
This is my decoding command:

PROBLEM=wmt_ende_bpe32k
MODEL=transformer
HPARAMS=transformer_base

DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS


BEAM_SIZE=4
ALPHA=0.6

t2t-trainer \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=0 \
  --eval_steps=0 \
  --decode_beam_size=$BEAM_SIZE \
  --decode_alpha=$ALPHA \
  --decode_from_file /.../newsdev2016.tok.bpe.32000.en

But I couldn't find any output file anywhere.

@mehmedes
Copy link
Author

Now, I see. It's in the tmp folder. Thanks!

@mehmedes
Copy link
Author

While translating a file, I got the following error. What could this be connected with:

File "/usr/local/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/usr/local/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 240, in run
    run_locally(exp_fn(output_dir))
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 543, in run_locally
    decode_from_file(estimator, FLAGS.decode_from_file)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 645, in decode_from_file
    result_iter = estimator.predict(input_fn=input_fn.next, as_iterable=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 590, in predict
    as_iterable=as_iterable)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 883, in _infer_model
    features = self._get_features_from_input_fn(input_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 863, in _get_features_from_input_fn
    result = input_fn()
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_utils.py", line 725, in _decode_batch_input_fn
    input_ids = vocabulary.encode(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/text_encoder.py", line 132, in encode
    ret = [self._token_to_id[tok] for tok in sentence.strip().split()]
KeyError: '\xc3\x88'

@mehmedes
Copy link
Author

O, that's only a character issue while translating a word containing the character "È"!

@cshanbo
Copy link
Contributor

cshanbo commented Jun 26, 2017

Hi all,
I met the same issues.

  1. Indeed, the output will be in a file in TMP directory. But the --decode_to_file doesn't work.
  2. the KeyError: '\xc3\x88' exception raised here.

I think this is caused by the OOVs like @mehmedes said. I'm not sure whether the vocab.bpe.32000 in this data can cover the pre-processed training data in English to German translation task.

I used my data, where the training data contains some unknown words out of vocabulary. I instead rewrote the code to something like:

def encode(self, sentence):
    """Converts a space-separated string of tokens to a list of ids."""
    ret = [self._token_to_id[tok] if tok in self._token_to_id \ 
           else self._token_to_id['UNK'] for tok in sentence.strip().split()]
    return ret[::-1] if self._reverse else ret

Where I can ensure the UNK symbol in my vocabulary.
I'm testing this setting to see if it works.
Any advice will be helpful

Thank you

@lukaszkaiser
Copy link
Contributor

Does this only happen with BPE, or with the standard "tokens_32k" too? We don't have a built-in tokenizer for BPE, it was used only for papers to have perplexities comparable with other papers. It cannot be detokenized, so I believe it's better to use our own tokenizer. Or is the problem the same?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants