-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Data download corrupted when running demo #23
Description
When running the demo (also in README: English-to-German translation model using the Transformer model from Attention Is All You Need on WMT data.), downloading the data, gives a corrupted version.
Eventually this causes the tokenizer to run into errors.
`PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
Generate data
t2t-datagen
--data_dir=$DATA_DIR
--tmp_dir=$TMP_DIR
--num_shards=100 `
The output of the previous data generation commands:
INFO:tensorflow:Generating training data for wmt_ende_tokens_32k.
INFO:tensorflow:Downloading http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz to /tmp/t2t_datagen/training-parallel-nc-v11.tgz
INFO:tensorflow:Succesfully downloaded training-parallel-nc-v11.tgz, 75178032 bytes.
INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.en
INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.de
INFO:tensorflow:Downloading http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz to /tmp/t2t_datagen/training-parallel-commoncrawl.tgz
At this point, the download just keeps hanging eventhough the data has been downloaded succesfully (checked in /tmp/t2t_datagen) and I abort with CTRL-C. When trying again it gives the following error:
Traceback (most recent call last):
File "/usr/local/bin/t2t-datagen", line 361, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/bin/t2t-datagen", line 344, in main
training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
File "/usr/local/bin/t2t-datagen", line 140, in
lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/wmt.py", line 224, in ende_wordpiece_token_generator
tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/generator_utils.py", line 220, in get_or_generate_vocab
corpus_tar.extractall(tmp_dir)
File "/usr/lib/python2.7/tarfile.py", line 2079, in extractall
self.extract(tarinfo, path)
File "/usr/lib/python2.7/tarfile.py", line 2116, in extract
self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
File "/usr/lib/python2.7/tarfile.py", line 2192, in _extract_member
self.makefile(tarinfo, targetpath)
File "/usr/lib/python2.7/tarfile.py", line 2233, in makefile
copyfileobj(source, target)
File "/usr/lib/python2.7/tarfile.py", line 266, in copyfileobj
shutil.copyfileobj(src, dst)
File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj
buf = fsrc.read(length)
File "/usr/lib/python2.7/tarfile.py", line 831, in read
buf += self.fileobj.read(size - len(buf))
File "/usr/lib/python2.7/tarfile.py", line 743, in read
return self.readnormal(size)
File "/usr/lib/python2.7/tarfile.py", line 758, in readnormal
return self.__read(size)
File "/usr/lib/python2.7/tarfile.py", line 748, in __read
buf = self.fileobj.read(size)
File "/usr/lib/python2.7/gzip.py", line 268, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 315, in _read
self._read_eof()
File "/usr/lib/python2.7/gzip.py", line 354, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0x75d9e49c != 0xd122220fL
One strategy might be to manually download the final tar.gz from http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz and unpack it in /tmp/t2t_data. When trying, download is extremely slow, approx. 2 hours for 876MB...
Results of manual download:
Working:
INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/training-parallel-commoncrawl.tgz
INFO:tensorflow:Reading file: commoncrawl.de-en.en
INFO:tensorflow:Reading file: commoncrawl.de-en.de
INFO:tensorflow:Reading file: commoncrawl.fr-en.en
INFO:tensorflow:Reading file: commoncrawl.fr-en.fr
Next in line for (hopefully not too slow) download:
INFO:tensorflow:Downloading
http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz to /tmp/t2t_datagen/training-parallel-europarl-v7.tgz
The above as an FYI or possible issue to be resolved.