Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Data download corrupted when running demo #23

@Jordy-VL

Description

@Jordy-VL

When running the demo (also in README: English-to-German translation model using the Transformer model from Attention Is All You Need on WMT data.), downloading the data, gives a corrupted version.
Eventually this causes the tokenizer to run into errors.

`PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base

DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

Generate data

t2t-datagen
--data_dir=$DATA_DIR
--tmp_dir=$TMP_DIR
--num_shards=100 `

The output of the previous data generation commands:

INFO:tensorflow:Generating training data for wmt_ende_tokens_32k.
INFO:tensorflow:Downloading http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz to /tmp/t2t_datagen/training-parallel-nc-v11.tgz
INFO:tensorflow:Succesfully downloaded training-parallel-nc-v11.tgz, 75178032 bytes.
INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.en
INFO:tensorflow:Reading file: training-parallel-nc-v11/news-commentary-v11.de-en.de
INFO:tensorflow:Downloading http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz to /tmp/t2t_datagen/training-parallel-commoncrawl.tgz

At this point, the download just keeps hanging eventhough the data has been downloaded succesfully (checked in /tmp/t2t_datagen) and I abort with CTRL-C. When trying again it gives the following error:

Traceback (most recent call last):
File "/usr/local/bin/t2t-datagen", line 361, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/local/bin/t2t-datagen", line 344, in main
training_gen(), FLAGS.problem + UNSHUFFLED_SUFFIX + "-train",
File "/usr/local/bin/t2t-datagen", line 140, in
lambda: wmt.ende_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/wmt.py", line 224, in ende_wordpiece_token_generator
tmp_dir, "tokens.vocab.%d" % vocab_size, vocab_size)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/data_generators/generator_utils.py", line 220, in get_or_generate_vocab
corpus_tar.extractall(tmp_dir)
File "/usr/lib/python2.7/tarfile.py", line 2079, in extractall
self.extract(tarinfo, path)
File "/usr/lib/python2.7/tarfile.py", line 2116, in extract
self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
File "/usr/lib/python2.7/tarfile.py", line 2192, in _extract_member
self.makefile(tarinfo, targetpath)
File "/usr/lib/python2.7/tarfile.py", line 2233, in makefile
copyfileobj(source, target)
File "/usr/lib/python2.7/tarfile.py", line 266, in copyfileobj
shutil.copyfileobj(src, dst)
File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj
buf = fsrc.read(length)
File "/usr/lib/python2.7/tarfile.py", line 831, in read
buf += self.fileobj.read(size - len(buf))
File "/usr/lib/python2.7/tarfile.py", line 743, in read
return self.readnormal(size)
File "/usr/lib/python2.7/tarfile.py", line 758, in readnormal
return self.__read(size)
File "/usr/lib/python2.7/tarfile.py", line 748, in __read
buf = self.fileobj.read(size)
File "/usr/lib/python2.7/gzip.py", line 268, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 315, in _read
self._read_eof()
File "/usr/lib/python2.7/gzip.py", line 354, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0x75d9e49c != 0xd122220fL

One strategy might be to manually download the final tar.gz from http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz and unpack it in /tmp/t2t_data. When trying, download is extremely slow, approx. 2 hours for 876MB...

Results of manual download:
Working:

INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/training-parallel-commoncrawl.tgz
INFO:tensorflow:Reading file: commoncrawl.de-en.en
INFO:tensorflow:Reading file: commoncrawl.de-en.de
INFO:tensorflow:Reading file: commoncrawl.fr-en.en
INFO:tensorflow:Reading file: commoncrawl.fr-en.fr

Next in line for (hopefully not too slow) download:

INFO:tensorflow:Downloading
http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz to /tmp/t2t_datagen/training-parallel-europarl-v7.tgz

The above as an FYI or possible issue to be resolved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions