Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit 35a586676036f627bffd0d3c753c6cd0a70d63cf
Author: ZheyuYe <zheyu.ye1995@gmail.com>
Date:   Fri Jul 17 10:10:14 2020 +0800

    Squashed commit of the following:

    commit 673344d
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Wed Jul 15 22:43:07 2020 +0800

        CharTokenizer

    commit 8dabfd6
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Wed Jul 15 15:47:24 2020 +0800

        lowercase

    commit f5c94a6
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jul 14 17:45:28 2020 +0800

        test

    commit dc55fc9
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jul 14 05:45:01 2020 +0800

        tiny update on run_squad

    commit 4defc7a
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jul 13 23:18:08 2020 +0800

        update testings

    commit 2719e81
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jul 13 23:08:32 2020 +0800

        re-upload xlmr

    commit cd0509d
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jul 13 22:30:47 2020 +0800

        fix get_pretrained

    commit 8ed8a72
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jul 13 22:28:13 2020 +0800

        re-upload roberta

    commit 5811d40
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jul 13 18:27:23 2020 +0800

        update

    commit 44a09a3
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sat Jul 11 15:06:33 2020 +0800

        fix

    commit 4074a26
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Fri Jul 10 16:08:49 2020 +0800

        inference without horovod

    commit 31cb953
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Thu Jul 9 18:41:55 2020 +0800

        update

    commit 838be2a
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Thu Jul 9 15:14:39 2020 +0800

        horovod for squad

    commit 1d374a2
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Thu Jul 9 12:09:19 2020 +0800

        fix

    commit e4fba39
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Thu Jul 9 10:35:08 2020 +0800

        remove multiply_grads

    commit 007f07e
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jul 7 11:26:38 2020 +0800

        multiply_grads

    commit b8c85bb
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jul 6 12:28:56 2020 +0800

        fix ModelForQABasic

    commit 0e13a58
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sat Jul 4 18:42:12 2020 +0800

        clip_grad_global_norm with zeros max_grad_norm

    commit bd270f2
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Fri Jul 3 20:21:31 2020 +0800

        fix roberta

    commit 4fc564c
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Fri Jul 3 19:36:08 2020 +0800

        update hyper-parameters of adamw

    commit 59cffbf
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Fri Jul 3 16:25:46 2020 +0800

        try

    commit a84f782
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Thu Jul 2 20:39:03 2020 +0800

        fix mobilebert

    commit 4bc3a96
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Thu Jul 2 11:14:39 2020 +0800

        layer-wise decay

    commit 07186d5
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Thu Jul 2 02:14:43 2020 +0800

        revise

    commit a5a6475
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Wed Jul 1 19:50:20 2020 +0800

        topk

    commit 34ee884
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Wed Jul 1 19:25:09 2020 +0800

        index_update

    commit 74178e2
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Wed Jul 1 00:48:32 2020 +0800

        rename

    commit fa011aa
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jun 30 23:40:28 2020 +0800

        update

    commit 402d625
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jun 30 21:40:30 2020 +0800

        multiprocessing for wiki

    commit ddbde75
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jun 30 20:41:35 2020 +0800

        fix bookcorpus

    commit 6cc5ccd
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jun 30 16:39:12 2020 +0800

        fix wiki

    commit 9773efd
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Tue Jun 30 15:52:13 2020 +0800

        fix openwebtext

    commit 1fb8eb8
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jun 29 19:51:25 2020 +0800

        upload gluon_electra_small_owt

    commit ca83fac
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jun 29 18:09:48 2020 +0800

        revise train_transformer

    commit 1450f5c
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jun 29 18:07:04 2020 +0800

        revise

    commit b460bbe
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jun 29 17:24:00 2020 +0800

        repeat for pretraining

    commit 8ee381b
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jun 29 17:06:43 2020 +0800

        repeat

    commit aea936f
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Mon Jun 29 16:39:22 2020 +0800

        fix mobilebert

    commit eead164
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 18:44:28 2020 +0800

        fix

    commit 8645115
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 17:27:43 2020 +0800

        update

    commit 2b7f7a3
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 17:18:00 2020 +0800

        fix roberta

    commit 86702fe
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 16:27:43 2020 +0800

        use_segmentation

    commit 6d03d7a
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 15:52:40 2020 +0800

        fix

    commit 5c0ca43
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 15:49:48 2020 +0800

        fix token_ids

    commit ff7aae8
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 13:56:07 2020 +0800

        fix xlmr

    commit 2070b86
    Author: ZheyuYe <zheyu.ye1995@gmail.com>
    Date:   Sun Jun 28 13:54:26 2020 +0800

        fix roberta

commit 70a1887
Author: Leonard Lausen <lausen@amazon.com>
Date:   Fri Jul 17 00:07:08 2020 +0000

    Update for Block API (dmlc#1261)

    - Remove params and prefix arguments for MXNet 2 and update
      parameter sharing implementation
    - Remove Block.name_scope() for MXNet 2
    - Remove self.params.get() and self.params.get_constant()

commit ea9152b
Author: Xingjian Shi <xshiab@connect.ust.hk>
Date:   Thu Jul 16 15:42:04 2020 -0700

    Fixes to make the CI more stable (dmlc#1265)

    * Some fixes to make the CI more stable

    * add retries

    * Update tokenizers.py

commit a646c34
Author: ht <wawawa@akane.waseda.jp>
Date:   Sun Jul 12 02:49:53 2020 +0800

    [FEATURE] update backtranslation and add multinomial sampler (dmlc#1259)

    * back translation bash

    * split "lang-pair" para in clean_tok_para_corpus

    * added clean_tok_mono_corpus

    * fix

    * add num_process para

    * fix

    * fix

    * add yml

    * rm yml

    * update cfg name

    * update evaluate

    * added max_update / save_interval_update params

    * fix

    * fix

    * multi gpu inference

    * fix

    * update

    * update multi gpu inference

    * fix

    * fix

    * split evaluate and parallel infer

    * fix

    * test

    * fix

    * update

    * add comments

    * fix

    * remove todo comment

    * revert remove todo comment

    * raw lines remove duplicated '\n'

    * update multinomaial sampler

    * fix

    * fix

    * fix

    * fix

    * sampling

    * update script

    * fix

    * add test_case with k > 1 in topk sampling

    * fix multinomial sampler

    * update docs

    * comments situation eos_id = None

    * fix

    Co-authored-by: Hu <huta@a483e74650ff.ant.amazon.com>

commit 83e1f13
Author: Leonard Lausen <lausen@amazon.com>
Date:   Thu Jul 9 20:57:55 2020 -0700

    Use Amazon S3 Transfer Acceleration (dmlc#1260)

commit cd48efd
Author: Leonard Lausen <lausen@amazon.com>
Date:   Tue Jul 7 17:39:42 2020 -0700

    Update codecov action to handle different OS and Python versions (dmlc#1254)

    codecov/codecov-action#80 (comment)

commit 689eba9
Author: Sheng Zha <szha@users.noreply.github.com>
Date:   Tue Jul 7 09:55:34 2020 -0700

    [CI] AWS batch job tool for GluonNLP (Part I) (dmlc#1251)

    * AWS batch job tool for GluonNLP

    * limit range

    Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>

commit e06ff01
Author: Leonard Lausen <lausen@amazon.com>
Date:   Tue Jul 7 08:36:24 2020 -0700

    Pin mxnet version range on CI (dmlc#1257)
  • Loading branch information
zheyuye committed Jul 17, 2020
1 parent 673344d commit 647b4ef
Show file tree
Hide file tree
Showing 56 changed files with 2,435 additions and 1,844 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/unittests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,14 @@ jobs:
- name: Install Other Dependencies
run: |
python -m pip install --user --upgrade pip
python -m pip install --user setuptools pytest pytest-cov
python -m pip install --user setuptools pytest pytest-cov contextvars
python -m pip install --upgrade cython
python -m pip install --pre --user mxnet>=2.0.0b20200604 -f https://dist.mxnet.io/python
python -m pip install --pre --user "mxnet>=2.0.0b20200716" -f https://dist.mxnet.io/python
python -m pip install --user -e .[extras]
- name: Test project
run: |
python -m pytest --cov=./ --cov-report=xml --durations=50 tests/
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
uses: codecov/codecov-action@v1.0.10
with:
env_vars: OS,PYTHON
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ First of all, install the latest MXNet. You may use the following commands:
```bash

# Install the version with CUDA 10.1
pip install -U --pre mxnet-cu101>=2.0.0b20200604 -f https://dist.mxnet.io/python
pip install -U --pre mxnet-cu101>=2.0.0b20200716 -f https://dist.mxnet.io/python

# Install the cpu-only version
pip install -U --pre mxnet>=2.0.0b20200604 -f https://dist.mxnet.io/python
pip install -U --pre mxnet>=2.0.0b20200716 -f https://dist.mxnet.io/python
```


Expand Down
7 changes: 3 additions & 4 deletions scripts/conversion_toolkits/convert_electra.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,11 +265,11 @@ def convert_tf_model(model_dir, save_dir, test_conversion, model_size, gpu, elec
assert_allclose(tf_params[k], backbone_params[k])

# Build gluon model and initialize
gluon_model = ElectraModel.from_cfg(cfg, prefix='electra_')
gluon_model = ElectraModel.from_cfg(cfg)
gluon_model.initialize(ctx=ctx)
gluon_model.hybridize()

gluon_disc_model = ElectraDiscriminator(cfg, prefix='electra_')
gluon_disc_model = ElectraDiscriminator(cfg)
gluon_disc_model.initialize(ctx=ctx)
gluon_disc_model.hybridize()

Expand All @@ -283,8 +283,7 @@ def convert_tf_model(model_dir, save_dir, test_conversion, model_size, gpu, elec
word_embed_params=word_embed_params,
token_type_embed_params=token_type_embed_params,
token_pos_embed_params=token_pos_embed_params,
embed_layer_norm_params=embed_layer_norm_params,
prefix='generator_')
embed_layer_norm_params=embed_layer_norm_params)
gluon_gen_model.initialize(ctx=ctx)
gluon_gen_model.hybridize()

Expand Down
2 changes: 1 addition & 1 deletion scripts/conversion_toolkits/convert_mobilebert.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ def convert_tf_model(model_dir, save_dir, test_conversion, gpu, mobilebert_dir):
gluon_model.initialize(ctx=ctx)
gluon_model.hybridize()

gluon_pretrain_model = MobileBertForPretrain(cfg, prefix='')
gluon_pretrain_model = MobileBertForPretrain(cfg)
gluon_pretrain_model.initialize(ctx=ctx)
gluon_pretrain_model.hybridize()

Expand Down
2 changes: 1 addition & 1 deletion scripts/conversion_toolkits/convert_tf_hub_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,7 +358,7 @@ def convert_tf_model(hub_model_dir, save_dir, test_conversion, model_type, gpu):
gluon_model = PretrainedModel.from_cfg(cfg, prefix='', use_pooler=True)
gluon_model.initialize(ctx=ctx)
gluon_model.hybridize()
gluon_mlm_model = PretrainedMLMModel(backbone_cfg=cfg, prefix='')
gluon_mlm_model = PretrainedMLMModel(backbone_cfg=cfg)
gluon_mlm_model.initialize(ctx=ctx)
gluon_mlm_model.hybridize()

Expand Down
4 changes: 2 additions & 2 deletions scripts/datasets/language_modeling/prepare_lm.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@
# The original address of Google One Billion Word dataset is
# http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
# We uploaded the file to S3 to accelerate the speed
'gbw': 'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz',
'gbw': 'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz',
# The data is obtained from https://raw.githubusercontent.com/rafaljozefowicz/lm/master/1b_word_vocab.txt
'gbw_vocab': 'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt'
'gbw_vocab': 'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt'
}


Expand Down
2 changes: 1 addition & 1 deletion scripts/datasets/machine_translation/prepare_wmt.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@
# For the CWMT dataset, you can also download them from the official location: http://nlp.nju.edu.cn/cwmt-wmt/
# Currently, this version is processed via https://gist.github.com/sxjscience/54bedd68ce3fb69b3b1b264377efb5a5
'cwmt': {
'url': 'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt.tar.gz',
'url': 'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt.tar.gz',
'zh-en': {
'en': 'cwmt/cwmt-zh-en.en',
'zh': 'cwmt/cwmt-zh-en.zh'
Expand Down
9 changes: 6 additions & 3 deletions scripts/datasets/machine_translation/wmt2014_ende.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,17 @@ sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/test.raw.${TGT}

# Clean and tokenize the training + dev corpus
cd ${SAVE_PATH}
nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
--tgt-lang ${TGT} \
--src-corpus train.raw.${SRC} \
--tgt-corpus train.raw.${TGT} \
--min-num-words 1 \
--max-num-words 100 \
--src-save-path train.tok.${SRC} \
--tgt-save-path train.tok.${TGT}

nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
--tgt-lang ${TGT} \
--src-corpus dev.raw.${SRC} \
--tgt-corpus dev.raw.${TGT} \
--min-num-words 1 \
Expand All @@ -35,7 +37,8 @@ nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
--tgt-save-path dev.tok.${TGT}

# For test corpus, we will just tokenize the data
nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
--tgt-lang ${TGT} \
--src-corpus test.raw.${SRC} \
--tgt-corpus test.raw.${TGT} \
--src-save-path test.tok.${SRC} \
Expand Down
9 changes: 6 additions & 3 deletions scripts/datasets/machine_translation/wmt2017_zhen.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ sacrebleu -t wmt17 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/test.raw.${TGT}

# Clean and tokenize the training + dev corpus
cd ${SAVE_PATH}
nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
--tgt-lang ${TGT} \
--src-corpus train.raw.${SRC} \
--tgt-corpus train.raw.${TGT} \
--src-tokenizer jieba \
Expand All @@ -29,7 +30,8 @@ nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
--src-save-path train.tok.${SRC} \
--tgt-save-path train.tok.${TGT}

nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
--tgt-lang ${TGT} \
--src-corpus dev.raw.${SRC} \
--tgt-corpus dev.raw.${TGT} \
--src-tokenizer jieba \
Expand All @@ -41,7 +43,8 @@ nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
--tgt-save-path dev.tok.${TGT}

# For test corpus, we will just tokenize the data
nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
--tgt-lang ${TGT} \
--src-corpus test.raw.${SRC} \
--tgt-corpus test.raw.${TGT} \
--src-tokenizer jieba \
Expand Down
2 changes: 1 addition & 1 deletion scripts/datasets/pretrain_corpus/prepare_bookcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@

_URLS = {
'gutenberg':
'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/pretrain_corpus/Gutenberg.zip',
'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/pretrain_corpus/Gutenberg.zip',
}


Expand Down
2 changes: 1 addition & 1 deletion scripts/datasets/url_checksums/book_corpus.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/pretrain_corpus/Gutenberg.zip 91e842dc3671ed5a917b7ff6a60f5f87397780e2 461506225
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/pretrain_corpus/Gutenberg.zip 91e842dc3671ed5a917b7ff6a60f5f87397780e2 461506225
4 changes: 2 additions & 2 deletions scripts/datasets/url_checksums/language_model.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip 3c914d1
https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip 0aec09a7537b58d4bb65362fee27650eeaba625a 190229076
http://mattmahoney.net/dc/enwik8.zip d856b1ccd937c51aeb9c342e47666fb8c38e7e72 36445475
http://mattmahoney.net/dc/text8.zip 6c70299b93b7e1f927b42cd8f6ac1a31547c7a2e 31344016
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz 4df859766482e12264a5a9d9fb7f0e276020447d 1792209805
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt aa2322a3da82ef628011336c9b5c6059e4f56c3f 9507106
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz 4df859766482e12264a5a9d9fb7f0e276020447d 1792209805
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt aa2322a3da82ef628011336c9b5c6059e4f56c3f 9507106
12 changes: 6 additions & 6 deletions scripts/datasets/url_checksums/wmt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-ru.tar.gz.01
https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-ru.tar.gz.02 bf6b18a33c8cafa6889fd463fa8a2850d8877d35 306221588
https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-zh.tar.gz.00 1bec5f10297512183e483fdd4984d207700657d1 1073741824
https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-zh.tar.gz.01 15df2968bc69ef7662cf3029282bbb62cbf107b1 312943879
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/casia2015.zip b432394685e4c53797e1ac86851f8a013aef27a2 98159063
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/casict2011.zip 769a9a86c24e9507dbf520b950b9026120cb041e 166957775
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/datum2015.zip 6d94cc8d296dd4268ed0a10fa3a419267280363e 100118018
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/datum2017.zip 480fa06760b2dbe7c9a9bd7c3fd5e5b22b860a45 37389573
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/neu2017.zip 532b56ba62f6cffccdc85f4316468873ca739bd1 148681171
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/casia2015.zip b432394685e4c53797e1ac86851f8a013aef27a2 98159063
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/casict2011.zip 769a9a86c24e9507dbf520b950b9026120cb041e 166957775
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/datum2015.zip 6d94cc8d296dd4268ed0a10fa3a419267280363e 100118018
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/datum2017.zip 480fa06760b2dbe7c9a9bd7c3fd5e5b22b860a45 37389573
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/neu2017.zip 532b56ba62f6cffccdc85f4316468873ca739bd1 148681171
http://data.statmt.org/wmt17/translation-task/rapid2016.tgz 8b173ce0bc77f2a1a57c8134143e3b5ae228a6e2 163416042
https://s3-eu-west-1.amazonaws.com/tilde-model/rapid2019.de-en.zip aafe431338abb98fc20951b2d6011223a1b91311 111888392
http://data.statmt.org/wmt19/translation-task/dev.tgz 451ce2cae815c8392212ccb3f54f5dcddb9b2b9e 38654961
http://data.statmt.org/wmt19/translation-task/test.tgz ce02a36fb2cd41abfa19d36eb8c8d50241ed3346 3533424
https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt.tar.gz 88c2f4295169e9f0a9834bf8bff87e3fd4c04055 709032378
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt.tar.gz 88c2f4295169e9f0a9834bf8bff87e3fd4c04055 709032378
http://data.statmt.org/news-crawl/de/news.2007.de.shuffled.deduped.gz 9d746b9df345f764e6e615119113c70e3fb0858c 90104365
http://data.statmt.org/news-crawl/de/news.2008.de.shuffled.deduped.gz 185a24e8833844486aee16cb5decf9a64da1c101 308205291
http://data.statmt.org/news-crawl/de/news.2009.de.shuffled.deduped.gz 9f7645fc6467de88f4205d94f483194838bad8ce 317590378
Expand Down
Loading

0 comments on commit 647b4ef

Please sign in to comment.