Squashed commit of the following:

commit 35a586676036f627bffd0d3c753c6cd0a70d63cf Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Fri Jul 17 10:10:14 2020 +0800 Squashed commit of the following: commit 673344d Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 15 22:43:07 2020 +0800 CharTokenizer commit 8dabfd6 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 15 15:47:24 2020 +0800 lowercase commit f5c94a6 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jul 14 17:45:28 2020 +0800 test commit dc55fc9 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jul 14 05:45:01 2020 +0800 tiny update on run_squad commit 4defc7a Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jul 13 23:18:08 2020 +0800 update testings commit 2719e81 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jul 13 23:08:32 2020 +0800 re-upload xlmr commit cd0509d Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jul 13 22:30:47 2020 +0800 fix get_pretrained commit 8ed8a72 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jul 13 22:28:13 2020 +0800 re-upload roberta commit 5811d40 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jul 13 18:27:23 2020 +0800 update commit 44a09a3 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sat Jul 11 15:06:33 2020 +0800 fix commit 4074a26 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Fri Jul 10 16:08:49 2020 +0800 inference without horovod commit 31cb953 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 9 18:41:55 2020 +0800 update commit 838be2a Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 9 15:14:39 2020 +0800 horovod for squad commit 1d374a2 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 9 12:09:19 2020 +0800 fix commit e4fba39 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 9 10:35:08 2020 +0800 remove multiply_grads commit 007f07e Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jul 7 11:26:38 2020 +0800 multiply_grads commit b8c85bb Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jul 6 12:28:56 2020 +0800 fix ModelForQABasic commit 0e13a58 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sat Jul 4 18:42:12 2020 +0800 clip_grad_global_norm with zeros max_grad_norm commit bd270f2 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Fri Jul 3 20:21:31 2020 +0800 fix roberta commit 4fc564c Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Fri Jul 3 19:36:08 2020 +0800 update hyper-parameters of adamw commit 59cffbf Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Fri Jul 3 16:25:46 2020 +0800 try commit a84f782 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 2 20:39:03 2020 +0800 fix mobilebert commit 4bc3a96 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 2 11:14:39 2020 +0800 layer-wise decay commit 07186d5 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Thu Jul 2 02:14:43 2020 +0800 revise commit a5a6475 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 1 19:50:20 2020 +0800 topk commit 34ee884 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 1 19:25:09 2020 +0800 index_update commit 74178e2 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Wed Jul 1 00:48:32 2020 +0800 rename commit fa011aa Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jun 30 23:40:28 2020 +0800 update commit 402d625 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jun 30 21:40:30 2020 +0800 multiprocessing for wiki commit ddbde75 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jun 30 20:41:35 2020 +0800 fix bookcorpus commit 6cc5ccd Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jun 30 16:39:12 2020 +0800 fix wiki commit 9773efd Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Tue Jun 30 15:52:13 2020 +0800 fix openwebtext commit 1fb8eb8 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jun 29 19:51:25 2020 +0800 upload gluon_electra_small_owt commit ca83fac Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jun 29 18:09:48 2020 +0800 revise train_transformer commit 1450f5c Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jun 29 18:07:04 2020 +0800 revise commit b460bbe Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jun 29 17:24:00 2020 +0800 repeat for pretraining commit 8ee381b Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jun 29 17:06:43 2020 +0800 repeat commit aea936f Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Mon Jun 29 16:39:22 2020 +0800 fix mobilebert commit eead164 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 18:44:28 2020 +0800 fix commit 8645115 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 17:27:43 2020 +0800 update commit 2b7f7a3 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 17:18:00 2020 +0800 fix roberta commit 86702fe Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 16:27:43 2020 +0800 use_segmentation commit 6d03d7a Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 15:52:40 2020 +0800 fix commit 5c0ca43 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 15:49:48 2020 +0800 fix token_ids commit ff7aae8 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 13:56:07 2020 +0800 fix xlmr commit 2070b86 Author: ZheyuYe <zheyu.ye1995@gmail.com> Date: Sun Jun 28 13:54:26 2020 +0800 fix roberta commit 70a1887 Author: Leonard Lausen <lausen@amazon.com> Date: Fri Jul 17 00:07:08 2020 +0000 Update for Block API (dmlc#1261) - Remove params and prefix arguments for MXNet 2 and update parameter sharing implementation - Remove Block.name_scope() for MXNet 2 - Remove self.params.get() and self.params.get_constant() commit ea9152b Author: Xingjian Shi <xshiab@connect.ust.hk> Date: Thu Jul 16 15:42:04 2020 -0700 Fixes to make the CI more stable (dmlc#1265) * Some fixes to make the CI more stable * add retries * Update tokenizers.py commit a646c34 Author: ht <wawawa@akane.waseda.jp> Date: Sun Jul 12 02:49:53 2020 +0800 [FEATURE] update backtranslation and add multinomial sampler (dmlc#1259) * back translation bash * split "lang-pair" para in clean_tok_para_corpus * added clean_tok_mono_corpus * fix * add num_process para * fix * fix * add yml * rm yml * update cfg name * update evaluate * added max_update / save_interval_update params * fix * fix * multi gpu inference * fix * update * update multi gpu inference * fix * fix * split evaluate and parallel infer * fix * test * fix * update * add comments * fix * remove todo comment * revert remove todo comment * raw lines remove duplicated '\n' * update multinomaial sampler * fix * fix * fix * fix * sampling * update script * fix * add test_case with k > 1 in topk sampling * fix multinomial sampler * update docs * comments situation eos_id = None * fix Co-authored-by: Hu <huta@a483e74650ff.ant.amazon.com> commit 83e1f13 Author: Leonard Lausen <lausen@amazon.com> Date: Thu Jul 9 20:57:55 2020 -0700 Use Amazon S3 Transfer Acceleration (dmlc#1260) commit cd48efd Author: Leonard Lausen <lausen@amazon.com> Date: Tue Jul 7 17:39:42 2020 -0700 Update codecov action to handle different OS and Python versions (dmlc#1254) codecov/codecov-action#80 (comment) commit 689eba9 Author: Sheng Zha <szha@users.noreply.github.com> Date: Tue Jul 7 09:55:34 2020 -0700 [CI] AWS batch job tool for GluonNLP (Part I) (dmlc#1251) * AWS batch job tool for GluonNLP * limit range Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> commit e06ff01 Author: Leonard Lausen <lausen@amazon.com> Date: Tue Jul 7 08:36:24 2020 -0700 Pin mxnet version range on CI (dmlc#1257)
zheyuye · Jul 17, 2020 · 647b4ef · 647b4ef
1 parent 673344d
commit 647b4ef
Show file tree

Hide file tree

Showing 56 changed files with 2,435 additions and 1,844 deletions.
diff --git a/.github/workflows/unittests.yml b/.github/workflows/unittests.yml
@@ -33,14 +33,14 @@ jobs:
       - name: Install Other Dependencies
         run: |
           python -m pip install --user --upgrade pip
-          python -m pip install --user setuptools pytest pytest-cov
+          python -m pip install --user setuptools pytest pytest-cov contextvars
           python -m pip install --upgrade cython
-          python -m pip install --pre --user mxnet>=2.0.0b20200604 -f https://dist.mxnet.io/python
+          python -m pip install --pre --user "mxnet>=2.0.0b20200716" -f https://dist.mxnet.io/python
           python -m pip install --user -e .[extras]
       - name: Test project
         run: |
           python -m pytest --cov=./ --cov-report=xml --durations=50 tests/
       - name: Upload coverage to Codecov
-        uses: codecov/codecov-action@v1
+        uses: codecov/codecov-action@v1.0.10
         with:
           env_vars: OS,PYTHON
diff --git a/README.md b/README.md
@@ -21,10 +21,10 @@ First of all, install the latest MXNet. You may use the following commands:
 ```bash
 
 # Install the version with CUDA 10.1
-pip install -U --pre mxnet-cu101>=2.0.0b20200604 -f https://dist.mxnet.io/python
+pip install -U --pre mxnet-cu101>=2.0.0b20200716 -f https://dist.mxnet.io/python
 
 # Install the cpu-only version
-pip install -U --pre mxnet>=2.0.0b20200604 -f https://dist.mxnet.io/python
+pip install -U --pre mxnet>=2.0.0b20200716 -f https://dist.mxnet.io/python
 ```
 
 

diff --git a/scripts/conversion_toolkits/convert_electra.py b/scripts/conversion_toolkits/convert_electra.py
@@ -265,11 +265,11 @@ def convert_tf_model(model_dir, save_dir, test_conversion, model_size, gpu, elec
         assert_allclose(tf_params[k], backbone_params[k])
 
     # Build gluon model and initialize
-    gluon_model = ElectraModel.from_cfg(cfg, prefix='electra_')
+    gluon_model = ElectraModel.from_cfg(cfg)
     gluon_model.initialize(ctx=ctx)
     gluon_model.hybridize()
 
-    gluon_disc_model = ElectraDiscriminator(cfg, prefix='electra_')
+    gluon_disc_model = ElectraDiscriminator(cfg)
     gluon_disc_model.initialize(ctx=ctx)
     gluon_disc_model.hybridize()
 
@@ -283,8 +283,7 @@ def convert_tf_model(model_dir, save_dir, test_conversion, model_size, gpu, elec
                                        word_embed_params=word_embed_params,
                                        token_type_embed_params=token_type_embed_params,
                                        token_pos_embed_params=token_pos_embed_params,
-                                       embed_layer_norm_params=embed_layer_norm_params,
-                                       prefix='generator_')
+                                       embed_layer_norm_params=embed_layer_norm_params)
     gluon_gen_model.initialize(ctx=ctx)
     gluon_gen_model.hybridize()
 

diff --git a/scripts/conversion_toolkits/convert_mobilebert.py b/scripts/conversion_toolkits/convert_mobilebert.py
@@ -270,7 +270,7 @@ def convert_tf_model(model_dir, save_dir, test_conversion, gpu, mobilebert_dir):
     gluon_model.initialize(ctx=ctx)
     gluon_model.hybridize()
 
-    gluon_pretrain_model = MobileBertForPretrain(cfg, prefix='')
+    gluon_pretrain_model = MobileBertForPretrain(cfg)
     gluon_pretrain_model.initialize(ctx=ctx)
     gluon_pretrain_model.hybridize()
 

diff --git a/scripts/conversion_toolkits/convert_tf_hub_model.py b/scripts/conversion_toolkits/convert_tf_hub_model.py
@@ -358,7 +358,7 @@ def convert_tf_model(hub_model_dir, save_dir, test_conversion, model_type, gpu):
     gluon_model = PretrainedModel.from_cfg(cfg, prefix='', use_pooler=True)
     gluon_model.initialize(ctx=ctx)
     gluon_model.hybridize()
-    gluon_mlm_model = PretrainedMLMModel(backbone_cfg=cfg, prefix='')
+    gluon_mlm_model = PretrainedMLMModel(backbone_cfg=cfg)
     gluon_mlm_model.initialize(ctx=ctx)
     gluon_mlm_model.hybridize()
 

diff --git a/scripts/datasets/language_modeling/prepare_lm.py b/scripts/datasets/language_modeling/prepare_lm.py
@@ -50,9 +50,9 @@
     # The original address of Google One Billion Word dataset is
     # http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
     # We uploaded the file to S3 to accelerate the speed
-    'gbw': 'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz',
+    'gbw': 'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz',
     # The data is obtained from https://raw.githubusercontent.com/rafaljozefowicz/lm/master/1b_word_vocab.txt
-    'gbw_vocab': 'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt'
+    'gbw_vocab': 'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt'
 }
 
 

diff --git a/scripts/datasets/machine_translation/prepare_wmt.py b/scripts/datasets/machine_translation/prepare_wmt.py
@@ -235,7 +235,7 @@
     # For the CWMT dataset, you can also download them from the official location: http://nlp.nju.edu.cn/cwmt-wmt/
     # Currently, this version is processed via https://gist.github.com/sxjscience/54bedd68ce3fb69b3b1b264377efb5a5
     'cwmt': {
-        'url': 'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt.tar.gz',
+        'url': 'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt.tar.gz',
         'zh-en': {
             'en': 'cwmt/cwmt-zh-en.en',
             'zh': 'cwmt/cwmt-zh-en.zh'

diff --git a/scripts/datasets/machine_translation/wmt2014_ende.sh b/scripts/datasets/machine_translation/wmt2014_ende.sh
@@ -18,15 +18,17 @@ sacrebleu -t wmt14 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/test.raw.${TGT}
 
 # Clean and tokenize the training + dev corpus
 cd ${SAVE_PATH}
-nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
+nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
+                      --tgt-lang ${TGT} \
                       --src-corpus train.raw.${SRC} \
                       --tgt-corpus train.raw.${TGT} \
                       --min-num-words 1 \
                       --max-num-words 100 \
                       --src-save-path train.tok.${SRC} \
                       --tgt-save-path train.tok.${TGT}
 
-nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
+nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
+                      --tgt-lang ${TGT} \
                       --src-corpus dev.raw.${SRC} \
                       --tgt-corpus dev.raw.${TGT} \
                       --min-num-words 1 \
@@ -35,7 +37,8 @@ nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
                       --tgt-save-path dev.tok.${TGT}
 
 # For test corpus, we will just tokenize the data
-nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
+nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
+                      --tgt-lang ${TGT} \
                       --src-corpus test.raw.${SRC} \
                       --tgt-corpus test.raw.${TGT} \
                       --src-save-path test.tok.${SRC} \

diff --git a/scripts/datasets/machine_translation/wmt2017_zhen.sh b/scripts/datasets/machine_translation/wmt2017_zhen.sh
@@ -18,7 +18,8 @@ sacrebleu -t wmt17 -l ${SRC}-${TGT} --echo ref > ${SAVE_PATH}/test.raw.${TGT}
 
 # Clean and tokenize the training + dev corpus
 cd ${SAVE_PATH}
-nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
+nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
+                      --tgt-lang ${TGT} \
                       --src-corpus train.raw.${SRC} \
                       --tgt-corpus train.raw.${TGT} \
                       --src-tokenizer jieba \
@@ -29,7 +30,8 @@ nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
                       --src-save-path train.tok.${SRC} \
                       --tgt-save-path train.tok.${TGT}
 
-nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
+nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
+                      --tgt-lang ${TGT} \
                       --src-corpus dev.raw.${SRC} \
                       --tgt-corpus dev.raw.${TGT} \
                       --src-tokenizer jieba \
@@ -41,7 +43,8 @@ nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
                       --tgt-save-path dev.tok.${TGT}
 
 # For test corpus, we will just tokenize the data
-nlp_preprocess clean_tok_para_corpus --lang-pair ${SRC}-${TGT} \
+nlp_preprocess clean_tok_para_corpus --src-lang ${SRC} \
+                      --tgt-lang ${TGT} \
                       --src-corpus test.raw.${SRC} \
                       --tgt-corpus test.raw.${TGT} \
                       --src-tokenizer jieba \

diff --git a/scripts/datasets/pretrain_corpus/prepare_bookcorpus.py b/scripts/datasets/pretrain_corpus/prepare_bookcorpus.py
@@ -34,7 +34,7 @@
 
 _URLS = {
     'gutenberg':
-        'https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/pretrain_corpus/Gutenberg.zip',
+        'https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/pretrain_corpus/Gutenberg.zip',
 }
 
 

diff --git a/scripts/datasets/url_checksums/book_corpus.txt b/scripts/datasets/url_checksums/book_corpus.txt
@@ -1 +1 @@
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/pretrain_corpus/Gutenberg.zip 91e842dc3671ed5a917b7ff6a60f5f87397780e2 461506225
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/pretrain_corpus/Gutenberg.zip 91e842dc3671ed5a917b7ff6a60f5f87397780e2 461506225
diff --git a/scripts/datasets/url_checksums/language_model.txt b/scripts/datasets/url_checksums/language_model.txt
@@ -2,5 +2,5 @@ https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip 3c914d1
 https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip 0aec09a7537b58d4bb65362fee27650eeaba625a 190229076
 http://mattmahoney.net/dc/enwik8.zip d856b1ccd937c51aeb9c342e47666fb8c38e7e72 36445475
 http://mattmahoney.net/dc/text8.zip 6c70299b93b7e1f927b42cd8f6ac1a31547c7a2e 31344016
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz 4df859766482e12264a5a9d9fb7f0e276020447d 1792209805
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt aa2322a3da82ef628011336c9b5c6059e4f56c3f 9507106
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1-billion-word-language-modeling-benchmark-r13output.tar.gz 4df859766482e12264a5a9d9fb7f0e276020447d 1792209805
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/datasets/language_modeling/1b_word_vocab.txt aa2322a3da82ef628011336c9b5c6059e4f56c3f 9507106
diff --git a/scripts/datasets/url_checksums/wmt.txt b/scripts/datasets/url_checksums/wmt.txt
@@ -34,16 +34,16 @@ https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-ru.tar.gz.01
 https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-ru.tar.gz.02 bf6b18a33c8cafa6889fd463fa8a2850d8877d35 306221588
 https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-zh.tar.gz.00 1bec5f10297512183e483fdd4984d207700657d1 1073741824
 https://stuncorpusprod.blob.core.windows.net/corpusfiles/UNv1.0.en-zh.tar.gz.01 15df2968bc69ef7662cf3029282bbb62cbf107b1 312943879
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/casia2015.zip b432394685e4c53797e1ac86851f8a013aef27a2 98159063
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/casict2011.zip 769a9a86c24e9507dbf520b950b9026120cb041e 166957775
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/datum2015.zip 6d94cc8d296dd4268ed0a10fa3a419267280363e 100118018
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/datum2017.zip 480fa06760b2dbe7c9a9bd7c3fd5e5b22b860a45 37389573
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt/parallel/neu2017.zip 532b56ba62f6cffccdc85f4316468873ca739bd1 148681171
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/casia2015.zip b432394685e4c53797e1ac86851f8a013aef27a2 98159063
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/casict2011.zip 769a9a86c24e9507dbf520b950b9026120cb041e 166957775
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/datum2015.zip 6d94cc8d296dd4268ed0a10fa3a419267280363e 100118018
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/datum2017.zip 480fa06760b2dbe7c9a9bd7c3fd5e5b22b860a45 37389573
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt/parallel/neu2017.zip 532b56ba62f6cffccdc85f4316468873ca739bd1 148681171
 http://data.statmt.org/wmt17/translation-task/rapid2016.tgz 8b173ce0bc77f2a1a57c8134143e3b5ae228a6e2 163416042
 https://s3-eu-west-1.amazonaws.com/tilde-model/rapid2019.de-en.zip aafe431338abb98fc20951b2d6011223a1b91311 111888392
 http://data.statmt.org/wmt19/translation-task/dev.tgz 451ce2cae815c8392212ccb3f54f5dcddb9b2b9e 38654961
 http://data.statmt.org/wmt19/translation-task/test.tgz ce02a36fb2cd41abfa19d36eb8c8d50241ed3346 3533424
-https://gluonnlp-numpy-data.s3-us-west-2.amazonaws.com/wmt/cwmt.tar.gz 88c2f4295169e9f0a9834bf8bff87e3fd4c04055 709032378
+https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/wmt/cwmt.tar.gz 88c2f4295169e9f0a9834bf8bff87e3fd4c04055 709032378
 http://data.statmt.org/news-crawl/de/news.2007.de.shuffled.deduped.gz 9d746b9df345f764e6e615119113c70e3fb0858c 90104365
 http://data.statmt.org/news-crawl/de/news.2008.de.shuffled.deduped.gz 185a24e8833844486aee16cb5decf9a64da1c101 308205291
 http://data.statmt.org/news-crawl/de/news.2009.de.shuffled.deduped.gz 9f7645fc6467de88f4205d94f483194838bad8ce 317590378