thammegowda · thammegowda · May 4, 2021 · May 5, 2021 · May 5, 2021 · May 5, 2021
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,5 +1,10 @@
+# v0.2 : 20210505
 
-# Changes:
+1. update torch to 1.8
+1. remove apex and instead use torch's native AMP
+1. `unmass-prep` updated to include --para_train
+
+# v0.1
 
 1. python module. `pip install` -able
 1. Train without peeking into held out `test` set. Use test set when all the training is finished 

diff --git a/docs/README.adoc b/docs/README.adoc
@@ -1,4 +1,6 @@
-= MASS
+= UnMass v0.2
+
+Code: https://github.com/thammegowda/unmass
 
 https://arxiv.org/pdf/1905.02450.pdf[MASS] is a novel pre-training method for sequence to sequence based language generation tasks.
  It randomly masks a sentence fragment in the encoder, and then predicts it in the decoder.
@@ -13,8 +15,10 @@ Credits: the original developers/researchers:
 
     facebookresearch/XLM
        |---microsoft/MASS
-            |---<this>
+            |---<this> thammegowda/unmass
 
+== Older Versions
+* link:v0.1.html[v0.1]
 
 == Unsupervised MASS (UnMASS)
 
@@ -48,57 +52,27 @@ pip install --editable .
 
 # install it from pypi https://pypi.org/project/unmass/
 pip install unmass
-
 ----
 
-Most dependencies are automatically installed from pip. Except, apex, which is a bit complex installation process, so it has to be manually installed
-To install apex, do the following:
-
-. The environment variable `CUDA_HOME` is set and that `$CUDA_HOME/bin/nvcc` is a valid path
-. The cuda toolkit version is consistent
-  .. e.g. if `nvcc --version` says version it is `10.1`, then `python -c 'import torch; print(torch.version.cuda)'`
-also says the same version.
-. You have a newer version of `gcc`. See `gcc --version`. (In my trail-errors, gcc >= 4.9 and gcc <= 8.x worked
-
-Once you have met the above requirements, do the following:
-[source,bash]
-----
-$ git clone https://github.com/NVIDIA/apex
-$ cd apex
-$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
-----
-
-You should get a message as `Successfully installed apex-0.1` if the installation is success.
-Otherwise, you are on your own to fix the installation (and please update this documentation).
-
 == Data Ready
 
-We use the same BPE codes and vocabulary with XLM. Here we take English-French as an example.
-
-
-*Using XLM tools and prepared vocabs:*
-----
-cd MASS
-
-wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
-wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr
-
-./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr
-----
-
 *Preparing from scratch:*
 ----
+MONO=/path/to/monolingual
+PARA_VAL=/path/to/parallel/validation
+
 unmass-prep --src de --tgt en --data runs/001-ende \
-  --mono /Users/tg/work/me/mtdata/data/de-en/train-parts/news_commentary_v14 \
-  --para_val /Users/tg/work/me/mtdata/data/de-en/tests/newstest2018_deen
+  --mono $MONO --para_val $PARA_VAL
+
+# this looks for $MONO.{src,tgt} and $PARA_VAL.{src,tgt}
 ----
 
 == Pre-training:
 
-[source,bash]
+[source, bash]
 ----
-python -m unmass.train --exp_name unmass-enfr  \
---data_path ./data/processed/en-fr/                  \
+python -m unmass.train --exp_path unmass-./runs/001-ende/pretrain  \
+--data_path ./runs/001-ende/prepared                  \
 --lgs 'en-fr'                                        \
 --mass_steps 'en,fr'                                 \
 --encoder_only false                                 \
@@ -115,11 +89,11 @@ python -m unmass.train --exp_name unmass-enfr  \
 --eval_bleu true                                     \
 --word_mass 0.5                                      \
 --min_len 5                                          \
-```
+----
 
 
-During the pre-training prcess, even without any back-translation, you can observe the model can achieve some intial BLEU scores:
-```
+During the pre-training prcess, even without any back-translation, you can observe the model can achieve some initial BLEU scores:
+----
 epoch -> 4
 valid_fr-en_mt_bleu -> 10.55
 valid_en-fr_mt_bleu ->  7.81
@@ -132,11 +106,10 @@ After pre-training, we use back-translation to fine-tune the pre-trained model o
 
 [source,bash]
 ----
-MODEL=mass_enfr_1024.pth
+MODEL=/runs/001-ende/pretrain/checkpoint.pth
 
-python -m unmass.train --exp_name unmass-enfr-unmt \
-  --exp_name unsupMT_enfr                              \
-  --data_path ./data/processed/en-fr/                  \
+python -m unmass.train --exp_path ./runs/001-ende/finetune \
+  --data_path ./runs/001-ende/prepared                  \
   --lgs 'en-fr'                                        \
   --bt_steps 'en-fr-en,fr-en-fr'                       \
   --encoder_only false                                 \
@@ -156,63 +129,6 @@ python -m unmass.train --exp_name unmass-enfr-unmt \
   --reload_model "$MODEL,$MODEL"                       \
 ----
 
-We also provide a demo to use MASS pre-trained model on the WMT16 en-ro bilingual dataset. We provide pre-trained and fine-tuned models:
-
-| Model | Ro-En BLEU (with BT) |
-|:---------:|:----:|
-| Baseline | 34.0 |
-| XLM | 38.5 |
-| [MASS](https://modelrelease.blob.core.windows.net/mass/mass_mt_enro_1024.pth) | 39.1 |
-
-
-Download dataset by the below command:
-
-----
-wget https://dl.fbaipublicfiles.com/XLM/codes_enro
-wget https://dl.fbaipublicfiles.com/XLM/vocab_enro
-
-./get-data-bilingual-enro-nmt.sh --src en --tgt ro --reload_codes codes_enro --reload_vocab vocab_enro
-----
-
-After download the mass pre-trained model from the above link. And use the following command to fine tune:
-
-[source,bash]
-----
-MODEL=mass_enro_1024.pth
-
-python -m unmass.train  \
-	--exp_name unsupMT_enro                              \
-	--data_path ./data/processed/en-ro                   \
-	--lgs 'en-ro'                                        \
-	--bt_steps 'en-ro-en,ro-en-ro'                       \
-	--encoder_only false                                 \
-	--mt_steps 'en-ro,ro-en'                             \
-	--emb_dim 1024                                       \
-	--n_layers 6                                         \
-	--n_heads 8                                          \
-	--dropout 0.1                                        \
-	--attention_dropout 0.1                              \
-	--gelu_activation true                               \
-	--tokens_per_batch 2000                              \
-	--batch_size 32                                      \
-	--bptt 256                                           \
-	--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
-	--epoch_size 200000                                  \
-	--max_epoch 50                                       \
-	--eval_bleu true                                     \
-	--reload_model "$MODEL,$MODEL"
-----
-
-=== Training Details
-
-`MASS-base-uncased` uses 32x NVIDIA 32GB V100 GPUs and trains on (Wikipekia + BookCorpus, 16GB) for 20 epochs (float32), batch size is simulated as 4096.
-
-== Other questions
-
-Q1: When I run this program in multi-gpus or multi-nodes, the program reports errors like `ModuleNotFouldError: No module named 'mass'`.
-
-A1: This seems a bug in python `multiprocessing/spawn.py`, a direct solution is to move these files into each relative folder under fairseq. Do not forget to modify the import path in the code.
-
 == Reference
 
 If you find MASS useful in your work, you can cite the paper as below:
@@ -224,3 +140,5 @@ If you find MASS useful in your work, you can cite the paper as below:
         pages={5926--5936},
         year={2019}
     }
+
+