Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Apex, upgrade torch #2

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# v0.2 : 20210505

# Changes:
1. update torch to 1.8
1. remove apex and instead use torch's native AMP
1. `unmass-prep` updated to include --para_train

# v0.1

1. python module. `pip install` -able
1. Train without peeking into held out `test` set. Use test set when all the training is finished
Expand Down
128 changes: 23 additions & 105 deletions docs/README.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
= MASS
= UnMass v0.2

Code: https://github.com/thammegowda/unmass

https://arxiv.org/pdf/1905.02450.pdf[MASS] is a novel pre-training method for sequence to sequence based language generation tasks.
It randomly masks a sentence fragment in the encoder, and then predicts it in the decoder.
Expand All @@ -13,8 +15,10 @@ Credits: the original developers/researchers:

facebookresearch/XLM
|---microsoft/MASS
|---<this>
|---<this> thammegowda/unmass

== Older Versions
* link:v0.1.html[v0.1]

== Unsupervised MASS (UnMASS)

Expand Down Expand Up @@ -48,57 +52,27 @@ pip install --editable .

# install it from pypi https://pypi.org/project/unmass/
pip install unmass

----

Most dependencies are automatically installed from pip. Except, apex, which is a bit complex installation process, so it has to be manually installed
To install apex, do the following:

. The environment variable `CUDA_HOME` is set and that `$CUDA_HOME/bin/nvcc` is a valid path
. The cuda toolkit version is consistent
.. e.g. if `nvcc --version` says version it is `10.1`, then `python -c 'import torch; print(torch.version.cuda)'`
also says the same version.
. You have a newer version of `gcc`. See `gcc --version`. (In my trail-errors, gcc >= 4.9 and gcc <= 8.x worked

Once you have met the above requirements, do the following:
[source,bash]
----
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
----

You should get a message as `Successfully installed apex-0.1` if the installation is success.
Otherwise, you are on your own to fix the installation (and please update this documentation).

== Data Ready

We use the same BPE codes and vocabulary with XLM. Here we take English-French as an example.


*Using XLM tools and prepared vocabs:*
----
cd MASS

wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr

./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr
----

*Preparing from scratch:*
----
MONO=/path/to/monolingual
PARA_VAL=/path/to/parallel/validation

unmass-prep --src de --tgt en --data runs/001-ende \
--mono /Users/tg/work/me/mtdata/data/de-en/train-parts/news_commentary_v14 \
--para_val /Users/tg/work/me/mtdata/data/de-en/tests/newstest2018_deen
--mono $MONO --para_val $PARA_VAL

# this looks for $MONO.{src,tgt} and $PARA_VAL.{src,tgt}
----

== Pre-training:

[source,bash]
[source, bash]
----
python -m unmass.train --exp_name unmass-enfr \
--data_path ./data/processed/en-fr/ \
python -m unmass.train --exp_path unmass-./runs/001-ende/pretrain \
--data_path ./runs/001-ende/prepared \
--lgs 'en-fr' \
--mass_steps 'en,fr' \
--encoder_only false \
Expand All @@ -115,11 +89,11 @@ python -m unmass.train --exp_name unmass-enfr \
--eval_bleu true \
--word_mass 0.5 \
--min_len 5 \
```
----


During the pre-training prcess, even without any back-translation, you can observe the model can achieve some intial BLEU scores:
```
During the pre-training prcess, even without any back-translation, you can observe the model can achieve some initial BLEU scores:
----
epoch -> 4
valid_fr-en_mt_bleu -> 10.55
valid_en-fr_mt_bleu -> 7.81
Expand All @@ -132,11 +106,10 @@ After pre-training, we use back-translation to fine-tune the pre-trained model o

[source,bash]
----
MODEL=mass_enfr_1024.pth
MODEL=/runs/001-ende/pretrain/checkpoint.pth

python -m unmass.train --exp_name unmass-enfr-unmt \
--exp_name unsupMT_enfr \
--data_path ./data/processed/en-fr/ \
python -m unmass.train --exp_path ./runs/001-ende/finetune \
--data_path ./runs/001-ende/prepared \
--lgs 'en-fr' \
--bt_steps 'en-fr-en,fr-en-fr' \
--encoder_only false \
Expand All @@ -156,63 +129,6 @@ python -m unmass.train --exp_name unmass-enfr-unmt \
--reload_model "$MODEL,$MODEL" \
----

We also provide a demo to use MASS pre-trained model on the WMT16 en-ro bilingual dataset. We provide pre-trained and fine-tuned models:

| Model | Ro-En BLEU (with BT) |
|:---------:|:----:|
| Baseline | 34.0 |
| XLM | 38.5 |
| [MASS](https://modelrelease.blob.core.windows.net/mass/mass_mt_enro_1024.pth) | 39.1 |


Download dataset by the below command:

----
wget https://dl.fbaipublicfiles.com/XLM/codes_enro
wget https://dl.fbaipublicfiles.com/XLM/vocab_enro

./get-data-bilingual-enro-nmt.sh --src en --tgt ro --reload_codes codes_enro --reload_vocab vocab_enro
----

After download the mass pre-trained model from the above link. And use the following command to fine tune:

[source,bash]
----
MODEL=mass_enro_1024.pth

python -m unmass.train \
--exp_name unsupMT_enro \
--data_path ./data/processed/en-ro \
--lgs 'en-ro' \
--bt_steps 'en-ro-en,ro-en-ro' \
--encoder_only false \
--mt_steps 'en-ro,ro-en' \
--emb_dim 1024 \
--n_layers 6 \
--n_heads 8 \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--tokens_per_batch 2000 \
--batch_size 32 \
--bptt 256 \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000 \
--max_epoch 50 \
--eval_bleu true \
--reload_model "$MODEL,$MODEL"
----

=== Training Details

`MASS-base-uncased` uses 32x NVIDIA 32GB V100 GPUs and trains on (Wikipekia + BookCorpus, 16GB) for 20 epochs (float32), batch size is simulated as 4096.

== Other questions

Q1: When I run this program in multi-gpus or multi-nodes, the program reports errors like `ModuleNotFouldError: No module named 'mass'`.

A1: This seems a bug in python `multiprocessing/spawn.py`, a direct solution is to move these files into each relative folder under fairseq. Do not forget to modify the import path in the code.

== Reference

If you find MASS useful in your work, you can cite the paper as below:
Expand All @@ -224,3 +140,5 @@ If you find MASS useful in your work, you can cite the paper as below:
pages={5926--5936},
year={2019}
}


Loading