<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/6-sequence-to-sequence-models/machine_translator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Machine Translator

In this notebook, we are going to build a working MT system. Instead of writing any
Python code to do that, we’ll make the most of existing MT frameworks. A number of
open source frameworks make it easier to build MT systems, including [Moses](http://www.statmt.org/moses/) for SMT and [OpenNMT](http://opennmt.net/) for NMT.

But, we will use [Fairseq](https://github.com/pytorch/fairseq), an NMT
toolkit developed by Facebook that is becoming more and more popular among NLP
practitioners these days.

The following aspects make Fairseq a good choice for developing
an NMT system quickly:

- it is a modern framework that comes with a number
of predefined state-of-the-art NMT models that you can use out of the box;
- it is very extensible, meaning you can quickly implement your own model by following their API;
- it is very fast, supporting multi-GPU and distributed training by default.

Thanks to its powerful models, you can build a decent quality NMT system within a couple of hours.

##Setup

In [1]:
!pip -q install fairseq

[K     |████████████████████████████████| 1.7 MB 5.4 MB/s 
[K     |████████████████████████████████| 90 kB 8.0 MB/s 
[K     |████████████████████████████████| 145 kB 46.8 MB/s 
[K     |████████████████████████████████| 112 kB 48.1 MB/s 
[K     |████████████████████████████████| 74 kB 3.1 MB/s 
[K     |████████████████████████████████| 596 kB 44.3 MB/s 
[?25h  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone


Let's download and expand the dataset

In [2]:
%%shell

mkdir -p data/mt
wget https://realworldnlpbook.s3.amazonaws.com/data/mt/tatoeba.eng_spa.zip
unzip tatoeba.eng_spa.zip -d data/mt

--2021-11-16 06:27:51--  https://realworldnlpbook.s3.amazonaws.com/data/mt/tatoeba.eng_spa.zip
Resolving realworldnlpbook.s3.amazonaws.com (realworldnlpbook.s3.amazonaws.com)... 52.216.225.240
Connecting to realworldnlpbook.s3.amazonaws.com (realworldnlpbook.s3.amazonaws.com)|52.216.225.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19148555 (18M) [application/zip]
Saving to: ‘tatoeba.eng_spa.zip’


2021-11-16 06:27:52 (44.4 MB/s) - ‘tatoeba.eng_spa.zip’ saved [19148555/19148555]

Archive:  tatoeba.eng_spa.zip
  inflating: data/mt/tatoeba.eng_spa.train.tok.en  
  inflating: data/mt/tatoeba.eng_spa.train.tok.es  
  inflating: data/mt/tatoeba.eng_spa.train.tsv  
  inflating: data/mt/tatoeba.eng_spa.valid.tok.en  
  inflating: data/mt/tatoeba.eng_spa.valid.tok.es  
  inflating: data/mt/tatoeba.eng_spa.valid.tsv  
  inflating: data/mt/tatoeba.eng_spa.tsv  
  inflating: data/mt/tatoeba.eng_spa.test.tsv  
  inflating: data/mt/tatoeba.eng_spa.test.tok.en  
  inf



The
corpus consists of approximately 200,000 English sentences and their Spanish translations.
I went ahead and already formatted the dataset so that you can use it without worrying about obtaining the data, tokenizing the text, and so on. The dataset is
already split into train, validate, and test subsets.

##Preparing the datasets

As we know, MT systems (both SMT and NMT) are machine learning
models and thus are trained from data. The development process of MT systems looks similar to any other modern NLP systems.

First, the training portion of the parallel corpus is preprocessed and used to train a set of NMT model candidates. 

Next, the validation portion is used to choose the best-performing model
out of all the candidates. This process is called model selection.

Finally, the best model is tested on the test portion of the dataset to obtain
evaluation metrics, which reflect how good the model is.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/6-sequence-to-sequence-models/images/1.png?raw=1' width='800'/>

The first step in MT development is preprocessing the dataset. But before preprocessing, you need to convert the dataset into an easy-to-use format, which is usually plain text in NLP.

In practice, the raw data for training MT systems come in many different
formats, for example, plain text files (if you are lucky), XML formats of proprietary
software, PDF files, and database records. Your first job is to format the raw files so
that source sentences and their target translations are aligned sentence by sentence.

The resulting file is often a TSV file where each line is a tab-separated sentence pair, which looks like the following:

```
Let's try something.                  Permíteme intentarlo.
Muiriel is 20 now.                    Ahora, Muiriel tiene 20 años.
I just don't know what to say.        No sé qué decir.
You are in my way.                    Estás en mi camino.
Sometimes he can be a strange guy.    A veces él puede ser un chico raro.
…
```

After the translations are aligned, the parallel corpus is fed into the preprocessing
pipeline. Specific operations applied in this process differ from application to application,
and from language to language, but the following steps are most common:

- Filtering
- Cleaning
- Tokenization

The Tatoeba dataset you downloaded and expanded earlier has already gone
through all this preprocessing pipeline. Now you are ready to hand the dataset over to Fairseq. 

The first step is to tell Fairseq to convert the input files to the binary format so that the training script can read them easily, as follows:

In [None]:
!fairseq-preprocess --source-lang es --target-lang en \
      --trainpref data/mt/tatoeba.eng_spa.train.tok \
      --validpref data/mt/tatoeba.eng_spa.valid.tok \
      --testpref data/mt/tatoeba.eng_spa.test.tok \
      --destdir data/mt-bin \
      --thresholdsrc 3 \
      --thresholdtgt 3

When this succeeds, you should see a message Wrote preprocessed data to `data/mt-bin` on your terminal. 

You should also find the following group of files under the `data/mt-bin` directory:

In [7]:
!ls data/mt-bin/

dict.en.txt	   test.es-en.en.idx   train.es-en.en.idx  valid.es-en.en.idx
dict.es.txt	   test.es-en.es.bin   train.es-en.es.bin  valid.es-en.es.bin
preprocess.log	   test.es-en.es.idx   train.es-en.es.idx  valid.es-en.es.idx
test.es-en.en.bin  train.es-en.en.bin  valid.es-en.en.bin


One of the key functionalities of this preprocessing step is to build the vocabulary (called the dictionary in Fairseq), which is a mapping from vocabulary items (usually words) to their IDs. 

Notice the two dictionary files in the directory, dict.en.txt and
dict.es.txt. MT deals with two languages, so the system needs to maintain two
mappings, one for each language.

##Training the model

Now that the train data is converted into the binary format, you are ready to train the MT model. 

At this point, you need to know only that you are training a model using the
data stored in the directory specified by the first parameter (data/mt-bin) using an LSTM architecture (--arch lstm) with a bunch of other hyperparameters, and saving the results in data/mt-ckpt (short for “checkpoint”).

Invoke the fairseq-train command with the directory where the
binary files are located, along with several hyperparameters, as shown next:

In [8]:
!fairseq-train data/mt-bin --arch lstm \
    --share-decoder-input-output-embed \
    --optimizer adam \
    --lr 1.0e-3 \
    --max-tokens 4096 \
    --save-dir data/mt-ckpt

2021-11-16 06:41:09 | INFO | fairseq_cli.train | Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, adaptive_softmax_cutoff='10000,50000,200000', all_gather_list_size=16384, arch='lstm', batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='cross_entropy', curriculum=0, data='data/mt-bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention='1', decoder_dropout_in=0.1, decoder_dropout_out=0.1, decoder_embed_dim=512, decoder_embed_path=None, decoder_freeze_embed=False, decoder_hidden_size=512, decoder_layers=1, decoder_out_embed_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.1, empty

When you run this command, your terminal will show two types of progress bars
alternatively—one for training and another for validating.

For each epoch, the training process alternates two
stages: training and validation. An epoch, a concept used in machine learning, means one pass through the entire train data.

In the training stage, the loss is calculated using the training data, then the model parameters are adjusted in such a way that the new set of parameters lowers the loss. 

In the validation stage, the model parameters
are fixed, and a separate dataset (validation set) is used to measure how well the model is performing against the dataset.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/real-world-natural-language-processing/6-sequence-to-sequence-models/images/2.png?raw=1' width='800'/>

As the training continues, the train loss becomes smaller and smaller and gradually
approaches zero, because this is exactly what we told the optimizer to do: decrease the
loss as much as possible. Checking whether the train loss is decreasing steadily epoch
after epoch is a good “sanity check” that your model and the training pipeline are working as expected.

On the other hand, if you look at the validation loss, it goes down at first for several
epochs, but after a certain point, it gradually goes back up, forming a U-shaped
curve—a typical sign of overfitting. After several epochs of training, your model fits
the train set so well that it begins to lose its generalizability on the validation set.

If you see your validation loss starting to creep up, there’s little point keeping the training process running, because chances are, your model has already overfitted to the data to some extent. A common practice in such a situation, called early stopping, is to terminate the training. 

Specifically, if your validation loss is not improving for a certain
number of epochs, you stop the training and use the model at the point when the
validation loss was the lowest. The number of epochs you wait until the training is terminated is called patience. 

In practice, the metric you care about the most (such as
BLEU) is used for early stopping instead of the validation loss.

The graph indicates that the validation loss is lowest around epoch 8, so you can stop (by pressing `Ctrl + C`) the fairseq-train command after around 10 epochs; otherwise, the command would keep running indefinitely. Fairseq will automatically save the best model parameters (in terms of the validation loss) to the `checkpoint_best.pt` file.



##Running the translator

After the model is trained, you can invoke the fairseq-interactive command to
run your MT model on any input in an interactive way. 

After you see the prompt Type the input sentence and press return, try typing
(or copying and pasting) the following Spanish sentences one by one:

```
Buenos días !
¡ Hola !
¿ Dónde está el baño ?
¿ Hay habitaciones libres ?
¿ Acepta tarjeta de crédito ?
La cuenta , por favor .
```

You can run the command by specifying the binary file location and the model parameter file as follows:

In [10]:
!fairseq-interactive data/mt-bin \
    --path data/mt-ckpt/checkpoint_best.pt \
    --beam 5 \
    --source-lang es \
    --target-lang en

2021-11-16 07:04:39 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='data/mt-bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False,

Most of the output sentences here are almost perfect, except the fourth one (I would translate to "is there free rooms?"). Even considering the fact that these sentences are all simple examples you can find in any travel Spanish phrasebook, this is not a bad start for a system built within an hour!