# OpenNMT

The instructions and code are taken from [Neural Machine Translation (NMT) tutorial with OpenNMT-py](https://github.com/ymoslem/OpenNMT-Tutorial) by combining
the two Jupyter notebooks together with a little bit tweaking.

### Data Gathering and Processing

In [1]:
# Create a directory and clone the Github MT-Preparation repository
!mkdir -p nmt
%cd nmt
!rm -rf *
!git clone https://github.com/ymoslem/MT-Preparation.git

/home/jovyan/nmt
Cloning into 'MT-Preparation'...
remote: Enumerating objects: 239, done.[K
remote: Counting objects: 100% (239/239), done.[K
remote: Compressing objects: 100% (133/133), done.[K
remote: Total 239 (delta 119), reused 186 (delta 94), pack-reused 0[K
Receiving objects: 100% (239/239), 61.56 KiB | 132.00 KiB/s, done.
Resolving deltas: 100% (119/119), done.


In [2]:
# Install the requirements
!pip3 install -r MT-Preparation/requirements.txt



### Datasets

Example datasets:

    EN-AR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/ar-en.txt.zip
    EN-ES: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-es.txt.zip
    EN-FR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip
    EN-RU: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-ru.txt.zip
    EN-ZH: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-zh.txt.zip

In [3]:
# Download and unzip a dataset
# !wget https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip
# !unzip en-fr.txt.zip
!wget https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-zh.txt.zip
!unzip en-zh.txt.zip

--2023-08-22 07:26:27--  https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-zh.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.19, 86.50.254.18
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9020487 (8.6M) [application/zip]
Saving to: ‘en-zh.txt.zip’


2023-08-22 07:30:16 (39.4 KB/s) - ‘en-zh.txt.zip’ saved [9020487/9020487]

Archive:  en-zh.txt.zip
  inflating: UN.en-zh.en             
  inflating: UN.en-zh.zh             
  inflating: README                  


In [4]:
# Filter the dataset
# Arguments: source file, target file, source language, target language
# !python3 MT-Preparation/filtering/filter.py UN.en-fr.fr UN.en-fr.en fr en

!python3 MT-Preparation/filtering/filter.py UN.en-zh.zh UN.en-zh.en zh en

Dataframe shape (rows, columns): (74067, 2)
--- Rows with Empty Cells Deleted	--> Rows: 74067
--- Duplicates Deleted			--> Rows: 62179
--- Source-Copied Rows Deleted		--> Rows: 62145
--- Too Long Source/Target Deleted	--> Rows: 3308
--- HTML Removed			--> Rows: 3308
--- Rows will remain in true-cased	--> Rows: 3308
--- Rows with Empty Cells Deleted	--> Rows: 3308
--- Rows Shuffled			--> Rows: 3308
--- Source Saved: UN.en-zh.zh-filtered.zh
--- Target Saved: UN.en-zh.en-filtered.en


### Tokenization / Sub-wording

In [5]:
!ls MT-Preparation/subwording/

1-train_bpe.py	1-train_unigram.py  2-subword.py  3-desubword.py


In [6]:
# Train a SentencePiece model for subword tokenization
# !python3 MT-Preparation/subwording/1-train_unigram.py UN.en-fr.fr-filtered.fr UN.en-fr.en-filtered.en
!python3 MT-Preparation/subwording/1-train_unigram.py UN.en-zh.zh-filtered.zh UN.en-zh.en-filtered.en

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=UN.en-zh.zh-filtered.zh --model_prefix=source --vocab_size=50000 --hard_vocab_limit=false --split_digits=true
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: UN.en-zh.zh-filtered.zh
  input_format: 
  model_prefix: source
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 0
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
 

In [7]:
!ls

en-zh.txt.zip	source.model  target.vocab	       UN.en-zh.zh
MT-Preparation	source.vocab  UN.en-zh.en	       UN.en-zh.zh-filtered.zh
README		target.model  UN.en-zh.en-filtered.en


In [8]:
# Subword the dataset
# !python3 MT-Preparation/subwording/2-subword.py source.model target.model UN.en-fr.fr-filtered.fr UN.en-fr.en-filtered.en
!python3 MT-Preparation/subwording/2-subword.py source.model target.model UN.en-zh.zh-filtered.zh UN.en-zh.en-filtered.en

Source Model: source.model
Target Model: target.model
Source Dataset: UN.en-zh.zh-filtered.zh
Target Dataset: UN.en-zh.en-filtered.en
Done subwording the source file! Output: UN.en-zh.zh-filtered.zh.subword
Done subwording the target file! Output: UN.en-zh.en-filtered.en.subword


In [9]:
# First 3 lines before subwording
# !head -n 3 UN.en-fr.fr-filtered.fr && echo "-----" && head -n 3 UN.en-fr.en-filtered.en
!head -n 3 UN.en-zh.zh-filtered.zh && echo "-----" && head -n 3 UN.en-zh.en-filtered.en

同意
第61/212号决议
表28.24
-----
Consent
RESOLUTION 61/212
Table 28.24


In [10]:
# First 3 lines after subwording
# !head -n 3 UN.en-fr.fr-filtered.fr.subword && echo "---" && head -n 3 UN.en-fr.en-filtered.en.subword
!head -n 3 UN.en-zh.zh-filtered.zh.subword && echo "---" && head -n 3 UN.en-zh.en-filtered.en.subword

▁ 同意
▁第 6 1 / 2 1 2 号决议
▁表 2 8 . 2 4
---
▁Cons ent
▁RESOLUTION ▁ 6 1 / 2 1 2
▁T able ▁ 2 8 . 2 4


### Data Splitting

We usually split our dataset into 3 portions:

    1. training dataset - used for training the model;
    2. development dataset - used to run regular validations during the training to help improve the model parameters; and
    3. testing dataset - a holdout dataset used after the model finishes training to finally evaluate the model on unseen data.

In [11]:
# Split the dataset into training set, development set, and test set
# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)
# !python3 MT-Preparation/train_dev_split/train_dev_test_split.py 2000 2000 UN.en-fr.fr-filtered.fr.subword UN.en-fr.en-filtered.en.subword
!python3 MT-Preparation/train_dev_split/train_dev_test_split.py 1000 1000 UN.en-zh.zh-filtered.zh.subword UN.en-zh.en-filtered.en.subword

Dataframe shape: (3308, 2)
--- Empty Cells Deleted --> Rows: 3308
--- Wrote Files
Done!
Output files
UN.en-zh.zh-filtered.zh.subword.train
UN.en-zh.en-filtered.en.subword.train
UN.en-zh.zh-filtered.zh.subword.dev
UN.en-zh.en-filtered.en.subword.dev
UN.en-zh.zh-filtered.zh.subword.test
UN.en-zh.en-filtered.en.subword.test


In [12]:
# Line count for the subworded train, dev, test datatest
!wc -l *.subword.*

  1000 UN.en-zh.en-filtered.en.subword.dev
  1000 UN.en-zh.en-filtered.en.subword.test
  1308 UN.en-zh.en-filtered.en.subword.train
  1000 UN.en-zh.zh-filtered.zh.subword.dev
  1000 UN.en-zh.zh-filtered.zh.subword.test
  1308 UN.en-zh.zh-filtered.zh.subword.train
  6616 total


In [13]:
# Check the first and last line from each dataset

# -------------------------------------------
# Change this cell to print your name
!echo -e "My name is: FirstName SecondName \n"
# -------------------------------------------

!echo "---First line---"
!head -n 1 *.{train,dev,test}

!echo -e "\n---Last line---"
!tail -n 1 *.{train,dev,test}

My name is: FirstName SecondName 

---First line---
==> UN.en-zh.en-filtered.en.subword.train <==
▁Cons ent

==> UN.en-zh.zh-filtered.zh.subword.train <==
▁ 同意

==> UN.en-zh.en-filtered.en.subword.dev <==
▁RESOLUTION ▁ 6 1 / 8 7

==> UN.en-zh.zh-filtered.zh.subword.dev <==
▁第 6 1 / 8 7 号决议

==> UN.en-zh.en-filtered.en.subword.test <==
▁Abst aining : ▁Canada

==> UN.en-zh.zh-filtered.zh.subword.test <==
▁ 弃 权 : 加拿大

---Last line---
==> UN.en-zh.en-filtered.en.subword.train <==
▁VI . ▁Treat ment ▁of ▁ vic ti ms

==> UN.en-zh.zh-filtered.zh.subword.train <==
▁六 . ▁ 受 害 人的 待 遇

==> UN.en-zh.en-filtered.en.subword.dev <==
▁RESOLUTION ▁ 5 9 / 1 6 5

==> UN.en-zh.zh-filtered.zh.subword.dev <==
▁第 5 9 / 1 6 5 号决议

==> UN.en-zh.en-filtered.en.subword.test <==
▁ 6 0 / 2 3 8 . ▁Human ▁resources ▁management

==> UN.en-zh.zh-filtered.zh.subword.test <==
▁ 6 0 / 2 3 8 . ▁人力资源 管理


In [14]:
# Install OpenNMT-py 3.x
!pip3 install OpenNMT-py



In [15]:
# Open the folder where you saved your prepapred datasets from the first exercise
!ls

en-zh.txt.zip			 UN.en-zh.en-filtered.en.subword.dev
MT-Preparation			 UN.en-zh.en-filtered.en.subword.test
README				 UN.en-zh.en-filtered.en.subword.train
source.model			 UN.en-zh.zh
source.vocab			 UN.en-zh.zh-filtered.zh
target.model			 UN.en-zh.zh-filtered.zh.subword
target.vocab			 UN.en-zh.zh-filtered.zh.subword.dev
UN.en-zh.en			 UN.en-zh.zh-filtered.zh.subword.test
UN.en-zh.en-filtered.en		 UN.en-zh.zh-filtered.zh.subword.train
UN.en-zh.en-filtered.en.subword


In [16]:
pwd

'/home/jovyan/nmt'

### Create the Training Configuration File

The following config file matches most of the recommended values for the Transformer model [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762). As the current dataset is small, we reduced the following values:

    train_steps - for datasets with a few millions of sentences, consider using a value between 100000 and 200000, or more! Enabling the option early_stopping can help stop the training when there is no considerable improvement.
    valid_steps - 10000 can be good if the value train_steps is big enough.
    warmup_steps - obviously, its value must be less than train_steps. Try 4000 and 8000 values.

In [17]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano
# Note here we are using some smaller values because the dataset is small
# For larger datasets, consider increasing: train_steps, valid_steps, warmup_steps, save_checkpoint_steps, keep_checkpoint

config = '''# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: UN.en-zh.zh-filtered.zh.subword.train
        path_tgt: UN.en-zh.en-filtered.en.subword.train
        transforms: [filtertoolong]
    valid:
        path_src: UN.en-zh.zh-filtered.zh.subword.dev
        path_tgt: UN.en-zh.en-filtered.en.subword.dev
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.zhen

# Stop training if it does not imporve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 1000

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 3435

# Default: 100000 - Train the model to max n steps 
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: 3000

# Default: 10000 - Run validation after n steps
valid_steps: 1000

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 1000
report_every: 100

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("config.yaml", "w") as config_yaml:
  config_yaml.write(config)

In [18]:
# [Optional] Check the content of the configuration file
!cat config.yaml

# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: UN.en-zh.zh-filtered.zh.subword.train
        path_tgt: UN.en-zh.en-filtered.en.subword.train
        transforms: [filtertoolong]
    valid:
        path_src: UN.en-zh.zh-filtered.zh.subword.dev
        path_tgt: UN.en-zh.en-filtered.en.subword.dev
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.zhen

# Stop training if it does not imporve after n validations
early_st

### Build Vocabulary

In [19]:
# Find the number of CPUs/cores on the machine
!nproc --all

16


In [20]:
# Build Vocabulary

# -config: path to your config.yaml file
# -n_sample: use -1 to build vocabulary on all the segment in the training dataset
# -num_threads: change it to match the number of CPUs to run it faster

!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 4

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2023-08-22 07:32:55,693 INFO] Counter vocab from -1 samples.
[2023-08-22 07:32:55,693 INFO] n_sample=-1: Build vocab on full datasets.
[2023-08-22 07:32:55,746 INFO] Counters src: 616
[2023-08-22 07:32:55,747 INFO] Counters tgt: 573


In [21]:
# Check if the GPU is active
!nvidia-smi -L

GPU 0: GRID V100-16C (UUID: GPU-87e669c5-28e4-11b2-bca1-dacc96a729c6)


In [22]:
# Check if the GPU is visable to PyTorch

import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

gpu_memory = torch.cuda.mem_get_info(0)
print("Free GPU memory:", gpu_memory[0]/1024**2, "out of:", gpu_memory[1]/1024**2)

True
GRID V100-16C
Free GPU memory: 14599.296875 out of: 16384.0


### Training

In [23]:
!rm -rf models/

In [24]:
# Train the NMT model
!onmt_train -config config.yaml

[2023-08-22 06:45:48,548 INFO] Parsed 2 corpora from -data.
[2023-08-22 06:45:48,548 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2023-08-22 06:45:48,555 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', '/', '▁第', '5', '号决议', '1', '6']
[2023-08-22 06:45:48,555 INFO] The decoder start token is: <s>
[2023-08-22 06:45:48,555 INFO] Building model...
[2023-08-22 06:45:49,007 INFO] Switching model to float32 for amp/apex_amp
[2023-08-22 06:45:49,008 INFO] Non quantized layer compute is fp16
[2023-08-22 06:45:50,134 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(720, 512, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_

### Translation

Translation Options:

    -model - specify the last model checkpoint name; try testing the quality of multiple checkpoints
    -src - the subworded test dataset, source file
    -output - give any file name to the new translation output file
    -gpu - GPU ID, usually 0 if you have one GPU. Otherwise, it will translate on CPU, which would be slower.
    -min_length - [optional] to avoid empty translations
    -verbose - [optional] if you want to print translations

In [25]:
# Translate the "subworded" source file of the test dataset
# Change the model name, if needed.
# !onmt_translate -model models/model.fren_step_3000.pt -src UN.en-fr.fr-filtered.fr.subword.test -output UN.en.translated -gpu 0 -min_length 1
!onmt_translate -model models/model.zhen_step_3000.pt -src UN.en-zh.zh-filtered.zh.subword.test -output UN.en.translated -gpu 0 -min_length 1

[2023-08-22 07:23:33,901 INFO] Loading checkpoint from models/model.zhen_step_3000.pt
[2023-08-22 07:23:34,424 INFO] Loading data into the model
[2023-08-22 07:23:45,289 INFO] PRED SCORE: -0.2534, PRED PPL: 1.29 NB SENTENCES: 1000


In [26]:
# Check the first 5 lines of the translation file
!head -n 5 UN.en.translated

▁RESOLUTION ▁ 5 7 / 1 3 3
▁RESOLUTION ▁ 5 8 / 2 4 1
▁RESOLUTION ▁ 5 5 / 1 2 1
▁RESOLUTION ▁ 5 5 / 1 9
▁RESOLUTION ▁ 6 0 / 1 4 0


In [27]:
# If needed install/update sentencepiece
!pip3 install --upgrade -q sentencepiece

# Desubword the translation file
!python3 MT-Preparation/subwording/3-desubword.py target.model UN.en.translated

Done desubwording! Output: UN.en.translated.desubword


In [39]:
# Check the first 5 lines of the desubworded translation file
!head -n 5 UN.en.translated.desubword

RESOLUTION 62/224
RESOLUTION 55/172
17. Urges States:
RESOLUTION 61/146
RESOLUTION 62/227


In [28]:
# Desubword the target file (reference) of the test dataset
# Note: You might as well have split files *before* subwording during dataset preperation, 
# but sometimes datasets have tokeniztion issues, so this way you are sure the file is really untokenized.
!python3 MT-Preparation/subwording/3-desubword.py target.model UN.en-zh.en-filtered.en.subword.test

Done desubwording! Output: UN.en-zh.en-filtered.en.subword.test.desubword


In [29]:
# Check the first 5 lines of the desubworded reference
!head -n 5 UN.en-zh.en-filtered.en.subword.test.desubword

RESOLUTION 57/173
RESOLUTION 58/240
RESOLUTION 55/152
RESOLUTION 55/19
RESOLUTION 60/104


### MT Evaluation

There are several MT Evaluation metrics such as BLEU, TER, METEOR, COMET, BERTScore, among others.

Here we are using BLEU. Files must be detokenized/desubworded beforehand.

In [30]:
# Download the BLEU script
!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py

--2023-08-22 07:24:29--  https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957 [text/plain]
Saving to: ‘compute-bleu.py’


2023-08-22 07:24:30 (15.4 MB/s) - ‘compute-bleu.py’ saved [957/957]



In [31]:
# Install sacrebleu
!pip3 install sacrebleu



In [32]:
# Evaluate the translation (without subwording)
!python3 compute-bleu.py UN.en-zh.en-filtered.en.subword.test.desubword UN.en.translated.desubword

Reference 1st sentence: RESOLUTION 57/173
MTed 1st sentence: RESOLUTION 57/133
BLEU:  22.1631887230471
