# OpenNMT

The instructions and code are taken from [Neural Machine Translation (NMT) tutorial with OpenNMT-py](https://github.com/ymoslem/OpenNMT-Tutorial) by combining
the two Jupyter notebooks together with a little bit tweaking.

### Data Gathering and Processing

In [1]:
# Create a directory and clone the Github MT-Preparation repository
!mkdir -p nmt
%cd nmt
!git clone https://github.com/ymoslem/MT-Preparation.git

/home/jovyan/nmt
Cloning into 'MT-Preparation'...
remote: Enumerating objects: 227, done.[K
remote: Counting objects: 100% (227/227), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 227 (delta 115), reused 186 (delta 94), pack-reused 0[K
Receiving objects: 100% (227/227), 54.89 KiB | 213.00 KiB/s, done.
Resolving deltas: 100% (115/115), done.


In [4]:
# Install the requirements
!pip3 install -r MT-Preparation/requirements.txt



### Datasets

Example datasets:

    EN-AR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/ar-en.txt.zip
    EN-ES: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-es.txt.zip
    EN-FR: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip
    EN-RU: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-ru.txt.zip
    EN-ZH: https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-zh.txt.zip

In [8]:
# Download and unzip a dataset
!wget https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip
!unzip en-fr.txt.zip

--2023-06-30 23:52:35--  https://object.pouta.csc.fi/OPUS-UN/v20090831/moses/en-fr.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10014972 (9.6M) [application/zip]
Saving to: ‘en-fr.txt.zip’


2023-06-30 23:52:51 (705 KB/s) - ‘en-fr.txt.zip’ saved [10014972/10014972]

Archive:  en-fr.txt.zip
  inflating: UN.en-fr.en             
  inflating: UN.en-fr.fr             
  inflating: README                  


In [7]:
import os
os.environ['HTTP_PROXY'] = 'http://proxy.vmware.com:3128'
os.environ['HTTPS_PROXY'] = 'http://proxy.vmware.com:3128'

In [9]:
# Filter the dataset
# Arguments: source file, target file, source language, target language
!python3 MT-Preparation/filtering/filter.py UN.en-fr.fr UN.en-fr.en fr en

Dataframe shape (rows, columns): (74067, 2)
--- Rows with Empty Cells Deleted	--> Rows: 74067
--- Duplicates Deleted			--> Rows: 60662
--- Source-Copied Rows Deleted		--> Rows: 60476
--- Too Long Source/Target Deleted	--> Rows: 59719
--- HTML Removed			--> Rows: 59719
--- Rows will remain in true-cased	--> Rows: 59719
--- Rows with Empty Cells Deleted	--> Rows: 59719
--- Rows Shuffled			--> Rows: 59719
--- Source Saved: UN.en-fr.fr-filtered.fr
--- Target Saved: UN.en-fr.en-filtered.en


### Tokenization / Sub-wording

In [10]:
!ls MT-Preparation/subwording/

1-train_bpe.py	1-train_unigram.py  2-subword.py  3-desubword.py


In [11]:
# Train a SentencePiece model for subword tokenization
!python3 MT-Preparation/subwording/1-train_unigram.py UN.en-fr.fr-filtered.fr UN.en-fr.en-filtered.en

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=UN.en-fr.fr-filtered.fr --model_prefix=source --vocab_size=50000 --hard_vocab_limit=false --split_digits=true
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: UN.en-fr.fr-filtered.fr
  input_format: 
  model_prefix: source
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 0
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
 

In [12]:
!ls

MT-Preparation	UN.en-fr.en-filtered.en  en-fr.txt.zip	target.model
README		UN.en-fr.fr		 source.model	target.vocab
UN.en-fr.en	UN.en-fr.fr-filtered.fr  source.vocab


In [13]:
# Subword the dataset
!python3 MT-Preparation/subwording/2-subword.py source.model target.model UN.en-fr.fr-filtered.fr UN.en-fr.en-filtered.en

Source Model: source.model
Target Model: target.model
Source Dataset: UN.en-fr.fr-filtered.fr
Target Dataset: UN.en-fr.en-filtered.en
Done subwording the source file! Output: UN.en-fr.fr-filtered.fr.subword
Done subwording the target file! Output: UN.en-fr.en-filtered.en.subword


In [14]:
# First 3 lines before subwording
!head -n 3 UN.en-fr.fr-filtered.fr && echo "-----" && head -n 3 UN.en-fr.en-filtered.en

Notant qu'elle a adopté à sa dixième session extraordinaire des principes directeurs essentiels pour progresser sur la voie du désarmement général et completRésolution S-10/2.,
l) Le rapport du Secrétaire général intitulé « Plan de campagne pour la mise en œuvre de la Déclaration du Millénaire »A/56/326 ; voir également le rapport du Secrétaire général sur l'application de la Déclaration du Millénaire adoptée par l'Organisation des Nations Unies (A/58/323), par. 23., en particulier ses paragraphes 56 à 61,
Considérant que tous les États, notamment ceux qui sont particulièrement avancés dans le domaine spatial, doivent s'employer activement à empêcher une course aux armements dans l'espace, condition essentielle pour promouvoir et renforcer la coopération internationale touchant l'exploration et l'utilisation de l'espace à des fins pacifiques,
-----
Noting that essential guidelines for progress towards general and complete disarmament were adopted at the tenth special session of the Gen

In [15]:
# First 3 lines after subwording
!head -n 3 UN.en-fr.fr-filtered.fr.subword && echo "---" && head -n 3 UN.en-fr.en-filtered.en.subword

▁Notant ▁qu ' elle ▁a ▁adopté ▁à ▁sa ▁dixième ▁session ▁extraordinaire ▁des ▁principes ▁directeurs ▁essentiels ▁pour ▁progresser ▁sur ▁la ▁voie ▁du ▁désarmement ▁général ▁et ▁complet Résolution ▁S - 1 0 / 2 .,
▁l ) ▁Le ▁rapport ▁du ▁Secrétaire ▁général ▁intitulé ▁« ▁Plan ▁de ▁campagne ▁pour ▁la ▁mise ▁en ▁œuvre ▁de ▁la ▁Déclaration ▁du ▁Millénaire ▁» A / 5 6 / 3 2 6 ▁; ▁voir ▁également ▁le ▁rapport ▁du ▁Secrétaire ▁général ▁sur ▁l ' application ▁de ▁la ▁Déclaration ▁du ▁Millénaire ▁adoptée ▁par ▁l ' Organisation ▁des ▁Nations ▁Unies ▁( A / 5 8 / 3 2 3 ), ▁par . ▁ 2 3 ., ▁en ▁particulier ▁ses ▁paragraphe s ▁ 5 6 ▁à ▁ 6 1 ,
▁Considérant ▁que ▁tous ▁les ▁États , ▁notamment ▁ceux ▁qui ▁sont ▁ particulièrement ▁avancés ▁d ans ▁le ▁domaine ▁spatial , ▁doivent ▁s ' employer ▁activement ▁à ▁empêcher ▁une ▁cours e ▁aux ▁armements ▁d ans ▁l ' espace , ▁condition ▁essentielle ▁pour ▁promouvoir ▁et ▁renforcer ▁la ▁coopération ▁internationale ▁touchant ▁l ' exploration ▁et ▁l ' utilisation ▁de ▁l '

### Data Splitting

We usually split our dataset into 3 portions:

    1. training dataset - used for training the model;
    2. development dataset - used to run regular validations during the training to help improve the model parameters; and
    3. testing dataset - a holdout dataset used after the model finishes training to finally evaluate the model on unseen data.

In [16]:
# Split the dataset into training set, development set, and test set
# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)
!python3 MT-Preparation/train_dev_split/train_dev_test_split.py 2000 2000 UN.en-fr.fr-filtered.fr.subword UN.en-fr.en-filtered.en.subword

Dataframe shape: (59719, 2)
--- Empty Cells Deleted --> Rows: 59719
--- Wrote Files
Done!
Output files
UN.en-fr.fr-filtered.fr.subword.train
UN.en-fr.en-filtered.en.subword.train
UN.en-fr.fr-filtered.fr.subword.dev
UN.en-fr.en-filtered.en.subword.dev
UN.en-fr.fr-filtered.fr.subword.test
UN.en-fr.en-filtered.en.subword.test


In [17]:
# Line count for the subworded train, dev, test datatest
!wc -l *.subword.*

    2000 UN.en-fr.en-filtered.en.subword.dev
    2000 UN.en-fr.en-filtered.en.subword.test
   55719 UN.en-fr.en-filtered.en.subword.train
    2000 UN.en-fr.fr-filtered.fr.subword.dev
    2000 UN.en-fr.fr-filtered.fr.subword.test
   55719 UN.en-fr.fr-filtered.fr.subword.train
  119438 total


In [18]:
# Check the first and last line from each dataset

# -------------------------------------------
# Change this cell to print your name
!echo -e "My name is: FirstName SecondName \n"
# -------------------------------------------

!echo "---First line---"
!head -n 1 *.{train,dev,test}

!echo -e "\n---Last line---"
!tail -n 1 *.{train,dev,test}

My name is: FirstName SecondName 

---First line---
==> UN.en-fr.en-filtered.en.subword.train <==
▁( l ) ▁The ▁report ▁of ▁the ▁Secretary - General ▁entitled ▁" R oad ▁map ▁towards ▁implementation ▁of ▁the ▁Unit ed ▁Nations ▁Millennium ▁Declaration ", A / 5 6 / 3 2 6 ; ▁see ▁also ▁the ▁report ▁of ▁the ▁Secretary - General ▁on ▁the ▁implementation ▁of ▁the ▁Unit ed ▁Nations ▁Millennium ▁Declaration ▁( A / 5 8 / 3 2 3 ), ▁para . ▁ 2 3 . ▁in ▁particular ▁paragraphs ▁ 5 6 ▁to ▁ 6 1 ▁thereof ,

==> UN.en-fr.fr-filtered.fr.subword.train <==
▁l ) ▁Le ▁rapport ▁du ▁Secrétaire ▁général ▁intitulé ▁« ▁Plan ▁de ▁campagne ▁pour ▁la ▁mise ▁en ▁œuvre ▁de ▁la ▁Déclaration ▁du ▁Millénaire ▁» A / 5 6 / 3 2 6 ▁; ▁voir ▁également ▁le ▁rapport ▁du ▁Secrétaire ▁général ▁sur ▁l ' application ▁de ▁la ▁Déclaration ▁du ▁Millénaire ▁adoptée ▁par ▁l ' Organisation ▁des ▁Nations ▁Unies ▁( A / 5 8 / 3 2 3 ), ▁par . ▁ 2 3 ., ▁en ▁particulier ▁ses ▁paragraphe s ▁ 5 6 ▁à ▁ 6 1 ,

==> UN.en-fr.en-filtered.en.subword.de

In [19]:
# Install OpenNMT-py 3.x
!pip3 install OpenNMT-py

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.3-py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.9/242.9 kB[0m [31m518.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting configargparse (from OpenNMT-py)
  Downloading ConfigArgParse-1.5.5-py3-none-any.whl (25 kB)
Collecting waitress (from OpenNMT-py)
  Downloading waitress-2.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyonmttok<2,>=1.35 (from OpenNMT-py)
  Downloading pyonmttok-1.37.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting sacrebleu (from OpenNMT-py)
  Using cached sacrebleu-2.3.1-py3-none-any.whl (118 kB)
Collecting rapidfuzz (from OpenNMT-py)
  Downloading rapidfuzz-3.1.1-cp39-cp39-ma

In [20]:
# Open the folder where you saved your prepapred datasets from the first exercise
!ls

MT-Preparation			       UN.en-fr.fr-filtered.fr.subword
README				       UN.en-fr.fr-filtered.fr.subword.dev
UN.en-fr.en			       UN.en-fr.fr-filtered.fr.subword.test
UN.en-fr.en-filtered.en		       UN.en-fr.fr-filtered.fr.subword.train
UN.en-fr.en-filtered.en.subword        en-fr.txt.zip
UN.en-fr.en-filtered.en.subword.dev    source.model
UN.en-fr.en-filtered.en.subword.test   source.vocab
UN.en-fr.en-filtered.en.subword.train  target.model
UN.en-fr.fr			       target.vocab
UN.en-fr.fr-filtered.fr


In [21]:
pwd

'/home/jovyan/nmt'

### Create the Training Configuration File

The following config file matches most of the recommended values for the Transformer model [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762). As the current dataset is small, we reduced the following values:

    train_steps - for datasets with a few millions of sentences, consider using a value between 100000 and 200000, or more! Enabling the option early_stopping can help stop the training when there is no considerable improvement.
    valid_steps - 10000 can be good if the value train_steps is big enough.
    warmup_steps - obviously, its value must be less than train_steps. Try 4000 and 8000 values.

In [22]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano
# Note here we are using some smaller values because the dataset is small
# For larger datasets, consider increasing: train_steps, valid_steps, warmup_steps, save_checkpoint_steps, keep_checkpoint

config = '''# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: UN.en-fr.fr-filtered.fr.subword.train
        path_tgt: UN.en-fr.en-filtered.en.subword.train
        transforms: [filtertoolong]
    valid:
        path_src: UN.en-fr.fr-filtered.fr.subword.dev
        path_tgt: UN.en-fr.en-filtered.en.subword.dev
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.fren

# Stop training if it does not imporve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 1000

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 3435

# Default: 100000 - Train the model to max n steps 
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: 3000

# Default: 10000 - Run validation after n steps
valid_steps: 1000

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 1000
report_every: 100

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("config.yaml", "w+") as config_yaml:
  config_yaml.write(config)

In [23]:
# [Optional] Check the content of the configuration file
!cat config.yaml

# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: UN.en-fr.fr-filtered.fr.subword.train
        path_tgt: UN.en-fr.en-filtered.en.subword.train
        transforms: [filtertoolong]
    valid:
        path_src: UN.en-fr.fr-filtered.fr.subword.dev
        path_tgt: UN.en-fr.en-filtered.en.subword.dev
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.fren

# Stop training if it does not imporve after n validations
early_st

### Build Vocabulary

In [24]:
# Find the number of CPUs/cores on the machine
!nproc --all

16


In [25]:
# Build Vocabulary

# -config: path to your config.yaml file
# -n_sample: use -1 to build vocabulary on all the segment in the training dataset
# -num_threads: change it to match the number of CPUs to run it faster

!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 4

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2023-07-01 00:04:41,904 INFO] Counter vocab from -1 samples.
[2023-07-01 00:04:41,904 INFO] n_sample=-1: Build vocab on full datasets.
[2023-07-01 00:04:43,343 INFO] * Transform statistics for corpus_1(25.00%):
			* FilterTooLongStats(filtered=1104)

[2023-07-01 00:04:43,351 INFO] * Transform statistics for corpus_1(25.00%):
			* FilterTooLongStats(filtered=1040)

[2023-07-01 00:04:43,354 INFO] * Transform statistics for corpus_1(25.00%):
			* FilterTooLongStats(filtered=1035)

[2023-07-01 00:04:43,384 INFO] * Transform statistics for corpus_1(25.00%):
			* FilterTooLongStats(filtered=1025)

[2023-07-01 00:04:43,439 INFO] Counters src: 14695
[2023-07-01 00:04:43,439 INFO] Counters tgt: 11879


In [26]:
# Check if the GPU is active
!nvidia-smi -L

GPU 0: GRID V100-16C (UUID: GPU-87e669c5-28e4-11b2-bca1-dacc96a729c6)


In [27]:
# Check if the GPU is visable to PyTorch

import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

gpu_memory = torch.cuda.mem_get_info(0)
print("Free GPU memory:", gpu_memory[0]/1024**2, "out of:", gpu_memory[1]/1024**2)

True
GRID V100-16C
Free GPU memory: 14907.546875 out of: 16384.0


### Training

In [28]:
!rm -rf models/

In [29]:
# Train the NMT model
!onmt_train -config config.yaml

[2023-07-01 00:07:22,793 INFO] Parsed 2 corpora from -data.
[2023-07-01 00:07:22,793 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2023-07-01 00:07:22,873 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', '▁de', ',', "'", '▁et', '▁', '▁la']
[2023-07-01 00:07:22,874 INFO] The decoder start token is: <s>
[2023-07-01 00:07:22,874 INFO] Building model...
[2023-07-01 00:07:23,557 INFO] Switching model to float32 for amp/apex_amp
[2023-07-01 00:07:23,557 INFO] Non quantized layer compute is fp16
[2023-07-01 00:07:25,009 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(14704, 512, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
       

### Translation

Translation Options:

    -model - specify the last model checkpoint name; try testing the quality of multiple checkpoints
    -src - the subworded test dataset, source file
    -output - give any file name to the new translation output file
    -gpu - GPU ID, usually 0 if you have one GPU. Otherwise, it will translate on CPU, which would be slower.
    -min_length - [optional] to avoid empty translations
    -verbose - [optional] if you want to print translations

In [30]:
# Translate the "subworded" source file of the test dataset
# Change the model name, if needed.
!onmt_translate -model models/model.fren_step_3000.pt -src UN.en-fr.fr-filtered.fr.subword.test -output UN.en.translated -gpu 0 -min_length 1

[2023-07-01 00:33:32,612 INFO] Loading checkpoint from models/model.fren_step_3000.pt
[2023-07-01 00:33:33,294 INFO] Loading data into the model
[2023-07-01 00:36:04,753 INFO] PRED SCORE: -0.2071, PRED PPL: 1.23 NB SENTENCES: 2000


In [31]:
# Check the first 5 lines of the translation file
!head -n 5 UN.en.translated

▁Recalling ▁also ▁Security ▁Council ▁resolution ▁ 1 4 0 ▁( 2 0 0 2 ) ▁of ▁ 1 7 ▁May ▁ 2 0 0 2 , ▁by ▁which ▁the ▁Council ▁established ▁the ▁Unit ed ▁Nations ▁Mission ▁of ▁Support ▁in ▁East ▁Timor ▁for ▁an ▁initial ▁period ▁of ▁twelve ▁months ▁as ▁from ▁ 2 0 ▁May ▁ 2 0 0 2 , ▁and ▁the ▁subsequent ▁resolution ▁ 1 8 0 3 ▁( 2 0 0 3 ) ▁of ▁ 1 9 ▁May ▁ 2 0 0 3 , ▁by ▁which ▁the ▁Council ▁extended ▁the ▁mandate ▁of ▁the ▁Mission ▁unti l ▁ 2 0 ▁May ▁ 2 0 0 4 ,
▁Recalling ▁further ▁the ▁ 2 0 0 5 ▁World ▁Summit ▁Outcome , See ▁resolution ▁ 6 0 / 1 . ▁and ▁all ▁relevant ▁General ▁Assembly ▁resolutions , ▁in ▁particular ▁th ose ▁that ▁have ▁taken ▁place ▁in ▁the ▁economic , ▁social ▁and ▁related ▁fields , ▁including ▁its ▁resolution ▁ 6 0 / 2 6 5 ▁of ▁ 3 0 ▁June ▁ 2 0 0 6 , ▁entitled ▁" Implementation ▁of ▁the ▁outcome ▁of ▁the ▁ 2 0 0 5 ▁World ▁Summit , ▁including ▁the ▁Millennium ▁Development ▁Goals ▁and ▁the ▁other ▁internationally ▁agreed ▁development ▁goals ,
▁ 1 6 . ▁Welcomes ▁the ▁efforts ▁

In [32]:
# If needed install/update sentencepiece
!pip3 install --upgrade -q sentencepiece

# Desubword the translation file
!python3 MT-Preparation/subwording/3-desubword.py target.model UN.en.translated

Done desubwording! Output: UN.en.translated.desubword


In [33]:
# Check the first 5 lines of the desubworded translation file
!head -n 5 UN.en.translated.desubword

Recalling also Security Council resolution 140 (2002) of 17 May 2002, by which the Council established the United Nations Mission of Support in East Timor for an initial period of twelve months as from 20 May 2002, and the subsequent resolution 1803 (2003) of 19 May 2003, by which the Council extended the mandate of the Mission until 20 May 2004,
Recalling further the 2005 World Summit Outcome,See resolution 60/1. and all relevant General Assembly resolutions, in particular those that have taken place in the economic, social and related fields, including its resolution 60/265 of 30 June 2006, entitled "Implementation of the outcome of the 2005 World Summit, including the Millennium Development Goals and the other internationally agreed development goals,
16. Welcomes the efforts undertaken so far to enhance the security consciousness within the United Nations system, and requests the Secretary-General to continue to take the necessary measures in this regard, including by further devel

In [34]:
# Desubword the target file (reference) of the test dataset
# Note: You might as well have split files *before* subwording during dataset preperation, 
# but sometimes datasets have tokeniztion issues, so this way you are sure the file is really untokenized.
!python3 MT-Preparation/subwording/3-desubword.py target.model UN.en-fr.en-filtered.en.subword.test

Done desubwording! Output: UN.en-fr.en-filtered.en.subword.test.desubword


In [35]:
# Check the first 5 lines of the desubworded reference
!head -n 5 UN.en-fr.en-filtered.en.subword.test.desubword

Recalling also Security Council resolution 1410 (2002) of 17 May 2002, by which the Council established the United Nations Mission of Support in East Timor as of 20 May 2002 for an initial period of twelve months, and its subsequent resolution 1480 (2003) of 19 May 2003, by which the Council extended the mandate of the Mission until 20 May 2004,
Recalling further the 2005 World Summit OutcomeSee resolution 60/1. and all relevant General Assembly resolutions, in particular those that have built upon the 2005 World Summit Outcome, in the economic, social and related fields, including General Assembly resolution 60/265 of 30 June 2006 on follow-up to the development outcome of the 2005 World Summit, including the Millennium Development Goals and the other internationally agreed development goals,
16. Welcomes ongoing efforts to promote and enhance the security consciousness within the organizational culture of the United Nations system, and requests the Secretary-General to continue to ta

### MT Evaluation

There are several MT Evaluation metrics such as BLEU, TER, METEOR, COMET, BERTScore, among others.

Here we are using BLEU. Files must be detokenized/desubworded beforehand.

In [36]:
# Download the BLEU script
!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py

--2023-07-01 00:38:27--  https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957 [text/plain]
Saving to: ‘compute-bleu.py’


2023-07-01 00:38:29 (19.2 MB/s) - ‘compute-bleu.py’ saved [957/957]



In [37]:
# Install sacrebleu
!pip3 install sacrebleu



In [38]:
# Evaluate the translation (without subwording)
!python3 compute-bleu.py UN.en-fr.en-filtered.en.subword.test.desubword UN.en.translated.desubword

Reference 1st sentence: Recalling also Security Council resolution 1410 (2002) of 17 May 2002, by which the Council established the United Nations Mission of Support in East Timor as of 20 May 2002 for an initial period of twelve months, and its subsequent resolution 1480 (2003) of 19 May 2003, by which the Council extended the mandate of the Mission until 20 May 2004,
MTed 1st sentence: Recalling also Security Council resolution 140 (2002) of 17 May 2002, by which the Council established the United Nations Mission of Support in East Timor for an initial period of twelve months as from 20 May 2002, and the subsequent resolution 1803 (2003) of 19 May 2003, by which the Council extended the mandate of the Mission until 20 May 2004,
BLEU:  66.52958587194361
