# Neural Machine Translation (NMT)

## Dependencias y configuración inicial

In [1]:
import requests as req

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/nmt0/

/content/drive/MyDrive/nmt0


In [4]:
!git clone https://github.com/ymoslem/MT-Preparation.git

fatal: destination path 'MT-Preparation' already exists and is not an empty directory.


In [5]:
!pip3 install -r MT-Preparation/requirements.txt

Collecting sentencepiece (from -r MT-Preparation/requirements.txt (line 3))
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [40]:
!pip3 install --upgrade -q sentencepiece

In [6]:
!pip install OpenNMT-py

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.4.3-py3-none-any.whl (257 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/257.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/257.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m257.3/257.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting configargparse (from OpenNMT-py)
  Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Collecting ctranslate2<4,>=3.17 (from OpenNMT-py)
  Downloading ctranslate2-3.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.8/36.8 MB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
Collecting waitress (from OpenNMT-py)
  Downloading waitress-2.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[

## Desarrollo

In [7]:
def get_corpus_files(base_url, files):
    corpus = []
    for file in files:
        response = req.get(base_url + file).text
        corpus.append((file, response))

    return corpus


def get_parallel_corpus(corpus_info: dict):
    base_url = "https://raw.githubusercontent.com/"
    base_url += "AmericasNLP/americasnlp2021/main/data/"+corpus_info["name"]+"/"

    files1 = [ f"{file}.{corpus_info['lang1_code']}" for file in ["dev", "train"]]
    files2 = [ f"{file}.{corpus_info['lang2_code']}" for file in ["dev", "train", "test"]]
    files = files1 + files2
    lang1 = get_corpus_files(base_url, files1)
    lang2 = get_corpus_files(base_url, files2)

    return (lang1, lang2)

In [8]:
corpus_info = {
    "name": "guarani-spanish",
    "lang1_code": "gn",
    "lang2_code": "es"
}

Se obtiene el corpus paralelo de guarani-español, en donde se obtiene:

- `.train, .dev` para el guaraní.
- `.train, .dev, .test` para el español

In [9]:
guarani, spanish = get_parallel_corpus(corpus_info)

Para poder usar MT-Preparation, es necesario guardar los corpus en archivos.

In [10]:
def write_corpus(lang_corpus):
    for name, corpus in lang_corpus:
        with open(name, "w") as f:
            f.write(corpus)

In [11]:
write_corpus(guarani)
write_corpus(spanish)

In [12]:
!ls

dev.es	dev.gn	MT-Preparation	practica  test.es  train.es  train.gn


Se aplica el filtrado tanto para los archivos `train` y `dev` de cada lengua.

In [13]:
!python3 MT-Preparation/filtering/filter.py train.es train.gn es gn

Dataframe shape (rows, columns): (26032, 2)
--- Rows with Empty Cells Deleted	--> Rows: 26032
--- Duplicates Deleted			--> Rows: 14500
--- Source-Copied Rows Deleted		--> Rows: 14500
--- Too Long Source/Target Deleted	--> Rows: 13463
--- HTML Removed			--> Rows: 13463
--- Rows will remain in true-cased	--> Rows: 13463
--- Rows with Empty Cells Deleted	--> Rows: 13463
--- Rows Shuffled			--> Rows: 13463
--- Source Saved: train.es-filtered.es
--- Target Saved: train.gn-filtered.gn


In [14]:
!python3 MT-Preparation/filtering/filter.py dev.es dev.gn es gn

Dataframe shape (rows, columns): (995, 2)
--- Rows with Empty Cells Deleted	--> Rows: 994
--- Duplicates Deleted			--> Rows: 994
--- Source-Copied Rows Deleted		--> Rows: 994
--- Too Long Source/Target Deleted	--> Rows: 865
--- HTML Removed			--> Rows: 865
--- Rows will remain in true-cased	--> Rows: 865
--- Rows with Empty Cells Deleted	--> Rows: 865
--- Rows Shuffled			--> Rows: 865
--- Source Saved: dev.es-filtered.es
--- Target Saved: dev.gn-filtered.gn


El siguiente paso es crear las subwords para todos los archivos generados.

In [15]:
!python3 MT-Preparation/subwording/1-train_unigram.py train.es-filtered.es train.gn-filtered.gn

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=train.es-filtered.es --model_prefix=source --vocab_size=50000 --hard_vocab_limit=false --split_digits=true
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: train.es-filtered.es
  input_format: 
  model_prefix: source
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 0
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_i

In [16]:
!python3 MT-Preparation/subwording/2-subword.py source.model target.model train.es-filtered.es train.gn-filtered.gn

Source Model: source.model
Target Model: target.model
Source Dataset: train.es-filtered.es
Target Dataset: train.gn-filtered.gn
Done subwording the source file! Output: train.es-filtered.es.subword
Done subwording the target file! Output: train.gn-filtered.gn.subword


In [19]:
!python3 MT-Preparation/subwording/2-subword.py source.model target.model dev.es-filtered.es dev.gn-filtered.gn

Source Model: source.model
Target Model: target.model
Source Dataset: dev.es-filtered.es
Target Dataset: dev.gn-filtered.gn
Done subwording the source file! Output: dev.es-filtered.es.subword
Done subwording the target file! Output: dev.gn-filtered.gn.subword


In [20]:
!ls

config.yaml		    dev.gn-filtered.gn		source.vocab  train.es-filtered.es
dev.es			    dev.gn-filtered.gn.subword	target.model  train.es-filtered.es.subword
dev.es-filtered.es	    MT-Preparation		target.vocab  train.gn
dev.es-filtered.es.subword  practica			test.es       train.gn-filtered.gn
dev.gn			    source.model		train.es      train.gn-filtered.gn.subword


Entonces, podemos construir el vocabulario. Estos archivos van a estar en `source.onmt.vocab` y `target.onmt.vocab`

In [21]:
# Creación del archivo de configuración
# Usando valores pequeños en vista de que tenemos un corpus limitado
# Para datasets grandes deberian aumentar los valores:
# train_steps, valid_steps, warmup_steps, save_checkpoint_steps, keep_checkpoint
SRC_DATA_NAME = "es-filtered.es.subword"
TARGET_DATA_NAME = "gn-filtered.gn.subword"


In [24]:
config = f'''# config.yaml

## Where the samples will be written
save_data: run

# Rutas de archivos de entrenamiento
#(previamente aplicado subword tokenization)
data:
    corpus_1:
        path_src: train.{SRC_DATA_NAME}
        path_tgt: train.{TARGET_DATA_NAME}
        transforms: [filtertoolong]
    valid:
        path_src: dev.{SRC_DATA_NAME}
        path_tgt: dev.{TARGET_DATA_NAME}
        transforms: [filtertoolong]

# Vocabularios (serán generados por `onmt_build_vocab`)
src_vocab: source.onmt.vocab
tgt_vocab: target.onmt.vocab

# Tamaño del vocabulario
#(debe concordar con el parametro usado en el algoritmo de subword tokenization)
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filtrado sentencias de longitud mayor a n
# actuara si [filtertoolong] está presente
src_seq_length: 150
src_seq_length: 150

# Tokenizadores
src_subword_model: source.model
tgt_subword_model: target.model

# Archivos donde se guardaran los logs y los checkpoints de modelos
log_file: train.log
save_model: models/model.enes

# Condición de paro si no se obtienen mejoras significativas
# despues de n validaciones
early_stopping: 4

# Guarda un checkpoint del modelo cada n steps
save_checkpoint_steps: 1000

# Mantiene los n ultimos checkpoints
keep_checkpoint: 3

# Reproductibilidad
seed: 3435

# Entrena el modelo maximo n steps
# Default: 100,000
train_steps: 3000

# Corre el set de validaciones (*.dev) despues de n steps
# Defatul: 10,000
valid_steps: 1000

warmup_steps: 1000
report_every: 100

# Numero de GPUs y sus ids
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 262144
num_workers: 0
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Configuración del optimizador
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Configuración del Modelo
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("/content/drive/MyDrive/nmt0/config.yaml", "w+") as config_yaml:
  config_yaml.write(config)

In [25]:
%%time
!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2

2023-11-27 01:06:33.525538: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-27 01:06:33.525597: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-27 01:06:33.525636: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-27 01:06:33.533879: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-27 01:06:35.797415: I tensorflow/c

In [26]:
%%time
!onmt_train -config config.yaml

2023-11-27 01:12:41.372146: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-27 01:12:41.372201: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-27 01:12:41.372240: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-27 01:12:41.380413: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-27 01:12:43.651180: I tensorflow/c

Después se realiza la traducción del test. Primero hay que obtener el corpus del target language

In [33]:
target_code = "gn"
response = req.get(f"https://raw.githubusercontent.com/AmericasNLP/americasnlp2021/main/test_data/test.{target_code}")
with open(f"test.{target_code}", "w") as f:
  f.write(response.text)

In [34]:
!ls test*

test.es  test.gn


In [35]:
!python3 MT-Preparation/filtering/filter.py test.es test.gn es gn

Dataframe shape (rows, columns): (1003, 2)
--- Rows with Empty Cells Deleted	--> Rows: 1003
--- Duplicates Deleted			--> Rows: 1003
--- Source-Copied Rows Deleted		--> Rows: 1003
--- Too Long Source/Target Deleted	--> Rows: 868
--- HTML Removed			--> Rows: 868
--- Rows will remain in true-cased	--> Rows: 868
--- Rows with Empty Cells Deleted	--> Rows: 868
--- Rows Shuffled			--> Rows: 868
--- Source Saved: test.es-filtered.es
--- Target Saved: test.gn-filtered.gn


In [38]:
!python3 MT-Preparation/subwording/2-subword.py source.model target.model test.es-filtered.es test.gn-filtered.gn

Source Model: source.model
Target Model: target.model
Source Dataset: test.es-filtered.es
Target Dataset: test.gn-filtered.gn
Done subwording the source file! Output: test.es-filtered.es.subword
Done subwording the target file! Output: test.gn-filtered.gn.subword


In [39]:
%%time
!onmt_translate -model models/model.enes_step_3000.pt -src test.es-filtered.es.subword -output gn.practice.translated -gpu 0 -min_length 1

2023-11-27 03:06:15.205536: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-27 03:06:15.205602: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-27 03:06:15.205643: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-27 03:06:15.213460: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-27 03:06:19.254548: I tensorflow/c

Se obtiene el resultado al hacer el desubword

In [41]:
!python3 MT-Preparation/subwording/3-desubword.py target.model gn.practice.translated

Done desubwording! Output: gn.practice.translated.desubword


## Evaluación

Para hacer la evaluación, se hará uso de los métodos que tiene el shared task.

In [42]:
!git clone https://github.com/AmericasNLP/americasnlp2021

Cloning into 'americasnlp2021'...
remote: Enumerating objects: 469, done.[K
remote: Counting objects: 100% (136/136), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 469 (delta 89), reused 99 (delta 87), pack-reused 333[K
Receiving objects: 100% (469/469), 37.37 MiB | 12.05 MiB/s, done.
Resolving deltas: 100% (218/218), done.
Updating files: 100% (146/146), done.


In [52]:
!python3 americasnlp2021/evaluate.py --sys gn.practice.translated.desubword --ref test.es-filtered.es

[]
868
#### Score Report ####
chrF2 = 11.89
BLEU = 0.26


Los resultados fueron los siguientes:

| Model     | BLEU  | ChrF (0-1) |
|-----------|-------|------------|
| Baseline  | 3.26  | 0.22       |
| Practica  | 0.26  | 11.89      |


## Extra

**¿Cómo se diferencia de BLEU? (ChrF)**

BLEU toma en cuenta los n-gramas a nivel palabra, mientras que ChrF lo hace a nivel cáracter, lo que beneficia a lenguajes que tienen una morfología muy rica. Otra diferencia es la penalización a palabras cortas, donde BLEU destaca en esto.

**¿Porqué es reelevante utilizar otras medidas de evaluación además de BLEU?**

Porque las diversidad de las lenguas. Un caso puede ser la morfología, que en un BLEU casos como (dormí, dormía), donde no son iguales, pero se acercan en el significado. Otra cosa serían los sinónimos, que bien no captura bien BLEU: *Yo tomé una pluma*, *Yo agarré una pluma*, *Yo cogí un boligrafo* están muy cercanos en significado, pero evaluando a nivel palabra no se captura eso.

