<a href="https://colab.research.google.com/github/qu1r0ra/philippine-machine-translation/blob/feat%2Fnmt-model-training/notebooks/02b_modeling_nmt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling in Neural Machine Translation (NMT)

## Notes

The author who trained the Transformer model did not have sufficient computational resources to train locally, so he decided to train it via Google Colab.

Unfortunately, the free plan in Colab also did not suffice for training as the free GPUs have rate limits, so the author also decided to purchase compute units to access said GPUs. You may also need Colab compute units to replicate the results of this notebook.

## Environment Setup

First, let us uninstall Colab's preinstalled `torch`, `torchvision`, and `torchaudio` packages.

We need to downgrade them as the `OpenNMT-py` version that the authors used for training requires older versions.

**NOTE:** You only need to run this the first time.

In [1]:
!pip uninstall torch torchvision torchaudio -y

Found existing installation: torch 2.1.2
Uninstalling torch-2.1.2:
  Successfully uninstalled torch-2.1.2
[0m

Next, let us install `condacolab` so we can install the older package versions via `conda`.

The authors tried using `pip` to install the older versions but they were somehow unlisted and thus could not be installed.

In [2]:
!pip install -q condacolab
import condacolab

condacolab.install()

✨🍰✨ Everything looks OK!


Now we can install the older versions required by our `OpenNMT-py` version.

In [None]:
!conda install --quiet pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 \
    pytorch-cuda=11.8 -c pytorch -c nvidia

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


ClobberError: The package 'conda-forge/linux-64::numpy-2.3.4-py311h2e04523_0' cannot be installed due to a
path collision for 'lib/python3.11/site-packages/numpy/lib/__pycache__/_nanfunctions_impl.cpython-311.pyc'.
This path already exists in the target prefix, and it won't be removed
by an uninstall action in this transaction. The path is one that conda
doesn't recognize. It may have been created by another package manager.


ClobberError: The package 'conda-forge/linux-64::numpy-2.3.4-py311h2e04523_0' cannot be installed due to a
path collision for 'lib/python3.11/site-packages/numpy/lib/__pycache__/_npyio_impl.cpython-311.pyc'.
This path already exists in the target prefix, and it won't be removed
by an uninstall action in this transaction. The path is one that conda
doesn't recognize. It may have been created by another package manager.


ClobberError: The package 'conda-forge/linux-64::numpy-2.3.4-py311h2e04523_0' 

In [4]:
!pip install OpenNMT-py==3.4.3

Collecting torch<2.2,>=2.0.1 (from OpenNMT-py==3.4.3)
  Using cached torch-2.1.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Using cached torch-2.1.2-cp311-cp311-manylinux1_x86_64.whl (670.2 MB)
Installing collected packages: torch
Successfully installed torch-2.1.2


Lastly, let us create the folders where we will be uploading our data.

In [5]:
!mkdir -p data outputs

## Data Upload

Next, we need to upload the necessary files to train our model.

At this point, it is assumed that:
1. You have cloned the GitHub repository https://github.com/qu1r0ra/philippine-machine-translation.
2. You have gone through `00_setup.ipynb` and `01b_preprocessing_nmt.ipynb` in the repository.

To prepare the data needed for this notebook, simply follow these instructions:
1. Upload `config.yaml` from the repository to the Colab filesystem (`/content`). You know you are in the right path if you can also see the file `condacolab_install.log`.
3. Upload `train.src`, `train.tgt`, `valid.src`, and `valid.tgt` from `data/processed/` in the repository to the `data/` folder in Colab.

Let us check the first few lines of our training and validation data.

In [6]:
!head -n 5 data/train.src

ug ang bisan unsa nga walay kapay ug himbis ayaw ninyo kaona hugaw kini alang kaninyo
kay ang akong mga adlaw nangahanaw sama sa aso ug ang akong kabukogan nagdilaab sama sa hudno
busa karon inigabot niining sulat diha kanimo sa nakita nimo nga ang mga anak sa inyong agalon uban kaninyo ug anaa kaninyo ang mga karwahi ug mga kabayo ug ang mga kinutaang siyudad ug mga hinagiban
ang mga sulat gipadala pinaagi sa mga sinugo ngadto sa tanang mga lalawigan sa hari aron sa paglaglag sa pagpatay ug sa pagpuo sa tanang mga judio batanon ug tigulang mga babaye ug mga kabataan nga gagmay sulod sa usa ka adlaw sa ikanapulo ug tulo ka adlaw sa ikanapulo ug duha nga bulan nga mao ang bulan sa adar ug sa pagkuha sa tanan nila nga mga kabtangan
ug ug kini kini mipadayon ngadto sa habagatan sa tungasan sa akrabim subay sa zin ug mitungas dapit sa habagatan sa kadesbarnea agi sa hesron ngadto sa adar ug miliko paingon sa karka


In [7]:
!head -n 5 data/train.tgt

pero no comeréis lo que no tiene aletas y escama os será inmundo
porque mis días se desvanecen como el humo y mis huesos cual tizón están quemados
inmediatamente que lleguen estas cartas a vosotros como tenéis a los hijos de vuestro señor y también tenéis carros y gente de a caballo la ciudad fortificada y las armas
y se enviaron las cartas por medio de correos a todas las provincias del rey con la orden de destruir matar y aniquilar a todos los judíos jóvenes y ancianos niños y mujeres y de apoderarse de sus bienes en un mismo día en el día trece del mes duodécimo que es el mes de adar
luego salía hacia el sur de la subida de acrabim pasaba hacia zin y subía por el sur hasta cadesbarnea pasando por hezrón subía hacia adar y daba vuelta a carca


In [8]:
!head -n 5 data/valid.src

unya ang paraon mipatawag kang moises ug miingon panglakaw kamo magalagad aron kamo sa ginoo ang inyong mga kabataan makauban kaninyo apan kinahanglang ibilin ninyo ang inyong mga karnero ug ang inyong mga baka
kini mao ang ihalad ninyo sa ginoo sa inyong gitakda nga mga kasaulogan dugang sa mga saad ug sa inyong mga halad nga kinabubuton alang sa inyong mga nga ug alang sunogon sa mga inyong halad nga pagkaon ug alang sa inyong mga halad nga ilimnon ug alang sa inyong inyong halad mga sa pakigdait
ug ang iyang panon sumala sa pagihap
busa si pablo gipagikan dayon sa mga kaigsoonan sa paglakaw padulong sa dagat apan si silas ug si timoteo nagpabilin didto
si jesus miingon kaniya sa pagkatinuod sultihan ko ikaw niini gayong gabhiona sa dili pa motuktugaok ang manok ilimod nimo ako sa makatulo


In [9]:
!head -n 5 data/valid.tgt

entonces el faraón hizo llamar a moisés y dijo id servid a jehová que solamente queden vuestras ovejas y vuestras vacas vayan también vuestros niños con vosotros
estas cosas ofreceréis a jehová en vuestras fiestas solemnes además de vuestros votos y ofrendas voluntarias de vuestros holocaustos ofrendas y libaciones y de vuestras ofrendas de paz
su cuerpo de ejército según el censo sesenta y dos mil setecientos
entonces los hermanos hicieron que pablo saliera inmediatamente en dirección al mar pero silas y timoteo se quedaron allí
jesús le dijo de cierto te digo que esta noche antes que el gallo cante me negarás tres veces


(will soon write stuff about the data)

## Training a Small Transformer with OpenNMT-py

(insert additional explanation)

The advantage with using `OpenNMT-py` is that we only need to configure the parameters and hyperparameters of the model with a `config.yaml` file. After doing so, the training can be ran with one line.

Normally, we would need to code the architecture and mechanisms of the model ourselves in deep learning frameworks like `PyTorch` and `TensorFlow`. As an aside, `OpenNMT` has implementations in both of these frameworks, but the authors chose the PyTorch implementation, thus named `OpenNMT-py`.

In [10]:
!onmt_build_vocab -config config.yaml


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/local/bin/onmt_build_vocab", line 5, in <module>
    from onmt.bin.build_vocab import main
  File "/usr/local/lib/python3.11/site-packages/onmt/__init__.py", line 2, in <module>
    import onmt.inputters
  File "/usr/local/lib/python3.11/site-packages/onmt/inputters/__init__.py", line 7, in <module>
    from onmt.inputters.text_utils import text_sort_key, process, numericalize, tensorify
  File "/usr/local/lib/python3.11/site-packages/onmt/inputters/text_utils.py", line 1, in <module>
    import torch
  File "

In [11]:
!onmt_train -config config.yaml


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/local/bin/onmt_train", line 5, in <module>
    from onmt.bin.train import main
  File "/usr/local/lib/python3.11/site-packages/onmt/__init__.py", line 2, in <module>
    import onmt.inputters
  File "/usr/local/lib/python3.11/site-packages/onmt/inputters/__init__.py", line 7, in <module>
    from onmt.inputters.text_utils import text_sort_key, process, numericalize, tensorify
  File "/usr/local/lib/python3.11/site-packages/onmt/inputters/text_utils.py", line 1, in <module>
    import torch
  File "/usr/local/l

This is literally it. Now we just have to wait until the model finishes training.

Do note that training even a small Transformer like this in general may take several hours to even days, depending on various factors such as:
- the size of your dataset
- the parameters and hyperparameters you set in `config.yaml`
- the power of your NVIDIA GPU (yes, you specifically need a GPU made by NVIDIA)
- etc.

For the case of CJ, the author who trained this model via Google Colab, it took him (insert time) to train the model with a (insert GPU) GPU provided by Google Colab.