# philippine-machine-translation

An exploration of statistical machine translation for a subset of (mostly) Philippine languages. Created for NLP1000 (Introduction to Natural Language Processing).

## Notes and Prerequisites

This project is divided into 6 notebooks:
1. Setup (you are here!)
2. Preprocessing
3. Modeling
4. Evaluation
5. Analysis
6. Conclusion

The notebooks should be executed in a sequential manner as listed above to achieve the desired results. The same goes for the cells in each notebook.

It is expected that you have cloned the [GitHub repository](https://github.com/qu1r0ra/philippine-machine-translation) containing the notebooks as it contains project dependencies. If not, then please do so and run `uv sync` at the root to install the necessary third-party dependencies.

Most of the code used throughout the notebooks are abstracted away into the repository along with other files. This a deliberate decision made by the authors to ensure that the notebooks are as clean and readable as possible. In a sense, the notebooks can be thought of as the project's 'presentation layer' which utilizes functions, classes, and other entities contained in Python modules developed by the authors.

## Environment Setup

Let us begin by setting up our notebook environment.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%matplotlib notebook

Right above are simply Jupyter magic commands that made the authors' lives easier during development. You should try them too.

Next, let us import the necessary libraries and functions.

In [2]:
import pandas as pd

from src.config import RAW_DIR
from src.utils import extract_archives

[CONFIG] Directories ensured and random seed set.


Imported modules prefixed with `src` are developed by the authors. The rest are third-party.

The data that our project will be using is in `data\raw\parallel_corpora_by_verse.zip`. It is currently compressed as a `.zip` file to save space. Let us first extract it.

In [3]:
extract_archives(RAW_DIR, overwrite=True)

Found 1 archive(s) in 'C:\Users\qu1r0ra\Documents\GitHub\philippine-machine-translation\data\raw':
Overwriting existing folder 'parallel_corpora_by_verse' ...
Extracting 'parallel_corpora_by_verse.zip' → 'C:\Users\qu1r0ra\Documents\GitHub\philippine-machine-translation\data\raw\parallel_corpora_by_verse/' ...
Done extracting 'parallel_corpora_by_verse.zip'
All archives processed.


We should now have 7 `.csv` files in the same directory, one for each parallel corpus. Let us confirm this and view the first few lines of one of the files.

In [4]:
csv_files = list(RAW_DIR.glob("**/*.csv"))

if csv_files:
    print(f"Found {len(csv_files)} CSV files. ")
    print(f"Showing first few lines of '{csv_files[0].name}':")
    df = pd.read_csv(csv_files[0])
    display(df.head())
else:
    print("No CSV files found in extracted data.")

Found 7 CSV files. 
Showing first few lines of 'cebuano_spanish.csv':


Unnamed: 0,usfm,book,verse,chapter,language1,language2
0,1CH.1.1,1CH,1,1,"Si Adan, si Set, si Enos,","Adán, Set, Enós,"
1,1CH.1.2,1CH,2,1,"si Kenan, si Mahalalel, si Jared,","Cainán, Mahalaleel, Jared,"
2,1CH.1.3,1CH,3,1,"si Enoc, si Metusela, si Lamec,","Enoc, Matusalén, Lamec,"
3,1CH.1.4,1CH,4,1,"si Noe, si Sem, si Ham ug si Jafet.","Noé, Sem, Cam y Jafet."
4,1CH.1.5,1CH,5,1,"Ang mga anak nga lalaki ni Jafet: si Gomer, si...","Los hijos de Jafet: Gomer, Magog, Madai, Javán..."


At this point, we can proceed with **preprocessing** knowing our environment has been set up and our data has been properly extracted.