## Preprocessing

### Setup

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%matplotlib notebook

In [12]:
from pathlib import Path
import os
import nltk
import multiprocessing

import pandas as pd
from src.config import RAW_DIR, PROCESSED_DIR
from src.preprocessing import preprocess_corpus, build_word_classes

First, let us download the `NLTK PUNKT` tokenizer. We will need this for preprocessing our corpora.

In [3]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\qu1r0ra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\qu1r0ra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
# Switch to root as working directory
if Path.cwd().name == "notebooks":
    os.chdir("..")

%pwd

'C:\\Users\\qu1r0ra\\Documents\\GitHub\\philippine-machine-translation'

### Load

In [5]:
csv_files = list(RAW_DIR.glob("**/*.csv"))
if not csv_files:
    raise FileNotFoundError("No CSV files found in RAW_DIR.")

print(f"Found {len(csv_files)} CSV files.")

df = pd.read_csv(csv_files[0])
print(f"Loaded: {csv_files[0].name}")
df.head()

Found 7 CSV files.
Loaded: cebuano_spanish.csv


Unnamed: 0,usfm,book,verse,chapter,language1,language2
0,1CH.1.1,1CH,1,1,"Si Adan, si Set, si Enos,","Adán, Set, Enós,"
1,1CH.1.2,1CH,2,1,"si Kenan, si Mahalalel, si Jared,","Cainán, Mahalaleel, Jared,"
2,1CH.1.3,1CH,3,1,"si Enoc, si Metusela, si Lamec,","Enoc, Matusalén, Lamec,"
3,1CH.1.4,1CH,4,1,"si Noe, si Sem, si Ham ug si Jafet.","Noé, Sem, Cam y Jafet."
4,1CH.1.5,1CH,5,1,"Ang mga anak nga lalaki ni Jafet: si Gomer, si...","Los hijos de Jafet: Gomer, Magog, Madai, Javán..."


In [6]:
df_preprocessed = preprocess_corpus(df)


[Preprocessing] Cleaning and tokenizing columns: language1, language2
[Preprocessing] 31,105 valid sentence pairs remaining after cleaning.
[Preprocessing] Done.


In [7]:
processed_path = PROCESSED_DIR / "preprocessed.csv"
df_preprocessed.to_csv(processed_path, index=False)
print(f"Preprocessed data saved to {processed_path}")

Preprocessed data saved to C:\Users\qu1r0ra\Documents\GitHub\philippine-machine-translation\data\processed\preprocessed.csv


In [8]:
df_preprocessed.head()

Unnamed: 0,language1,language2,src_tokens,tgt_tokens
0,"Si Adan, si Set, si Enos,","Adán, Set, Enós,","[si, adan, si, set, si, enos]","[adán, set, enós]"
1,"si Kenan, si Mahalalel, si Jared,","Cainán, Mahalaleel, Jared,","[si, kenan, si, mahalalel, si, jared]","[cainán, mahalaleel, jared]"
2,"si Enoc, si Metusela, si Lamec,","Enoc, Matusalén, Lamec,","[si, enoc, si, metusela, si, lamec]","[enoc, matusalén, lamec]"
3,"si Noe, si Sem, si Ham ug si Jafet.","Noé, Sem, Cam y Jafet.","[si, noe, si, sem, si, ham, ug, si, jafet]","[noé, sem, cam, y, jafet]"
4,"Ang mga anak nga lalaki ni Jafet: si Gomer, si...","Los hijos de Jafet: Gomer, Magog, Madai, Javán...","[ang, mga, anak, nga, lalaki, ni, jafet, si, g...","[los, hijos, de, jafet, gomer, magog, madai, j..."


Let us build word classes using `FastText`, a word embedding model preferred for morphologically rich languages like Filipino.

First, we need to determine how many logical cores our computer has so we can speed up training.

In [13]:
print(multiprocessing.cpu_count())

8


Next, set the variable below to the value you got above (i.e., the number of logical cores your computer has) or to the number of cores you want to use (why not just use all?) and run it.

In [14]:
os.environ["LOKY_MAX_CPU_COUNT"] = "8"

In [16]:
%%time
word2class = build_word_classes(df_preprocessed)


[FastText] Training on 62,210 sentences...
[Clustering] Running KMeans on 19,612 word vectors (100 dims) ...
[Clustering] Done — created 100 clusters.
[Save] Word classes saved to C:\Users\qu1r0ra\Documents\GitHub\philippine-machine-translation\data\processed\word_classes.json
CPU times: total: 4min 47s
Wall time: 1min 1s


Let us inspect a few word-to-class mappings.

In [17]:
list(word2class.items())[:20]

[('sa', 'c48'),
 ('ug', 'c48'),
 ('ang', 'c48'),
 ('de', 'c0'),
 ('nga', 'c48'),
 ('y', 'c80'),
 ('mga', 'c48'),
 ('a', 'c78'),
 ('que', 'c19'),
 ('la', 'c77'),
 ('el', 'c40'),
 ('los', 'c43'),
 ('en', 'c0'),
 ('si', 'c72'),
 ('iyang', 'c1'),
 ('ka', 'c41'),
 ('dios', 'c20'),
 ('ni', 'c88'),
 ('siya', 'c60'),
 ('no', 'c36')]

With preprocessing finished, we can proceed with **modeling**.