<a href="https://colab.research.google.com/github/qu1r0ra/philippine-language-clustering/blob/main/philippine_language_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Repository Setup

First, let us clone the GitHub repository containing code which our notebook relies on.

The primary reason as to why the authors decided to externalize most code via a repository is to integrate software engineering principles and practices into the project, thereby improving its cleanliness and quality.

Maintaining a GitHub repository alongside a Jupyter notebook (as opposed to relying solely on the notebook) allows for better code organization and version control. Code snippets for high-level tasks (e.g., preprocessing data, computing for distance) can be refactored to files, making it easier to debug and modify code. Version control for these scripts is a given as we are utilizing Git.

In a sense, this notebook serves as a 'presentation layer' which simply utilizes functions and classes abstracted by various Python scripts in the said GitHub repository.

In [1]:
import os

repo_url = "https://github.com/qu1r0ra/philippine-language-clustering.git"
repo_dir = "philippine-language-clustering"
branch = "main"

# Change to the working directory in Google Colab: /content
%cd /content

# Remove previous clone
if os.path.exists(repo_dir):
  print(f"\nRemoving old repo folder '{repo_dir}' to ensure a clean clone...")
  !rm -rf {repo_dir}
else:
  print(f"\nNo repo folder '{repo_dir}' found.")

# Clone the specified branch
print(f"\nCloning branch {branch} from repository...")
!git clone --branch {branch} --single-branch {repo_url} {repo_dir}

# Move into the cloned directory
%cd {repo_dir}

# Install or upgrade uv
print("\nInstalling/Upgrading uv...")
!pip install --upgrade uv --quiet

# Install dependencies using uv
if os.path.exists("pyproject.toml"):
  print("\nInstalling dependencies with uv...")
  !uv sync --quiet
else:
  print("\nNo pyproject.toml found — skipping uv install.")

print(f"\nSetup complete!")

/content

Removing old repo folder 'philippine-language-clustering' to ensure a clean clone...

Cloning branch main from repository...
Cloning into 'philippine-language-clustering'...
remote: Enumerating objects: 79, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 79 (delta 19), reused 8 (delta 6), pack-reused 50 (from 1)[K
Receiving objects: 100% (79/79), 307.08 KiB | 7.31 MiB/s, done.
Resolving deltas: 100% (32/32), done.
/content/philippine-language-clustering

Installing/Upgrading uv...

Installing dependencies with uv...

Setup complete!


## Data Setup

1. Create a `data` folder and a `zipped` folder inside `data`.
2. Download `segmented.zip` from [this GDrive link](https://drive.google.com/drive/folders/1hKNLOayge7rvjx5LNMzHrMoGJY7gTeAT?usp=sharing) and upload it to `data/zipped`.
3. Run the script below to unzip the segmented data.

In [2]:
import zipfile

data_root = "/content/data"
zip_dir = os.path.join(data_root, "zipped")
unzipped_dir = os.path.join(data_root, "unzipped")

os.makedirs(zip_dir, exist_ok=True)
os.makedirs(unzipped_dir, exist_ok=True)

# Loop through all zip files and extract them to the unzipped folder
for zip_file in os.listdir(zip_dir):
    if zip_file.endswith(".zip"):
        zip_path = os.path.join(zip_dir, zip_file)
        extract_folder_name = os.path.splitext(zip_file)[0]
        extract_path = os.path.join(unzipped_dir, extract_folder_name)
        os.makedirs(extract_path, exist_ok=True)

        print(f"\nExtracting {zip_file} → {extract_folder_name}/ ...")
        with zipfile.ZipFile(zip_path, "r") as zf:
            zf.extractall(extract_path)
        print("\nDone!")

print("\nAll archives extracted.")
!ls -R {unzipped_dir}



Extracting segmented.zip → segmented/ ...

Done!

All archives extracted.
/content/data/unzipped:
segmented

/content/data/unzipped/segmented:
by_sentence  by_verse

/content/data/unzipped/segmented/by_sentence:
Bikolano  Chavacano  Isnag   Kalinga	Kapampangan   Sambal   Tagalog	Yakan
Cebuano   Ilokano    Ivatan  Kankanaey	Pangasinense  Spanish  Tausug	Yami

/content/data/unzipped/segmented/by_sentence/Bikolano:
MBBBIK92_1CH_raw.csv  MBBBIK92_ECC_raw.csv  MBBBIK92_LUK_raw.csv
MBBBIK92_1CH_raw.txt  MBBBIK92_ECC_raw.txt  MBBBIK92_LUK_raw.txt
MBBBIK92_1CO_raw.csv  MBBBIK92_EPH_raw.csv  MBBBIK92_MAL_raw.csv
MBBBIK92_1CO_raw.txt  MBBBIK92_EPH_raw.txt  MBBBIK92_MAL_raw.txt
MBBBIK92_1JN_raw.csv  MBBBIK92_EST_raw.csv  MBBBIK92_MAT_raw.csv
MBBBIK92_1JN_raw.txt  MBBBIK92_EST_raw.txt  MBBBIK92_MAT_raw.txt
MBBBIK92_1KI_raw.csv  MBBBIK92_EXO_raw.csv  MBBBIK92_MIC_raw.csv
MBBBIK92_1KI_raw.txt  MBBBIK92_EXO_raw.txt  MBBBIK92_MIC_raw.txt
MBBBIK92_1MA_raw.csv  MBBBIK92_EZK_raw.csv  MBBBIK92_MRK_raw.cs

## Cleaning and Preprocessing

> write stuff

In [3]:
from src.cleaning_preprocessing import LanguageData

data_dir = os.path.join(unzipped_dir, "segmented/by_sentence")
languages = [language for language in os.listdir(data_dir)]

corpora = {}

for lang in languages:
  lang_path = os.path.join(data_dir, lang)
  lang_obj = LanguageData(lang, lang_path).load()
  corpora[lang] = lang_obj

  lang_obj.summary()
  print("-" * 40)


Language: Spanish
No. of sentences: 34020
Avg. word length: 4.30
Avg. sentence length: 20.61
----------------------------------------
Language: Isnag
No. of sentences: 19884
Avg. word length: 4.29
Avg. sentence length: 15.74
----------------------------------------
Language: Kalinga
No. of sentences: 12050
Avg. word length: 5.04
Avg. sentence length: 15.79
----------------------------------------
Language: Cebuano
No. of sentences: 35191
Avg. word length: 4.64
Avg. sentence length: 23.23
----------------------------------------
Language: Chavacano
No. of sentences: 11889
Avg. word length: 4.03
Avg. sentence length: 23.56
----------------------------------------
Language: Kankanaey
No. of sentences: 20479
Avg. word length: 4.88
Avg. sentence length: 15.82
----------------------------------------
Language: Tagalog
No. of sentences: 51287
Avg. word length: 4.89
Avg. sentence length: 15.46
----------------------------------------
Language: Tausug
No. of sentences: 16055
Avg. word length: 4

## Feature Engineering
> write stuff

In [4]:
from src.feature_engineering import LanguageFeatureExtractor

char_ngram_n = 3 # @param {"type":"slider","min":1,"max":10,"step":1}
word_ngram_n = 1 # @param {"type":"slider","min":1,"max":5,"step":1}

feature_extractors = {}

for lang, lang_data in corpora.items():
  feat_extractor = LanguageFeatureExtractor(lang_data)
  feat_extractor.char_ngram(n=char_ngram_n)
  feat_extractor.word_ngram(n=word_ngram_n)
  feature_extractors[lang] = feat_extractor

  feat_extractor.summary()
  print("-" * 40)


Language: Spanish
Avg. word length: 4.30
Avg. sentence length: 20.61

Top 20 character n-grams:
 de: 58441
os : 55816
de : 45352
 y : 29402
el : 28550
que: 27385
 la: 25809
ue : 25064
 qu: 24545
as : 23965
es : 23142
los: 21895
 lo: 21776
la : 21657
 en: 21379
 co: 21337
 el: 21126
 a : 20165
s d: 19057
en : 18796

Top 20 word n-grams:
de: 42778
y: 32071
a: 20507
que: 20060
la: 19185
el: 18171
los: 17232
en: 15086
no: 8087
su: 7468
se: 7412
jehová: 6836
por: 6765
del: 6267
con: 5859
las: 5851
para: 5831
lo: 5463
sus: 5006
al: 4856
----------------------------------------
Language: Isnag
Avg. word length: 4.29
Avg. sentence length: 15.74

Top 20 character n-grams:
ay : 28539
 na: 26916
ya : 24217
 ka: 23068
a n: 21696
tu : 21187
nga: 20310
 ng: 20142
 da: 19608
án : 19131
n n: 18012
ga : 15659
an : 14109
na : 14000
 ma: 13661
 ki: 12834
a k: 12735
da : 11974
a a: 10918
_ay: 10475

Top 20 word n-grams:
ay: 19795
nga: 14587
na: 13840
ya: 9865
da: 8877
se: 7268
nu: 6466
nán: 6084
tu: 5747


## Clustering
Continue here!!!