# Clustering Pipelines - Colab Runner
Run the repository's clustering pipelines inside Google Colab or any standard Jupyter runtime. Execute the cells in order, customizing the configuration blocks as needed.


## (Optional) Clone the repository when starting from a blank Colab runtime
If you opened this notebook directly from GitHub, the Python modules and datasets are not present yet. Fill in `REPO_URL` with your clone URL and run the next cell to fetch the repo into `/content/clustering`.


In [None]:
import os
import shutil
import subprocess
from pathlib import Path

REPO_URL = ""  # e.g. "https://github.com/<user>/clustering.git"
TARGET_DIR = Path("/content/clustering")

if REPO_URL:
    if TARGET_DIR.exists():
        print(f"Removing existing directory: {TARGET_DIR}")
        shutil.rmtree(TARGET_DIR)
    print(f"Cloning {REPO_URL} -> {TARGET_DIR}")
    subprocess.run(["git", "clone", REPO_URL, str(TARGET_DIR)], check=True)
    os.chdir(TARGET_DIR)
    print(f"Changed working directory to {TARGET_DIR}")
else:
    print("Skipping clone. Set REPO_URL above when running in a fresh Colab session.")


## Install Python dependencies
This project only needs pandas, scikit-learn, and matplotlib; keep this cell even if your runtime already has them so versions match the repo's expectations.


In [None]:
!pip install --quiet pandas scikit-learn matplotlib


## Configure dataset and pipelines to run
Set `dataset_metadata_path` to any of the JSON files under `datasets/`. Update `pipelines_to_run` with one or more of `"knn"`, `"kmeans"`, or `"dbscan"`.


In [None]:
from pathlib import Path
import sys

WORK_DIR = Path.cwd()
SRC_DIR = WORK_DIR / "src"
if not SRC_DIR.exists():
    raise FileNotFoundError("Could not find the 'src' directory. Ensure the notebook runs from the repository root or clone it in the previous cell.")

if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

from common.dataset_metadata import load_dataset_metadata

dataset_metadata_path = WORK_DIR / "datasets" / "raw" / "bank-full.json"
pipelines_to_run = ["knn", "kmeans", "dbscan"]

dataset_meta = load_dataset_metadata(dataset_metadata_path, base_dir=WORK_DIR)
print(f"Dataset: {dataset_meta.name}")
print(f"Rows: using file at {dataset_meta.dataset_path}")
print(f"Selected pipelines: {pipelines_to_run}")


## Run the selected pipelines
This reuses the project's existing pipeline runners (identical to invoking `python src/main.py`). Results and trained models are written under `results/<pipeline>/<config>/`.


In [None]:
from main import run_knn, run_kmeans, run_dbscan

runners = {
    "knn": run_knn,
    "kmeans": run_kmeans,
    "dbscan": run_dbscan,
}

for name in pipelines_to_run:
    if name not in runners:
        raise ValueError(f"Unknown pipeline requested: {name}")

print("\n=== Starting pipeline runs ===")
for name in pipelines_to_run:
    print(f"\n>>> Running {name.upper()} pipeline")
    runners[name](WORK_DIR, dataset_meta)
print("\nAll requested pipelines completed.")


## Inspect generated artifacts
Each pipeline stores `metadata.json` and `model.pkl` under the results directory. Use the helper below to list what was produced.


In [None]:
from pathlib import Path

results_root = WORK_DIR / "results"
if not results_root.exists():
    raise FileNotFoundError("No results directory found yet. Run at least one pipeline first.")

for metadata_path in sorted(results_root.glob("*/**/metadata.json")):
    rel = metadata_path.relative_to(WORK_DIR)
    print(f"âœ” {rel}")
