## Milestone II — Build Catalog, Train Lightweight Models, Export Bundle

Outputs created by this notebook

- data/site_items.json – item catalog for browse/search in the Flask UI

- model/ensemble.pkl – equal-weight ensemble (Count+LR, TF-IDF+LR, TF-IDF→SVD+LR)

- model/manifest.json – small human/debug summary for the web app



Why these?

- Fast to train & load, zero external downloads, portable across machines.

- Meets Milestone II requirement: auto “Recommended” label for new reviews.

In [67]:
# === Cell 1. Imports, paths, reproducibility ===
import os, re, json, warnings, random
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Any, List

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import joblib

warnings.filterwarnings("ignore")

# Resolve project paths so this notebook works from project root or /notebooks
NB_DIR = Path.cwd().resolve()
CANDIDATES = [
    NB_DIR / "data" / "assignment3_II.csv",
    NB_DIR.parent / "data" / "assignment3_II.csv",
    NB_DIR.parents[1] / "data" / "assignment3_II.csv",
]
DATA_CSV = next((p for p in CANDIDATES if p.exists()), None)
if DATA_CSV is None:
    raise FileNotFoundError("Place 'assignment3_II.csv' under project_root/data/")

PROJECT_ROOT = DATA_CSV.parent.parent
DATA_DIR  = PROJECT_ROOT / "data"
MODEL_DIR = PROJECT_ROOT / "model"
DATA_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR.mkdir(parents=True, exist_ok=True)

CATALOG_JSON  = DATA_DIR  / "site_items.json"
BUNDLE_PKL    = MODEL_DIR / "ensemble.pkl"
MANIFEST_JSON = MODEL_DIR / "manifest.json"

# Reproducibility
SEED = 42
random.seed(SEED); np.random.seed(SEED)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_CSV    :", DATA_CSV)
print("DATA_DIR    :", DATA_DIR)
print("MODEL_DIR   :", MODEL_DIR)

PROJECT_ROOT: /Users/mac/Desktop/dem-web
DATA_CSV    : /Users/mac/Desktop/dem-web/data/assignment3_II.csv
DATA_DIR    : /Users/mac/Desktop/dem-web/data
MODEL_DIR   : /Users/mac/Desktop/dem-web/model


## Load CSV & normalize columns

We keep both display fields (Clothes Title/Description) and the original review fields for modeling.

- Fill missing strings with "", coerce numeric columns safely.

- Accept either Recommended IND or Recommended as the label column.

In [68]:
# === Cell 2. Load CSV & normalize columns ===
df = pd.read_csv(DATA_CSV)

# Add display fields if missing (fallbacks to legacy)
if "Clothes Title" not in df.columns:
    df["Clothes Title"] = df.get("Title", "").fillna("")
if "Clothes Description" not in df.columns:
    df["Clothes Description"] = df.get("Review Text", "").fillna("")

# Ensure these exist
need_cols = [
    "Clothing ID", "Clothes Title", "Clothes Description",
    "Rating", "Division Name", "Department Name", "Class Name",
    "Review Text", "Title"
]
for c in need_cols:
    if c not in df.columns:
        df[c] = ""

# Fill NaNs
df = df.fillna({
    "Clothing ID": 0,
    "Clothes Title": "",
    "Clothes Description": "",
    "Rating": 0,
    "Division Name": "",
    "Department Name": "",
    "Class Name": "",
    "Review Text": "",
    "Title": "",
})

def to_int_safe(x, default=0):
    try: return int(x)
    except: return default

df["Clothing ID"] = df["Clothing ID"].apply(to_int_safe)
df["Rating"]      = df["Rating"].apply(to_int_safe)

# Determine label column
label_col = None
for cand in ["Recommended IND", "Recommended"]:
    if cand in df.columns:
        label_col = cand
        break
if label_col is None:
    raise ValueError("No label column found. Expect 'Recommended IND' or 'Recommended'.")

# Clean label & drop missing
df[label_col] = pd.to_numeric(df[label_col], errors="coerce").fillna(0).astype(int)
df = df[(df[label_col] == 0) | (df[label_col] == 1)].copy()

print("Rows:", len(df))
print("Columns:", list(df.columns))
print("Label column:", label_col, "| Positives:", int(df[label_col].sum()))

Rows: 19662
Columns: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name', 'Clothes Title', 'Clothes Description']
Label column: Recommended IND | Positives: 16087


## Plural-aware normalizer (for keyword search)

We want “dress” and “dresses” to match the same items without heavy stemmers.

- Lowercase, strip punctuation.

- Rule-of-thumb plural reducer (common English endings).

- Keep numbers/letters joined for simple matching.

In [69]:
# === Cell 3. Lightweight plural-aware normalizer ===
_word_re = re.compile(r"[A-Za-z]+(?:[-'][A-Za-z]+)?")

def normalize_for_search(text: str) -> str:
    text = text.lower()
    tokens = _word_re.findall(text)

    def reduce_plural(w: str) -> str:
        if len(w) <= 3: return w
        # common endings
        for suf, rep in [("ies","y"), ("sses","ss"), ("xes","x"), ("zes","z")]:
            if w.endswith(suf): return w[:-len(suf)] + rep
        for suf in ("s","es"):
            if w.endswith(suf) and not w.endswith("ss"):
                return w[:-len(suf)]
        return w

    return " ".join(reduce_plural(t) for t in tokens)

## Build search catalog for Milestone II UI

We write a lightweight catalog data/site_items.json the Flask app can load to render previews / details:

- id, clothes_title, clothes_desc, rating, division, department, class

- preview (short teaser)

- search_text: normalized string used by the keyword search (plural-aware reducer)



If Clothes Title / Clothes Description are missing, we fall back to Title / first 120 chars of Review Text.

In [70]:
# === Cell 4. Build & save catalog ===
def pick_first_per_id(frame: pd.DataFrame) -> pd.DataFrame:
    # If multiple rows per Clothing ID, pick the longest review text row as the representative
    frame = frame.copy()
    frame["__len"] = frame["Review Text"].str.len().fillna(0)
    rep = (
        frame.sort_values(["Clothing ID","__len"], ascending=[True, False])
             .drop_duplicates(subset=["Clothing ID"], keep="first")
             .drop(columns="__len")
    )
    return rep

rep = pick_first_per_id(df)

def mk_preview(desc: str) -> str:
    desc = (desc or "").strip()
    if len(desc) > 160:
        return desc[:157] + "..."
    return desc

items = []
for _, r in rep.iterrows():
    cid   = int(r.get("Clothing ID", 0))
    title = (r.get("Clothes Title") or r.get("Title") or "").strip()
    desc  = (r.get("Clothes Description") or r.get("Review Text") or "").strip()
    rating = int(r.get("Rating", 0))
    div   = (r.get("Division Name") or "").strip()
    dept  = (r.get("Department Name") or "").strip()
    cls   = (r.get("Class Name") or "").strip()

    search_text = normalize_for_search(" ".join([
        title, desc, div, dept, cls
    ]))

    items.append({
        "id": cid,
        "clothes_title": title,
        "clothes_desc": desc,
        "rating": rating,
        "division": div,
        "department": dept,
        "class": cls,
        "preview": mk_preview(desc if desc else (r.get("Review Text") or "")[:120]),
        "search_text": search_text
    })

with open(CATALOG_JSON, "w", encoding="utf-8") as f:
    json.dump(items, f, ensure_ascii=False, indent=2)

print(f"Wrote catalog: {CATALOG_JSON} | items: {len(items)}")
print("Search normalizer test:", normalize_for_search("dress / dresses"), "→ expect same token twice")

Wrote catalog: /Users/mac/Desktop/dem-web/data/site_items.json | items: 1095
Search normalizer test: dress dress → expect same token twice


## Modeling corpus & label

We will:

- Use Review Text as the core signal (best generalization).

- Optionally also export a Title+Review model variant if needed later.

- Drop completely empty reviews.

In [71]:
# === Cell 5. Build corpus & labels ===
df["Review Text"] = df["Review Text"].fillna("").astype(str)
df["Title"] = df["Title"].fillna("").astype(str)

mask_nonempty = df["Review Text"].str.strip().str.len() > 0
dfe = df[mask_nonempty].copy()

corpus_review = dfe["Review Text"].tolist()
corpus_both   = (dfe["Title"].str.cat(dfe["Review Text"], sep=" . ").tolist())
y             = dfe[label_col].astype(int).to_numpy()

print("Samples:", len(y), "| positives:", int(y.sum()))
print("Example:", corpus_review[0][:120] if len(corpus_review)>0 else "(empty)")

Samples: 19662 | positives: 16087
Example: I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual


## Define lightweight pipelines

We use three compact models:

1) Count + LogisticRegression

2) TF-IDF + LogisticRegression

3) TF-IDF → TruncatedSVD + Standardize + LogisticRegression (LSA style)



Notes:

- Automatic feature limits (max_features) to avoid memory spikes.

- Dynamic min_df to avoid “no terms remain” errors on small subsets.

- Safe SVD dimension computed from a probe TF-IDF fit.

In [72]:
# === Cell 6. Pipelines (with safe parameters for small/large data) ===
def choose_min_df(n_docs: int) -> int:
    # Keep at least a few terms EVEN in very small CV folds
    if n_docs >= 5000: return 3
    if n_docs >= 1000: return 2
    return 1

def make_count_lr(n_docs: int, max_features=30000) -> Pipeline:
    mindf = choose_min_df(n_docs)
    return Pipeline([
        ("vec", CountVectorizer(max_features=max_features, min_df=mindf, ngram_range=(1,2))),
        ("clf", LogisticRegression(max_iter=2000, n_jobs=None))
    ])

def make_tfidf_lr(n_docs: int, max_features=30000) -> Pipeline:
    mindf = choose_min_df(n_docs)
    return Pipeline([
        ("vec", TfidfVectorizer(max_features=max_features, min_df=mindf, ngram_range=(1,2))),
        ("clf", LogisticRegression(max_iter=2000, n_jobs=None))
    ])

def make_tfidf_svd_lr(n_docs: int, svd_dim: int = None, max_features=30000) -> Pipeline:
    mindf = choose_min_df(n_docs)
    # Probe TF-IDF to pick a safe SVD dim
    probe = TfidfVectorizer(max_features=max_features, min_df=mindf, ngram_range=(1,2))
    Xp = probe.fit_transform(corpus_review)  # global probe is fine for dimension decisions
    max_dim = max(2, min(256, Xp.shape[1]-1))
    dim = max_dim if svd_dim is None else min(svd_dim, max_dim)
    return Pipeline([
        ("vec", TfidfVectorizer(max_features=max_features, min_df=mindf, ngram_range=(1,2))),
        ("svd", TruncatedSVD(n_components=dim, random_state=SEED)),
        ("sc",  StandardScaler(with_mean=False)),
        ("clf", LogisticRegression(max_iter=2000, n_jobs=None))
    ])

## Cross-validated sanity check (small & fast)

- Dynamic n_splits respects the minority class size to avoid errors on small subsets.

- We report Accuracy, F1, ROC-AUC.

In [73]:
# === Cell 7. Quick CV sanity check ===
def eval_model(name: str, pipe: Pipeline, X: List[str], y: np.ndarray, seed: int = SEED) -> Dict[str, float]:
    # Pick a safe number of folds
    min_class = min((y==0).sum(), (y==1).sum())
    n_splits = max(2, min(5, int(min_class)))  # at least 2, at most 5, not exceeding minority count
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
    scores = {}
    for metric in ("accuracy", "f1", "roc_auc"):
        scores[metric] = cross_val_score(pipe, X, y, cv=cv, scoring=metric).mean()
    print(f"[{name}] Acc={scores['accuracy']:.3f} F1={scores['f1']:.3f} AUC={scores['roc_auc']:.3f}")
    return {"name": name, **scores}

results = []
results.append(eval_model("Count+LR (review)",         make_count_lr(len(corpus_review)),    corpus_review, y))
results.append(eval_model("TFIDF+LR (review)",         make_tfidf_lr(len(corpus_review)),    corpus_review, y))
results.append(eval_model("TFIDF→SVD+LR (review)",     make_tfidf_svd_lr(len(corpus_review)), corpus_review, y))

pd.DataFrame(results).sort_values("roc_auc", ascending=False)

[Count+LR (review)] Acc=0.896 F1=0.937 AUC=0.928
[TFIDF+LR (review)] Acc=0.884 F1=0.932 AUC=0.944
[TFIDF→SVD+LR (review)] Acc=0.891 F1=0.935 AUC=0.932


Unnamed: 0,name,accuracy,f1,roc_auc
1,TFIDF+LR (review),0.884447,0.932459,0.94393
2,TFIDF→SVD+LR (review),0.890856,0.934698,0.93244
0,Count+LR (review),0.895535,0.937097,0.928375


## Train final models on all data & export bundle

We train three pipelines on all review texts and export an equal-weight ensemble:

- ensemble.pkl: contains models + predict/predict_proba logic

- manifest.json: quick summary for your /metrics page



The Flask app can simply joblib.load("model/ensemble.pkl") and call:

bundle.predict_proba([text]) or bundle.predict([text]).

In [74]:
# === Cell 8. Train final models & export ===
@dataclass
class EnsembleBundle:
    name: str
    models: Dict[str, Any]          # {"count_lr": pipe, "tfidf_lr": pipe, "svd_lr": pipe}
    weights: Dict[str, float]       # {"count_lr": 1/3, ...}
    tokenizer_version: str = "regex+light"
    notes: str = "Milestone II export; all models trained on Review Text."

    def _probas(self, texts: List[str]) -> Dict[str, np.ndarray]:
        out = {}
        for k, m in self.models.items():
            p = m.predict_proba(texts)[:, 1]  # class 1 prob
            out[k] = p
        return out

    def predict_proba(self, texts: List[str]) -> np.ndarray:
        probs = self._probas(texts)
        s = np.zeros(len(texts), dtype=float)
        for k, p in probs.items():
            w = self.weights.get(k, 0.0)
            s += w * p
        wsum = sum(self.weights.values()) or 1.0
        return s / wsum

    def predict(self, texts: List[str], threshold: float = 0.5) -> np.ndarray:
        return (self.predict_proba(texts) >= threshold).astype(int)

# Train three models on full corpus
model_A = make_count_lr(len(corpus_review))
model_B = make_tfidf_lr(len(corpus_review))
model_C = make_tfidf_svd_lr(len(corpus_review))

model_A.fit(corpus_review, y)
model_B.fit(corpus_review, y)
model_C.fit(corpus_review, y)

weights = {"count_lr": 1.0, "tfidf_lr": 1.0, "svd_lr": 1.0}
bundle = EnsembleBundle(
    name="Count+TFIDF+LSA LR (equal-weight ensemble)",
    models={"count_lr": model_A, "tfidf_lr": model_B, "svd_lr": model_C},
    weights=weights,
)

joblib.dump(bundle, BUNDLE_PKL)

manifest = {
    "bundle_name": bundle.name,
    "weights": bundle.weights,
    "tokenizer": bundle.tokenizer_version,
    "notes": bundle.notes,
    "samples": len(corpus_review),
    "positives": int(y.sum()),
    "data_csv": str(DATA_CSV)
}
with open(MANIFEST_JSON, "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2)

print(f"Saved model bundle → {BUNDLE_PKL}")
print(f"Saved manifest     → {MANIFEST_JSON}")

Saved model bundle → /Users/mac/Desktop/dem-web/model/ensemble.pkl
Saved manifest     → /Users/mac/Desktop/dem-web/model/manifest.json


## What to wire in Flask (Milestone II)

- Search: load data/site_items.json; normalize user query with the same normalize_for_search,

rank items whose search_text contains any/most query tokens; show previews; link to detail pages.

- New Review: take title + review_text, call bundle.predict_proba([text]) and display the

suggested recommended label (allow override); on submit, append to your storage and show the new review page.

- Metrics page: read model/manifest.json and show weights, bundle name, notes, sample counts.



This notebook already created everything your app reads at runtime:

- data/site_items.json