
 # Milestone II — Why these 3 models and how they work

 We keep one familiar baseline and add two models with different inductive biases so their errors are less correlated.
 The trio gives strong accuracy, calibrated probabilities (needed for UI), and robustness to noisy/misspelled text.

 - 3 models, 3 views of the text: word-level linear, NB bias, character-level margins.

 ## 1) TF-IDF + Logistic Regression (LR)
 Word n-grams + LR. Fast, smooth probabilities, interpretable features. Misses char clues; linear boundary.

 ## 2) TF-IDF + Complement Naive Bayes (CNB)
 Word n-grams + CNB. Strong on short/imbalanced data, trains very fast. Independence assumption limits boundary.

 ## 3) Char TF-IDF (3–5) + Calibrated Linear SVM
 Character n-grams capture typos/subword patterns. SVM margins are strong; calibration yields probabilities.

 **Safety knobs**: dynamic `min_df`, capped `max_features`, `class_weight="balanced"` (LR/SVM), SVM probability calibration.


## === Cell 1. Imports, paths, reproducibility ===

In [44]:

import json, warnings, random, re
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import VotingClassifier
import joblib

warnings.filterwarnings("ignore")

# Resolve dataset no matter where the notebook runs
NB_DIR = Path.cwd().resolve()
CANDIDATES = [
    NB_DIR / "data" / "assignment3_II.csv",
    NB_DIR.parent / "data" / "assignment3_II.csv",
    NB_DIR.parents[1] / "data" / "assignment3_II.csv",
]
DATA_CSV = next((p for p in CANDIDATES if p.exists()), None)
if DATA_CSV is None:
    raise FileNotFoundError("Place 'assignment3_II.csv' under project_root/data/")

PROJECT_ROOT = DATA_CSV.parent.parent
DATA_DIR  = PROJECT_ROOT / "data"
MODEL_DIR = PROJECT_ROOT / "model"
DATA_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR.mkdir(parents=True, exist_ok=True)

ENSEMBLE_PKL   = MODEL_DIR / "ensemble_soft.pkl"
MANIFEST_JSON  = MODEL_DIR / "manifest.json"

SEED = 42
random.seed(SEED); np.random.seed(SEED)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_CSV    :", DATA_CSV)
print("MODEL_DIR   :", MODEL_DIR)

PROJECT_ROOT: /Users/mac/Desktop/dem-web
DATA_CSV    : /Users/mac/Desktop/dem-web/data/assignment3_II.csv
MODEL_DIR   : /Users/mac/Desktop/dem-web/model


## === Cell 2. Load CSV & normalize columns ===
 - Accept either `Recommended IND` or `Recommended` as label.
 - Minimal cleaning; keep "not/never/no" in vocabulary.

In [45]:

df = pd.read_csv(DATA_CSV)

# Display fields (fallbacks)
df["Clothes Title"] = df.get("Clothes Title", df.get("Title", "")).fillna("")
df["Clothes Description"] = df.get("Clothes Description", df.get("Review Text", "")).fillna("")

# Ensure required columns exist
for c in [
    "Clothing ID","Clothes Title","Clothes Description","Rating",
    "Division Name","Department Name","Class Name","Review Text","Title"
]:
    if c not in df.columns:
        df[c] = ""

# Types
def to_int_safe(x, default=0):
    try: return int(x)
    except: return default

df["Clothing ID"] = df["Clothing ID"].apply(to_int_safe)
df["Rating"]      = pd.to_numeric(df["Rating"], errors="coerce").fillna(0).astype(int)
df["Review Text"] = df["Review Text"].fillna("").astype(str)
df["Title"]       = df["Title"].fillna("").astype(str)

# Label column
label_col = "Recommended IND" if "Recommended IND" in df.columns else (
            "Recommended" if "Recommended" in df.columns else None)
if label_col is None:
    raise ValueError("No label found. Expect 'Recommended IND' or 'Recommended'.")

df[label_col] = pd.to_numeric(df[label_col], errors="coerce").fillna(0).astype(int)
df = df[(df[label_col].isin([0,1]))].copy()

print("Rows:", len(df))
print("Label:", label_col, "| Positives:", int(df[label_col].sum()))

Rows: 19662
Label: Recommended IND | Positives: 16087


## === Cell 3. Light text cleaning for modeling ===

In [46]:
def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s']", " ", text)
    return re.sub(r"\s+", " ", text).strip()

df["Review Text"] = df["Review Text"].apply(clean_text)
df["Title"]       = df["Title"].apply(clean_text)

mask_nonempty = df["Review Text"].str.len() > 0
df = df[mask_nonempty].copy()

print("Remaining samples:", len(df))

Remaining samples: 19662


## === Cell 4. Build corpus & labels ===

In [47]:
X = df["Review Text"].tolist()
y = df[label_col].astype(int).to_numpy()

print(f"Samples: {len(y)} | Positives: {int(y.sum())}")
print("Example:", (X[0][:120] + "...") if X else "(empty)")

Samples: 19662 | Positives: 16087
Example: i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual s...


## === Cell 5. Define the 3 pipelines ===
 - Word TF-IDF (1–3) + LR (balanced).
 - Word TF-IDF (1–3) + Complement NB.
 - Char TF-IDF (3–5) + LinearSVC (calibrated to get `predict_proba`).


In [48]:

STOP_WORDS_KEEP_NEG = sorted(list(ENGLISH_STOP_WORDS - {"no", "not", "never"}))

def _choose_min_df(n_docs: int) -> int:
    # Small corpora need min_df=1 to keep enough features
    if n_docs >= 5000: return 3
    if n_docs >= 1000: return 2
    return 1

def make_tfidf_lr(n_docs: int, max_features: int = 40_000) -> Pipeline:
    """Word TF-IDF 1–3 + Logistic Regression (balanced)."""
    return Pipeline([
        ("tfidf", TfidfVectorizer(
            ngram_range=(1, 3),
            max_features=max_features,
            min_df=_choose_min_df(n_docs),
            stop_words=STOP_WORDS_KEEP_NEG
        )),
        ("clf", LogisticRegression(
            max_iter=2000,
            class_weight="balanced",
            solver="liblinear",
            random_state=SEED
        ))
    ])

def make_tfidf_cnb(n_docs: int, max_features: int = 40_000, alpha: float = 0.5) -> Pipeline:
    """Word TF-IDF 1–3 + Complement NB."""
    return Pipeline([
        ("tfidf", TfidfVectorizer(
            ngram_range=(1, 3),
            max_features=max_features,
            min_df=_choose_min_df(n_docs),
            stop_words=STOP_WORDS_KEEP_NEG
        )),
        ("clf", ComplementNB(alpha=alpha))
    ])

def make_char_svm(n_docs: int, C: float = 0.5) -> Pipeline:
    """Char TF-IDF 3–5 + LinearSVC (calibrated for probabilities)."""
    vec = TfidfVectorizer(
        analyzer="char_wb",
        ngram_range=(3, 5),
        sublinear_tf=True,
        min_df=2
    )
    svm = LinearSVC(C=C, class_weight="balanced", random_state=SEED)
    cal = CalibratedClassifierCV(svm, method="sigmoid", cv=3)
    return Pipeline([("vec", vec), ("clf", cal)])

## == Cell 6. Build X (corpus) and y (labels) ===

In [49]:
def _clean_text(s: str) -> str:
    s = (s or "").lower()
    s = re.sub(r"[^a-z0-9\s.,!'?-]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

dfe = df.copy()
dfe["Review Text"] = dfe["Review Text"].fillna("").astype(str).map(_clean_text)
y_all = pd.to_numeric(dfe[label_col], errors="coerce").fillna(0).astype(int)

mask = (dfe["Review Text"].str.len() > 0) & (y_all.isin([0, 1]))
dfe = dfe.loc[mask].reset_index(drop=True)

# Final training arrays
X = dfe["Review Text"].tolist()     # corpus (list[str])
y = dfe[label_col].astype(int).to_numpy()

print(f"[Cell 6] Samples kept: {len(y)} | Positives: {int(y.sum())} | Negatives: {len(y) - int(y.sum())}")
print("Example:", (X[0][:120] + "...") if len(X) else "(empty)")

[Cell 6] Samples kept: 19662 | Positives: 16087 | Negatives: 3575
Example: i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual s...


## === Cell 7. Quick CV sanity check (Acc/F1/AUC) ===



In [50]:
def _safe_cv(y_vec: np.ndarray, seed: int = SEED) -> StratifiedKFold:
    min_class = int(min((y_vec == 0).sum(), (y_vec == 1).sum()))
    n_splits = max(2, min(5, min_class))
    return StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

def _eval_model(name: str, pipe: Pipeline, X_list, y_vec):
    cv = _safe_cv(y_vec)
    metrics = {}
    for metric in ("accuracy", "f1", "roc_auc"):
        metrics[metric] = cross_val_score(pipe, X_list, y_vec, cv=cv, scoring=metric).mean()
    print(f"[{name}]  Acc={metrics['accuracy']:.3f}  F1={metrics['f1']:.3f}  AUC={metrics['roc_auc']:.3f}")
    return {"name": name, **metrics}

results = []
results.append(_eval_model("TFIDF+LR (1–3)",  make_tfidf_lr(len(X)),  X, y))
results.append(_eval_model("TFIDF+CNB (1–3)", make_tfidf_cnb(len(X)), X, y))
results.append(_eval_model("Char TFIDF+SVM (3–5)", make_char_svm(len(X)), X, y))

pd.DataFrame(results).sort_values("roc_auc", ascending=False)

[TFIDF+LR (1–3)]  Acc=0.879  F1=0.923  AUC=0.934
[TFIDF+CNB (1–3)]  Acc=0.892  F1=0.935  AUC=0.929
[Char TFIDF+SVM (3–5)]  Acc=0.892  F1=0.936  AUC=0.935


Unnamed: 0,name,accuracy,f1,roc_auc
2,Char TFIDF+SVM (3–5),0.892279,0.935727,0.934982
0,TFIDF+LR (1–3),0.878853,0.923357,0.934167
1,TFIDF+CNB (1–3),0.892127,0.934993,0.92908


## === Cell 8. Train on all data & export artifacts ===

In [51]:
# Train base models
mdl_lr  = make_tfidf_lr(len(X)).fit(X, y)
mdl_cnb = make_tfidf_cnb(len(X)).fit(X, y)
mdl_svm = make_char_svm(len(X)).fit(X, y)

# Soft-voting ensemble with equal weights
ensemble = VotingClassifier(
    estimators=[
        ("tfidf_lr",  mdl_lr),
        ("tfidf_cnb", mdl_cnb),
        ("char_svm",  mdl_svm),
    ],
    voting="soft",
    weights=[1.0, 1.0, 1.0],
    n_jobs=None
).fit(X, y)

# Save model + manifest
joblib.dump(ensemble, ENSEMBLE_PKL)

manifest = {
    "bundle_name": "VotingClassifier: LR + CNB + CharSVM (soft)",
    "weights": {"tfidf_lr": 1.0, "tfidf_cnb": 1.0, "char_svm": 1.0},
    "notes": "Pure sklearn object (no custom class) — fixes EnsembleBundle pickle issue.",
    "word_ngram": "(1,3)",
    "char_ngram": "(3,5)",
    "samples": int(len(X)),
    "positives": int(y.sum()),
    "data_csv": str(DATA_CSV)
}
with open(MANIFEST_JSON, "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2)

print(f"Saved → {ENSEMBLE_PKL}")
print(f"Saved → {MANIFEST_JSON}")

Saved → /Users/mac/Desktop/dem-web/model/ensemble_soft.pkl
Saved → /Users/mac/Desktop/dem-web/model/manifest.json


 ## ===  Quick test (same object Flask will load) ===
Uses `predict_proba` of the VotingClassifier.

In [52]:
EXAMPLES = [
    "This dress is beautiful and I love it!",
    "This product is terrible and very uncomfortable to wear.",
    "So difficult and itchy.",
    "Perfect fit, high quality material.",
    "Cheap fabric, looks awful."
]

def show_preds(texts: List[str], model, thr: float = 0.5):
    for t in texts:
        p = float(model.predict_proba([t])[0,1])
        label = "Positive ✅" if p >= thr else "Negative ❌"
        print(f"TEXT: {t}\n  P(positive)={p:.3f} → {label}\n")

ens = joblib.load(ENSEMBLE_PKL)  # simulate Flask runtime load
show_preds(EXAMPLES, ens)

TEXT: This dress is beautiful and I love it!
  P(positive)=0.902 → Positive ✅

TEXT: This product is terrible and very uncomfortable to wear.
  P(positive)=0.197 → Negative ❌

TEXT: So difficult and itchy.
  P(positive)=0.498 → Negative ❌

TEXT: Perfect fit, high quality material.
  P(positive)=0.792 → Positive ✅

TEXT: Cheap fabric, looks awful.
  P(positive)=0.022 → Negative ❌



## How Flask will use it
 - At runtime: `ens = joblib.load('model/ensemble_soft.pkl')`
 - Predict: `p = float(ens.predict_proba([user_text])[0, 1])`
 - No custom classes needed → no `EnsembleBundle` import issues.

In [53]:
import joblib
model = joblib.load(ENSEMBLE_PKL)
print(type(model))

<class 'sklearn.ensemble._voting.VotingClassifier'>
