# Milestone II — Why these 3 models and how they work

- This project keeps **one familiar baseline** from Milestone I and adds **two models with different inductive biases** so their errors are less correlated. The trio gives you strong accuracy, calibrated probabilities (for UI), and robustness to noisy, short, or misspelled text.


- Using 3 model, each model focuses on **different cues** (word-level linear, NB bias, character-level margins).  


## 1) TF-IDF + Logistic Regression (LR)  
This model uses word n-grams with TF-IDF weighting and Logistic Regression for linear separation. Strengths: fast, smooth probabilities, interpretable features. Weakness: misses character-level signals and limited by linearity.  

## 2) TF-IDF + Complement Naive Bayes (CNB)  
Also uses TF-IDF word n-grams but applies Complement NB weighting. Strong for imbalanced or short data and trains extremely fast. Weakness: simplistic independence assumption and less flexible decision boundaries compared to LR/SVM.  

## 3) Char TF-IDF (3–5) + Calibrated Linear SVM  
Builds features from character n-grams to capture typos, concatenated words, and subword patterns. Linear SVM gives strong margins; calibration ensures usable probabilities. Strengths: handles noisy text and unseen words. Weakness: large feature space and extra compute for calibration.  

## Design choices that support reliability
- **Word n-grams (1–3)** on LR/CNB: capture unigrams (“itchy”), bigrams (“too small”), trigrams (“hard to wear”).  
- **Character n-grams (3–5)** on SVM: capture morphology/typos (“itch”, “tchy”, “awf”, “ful”, “awful”).  
- **`class_weight="balanced"`** (LR/SVM): offsets label skew so minority examples matter.  
- **`min_df` dynamic**: avoids “no terms remain” in small CV folds.  
- **`max_features` cap**: protects memory while keeping the most informative grams.  
- **Stopwords = `"english"`**: keeps pipeline simple and portable; we purposely do *not* strip sentiment-bearing words like “not” via extra rules to avoid losing negation information.

---


## === Cell 1. Imports, paths, reproducibility ===

In [12]:

import os, re, json, warnings, random
from pathlib import Path
from typing import List, Dict, Any

import numpy as np
import pandas as pd

from dataclasses import dataclass
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import VotingClassifier
import joblib

warnings.filterwarnings("ignore")

# Resolve project paths so the notebook works from project root or /notebooks
NB_DIR = Path.cwd().resolve()
CANDIDATES = [
    NB_DIR / "data" / "assignment3_II.csv",
    NB_DIR.parent / "data" / "assignment3_II.csv",
    NB_DIR.parents[1] / "data" / "assignment3_II.csv",
]
DATA_CSV = next((p for p in CANDIDATES if p.exists()), None)
if DATA_CSV is None:
    raise FileNotFoundError("Place 'assignment3_II.csv' under project_root/data/")

PROJECT_ROOT = DATA_CSV.parent.parent
DATA_DIR  = PROJECT_ROOT / "data"
MODEL_DIR = PROJECT_ROOT / "model"
DATA_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR.mkdir(parents=True, exist_ok=True)

CATALOG_JSON   = DATA_DIR  / "site_items.json"
ENSEMBLE_PKL   = MODEL_DIR / "ensemble_soft.pkl"
MANIFEST_JSON  = MODEL_DIR / "manifest.json"

# Reproducibility
SEED = 42
random.seed(SEED); np.random.seed(SEED)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_CSV    :", DATA_CSV)
print("DATA_DIR    :", DATA_DIR)
print("MODEL_DIR   :", MODEL_DIR)

PROJECT_ROOT: /Users/mac/Desktop/dem-web
DATA_CSV    : /Users/mac/Desktop/dem-web/data/assignment3_II.csv
DATA_DIR    : /Users/mac/Desktop/dem-web/data
MODEL_DIR   : /Users/mac/Desktop/dem-web/model


## === Cell 2. Load CSV & normalize columns ===

- Accept either `Recommended IND` or `Recommended` as the label.
- Fill missing strings with `""` and coerce numerics safely.
- Ensure **display columns** exist: `Clothes Title`, `Clothes Description`.

In [13]:

df = pd.read_csv(DATA_CSV)

# Ensure display fields exist (fallbacks to legacy columns if missing)
if "Clothes Title" not in df.columns:
    df["Clothes Title"] = df.get("Title", "").fillna("")
if "Clothes Description" not in df.columns:
    df["Clothes Description"] = df.get("Review Text", "").fillna("")

# Make sure these columns exist to avoid key errors later
need_cols = [
    "Clothing ID", "Clothes Title", "Clothes Description",
    "Rating", "Division Name", "Department Name", "Class Name",
    "Review Text", "Title"
]
for c in need_cols:
    if c not in df.columns:
        df[c] = ""

# Fill NaNs
df = df.fillna({
    "Clothing ID": 0,
    "Clothes Title": "",
    "Clothes Description": "",
    "Rating": 0,
    "Division Name": "",
    "Department Name": "",
    "Class Name": "",
    "Review Text": "",
    "Title": "",
})

# Coerce types
def to_int_safe(x, default=0):
    try: return int(x)
    except: return default

df["Clothing ID"] = df["Clothing ID"].apply(to_int_safe)
df["Rating"]      = df["Rating"].apply(to_int_safe)

# Determine label column
label_col = None
for cand in ["Recommended IND", "Recommended"]:
    if cand in df.columns:
        label_col = cand
        break
if label_col is None:
    raise ValueError("No label column found. Expect 'Recommended IND' or 'Recommended'.")

# Clean label & drop invalid
df[label_col] = pd.to_numeric(df[label_col], errors="coerce").fillna(0).astype(int)
df = df[(df[label_col] == 0) | (df[label_col] == 1)].copy()

print("Rows:", len(df))
print("Columns:", list(df.columns))
print("Label column:", label_col, "| Positives:", int(df[label_col].sum()))

Rows: 19662
Columns: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name', 'Clothes Title', 'Clothes Description']
Label column: Recommended IND | Positives: 16087


## 2.1 Text preprocessing

In [14]:

import re

def clean_text(text: str) -> str:
    # Lowercase
    text = text.lower()
    # Remove non-letter characters (keep spaces)
    text = re.sub(r"[^a-z\s]", " ", text)
    # Collapse multiple spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Apply preprocessing to Review Text & Title
df["Review Text"] = df["Review Text"].astype(str).apply(clean_text)
df["Title"]       = df["Title"].astype(str).apply(clean_text)

# Drop empty reviews after cleaning
mask_nonempty = df["Review Text"].str.strip().str.len() > 0
df = df[mask_nonempty].copy()

print("Remaining samples after cleaning:", len(df))

Remaining samples after cleaning: 19662



## === Cell 3. Plural-aware normalizer for search ===

To support keyword search in UI, we normalize text:

- Lowercase, keep word characters.
- Simple plural reduction (e.g., *dresses* → *dress*).

In [15]:

_word_re = re.compile(r"[A-Za-z]+(?:[-'][A-Za-z]+)?")

def normalize_for_search(text: str) -> str:
    text = (text or "").lower()
    tokens = _word_re.findall(text)

    def reduce_plural(w: str) -> str:
        if len(w) <= 3: return w
        for suf, rep in [("ies","y"), ("sses","ss"), ("xes","x"), ("zes","z")]:
            if w.endswith(suf): return w[:-len(suf)] + rep
        for suf in ("es","s"):
            if w.endswith(suf) and not w.endswith("ss"):
                return w[:-len(suf)]
        return w

    return " ".join(reduce_plural(t) for t in tokens)

print("Test:", normalize_for_search("Dress / dresses — Boxes, foxes, classes"))

Test: dress dress box fox class


## Cell 5. Build modeling corpus & label

Model on **Review Text** (best generalization); drop blank reviews.

- `corpus_review` — list of review texts
- `y` — binary labels (0/1)

In [16]:
# === Cell 5. Build corpus & labels (Milestone II) ===
# Uses Review Text as the main signal. Drops empty reviews. Label from Recommended IND/Recommended.

# Ensure string types
df["Review Text"] = df["Review Text"].fillna("").astype(str)
df["Title"] = df["Title"].fillna("").astype(str)

# Drop empty reviews
mask_nonempty = df["Review Text"].str.strip().str.len() > 0
dfe = df[mask_nonempty].copy()

# X, y
corpus_review = dfe["Review Text"].tolist()
# Optional: title + review variant if you want later
corpus_title_review = dfe["Title"].str.cat(dfe["Review Text"], sep=" . ").tolist()
y = dfe[label_col].astype(int).to_numpy()

print(f"Samples: {len(y)}  |  Positives: {int(y.sum())}")
print("Example review:", (corpus_review[0][:120] + "...") if len(corpus_review) else "(empty)")

Samples: 19662  |  Positives: 16087
Example review: i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual s...


## Cell 6. Pipelines (3 diverse models)

### Why these three?

1) **TF-IDF + Logistic Regression (LR)**  
   - Strengths: strong, smooth probabilities; linear on word n-grams; interpretable; fast.  
   - Weaknesses: purely linear; might miss character-level patterns.

2) **TF-IDF + Complement Naive Bayes (CNB)**  
   - Strengths: different inductive bias; robust for small/imbalanced datasets; extremely fast training.  
   - Weaknesses: conditional independence assumption; decision boundaries can be less flexible.

3) **Character TF-IDF (3–5) + Linear SVM (Calibrated)**  
   - Strengths: char n-grams capture typos/joins; SVM margins are strong; calibrated to get `predict_proba` for UI/ensemble.  
   - Weaknesses: larger feature spaces; calibration adds compute.

### Safety knobs
- Dynamic `min_df` to avoid “no terms remain” on small folds.  
- Capped `max_features` to control memory.  
- `class_weight="balanced"` on LR & SVM.  
- SVM wrapped in `CalibratedClassifierCV(method="sigmoid", cv=3)` to provide probabilities.

In [17]:
# === Cell 6. Pipelines (3 diverse models, library stopwords) ===
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.naive_bayes import ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.pipeline import Pipeline

# Start from sklearn's list and KEEP negations so sentiment flips remain informative
STOP_WORDS_LIST = sorted(list(ENGLISH_STOP_WORDS - {"no", "not", "never"}))

def choose_min_df(n_docs: int) -> int:
    """Dynamic min_df to avoid 'no terms remain' on small folds."""
    if n_docs >= 5000: return 3
    if n_docs >= 1000: return 2
    return 1

def make_tfidf_lr(n_docs: int, max_features: int = 40000) -> Pipeline:
    """Word TF-IDF (1–3)-gram + Logistic Regression (balanced)."""
    mindf = choose_min_df(n_docs)
    return Pipeline([
        ("tfidf", TfidfVectorizer(
            ngram_range=(1, 3),
            max_features=max_features,
            min_df=mindf,
            stop_words=STOP_WORDS_LIST
        )),
        ("clf", LogisticRegression(
            max_iter=3000,
            class_weight="balanced",
            solver="liblinear",
            random_state=SEED
        ))
    ])

def make_tfidf_cnb(n_docs: int, max_features: int = 40000, alpha: float = 0.5) -> Pipeline:
    """Word TF-IDF (1–3)-gram + Complement Naive Bayes."""
    mindf = choose_min_df(n_docs)
    return Pipeline([
        ("tfidf", TfidfVectorizer(
            ngram_range=(1, 3),
            max_features=max_features,
            min_df=mindf,
            stop_words=STOP_WORDS_LIST
        )),
        ("clf", ComplementNB(alpha=alpha))
    ])

def make_char_svm(n_docs: int, C=0.5) -> Pipeline:
    vec = TfidfVectorizer(
        analyzer="char_wb",        # better for short texts & word boundaries
        ngram_range=(3,5),
        sublinear_tf=True,
        min_df=2                   # drop ultra-rare shards
    )
    svm = LinearSVC(C=C, class_weight="balanced", random_state=SEED)
    clf = CalibratedClassifierCV(svm, method="sigmoid", cv=5)  # more stable calibration
    return Pipeline([("vec", vec), ("clf", clf)])

## Cell 7. Quick cross-validated sanity check

- Use **StratifiedKFold** with a safe number of splits (≤ minority class size).
- Report **Accuracy**, **F1**, **ROC-AUC** for each model and the **soft-voting ensemble**.

In [18]:
# === Cell 7. Cross-validated sanity check (Acc/F1/AUC) ===
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np
import pandas as pd

def safe_cv(y: np.ndarray, seed: int = SEED) -> StratifiedKFold:
    """Pick a safe number of folds given class counts."""
    min_class = int(min((y == 0).sum(), (y == 1).sum()))
    n_splits = max(2, min(5, min_class))  # at least 2, at most 5, not exceeding minority count
    return StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

def eval_model(name: str, pipe: Pipeline, X, y, seed: int = SEED):
    cv = safe_cv(y, seed)
    metrics = {}
    for metric in ("accuracy", "f1", "roc_auc"):
        metrics[metric] = cross_val_score(pipe, X, y, cv=cv, scoring=metric).mean()
    print(f"[{name}]  Acc={metrics['accuracy']:.3f}  F1={metrics['f1']:.3f}  AUC={metrics['roc_auc']:.3f}")
    return {"name": name, **metrics}

results = []
results.append(eval_model("TFIDF+LR (word 1-3)", make_tfidf_lr(len(corpus_review)), corpus_review, y))
results.append(eval_model("TFIDF+CNB (word 1-3)", make_tfidf_cnb(len(corpus_review)), corpus_review, y))
results.append(eval_model("Char TFIDF+SVM (3-5)", make_char_svm(len(corpus_review)), corpus_review, y))

pd.DataFrame(results).sort_values("roc_auc", ascending=False)

[TFIDF+LR (word 1-3)]  Acc=0.878  F1=0.923  AUC=0.934
[TFIDF+CNB (word 1-3)]  Acc=0.892  F1=0.935  AUC=0.929
[Char TFIDF+SVM (3-5)]  Acc=0.891  F1=0.935  AUC=0.934


Unnamed: 0,name,accuracy,f1,roc_auc
2,Char TFIDF+SVM (3-5),0.890804,0.934793,0.934143
0,TFIDF+LR (word 1-3),0.878497,0.923079,0.934087
1,TFIDF+CNB (word 1-3),0.892331,0.935087,0.929135


## Cell 8. Train on all data & export artifacts

Train the three pipelines and the **soft-voting ensemble** on the full corpus.

Export:
- `model/ensemble_soft.pkl` — a **scikit-learn VotingClassifier**, *safely picklable* (no custom class).
- `model/manifest.json` — metadata for your UI’s **/metrics** page.

In [19]:
# === Cell 8. Train final models on all data & export bundle ===


@dataclass
class EnsembleBundle:
    """Soft-vote ensemble that averages model probabilities."""
    name: str
    models: Dict[str, Any]          # keys: "tfidf_lr", "tfidf_cnb", "char_svm"
    weights: Dict[str, float]       # e.g., {"tfidf_lr": 1.0, "tfidf_cnb": 1.0, "char_svm": 1.0}
    notes: str = "Milestone II ensemble: word TF-IDF+LR, word TF-IDF+CNB, char TF-IDF+Calibrated SVM."
    stopwords: str = "sklearn.ENGLISH_STOP_WORDS minus {no,not,never}"
    ngram_word: str = "(1,3)"
    ngram_char: str = "(3,5)"

    def _probas(self, texts: List[str]) -> Dict[str, np.ndarray]:
        out = {}
        for k, m in self.models.items():
            p = m.predict_proba(texts)[:, 1]  # probability of positive class
            out[k] = p
        return out

    def predict_proba(self, texts: List[str]) -> np.ndarray:
        probs = self._probas(texts)
        s = np.zeros(len(texts), dtype=float)
        wsum = float(sum(self.weights.values())) or 1.0
        for k, p in probs.items():
            s += self.weights.get(k, 0.0) * p
        return s / wsum

    def predict(self, texts: List[str], threshold: float = 0.5) -> np.ndarray:
        return (self.predict_proba(texts) >= threshold).astype(int)

# Train three models on the full corpus
mdl_lr  = make_tfidf_lr(len(corpus_review))
mdl_cnb = make_tfidf_cnb(len(corpus_review))
mdl_svm = make_char_svm(len(corpus_review))

mdl_lr.fit(corpus_review, y)
mdl_cnb.fit(corpus_review, y)
mdl_svm.fit(corpus_review, y)

weights = {"tfidf_lr": 1.0, "tfidf_cnb": 1.0, "char_svm": 1.0}
bundle = EnsembleBundle(
    name="Milestone II: TFIDF+LR / TFIDF+CNB / CharTFIDF+SVM (soft-vote, equal weights)",
    models={"tfidf_lr": mdl_lr, "tfidf_cnb": mdl_cnb, "char_svm": mdl_svm},
    weights=weights
)

joblib.dump(bundle, ENSEMBLE_PKL)

manifest = {
    "bundle_name": bundle.name,
    "weights": bundle.weights,
    "notes": bundle.notes,
    "stopwords": bundle.stopwords,
    "word_ngram": bundle.ngram_word,
    "char_ngram": bundle.ngram_char,
    "samples": int(len(corpus_review)),
    "positives": int(y.sum()),
    "data_csv": str(DATA_CSV)
}
with open(MANIFEST_JSON, "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2)

print(f"Saved model bundle → {ENSEMBLE_PKL}")
print(f"Saved manifest     → {MANIFEST_JSON}")

Saved model bundle → /Users/mac/Desktop/dem-web/model/ensemble_soft.pkl
Saved manifest     → /Users/mac/Desktop/dem-web/model/manifest.json


## Cell 9. Quick test of trained models

 - Pass a few example texts through each individual model + ensemble.
- Show probability of "Positive" (recommended).
- Also display the final predicted label (Positive / Negative) using threshold=0.5.

In [20]:
EXAMPLES = [
    "This dress is beautiful and I love it!",
    "This product is terrible and very uncomfortable to wear.",
    "So difficult and itchy.",
    "Perfect fit, high quality material.",
    "Cheap fabric, looks awful."
]

def predict_with_all(text: str, bundle, threshold: float = 0.5):
    print(f"\nTEXT: {text}\n" + "-"*60)
    for name, model in bundle.models.items():
        p = model.predict_proba([text])[0,1]
        label = "Positive ✅" if p >= threshold else "Negative ❌"
        print(f"[{name}]  P(positive) = {p:.3f} → {label}")
    
    # Ensemble prediction
    p_ens = bundle.predict_proba([text])[0]
    label_ens = "Positive ✅" if p_ens >= threshold else "Negative ❌"
    print(f"[ensemble_soft]  P(positive) = {p_ens:.3f} → {label_ens}")

# Run on sample texts
for t in EXAMPLES:
    predict_with_all(t, bundle)


TEXT: This dress is beautiful and I love it!
------------------------------------------------------------
[tfidf_lr]  P(positive) = 0.932 → Positive ✅
[tfidf_cnb]  P(positive) = 0.779 → Positive ✅
[char_svm]  P(positive) = 0.994 → Positive ✅
[ensemble_soft]  P(positive) = 0.902 → Positive ✅

TEXT: This product is terrible and very uncomfortable to wear.
------------------------------------------------------------
[tfidf_lr]  P(positive) = 0.147 → Negative ❌
[tfidf_cnb]  P(positive) = 0.140 → Negative ❌
[char_svm]  P(positive) = 0.292 → Negative ❌
[ensemble_soft]  P(positive) = 0.193 → Negative ❌

TEXT: So difficult and itchy.
------------------------------------------------------------
[tfidf_lr]  P(positive) = 0.276 → Negative ❌
[tfidf_cnb]  P(positive) = 0.471 → Negative ❌
[char_svm]  P(positive) = 0.759 → Positive ✅
[ensemble_soft]  P(positive) = 0.502 → Positive ✅

TEXT: Perfect fit, high quality material.
------------------------------------------------------------
[tfidf_lr]  P(

## How Flask should use these artifacts

At **runtime** (Flask app):

- Load the ensemble once and reuse:
  ```python
  import joblib, json
  ENSEMBLE_PKL = "model/ensemble_soft.pkl"
  MANIFEST_JSON = "model/manifest.json"
  ensemble = joblib.load(ENSEMBLE_PKL)  # VotingClassifier
  # predict_proba:
  p = float(ensemble.predict_proba([user_text])[0, 1])