# Phase 3: Advanced & Optimized Modeling

## 1. Objective
We aim to build a high-performance sentiment classifier using **Review Text**, **Brand**, and **Category**. 

### Professional Elevation Enhancements:
1. **Parallelism**: Using `n_jobs=-1` for multi-core execution.
2. **Dimensionality Reduction**: Using `TruncatedSVD` (Latent Semantic Analysis) to condense text features.
3. **Fast Gradient Boosting**: Using XGBoost with `tree_method='hist'` for rapid training.
4. **K-Fold Cross-Validation**: Ensuring the model is robust across data subsets.
5. **SMOTE Oversampling**: Generating synthetic data to fix severe class imbalance.
6. **Deep Error Analysis**: Inspecting specific misclassifications and bias.

## 2. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import joblib
from tqdm.auto import tqdm

from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import classification_report, confusion_matrix, f1_score

import xgboost as xgb
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

tqdm.pandas()

# Load cleaned data
data_path = os.path.join('..', 'data', 'interim', 'cleaned_amazon.csv')
df = pd.read_csv(data_path)

# Target Binning (Neg: 0, Neu: 1, Pos: 2)
df['sentiment'] = df['reviews.rating'].map({1: 0, 2: 0, 3: 1, 4: 2, 5: 2})

# Clean metadata and text
df = df.dropna(subset=['cleaned_text', 'brand', 'categories'])
df = df[df['cleaned_text'].str.strip().astype(bool)]

print(f"Final Dataset Shape: {df.shape}")

## 3. Data Splitting (Stratified)
We maintain an 80/20 split, ensuring sentiment ratios are preserved.

In [None]:
X = df[['cleaned_text', 'brand', 'categories']]
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

## 4. Feature Engineering: Pipelines
We define our preprocessors for text (TF-IDF + SVD) and categories (OneHot).

In [None]:
# Text sub-pipeline
text_transformer = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english', min_df=5)),
    ('svd', TruncatedSVD(n_components=100, random_state=42)),
    ('scaler', StandardScaler()) # Added for LogReg stability
])

# Main preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, 'cleaned_text'),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['brand', 'categories'])
    ]
)

# Preprocessor for Naive Bayes (No SVD/Scaler)
nb_preprocessor = ColumnTransformer(
    transformers=[
        ('text', TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english', min_df=5), 'cleaned_text'),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['brand', 'categories'])
    ]
)

## 5. Model 1: Multinomial Naive Bayes (Baseline)

In [None]:
nb_pipeline = Pipeline([('preprocessor', nb_preprocessor), ('clf', MultinomialNB())])
nb_pipeline.fit(X_train, y_train)
print("Naive Bayes trained.")

## 6. Model 2: Logistic Regression (Weighted)
Professional Benchmark: Using `class_weight='balanced'`.

In [None]:
lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', LogisticRegression(class_weight='balanced', max_iter=2000, n_jobs=-1, random_state=42))
])
lr_pipeline.fit(X_train, y_train)
print("Logistic Regression (Weighted) trained.")

## 7. Model 3: XGBoost (Optimized)
Performance focus: `tree_method='hist'` and `n_jobs=-1`.

In [None]:
xgb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', xgb.XGBClassifier(tree_method='hist', n_jobs=-1, random_state=42, eval_metric='mlogloss'))
])
xgb_pipeline.fit(X_train, y_train)
print("XGBoost trained.")

## 8. Professional Elevation: SMOTE Resampling
Comparing weights against **Synthetic Minority Over-sampling**.

In [None]:
# Note: We use ImbPipeline from imblearn to ensure SMOTE only applies to training folds
smote_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('clf', LogisticRegression(max_iter=2000, n_jobs=-1, random_state=42))
])

print("Training SMOTE Pipeline (this may take longer)...")
smote_pipeline.fit(X_train, y_train)

## 9. Professional Elevation: 5-Fold Cross-Validation
Verifying that our results are not just a "lucky split".

In [None]:
print("Running 5-Fold Cross-Validation on Logistic Regression...")
cv_results = cross_validate(lr_pipeline, X_train, y_train, cv=5, scoring='f1_macro')
print(f"Mean F1 (Macro): {cv_results['test_score'].mean():.4f} (+/- {cv_results['test_score'].std()*2:.4f})")

## 10. Deep Evaluation & Model Comparison

In [None]:
def evaluate_all(pipelines):
    results = []
    for name, pipe in pipelines.items():
        y_pred = pipe.predict(X_test)
        results.append({
            "Model": name,
            "F1 (Macro)": f1_score(y_test, y_pred, average='macro'),
            "Report": classification_report(y_test, y_pred, output_dict=True)
        })
    return pd.DataFrame(results)

models_dict = {
    "Naive Bayes": nb_pipeline,
    "LogReg (Weighted)": lr_pipeline,
    "LogReg (SMOTE)": smote_pipeline,
    "XGBoost": xgb_pipeline
}

leaderboard = evaluate_all(models_dict)
display(leaderboard.sort_values("F1 (Macro)", ascending=False))

## 11. Error Discovery: Misclassification Audit
Inspecting the cases where the model failed most significantly.

In [None]:
# Using the best performer (assume LR for this analysis)
y_pred_final = lr_pipeline.predict(X_test)
test_analysis = X_test.copy()
test_analysis['actual'] = y_test
test_analysis['pred'] = y_pred_final

# Find 'Severe' Errors: Actual Negative predicted as Positive
severe_errors = test_analysis[(test_analysis['actual'] == 0) & (test_analysis['pred'] == 2)]

print(f"Number of Severe Errors (Neg -> Pos): {len(severe_errors)}")
print("\n--- Sample Severe Errors ---")
display(severe_errors[['cleaned_text', 'brand']].head(10))

## 12. Bias Audit: Performance by Brand
Is our model biased towards certain brands?

In [None]:
test_analysis['correct'] = test_analysis['actual'] == test_analysis['pred']
brand_accuracy = test_analysis.groupby('brand')['correct'].mean().sort_values()

plt.figure(figsize=(10, 5))
brand_accuracy.head(10).plot(kind='barh', color='salmon')
plt.title("Accuracy by Brand (Bottom 10)")
plt.xlabel("Accuracy Rate")
plt.show()