# Phase 4: A/B Testing & Model Improvement

In this phase, we conduct an A/B test to compare our best Phase 2 model (Control) against a proposed improved version (Treatment). We define specific discrete and continuous metrics to evaluate performance comprehensively.

## 1. Metrics Definition

We selected the following metrics to evaluate the models:

### Discrete Metrics (Classification Quality)
1. **Accuracy**: The ratio of correctly predicted observations to total observations.
2. **Macro F1-Score**: The harmonic mean of precision and recall, averaged across classes (crucial for our imbalanced data).
3. **Macro Precision**: Measures how many selected items are relevant, averaged across classes.

### Continuous Metrics (Probabilistic & Operational)
1. **Log Loss (Cross-Entropy)**: Measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Lower is better.
2. **ROC-AUC Score**: Area Under the Receiver Operating Characteristic Curve. Measures the ability of the classifier to distinguish between classes. Higher is better.
3. **Inference Latency (ms)**: The average time taken to predict a single sample. Lower is better.

In [1]:
import time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    log_loss, 
    roc_auc_score
)
from sklearn.preprocessing import LabelEncoder

sns.set_theme(style="whitegrid")

## 2. Data Loading

In [2]:
train_path = "Dataset/phase2_outputs/conversation2_train.csv"
test_path = "Dataset/phase2_outputs/conversation2_test.csv"

try:
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
except FileNotFoundError:
    raise FileNotFoundError("Phase 2 outputs not found. Please run Phase 2 notebook first.")

# Ensure string types
X_train = train_df['utterance'].astype(str)
y_train = train_df['intent']
X_test = test_df['utterance'].astype(str)
y_test = test_df['intent']

# Label Encoding for Log Loss / AUC calculation
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)
labels = le.classes_

print(f"Training Samples: {len(X_train)}")
print(f"Test Samples: {len(X_test)}")

Training Samples: 959
Test Samples: 240


## 3. Model A: Control (Previous Best)

Our previous best model was the **Linear SVM**. However, standard SVMs do not output probabilities required for Log Loss. To enable fair A/B testing on continuous metrics, we wrap the Linear SVM in `CalibratedClassifierCV` (Platt Scaling).

In [3]:
# Define Model A Pipeline
model_a = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('clf', CalibratedClassifierCV(LinearSVC(class_weight='balanced', random_state=42)))
])

# Train
start_train_a = time.time()
model_a.fit(X_train, y_train_enc)
train_time_a = time.time() - start_train_a

# Predict & Measure Latency
start_inf_a = time.time()
preds_a = model_a.predict(X_test)
probs_a = model_a.predict_proba(X_test)
inf_time_a = (time.time() - start_inf_a) / len(X_test) * 1000 # ms per sample

# Calculate Metrics
metrics_a = {
    "Accuracy": accuracy_score(y_test_enc, preds_a),
    "Macro F1": f1_score(y_test_enc, preds_a, average='macro'),
    "Macro Precision": precision_score(y_test_enc, preds_a, average='macro'),
    "Log Loss": log_loss(y_test_enc, probs_a),
    "ROC AUC": roc_auc_score(y_test_enc, probs_a, multi_class='ovr', average='macro'),
    "Latency (ms)": inf_time_a
}

display(pd.DataFrame([metrics_a], index=["Model A (Control)"]))



Unnamed: 0,Accuracy,Macro F1,Macro Precision,Log Loss,ROC AUC,Latency (ms)
Model A (Control),0.9625,0.968881,0.971984,0.151515,0.996458,0.031528


## 4. Model B: Treatment (Improved Strategy)

### Improvement Hypothesis
1. **Data Augmentation**: The `book_hotel` intent is severely underrepresented (2 samples). We hypothesize that adding synthetic examples will improve the model's ability to recognize this intent.
2. **Algorithm Enhancement (Ensemble)**: We replace the single SVM with a **VotingClassifier** (SVM + Logistic Regression). Logistic Regression provides naturally better-calibrated probabilities, potentially improving Log Loss and AUC.

In [4]:
# 1. Data Augmentation
new_samples = [
    ("I want to book a hotel room", "book_hotel"),
    ("Reserve a suite for me", "book_hotel"),
    ("I need a reservation at the Hilton", "book_hotel"),
    ("Can you find me a place to stay?", "book_hotel"),
    ("I'm looking for accommodation", "book_hotel"),
    ("Book a double room for 2 nights", "book_hotel"),
    ("I'd like to make a hotel reservation", "book_hotel"),
    ("Find me a hotel in Paris", "book_hotel"),
    ("Reserve a room", "book_hotel"),
    ("I need to book a place", "book_hotel")
]
new_df = pd.DataFrame(new_samples, columns=['utterance', 'intent'])
train_df_aug = pd.concat([train_df, new_df], ignore_index=True)

# Re-encode labels with augmented data (though classes are same)
X_train_aug = train_df_aug['utterance'].astype(str)
y_train_aug = train_df_aug['intent']
y_train_aug_enc = le.transform(y_train_aug)

print(f"Added {len(new_df)} samples. New training size: {len(X_train_aug)}")

# 2. Algorithm: Voting Classifier (SVM + LR)
clf1 = CalibratedClassifierCV(LinearSVC(class_weight='balanced', random_state=42))
clf2 = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)
voting_clf = VotingClassifier(estimators=[('svm', clf1), ('lr', clf2)], voting='soft')

model_b = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), min_df=1)),
    ('clf', voting_clf)
])

# Train
start_train_b = time.time()
model_b.fit(X_train_aug, y_train_aug_enc)
train_time_b = time.time() - start_train_b

# Predict
start_inf_b = time.time()
preds_b = model_b.predict(X_test)
probs_b = model_b.predict_proba(X_test)
inf_time_b = (time.time() - start_inf_b) / len(X_test) * 1000

# Calculate Metrics
metrics_b = {
    "Accuracy": accuracy_score(y_test_enc, preds_b),
    "Macro F1": f1_score(y_test_enc, preds_b, average='macro'),
    "Macro Precision": precision_score(y_test_enc, preds_b, average='macro'),
    "Log Loss": log_loss(y_test_enc, probs_b),
    "ROC AUC": roc_auc_score(y_test_enc, probs_b, multi_class='ovr', average='macro'),
    "Latency (ms)": inf_time_b
}

display(pd.DataFrame([metrics_b], index=["Model B (Treatment)"]))

Added 10 samples. New training size: 969


Unnamed: 0,Accuracy,Macro F1,Macro Precision,Log Loss,ROC AUC,Latency (ms)
Model B (Treatment),0.941667,0.822735,0.809911,0.209622,0.99666,0.035373


## 5. Comparison & Findings

We compare the Control (A) and Treatment (B) across all metrics.

In [5]:
comparison = pd.DataFrame([metrics_a, metrics_b], index=["Model A (Control)", "Model B (Treatment)"])
comparison["F1 Diff"] = comparison["Macro F1"] - comparison["Macro F1"].iloc[0]
comparison["LogLoss Diff"] = comparison["Log Loss"] - comparison["Log Loss"].iloc[0]

display(comparison.style.format("{:.4f}").background_gradient(cmap="RdYlGn", subset=["Macro F1", "ROC AUC"]))
display(comparison.style.format("{:.4f}").background_gradient(cmap="RdYlGn_r", subset=["Log Loss", "Latency (ms)"]))

Unnamed: 0,Accuracy,Macro F1,Macro Precision,Log Loss,ROC AUC,Latency (ms),F1 Diff,LogLoss Diff
Model A (Control),0.9625,0.9689,0.972,0.1515,0.9965,0.0315,0.0,0.0
Model B (Treatment),0.9417,0.8227,0.8099,0.2096,0.9967,0.0354,-0.1461,0.0581


Unnamed: 0,Accuracy,Macro F1,Macro Precision,Log Loss,ROC AUC,Latency (ms),F1 Diff,LogLoss Diff
Model A (Control),0.9625,0.9689,0.972,0.1515,0.9965,0.0315,0.0,0.0
Model B (Treatment),0.9417,0.8227,0.8099,0.2096,0.9967,0.0354,-0.1461,0.0581


### Findings

1. **Discrete Metrics Decline**: The **Macro F1 score dropped** in Model B. This suggests that while we added data for `book_hotel`, the **Voting Classifier** (specifically the Logistic Regression component) or the synthetic data itself introduced confusion with other classes (e.g., `greeting` vs. `book_hotel`). The original SVM (Model A) was more robust to the specific noise in the zero-shot-labeled test set.
2. **Log Loss Degradation**: Model B showed higher (worse) Log Loss. This indicates that the ensemble was *less confident* in its correct predictions or *more confident* in its wrong ones compared to the calibrated SVM alone.
3. **AUC Stability**: The ROC-AUC remained very high (~0.99) for both, indicating that both models are excellent at ranking intents, even if the specific decision threshold (affecting F1) was suboptimal in Model B.
4. **Latency**: Model B is slightly slower due to the overhead of running two classifiers (SVM + LR) and the voting mechanism.

**Conclusion**: For this specific dataset, the **Control Model (Calibrated Linear SVM)** is superior. The attempt to improve via simple augmentation and ensembling failed to outperform the baseline, likely due to the small dataset size and the potential quality mismatch between synthetic data and the zero-shot labels in the test set.