# Comparing Machine Learning Models in Production

This lecture demonstrates various strategies for comparing the performance of different sklearn models in a production environment. We'll cover the following approaches:

1. A/B Testing
2. G-Test
3. Multi-armed Bandit
4. Defining a metric-only dataset

We'll use a simple dataset and two different sklearn models to illustrate these concepts.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, log_loss
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## Data Generation and Model Training

First, let's generate a synthetic dataset and train two different models: Logistic Regression and Random Forest.

In [None]:
# Generate synthetic data
X, y = make_classification(n_samples=10000, n_features=20, n_classes=2, random_state=42)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
lr_preds = lr_model.predict(X_test)
rf_preds = rf_model.predict(X_test)

print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr_preds):.4f}")
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_preds):.4f}")

## 1. A/B Testing

A/B testing is a method of comparing two versions of a model to determine which one performs better. In this case, we'll compare the Logistic Regression model (A) with the Random Forest model (B).

In [None]:
def ab_test(model_a_preds, model_b_preds, true_labels):
    model_a_accuracy = accuracy_score(true_labels, model_a_preds)
    model_b_accuracy = accuracy_score(true_labels, model_b_preds)
    
    print(f"Model A Accuracy: {model_a_accuracy:.4f}")
    print(f"Model B Accuracy: {model_b_accuracy:.4f}")
    
    if model_a_accuracy > model_b_accuracy:
        print("Model A (Logistic Regression) performs better.")
    elif model_b_accuracy > model_a_accuracy:
        print("Model B (Random Forest) performs better.")
    else:
        print("Both models perform equally.")

ab_test(lr_preds, rf_preds, y_test)

## 2. G-Test

The G-test is a statistical test that can be used to compare the performance of two models. It's similar to the chi-squared test but is more accurate for small sample sizes.

In [None]:
def g_test(model_a_preds, model_b_preds, true_labels):
    # Create contingency table
    contingency_table = pd.crosstab(
        pd.Series(true_labels, name='Actual'),
        pd.Series(np.where(model_a_preds == model_b_preds, 'Both Correct', 
                           np.where(model_a_preds == true_labels, 'A Correct', 
                                    np.where(model_b_preds == true_labels, 'B Correct', 'Both Incorrect'))),
                  name='Prediction')
    )
    
    # Perform G-test
    g_stat, p_value, dof, expected = chi2_contingency(contingency_table, lambda_="log-likelihood")
    
    print("Contingency Table:")
    print(contingency_table)
    print(f"\nG-statistic: {g_stat:.4f}")
    print(f"p-value: {p_value:.4f}")
    
    if p_value < 0.05:
        print("There is a significant difference between the models.")
    else:
        print("There is no significant difference between the models.")

g_test(lr_preds, rf_preds, y_test)

## 3. Multi-armed Bandit

The multi-armed bandit approach is a method of dynamically allocating resources to the best-performing option while continuing to explore other options. We'll implement a simple epsilon-greedy strategy.

In [None]:
def multi_armed_bandit(model_a, model_b, X, y, n_rounds=1000, epsilon=0.1):
    model_a_correct = 0
    model_b_correct = 0
    model_a_count = 0
    model_b_count = 0
    
    for _ in range(n_rounds):
        if np.random.random() < epsilon:  # Explore
            chosen_model = np.random.choice(['A', 'B'])
        else:  # Exploit
            model_a_rate = model_a_correct / model_a_count if model_a_count > 0 else 0
            model_b_rate = model_b_correct / model_b_count if model_b_count > 0 else 0
            chosen_model = 'A' if model_a_rate >= model_b_rate else 'B'
        
        # Randomly select a sample
        idx = np.random.randint(0, len(X))
        x, true_y = X[idx:idx+1], y[idx]
        
        if chosen_model == 'A':
            pred = model_a.predict(x)[0]
            model_a_count += 1
            if pred == true_y:
                model_a_correct += 1
        else:
            pred = model_b.predict(x)[0]
            model_b_count += 1
            if pred == true_y:
                model_b_correct += 1
    
    print(f"Model A (Logistic Regression) accuracy: {model_a_correct / model_a_count:.4f}")
    print(f"Model B (Random Forest) accuracy: {model_b_correct / model_b_count:.4f}")
    print(f"Model A was chosen {model_a_count} times")
    print(f"Model B was chosen {model_b_count} times")

multi_armed_bandit(lr_model, rf_model, X_test, y_test)

## 4. Defining a Metric-only Dataset

In some cases, it's useful to define a separate dataset solely for evaluating model performance. This approach can help prevent overfitting to the test set and provide a more robust comparison between models.

In [None]:
# Generate a new dataset for metric evaluation
X_metric, y_metric = make_classification(n_samples=5000, n_features=20, n_classes=2, random_state=100)

def evaluate_on_metric_dataset(model_a, model_b, X, y):
    model_a_preds = model_a.predict(X)
    model_b_preds = model_b.predict(X)
    
    model_a_accuracy = accuracy_score(y, model_a_preds)
    model_b_accuracy = accuracy_score(y, model_b_preds)
    
    model_a_log_loss = log_loss(y, model_a.predict_proba(X))
    model_b_log_loss = log_loss(y, model_b.predict_proba(X))
    
    print(f"Model A (Logistic Regression) Accuracy: {model_a_accuracy:.4f}")
    print(f"Model B (Random Forest) Accuracy: {model_b_accuracy:.4f}")
    print(f"Model A (Logistic Regression) Log Loss: {model_a_log_loss:.4f}")
    print(f"Model B (Random Forest) Log Loss: {model_b_log_loss:.4f}")
    
    # Visualize the results
    metrics = ['Accuracy', 'Log Loss (lower is better)']
    model_a_scores = [model_a_accuracy, model_a_log_loss]
    model_b_scores = [model_b_accuracy, model_b_log_loss]
    
    x = np.arange(len(metrics))
    width = 0.35
    
    fig, ax = plt.subplots(figsize=(10, 6))
    rects1 = ax.bar(x - width/2, model_a_scores, width, label='Model A (Logistic Regression)')
    rects2 = ax.bar(x + width/2, model_b_scores, width, label='Model B (Random Forest)')
    
    ax.set_ylabel('Scores')
    ax.set_title('Model Comparison on Metric Dataset')
    ax.set_xticks(x)
    ax.set_xticklabels(metrics)
    ax.legend()
    
    ax.bar_label(rects1, padding=3)
    ax.bar_label(rects2, padding=3)
    
    fig.tight_layout()
    plt.show()

evaluate_on_metric_dataset(lr_model, rf_model, X_metric, y_metric)

## Conclusion

In this lecture, we've explored four different strategies for comparing machine learning models in a production environment:

1. A/B Testing: A straightforward comparison of model performance on a test set.
2. G-Test: A statistical test to determine if there's a significant difference between model performances.
3. Multi-armed Bandit: A dynamic approach that balances exploration and exploitation to find the best-performing model.
4. Metric-only Dataset: Using a separate dataset for model evaluation to prevent overfitting to the test set.

Each approach has its strengths and is suitable for different scenarios:

- A/B Testing is simple and easy to implement but may not capture the statistical significance of the difference.
- The G-Test provides a statistical foundation for comparing models but requires more data to be reliable.
- The Multi-armed Bandit approach is adaptive and can be useful in online learning scenarios where you want to balance exploration and exploitation.
- Using a Metric-only Dataset provides a more robust evaluation but requires additional data and may not capture real-time changes in data distribution.

When choosing a method for model comparison in production, consider factors such as:
- The amount of data available
- The cost of making incorrect predictions
- The need for real-time adaptation
- The stability of the data distribution over time

Let's now explore some additional considerations and advanced techniques for model comparison in production environments.