# 1. Business Understanding

1. What relevant key metrics are provided to evaluate the CTA combinations? And which CTA Copy and CTA Placement did best/worst based on the key metrics? - The main metric provided to evaluate the CTA combinations is click through rate (CTR). This is because the higher the CTR, the more likely the user will click on the CTA and visit the website, which means that this would allow us to evaluate the CTA combinations. Other key metrics are submittedForm, scheduledAppointment, and revenue as these also allow us to evaluate the CTA combinations in terms of what types of clicks happen.

In [194]:
## Loading Data

In [195]:
import pandas as pd
import numpy as np

train_df = pd.read_csv('train.csv')

## Computing Metrics

In [196]:
metrics = train_df.groupby(['ctaCopy', 'ctaPlacement']).agg({
    'clickedCTA': 'mean',
    'submittedForm': 'mean',
    'scheduledAppointment': 'mean',
    'revenue': 'mean'
}).reset_index()

## Displaying Results

In [197]:
print(metrics[['ctaCopy', 'ctaPlacement', 'clickedCTA', 'submittedForm', 'scheduledAppointment', 'revenue']].to_string(index=False))

                                                      ctaCopy ctaPlacement  clickedCTA  submittedForm  scheduledAppointment    revenue
                  Access Your Personalized Mortgage Rates Now       Bottom    0.134821       0.117001              0.051751 218.982609
                  Access Your Personalized Mortgage Rates Now       Middle    0.161462       0.126901              0.050671 225.461812
                  Access Your Personalized Mortgage Rates Now          Top    0.186482       0.150752              0.054631 221.869852
First Time? We've Made it Easy to Find the Best Mortgage Rate       Bottom    0.153092       0.135631              0.056881 226.882911
First Time? We've Made it Easy to Find the Best Mortgage Rate       Middle    0.169922       0.135811              0.053191 226.945854
First Time? We've Made it Easy to Find the Best Mortgage Rate          Top    0.198452       0.159032              0.054541 225.280528
                 Get Pre-Approved for a Mortgage in 5 M

## Best Performing Combinations

In [198]:
best_clicked = metrics.loc[metrics['clickedCTA'].idxmax()]
best_submitted = metrics.loc[metrics['submittedForm'].idxmax()]
best_appointment = metrics.loc[metrics['scheduledAppointment'].idxmax()]
best_revenue = metrics.loc[metrics['revenue'].idxmax()]

print(f"Highest clickedCTA: {best_clicked['ctaCopy']} - {best_clicked['ctaPlacement']} ({best_clicked['clickedCTA']:.4f})")
print(f"Highest submittedForm: {best_submitted['ctaCopy']} - {best_submitted['ctaPlacement']} ({best_submitted['submittedForm']:.4f})")
print(f"Highest scheduledAppointment: {best_appointment['ctaCopy']} - {best_appointment['ctaPlacement']} ({best_appointment['scheduledAppointment']:.4f})")
print(f"Highest Revenue: {best_revenue['ctaCopy']} - {best_revenue['ctaPlacement']} (${best_revenue['revenue']:.2f})")

Highest clickedCTA: Get Pre-Approved for a Mortgage in 5 Minutes - Top (0.2118)
Highest submittedForm: Get Pre-Approved for a Mortgage in 5 Minutes - Top (0.1909)
Highest scheduledAppointment: Get Pre-Approved for a Mortgage in 5 Minutes - Top (0.0603)
Highest Revenue: First Time? We've Made it Easy to Find the Best Mortgage Rate - Middle ($226.95)


## Worst Performing Combinations

In [199]:
worst_clicked = metrics.loc[metrics['clickedCTA'].idxmin()]
worst_submitted = metrics.loc[metrics['submittedForm'].idxmin()]
worst_appointment = metrics.loc[metrics['scheduledAppointment'].idxmin()]
worst_revenue = metrics.loc[metrics['revenue'].idxmin()]

print(f"Lowest clickedCTA: {worst_clicked['ctaCopy']} - {worst_clicked['ctaPlacement']} ({worst_clicked['clickedCTA']:.4f})")
print(f"Lowest submittedForm: {worst_submitted['ctaCopy']} - {worst_submitted['ctaPlacement']} ({worst_submitted['submittedForm']:.4f})")
print(f"Lowest scheduledAppointment: {worst_appointment['ctaCopy']} - {worst_appointment['ctaPlacement']} ({worst_appointment['scheduledAppointment']:.4f})")
print(f"Lowest Revenue: {worst_revenue['ctaCopy']} - {worst_revenue['ctaPlacement']} (${worst_revenue['revenue']:.2f})")

Lowest clickedCTA: Access Your Personalized Mortgage Rates Now - Bottom (0.1348)
Lowest submittedForm: Access Your Personalized Mortgage Rates Now - Bottom (0.1170)
Lowest scheduledAppointment: Access Your Personalized Mortgage Rates Now - Middle (0.0507)
Lowest Revenue: Get Pre-Approved for a Mortgage in 5 Minutes - Middle ($203.10)


2. Which groups of people tend to be more correlated or less correlated with our key metrics?

3. What ways can you manipulate the columns/dataset to create features that increase predictive power towards our key metric?

4. Besides Log Loss, what other metrics will you use to evaluate the model's performance, and why?

# 2. Exploratory Data Analysis

# 3. Baseline Model

In [200]:
## Imports

In [201]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, roc_auc_score, brier_score_loss
import warnings
warnings.filterwarnings('ignore')

## Data Loading Function

In [202]:
def load_data():
    """Load train and test data from current directory."""
    train_path = 'train.csv'
    test_path = 'test.csv'
    
    if not os.path.exists(train_path):
        raise FileNotFoundError(f"Training file '{train_path}' not found")
    if not os.path.exists(test_path):
        raise FileNotFoundError(f"Test file '{test_path}' not found")
    
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
    
    print(f"Loaded train data: {train_df.shape}")
    print(f"Loaded test data: {test_df.shape}")
    
    return train_df, test_df

## Pipeline Building Function

In [203]:
def build_pipeline(categorical_features, numeric_features):
    """Build the preprocessing and modeling pipeline."""
    
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(
        transformers=transformers,
        remainder='drop'
    )
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(max_iter=2000, n_jobs=-1, class_weight=None))
    ])
    
    return pipeline

## Evaluation Function

In [204]:
def evaluate(y_true, y_pred_proba):
    """Calculate and print evaluation metrics."""
    logloss = log_loss(y_true, y_pred_proba)
    roc_auc = roc_auc_score(y_true, y_pred_proba)
    brier = brier_score_loss(y_true, y_pred_proba)
    
    print(f"Log Loss: {logloss:.6f} | ROC-AUC: {roc_auc:.6f} | Brier Score: {brier:.6f}")
    return logloss, roc_auc, brier

## Prediction and Saving Function

In [205]:
def predict_and_save(pipeline, X_test, test_df_original, name="Vineet_Burugu"):
    """Generate predictions and save to CSV."""
    os.makedirs('./outputs', exist_ok=True)
    output_path = f'./outputs/{name}_predictions.csv'
    
    pr_CTA = pipeline.predict_proba(X_test)[:, 1]
    predictions_df = pd.DataFrame({
        'userId': test_df_original['userId'].values,
        'pr_CTA': pr_CTA
    })
    
    predictions_df.to_csv(output_path, index=False)
    print(f"Predictions saved to: {output_path}")
    return predictions_df

## Data Preparation

In [206]:
train_df, test_df = load_data()

feature_cols = [
    'ctaCopy', 'ctaPlacement', 'sessionReferrer', 'browser', 
    'deviceType', 'estimatedAnnualIncome', 'estimatedPropertyType', 
    'visitCount', 'pageURL', 'scrollDepth', 'editorialSnippet'
]

available_features = [col for col in feature_cols if col in train_df.columns]

X_train = train_df[available_features].copy()
y_train = train_df['clickedCTA'].copy()
X_test = test_df[available_features].copy()

if 'editorialSnippet' in X_train.columns:
    X_train['editorialSnippet'] = X_train['editorialSnippet'].astype(str).str.len()
    X_test['editorialSnippet'] = X_test['editorialSnippet'].astype(str).str.len()

categorical_features = [f for f in ['ctaCopy', 'ctaPlacement', 'sessionReferrer', 'browser', 
                                    'deviceType', 'estimatedPropertyType', 'pageURL'] 
                        if f in available_features]

numeric_features = [f for f in ['estimatedAnnualIncome', 'visitCount', 'scrollDepth'] 
                    if f in available_features]

if 'editorialSnippet' in available_features:
    numeric_features.append('editorialSnippet')

Loaded train data: (100000, 18)
Loaded test data: (20000, 17)


## Train/Validation Split

In [207]:
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

## Model Training and Validation

In [208]:
pipeline = build_pipeline(categorical_features, numeric_features)
pipeline.fit(X_train_split, y_train_split)

y_val_pred_proba = pipeline.predict_proba(X_val_split)[:, 1]
evaluate(y_val_split, y_val_pred_proba)

Log Loss: 0.422296 | ROC-AUC: 0.700934 | Brier Score: 0.133458


(0.4222961053243016, 0.7009335232790259, 0.13345798393221112)

## Final Model Training and Prediction

In [209]:
pipeline.fit(X_train, y_train)
predictions_df = predict_and_save(pipeline, X_test, test_df, name="Vineet_Burugu")
print(f"Prediction range: [{predictions_df['pr_CTA'].min():.6f}, {predictions_df['pr_CTA'].max():.6f}]")

Predictions saved to: ./outputs/Vineet_Burugu_predictions.csv
Prediction range: [0.006222, 0.612568]


# 4. Iteration 1: Feature Engineering

In [210]:
## Model Comparison: Logistic Regression vs HistGradientBoosting

In [211]:
from sklearn.ensemble import HistGradientBoostingClassifier

def build_tree_pipeline(categorical_features, numeric_features):
    """Build pipeline for HistGradientBoostingClassifier (no scaling needed)."""
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median'))
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(
        transformers=transformers,
        remainder='drop'
    )
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', HistGradientBoostingClassifier(
            max_iter=200,
            early_stopping=True,
            validation_fraction=0.1,
            n_iter_no_change=10,
            random_state=42,
            class_weight=None
        ))
    ])
    
    return pipeline

## Training and Comparing Both Models

In [212]:
logistic_pipeline = build_pipeline(categorical_features, numeric_features)
tree_pipeline = build_tree_pipeline(categorical_features, numeric_features)

print("Training Logistic Regression...")
logistic_pipeline.fit(X_train_split, y_train_split)

print("Training HistGradientBoostingClassifier...")
tree_pipeline.fit(X_train_split, y_train_split)

print("\n" + "="*60)
print("VALIDATION METRICS COMPARISON")
print("="*60)

y_val_logistic = logistic_pipeline.predict_proba(X_val_split)[:, 1]
y_val_tree = tree_pipeline.predict_proba(X_val_split)[:, 1]

print("\nLogistic Regression:")
log_loss_lr, roc_auc_lr, brier_lr = evaluate(y_val_split, y_val_logistic)

print("\nHistGradientBoostingClassifier:")
log_loss_tree, roc_auc_tree, brier_tree = evaluate(y_val_split, y_val_tree)

print("\n" + "="*60)
if log_loss_lr < log_loss_tree:
    winner = "Logistic Regression"
    winner_pipeline = logistic_pipeline
    print(f"Winner: {winner} (Log Loss: {log_loss_lr:.6f})")
else:
    winner = "HistGradientBoostingClassifier"
    winner_pipeline = tree_pipeline
    print(f"Winner: {winner} (Log Loss: {log_loss_tree:.6f})")
print("="*60)

Training Logistic Regression...
Training HistGradientBoostingClassifier...

VALIDATION METRICS COMPARISON

Logistic Regression:
Log Loss: 0.422296 | ROC-AUC: 0.700934 | Brier Score: 0.133458

HistGradientBoostingClassifier:
Log Loss: 0.385112 | ROC-AUC: 0.761368 | Brier Score: 0.125093

Winner: HistGradientBoostingClassifier (Log Loss: 0.385112)


# 5. Iteration 2: Model Improvement