# 1. Business Understanding

1. What relevant key metrics are provided to evaluate the CTA combinations? And which CTA Copy and CTA Placement did best/worst based on the key metrics? - The main metric provided to evaluate the CTA combinations is click through rate (CTR). This is because the higher the CTR, the more likely the user will click on the CTA and visit the website, which means that this would allow us to evaluate the CTA combinations. Other key metrics are submittedForm, scheduledAppointment, and revenue as these also allow us to evaluate the CTA combinations in terms of what types of clicks happen.

In [85]:
## Loading Data

In [86]:
import pandas as pd
import numpy as np

train_df = pd.read_csv('train.csv')

## Computing Metrics

Converting columns to numeric to handle any type issues:

In [87]:
metrics = train_df.groupby(['ctaCopy', 'ctaPlacement']).agg({
    'clickedCTA': 'mean',
    'submittedForm': 'mean',
    'scheduledAppointment': 'mean',
    'revenue': 'mean'
}).reset_index()

## Displaying Results

In [88]:
print("Metrics for each CTA combination:\n")
print(metrics[['ctaCopy', 'ctaPlacement', 'clickedCTA', 'submittedForm', 'scheduledAppointment', 'revenue']].to_string(index=False))

Metrics for each CTA combination:

                                                      ctaCopy ctaPlacement  clickedCTA  submittedForm  scheduledAppointment    revenue
                  Access Your Personalized Mortgage Rates Now       Bottom    0.134821       0.117001              0.051751 218.982609
                  Access Your Personalized Mortgage Rates Now       Middle    0.161462       0.126901              0.050671 225.461812
                  Access Your Personalized Mortgage Rates Now          Top    0.186482       0.150752              0.054631 221.869852
First Time? We've Made it Easy to Find the Best Mortgage Rate       Bottom    0.153092       0.135631              0.056881 226.882911
First Time? We've Made it Easy to Find the Best Mortgage Rate       Middle    0.169922       0.135811              0.053191 226.945854
First Time? We've Made it Easy to Find the Best Mortgage Rate          Top    0.198452       0.159032              0.054541 225.280528
                 Get

## Best Performing Combinations

In [89]:
best_clicked_idx = metrics['clickedCTA'].idxmax()
best_submitted_idx = metrics['submittedForm'].idxmax()
best_appointment_idx = metrics['scheduledAppointment'].idxmax()
best_revenue_idx = metrics['revenue'].idxmax()

best_clicked = metrics.loc[best_clicked_idx]
best_submitted = metrics.loc[best_submitted_idx]
best_appointment = metrics.loc[best_appointment_idx]
best_revenue = metrics.loc[best_revenue_idx]

print(f"\nHighest clickedCTA: {best_clicked['ctaCopy']} - {best_clicked['ctaPlacement']}")
print(f"clickedCTA: {best_clicked['clickedCTA']:.4f}")

print(f"\nHighest submittedForm: {best_submitted['ctaCopy']} - {best_submitted['ctaPlacement']}")
print(f"submittedForm: {best_submitted['submittedForm']:.4f}")

print(f"\nHighest scheduledAppointment: {best_appointment['ctaCopy']} - {best_appointment['ctaPlacement']}")
print(f"scheduledAppointment: {best_appointment['scheduledAppointment']:.4f}")

print(f"\nHighest Revenue: {best_revenue['ctaCopy']} - {best_revenue['ctaPlacement']}")
print(f"Revenue: ${best_revenue['revenue']:.2f}")


Highest clickedCTA: Get Pre-Approved for a Mortgage in 5 Minutes - Top
clickedCTA: 0.2118

Highest submittedForm: Get Pre-Approved for a Mortgage in 5 Minutes - Top
submittedForm: 0.1909

Highest scheduledAppointment: Get Pre-Approved for a Mortgage in 5 Minutes - Top
scheduledAppointment: 0.0603

Highest Revenue: First Time? We've Made it Easy to Find the Best Mortgage Rate - Middle
Revenue: $226.95


## Worst Performing Combinations

In [90]:
worst_clicked_idx = metrics['clickedCTA'].idxmin()
worst_submitted_idx = metrics['submittedForm'].idxmin()
worst_appointment_idx = metrics['scheduledAppointment'].idxmin()
worst_revenue_idx = metrics['revenue'].idxmin()

worst_clicked = metrics.loc[worst_clicked_idx]
worst_submitted = metrics.loc[worst_submitted_idx]
worst_appointment = metrics.loc[worst_appointment_idx]
worst_revenue = metrics.loc[worst_revenue_idx]

print(f"\nLowest clickedCTA: {worst_clicked['ctaCopy']} - {worst_clicked['ctaPlacement']}")
print(f"clickedCTA: {worst_clicked['clickedCTA']:.4f}")

print(f"\nLowest submittedForm: {worst_submitted['ctaCopy']} - {worst_submitted['ctaPlacement']}")
print(f"submittedForm: {worst_submitted['submittedForm']:.4f}")

print(f"\nLowest scheduledAppointment: {worst_appointment['ctaCopy']} - {worst_appointment['ctaPlacement']}")
print(f"scheduledAppointment: {worst_appointment['scheduledAppointment']:.4f}")

print(f"\nLowest Revenue: {worst_revenue['ctaCopy']} - {worst_revenue['ctaPlacement']}")
print(f"Revenue: ${worst_revenue['revenue']:.2f}")


Lowest clickedCTA: Access Your Personalized Mortgage Rates Now - Bottom
clickedCTA: 0.1348

Lowest submittedForm: Access Your Personalized Mortgage Rates Now - Bottom
submittedForm: 0.1170

Lowest scheduledAppointment: Access Your Personalized Mortgage Rates Now - Middle
scheduledAppointment: 0.0507

Lowest Revenue: Get Pre-Approved for a Mortgage in 5 Minutes - Middle
Revenue: $203.10


2. Which groups of people tend to be more correlated or less correlated with our key metrics?

3. What ways can you manipulate the columns/dataset to create features that increase predictive power towards our key metric?

4. Besides Log Loss, what other metrics will you use to evaluate the model's performance, and why?

# 2. Exploratory Data Analysis

# 3. Baseline Model

In [91]:
## Imports

In [92]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, roc_auc_score, brier_score_loss
import warnings
warnings.filterwarnings('ignore')

## Data Loading Function

In [None]:
def load_data():
    """Load train and test data from current directory."""
    train_path = 'train.csv'
    test_path = 'test.csv'
    
    if not os.path.exists(train_path):
        raise FileNotFoundError(f"Training file '{train_path}' not found")
    if not os.path.exists(test_path):
        raise FileNotFoundError(f"Test file '{test_path}' not found")
    
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
    
    print(f"Loaded train data: {train_df.shape}")
    print(f"Loaded test data: {test_df.shape}")
    
    return train_df, test_df

## Pipeline Building Function

In [94]:
def build_pipeline(categorical_features, numeric_features):
    """Build the preprocessing and modeling pipeline."""
    
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(
        transformers=transformers,
        remainder='drop'
    )
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(max_iter=2000, n_jobs=-1, class_weight=None))
    ])
    
    return pipeline

## Evaluation Function

In [95]:
def evaluate(y_true, y_pred_proba):
    """Calculate and print evaluation metrics."""
    logloss = log_loss(y_true, y_pred_proba)
    roc_auc = roc_auc_score(y_true, y_pred_proba)
    brier = brier_score_loss(y_true, y_pred_proba)
    
    print(f"\nValidation Metrics:")
    print(f"  Log Loss: {logloss:.6f}")
    print(f"  ROC-AUC: {roc_auc:.6f}")
    print(f"  Brier Score: {brier:.6f}")
    
    return logloss, roc_auc, brier

## Prediction and Saving Function

In [96]:
def predict_and_save(pipeline, X_test, test_df_original, name="Vineet_Burugu"):
    """Generate predictions and save to CSV."""
    outputs_dir = './outputs'
    os.makedirs(outputs_dir, exist_ok=True)
    
    output_path = os.path.join(outputs_dir, f'{name}_predictions.csv')
    
    pr_CTA = pipeline.predict_proba(X_test)[:, 1]
    
    predictions_df = pd.DataFrame({
        'userId': test_df_original['userId'].values,
        'pr_CTA': pr_CTA
    })
    
    predictions_df.to_csv(output_path, index=False)
    print(f"\nPredictions saved to: {output_path}")
    
    return predictions_df

## Data Preparation

In [97]:
train_df, test_df = load_data()

target_col = 'clickedCTA'
if target_col not in train_df.columns:
    raise ValueError(f"Target column '{target_col}' not found in training data")

feature_cols = [
    'ctaCopy', 'ctaPlacement', 'sessionReferrer', 'browser', 
    'deviceType', 'estimatedAnnualIncome', 'estimatedPropertyType', 
    'visitCount', 'pageURL', 'scrollDepth', 'editorialSnippet'
]

available_features = [col for col in feature_cols if col in train_df.columns]
missing_features = [col for col in feature_cols if col not in train_df.columns]

if missing_features:
    print(f"Warning: Missing features: {missing_features}")

X_train = train_df[available_features].copy()
y_train = train_df[target_col].copy()
X_test = test_df[available_features].copy()

if 'editorialSnippet' in X_train.columns:
    X_train['editorialSnippet'] = X_train['editorialSnippet'].astype(str).str.len()
if 'editorialSnippet' in X_test.columns:
    X_test['editorialSnippet'] = X_test['editorialSnippet'].astype(str).str.len()

categorical_features = [
    'ctaCopy', 'ctaPlacement', 'sessionReferrer', 'browser', 
    'deviceType', 'estimatedPropertyType', 'pageURL'
]
categorical_features = [f for f in categorical_features if f in available_features]

numeric_features = [
    'estimatedAnnualIncome', 'visitCount', 'scrollDepth'
]
numeric_features = [f for f in numeric_features if f in available_features]

if 'editorialSnippet' in available_features:
    numeric_features.append('editorialSnippet')

FileNotFoundError: Data directory './data' not found. Please create it and add train.csv and test.csv

## Train/Validation Split

In [None]:
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

## Model Training and Validation

In [None]:
pipeline = build_pipeline(categorical_features, numeric_features)

print("Training model on training split...")
pipeline.fit(X_train_split, y_train_split)

print("Evaluating on validation split...")
y_val_pred_proba = pipeline.predict_proba(X_val_split)[:, 1]
evaluate(y_val_split, y_val_pred_proba)

## Final Model Training and Prediction

In [None]:
print("Retraining on full training data...")
pipeline.fit(X_train, y_train)

predictions_df = predict_and_save(pipeline, X_test, test_df, name="Vineet_Burugu")

print(f"\nGenerated {len(predictions_df)} predictions")
print(f"Prediction range: [{predictions_df['pr_CTA'].min():.6f}, {predictions_df['pr_CTA'].max():.6f}]")

# 4. Iteration 1: Feature Engineering

# 5. Iteration 2: Model Improvement

# 6. Final Model Selection

# 7. Test Predictions