# 1. Business Understanding

1. What relevant key metrics are provided to evaluate the CTA combinations? And which CTA Copy and CTA Placement did best/worst based on the key metrics? - The main metric provided to evaluate the CTA combinations is click through rate (CTR). This is because the higher the CTR, the more likely the user will click on the CTA and visit the website, which means that this would allow us to evaluate the CTA combinations. Other key metrics are submittedForm, scheduledAppointment, and revenue as these also allow us to evaluate the CTA combinations in terms of what types of clicks happen.

In [194]:
## Loading Data

In [195]:
import pandas as pd
import numpy as np

train_df = pd.read_csv('train.csv')

## Computing Metrics

In [196]:
metrics = train_df.groupby(['ctaCopy', 'ctaPlacement']).agg({
    'clickedCTA': 'mean',
    'submittedForm': 'mean',
    'scheduledAppointment': 'mean',
    'revenue': 'mean'
}).reset_index()

## Displaying Results

In [197]:
print(metrics[['ctaCopy', 'ctaPlacement', 'clickedCTA', 'submittedForm', 'scheduledAppointment', 'revenue']].to_string(index=False))

                                                      ctaCopy ctaPlacement  clickedCTA  submittedForm  scheduledAppointment    revenue
                  Access Your Personalized Mortgage Rates Now       Bottom    0.134821       0.117001              0.051751 218.982609
                  Access Your Personalized Mortgage Rates Now       Middle    0.161462       0.126901              0.050671 225.461812
                  Access Your Personalized Mortgage Rates Now          Top    0.186482       0.150752              0.054631 221.869852
First Time? We've Made it Easy to Find the Best Mortgage Rate       Bottom    0.153092       0.135631              0.056881 226.882911
First Time? We've Made it Easy to Find the Best Mortgage Rate       Middle    0.169922       0.135811              0.053191 226.945854
First Time? We've Made it Easy to Find the Best Mortgage Rate          Top    0.198452       0.159032              0.054541 225.280528
                 Get Pre-Approved for a Mortgage in 5 M

## Best Performing Combinations

In [198]:
best_clicked = metrics.loc[metrics['clickedCTA'].idxmax()]
best_submitted = metrics.loc[metrics['submittedForm'].idxmax()]
best_appointment = metrics.loc[metrics['scheduledAppointment'].idxmax()]
best_revenue = metrics.loc[metrics['revenue'].idxmax()]

print(f"Highest clickedCTA: {best_clicked['ctaCopy']} - {best_clicked['ctaPlacement']} ({best_clicked['clickedCTA']:.4f})")
print(f"Highest submittedForm: {best_submitted['ctaCopy']} - {best_submitted['ctaPlacement']} ({best_submitted['submittedForm']:.4f})")
print(f"Highest scheduledAppointment: {best_appointment['ctaCopy']} - {best_appointment['ctaPlacement']} ({best_appointment['scheduledAppointment']:.4f})")
print(f"Highest Revenue: {best_revenue['ctaCopy']} - {best_revenue['ctaPlacement']} (${best_revenue['revenue']:.2f})")

Highest clickedCTA: Get Pre-Approved for a Mortgage in 5 Minutes - Top (0.2118)
Highest submittedForm: Get Pre-Approved for a Mortgage in 5 Minutes - Top (0.1909)
Highest scheduledAppointment: Get Pre-Approved for a Mortgage in 5 Minutes - Top (0.0603)
Highest Revenue: First Time? We've Made it Easy to Find the Best Mortgage Rate - Middle ($226.95)


## Worst Performing Combinations

In [199]:
worst_clicked = metrics.loc[metrics['clickedCTA'].idxmin()]
worst_submitted = metrics.loc[metrics['submittedForm'].idxmin()]
worst_appointment = metrics.loc[metrics['scheduledAppointment'].idxmin()]
worst_revenue = metrics.loc[metrics['revenue'].idxmin()]

print(f"Lowest clickedCTA: {worst_clicked['ctaCopy']} - {worst_clicked['ctaPlacement']} ({worst_clicked['clickedCTA']:.4f})")
print(f"Lowest submittedForm: {worst_submitted['ctaCopy']} - {worst_submitted['ctaPlacement']} ({worst_submitted['submittedForm']:.4f})")
print(f"Lowest scheduledAppointment: {worst_appointment['ctaCopy']} - {worst_appointment['ctaPlacement']} ({worst_appointment['scheduledAppointment']:.4f})")
print(f"Lowest Revenue: {worst_revenue['ctaCopy']} - {worst_revenue['ctaPlacement']} (${worst_revenue['revenue']:.2f})")

Lowest clickedCTA: Access Your Personalized Mortgage Rates Now - Bottom (0.1348)
Lowest submittedForm: Access Your Personalized Mortgage Rates Now - Bottom (0.1170)
Lowest scheduledAppointment: Access Your Personalized Mortgage Rates Now - Middle (0.0507)
Lowest Revenue: Get Pre-Approved for a Mortgage in 5 Minutes - Middle ($203.10)


2. Which groups of people tend to be more correlated or less correlated with our key metrics?

3. What ways can you manipulate the columns/dataset to create features that increase predictive power towards our key metric?

4. Besides Log Loss, what other metrics will you use to evaluate the model's performance, and why?

# 2. Exploratory Data Analysis

# 3. Baseline Model

In [200]:
## Imports

In [201]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, roc_auc_score, brier_score_loss
import warnings
warnings.filterwarnings('ignore')

## Data Loading Function

In [202]:
def load_data():
    """Load train and test data from current directory."""
    train_path = 'train.csv'
    test_path = 'test.csv'
    
    if not os.path.exists(train_path):
        raise FileNotFoundError(f"Training file '{train_path}' not found")
    if not os.path.exists(test_path):
        raise FileNotFoundError(f"Test file '{test_path}' not found")
    
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
    
    print(f"Loaded train data: {train_df.shape}")
    print(f"Loaded test data: {test_df.shape}")
    
    return train_df, test_df

## Pipeline Building Function

In [203]:
def build_pipeline(categorical_features, numeric_features):
    """Build the preprocessing and modeling pipeline."""
    
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(
        transformers=transformers,
        remainder='drop'
    )
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(max_iter=2000, n_jobs=-1, class_weight=None))
    ])
    
    return pipeline

## Evaluation Function

In [204]:
def evaluate(y_true, y_pred_proba):
    """Calculate and print evaluation metrics."""
    logloss = log_loss(y_true, y_pred_proba)
    roc_auc = roc_auc_score(y_true, y_pred_proba)
    brier = brier_score_loss(y_true, y_pred_proba)
    
    print(f"Log Loss: {logloss:.6f} | ROC-AUC: {roc_auc:.6f} | Brier Score: {brier:.6f}")
    return logloss, roc_auc, brier

## Prediction and Saving Function

In [205]:
def predict_and_save(pipeline, X_test, test_df_original, name="Vineet_Burugu"):
    """Generate predictions and save to CSV."""
    os.makedirs('./outputs', exist_ok=True)
    output_path = f'./outputs/{name}_predictions.csv'
    
    pr_CTA = pipeline.predict_proba(X_test)[:, 1]
    predictions_df = pd.DataFrame({
        'userId': test_df_original['userId'].values,
        'pr_CTA': pr_CTA
    })
    
    predictions_df.to_csv(output_path, index=False)
    print(f"Predictions saved to: {output_path}")
    return predictions_df

## Data Preparation

In [206]:
train_df, test_df = load_data()

feature_cols = [
    'ctaCopy', 'ctaPlacement', 'sessionReferrer', 'browser', 
    'deviceType', 'estimatedAnnualIncome', 'estimatedPropertyType', 
    'visitCount', 'pageURL', 'scrollDepth', 'editorialSnippet'
]

available_features = [col for col in feature_cols if col in train_df.columns]

X_train = train_df[available_features].copy()
y_train = train_df['clickedCTA'].copy()
X_test = test_df[available_features].copy()

if 'editorialSnippet' in X_train.columns:
    X_train['editorialSnippet'] = X_train['editorialSnippet'].astype(str).str.len()
    X_test['editorialSnippet'] = X_test['editorialSnippet'].astype(str).str.len()

categorical_features = [f for f in ['ctaCopy', 'ctaPlacement', 'sessionReferrer', 'browser', 
                                    'deviceType', 'estimatedPropertyType', 'pageURL'] 
                        if f in available_features]

numeric_features = [f for f in ['estimatedAnnualIncome', 'visitCount', 'scrollDepth'] 
                    if f in available_features]

if 'editorialSnippet' in available_features:
    numeric_features.append('editorialSnippet')

Loaded train data: (100000, 18)
Loaded test data: (20000, 17)


## Train/Validation Split

In [207]:
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

## Model Training and Validation

In [208]:
pipeline = build_pipeline(categorical_features, numeric_features)
pipeline.fit(X_train_split, y_train_split)

y_val_pred_proba = pipeline.predict_proba(X_val_split)[:, 1]
evaluate(y_val_split, y_val_pred_proba)

Log Loss: 0.422296 | ROC-AUC: 0.700934 | Brier Score: 0.133458


(0.4222961053243016, 0.7009335232790259, 0.13345798393221112)

## Final Model Training and Prediction

In [209]:
pipeline.fit(X_train, y_train)
predictions_df = predict_and_save(pipeline, X_test, test_df, name="Vineet_Burugu")
print(f"Prediction range: [{predictions_df['pr_CTA'].min():.6f}, {predictions_df['pr_CTA'].max():.6f}]")

Predictions saved to: ./outputs/Vineet_Burugu_predictions.csv
Prediction range: [0.006222, 0.612568]


# 4. Iteration 1: Feature Engineering

In [210]:
## Model Comparison: Logistic Regression vs HistGradientBoosting

In [211]:
from sklearn.ensemble import HistGradientBoostingClassifier

def build_tree_pipeline(categorical_features, numeric_features):
    """Build pipeline for HistGradientBoostingClassifier (no scaling needed)."""
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median'))
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(
        transformers=transformers,
        remainder='drop'
    )
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', HistGradientBoostingClassifier(
            max_iter=200,
            early_stopping=True,
            validation_fraction=0.1,
            n_iter_no_change=10,
            random_state=42,
            class_weight=None
        ))
    ])
    
    return pipeline

## Training and Comparing Both Models

In [212]:
logistic_pipeline = build_pipeline(categorical_features, numeric_features)
tree_pipeline = build_tree_pipeline(categorical_features, numeric_features)

print("Training Logistic Regression...")
logistic_pipeline.fit(X_train_split, y_train_split)

print("Training HistGradientBoostingClassifier...")
tree_pipeline.fit(X_train_split, y_train_split)

print("\n" + "="*60)
print("VALIDATION METRICS COMPARISON")
print("="*60)

y_val_logistic = logistic_pipeline.predict_proba(X_val_split)[:, 1]
y_val_tree = tree_pipeline.predict_proba(X_val_split)[:, 1]

print("\nLogistic Regression:")
log_loss_lr, roc_auc_lr, brier_lr = evaluate(y_val_split, y_val_logistic)

print("\nHistGradientBoostingClassifier:")
log_loss_tree, roc_auc_tree, brier_tree = evaluate(y_val_split, y_val_tree)

print("\n" + "="*60)
if log_loss_lr < log_loss_tree:
    winner = "Logistic Regression"
    winner_pipeline = logistic_pipeline
    print(f"Winner: {winner} (Log Loss: {log_loss_lr:.6f})")
else:
    winner = "HistGradientBoostingClassifier"
    winner_pipeline = tree_pipeline
    print(f"Winner: {winner} (Log Loss: {log_loss_tree:.6f})")
print("="*60)

Training Logistic Regression...
Training HistGradientBoostingClassifier...

VALIDATION METRICS COMPARISON

Logistic Regression:
Log Loss: 0.422296 | ROC-AUC: 0.700934 | Brier Score: 0.133458

HistGradientBoostingClassifier:
Log Loss: 0.385112 | ROC-AUC: 0.761368 | Brier Score: 0.125093

Winner: HistGradientBoostingClassifier (Log Loss: 0.385112)


# 5. Iteration 2: Model Improvement

# 7. Select a Machine Learning Algorithm

Choose the most suitable algorithm based on the problem type, data characteristics, and performance requirements.

We'll compare multiple algorithms:
- **Linear Models**: Logistic Regression (fast, interpretable)
- **Tree-based**: Random Forest, HistGradientBoostingClassifier (good for non-linear patterns)
- **Distance-based**: KNN (simple, can capture local patterns)
- **Neural Networks**: MLPClassifier (can capture complex interactions)

Considerations:
- **Speed**: Linear models are fastest, neural networks slowest
- **Accuracy**: Tree-based models often perform well on tabular data
- **Interpretability**: Linear models are most interpretable
- **Scalability**: Tree-based models scale well to large datasets
- **Dataset size**: With 100K samples, we can use more complex models

he sec

In [None]:
## Import Additional Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import time

print("Imported additional models for comparison")

In [None]:
## Build Pipeline Functions for Different Models

def build_rf_pipeline(categorical_features, numeric_features):
    """Random Forest pipeline."""
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median'))
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
    ])
    return pipeline

def build_knn_pipeline(categorical_features, numeric_features):
    """KNN pipeline (requires scaling)."""
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', KNeighborsClassifier(n_neighbors=5, n_jobs=-1))
    ])
    return pipeline

def build_mlp_pipeline(categorical_features, numeric_features):
    """Neural Network pipeline."""
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    transformers = []
    if categorical_features:
        transformers.append(('cat', categorical_transformer, categorical_features))
    if numeric_features:
        transformers.append(('num', numeric_transformer, numeric_features))
    
    preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, 
                                     random_state=42, early_stopping=True, validation_fraction=0.1))
    ])
    return pipeline

print("Pipeline functions created for all models")

# 8. Train the Model

Train all selected algorithms on the prepared training data with engineered features. Monitor training progress and detect issues early.

In [None]:
## Train All Models with Engineered Features

# Use the engineered features from section 5
X_train_eng = create_engineered_features(X_train)
X_test_eng = create_engineered_features(X_test)
X_train_split_eng = create_engineered_features(X_train_split)
X_val_split_eng = create_engineered_features(X_val_split)

# Build all pipelines
models = {
    'Logistic Regression': build_pipeline(categorical_features_eng, numeric_features_eng),
    'HistGradientBoosting': build_tree_pipeline(categorical_features_eng, numeric_features_eng),
    'Random Forest': build_rf_pipeline(categorical_features_eng, numeric_features_eng),
    'KNN': build_knn_pipeline(categorical_features_eng, numeric_features_eng),
    'Neural Network (MLP)': build_mlp_pipeline(categorical_features_eng, numeric_features_eng)
}

# Train all models and track training time
training_results = {}
print("="*70)
print("TRAINING ALL MODELS")
print("="*70)

for name, pipeline in models.items():
    print(f"\nTraining {name}...")
    start_time = time.time()
    try:
        pipeline.fit(X_train_split_eng, y_train_split)
        train_time = time.time() - start_time
        training_results[name] = {
            'pipeline': pipeline,
            'train_time': train_time,
            'status': 'success'
        }
        print(f"  ‚úì Completed in {train_time:.2f} seconds")
    except Exception as e:
        training_results[name] = {
            'pipeline': None,
            'train_time': None,
            'status': f'failed: {str(e)}'
        }
        print(f"  ‚úó Failed: {str(e)}")

print("\n" + "="*70)

# 9. Evaluate Model Performance

Test all trained models on unseen validation data to measure generalization and reliability. Use multiple metrics and compare training vs validation results.

In [None]:
## Comprehensive Model Evaluation

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

def comprehensive_evaluate(y_true, y_pred_proba, y_pred=None):
    """Calculate comprehensive evaluation metrics."""
    if y_pred is None:
        y_pred = (y_pred_proba >= 0.5).astype(int)
    
    metrics = {
        'log_loss': log_loss(y_true, y_pred_proba),
        'roc_auc': roc_auc_score(y_true, y_pred_proba),
        'brier_score': brier_score_loss(y_true, y_pred_proba),
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, zero_division=0),
        'recall': recall_score(y_true, y_pred, zero_division=0),
        'f1_score': f1_score(y_true, y_pred, zero_division=0)
    }
    return metrics

# Evaluate all models
evaluation_results = {}

print("="*70)
print("MODEL EVALUATION ON VALIDATION SET")
print("="*70)

for name, result in training_results.items():
    if result['status'] == 'success':
        pipeline = result['pipeline']
        try:
            y_val_pred_proba = pipeline.predict_proba(X_val_split_eng)[:, 1]
            y_val_pred = pipeline.predict(X_val_split_eng)
            
            # Get training metrics for comparison
            y_train_pred_proba = pipeline.predict_proba(X_train_split_eng)[:, 1]
            y_train_pred = pipeline.predict(X_train_split_eng)
            
            val_metrics = comprehensive_evaluate(y_val_split, y_val_pred_proba, y_val_pred)
            train_metrics = comprehensive_evaluate(y_train_split, y_train_pred_proba, y_train_pred)
            
            evaluation_results[name] = {
                'val_metrics': val_metrics,
                'train_metrics': train_metrics,
                'train_time': result['train_time']
            }
        except Exception as e:
            evaluation_results[name] = {'error': str(e)}

# Display results in a table
print("\n" + "="*70)
print("VALIDATION METRICS SUMMARY")
print("="*70)

results_df = []
for name, result in evaluation_results.items():
    if 'error' not in result:
        metrics = result['val_metrics']
        results_df.append({
            'Model': name,
            'Log Loss': f"{metrics['log_loss']:.6f}",
            'ROC-AUC': f"{metrics['roc_auc']:.6f}",
            'Brier Score': f"{metrics['brier_score']:.6f}",
            'Accuracy': f"{metrics['accuracy']:.4f}",
            'Precision': f"{metrics['precision']:.4f}",
            'Recall': f"{metrics['recall']:.4f}",
            'F1-Score': f"{metrics['f1_score']:.4f}",
            'Train Time (s)': f"{result['train_time']:.2f}"
        })

if results_df:
    results_table = pd.DataFrame(results_df)
    print(results_table.to_string(index=False))
    
    # Identify best model by log loss (primary metric)
    best_model_name = min(evaluation_results.keys(), 
                         key=lambda x: evaluation_results[x]['val_metrics']['log_loss'] 
                         if 'error' not in evaluation_results[x] else float('inf'))
    best_metrics = evaluation_results[best_model_name]['val_metrics']
    
    print("\n" + "="*70)
    print(f"BEST MODEL: {best_model_name}")
    print("="*70)
    print(f"Log Loss: {best_metrics['log_loss']:.6f}")
    print(f"ROC-AUC: {best_metrics['roc_auc']:.6f}")
    print(f"Brier Score: {best_metrics['brier_score']:.6f}")
    print(f"Accuracy: {best_metrics['accuracy']:.4f}")
    print(f"F1-Score: {best_metrics['f1_score']:.4f}")
    print("="*70)

In [None]:
## Check for Overfitting/Underfitting

print("\n" + "="*70)
print("TRAINING vs VALIDATION COMPARISON (Overfitting Check)")
print("="*70)

comparison_df = []
for name, result in evaluation_results.items():
    if 'error' not in result:
        train_ll = result['train_metrics']['log_loss']
        val_ll = result['val_metrics']['log_loss']
        diff = train_ll - val_ll
        
        comparison_df.append({
            'Model': name,
            'Train Log Loss': f"{train_ll:.6f}",
            'Val Log Loss': f"{val_ll:.6f}",
            'Difference': f"{diff:.6f}",
            'Status': 'Overfitting' if diff < -0.01 else 'Underfitting' if diff > 0.01 else 'Good'
        })

if comparison_df:
    comparison_table = pd.DataFrame(comparison_df)
    print(comparison_table.to_string(index=False))
    print("\nNote: Negative difference suggests overfitting, positive suggests underfitting")

In [None]:
## Confusion Matrix for Best Model

print("\n" + "="*70)
print(f"CONFUSION MATRIX: {best_model_name}")
print("="*70)

best_pipeline = training_results[best_model_name]['pipeline']
y_val_pred_best = best_pipeline.predict(X_val_split_eng)
cm = confusion_matrix(y_val_split, y_val_pred_best)

print("\nConfusion Matrix:")
print(f"                Predicted")
print(f"              Negative  Positive")
print(f"Actual Negative   {cm[0,0]:6d}   {cm[0,1]:6d}")
print(f"        Positive   {cm[1,0]:6d}   {cm[1,1]:6d}")

tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")

# 10. Hyperparameter Tuning

Optimize the model's hyperparameters to achieve higher accuracy, stability, and better generalization. Use Grid Search and Random Search with cross-validation.

In [None]:
## Hyperparameter Tuning for Top Models

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Select top 2-3 models for hyperparameter tuning based on validation log loss
sorted_models = sorted(evaluation_results.items(), 
                      key=lambda x: x[1]['val_metrics']['log_loss'] 
                      if 'error' not in x[1] else float('inf'))

top_models_for_tuning = [name for name, _ in sorted_models[:3] if 'error' not in evaluation_results[name]]
print(f"Selected models for hyperparameter tuning: {top_models_for_tuning}")

tuned_results = {}

In [None]:
## Hyperparameter Tuning: HistGradientBoostingClassifier

if 'HistGradientBoosting' in top_models_for_tuning:
    print("\n" + "="*70)
    print("HYPERPARAMETER TUNING: HistGradientBoostingClassifier")
    print("="*70)
    
    # Build base pipeline
    base_pipeline = build_tree_pipeline(categorical_features_eng, numeric_features_eng)
    
    # Define parameter grid
    param_grid = {
        'classifier__max_iter': [200, 300, 500],
        'classifier__max_depth': [5, 10, 15, None],
        'classifier__learning_rate': [0.01, 0.05, 0.1],
        'classifier__min_samples_leaf': [10, 20, 30]
    }
    
    # Use RandomizedSearchCV for faster search (sample 20 combinations)
    print("Running RandomizedSearchCV (20 iterations)...")
    grid_search = RandomizedSearchCV(
        base_pipeline,
        param_grid,
        n_iter=20,
        cv=3,
        scoring='neg_log_loss',
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    start_time = time.time()
    grid_search.fit(X_train_split_eng, y_train_split)
    tuning_time = time.time() - start_time
    
    print(f"\nTuning completed in {tuning_time:.2f} seconds")
    print(f"\nBest parameters: {grid_search.best_params_}")
    print(f"Best CV score (neg_log_loss): {grid_search.best_score_:.6f}")
    
    # Evaluate on validation set
    best_hgb = grid_search.best_estimator_
    y_val_pred_proba_tuned = best_hgb.predict_proba(X_val_split_eng)[:, 1]
    tuned_metrics = comprehensive_evaluate(y_val_split, y_val_pred_proba_tuned)
    
    tuned_results['HistGradientBoosting'] = {
        'best_params': grid_search.best_params_,
        'best_cv_score': grid_search.best_score_,
        'val_metrics': tuned_metrics,
        'pipeline': best_hgb,
        'tuning_time': tuning_time
    }
    
    print(f"\nValidation Log Loss (tuned): {tuned_metrics['log_loss']:.6f}")
    print(f"Validation Log Loss (original): {evaluation_results['HistGradientBoosting']['val_metrics']['log_loss']:.6f}")
    improvement = evaluation_results['HistGradientBoosting']['val_metrics']['log_loss'] - tuned_metrics['log_loss']
    print(f"Improvement: {improvement:+.6f}")

In [None]:
## Hyperparameter Tuning: Random Forest

if 'Random Forest' in top_models_for_tuning:
    print("\n" + "="*70)
    print("HYPERPARAMETER TUNING: Random Forest")
    print("="*70)
    
    base_pipeline = build_rf_pipeline(categorical_features_eng, numeric_features_eng)
    
    param_grid = {
        'classifier__n_estimators': [100, 200, 300],
        'classifier__max_depth': [10, 20, 30, None],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4]
    }
    
    print("Running RandomizedSearchCV (15 iterations)...")
    grid_search = RandomizedSearchCV(
        base_pipeline,
        param_grid,
        n_iter=15,
        cv=3,
        scoring='neg_log_loss',
        n_jobs=-1,
        random_state=42,
        verbose=1
    )
    
    start_time = time.time()
    grid_search.fit(X_train_split_eng, y_train_split)
    tuning_time = time.time() - start_time
    
    print(f"\nTuning completed in {tuning_time:.2f} seconds")
    print(f"\nBest parameters: {grid_search.best_params_}")
    print(f"Best CV score (neg_log_loss): {grid_search.best_score_:.6f}")
    
    best_rf = grid_search.best_estimator_
    y_val_pred_proba_tuned = best_rf.predict_proba(X_val_split_eng)[:, 1]
    tuned_metrics = comprehensive_evaluate(y_val_split, y_val_pred_proba_tuned)
    
    tuned_results['Random Forest'] = {
        'best_params': grid_search.best_params_,
        'best_cv_score': grid_search.best_score_,
        'val_metrics': tuned_metrics,
        'pipeline': best_rf,
        'tuning_time': tuning_time
    }
    
    print(f"\nValidation Log Loss (tuned): {tuned_metrics['log_loss']:.6f}")
    print(f"Validation Log Loss (original): {evaluation_results['Random Forest']['val_metrics']['log_loss']:.6f}")
    improvement = evaluation_results['Random Forest']['val_metrics']['log_loss'] - tuned_metrics['log_loss']
    print(f"Improvement: {improvement:+.6f}")

In [None]:
## Hyperparameter Tuning: Logistic Regression

if 'Logistic Regression' in top_models_for_tuning:
    print("\n" + "="*70)
    print("HYPERPARAMETER TUNING: Logistic Regression")
    print("="*70)
    
    base_pipeline = build_pipeline(categorical_features_eng, numeric_features_eng)
    
    param_grid = {
        'classifier__C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
        'classifier__penalty': ['l1', 'l2'],
        'classifier__solver': ['liblinear', 'lbfgs']
    }
    
    print("Running GridSearchCV...")
    grid_search = GridSearchCV(
        base_pipeline,
        param_grid,
        cv=3,
        scoring='neg_log_loss',
        n_jobs=-1,
        verbose=1
    )
    
    start_time = time.time()
    grid_search.fit(X_train_split_eng, y_train_split)
    tuning_time = time.time() - start_time
    
    print(f"\nTuning completed in {tuning_time:.2f} seconds")
    print(f"\nBest parameters: {grid_search.best_params_}")
    print(f"Best CV score (neg_log_loss): {grid_search.best_score_:.6f}")
    
    best_lr = grid_search.best_estimator_
    y_val_pred_proba_tuned = best_lr.predict_proba(X_val_split_eng)[:, 1]
    tuned_metrics = comprehensive_evaluate(y_val_split, y_val_pred_proba_tuned)
    
    tuned_results['Logistic Regression'] = {
        'best_params': grid_search.best_params_,
        'best_cv_score': grid_search.best_score_,
        'val_metrics': tuned_metrics,
        'pipeline': best_lr,
        'tuning_time': tuning_time
    }
    
    print(f"\nValidation Log Loss (tuned): {tuned_metrics['log_loss']:.6f}")
    print(f"Validation Log Loss (original): {evaluation_results['Logistic Regression']['val_metrics']['log_loss']:.6f}")
    improvement = evaluation_results['Logistic Regression']['val_metrics']['log_loss'] - tuned_metrics['log_loss']
    print(f"Improvement: {improvement:+.6f}")

In [None]:
## Compare All Models (Original vs Tuned)

print("\n" + "="*70)
print("FINAL MODEL COMPARISON: ALL MODELS")
print("="*70)

# Combine original and tuned results
all_final_results = []

# Add original models
for name, result in evaluation_results.items():
    if 'error' not in result:
        all_final_results.append({
            'Model': name,
            'Type': 'Original',
            'Log Loss': result['val_metrics']['log_loss'],
            'ROC-AUC': result['val_metrics']['roc_auc'],
            'F1-Score': result['val_metrics']['f1_score'],
            'Train Time (s)': result['train_time']
        })

# Add tuned models
for name, result in tuned_results.items():
    all_final_results.append({
        'Model': name,
        'Type': 'Tuned',
        'Log Loss': result['val_metrics']['log_loss'],
        'ROC-AUC': result['val_metrics']['roc_auc'],
        'F1-Score': result['val_metrics']['f1_score'],
        'Train Time (s)': result['tuning_time']
    })

final_comparison_df = pd.DataFrame(all_final_results)
final_comparison_df = final_comparison_df.sort_values('Log Loss')

print("\n" + final_comparison_df.to_string(index=False))

# Identify overall best model
best_final_model = final_comparison_df.iloc[0]
print("\n" + "="*70)
print("üèÜ OVERALL BEST MODEL")
print("="*70)
print(f"Model: {best_final_model['Model']} ({best_final_model['Type']})")
print(f"Log Loss: {best_final_model['Log Loss']:.6f}")
print(f"ROC-AUC: {best_final_model['ROC-AUC']:.6f}")
print(f"F1-Score: {best_final_model['F1-Score']:.6f}")
print("="*70)

# Store the best pipeline
if best_final_model['Type'] == 'Tuned':
    best_final_pipeline = tuned_results[best_final_model['Model']]['pipeline']
else:
    best_final_pipeline = training_results[best_final_model['Model']]['pipeline']

# Final Model Selection and Test Predictions

Train the best model on the full training dataset and generate final predictions.

In [None]:
## Final Model Training and Prediction

print(f"\nRefitting best model ({best_final_model['Model']}) on full training data...")
best_final_pipeline.fit(X_train_eng, y_train)

print("Generating final predictions...")
predictions_df = predict_and_save(best_final_pipeline, X_test_eng, test_df, name="Vineet_Burugu")
print(f"Prediction range: [{predictions_df['pr_CTA'].min():.6f}, {predictions_df['pr_CTA'].max():.6f}]")
print(f"Mean prediction: {predictions_df['pr_CTA'].mean():.6f}")
print(f"Std prediction: {predictions_df['pr_CTA'].std():.6f}")

print("\n" + "="*70)
print("‚úÖ FINAL PREDICTIONS SAVED")
print("="*70)