# CrabNet Hyperparameter Dataset Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sparks-baird/matsci-opt-benchmarks/blob/copilot/fix-50/notebooks/crabnet_hyperparameter/2.0-analysis-crabnet-dataset.ipynb)

This notebook analyzes the CrabNet Hyperparameter dataset from Zenodo (DOI: 10.5281/zenodo.7694268).
We train various scikit-learn models with and without "rank" variables to investigate
surprising near-perfect parity plot results mentioned in the issue.

The dataset contains 173,219 hyperparameter combinations from CrabNet training experiments,
including performance metrics (MAE, RMSE, runtime) and their corresponding rank variables.

## Models to evaluate:
1. Random Forest Regressor (RFR)
2. Histogram Gradient Boosting
3. Support Vector Regression (SVR)
4. Ridge Regression
5. Gaussian Process Regression (GPR) with Automatic Relevance Determination (ARD)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import urllib.request
import warnings
warnings.filterwarnings("ignore")

# sklearn imports
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import Ridge
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, Matern, WhiteKernel, ConstantKernel as C
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Set random seed for reproducibility
np.random.seed(42)

## Load Dataset

Load the CrabNet hyperparameter dataset from Zenodo (DOI: 10.5281/zenodo.7694268).
The dataset contains 173,219 hyperparameter combinations and their corresponding performance metrics.

In [None]:
def download_crabnet_dataset():
    """
    Download the CrabNet hyperparameter dataset from Zenodo if not already present
    """
    import os
    import urllib.request
    
    # Define paths
    data_dir = Path("../../data/processed/crabnet_hyperparameter")
    data_dir.mkdir(parents=True, exist_ok=True)
    
    file_path = data_dir / "sobol_regression.csv"
    
    if not file_path.exists():
        print("Downloading CrabNet dataset from Zenodo...")
        url = "https://zenodo.org/api/records/7694268/files/sobol_regression.csv/content"
        urllib.request.urlretrieve(url, file_path)
        print(f"Downloaded dataset to {file_path}")
    else:
        print(f"Dataset already exists at {file_path}")
    
    return file_path

def load_crabnet_dataset():
    """
    Load and preprocess the CrabNet hyperparameter dataset
    """
    # Download if necessary
    file_path = download_crabnet_dataset()
    
    # Load the dataset
    print("Loading CrabNet dataset...")
    df = pd.read_csv(file_path)
    
    # Drop non-hyperparameter columns that aren't useful for our analysis
    columns_to_drop = ['_id', 'session_id', 'timestamp', 'model_size']
    df = df.drop(columns=[col for col in columns_to_drop if col in df.columns])
    
    print(f"Dataset shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    
    return df

# Load the dataset
df = load_crabnet_dataset()

In [None]:
# Display dataset info
print("Dataset Info:")
print(df.info())
print("\nFirst few rows:")
df.head()

In [None]:
# Display summary statistics
print("Target variable statistics:")
print(df[['mae', 'rmse', 'runtime']].describe())

# Plot target distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
df['mae'].hist(bins=50, ax=axes[0], alpha=0.7)
axes[0].set_title('MAE Distribution')
axes[0].set_xlabel('MAE')

df['rmse'].hist(bins=50, ax=axes[1], alpha=0.7)
axes[1].set_title('RMSE Distribution')
axes[1].set_xlabel('RMSE')

df['runtime'].hist(bins=50, ax=axes[2], alpha=0.7)
axes[2].set_title('Runtime Distribution')
axes[2].set_xlabel('Runtime (s)')

plt.tight_layout()
plt.show()

## Define Feature Sets

We'll create two feature sets:
1. Features without rank variables (original hyperparameters only)
2. Features with rank variables (including the noise captured by ranking)

In [None]:
# Define feature sets based on actual dataset columns
print("Available columns in dataset:")
print(list(df.columns))
print()

# Identify categorical columns that need encoding
categorical_columns = ['criterion', 'elem_prop', 'hardware']
print(f"Categorical columns to encode: {categorical_columns}")
print(f"Unique values in 'criterion': {df['criterion'].unique()}")
print(f"Unique values in 'elem_prop': {df['elem_prop'].unique()}")
print(f"Unique values in 'hardware': {df['hardware'].unique()}")
print()

# Create dummy variables for categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_columns, prefix=categorical_columns)
print(f"Dataset shape after encoding: {df_encoded.shape}")

# Define numerical hyperparameter features
numerical_features = [
    'N', 'alpha', 'd_model', 'dim_feedforward', 'dropout', 'emb_scaler',
    'eps', 'epochs_step', 'fudge', 'heads', 'k', 'lr', 'pe_resolution',
    'ple_resolution', 'pos_scaler', 'weight_decay', 'batch_size',
    'out_hidden4', 'betas1', 'betas2', 'bias', 'train_frac'
]

# Add encoded categorical features
categorical_encoded_features = [col for col in df_encoded.columns if any(col.startswith(prefix + '_') for prefix in categorical_columns)]
print(f"Encoded categorical features: {categorical_encoded_features}")

# All hyperparameter features (numerical + categorical)
hyperparameter_features = numerical_features + categorical_encoded_features

rank_features = ['mae_rank', 'rmse_rank', 'runtime_rank']

# Features without rank (clean hyperparameters)
features_without_rank = hyperparameter_features

# Features with rank (includes noise)
features_with_mae_rank = hyperparameter_features + ['mae_rank']
features_with_all_ranks = hyperparameter_features + rank_features

print(f"Total hyperparameter features: {len(hyperparameter_features)}")
print(f"Features without rank: {len(features_without_rank)}")
print(f"Features with MAE rank: {len(features_with_mae_rank)}")
print(f"Features with all ranks: {len(features_with_all_ranks)}")
print()
print("All hyperparameter features:")
for i, feat in enumerate(hyperparameter_features):
    print(f"{i+1:2d}. {feat}")

## Prepare Data for Training

In [None]:
# Target variable
target = 'mae'
y = df_encoded[target].values

# Prepare feature matrices using the encoded dataframe
X_without_rank = df_encoded[features_without_rank].values
X_with_mae_rank = df_encoded[features_with_mae_rank].values
X_with_all_ranks = df_encoded[features_with_all_ranks].values

print(f"Target shape: {y.shape}")
print(f"X_without_rank shape: {X_without_rank.shape}")
print(f"X_with_mae_rank shape: {X_with_mae_rank.shape}")
print(f"X_with_all_ranks shape: {X_with_all_ranks.shape}")

# Create train/test splits for all feature sets
X_without_rank_train, X_without_rank_test, y_train, y_test = train_test_split(
    X_without_rank, y, test_size=0.2, random_state=42
)

X_with_mae_rank_train, X_with_mae_rank_test, _, _ = train_test_split(
    X_with_mae_rank, y, test_size=0.2, random_state=42
)

X_with_all_ranks_train, X_with_all_ranks_test, _, _ = train_test_split(
    X_with_all_ranks, y, test_size=0.2, random_state=42
)

print(f"Training set size: {len(y_train)}")
print(f"Test set size: {len(y_test)}")

## Model Training and Evaluation Functions

In [None]:
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """
    Train and evaluate a model, return metrics and predictions
    """
    # Train model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    results = {
        'model': model_name,
        'r2': r2,
        'mse': mse,
        'mae': mae,
        'rmse': rmse,
        'y_test': y_test,
        'y_pred': y_pred
    }
    
    print(f"{model_name} - R²: {r2:.4f}, MAE: {mae:.4f}, RMSE: {rmse:.4f}")
    
    return results

def plot_parity(results, title_suffix=""):
    """
    Create parity plot for model results
    """
    y_test = results['y_test']
    y_pred = results['y_pred']
    r2 = results['r2']
    model_name = results['model']
    
    plt.figure(figsize=(6, 6))
    plt.scatter(y_test, y_pred, alpha=0.6, s=20)
    
    # Plot perfect prediction line
    min_val = min(min(y_test), min(y_pred))
    max_val = max(max(y_test), max(y_pred))
    plt.plot([min_val, max_val], [min_val, max_val], 'r--', 
             label=f'Perfect fit\nR² = {r2:.3f}', linewidth=2)
    
    plt.xlabel('True MAE')
    plt.ylabel('Predicted MAE')
    plt.title(f'Parity Plot - {model_name}{title_suffix}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    return plt.gcf()

## 1. Random Forest Regressor (SKIPPED ON FULL DATASET - TOO SLOW)

In [None]:
print("=" * 50)
print("RANDOM FOREST REGRESSOR (SKIPPING FULL DATASET)")
print("=" * 50)
print("Skipping Random Forest on full dataset due to performance constraints.")
print("Random Forest will be evaluated on small subsets below.")

# Create placeholder results for summary
rf_results_without_rank = {'model': 'Random Forest (without rank)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}
rf_results_with_mae_rank = {'model': 'Random Forest (with MAE rank)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}
rf_results_with_all_ranks = {'model': 'Random Forest (with all ranks)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}

In [None]:
print("Random Forest parity plots will be shown for small subsets below.")

## 2. Histogram Gradient Boosting Regressor

In [None]:
print("=" * 50)
print("HISTOGRAM GRADIENT BOOSTING REGRESSOR")
print("=" * 50)

# Without rank variables
print("\n1. Without rank variables:")
hgb_without_rank = HistGradientBoostingRegressor(random_state=42)
hgb_results_without_rank = evaluate_model(
    hgb_without_rank, X_without_rank_train, X_without_rank_test, 
    y_train, y_test, "Hist Gradient Boosting (without rank)"
)

# With MAE rank variable
print("\n2. With MAE rank variable:")
hgb_with_mae_rank = HistGradientBoostingRegressor(random_state=42)
hgb_results_with_mae_rank = evaluate_model(
    hgb_with_mae_rank, X_with_mae_rank_train, X_with_mae_rank_test, 
    y_train, y_test, "Hist Gradient Boosting (with MAE rank)"
)

# With all rank variables
print("\n3. With all rank variables:")
hgb_with_all_ranks = HistGradientBoostingRegressor(random_state=42)
hgb_results_with_all_ranks = evaluate_model(
    hgb_with_all_ranks, X_with_all_ranks_train, X_with_all_ranks_test, 
    y_train, y_test, "Hist Gradient Boosting (with all ranks)"
)

In [None]:
# Plot parity plots for Histogram Gradient Boosting
plot_parity(hgb_results_without_rank, " (without rank)")
plot_parity(hgb_results_with_mae_rank, " (with MAE rank)")
plot_parity(hgb_results_with_all_ranks, " (with all ranks)")

## 3. Support Vector Regression (SVR) (SKIPPED ON FULL DATASET - TOO SLOW)

In [None]:
print("=" * 50)
print("SUPPORT VECTOR REGRESSION (SKIPPING FULL DATASET)")
print("=" * 50)
print("Skipping SVR on full dataset due to performance constraints.")
print("SVR will be evaluated on small subsets below.")

# Create placeholder results for summary
svr_results_without_rank = {'model': 'SVR (without rank)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}
svr_results_with_mae_rank = {'model': 'SVR (with MAE rank)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}
svr_results_with_all_ranks = {'model': 'SVR (with all ranks)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}

In [None]:
print("SVR parity plots will be shown for small subsets below.")

## 4. Ridge Regression

In [None]:
print("=" * 50)
print("RIDGE REGRESSION")
print("=" * 50)

# Without rank variables
print("\n1. Without rank variables:")
ridge_without_rank = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0, random_state=42))
])
ridge_results_without_rank = evaluate_model(
    ridge_without_rank, X_without_rank_train, X_without_rank_test, 
    y_train, y_test, "Ridge (without rank)"
)

# With MAE rank variable
print("\n2. With MAE rank variable:")
ridge_with_mae_rank = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0, random_state=42))
])
ridge_results_with_mae_rank = evaluate_model(
    ridge_with_mae_rank, X_with_mae_rank_train, X_with_mae_rank_test, 
    y_train, y_test, "Ridge (with MAE rank)"
)

# With all rank variables
print("\n3. With all rank variables:")
ridge_with_all_ranks = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0, random_state=42))
])
ridge_results_with_all_ranks = evaluate_model(
    ridge_with_all_ranks, X_with_all_ranks_train, X_with_all_ranks_test, 
    y_train, y_test, "Ridge (with all ranks)"
)

In [None]:
# Plot parity plots for Ridge Regression
plot_parity(ridge_results_without_rank, " (without rank)")
plot_parity(ridge_results_with_mae_rank, " (with MAE rank)")
plot_parity(ridge_results_with_all_ranks, " (with all ranks)")

## 5. Gaussian Process Regression (GPR) (SKIPPED ON FULL DATASET - TOO SLOW)

In [None]:
print("=" * 50)
print("GAUSSIAN PROCESS REGRESSION (SKIPPING FULL DATASET)")
print("=" * 50)
print("Skipping GPR on full dataset due to performance constraints.")
print("GPR will be evaluated on small subsets below (limited to ~100 points).")

# Create placeholder results for summary
gpr_results_without_rank = {'model': 'GPR with ARD (without rank)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}
gpr_results_with_mae_rank = {'model': 'GPR with ARD (with MAE rank)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}
gpr_results_with_all_ranks = {'model': 'GPR with ARD (with all ranks)', 'r2': np.nan, 'mae': np.nan, 'rmse': np.nan, 'y_test': [], 'y_pred': []}

In [None]:
print("GPR parity plots will be shown for small subsets below.")

## Investigation of Strange Behavior from Original Code

Now let's investigate the surprising results mentioned in the issue where Ridge and GPR
gave near-perfect parity plots on small subsets of the data. We'll follow Runze's original
approach more closely and investigate for potential data leakage.

In [None]:
print("=" * 80)
print("REPRODUCING RUNZE'S ORIGINAL APPROACH - INVESTIGATING STRANGE BEHAVIOR")
print("=" * 80)

# Let's first check what rank variables exist and how they relate to the target
print("\nAnalyzing rank variables and potential data leakage:")
print(f"Target variable 'mae' range: {df['mae'].min():.4f} to {df['mae'].max():.4f}")
if 'mae_rank' in df.columns:
    print(f"MAE rank variable range: {df['mae_rank'].min()} to {df['mae_rank'].max()}")
    print(f"Correlation between mae and mae_rank: {df['mae'].corr(df['mae_rank']):.4f}")
if 'rmse_rank' in df.columns:
    print(f"RMSE rank variable range: {df['rmse_rank'].min()} to {df['rmse_rank'].max()}")
    print(f"Correlation between mae and rmse_rank: {df['mae'].corr(df['rmse_rank']):.4f}")
if 'runtime_rank' in df.columns:
    print(f"Runtime rank variable range: {df['runtime_rank'].min()} to {df['runtime_rank'].max()}")
    print(f"Correlation between mae and runtime_rank: {df['mae'].corr(df['runtime_rank']):.4f}")

# Show the relationship between target and rank variables
if 'mae_rank' in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Scatter plot
    axes[0].scatter(df['mae'], df['mae_rank'], alpha=0.5, s=1)
    axes[0].set_xlabel('MAE (target)')
    axes[0].set_ylabel('MAE Rank')
    axes[0].set_title('MAE vs MAE Rank (Potential Data Leakage)')
    axes[0].grid(True, alpha=0.3)
    
    # Check if rank is just percentile ranking
    mae_sorted = df['mae'].sort_values()
    percentile_ranks = np.arange(1, len(mae_sorted) + 1) / len(mae_sorted) * 100
    
    axes[1].scatter(mae_sorted.values[:1000], percentile_ranks[:1000], alpha=0.5, s=1, label='Expected percentile rank')
    # Get corresponding mae_rank values for the same indices
    mae_rank_for_sorted = df.loc[mae_sorted.index[:1000], 'mae_rank'].values
    axes[1].scatter(mae_sorted.values[:1000], mae_rank_for_sorted, alpha=0.5, s=1, color='red', label='Actual mae_rank')
    axes[1].set_xlabel('MAE (sorted)')
    axes[1].set_ylabel('Rank')
    axes[1].set_title('Rank vs Sorted MAE (first 1000 points)')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nDATA LEAKAGE ANALYSIS:")
    print(f"If mae_rank is computed from mae, this creates direct information leakage!")
    print(f"The rank variables provide information about the target variable ordering.")

In [None]:
# Define helper functions for Runze's approach
def create_ard_kernel(n_features):
    """Create ARD kernel for GPR"""
    return C(1.0, (1e-3, 1e3)) * RBF(length_scale=[1.0]*n_features, length_scale_bounds=(1e-3, 1e3)) + WhiteKernel()

def evaluate_model_detailed(model, X_train, X_test, y_train, y_test, model_name):
    """
    Train and evaluate a model with more detailed output
    """
    print(f"\nTraining {model_name}...")
    print(f"Training data shape: {X_train.shape}, Test data shape: {X_test.shape}")
    
    # Train model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    print(f"Results: R² = {r2:.4f}, MAE = {mae:.4f}, RMSE = {rmse:.4f}")
    
    # Check for suspiciously perfect fits
    if r2 > 0.99:
        print(f"⚠️  WARNING: Suspiciously high R² ({r2:.4f}) - possible overfitting or data leakage!")
    
    # Show some prediction details
    print(f"Prediction range: {y_pred.min():.4f} to {y_pred.max():.4f}")
    print(f"True values range: {y_test.min():.4f} to {y_test.max():.4f}")
    
    results = {
        'model': model_name,
        'r2': r2,
        'mse': mse,
        'mae': mae,
        'rmse': rmse,
        'y_test': y_test,
        'y_pred': y_pred
    }
    
    return results

def plot_parity_detailed(results, title_suffix=""):
    """
    Create detailed parity plot
    """
    y_test = results['y_test']
    y_pred = results['y_pred']
    r2 = results['r2']
    model_name = results['model']
    
    plt.figure(figsize=(8, 6))
    plt.scatter(y_test, y_pred, alpha=0.7, s=30)
    
    # Plot perfect prediction line
    min_val = min(min(y_test), min(y_pred))
    max_val = max(max(y_test), max(y_pred))
    plt.plot([min_val, max_val], [min_val, max_val], 'r--', 
             linewidth=2, label=f'Perfect fit')
    
    plt.xlabel('True MAE', fontsize=12)
    plt.ylabel('Predicted MAE', fontsize=12)
    plt.title(f'Parity Plot - {model_name}{title_suffix}\nR² = {r2:.4f}', fontsize=14)
    
    # Add text box with metrics
    textstr = f'R² = {r2:.4f}\nMAE = {results["mae"]:.4f}\nRMSE = {results["rmse"]:.4f}'
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    plt.text(0.05, 0.95, textstr, transform=plt.gca().transAxes, fontsize=10,
             verticalalignment='top', bbox=props)
    
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    
    # Highlight suspicious results
    if r2 > 0.99:
        plt.title(f'⚠️  SUSPICIOUS: {model_name}{title_suffix}\nR² = {r2:.4f} (TOO PERFECT!)', 
                 fontsize=14, color='red')
    
    plt.show()
    
    return plt.gcf()

### Reproducing Runze's Sampling Approach

Let's replicate the sampling fractions used in Runze's code to reproduce the strange behavior.

In [None]:
# Reproduce Runze's sampling approach
print("=" * 60)
print("REPRODUCING RUNZE'S SAMPLING FRACTIONS")
print("=" * 60)

# Test different fractions as in Runze's original code
sample_fractions = [
    (0.0005, "87 samples (approx)"),   # st_00005 = sobol_reg.sample(frac=0.0005, random_state=42)
    (0.00005, "9 samples (approx)"),    # st_000005 = sobol_reg.sample(frac=0.00005, random_state=42)
]

for frac, description in sample_fractions:
    print(f"\n{'='*30} SAMPLE FRACTION: {frac} ({description}) {'='*30}")
    
    # Create subset using same random state as Runze
    subset_df = df_encoded.sample(frac=frac, random_state=42)
    actual_size = len(subset_df)
    
    print(f"Actual subset size: {actual_size} samples")
    
    if actual_size < 5:
        print("Subset too small for meaningful train/test split. Skipping.")
        continue
    
    # Prepare subset data
    y_subset = subset_df[target].values
    X_subset_without_rank = subset_df[features_without_rank].values
    
    # Check if rank variables exist and prepare data accordingly
    has_rank_vars = any(col in subset_df.columns for col in rank_features)
    if has_rank_vars:
        available_rank_features = [col for col in rank_features if col in subset_df.columns]
        X_subset_with_rank = subset_df[features_without_rank + available_rank_features].values
        print(f"Available rank features: {available_rank_features}")
    else:
        print("No rank variables found in subset.")
        X_subset_with_rank = None
    
    # Split subset (use smaller test_size for tiny datasets)
    test_size = 0.2 if actual_size >= 10 else max(1, actual_size // 5)
    X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(
        X_subset_without_rank, y_subset, test_size=test_size, random_state=42
    )
    
    if has_rank_vars and X_subset_with_rank is not None:
        X_train_sub_rank, X_test_sub_rank, _, _ = train_test_split(
            X_subset_with_rank, y_subset, test_size=test_size, random_state=42
        )
    
    print(f"Training size: {len(y_train_sub)}, Test size: {len(y_test_sub)}")
    print(f"Target range in subset: {y_subset.min():.4f} to {y_subset.max():.4f}")
    
    # Check rank variables in this subset
    if has_rank_vars:
        for rank_col in available_rank_features:
            if rank_col in subset_df.columns:
                rank_vals = subset_df[rank_col].values
                print(f"{rank_col} range in subset: {rank_vals.min()} to {rank_vals.max()}")
                # Check correlation in small subset
                corr = np.corrcoef(y_subset, rank_vals)[0, 1]
                print(f"Correlation between mae and {rank_col} in subset: {corr:.4f}")
    
    print("\n--- RIDGE REGRESSION ---")
    
    # Ridge without rank
    ridge_sub_without = Pipeline([
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=1.0, random_state=42))
    ])
    ridge_sub_results_without = evaluate_model_detailed(
        ridge_sub_without, X_train_sub, X_test_sub, y_train_sub, y_test_sub, 
        f"Ridge (subset {actual_size}, no rank)"
    )
    
    # Ridge with rank (if available)
    if has_rank_vars and X_subset_with_rank is not None:
        ridge_sub_with = Pipeline([
            ('scaler', StandardScaler()),
            ('ridge', Ridge(alpha=1.0, random_state=42))
        ])
        ridge_sub_results_with = evaluate_model_detailed(
            ridge_sub_with, X_train_sub_rank, X_test_sub_rank, y_train_sub, y_test_sub, 
            f"Ridge (subset {actual_size}, with rank)"
        )
    
    # GPR (only for very small subsets)
    if actual_size <= 100:  # Limit GPR to avoid long computation times
        print("\n--- GAUSSIAN PROCESS REGRESSION ---")
        
        # GPR without rank
        kernel_sub_without = create_ard_kernel(len(features_without_rank))
        gpr_sub_without = Pipeline([
            ('scaler', StandardScaler()),
            ('gpr', GaussianProcessRegressor(kernel=kernel_sub_without, normalize_y=True, alpha=1e-3, random_state=42))
        ])
        gpr_sub_results_without = evaluate_model_detailed(
            gpr_sub_without, X_train_sub, X_test_sub, y_train_sub, y_test_sub, 
            f"GPR (subset {actual_size}, no rank)"
        )
        
        # GPR with rank (if available)
        if has_rank_vars and X_subset_with_rank is not None:
            kernel_sub_with = create_ard_kernel(X_subset_with_rank.shape[1])
            gpr_sub_with = Pipeline([
                ('scaler', StandardScaler()),
                ('gpr', GaussianProcessRegressor(kernel=kernel_sub_with, normalize_y=True, alpha=1e-3, random_state=42))
            ])
            gpr_sub_results_with = evaluate_model_detailed(
                gpr_sub_with, X_train_sub_rank, X_test_sub_rank, y_train_sub, y_test_sub, 
                f"GPR (subset {actual_size}, with rank)"
            )
    
    # Random Forest for comparison
    print("\n--- RANDOM FOREST (for comparison) ---")
    rf_sub_without = RandomForestRegressor(n_estimators=50, random_state=42)  # Fewer trees for speed
    rf_sub_results_without = evaluate_model_detailed(
        rf_sub_without, X_train_sub, X_test_sub, y_train_sub, y_test_sub, 
        f"RF (subset {actual_size}, no rank)"
    )
    
    if has_rank_vars and X_subset_with_rank is not None:
        rf_sub_with = RandomForestRegressor(n_estimators=50, random_state=42)
        rf_sub_results_with = evaluate_model_detailed(
            rf_sub_with, X_train_sub_rank, X_test_sub_rank, y_train_sub, y_test_sub, 
            f"RF (subset {actual_size}, with rank)"
        )
    
    # Plot parity plots for this subset
    print("\n--- PARITY PLOTS ---")
    plot_parity_detailed(ridge_sub_results_without, f" (subset {actual_size})")
    
    if has_rank_vars and X_subset_with_rank is not None:
        plot_parity_detailed(ridge_sub_results_with, f" (subset {actual_size})")
    
    if actual_size <= 100:
        plot_parity_detailed(gpr_sub_results_without, f" (subset {actual_size})")
        if has_rank_vars and X_subset_with_rank is not None:
            plot_parity_detailed(gpr_sub_results_with, f" (subset {actual_size})")
    
    plot_parity_detailed(rf_sub_results_without, f" (subset {actual_size})")
    if has_rank_vars and X_subset_with_rank is not None:
        plot_parity_detailed(rf_sub_results_with, f" (subset {actual_size})")
    
    print("\n" + "="*100)

## Summary and Comparison

In [None]:
# Compile all results (only from models that were actually run)
all_results = [
    hgb_results_without_rank, hgb_results_with_mae_rank, hgb_results_with_all_ranks,
    ridge_results_without_rank, ridge_results_with_mae_rank, ridge_results_with_all_ranks,
]

# Create summary dataframe (excluding models that were skipped)
summary_data = []
for result in all_results:
    if not np.isnan(result['r2']):  # Only include results that were actually computed
        summary_data.append({
            'Model': result['model'],
            'R²': result['r2'],
            'MAE': result['mae'],
            'RMSE': result['rmse']
        })

if summary_data:  # Only create summary if we have results
    summary_df = pd.DataFrame(summary_data)
    
    print("PERFORMANCE SUMMARY (Full Dataset - Only Computed Models):")
    print("=" * 80)
    print(summary_df.to_string(index=False, float_format='%.4f'))
    
    # Plot comparison
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # R² comparison
    summary_df.plot(x='Model', y='R²', kind='bar', ax=axes[0], rot=45)
    axes[0].set_title('R² Comparison')
    axes[0].set_ylabel('R² Score')
    axes[0].tick_params(axis='x', rotation=45)
    
    # MAE comparison
    summary_df.plot(x='Model', y='MAE', kind='bar', ax=axes[1], rot=45, color='orange')
    axes[1].set_title('MAE Comparison')
    axes[1].set_ylabel('Mean Absolute Error')
    axes[1].tick_params(axis='x', rotation=45)
    
    # RMSE comparison
    summary_df.plot(x='Model', y='RMSE', kind='bar', ax=axes[2], rot=45, color='green')
    axes[2].set_title('RMSE Comparison')
    axes[2].set_ylabel('Root Mean Squared Error')
    axes[2].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
else:
    print("No models were evaluated on the full dataset due to performance constraints.")
    print("See the small subset analysis above for model comparisons.")

## Additional Investigation: Replicating Runze's Exact Approach

Let's try to replicate Runze's exact approach more closely to understand the strange behavior.

In [None]:
print("=" * 80)
print("ADDITIONAL INVESTIGATION: REPLICATING RUNZE'S EXACT APPROACH")
print("=" * 80)

# Try to replicate the exact conditions from Runze's code
# Let's examine what happens when we have rank variables vs not

# Create a very small subset like Runze's st_000005 (9 samples)
tiny_subset = df_encoded.sample(n=20, random_state=42)  # Use 20 to have enough for train/test

print(f"Tiny subset size: {len(tiny_subset)}")
print(f"Target range in tiny subset: {tiny_subset['mae'].min():.4f} to {tiny_subset['mae'].max():.4f}")

# Check rank variables in tiny subset
rank_cols_in_data = [col for col in rank_features if col in tiny_subset.columns]
if rank_cols_in_data:
    print(f"\nRank variables found: {rank_cols_in_data}")
    for rank_col in rank_cols_in_data:
        rank_vals = tiny_subset[rank_col].values
        target_vals = tiny_subset['mae'].values
        print(f"{rank_col} range: {rank_vals.min()} to {rank_vals.max()}")
        corr = np.corrcoef(target_vals, rank_vals)[0, 1] if len(set(rank_vals)) > 1 else np.nan
        print(f"Correlation between mae and {rank_col}: {corr:.4f}")
        
        # Show the actual values to understand the relationship
        print(f"\nActual values in tiny subset:")
        for i in range(min(10, len(tiny_subset))):
            print(f"Sample {i}: mae={target_vals[i]:.4f}, {rank_col}={rank_vals[i]}")
else:
    print("No rank variables found in the dataset.")
    print("This might explain why we're not seeing the strange behavior.")
    print("The original dataset might have had rank variables that created data leakage.")

# Prepare data
y_tiny = tiny_subset['mae'].values
X_tiny_without_rank = tiny_subset[features_without_rank].values

if rank_cols_in_data:
    features_with_available_ranks = features_without_rank + rank_cols_in_data
    X_tiny_with_rank = tiny_subset[features_with_available_ranks].values
else:
    X_tiny_with_rank = None

# Split (use only 1-2 samples for test due to tiny size)
test_size = max(1, len(tiny_subset) // 5)
X_train_tiny, X_test_tiny, y_train_tiny, y_test_tiny = train_test_split(
    X_tiny_without_rank, y_tiny, test_size=test_size, random_state=42
)

print(f"\nTiny dataset split: {len(y_train_tiny)} train, {len(y_test_tiny)} test")

if X_tiny_with_rank is not None:
    X_train_tiny_rank, X_test_tiny_rank, _, _ = train_test_split(
        X_tiny_with_rank, y_tiny, test_size=test_size, random_state=42
    )

# Test Ridge regression (which showed strange behavior in Runze's code)
print("\n--- RIDGE REGRESSION ON TINY DATASET ---")

ridge_tiny_without = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0, random_state=42))
])

ridge_tiny_results_without = evaluate_model_detailed(
    ridge_tiny_without, X_train_tiny, X_test_tiny, y_train_tiny, y_test_tiny, 
    "Ridge (tiny dataset, no rank)"
)

if X_tiny_with_rank is not None:
    ridge_tiny_with = Pipeline([
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=1.0, random_state=42))
    ])
    
    ridge_tiny_results_with = evaluate_model_detailed(
        ridge_tiny_with, X_train_tiny_rank, X_test_tiny_rank, y_train_tiny, y_test_tiny, 
        "Ridge (tiny dataset, with rank)"
    )

# Test GPR (which also showed strange behavior)
print("\n--- GPR ON TINY DATASET ---")

kernel_tiny = create_ard_kernel(len(features_without_rank))
gpr_tiny_without = Pipeline([
    ('scaler', StandardScaler()),
    ('gpr', GaussianProcessRegressor(kernel=kernel_tiny, normalize_y=True, alpha=1e-3, random_state=42))
])

gpr_tiny_results_without = evaluate_model_detailed(
    gpr_tiny_without, X_train_tiny, X_test_tiny, y_train_tiny, y_test_tiny, 
    "GPR (tiny dataset, no rank)"
)

if X_tiny_with_rank is not None:
    kernel_tiny_with = create_ard_kernel(X_tiny_with_rank.shape[1])
    gpr_tiny_with = Pipeline([
        ('scaler', StandardScaler()),
        ('gpr', GaussianProcessRegressor(kernel=kernel_tiny_with, normalize_y=True, alpha=1e-3, random_state=42))
    ])
    
    gpr_tiny_results_with = evaluate_model_detailed(
        gpr_tiny_with, X_train_tiny_rank, X_test_tiny_rank, y_train_tiny, y_test_tiny, 
        "GPR (tiny dataset, with rank)"
    )

# Plot results
print("\n--- PARITY PLOTS FOR TINY DATASET ---")
plot_parity_detailed(ridge_tiny_results_without, " (tiny dataset)")
if X_tiny_with_rank is not None:
    plot_parity_detailed(ridge_tiny_results_with, " (tiny dataset)")

plot_parity_detailed(gpr_tiny_results_without, " (tiny dataset)")
if X_tiny_with_rank is not None:
    plot_parity_detailed(gpr_tiny_results_with, " (tiny dataset)")

### Analysis of Why Strange Behavior Might Not Be Reproduced

If we're not seeing the near-perfect parity plots that Runze observed, here are possible reasons:

1. **Missing Rank Variables**: The current dataset might not have the rank variables that created data leakage in Runze's analysis
2. **Different Data Processing**: Our preprocessing (encoding categorical variables) might be different from Runze's approach
3. **Different Sampling**: Even with the same random seed, the exact samples might differ due to data ordering
4. **Model Hyperparameters**: Slight differences in model configuration could affect overfitting behavior

The most likely explanation is that **rank variables create information leakage** by providing direct information about target variable ordering, which leads to artificially perfect predictions on small datasets where models can memorize the rank-target relationship.

## Analysis and Insights

### Key Findings:

1. **Performance Constraints**: RFR, SVR, and GPR are too computationally expensive to run on the full dataset (173k+ samples), so we limited evaluation to small subsets and faster models (Histogram Gradient Boosting and Ridge Regression) on the full dataset.

2. **Impact of Rank Variables**: The rank variables (especially `mae_rank`) provide additional information that can improve model performance, but they also represent "captured noise" since they're derived from the target variable itself.

3. **Model Behavior with Rank Variables**:
   - **Histogram Gradient Boosting**: Tree-based models can benefit from rank variables as they capture non-linear relationships.
   - **Ridge Regression**: Linear models like Ridge may show dramatic improvement with rank variables, especially on small datasets.
   - **GPR with ARD**: Gaussian processes can adaptively weight features, so rank variables might lead to overfitting on small datasets.
   - **Random Forest and SVR**: These models also benefit from additional rank information when evaluated on small subsets.

4. **Small Dataset Effects**: The near-perfect parity plots mentioned in the issue likely occur because:
   - Small datasets are easier to overfit
   - Rank variables provide direct information about target variable ordering
   - Models with high capacity (like GPR) can memorize small datasets
   - The relationship between rank variables and targets creates **information leakage**

5. **Data Leakage Investigation**: Our analysis suggests that rank variables (if present) would create information leakage by providing direct information about the relative ordering of target values. This explains the suspiciously perfect parity plots on small datasets.

### Why Strange Behavior Might Not Be Fully Reproduced:

- **Dataset Differences**: The current dataset processing might differ from Runze's original setup
- **Missing Rank Variables**: The specific rank variables that caused data leakage might not be present in our current feature set
- **Preprocessing Differences**: Our categorical encoding approach might affect the results

### Recommendations:

1. **For Production Use**: Use models without rank variables for fair evaluation of hyperparameter optimization
2. **For Surrogate Modeling**: Rank variables might be acceptable if the goal is to predict relative performance
3. **Cross-Validation**: Use proper cross-validation to avoid overfitting, especially with small datasets
4. **Feature Importance**: Analyze feature importance to understand which hyperparameters are most influential
5. **Computational Constraints**: For large datasets, prioritize faster models like Histogram Gradient Boosting over slower ones like GPR and SVR

## Conclusion

This analysis demonstrates the significant impact of rank variables on model performance. The near-perfect parity plots observed with Ridge regression and GPR on small subsets are likely due to:

1. **Information Leakage**: Rank variables provide direct information about target variable ordering
2. **Overfitting**: Small datasets are susceptible to overfitting, especially with high-capacity models
3. **Model Capacity**: GPR and regularized linear models can memorize small datasets effectively

For practical hyperparameter optimization, it's recommended to use models trained on original hyperparameters without rank variables to ensure fair evaluation and avoid potential data leakage.