This file was one of the first approaches we took, and a naive rude awakening to the processing power needed. We wanted to just get a rough idea of how the different models performed against each other. This was also before we began to bucket breeds.

In this file, we run 6 different models, and do hyperparameter tuning on each. Naively, we ran all 6 models tuning in one cell, which took several hours before crashing. Part of the results were saved though, and we used the resulting parameters for Neural Nets on the experimentMLPNeuralNet file.

This was where we started to separate out our experiments, but it did give us some insight into general performance of the different models.

In [1]:
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

def age_to_weeks(age_str):
    if pd.isnull(age_str):
        return np.nan
    match = re.match(r'(\d+)\s+(\w+)', age_str)
    if match:
        num, unit = int(match.group(1)), match.group(2).lower()
        if 'day' in unit:
            return num // 7
        elif 'week' in unit:
            return num
        elif 'month' in unit:
            return num * 4
        elif 'year' in unit:
            return num * 52
    return np.nan

Just gaining information about the dataset here.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the full dataset
df_full = pd.read_csv("train.csv")

# Choose a sample size (e.g., 40,000 rows)
sample_size = 15000

# Use train_test_split with the 'stratify' parameter to get a representative sample.
# Here, we use the Outcome Type column as the stratification key.
sample_df, _ = train_test_split(
    df_full,
    train_size=sample_size,
    stratify=df_full["Outcome Type"],
    # random_state=42 # may want this to change so we get a different random sample everytime.
)

# Optional: Compare the distributions in the sample vs. full dataset
print("Sample distribution of Outcome Type:")
print(sample_df["Outcome Type"].value_counts(normalize=True))

print("\nOverall distribution of Outcome Type:")
print(df_full["Outcome Type"].value_counts(normalize=True))


Sample distribution of Outcome Type:
Adoption           0.495200
Transfer           0.315075
Return to Owner    0.149325
Euthanasia         0.031025
Died               0.009375
Name: Outcome Type, dtype: float64

Overall distribution of Outcome Type:
Adoption           0.495191
Transfer           0.315086
Return to Owner    0.149329
Euthanasia         0.031028
Died               0.009365
Name: Outcome Type, dtype: float64


Unused

In [None]:

# === Helper Functions ===
def simplify_color(color_str):
    # Example function to simplify color names.
    # Customize this as needed.
    if pd.isna(color_str):
        return "Unknown"
    return color_str.split('/')[0].strip()

def age_to_weeks(age_str):
    # Example conversion: assumes age is a string like "2 years" or "6 months"
    # Customize this conversion logic depending on your input format.
    if pd.isna(age_str):
        return np.nan
    tokens = age_str.split()
    if 'year' in tokens[1]:
        return float(tokens[0]) * 52
    elif 'month' in tokens[1]:
        return float(tokens[0]) * 4
    elif 'week' in tokens[1]:
        return float(tokens[0])
    else:
        return np.nan



# === 2. Target & ID Setup ===
target_col = "Outcome Type"
id_col = "Id"
drop_cols = ['Found Location', 'Date of Birth', 'Name', target_col, id_col, 'Outcome Time']
X = sample_df.drop(columns=drop_cols, errors='ignore')
y = train_df[target_col]


Early preprocessing attempts, unused after.

In [None]:
# === Load Data ===
# train_df = pd.read_csv("train.csv")
train_df = pd.read_csv("train.csv").sample(n=40000, random_state=42)

# print(train_df[['Intake Time', 'Outcome Time']].head(10))


# === Target & ID ===
target_col = "Outcome Type"
id_col = "Id"

# Drop high-leakage or irrelevant features
drop_cols = ['Found Location', 'Date of Birth', 'Name', target_col, id_col]
X = train_df.drop(columns=drop_cols, errors='ignore')
y = train_df[target_col]

# === Encode Target ===
le_y = LabelEncoder()
y_encoded = le_y.fit_transform(y)

# === Convert age to weeks ===
X['Age in Weeks'] = X['Age upon Intake'].apply(age_to_weeks)
X.drop(columns=['Age upon Intake'], inplace=True)

# === Add new column: total time in shelter === 
# datetime_format = "%m/%d/%Y %I:%M:%S %p"

# X['Intake Time'] = pd.to_datetime(X['Intake Time'], format=datetime_format, errors='coerce')
# X['Outcome Time'] = pd.to_datetime(X['Outcome Time'], format=datetime_format, errors='coerce')

# X['Time in Shelter (Days)'] = (X['Outcome Time'] - X['Intake Time']).dt.total_seconds() / (60*60*24)

# X = X.drop(columns=['Intake Time', 'Outcome Time'])

# Fill numeric columns with median, and object columns with "Unknown"
filledNumericVals = 0
filledCatVals = 0
for col in X.columns:
    if X[col].dtype == 'object':
        X[col] = X[col].fillna("Unknown")
        filledCatVals += 1
    else:
        X[col] = X[col].fillna(X[col].median())
        filledNumericVals += 1

print(f'fillednumeric: {filledNumericVals}, filledCat: {filledCatVals}')

# === One-hot encode categoricals ===
# X = pd.get_dummies(X)
categorical_cols = X.select_dtypes(include='object').columns
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)


# === Ensure all features are numeric ===
X = X.astype(float)

# print(train_df.head())
X.head()

fillednumeric: 2, filledCat: 6


Unnamed: 0,Age in Weeks,Time in Shelter (Days),Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,Intake Condition_Injured,...,Color_White/Yellow Brindle,Color_Yellow,Color_Yellow Brindle,Color_Yellow Brindle/Black,Color_Yellow Brindle/White,Color_Yellow/Black,Color_Yellow/Gray,Color_Yellow/Orange,Color_Yellow/Tan,Color_Yellow/White
88566,156.0,10.875694,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
88441,52.0,2.7,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
60549,416.0,9.667361,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46272,104.0,8.998611,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
963,0.0,0.050694,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
!pip install tqdm.notebook

Defaulting to user installation because normal site-packages is not writeable


ERROR: Could not find a version that satisfies the requirement tqdm.notebook (from versions: none)
ERROR: No matching distribution found for tqdm.notebook


Preprocessing that we would come to use in the other models, included age engineering and color bucketing.

In [1]:
# PREPROCESSING

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier  # For KNN
import lightgbm as lgb
import warnings
import xgboost as xgb
from tqdm.notebook import tqdm
warnings.filterwarnings('ignore')

# === Helper Functions ===
def simplify_color(color_str):
    """
    Map raw color strings to a smaller set of standardized categories.
    This function checks for the presence of common color keywords.
    """
    if pd.isna(color_str):
        return "unknown"
    color_str = color_str.lower()
    if "black" in color_str:
        return "black"
    elif "brown" in color_str:
        return "brown"
    elif "white" in color_str:
        return "white"
    elif "tan" in color_str or "gold" in color_str:
        return "tan_gold"
    elif "grey" in color_str or "gray" in color_str:
        return "gray"
    else:
        return "other"

def age_to_weeks(age_str):
    """
    Convert an age string into estimated weeks.
    e.g., "2 years" becomes 104 weeks and "6 months" becomes about 24 weeks.
    """
    if pd.isna(age_str):
        return np.nan
    tokens = age_str.split()
    if len(tokens) < 2:
        return np.nan
    if 'year' in tokens[1]:
        return float(tokens[0]) * 52
    elif 'month' in tokens[1]:
        return float(tokens[0]) * 4
    elif 'week' in tokens[1]:
        return float(tokens[0])
    else:
        return np.nan

# === 1. Load and Subsample Data ===
# For prototyping, we select a stratified random sample of ~40,000 rows
# to maintain the same target distribution as the full dataset.
df_full = pd.read_csv("train.csv")
# Option A: Using groupby with sample (ensures stratification)
# sample_fraction = 40000 / len(df_full)
# sample_df = df_full.groupby("Outcome Type", group_keys=False).apply(lambda x: x.sample(frac=sample_fraction, random_state=42))
# Option B (alternative): Using train_test_split's stratify option:
sample_df, _ = train_test_split(df_full, train_size=12000, stratify=df_full["Outcome Type"], random_state=42)

# === 2. Target & ID Setup ===
target_col = "Outcome Type"
id_col = "Id"
# Drop columns that are either high-leakage or not available in the test set.
# Here, we drop Outcome Time as well since it isn't present in the test data.
drop_cols = ['Found Location', 'Date of Birth', 'Name', target_col, id_col, 'Outcome Time']
X = sample_df.drop(columns=drop_cols, errors='ignore')
y = sample_df[target_col]

# === 3. Encode the Target Variable ===
# Although there are only five outcomes, encoding ensures compatibility with all models.
le_y = LabelEncoder()
y_encoded = le_y.fit_transform(y)

# === 4. Feature Engineering ===
# Simplify the 'Color' feature into standardized categories
X['Color Category'] = X['Color'].apply(simplify_color)
X = X.drop(columns=['Color'])

# Convert Age to Weeks
X['Age in Weeks'] = X['Age upon Intake'].apply(age_to_weeks)
X.drop(columns=['Age upon Intake'], inplace=True)

# Process Intake Time if necessary. Since Outcome Time is not used and not in test data,
# we drop Intake Time as well (or extract features if desired).
datetime_format = "%m/%d/%Y %I:%M:%S %p"
X['Intake Time'] = pd.to_datetime(X['Intake Time'], format=datetime_format, errors='coerce')
# If you wish to derive time-based features (like hour or day of week), do it here.
# For now, we drop it to ensure consistency with the test set.
X = X.drop(columns=['Intake Time'])

# Fill missing values: For object columns, fill with "Unknown"; for numeric, fill with median.
for col in X.columns:
    if X[col].dtype == 'object':
        X[col] = X[col].fillna("Unknown")
    else:
        X[col] = X[col].fillna(X[col].median())

# === 5. Encode Categorical Variables ===
categorical_cols = X.select_dtypes(include='object').columns
print("categorical cols", categorical_cols)
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Ensure all features are numeric
X = X.astype(float)

print("Processed feature sample:")
X.head()



categorical cols Index(['Intake Type', 'Intake Condition', 'Animal Type', 'Sex upon Intake',
       'Breed', 'Color Category'],
      dtype='object')
Processed feature sample:


Unnamed: 0,Age in Weeks,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Condition_Behavior,Intake Condition_Feral,Intake Condition_Injured,Intake Condition_Med Attn,Intake Condition_Medical,...,Breed_Yorkshire Terrier/Chihuahua Shorthair,Breed_Yorkshire Terrier/Maltese,Breed_Yorkshire Terrier/Miniature Poodle,Breed_Yorkshire Terrier/Norfolk Terrier,Breed_Yorkshire Terrier/Rat Terrier,Color Category_brown,Color Category_gray,Color Category_other,Color Category_tan_gold,Color Category_white
42893,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45707,8.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
68267,52.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
39689,28.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
81220,52.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Defining the parameters that would be tested on each model and running the tuning to find the best parameters.

In [None]:
# === 6. Split Data into Training and Validation Sets ===
X_train, X_val, y_train, y_val = train_test_split(X, y_encoded, test_size=0.2, 
                                                  random_state=42, stratify=y_encoded)
print(f"Train set shape: {X_train.shape}, Validation set shape: {X_val.shape}")

# === 7. Define Models and Parameter Grids ===
# Each model is mapped to a tuple: (estimator instance, parameter grid dictionary)

# Logistic Regression:
# - C: Inverse regularization strength (smaller means more regularization).
# - penalty: Regularization norm. ('l2' is common)
# - solver: The optimization algorithm.
lr_param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['lbfgs']
}

# Random Forest:
# - n_estimators: Number of trees.
# - max_depth: Maximum depth of each tree.
# - min_samples_split: Minimum number of samples required to split an internal node.
# - min_samples_leaf: Minimum samples required to be a leaf.
# - max_features: Number of features to consider when splitting.
rf_param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# LightGBM:
# - num_leaves: Maximum number of leaves per tree (controls model complexity).
# - max_depth: Maximum tree depth (-1 indicates no limit).
# - learning_rate: Shrinkage factor for boosting (lower values slow training but can improve performance).
# - n_estimators: Number of boosting rounds (trees).
# - min_child_samples: Minimum number of samples per leaf.
# - subsample: Fraction of data used per boosting round.
# - colsample_bytree: Fraction of features used per tree.
lgb_param_grid = {
    'num_leaves': [31, 50, 70],
    'max_depth': [5, 10, 15, -1],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 500],
    'min_child_samples': [5, 10, 20],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# XGBoost:
# - max_depth: Maximum depth of a tree.
# - learning_rate: Step size shrinkage.
# - n_estimators: Number of boosting rounds.
# - subsample: Fraction of training instances used for growing trees.
# - colsample_bytree: Fraction of features used per tree.
# - gamma: Minimum loss reduction required to make a split.
xgb_param_grid = {
    'max_depth': [3, 6, 10],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 500],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 1, 5]
}

# MLP Neural Network (scikit-learn's implementation):
# - hidden_layer_sizes: Tuple representing the number of neurons per hidden layer.
# - activation: Activation function for the hidden layers.
# - solver: The optimization algorithm (adam is commonly used).
# - alpha: L2 regularization (penalty term).
# - learning_rate_init: Initial learning rate.
# - max_iter: Maximum number of training iterations (epochs).
mlp_param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (100, 50)],
    'activation': ['relu', 'tanh'],
    'solver': ['adam'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01],
    'max_iter': [200, 300]
}

# K-Nearest Neighbors (KNN):
# - n_neighbors: Number of neighbors to use for predictions.
# - weights: The weight function used in prediction ('uniform' or 'distance').
# - p: Power parameter for the Minkowski metric (p=1: Manhattan, p=2: Euclidean).
knn_param_grid = {
    'n_neighbors': [3, 13, 23, 53, 103],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

# Create the dictionary of models (without the TF model)
models = {
    'Logistic Regression': (LogisticRegression(max_iter=200, n_jobs=-1, random_state=42), lr_param_grid),
    'Random Forest': (RandomForestClassifier(random_state=42, n_jobs=-1), rf_param_grid),
    'LightGBM': (lgb.LGBMClassifier(random_state=42, n_jobs=-1), lgb_param_grid),
    'XGBoost': (xgb.XGBClassifier(random_state=42, n_jobs=-1, use_label_encoder=False, eval_metric='mlogloss'), xgb_param_grid),
    'MLP Neural Net': (MLPClassifier(random_state=42), mlp_param_grid),
    'KNN': (KNeighborsClassifier(), knn_param_grid)
}

# === 8. Hyperparameter Tuning with a Progress Bar (Outer Loop) ===
results = {}
# Loop through each model using tqdm to show progress on the outer loop (i.e. over models).
for model_name, (estimator, param_grid) in tqdm(models.items(), desc="Tuning Models"):
    print(f"\n--- Tuning {model_name} ---")
    search = RandomizedSearchCV(estimator=estimator,
                                param_distributions=param_grid,
                                n_iter=20,  # Evaluate 20 random parameter combinations for this model
                                scoring='accuracy',
                                cv=3,       # Use 3-fold cross-validation for speed
                                verbose=1,
                                random_state=42,
                                n_jobs=-1)
    search.fit(X_train, y_train)
    
    # Print details of each candidate that was tested
    print(f"\nCandidate hyperparameter settings and CV scores for {model_name}:")
    for i, (params, score) in enumerate(zip(search.cv_results_['params'], 
                                              search.cv_results_['mean_test_score'])):
        print(f"Candidate {i+1}: Parameters: {params}, Mean CV Score: {score:.4f}")
    
    results[model_name] = {
        'best_params': search.best_params_,
        'best_score': search.best_score_,
        'validation_score': search.score(X_val, y_val)
    }
    print(f"\n{model_name} best params: {search.best_params_}")
    print(f"{model_name} best CV accuracy: {search.best_score_:.4f}")
    print(f"{model_name} validation accuracy: {search.score(X_val, y_val):.4f}")

# Display a summary of the tuning results.
print("\nSummary of Tuning Results:")
for model_name, res in results.items():
    print(f"{model_name}: CV Acc = {res['best_score']:.4f}, Validation Acc = {res['validation_score']:.4f}")


Train set shape: (32000, 1576), Validation set shape: (8000, 1576)


Tuning Models:   0%|          | 0/6 [00:00<?, ?it/s]


--- Tuning Logistic Regression ---
Fitting 3 folds for each of 5 candidates, totalling 15 fits

Candidate hyperparameter settings and CV scores for Logistic Regression:
Candidate 1: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.01}, Mean CV Score: 0.5839
Candidate 2: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.1}, Mean CV Score: 0.5879
Candidate 3: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 1}, Mean CV Score: 0.5856
Candidate 4: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10}, Mean CV Score: 0.5895
Candidate 5: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 100}, Mean CV Score: 0.5839

Logistic Regression best params: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10}
Logistic Regression best CV accuracy: 0.5895
Logistic Regression validation accuracy: 0.5998

--- Tuning Random Forest ---
Fitting 3 folds for each of 20 candidates, totalling 60 fits

Candidate hyperparameter settings and CV scores for Random Forest:
Candidate 1: Parameters: 

NOTE: THE SECTION BELOW THIS IS COPY PASTED RESULTS OF THE TUNING! 


My computer crashed trying to run the section above. I needed somewhere to store the results because I was worried that we'd lose them when we tried to reset the kernel.


------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------


--- Tuning Logistic Regression ---
Fitting 3 folds for each of 5 candidates, totalling 15 fits

Candidate hyperparameter settings and CV scores for Logistic Regression:
Candidate 1: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.01}, Mean CV Score: 0.5839
Candidate 2: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.1}, Mean CV Score: 0.5879
Candidate 3: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 1}, Mean CV Score: 0.5856
Candidate 4: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10}, Mean CV Score: 0.5895
Candidate 5: Parameters: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 100}, Mean CV Score: 0.5839

Logistic Regression best params: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10}
Logistic Regression best CV accuracy: 0.5895
Logistic Regression validation accuracy: 0.5998

--- Tuning Random Forest ---
Fitting 3 folds for each of 20 candidates, totalling 60 fits

Candidate hyperparameter settings and CV scores for Random Forest:
Candidate 1: Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 30}, Mean CV Score: 0.5089
Candidate 2: Parameters: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 30}, Mean CV Score: 0.4952
Candidate 3: Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 20}, Mean CV Score: 0.5262
Candidate 4: Parameters: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 30}, Mean CV Score: 0.5951
Candidate 5: Parameters: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None}, Mean CV Score: 0.6089
Candidate 6: Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 20}, Mean CV Score: 0.5908
Candidate 7: Parameters: {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 30}, Mean CV Score: 0.5713
Candidate 8: Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 10}, Mean CV Score: 0.5326
Candidate 9: Parameters: {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 30}, Mean CV Score: 0.5083
Candidate 10: Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 20}, Mean CV Score: 0.5266
Candidate 11: Parameters: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 20}, Mean CV Score: 0.4952
Candidate 12: Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None}, Mean CV Score: 0.5932
Candidate 13: Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 30}, Mean CV Score: 0.5434
Candidate 14: Parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None}, Mean CV Score: 0.6077
Candidate 15: Parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10}, Mean CV Score: 0.5356
Candidate 16: Parameters: {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 30}, Mean CV Score: 0.5952
Candidate 17: Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 30}, Mean CV Score: 0.5953
Candidate 18: Parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None}, Mean CV Score: 0.5805
Candidate 19: Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 20}, Mean CV Score: 0.4952
Candidate 20: Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10}, Mean CV Score: 0.5347

Random Forest best params: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None}
Random Forest best CV accuracy: 0.6089
Random Forest validation accuracy: 0.6191

--- Tuning LightGBM ---
Fitting 3 folds for each of 20 candidates, totalling 60 fits
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.021993 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 501
[LightGBM] [Info] Number of data points in the train set: 32000, number of used features: 233
[LightGBM] [Info] Start training from score -0.702819
[LightGBM] [Info] Start training from score -4.669709
[LightGBM] [Info] Start training from score -3.472761
[LightGBM] [Info] Start training from score -1.901505
[LightGBM] [Info] Start training from score -1.154984

Candidate hyperparameter settings and CV scores for LightGBM:
Candidate 1: Parameters: {'subsample': 1.0, 'num_leaves': 50, 'n_estimators': 500, 'min_child_samples': 10, 'max_depth': 15, 'learning_rate': 0.001, 'colsample_bytree': 0.6}, Mean CV Score: 0.5701
Candidate 2: Parameters: {'subsample': 0.8, 'num_leaves': 70, 'n_estimators': 500, 'min_child_samples': 20, 'max_depth': -1, 'learning_rate': 0.1, 'colsample_bytree': 0.8}, Mean CV Score: 0.5969
Candidate 3: Parameters: {'subsample': 1.0, 'num_leaves': 50, 'n_estimators': 500, 'min_child_samples': 20, 'max_depth': 10, 'learning_rate': 0.1, 'colsample_bytree': 0.8}, Mean CV Score: 0.6077
Candidate 4: Parameters: {'subsample': 0.6, 'num_leaves': 70, 'n_estimators': 200, 'min_child_samples': 10, 'max_depth': 10, 'learning_rate': 0.1, 'colsample_bytree': 0.8}, Mean CV Score: 0.6156
Candidate 5: Parameters: {'subsample': 0.6, 'num_leaves': 31, 'n_estimators': 500, 'min_child_samples': 5, 'max_depth': 5, 'learning_rate': 0.001, 'colsample_bytree': 0.8}, Mean CV Score: 0.5652
Candidate 6: Parameters: {'subsample': 0.6, 'num_leaves': 31, 'n_estimators': 200, 'min_child_samples': 20, 'max_depth': 15, 'learning_rate': 0.1, 'colsample_bytree': 1.0}, Mean CV Score: 0.6171
Candidate 7: Parameters: {'subsample': 0.8, 'num_leaves': 70, 'n_estimators': 100, 'min_child_samples': 20, 'max_depth': 10, 'learning_rate': 0.01, 'colsample_bytree': 0.6}, Mean CV Score: 0.6129
Candidate 8: Parameters: {'subsample': 1.0, 'num_leaves': 50, 'n_estimators': 500, 'min_child_samples': 5, 'max_depth': -1, 'learning_rate': 0.1, 'colsample_bytree': 0.8}, Mean CV Score: 0.5971
Candidate 9: Parameters: {'subsample': 0.6, 'num_leaves': 70, 'n_estimators': 100, 'min_child_samples': 5, 'max_depth': 5, 'learning_rate': 0.01, 'colsample_bytree': 0.6}, Mean CV Score: 0.6018
Candidate 10: Parameters: {'subsample': 0.6, 'num_leaves': 70, 'n_estimators': 500, 'min_child_samples': 5, 'max_depth': 15, 'learning_rate': 0.01, 'colsample_bytree': 0.8}, Mean CV Score: 0.6226
Candidate 11: Parameters: {'subsample': 1.0, 'num_leaves': 31, 'n_estimators': 100, 'min_child_samples': 10, 'max_depth': 15, 'learning_rate': 0.1, 'colsample_bytree': 1.0}, Mean CV Score: 0.6219
Candidate 12: Parameters: {'subsample': 0.8, 'num_leaves': 50, 'n_estimators': 500, 'min_child_samples': 10, 'max_depth': 10, 'learning_rate': 0.1, 'colsample_bytree': 0.6}, Mean CV Score: 0.6087
Candidate 13: Parameters: {'subsample': 1.0, 'num_leaves': 31, 'n_estimators': 200, 'min_child_samples': 20, 'max_depth': 5, 'learning_rate': 0.001, 'colsample_bytree': 0.8}, Mean CV Score: 0.5250
Candidate 14: Parameters: {'subsample': 0.8, 'num_leaves': 50, 'n_estimators': 200, 'min_child_samples': 10, 'max_depth': 10, 'learning_rate': 0.001, 'colsample_bytree': 0.6}, Mean CV Score: 0.5194
Candidate 15: Parameters: {'subsample': 0.6, 'num_leaves': 70, 'n_estimators': 200, 'min_child_samples': 10, 'max_depth': 10, 'learning_rate': 0.01, 'colsample_bytree': 1.0}, Mean CV Score: 0.6227
Candidate 16: Parameters: {'subsample': 0.6, 'num_leaves': 50, 'n_estimators': 100, 'min_child_samples': 20, 'max_depth': 15, 'learning_rate': 0.01, 'colsample_bytree': 0.8}, Mean CV Score: 0.6176
Candidate 17: Parameters: {'subsample': 0.6, 'num_leaves': 31, 'n_estimators': 500, 'min_child_samples': 5, 'max_depth': -1, 'learning_rate': 0.001, 'colsample_bytree': 1.0}, Mean CV Score: 0.5900
Candidate 18: Parameters: {'subsample': 0.6, 'num_leaves': 50, 'n_estimators': 100, 'min_child_samples': 5, 'max_depth': 15, 'learning_rate': 0.01, 'colsample_bytree': 1.0}, Mean CV Score: 0.6173
Candidate 19: Parameters: {'subsample': 0.6, 'num_leaves': 31, 'n_estimators': 100, 'min_child_samples': 5, 'max_depth': -1, 'learning_rate': 0.1, 'colsample_bytree': 0.8}, Mean CV Score: 0.6215
Candidate 20: Parameters: {'subsample': 0.8, 'num_leaves': 31, 'n_estimators': 200, 'min_child_samples': 20, 'max_depth': -1, 'learning_rate': 0.001, 'colsample_bytree': 0.6}, Mean CV Score: 0.5187

LightGBM best params: {'subsample': 0.6, 'num_leaves': 70, 'n_estimators': 200, 'min_child_samples': 10, 'max_depth': 10, 'learning_rate': 0.01, 'colsample_bytree': 1.0}
LightGBM best CV accuracy: 0.6227
LightGBM validation accuracy: 0.6266

--- Tuning XGBoost ---
Fitting 3 folds for each of 20 candidates, totalling 60 fits

Candidate hyperparameter settings and CV scores for XGBoost:
Candidate 1: Parameters: {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 6, 'learning_rate': 0.001, 'gamma': 5, 'colsample_bytree': 0.8}, Mean CV Score: 0.6145
Candidate 2: Parameters: {'subsample': 0.8, 'n_estimators': 200, 'max_depth': 6, 'learning_rate': 0.001, 'gamma': 1, 'colsample_bytree': 0.6}, Mean CV Score: 0.6132
Candidate 3: Parameters: {'subsample': 1.0, 'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.001, 'gamma': 0, 'colsample_bytree': 0.8}, Mean CV Score: 0.5737
Candidate 4: Parameters: {'subsample': 0.8, 'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.8}, Mean CV Score: 0.6016
Candidate 5: Parameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 1.0}, Mean CV Score: 0.5824
Candidate 6: Parameters: {'subsample': 1.0, 'n_estimators': 200, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.8}, Mean CV Score: 0.6197
Candidate 7: Parameters: {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.001, 'gamma': 0, 'colsample_bytree': 0.6}, Mean CV Score: 0.6176
Candidate 8: Parameters: {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 0.6}, Mean CV Score: 0.6199
Candidate 9: Parameters: {'subsample': 0.8, 'n_estimators': 500, 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.6}, Mean CV Score: 0.6248
Candidate 10: Parameters: {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 0.8}, Mean CV Score: 0.6254
Candidate 11: Parameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 0.8}, Mean CV Score: 0.6248
Candidate 12: Parameters: {'subsample': 0.8, 'n_estimators': 500, 'max_depth': 3, 'learning_rate': 0.001, 'gamma': 5, 'colsample_bytree': 1.0}, Mean CV Score: 0.5738
Candidate 13: Parameters: {'subsample': 0.6, 'n_estimators': 500, 'max_depth': 10, 'learning_rate': 0.001, 'gamma': 0, 'colsample_bytree': 0.6}, Mean CV Score: 0.6207
Candidate 14: Parameters: {'subsample': 0.8, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 5, 'colsample_bytree': 0.8}, Mean CV Score: 0.5823
Candidate 15: Parameters: {'subsample': 0.6, 'n_estimators': 500, 'max_depth': 6, 'learning_rate': 0.01, 'gamma': 5, 'colsample_bytree': 0.6}, Mean CV Score: 0.6198
Candidate 16: Parameters: {'subsample': 1.0, 'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 0.6}, Mean CV Score: 0.6184
Candidate 17: Parameters: {'subsample': 1.0, 'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.001, 'gamma': 5, 'colsample_bytree': 0.8}, Mean CV Score: 0.5736
Candidate 18: Parameters: {'subsample': 0.6, 'n_estimators': 500, 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.6}, Mean CV Score: 0.6124
Candidate 19: Parameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 1.0}, Mean CV Score: 0.6238
Candidate 20: Parameters: {'subsample': 0.8, 'n_estimators': 100, 'max_depth': 6, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.6}, Mean CV Score: 0.6159

XGBoost best params: {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 0.8}
XGBoost best CV accuracy: 0.6254
XGBoost validation accuracy: 0.6304

--- Tuning MLP Neural Net ---
Fitting 3 folds for each of 20 candidates, totalling 60 fits

Candidate hyperparameter settings and CV scores for MLP Neural Net:
Candidate 1: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (100,), 'alpha': 0.0001, 'activation': 'relu'}, Mean CV Score: 0.5992
Candidate 2: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (50,), 'alpha': 0.01, 'activation': 'tanh'}, Mean CV Score: 0.5925
Candidate 3: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (100,), 'alpha': 0.001, 'activation': 'relu'}, Mean CV Score: 0.5916
Candidate 4: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (50,), 'alpha': 0.0001, 'activation': 'relu'}, Mean CV Score: 0.5975
Candidate 5: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (100,), 'alpha': 0.01, 'activation': 'relu'}, Mean CV Score: 0.6035
Candidate 6: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (50,), 'alpha': 0.001, 'activation': 'tanh'}, Mean CV Score: 0.5903
Candidate 7: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 50), 'alpha': 0.0001, 'activation': 'relu'}, Mean CV Score: 0.5894
Candidate 8: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 50), 'alpha': 0.01, 'activation': 'relu'}, Mean CV Score: 0.5960
Candidate 9: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (50,), 'alpha': 0.001, 'activation': 'relu'}, Mean CV Score: 0.6076
Candidate 10: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (100,), 'alpha': 0.001, 'activation': 'tanh'}, Mean CV Score: 0.5855
Candidate 11: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 50), 'alpha': 0.0001, 'activation': 'tanh'}, Mean CV Score: 0.5837
Candidate 12: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (100,), 'alpha': 0.01, 'activation': 'relu'}, Mean CV Score: 0.6050
Candidate 13: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (100, 50), 'alpha': 0.0001, 'activation': 'relu'}, Mean CV Score: 0.5834
Candidate 14: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (100, 50), 'alpha': 0.0001, 'activation': 'tanh'}, Mean CV Score: 0.5838
Candidate 15: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (100,), 'alpha': 0.0001, 'activation': 'relu'}, Mean CV Score: 0.5886
Candidate 16: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 50), 'alpha': 0.001, 'activation': 'relu'}, Mean CV Score: 0.6013
Candidate 17: Parameters: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (100, 50), 'alpha': 0.001, 'activation': 'tanh'}, Mean CV Score: 0.5929
Candidate 18: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (50,), 'alpha': 0.001, 'activation': 'tanh'}, Mean CV Score: 0.5948
Candidate 19: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (100, 50), 'alpha': 0.01, 'activation': 'relu'}, Mean CV Score: 0.5907
Candidate 20: Parameters: {'solver': 'adam', 'max_iter': 300, 'learning_rate_init': 0.01, 'hidden_layer_sizes': (50,), 'alpha': 0.0001, 'activation': 'tanh'}, Mean CV Score: 0.5894

MLP Neural Net best params: {'solver': 'adam', 'max_iter': 200, 'learning_rate_init': 0.001, 'hidden_layer_sizes': (50,), 'alpha': 0.001, 'activation': 'relu'}
MLP Neural Net best CV accuracy: 0.6076
MLP Neural Net validation accuracy: 0.6106

--- Tuning KNN ---
Fitting 3 folds for each of 20 candidates, totalling 60 fits


In [2]:
!pip install catboost

Defaulting to user installation because normal site-packages is not writeable
Collecting catboost
  Downloading catboost-1.2.8-cp39-cp39-win_amd64.whl (102.5 MB)
Collecting graphviz
  Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: graphviz, catboost
Successfully installed catboost-1.2.8 graphviz-0.20.3


------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------

The code below was eventually moved to experimentEnsembleBoostingCat once we realized the file was getting too long. We wanted to separate 
out the different experiments and models.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import balanced_accuracy_score
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

# === 1. Load & Preprocess ===
# df = pd.read_csv("train.csv")
# ... your existing preprocessing steps here ...
# (drop high-leakage cols, simplify color, age_to_weeks, one-hot encode, etc.)
# Result: X (features), y (string labels)
le = LabelEncoder()
y_enc = le.fit_transform(y)

# === 2. Train/Val Split & 10% Coarse Subsample ===
X_train_full, X_val, y_train_full, y_val = train_test_split(
    X, y_enc, test_size=0.2, stratify=y_enc, random_state=42
)
# 10% stratified subsample for coarse tuning
X_coarse, _, y_coarse, _ = train_test_split(
    X_train_full, y_train_full,
    train_size=0.1, stratify=y_train_full, random_state=42
)

# === 3. Define Boosting Models & Broad Param Grids ===
boosters = {
    'LightGBM': (
        lgb.LGBMClassifier(random_state=42, n_jobs=-1),
        {
            'num_leaves':       [31, 63, 127],
            'max_depth':        [5, 10, 20, -1],
            'learning_rate':    [0.3, 0.1, 0.01],
            'n_estimators':     [50, 100, 200],
            'subsample':        [0.5, 0.8, 1.0],
            'colsample_bytree': [0.5, 0.8, 1.0]
        }
    ),
    'XGBoost': (
        xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss', n_jobs=-1),
        {
            'max_depth':        [3, 6, 10],
            'learning_rate':    [0.3, 0.1, 0.01],
            'n_estimators':     [50, 100, 200],
            'subsample':        [0.5, 0.8, 1.0],
            'colsample_bytree': [0.5, 0.8, 1.0],
            'gamma':            [0, 1, 5]
        }
    ),
    'CatBoost': (
        CatBoostClassifier(random_state=42, verbose=0),  # silent
        {
            'depth':            [4, 6, 10],
            'learning_rate':    [0.3, 0.1, 0.01],
            'iterations':       [100, 200, 500],
            'l2_leaf_reg':      [1, 3, 10]
        }
    )
}

coarse_best = {}
print("=== Stage 1: Coarse tuning on 10% subset ===")
for name, (est, grid) in tqdm(boosters.items(), desc="Coarse tuning"):
    rs = RandomizedSearchCV(est, grid,
                            n_iter=20,
                            scoring='balanced_accuracy',
                            cv=3, n_jobs=-1, random_state=42, verbose=0)
    rs.fit(X_coarse, y_coarse)
    coarse_best[name] = rs.best_params_
    print(f"{name} coarse best: {rs.best_params_}  →  CV balanced_accuracy={rs.best_score_:.4f}")

# === 4. Build Narrow Grids Around Best Coarse Params ===
fine_params = {}
for name, bp in coarse_best.items():
    if name == 'LightGBM':
        fine_params[name] = {
            'num_leaves':       [max(5, bp['num_leaves']-16), bp['num_leaves'], bp['num_leaves']+16],
            'learning_rate':    [bp['learning_rate']/2, bp['learning_rate'], bp['learning_rate']*2],
            'n_estimators':     [bp['n_estimators']-50, bp['n_estimators'], bp['n_estimators']+50],
            'subsample':        [max(0.5, bp['subsample']-0.2), bp['subsample'], min(1.0, bp['subsample']+0.2)],
            'colsample_bytree': [max(0.5, bp['colsample_bytree']-0.2), bp['colsample_bytree'], min(1.0, bp['colsample_bytree']+0.2)],
            'max_depth':        [bp['max_depth']] if bp['max_depth'] != -1 else [-1, 10, 20]
        }
    elif name == 'XGBoost':
        fine_params[name] = {
            'max_depth':        [max(1, bp['max_depth']-3), bp['max_depth'], bp['max_depth']+3],
            'learning_rate':    [bp['learning_rate']/2, bp['learning_rate'], bp['learning_rate']*2],
            'n_estimators':     [bp['n_estimators']-50, bp['n_estimators'], bp['n_estimators']+50],
            'subsample':        [max(0.5, bp['subsample']-0.2), bp['subsample'], min(1.0, bp['subsample']+0.2)],
            'colsample_bytree': [max(0.5, bp['colsample_bytree']-0.2), bp['colsample_bytree'], min(1.0, bp['colsample_bytree']+0.2)],
            'gamma':            [0, bp['gamma'], bp['gamma']*2]
        }
    else:  # CatBoost
        fine_params[name] = {
            'depth':        [max(1, bp['depth']-2), bp['depth'], bp['depth']+2],
            'learning_rate':[bp['learning_rate']/2, bp['learning_rate'], bp['learning_rate']*2],
            'iterations':   [bp['iterations']-100, bp['iterations'], bp['iterations']+100],
            'l2_leaf_reg':  [max(1, bp['l2_leaf_reg']-2), bp['l2_leaf_reg'], bp['l2_leaf_reg']+2]
        }

# === 5. Fine Tuning on Full Training Set ===
final_models = {}
print("\n=== Stage 2: Fine tuning on full training data ===")
for name, (est, _) in tqdm(boosters.items(), desc="Fine tuning"):
    grid = fine_params[name]
    rs = RandomizedSearchCV(est, grid,
                            n_iter=15,
                            scoring='balanced_accuracy',
                            cv=5, n_jobs=-1, random_state=42, verbose=0)
    rs.fit(X_train_full, y_train_full)
    final_models[name] = rs.best_estimator_
    print(f"{name} fine best: {rs.best_params_}  →  CV balanced_accuracy={rs.best_score_:.4f}")

# === 6. Ensemble & Evaluate ===
# Predict probabilities on validation set and average across models
proba = np.stack([m.predict_proba(X_val) for m in final_models.values()], axis=2)
avg_proba = proba.mean(axis=2)
y_pred = np.argmax(avg_proba, axis=1)

print("\nEnsemble balanced accuracy:", 
      balanced_accuracy_score(y_val, y_pred))


=== Stage 1: Coarse tuning on 10% subset ===


Coarse tuning:   0%|          | 0/3 [00:00<?, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000452 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 75
[LightGBM] [Info] Number of data points in the train set: 960, number of used features: 23
[LightGBM] [Info] Start training from score -0.703618
[LightGBM] [Info] Start training from score -4.669709
[LightGBM] [Info] Start training from score -3.465736
[LightGBM] [Info] Start training from score -1.904089
[LightGBM] [Info] Start training from score -1.153200
LightGBM coarse best: {'subsample': 1.0, 'num_leaves': 31, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 1.0}  →  CV balanced_accuracy=0.4170
XGBoost coarse best: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.3, 'gamma': 0, 'colsample_bytree': 0.8}  →  CV balanced_accuracy=0.3762
CatBoost coarse best: {'learning_rate

Fine tuning:   0%|          | 0/3 [00:00<?, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000586 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 172
[LightGBM] [Info] Number of data points in the train set: 9600, number of used features: 70
[LightGBM] [Info] Start training from score -0.702987
[LightGBM] [Info] Start training from score -4.669709
[LightGBM] [Info] Start training from score -3.472425
[LightGBM] [Info] Start training from score -1.901295
[LightGBM] [Info] Start training from score -1.154852
LightGBM fine best: {'subsample': 1.0, 'num_leaves': 47, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.2, 'colsample_bytree': 1.0}  →  CV balanced_accuracy=0.3835
XGBoost fine best: {'subsample': 1.0, 'n_estimators': 50, 'max_depth': 7, 'learning_rate': 0.3, 'gamma': 0, 'colsample_bytree': 1.0}  →  CV balanced_accuracy=0.3891
CatBoost fine best: {'learning_rate': 0.3