This was another experiment with ensemble boosting, where we used three models: LightGBM, XGBoost, and Catboost. In this code, we used rough fine tuning on a smaller subset of the data to get a general idea of the hyperparameters, then did finer fine tuning on a larger subset of the data with the hyperparameters we got from the last step. Then, we did ensemble voting between the three models. We started with just average, and then moved to a weighted average for the voting. This set used color bucketing, breed bucketing, and age to weeks for preprocessing, similar to most of the others.


In [2]:
# PREPROCESSING

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier  # For KNN
import lightgbm as lgb
import warnings
import xgboost as xgb
from tqdm.notebook import tqdm
warnings.filterwarnings('ignore')

# === Helper Functions ===
def simplify_color(color_str):
    """
    Map raw color strings to a smaller set of standardized categories.
    This function checks for the presence of common color keywords.
    """
    if pd.isna(color_str):
        return "unknown"
    color_str = color_str.lower()
    if "black" in color_str:
        return "black"
    elif "brown" in color_str:
        return "brown"
    elif "white" in color_str:
        return "white"
    elif "tan" in color_str or "gold" in color_str:
        return "tan_gold"
    elif "grey" in color_str or "gray" in color_str:
        return "gray"
    else:
        return "other"

def age_to_weeks(age_str):
    """
    Convert an age string into estimated weeks.
    e.g., "2 years" becomes 104 weeks and "6 months" becomes about 24 weeks.
    """
    if pd.isna(age_str):
        return np.nan
    tokens = age_str.split()
    if len(tokens) < 2:
        return np.nan
    if 'year' in tokens[1]:
        return float(tokens[0]) * 52
    elif 'month' in tokens[1]:
        return float(tokens[0]) * 4
    elif 'week' in tokens[1]:
        return float(tokens[0])
    else:
        return np.nan
    

breed_size = {
    # SMALL (<20 lbs avg)
    'chihuahua':        'small',
    'pembroke welsh corgi':'small',
    'pug':               'small',
    'yorkshire terrier':'small',
    'dachshund':         'small',
    'pomeranian':        'small',
    'papillon':          'small',
    'shih tzu':          'small',
    'maltese':           'small',
    'rat terrier':       'small',
    'jack russell terrier':'small',
    'west highland white terrier':'small',
    # MEDIUM (20–50 lbs avg)
    'border collie':     'medium',
    'australian cattle dog':'medium',
    'beagle':            'medium',
    'boston terrier':    'medium',
    'cocker spaniel':    'medium',
    'cairn terrier':     'medium',
    'bichon frise':      'medium',
    'boston terrier':    'medium',
    'siberian husky':    'medium',
    # LARGE (50+ lbs avg)
    'labrador retriever':'large',
    'golden retriever':  'large',
    'german shepherd':   'large',
    'rottweiler':        'large',
    ' dob':              'large',  # for “doberman”
    'boxer':             'large',
    'great dane':        'large',
    'mastiff':           'large',
    'newfoundland':      'large',
    'bernese mountain dog':'large',
    'great pyrenees':    'large',
    'alaskan malamute':  'large',
    'cane corso':        'large',
    'doberman pinscher': 'large',
    'chow chow':         'large',
    # …add more breeds as needed…
}

# 2) Fallback keyword sets for truly rare / unlisted mixes
small_keys = ['chihuahua','toy','pomeranian','papillon','yorkshire','pug']
large_keys = ['mastiff','wolfhound','dane','newfoundland','retriever','shepherd','rottweiler','boxer','bulldog','malamute']

def size_from_breed(breed):
    b = breed.lower().replace('mix','').strip()
    # split on slashes
    components = [c.strip() for c in b.replace('/',',').split(',')]
    sizes = []
    for comp in components:
        # exact lookup?
        if comp in breed_size:
            sizes.append(breed_size[comp])
        else:
            # fallback to keyword scan
            if any(k in comp for k in small_keys):
                sizes.append('small')
            elif any(k in comp for k in large_keys):
                sizes.append('large')
            else:
                sizes.append('medium')
    # mixture takes the LARGEST of its parents
    if 'large'  in sizes: return 'large'
    if 'medium' in sizes: return 'medium'
    return 'small'

Preprocessing


In [3]:


# === 1. Load and Subsample Data ===
# For prototyping, we select a stratified random sample of ~40,000 rows
# to maintain the same target distribution as the full dataset.
df_full = pd.read_csv("train.csv")
# Option A: Using groupby with sample (ensures stratification)
# sample_fraction = 40000 / len(df_full)
# sample_df = df_full.groupby("Outcome Type", group_keys=False).apply(lambda x: x.sample(frac=sample_fraction, random_state=42))
# Option B (alternative): Using train_test_split's stratify option:
sample_df, _ = train_test_split(df_full, train_size=12000, stratify=df_full["Outcome Type"], random_state=42)

# === 2. Target & ID Setup ===
target_col = "Outcome Type"
id_col = "Id"
# Drop columns that are either high-leakage or not available in the test set.
# Here, we drop Outcome Time as well since it isn't present in the test data.
drop_cols = ['Found Location', 'Date of Birth', 'Name', target_col, id_col, 'Outcome Time']
X = df_full.drop(columns=drop_cols, errors='ignore')
y = df_full[target_col]

# === 3. Encode the Target Variable ===
# Although there are only five outcomes, encoding ensures compatibility with all models.
le_y = LabelEncoder()
y_encoded = le_y.fit_transform(y)

# === 4. Feature Engineering ===
# Simplify the 'Color' feature into standardized categories
X['Color Category'] = X['Color'].apply(simplify_color)
X = X.drop(columns=['Color'])

# Convert Age to Weeks
X['Age in Weeks'] = X['Age upon Intake'].apply(age_to_weeks)
X.drop(columns=['Age upon Intake'], inplace=True)

# Process Intake Time if necessary. Since Outcome Time is not used and not in test data,
# we drop Intake Time as well (or extract features if desired).
datetime_format = "%m/%d/%Y %I:%M:%S %p"
X['Intake Time'] = pd.to_datetime(X['Intake Time'], format=datetime_format, errors='coerce')
# If you wish to derive time-based features (like hour or day of week), do it here.
# For now, we drop it to ensure consistency with the test set.
X = X.drop(columns=['Intake Time'])

# 3) Apply to your DataFrame
X['Size Category'] = X['Breed'].apply(size_from_breed)

# 4) Quick sanity check
print(X[['Breed','Size Category']]
      .drop_duplicates()
      .groupby('Size Category')
      .size())
X = X.drop(columns=['Breed'])

# Fill missing values: For object columns, fill with "Unknown"; for numeric, fill with median.
for col in X.columns:
    if X[col].dtype == 'object':
        X[col] = X[col].fillna("Unknown")
    else:
        X[col] = X[col].fillna(X[col].median())



# # === 5. Encode Categorical Variables ===
categorical_cols = X.select_dtypes(include='object').columns
print("categorical cols", categorical_cols)
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)


# print(X["Intake Type"].unique())
# print(X["Intake Condition"].unique())
# print(X["Animal Type"].unique())
# print(X["Breed"].unique())
# print(X["Color Category"].unique())
# Ensure all features are numeric
X = X.astype(float)

print("Processed feature sample:")

X.head()



Size Category
large      811
medium    1497
small      132
dtype: int64
categorical cols Index(['Intake Type', 'Intake Condition', 'Animal Type', 'Sex upon Intake',
       'Color Category', 'Size Category'],
      dtype='object')
Processed feature sample:


Unnamed: 0,Age in Weeks,Intake Type_Euthanasia Request,Intake Type_Owner Surrender,Intake Type_Public Assist,Intake Type_Stray,Intake Type_Wildlife,Intake Condition_Agonal,Intake Condition_Behavior,Intake Condition_Congenital,Intake Condition_Feral,...,Sex upon Intake_Neutered Male,Sex upon Intake_Spayed Female,Sex upon Intake_Unknown,Color Category_brown,Color Category_gray,Color Category_other,Color Category_tan_gold,Color Category_white,Size Category_medium,Size Category_small
0,416.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,44.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,104.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,104.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,312.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Training the three models

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import balanced_accuracy_score
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

# === 1. Load & Preprocess ===
# df = pd.read_csv("train.csv")
# ... your existing preprocessing steps here ...
# (drop high-leakage cols, simplify color, age_to_weeks, one-hot encode, etc.)
# Result: X (features), y (string labels)
le = LabelEncoder()
y_enc = le.fit_transform(y)

# === 2. Train/Val Split & 10% Coarse Subsample ===
X_train_full, X_val, y_train_full, y_val = train_test_split(
    X, y_enc, test_size=0.2, stratify=y_enc, random_state=42
)
# 10% stratified subsample for coarse tuning
X_coarse, _, y_coarse, _ = train_test_split(
    X_train_full, y_train_full,
    train_size=0.1, stratify=y_train_full, random_state=42
)

# === 3. Define Boosting Models & Broad Param Grids ===
boosters = {
    'LightGBM': (
        lgb.LGBMClassifier(random_state=42, n_jobs=-1),
        {
            'num_leaves':       [31, 63, 127],
            'max_depth':        [5, 10, 20, -1],
            'learning_rate':    [0.3, 0.1, 0.01],
            'n_estimators':     [50, 100, 200],
            'subsample':        [0.5, 0.8, 1.0],
            'colsample_bytree': [0.5, 0.8, 1.0]
        }
    ),
    'XGBoost': (
        xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss', n_jobs=-1),
        {
            'max_depth':        [3, 6, 10],
            'learning_rate':    [0.3, 0.1, 0.01],
            'n_estimators':     [50, 100, 200],
            'subsample':        [0.5, 0.8, 1.0],
            'colsample_bytree': [0.5, 0.8, 1.0],
            'gamma':            [0, 1, 5]
        }
    ),
    'CatBoost': (
        CatBoostClassifier(random_state=42, verbose=0),  # silent
        {
            'depth':            [4, 6, 10],
            'learning_rate':    [0.3, 0.1, 0.01],
            'iterations':       [100, 200, 500],
            'l2_leaf_reg':      [1, 3, 10]
        }
    )
}

coarse_best = {}
print("=== Stage 1: Coarse tuning on 10% subset ===")
for name, (est, grid) in tqdm(boosters.items(), desc="Coarse tuning"):
    rs = RandomizedSearchCV(est, grid,
                            n_iter=20,
                            scoring='balanced_accuracy',
                            cv=3, n_jobs=-1, random_state=42, verbose=0)
    rs.fit(X_coarse, y_coarse)
    coarse_best[name] = rs.best_params_
    print(f"{name} coarse best: {rs.best_params_}  →  CV balanced_accuracy={rs.best_score_:.4f}")

# === 4. Build Narrow Grids Around Best Coarse Params ===
fine_params = {}
for name, bp in coarse_best.items():
    if name == 'LightGBM':
        fine_params[name] = {
            'num_leaves':       [max(5, bp['num_leaves']-16), bp['num_leaves'], bp['num_leaves']+16],
            'learning_rate':    [bp['learning_rate']/2, bp['learning_rate'], bp['learning_rate']*2],
            'n_estimators':     [bp['n_estimators']-50, bp['n_estimators'], bp['n_estimators']+50],
            'subsample':        [max(0.5, bp['subsample']-0.2), bp['subsample'], min(1.0, bp['subsample']+0.2)],
            'colsample_bytree': [max(0.5, bp['colsample_bytree']-0.2), bp['colsample_bytree'], min(1.0, bp['colsample_bytree']+0.2)],
            'max_depth':        [bp['max_depth']] if bp['max_depth'] != -1 else [-1, 10, 20]
        }
    elif name == 'XGBoost':
        fine_params[name] = {
            'max_depth':        [max(1, bp['max_depth']-3), bp['max_depth'], bp['max_depth']+3],
            'learning_rate':    [bp['learning_rate']/2, bp['learning_rate'], bp['learning_rate']*2],
            'n_estimators':     [bp['n_estimators']-50, bp['n_estimators'], bp['n_estimators']+50],
            'subsample':        [max(0.5, bp['subsample']-0.2), bp['subsample'], min(1.0, bp['subsample']+0.2)],
            'colsample_bytree': [max(0.5, bp['colsample_bytree']-0.2), bp['colsample_bytree'], min(1.0, bp['colsample_bytree']+0.2)],
            'gamma':            [0, bp['gamma'], bp['gamma']*2]
        }
    else:  # CatBoost
        fine_params[name] = {
            'depth':        [max(1, bp['depth']-2), bp['depth'], bp['depth']+2],
            'learning_rate':[bp['learning_rate']/2, bp['learning_rate'], bp['learning_rate']*2],
            'iterations':   [bp['iterations']-100, bp['iterations'], bp['iterations']+100],
            'l2_leaf_reg':  [max(1, bp['l2_leaf_reg']-2), bp['l2_leaf_reg'], bp['l2_leaf_reg']+2]
        }

# === 5. Fine Tuning on Full Training Set ===
final_models = {}
print("\n=== Stage 2: Fine tuning on full training data ===")
for name, (est, _) in tqdm(boosters.items(), desc="Fine tuning"):
    grid = fine_params[name]
    rs = RandomizedSearchCV(est, grid,
                            n_iter=15,
                            scoring='balanced_accuracy',
                            cv=5, n_jobs=-1, random_state=42, verbose=0)
    rs.fit(X_train_full, y_train_full)
    final_models[name] = rs.best_estimator_
    print(f"{name} fine best: {rs.best_params_}  →  CV balanced_accuracy={rs.best_score_:.4f}")

# # === 6. Ensemble & Evaluate ===
# # Predict probabilities on validation set and average across models
# proba = np.stack([m.predict_proba(X_val) for m in final_models.values()], axis=2)
# avg_proba = proba.mean(axis=2)
# y_pred = np.argmax(avg_proba, axis=1)

# print("\nEnsemble balanced accuracy:", 
#       balanced_accuracy_score(y_val, y_pred))

# === 6. Weighted Ensemble & Evaluate ===
# 1) Compute individual validation balanced‐accuracy scores
validation_scores = {}
for name, model in final_models.items():
    y_pred_i = model.predict(X_val)
    score_i = balanced_accuracy_score(y_val, y_pred_i)
    validation_scores[name] = score_i
    print(f"{name} validation balanced_accuracy: {score_i:.4f}")

# 2) Normalize to get weights that sum to 1
total_score = sum(validation_scores.values())
weights = { name: score / total_score for name, score in validation_scores.items() }
print("\nEnsemble weights:")
for name, w in weights.items():
    print(f"  {name}: {w:.3f}")

# 3) Gather predict_proba arrays in the same order
names = list(final_models.keys())
probas = [ final_models[name].predict_proba(X_val) for name in names ]

# 4) Stack into shape (n_samples, n_classes, n_models)
proba_stack = np.stack(probas, axis=2)

# 5) Compute weighted average across the 3rd axis
weights_arr = np.array([weights[name] for name in names])
weighted_proba = np.tensordot(proba_stack, weights_arr, axes=([2],[0]))

# 6) Final prediction & evaluation
y_pred_ensemble = np.argmax(weighted_proba, axis=1)
print("\nWeighted‐ensemble balanced accuracy:",
      balanced_accuracy_score(y_val, y_pred_ensemble))



=== Stage 1: Coarse tuning on 10% subset ===


Coarse tuning:   0%|          | 0/3 [00:00<?, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001658 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 77
[LightGBM] [Info] Number of data points in the train set: 8892, number of used features: 22
[LightGBM] [Info] Start training from score -0.702866
[LightGBM] [Info] Start training from score -4.674067
[LightGBM] [Info] Start training from score -3.472506
[LightGBM] [Info] Start training from score -1.901478
[LightGBM] [Info] Start training from score -1.154819
LightGBM coarse best: {'subsample': 1.0, 'num_leaves': 31, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 1.0}  →  CV balanced_accuracy=0.3932
XGBoost coarse best: {'subsample': 0.5, 'n_estimators': 50, 'max_depth': 10, 'learning_rate': 0.3, 'gamma': 1, 'colsample_bytree': 0.8}  →  CV balanced_accuracy=0.4024
CatBoost coarse best: {'learning_rate

Fine tuning:   0%|          | 0/3 [00:00<?, ?it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.009388 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 92
[LightGBM] [Info] Number of data points in the train set: 88925, number of used features: 28
[LightGBM] [Info] Start training from score -0.702809
[LightGBM] [Info] Start training from score -4.670515
[LightGBM] [Info] Start training from score -3.472925
[LightGBM] [Info] Start training from score -1.901609
[LightGBM] [Info] Start training from score -1.154910
LightGBM fine best: {'subsample': 1.0, 'num_leaves': 47, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.2, 'colsample_bytree': 1.0}  →  CV balanced_accuracy=0.3946
XGBoost fine best: {'subsample': 0.7, 'n_estimators': 50, 'max_depth': 7, 'learning_rate': 0.6, 'gamma': 0, 'colsample_bytree': 0.8}  →  CV balanced_accuracy=0.3966
CatBoost fine best: {'learning_rate': 0.3