# Ensemble Quantile Regression — CatBoost‑Raw + Adaptive Conformal
*Generated 2025-06-02 18:00 UTC* 

This notebook:
1. Builds feature‑engineered LightGBM bag (5 seeds, preprocessed).
2. Trains CatBoost quantile **directly on raw categoricals**.
3. Finds the best weight blend via Winkler grid search.
4. Applies **bin‑wise adaptive conformal padding** to hit 90 % coverage while minimising width.
5. Writes `assets/ensemble_catraw_adaptpad.csv` ready for Kaggle.

In [1]:
import pandas as pd, numpy as np
from pathlib import Path
from sklearn.model_selection import KFold
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import lightgbm as lgb
from catboost import CatBoostRegressor

SEEDS = [0,1,2,3,4]
RANDOM_STATE = 42

NB_DIR = Path.cwd()
ROOT_DIR = NB_DIR.parent
DATA_DIR = ROOT_DIR / 'dataset'
ASSETS   = ROOT_DIR / 'assets'
ASSETS.mkdir(exist_ok=True, parents=True)


In [2]:
ID='id'; TARGET='sale_price'
train_df = pd.read_csv(DATA_DIR/'dataset.csv')
test_df  = pd.read_csv(DATA_DIR/'test.csv')
print(train_df.shape, test_df.shape)


(200000, 47) (200000, 46)


In [3]:
def enrich(df):
    df['log_area'] = np.log1p(df['area'])
    lat0, lon0 = 47.6097, -122.3331
    df['dist_cbd_km'] = np.sqrt((111*(df['latitude']-lat0))**2 +
                                (85*(df['longitude']-lon0))**2)
    df['sale_warning'] = df['sale_warning'].astype(str).fillna('missing')
    df['sale_nbr'] = pd.to_numeric(df['sale_nbr'], errors='coerce')
    df['sale_nbr'].fillna(df['sale_nbr'].median(), inplace=True)
    return df

train_df = enrich(train_df); test_df = enrich(test_df)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['sale_nbr'].fillna(df['sale_nbr'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['sale_nbr'].fillna(df['sale_nbr'].median(), inplace=True)


In [4]:
def build_pre(df):
    num_cols = df.select_dtypes(['int64','float64']).columns.drop([ID, TARGET], errors='ignore')
    cat_cols = df.select_dtypes(['object','category']).columns
    num_scaled = num_cols.drop(['log_area','dist_cbd_km'], errors='ignore')
    num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')),
                         ('sc', StandardScaler())])
    cat_pipe = Pipeline([('imp', SimpleImputer(strategy='most_frequent')),
                         ('enc', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))])
    return ColumnTransformer([
        ('num', num_pipe, num_scaled),
        ('cat', cat_pipe, cat_cols),
        ('pas', 'passthrough', ['log_area','dist_cbd_km'])
    ])

pre = build_pre(train_df)


In [5]:
cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
train_idx, val_idx = next(cv.split(train_df))
X_train = train_df.iloc[train_idx].copy()
X_val   = train_df.iloc[val_idx].copy()
y_train = X_train.pop(TARGET)
y_val   = X_val.pop(TARGET)

X_train_t = pre.fit_transform(X_train)
X_val_t   = pre.transform(X_val)
X_test_t  = pre.transform(test_df)


In [6]:
def fit_lgbq(X, y, a_lo=0.05, a_hi=0.95, seed=0):
    params = dict(n_estimators=1200, learning_rate=0.03, max_depth=-1,
                  num_leaves=256, subsample=0.9, colsample_bytree=0.9,
                  random_state=seed)
    lo = lgb.LGBMRegressor(objective='quantile', alpha=a_lo, **params)
    hi = lgb.LGBMRegressor(objective='quantile', alpha=a_hi, **params)
    lo.fit(X, y); hi.fit(X, y)
    return lo, hi

preds_lo_test, preds_hi_test = [], []
preds_lo_val, preds_hi_val = [], []

for s in SEEDS:
    lo, hi = fit_lgbq(X_train_t, y_train, 0.05, 0.95, s)
    preds_lo_test.append(lo.predict(X_test_t))
    preds_hi_test.append(hi.predict(X_test_t))
    preds_lo_val.append(lo.predict(X_val_t))
    preds_hi_val.append(hi.predict(X_val_t))

lgb_lo_test = np.mean(preds_lo_test, axis=0)
lgb_hi_test = np.mean(preds_hi_test, axis=0)
lgb_lo_val  = np.mean(preds_lo_val, axis=0)
lgb_hi_val  = np.mean(preds_hi_val, axis=0)


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007569 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 185000.000000
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006079 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 1435000.000000




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.059581 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 185000.000000
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005668 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 1435000.000000




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.066048 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 185000.000000
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006336 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 1435000.000000




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007328 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 185000.000000
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006019 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 1435000.000000




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006183 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 185000.000000
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007065 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 47
[LightGBM] [Info] Start training from score 1435000.000000




In [8]:
# ---  CLEAN  ALL  CATEGORICAL  NaNs  ---------------------------
for df in (train_df, X_val, test_df):
    for col in cat_cols:
        df[col] = df[col].astype(str).fillna("missing")


In [9]:
cat_cols = train_df.select_dtypes('object').columns.tolist()
num_cols = [c for c in train_df.columns if c not in cat_cols + [ID, TARGET]]

cat_lo = CatBoostRegressor(loss_function='Quantile:alpha=0.05',
                           iterations=1600, depth=8, learning_rate=0.03,
                           random_seed=RANDOM_STATE, verbose=False,
                           cat_features=cat_cols)
cat_hi = CatBoostRegressor(loss_function='Quantile:alpha=0.95',
                           iterations=1600, depth=8, learning_rate=0.03,
                           random_seed=RANDOM_STATE, verbose=False,
                           cat_features=cat_cols)

cat_lo.fit(train_df[cat_cols+num_cols], train_df[TARGET])
cat_hi.fit(train_df[cat_cols+num_cols], train_df[TARGET])

cat_lo_val  = cat_lo.predict(X_val[cat_cols+num_cols])
cat_hi_val  = cat_hi.predict(X_val[cat_cols+num_cols])
cat_lo_test = cat_lo.predict(test_df[cat_cols+num_cols])
cat_hi_test = cat_hi.predict(test_df[cat_cols+num_cols])


In [10]:
def winkler(y, lo, hi, alpha=0.10):
    y, lo, hi = map(np.asarray, (y, lo, hi))
    width = hi - lo
    penalty = (2/alpha)*(np.clip(lo - y, 0, None)+np.clip(y - hi, 0, None))
    return width + penalty

best_w, best_wink, best_lo_val, best_hi_val, best_q = None, 1e12, None, None, None
for w in np.linspace(0,1,21):
    lo_val = w*lgb_lo_val + (1-w)*cat_lo_val
    hi_val = w*lgb_hi_val + (1-w)*cat_hi_val
    scores = np.maximum(y_val - hi_val, lo_val - y_val)
    q_tmp = np.quantile(scores, 0.90)
    wink = winkler(y_val, lo_val - q_tmp, hi_val + q_tmp).mean()
    if wink < best_wink:
        best_w, best_wink, best_lo_val, best_hi_val, best_q = w, wink, lo_val, hi_val, q_tmp

print(f'Best weight: {best_w:.2f}, Winkler: {best_wink:,.0f}, q̂: {best_q:,.0f}')

# Build raw test with best weight
pi_lower_raw_val = best_w*lgb_lo_val + (1-best_w)*cat_lo_val
pi_upper_raw_val = best_w*lgb_hi_val + (1-best_w)*cat_hi_val
pi_lower_raw_test = best_w*lgb_lo_test + (1-best_w)*cat_lo_test
pi_upper_raw_test = best_w*lgb_hi_test + (1-best_w)*cat_hi_test


Best weight: 0.20, Winkler: 316,879, q̂: 1,668


In [12]:
# determine bins on validation widths
raw_width_val = pi_upper_raw_val - pi_lower_raw_val
n_bins = 10
val_bins = pd.qcut(raw_width_val, q=n_bins, labels=False, duplicates='drop')
bin_q = np.zeros(n_bins)

for b in range(n_bins):
    mask = val_bins == b
    s = np.maximum(y_val.values[mask] - pi_upper_raw_val[mask],
                   pi_lower_raw_val[mask] - y_val.values[mask])
    bin_q[b] = np.quantile(s, 0.90)

# Map test widths into same bins
bin_edges = np.quantile(raw_width_val, np.linspace(0,1,n_bins+1))
test_bins = np.clip(np.digitize(pi_upper_raw_test - pi_lower_raw_test, bin_edges, right=False)-1, 0, n_bins-1)
pad = bin_q[test_bins]

pi_lower = (pi_lower_raw_test - pad).clip(min=0)
pi_upper = np.maximum(pi_upper_raw_test + pad, pi_lower)

# Validation sanity
pad_val = bin_q[val_bins]
cov_val = ((y_val >= pi_lower_raw_val - pad_val) & (y_val <= pi_upper_raw_val + pad_val)).mean()
wink_val = winkler(y_val, pi_lower_raw_val - pad_val, pi_upper_raw_val + pad_val).mean()
print(f'Adaptive coverage: {cov_val:.3%}, Winkler: {wink_val:,.0f}')


Adaptive coverage: 90.000%, Winkler: 316,184


In [13]:
sub = pd.DataFrame({ID: test_df[ID], 'pi_lower': pi_lower, 'pi_upper': pi_upper})
csv_path = ASSETS/'ensemble_catraw_adaptpad_june2.csv'
sub.to_csv(csv_path, index=False)
print('Saved:', csv_path)


Saved: e:\Hackathons\Kaggle Prediction interval competition II House price\assets\ensemble_catraw_adaptpad_june2.csv
