# Final plan (medal path, concise)

Objectives (in order):
- S31: Recency-optimized blending is done in production; keep gamma=0.98 variant as primary for now.
- S32: Train high-capacity time-aware LR_main (title+request_text only, no subreddit) with L2 saga; cache OOF/test.
- S33: Re-run 7/8-way recency-weighted logit blend including LR_main; write primary + 15% shrink hedges.
- S34: Refit-on-full for MPNet emb+meta with 5-seed bag and fixed num_boost_round; update test preds.
- S35: Final refit-on-full for all XGB bases (Dense v1/v2, Meta, MiniLM, MPNet) with 5-seed bag; LR models refit with chosen C.
- S36: Build final refit blends with recency-optimized weights; write hedges and promote best to submission.csv.

Constraints and settings:
- Time-aware CV: 6 blocks forward-chaining; validate on blocks 1..5 only.
- LR_main TF-IDF:
  - word 1–3, char_wb 2–6; min_df in {1,2}; max_features per view ≈ 300k–400k (RAM check).
  - Regularization: L2 (saga), C ∈ {0.6, 0.8, 1.0, 1.2, 1.5}; max_iter=2000, n_jobs=-1.
  - Add small meta_v1 if and only if it improves blend ≥ +0.001; otherwise keep text-only.
- Blending (logit space, nonnegative, sum=1):
  - LR_mix g ∈ {0.90, 0.95, 0.97}; w_LR ≥ 0.25; Meta ∈ [0.18,0.22]; Dense_total ∈ [0.22,0.40];
  - MiniLM ∈ [0.10,0.15], MPNet ∈ [0.08,0.12], embeddings total ≤ 0.30.
  - If LR_main included: w_LRmain ∈ [0.05,0.10] only if it lifts OOF on late-tuned objective.
  - Optimize with full-mask, last-2, and gamma ∈ {0.90,0.95,0.98}; produce 15% shrink hedges.
- Refit-on-full:
  - XGB: use median best_iteration from time-CV as fixed num_boost_round; 5 seeds [42,1337,2025,614,2718]; device=cuda.
  - LR: rebuild vectorizers on full train; same C as best fold; predict test probs.

Artifacts to produce:
- oof_lr_main_time.npy, test_lr_main_time.npy
- submission_8way_full.csv / last2.csv / gammaXX.csv (+ _shrunk) with/without LR_main
- test_xgb_emb_mpnet_fullbag.npy (and similarly for other XGB bases if refit updated)

Next cell: implement S32 LR_main time-aware training with caching and progress logs.

In [10]:
# S32: Time-aware high-capacity LR_main (title + request_text only), L2 saga; cache OOF/test
import numpy as np, pandas as pd, time, gc
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    # Prefer request_text (avoid edit_aware per expert advice); fallback if missing
    if 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    col = 'request_text_edit_aware' if 'request_text_edit_aware' in df.columns else 'request_text'
    return df.get(col, pd.Series(['']*len(df))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train)
txt_te = build_text(test)

# 6-block forward-chaining folds (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []
mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}')

# High-capacity TF-IDF views
word_params = dict(analyzer='word', ngram_range=(1,3), lowercase=True, min_df=2, max_features=300_000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params = dict(analyzer='char_wb', ngram_range=(2,6), lowercase=True, min_df=2, max_features=300_000, sublinear_tf=True, smooth_idf=True, norm='l2')

C_grid = [0.8, 1.0, 1.2]
results = []
best = dict(auc=-1.0, C=None, oof=None, te=None)

for C in C_grid:
    tC = time.time()
    oof = np.zeros(n, dtype=np.float32)
    te_parts = []
    for fi, (tr_idx, va_idx) in enumerate(folds, 1):
        t0 = time.time()
        tr_text = txt_tr.iloc[tr_idx]; va_text = txt_tr.iloc[va_idx]
        tf_w = TfidfVectorizer(**word_params)
        Xw_tr = tf_w.fit_transform(tr_text); Xw_va = tf_w.transform(va_text); Xw_te = tf_w.transform(txt_te)
        tf_c = TfidfVectorizer(**char_params)
        Xc_tr = tf_c.fit_transform(tr_text); Xc_va = tf_c.transform(va_text); Xc_te = tf_c.transform(txt_te)
        X_tr = hstack([Xw_tr, Xc_tr], format='csr')
        X_va = hstack([Xw_va, Xc_va], format='csr')
        X_te = hstack([Xw_te, Xc_te], format='csr')
        clf = LogisticRegression(penalty='l2', solver='saga', C=C, max_iter=2000, n_jobs=-1, verbose=0)
        clf.fit(X_tr, y[tr_idx])
        va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
        te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
        oof[va_idx] = va_pred
        te_parts.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred)
        print(f'[LR_main C={C}] Fold {fi} AUC: {auc:.5f} | {time.time()-t0:.1f}s | tr:{X_tr.shape}, va:{X_va.shape}')
        del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, X_tr, X_va, X_te, clf; gc.collect()
    auc_mask = roc_auc_score(y[mask], oof[mask])
    te_mean = np.mean(te_parts, axis=0).astype(np.float32)
    results.append((C, auc_mask))
    print(f'[LR_main C={C}] OOF AUC(validated): {auc_mask:.5f} | {time.time()-tC:.1f}s')
    if auc_mask > best['auc']:
        best.update(dict(auc=auc_mask, C=C, oof=oof.copy(), te=te_mean.copy()))
    del oof, te_parts; gc.collect()

print('C grid results:', results)
print(f'Best C={best["C"]} | OOF AUC(validated)={best["auc"]:.5f}')
np.save('oof_lr_main_time.npy', best['oof'].astype(np.float32))
np.save('test_lr_main_time.npy', best['te'].astype(np.float32))
print('Saved oof_lr_main_time.npy and test_lr_main_time.npy')

Time-CV: 5 folds; validated 2398/2878


[LR_main C=0.8] Fold 1 AUC: 0.67896 | 5.8s | tr:(480, 36871), va:(480, 36871)


[LR_main C=0.8] Fold 2 AUC: 0.61152 | 9.8s | tr:(960, 59665), va:(480, 59665)


[LR_main C=0.8] Fold 3 AUC: 0.58009 | 13.0s | tr:(1440, 77632), va:(480, 77632)


[LR_main C=0.8] Fold 4 AUC: 0.63032 | 17.0s | tr:(1920, 91689), va:(479, 91689)


[LR_main C=0.8] Fold 5 AUC: 0.64895 | 18.4s | tr:(2399, 104131), va:(479, 104131)
[LR_main C=0.8] OOF AUC(validated): 0.62378 | 65.1s


[LR_main C=1.0] Fold 1 AUC: 0.67685 | 5.6s | tr:(480, 36871), va:(480, 36871)


[LR_main C=1.0] Fold 2 AUC: 0.60895 | 10.1s | tr:(960, 59665), va:(480, 59665)


[LR_main C=1.0] Fold 3 AUC: 0.57817 | 14.1s | tr:(1440, 77632), va:(480, 77632)


[LR_main C=1.0] Fold 4 AUC: 0.62856 | 17.8s | tr:(1920, 91689), va:(479, 91689)


[LR_main C=1.0] Fold 5 AUC: 0.64877 | 21.6s | tr:(2399, 104131), va:(479, 104131)
[LR_main C=1.0] OOF AUC(validated): 0.62322 | 70.3s


[LR_main C=1.2] Fold 1 AUC: 0.67495 | 6.2s | tr:(480, 36871), va:(480, 36871)


[LR_main C=1.2] Fold 2 AUC: 0.60622 | 9.7s | tr:(960, 59665), va:(480, 59665)


[LR_main C=1.2] Fold 3 AUC: 0.57687 | 13.9s | tr:(1440, 77632), va:(480, 77632)


[LR_main C=1.2] Fold 4 AUC: 0.62808 | 18.9s | tr:(1920, 91689), va:(479, 91689)


[LR_main C=1.2] Fold 5 AUC: 0.64811 | 21.2s | tr:(2399, 104131), va:(479, 104131)
[LR_main C=1.2] OOF AUC(validated): 0.62254 | 71.0s
C grid results: [(0.8, 0.623780657748049), (1.0, 0.6232247161901174), (1.2, 0.6225446323425503)]
Best C=0.8 | OOF AUC(validated)=0.62378
Saved oof_lr_main_time.npy and test_lr_main_time.npy


In [None]:
# S34: MPNet emb+meta FULL refit 5-seed bag (fixed rounds ~ median best_iter=29) + rebuild gamma-best 7-way submission
import numpy as np, pandas as pd, time, gc, xgboost as xgb
from sklearn.preprocessing import StandardScaler

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

# Load MPNet embeddings + meta_v1
Emb_tr = np.load('emb_mpnet_tr.npy').astype(np.float32)
Emb_te = np.load('emb_mpnet_te.npy').astype(np.float32)
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)
Xtr_raw = np.hstack([Emb_tr, Meta_tr]).astype(np.float32)
Xte_raw = np.hstack([Emb_te, Meta_te]).astype(np.float32)
print('Full-refit feature shapes:', Xtr_raw.shape, Xte_raw.shape)

# Standardize on full train
scaler = StandardScaler(with_mean=True, with_std=True)
Xtr = scaler.fit_transform(Xtr_raw).astype(np.float32)
Xte = scaler.transform(Xte_raw).astype(np.float32)
del Xtr_raw, Xte_raw; gc.collect()

# XGB params (same as CV runs)
params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=3,
    eta=0.05,
    subsample=0.8,
    colsample_bytree=0.6,
    min_child_weight=8,
    reg_alpha=0.5,
    reg_lambda=3.0,
    gamma=0.0,
    device='cuda',
    tree_method='hist'
)

# Fixed rounds from median of best_iter observed in time-CV logs
num_boost_round = 29
seeds = [42, 1337, 2025, 614, 2718]
pos = float((y == 1).sum()); neg = float((y == 0).sum())
spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
print(f'Class balance full-train: pos={int(pos)} neg={int(neg)} spw={spw:.2f} | rounds={num_boost_round} | seeds={seeds}')

dtr = xgb.DMatrix(Xtr, label=y)
dte = xgb.DMatrix(Xte)

test_seed_preds = []
t0 = time.time()
for si, seed in enumerate(seeds, 1):
    p = dict(params); p['seed'] = seed; p['scale_pos_weight'] = spw
    booster = xgb.train(p, dtr, num_boost_round=num_boost_round, verbose_eval=False)
    te_pred = booster.predict(dte).astype(np.float32)
    test_seed_preds.append(te_pred)
    print(f'[MPNet full-refit seed {seed}] done | te_pred mean={te_pred.mean():.4f}')
test_avg = np.mean(test_seed_preds, axis=0).astype(np.float32)
print(f'MPNet full-refit bag done in {time.time()-t0:.1f}s | test mean={test_avg.mean():.4f}')
np.save('test_xgb_emb_mpnet_fullbag.npy', test_avg)
print('Saved test_xgb_emb_mpnet_fullbag.npy')

# Rebuild gamma-best 7-way blend using refit MPNet test preds and prior best weights (from S30 gamma=0.98)
def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_time.npy')
t_emb_min = np.load('test_xgb_emb_meta_time.npy')
t_emb_mp_refit = np.load('test_xgb_emb_mpnet_fullbag.npy')

# Gamma-best config from S30:
g = 0.97
w_lr, w_d1, w_d2, w_meta, w_emn, w_emp = 0.24, 0.15, 0.15, 0.22, 0.12, 0.12

tz_lr_mix = (1.0 - g)*to_logit(t_lr_w) + g*to_logit(t_lr_ns)
zt = (w_lr*tz_lr_mix +
      w_d1*to_logit(t_d1) +
      w_d2*to_logit(t_d2) +
      w_meta*to_logit(t_meta) +
      w_emn*to_logit(t_emb_min) +
      w_emp*to_logit(t_emb_mp_refit))
pt = sigmoid(zt).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission_7way_gamma0p98_mpnet_fullrefit.csv', index=False)

# 15% shrink-to-equal hedge
w_vec = np.array([w_lr, w_d1, w_d2, w_meta, w_emn, w_emp], dtype=np.float64)
w_eq = np.ones_like(w_vec)/len(w_vec)
alpha = 0.15
w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
zt_shr = (w_shr[0]*tz_lr_mix +
          w_shr[1]*to_logit(t_d1) +
          w_shr[2]*to_logit(t_d2) +
          w_shr[3]*to_logit(t_meta) +
          w_shr[4]*to_logit(t_emb_min) +
          w_shr[5]*to_logit(t_emb_mp_refit))
pt_shr = sigmoid(zt_shr).astype(np.float32)
pd.DataFrame({id_col: test[id_col].values, target_col: pt_shr}).to_csv('submission_7way_gamma0p98_mpnet_fullrefit_shrunk.csv', index=False)

# Promote refit submission
sub.to_csv('submission.csv', index=False)
print('Promoted submission_7way_gamma0p98_mpnet_fullrefit.csv to submission.csv')

In [None]:
# S32b: Time-aware LR_main + meta_v1 (title+request_text only), L2 saga; cache OOF/test
import numpy as np, pandas as pd, time, gc
from scipy.sparse import hstack, csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    # Avoid edit_aware; prefer request_text
    if 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    return df.get('request_text', pd.Series(['']*len(df))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train); txt_te = build_text(test)

# Load meta_v1 features
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)

# 6-block forward-chaining folds
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}')

# High-capacity TF-IDF views
word_params = dict(analyzer='word', ngram_range=(1,3), lowercase=True, min_df=2, max_features=300_000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params = dict(analyzer='char_wb', ngram_range=(2,6), lowercase=True, min_df=2, max_features=300_000, sublinear_tf=True, smooth_idf=True, norm='l2')

C_grid = [0.8, 1.0]
results = []
best = dict(auc=-1.0, C=None, oof=None, te=None)

for C in C_grid:
    tC = time.time()
    oof = np.zeros(n, dtype=np.float32)
    te_parts = []
    for fi, (tr_idx, va_idx) in enumerate(folds, 1):
        t0 = time.time()
        tr_text = txt_tr.iloc[tr_idx]; va_text = txt_tr.iloc[va_idx]
        tf_w = TfidfVectorizer(**word_params)
        Xw_tr = tf_w.fit_transform(tr_text); Xw_va = tf_w.transform(va_text); Xw_te = tf_w.transform(txt_te)
        tf_c = TfidfVectorizer(**char_params)
        Xc_tr = tf_c.fit_transform(tr_text); Xc_va = tf_c.transform(va_text); Xc_te = tf_c.transform(txt_te)
        # Stack text views
        X_tr_text = hstack([Xw_tr, Xc_tr], format='csr')
        X_va_text = hstack([Xw_va, Xc_va], format='csr')
        X_te_text = hstack([Xw_te, Xc_te], format='csr')
        # Append meta_v1 (as CSR) without scaling
        X_tr = hstack([X_tr_text, csr_matrix(Meta_tr[tr_idx])], format='csr')
        X_va = hstack([X_va_text, csr_matrix(Meta_tr[va_idx])], format='csr')
        X_te = hstack([X_te_text, csr_matrix(Meta_te)], format='csr')
        clf = LogisticRegression(penalty='l2', solver='saga', C=C, max_iter=2000, n_jobs=-1, verbose=0)
        clf.fit(X_tr, y[tr_idx])
        va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
        te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
        oof[va_idx] = va_pred
        te_parts.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred)
        print(f'[LR_main+meta C={C}] Fold {fi} AUC: {auc:.5f} | {time.time()-t0:.1f}s | tr:{X_tr.shape[0]}x{X_tr.shape[1]}')
        del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, X_tr_text, X_va_text, X_te_text, X_tr, X_va, X_te, clf; gc.collect()
    auc_mask = roc_auc_score(y[mask], oof[mask])
    te_mean = np.mean(te_parts, axis=0).astype(np.float32)
    results.append((C, auc_mask))
    print(f'[LR_main+meta C={C}] OOF AUC(validated): {auc_mask:.5f} | {time.time()-tC:.1f}s')
    if auc_mask > best['auc']:
        best.update(dict(auc=auc_mask, C=C, oof=oof.copy(), te=te_mean.copy()))
    del oof, te_parts; gc.collect()

print('C grid results:', results)
print(f'Best C={best["C"]} | OOF AUC(validated)={best["auc"]:.5f}')
np.save('oof_lr_main_meta_time.npy', best['oof'].astype(np.float32))
np.save('test_lr_main_meta_time.npy', best['te'].astype(np.float32))
print('Saved oof_lr_main_meta_time.npy and test_lr_main_meta_time.npy')

In [15]:
# S33: Recency-weighted 7/8-way logit blend including LR_main+meta; write variants + 15% shrink hedges
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# 6-block forward-chaining blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}')

# Load base OOF/test
o_lr_w = np.load('oof_lr_time_withsub_meta.npy'); t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');  t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');         t_d1 = np.load('test_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy');      t_d2 = np.load('test_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy');        t_meta = np.load('test_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');     t_emn_refit = np.load('test_xgb_emb_meta_time.npy')  # MiniLM (no full-bag yet)
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');    
t_emp_path_full = 'test_xgb_emb_mpnet_fullbag.npy'
try:
    t_emp_refit = np.load(t_emp_path_full)
    print('Using MPNet full-bag test preds.')
except Exception:
    t_emp_refit = np.load('test_xgb_emb_mpnet_time.npy')
    print('Using MPNet CV-avg test preds (no full-bag found).')

# Optional LR_main+meta
try:
    o_lr_mainm = np.load('oof_lr_main_meta_time.npy')
    t_lr_mainm = np.load('test_lr_main_meta_time.npy')
    has_lr_mainm = True
    print('Loaded LR_main+meta OOF/test.')
except Exception:
    has_lr_mainm = False
    print('LR_main+meta not found; running 7-way only.')

# Convert to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_d2, tz_meta = to_logit(t_d1), to_logit(t_d2), to_logit(t_meta)
tz_emn = to_logit(t_emn_refit); tz_emp = to_logit(t_emp_refit)
if has_lr_mainm:
    z_lr_mainm = to_logit(o_lr_mainm); tz_lr_mainm = to_logit(t_lr_mainm)

# Grids per expert priors (tight around previous best)
g_grid = [0.96, 0.97, 0.98]
meta_grid = [0.18, 0.20, 0.22]
dense_tot_grid = [0.28, 0.30, 0.35]
dense_split = [(0.6, 0.4), (0.7, 0.3), (0.8, 0.2)]  # (v1, v2) fractions
emb_tot_grid = [0.24, 0.27, 0.30]
emb_split = [(0.6, 0.4), (0.5, 0.5)]  # (MiniLM, MPNet)
w_lrmain_grid = [0.0, 0.05, 0.08] if has_lr_mainm else [0.0]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    for g in g_grid:
        z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
        tz_lr_mix = (1.0 - g)*tz_lr_w + g*tz_lr_ns
        for w_meta in meta_grid:
            for d_tot in dense_tot_grid:
                for dv1, dv2 in dense_split:
                    w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                    for e_tot in emb_tot_grid:
                        for emn_fr, emp_fr in emb_split:
                            w_emn = e_tot * emn_fr; w_emp = e_tot * emp_fr
                            rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp)
                            if rem <= 0: continue
                            for w_lrmain in w_lrmain_grid:
                                if w_lrmain > rem: continue
                                w_lr = rem - w_lrmain
                                if w_lr < 0.25:  # enforce LR_mix ≥ 0.25
                                    continue
                                z_oof = (w_lr*z_lr_mix + w_d1*z_d1 + w_d2*z_d2 + w_meta*z_meta + w_emn*z_emn + w_emp*z_emp)
                                if has_lr_mainm and w_lrmain > 0:
                                    z_oof = z_oof + w_lrmain*z_lr_mainm
                                auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                tried += 1
                                if auc > best_auc:
                                    best_auc = auc
                                    best_cfg = dict(g=float(g), w_lr=float(w_lr), w_d1=float(w_d1), w_d2=float(w_d2), w_meta=float(w_meta),
                                                    w_emn=float(w_emn), w_emp=float(w_emp), w_lrmain=float(w_lrmain), tz_lr_mix=tz_lr_mix)
    return best_auc, best_cfg, tried

# 1) Full-mask
auc_full, cfg_full, tried_full = search(mask_full)
print(f'[Full] tried={tried_full} | best OOF(z) AUC={auc_full:.5f} | cfg={ {k:v for k,v in cfg_full.items() if k!="tz_lr_mix"} }')

# 2) Last-2 blocks only
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2] tried={tried_last2} | best OOF(z,last2) AUC={auc_last2:.5f} | cfg={ {k:v for k,v in cfg_last2.items() if k!="tz_lr_mix"} }')

# 3) Gamma-decayed over validated
best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.95, 0.98]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best OOF(z,weighted) AUC={auc_g:.5f}')
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={ {k:v for k,v in best_cfg_g.items() if k!="tz_lr_mix"} }')

def build_and_save(tag, cfg):
    g = cfg['g']; tz_lr_mix = cfg['tz_lr_mix']
    w_lr, w_d1, w_d2, w_meta, w_emn, w_emp, w_lrmain = cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_lrmain']
    zt = (w_lr*tz_lr_mix + w_d1*tz_d1 + w_d2*tz_d2 + w_meta*tz_meta + w_emn*tz_emn + w_emp*tz_emp)
    if has_lr_mainm and w_lrmain > 0:
        zt = zt + w_lrmain*tz_lr_mainm
    pt = sigmoid(zt).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(f'submission_blend_{tag}.csv', index=False)
    # 15% shrink hedge across present components
    w_list = [w_lr, w_d1, w_d2, w_meta, w_emn, w_emp]
    comp_logits = [tz_lr_mix, tz_d1, tz_d2, tz_meta, tz_emn, tz_emp]
    if has_lr_mainm and w_lrmain > 0:
        w_list.append(w_lrmain); comp_logits.append(tz_lr_mainm)
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = np.zeros_like(comp_logits[0], dtype=np.float64)
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(f'submission_blend_{tag}_shrunk.csv', index=False)

build_and_save('full', cfg_full)
build_and_save('last2', cfg_last2)
build_and_save(f'gamma{best_gamma:.2f}'.replace('.','p'), best_cfg_g)

# Promote gamma-best as primary
prim = f'submission_blend_gamma{best_gamma:.2f}'.replace('.','p') + '.csv'
pd.read_csv(prim).to_csv('submission.csv', index=False)
print(f'Promoted {prim} to submission.csv')

Time-CV validated full: 2398/2878 | last2: 958
Using MPNet full-bag test preds.
Loaded LR_main+meta OOF/test.


[Full] tried=156 | best OOF(z) AUC=0.68197 | cfg={'g': 0.98, 'w_lr': 0.25, 'w_d1': 0.22400000000000003, 'w_d2': 0.05600000000000001, 'w_meta': 0.2, 'w_emn': 0.135, 'w_emp': 0.135, 'w_lrmain': 0.0}


[Last2] tried=156 | best OOF(z,last2) AUC=0.64782 | cfg={'g': 0.98, 'w_lr': 0.25, 'w_d1': 0.22400000000000003, 'w_d2': 0.05600000000000001, 'w_meta': 0.2, 'w_emn': 0.162, 'w_emp': 0.10800000000000001, 'w_lrmain': 0.0}


[Gamma 0.95] best OOF(z,weighted) AUC=0.67894


[Gamma 0.98] best OOF(z,weighted) AUC=0.68076
[Gamma-best] gamma=0.98 | AUC=0.68076 | cfg={'g': 0.98, 'w_lr': 0.25, 'w_d1': 0.22400000000000003, 'w_d2': 0.05600000000000001, 'w_meta': 0.2, 'w_emn': 0.135, 'w_emp': 0.135, 'w_lrmain': 0.0}
Promoted submission_blend_gamma0p98.csv to submission.csv


In [17]:
# S34c: Full-train 5-seed refits for MiniLM (emb+meta) and Meta-only XGB with rounds from last-block early-stop; rebuild gamma-best submission
import numpy as np, pandas as pd, time, gc, xgboost as xgb
from sklearn.preprocessing import StandardScaler

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

# Time blocks for last-block validation to pick num_boost_round
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
tr_idx_rounds = np.concatenate(blocks[:5])  # first 5 blocks
va_idx_rounds = np.array(blocks[5])         # last block as validation
print(f'Rounds selection using last block valid: tr={len(tr_idx_rounds)} va={len(va_idx_rounds)}')

def pick_rounds(X_tr_full, y_full, name, base_params, max_rounds=4000, early_stopping_rounds=100):
    # Split last block for early stopping to estimate rounds
    X_tr = X_tr_full[tr_idx_rounds]
    y_tr = y_full[tr_idx_rounds]
    X_va = X_tr_full[va_idx_rounds]
    y_va = y_full[va_idx_rounds]
    dtr = xgb.DMatrix(X_tr, label=y_tr)
    dva = xgb.DMatrix(X_va, label=y_va)
    booster = xgb.train(base_params, dtr, num_boost_round=max_rounds, evals=[(dva, 'valid')], early_stopping_rounds=early_stopping_rounds, verbose_eval=False)
    best_iter = int(booster.best_iteration or booster.best_ntree_limit or 100)
    print(f'[{name}] picked rounds (last-block ES): best_iter={best_iter}')
    del dtr, dva, booster; gc.collect()
    return best_iter

def fullbag_predict(X_full, y_full, X_test, name, base_params, num_rounds, seeds):
    dtr = xgb.DMatrix(X_full, label=y_full)
    dte = xgb.DMatrix(X_test)
    pos = float((y_full == 1).sum()); neg = float((y_full == 0).sum())
    spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
    preds = []
    for si, seed in enumerate(seeds, 1):
        p = dict(base_params); p['seed'] = seed; p['scale_pos_weight'] = spw
        booster = xgb.train(p, dtr, num_boost_round=num_rounds, verbose_eval=False)
        te_pred = booster.predict(dte).astype(np.float32)
        preds.append(te_pred)
        print(f'[{name}] seed {seed} done | mean={te_pred.mean():.4f}')
        del booster; gc.collect()
    out = np.mean(preds, axis=0).astype(np.float32)
    print(f'[{name}] bag mean={out.mean():.4f} | num_rounds={num_rounds} | seeds={seeds}')
    return out

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Base params (same family as earlier CV runs)
xgb_params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=3,
    eta=0.05,
    subsample=0.8,
    colsample_bytree=0.6,
    min_child_weight=8,
    reg_alpha=0.5,
    reg_lambda=3.0,
    gamma=0.0,
    device='cuda',
    tree_method='hist'
)
seeds = [42, 1337, 2025, 614, 2718]

# 1) MiniLM emb+meta refit
Emb_min_tr = np.load('emb_minilm_tr.npy').astype(np.float32)
Emb_min_te = np.load('emb_minilm_te.npy').astype(np.float32)
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)
Xmin_tr_raw = np.hstack([Emb_min_tr, Meta_tr]).astype(np.float32)
Xmin_te_raw = np.hstack([Emb_min_te, Meta_te]).astype(np.float32)
scaler_min = StandardScaler(with_mean=True, with_std=True)
Xmin_tr = scaler_min.fit_transform(Xmin_tr_raw).astype(np.float32)
Xmin_te = scaler_min.transform(Xmin_te_raw).astype(np.float32)
min_rounds = pick_rounds(Xmin_tr, y, 'MiniLM', xgb_params, max_rounds=4000, early_stopping_rounds=100)
pred_min_fullbag = fullbag_predict(Xmin_tr, y, Xmin_te, 'MiniLM', xgb_params, min_rounds, seeds)
np.save('test_xgb_emb_minilm_fullbag.npy', pred_min_fullbag)
print('Saved test_xgb_emb_minilm_fullbag.npy')

# 2) Meta-only refit
Xmeta_tr = Meta_tr.astype(np.float32)
Xmeta_te = Meta_te.astype(np.float32)
meta_rounds = pick_rounds(Xmeta_tr, y, 'Meta-only', xgb_params, max_rounds=4000, early_stopping_rounds=100)
pred_meta_fullbag = fullbag_predict(Xmeta_tr, y, Xmeta_te, 'Meta-only', xgb_params, meta_rounds, seeds)
np.save('test_xgb_meta_fullbag.npy', pred_meta_fullbag)
print('Saved test_xgb_meta_fullbag.npy')

# Rebuild gamma-best blend (from S33 cfg) using refit test preds for MiniLM/MPNet/Meta
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
t_d1 = np.load('test_xgb_dense_time.npy')   # Dense v1 (no refit available)
t_d2 = np.load('test_xgb_dense_time_v2.npy')# Dense v2 (no refit available)
t_meta_ref = np.load('test_xgb_meta_fullbag.npy')
t_emn_ref = np.load('test_xgb_emb_minilm_fullbag.npy')
t_emp_ref = np.load('test_xgb_emb_mpnet_fullbag.npy')

# Use S33 gamma-best weights (printed there)
g = 0.98
w_lr, w_d1, w_d2, w_meta, w_emn, w_emp, w_lrmain = 0.25, 0.224, 0.056, 0.20, 0.135, 0.135, 0.0

tz_lr_mix = (1.0 - g)*to_logit(t_lr_w) + g*to_logit(t_lr_ns)
zt = (w_lr*tz_lr_mix +
      w_d1*to_logit(t_d1) +
      w_d2*to_logit(t_d2) +
      w_meta*to_logit(t_meta_ref) +
      w_emn*to_logit(t_emn_ref) +
      w_emp*to_logit(t_emp_ref))
pt = sigmoid(zt).astype(np.float32)
pd.DataFrame({id_col: test[id_col].values, target_col: pt}).to_csv('submission_blend_gamma0p98_fullrefits.csv', index=False)

# 15% shrink hedge
w_vec = np.array([w_lr, w_d1, w_d2, w_meta, w_emn, w_emp], dtype=np.float64)
w_eq = np.ones_like(w_vec)/len(w_vec)
alpha = 0.15
w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
zt_shr = (w_shr[0]*tz_lr_mix +
          w_shr[1]*to_logit(t_d1) +
          w_shr[2]*to_logit(t_d2) +
          w_shr[3]*to_logit(t_meta_ref) +
          w_shr[4]*to_logit(t_emn_ref) +
          w_shr[5]*to_logit(t_emp_ref))
pt_shr = sigmoid(zt_shr).astype(np.float32)
pd.DataFrame({id_col: test[id_col].values, target_col: pt_shr}).to_csv('submission_blend_gamma0p98_fullrefits_shrunk.csv', index=False)

# Promote full-refit gamma-best
pd.read_csv('submission_blend_gamma0p98_fullrefits.csv').to_csv('submission.csv', index=False)
print('Promoted submission_blend_gamma0p98_fullrefits.csv to submission.csv')

Rounds selection using last block valid: tr=2399 va=479


[MiniLM] picked rounds (last-block ES): best_iter=45


[MiniLM] seed 42 done | mean=0.4681


[MiniLM] seed 1337 done | mean=0.4672


[MiniLM] seed 2025 done | mean=0.4684


[MiniLM] seed 614 done | mean=0.4697


[MiniLM] seed 2718 done | mean=0.4680
[MiniLM] bag mean=0.4683 | num_rounds=45 | seeds=[42, 1337, 2025, 614, 2718]
Saved test_xgb_emb_minilm_fullbag.npy


[Meta-only] picked rounds (last-block ES): best_iter=595


[Meta-only] seed 42 done | mean=0.4263


[Meta-only] seed 1337 done | mean=0.4285


[Meta-only] seed 2025 done | mean=0.4262


[Meta-only] seed 614 done | mean=0.4253


[Meta-only] seed 2718 done | mean=0.4256


[Meta-only] bag mean=0.4264 | num_rounds=595 | seeds=[42, 1337, 2025, 614, 2718]
Saved test_xgb_meta_fullbag.npy
Promoted submission_blend_gamma0p98_fullrefits.csv to submission.csv


In [None]:
# S32c: Time-decayed LR_nosub + meta_v1 (title+request_text only), L2 saga; cache OOF/test
import numpy as np, pandas as pd, time, gc
from scipy.sparse import hstack, csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    # Avoid edit_aware; prefer request_text
    if 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    return df.get('request_text', pd.Series(['']*len(df))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train); txt_te = build_text(test)

# Load meta_v1 features
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)

# 6-block forward-chaining folds
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
block_id = np.zeros(n, dtype=np.int32)
for bi in range(k):
    block_id[np.array(blocks[bi])] = bi
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}')

# Time-decay weights per training sample: w = gamma^(age), age = (k-1 - block_id)
gamma = 0.98
sample_w_all = (gamma ** ( (k-1) - block_id )).astype(np.float32)

# TF-IDF views (slightly lighter caps for speed)
word_params = dict(analyzer='word', ngram_range=(1,3), lowercase=True, min_df=2, max_features=200_000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params = dict(analyzer='char_wb', ngram_range=(2,6), lowercase=True, min_df=2, max_features=200_000, sublinear_tf=True, smooth_idf=True, norm='l2')

C = 1.0
tC = time.time()
oof = np.zeros(n, dtype=np.float32)
te_parts = []
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t0 = time.time()
    tr_text = txt_tr.iloc[tr_idx]; va_text = txt_tr.iloc[va_idx]
    tf_w = TfidfVectorizer(**word_params)
    Xw_tr = tf_w.fit_transform(tr_text); Xw_va = tf_w.transform(va_text); Xw_te = tf_w.transform(txt_te)
    tf_c = TfidfVectorizer(**char_params)
    Xc_tr = tf_c.fit_transform(tr_text); Xc_va = tf_c.transform(va_text); Xc_te = tf_c.transform(txt_te)
    X_tr_text = hstack([Xw_tr, Xc_tr], format='csr')
    X_va_text = hstack([Xw_va, Xc_va], format='csr')
    X_te_text = hstack([Xw_te, Xc_te], format='csr')
    X_tr = hstack([X_tr_text, csr_matrix(Meta_tr[tr_idx])], format='csr')
    X_va = hstack([X_va_text, csr_matrix(Meta_tr[va_idx])], format='csr')
    X_te = hstack([X_te_text, csr_matrix(Meta_te)], format='csr')
    clf = LogisticRegression(penalty='l2', solver='saga', C=C, max_iter=2000, n_jobs=-1, verbose=0)
    clf.fit(X_tr, y[tr_idx], sample_weight=sample_w_all[tr_idx])
    va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
    te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
    oof[va_idx] = va_pred
    te_parts.append(te_pred)
    auc = roc_auc_score(y[va_idx], va_pred)
    print(f'[LR_nosub+meta_decay C={C}, gamma={gamma}] Fold {fi} AUC: {auc:.5f} | {time.time()-t0:.1f}s | tr:{X_tr.shape[0]}x{X_tr.shape[1]}')
    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, X_tr_text, X_va_text, X_te_text, X_tr, X_va, X_te, clf; gc.collect()
auc_mask = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[LR_nosub+meta_decay] OOF AUC(validated): {auc_mask:.5f} | total {time.time()-tC:.1f}s')
np.save('oof_lr_time_nosub_meta_decay.npy', oof.astype(np.float32))
np.save('test_lr_time_nosub_meta_decay.npy', te_mean.astype(np.float32))
print('Saved oof_lr_time_nosub_meta_decay.npy and test_lr_time_nosub_meta_decay.npy')

In [None]:
# Fix import for Path used in S33b
from pathlib import Path

In [None]:
# S33b: Retune blends using updated full-bag MiniLM/Meta and option to use LR_nosub_meta_decay vs baseline
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# 6-block forward-chaining blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}')

# Load base OOF
o_lr_w = np.load('oof_lr_time_withsub_meta.npy')
o_lr_ns_base = np.load('oof_lr_time_nosub_meta.npy')
o_lr_ns_decay = np.load('oof_lr_time_nosub_meta_decay.npy') if (Path('oof_lr_time_nosub_meta_decay.npy').exists()) else None
o_d1 = np.load('oof_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy')
use_lr_decay_options = [False, True] if (o_lr_ns_decay is not None) else [False]

# Optional LR_main+meta
has_lr_mainm = Path('oof_lr_main_meta_time.npy').exists() and Path('test_lr_main_meta_time.npy').exists()
if has_lr_mainm:
    o_lr_mainm = np.load('oof_lr_main_meta_time.npy')
    t_lr_mainm = np.load('test_lr_main_meta_time.npy')
    print('Loaded LR_main+meta for blend consideration.')

# Convert OOF to logits (we'll choose lr_ns variant inside search)
z_lr_w = to_logit(o_lr_w)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
if has_lr_mainm:
    z_lr_mainm = to_logit(o_lr_mainm)

# Load test preds, preferring full-bag refits where available
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns_base = np.load('test_lr_time_nosub_meta.npy')
t_lr_ns_decay = np.load('test_lr_time_nosub_meta_decay.npy') if Path('test_lr_time_nosub_meta_decay.npy').exists() else None
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')

from pathlib import Path

# Grids (tight)
g_grid = [0.96, 0.97, 0.98]
meta_grid = [0.18, 0.20, 0.22]
dense_tot_grid = [0.28, 0.30, 0.35]
dense_split = [(0.6, 0.4), (0.7, 0.3), (0.8, 0.2)]
emb_tot_grid = [0.24, 0.27, 0.30]
emb_split = [(0.6, 0.4), (0.5, 0.5)]
w_lrmain_grid = [0.0, 0.05, 0.08] if has_lr_mainm else [0.0]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    for use_decay in use_lr_decay_options:
        z_lr_ns = to_logit(o_lr_ns_decay) if (use_decay and (o_lr_ns_decay is not None)) else to_logit(o_lr_ns_base)
        for g in g_grid:
            z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
            for w_meta in meta_grid:
                for d_tot in dense_tot_grid:
                    for dv1, dv2 in dense_split:
                        w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                        for e_tot in emb_tot_grid:
                            for emn_fr, emp_fr in emb_split:
                                w_emn = e_tot * emn_fr; w_emp = e_tot * emp_fr
                                rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp)
                                if rem <= 0: continue
                                for w_lrmain in w_lrmain_grid:
                                    if w_lrmain > rem: continue
                                    w_lr = rem - w_lrmain
                                    if w_lr < 0.25: continue
                                    z_oof = (w_lr*z_lr_mix + w_d1*z_d1 + w_d2*z_d2 + w_meta*z_meta + w_emn*z_emn + w_emp*z_emp)
                                    if has_lr_mainm and w_lrmain > 0:
                                        z_oof = z_oof + w_lrmain*z_lr_mainm
                                    auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                    tried += 1
                                    if auc > best_auc:
                                        best_auc = auc
                                        best_cfg = dict(use_decay=use_decay, g=float(g), w_lr=float(w_lr), w_d1=float(w_d1), w_d2=float(w_d2), w_meta=float(w_meta),
                                                        w_emn=float(w_emn), w_emp=float(w_emp), w_lrmain=float(w_lrmain))
    return best_auc, best_cfg, tried

# 1) Full-mask
auc_full, cfg_full, tried_full = search(mask_full)
print(f'[Full] tried={tried_full} | best OOF(z) AUC={auc_full:.5f} | cfg={cfg_full}')

# 2) Last-2
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2] tried={tried_last2} | best OOF(z,last2) AUC={auc_last2:.5f} | cfg={cfg_last2}')

# 3) Gamma-decayed
best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.95, 0.98]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best OOF(z,weighted) AUC={auc_g:.5f}')
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={best_cfg_g}')

def build_and_save(tag, cfg):
    use_decay = cfg['use_decay']
    tz_lr_ns = to_logit(t_lr_ns_decay if (use_decay and (t_lr_ns_decay is not None)) else t_lr_ns_base)
    tz_lr_w = to_logit(t_lr_w)
    tz_lr_mix = (1.0 - cfg['g'])*tz_lr_w + cfg['g']*tz_lr_ns
    z_comps = [cfg['w_lr']*tz_lr_mix,
               cfg['w_d1']*to_logit(t_d1),
               cfg['w_d2']*to_logit(t_d2),
               cfg['w_meta']*to_logit(t_meta),
               cfg['w_emn']*to_logit(t_emn),
               cfg['w_emp']*to_logit(t_emp)]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp']]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        z_comps.append(cfg['w_lrmain']*to_logit(t_lr_mainm))
        w_list.append(cfg['w_lrmain'])
    zt = np.sum(z_comps, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(f'submission_reblend_{tag}.csv', index=False)
    # Shrink hedge 15%
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    # Rebuild z with shrunk weights
    comps_logits = [to_logit(t_lr_mix := ( (1.0 - cfg['g'])*to_logit(t_lr_w) + cfg['g']*tz_lr_ns ))]  # placeholder, not used directly
    comp_logits = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp)]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        comp_logits.append(to_logit(t_lr_mainm))
    zt_shr = 0.0
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(f'submission_reblend_{tag}_shrunk.csv', index=False)

build_and_save('full', cfg_full)
build_and_save('last2', cfg_last2)
build_and_save(f'gamma{best_gamma:.2f}'.replace('.','p'), best_cfg_g)

# Promote gamma-best
prim = f'submission_reblend_gamma{best_gamma:.2f}'.replace('.','p') + '.csv'
pd.read_csv(prim).to_csv('submission.csv', index=False)
print(f'Promoted {prim} to submission.csv')

In [None]:
# S37: TF-IDF(word 1-2) -> SVD(300) per-fold, + meta_v1, StandardScaler, models: LR and XGB; cache OOF/test
import numpy as np, pandas as pd, time, gc, sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import xgboost as xgb
from scipy.sparse import csr_matrix

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    if 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    return df.get('request_text', pd.Series(['']*len(df))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train)
txt_te = build_text(test)

# Time-aware 6-block forward-chaining
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask_val = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask_val[va_idx] = True
print(f'Time-CV folds={len(folds)}; validated {mask_val.sum()}/{n}')

Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)

word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=2, max_features=300_000,
                   sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def run_svd_base(model_type='lr', smoke=True, seed=42, tag='svd_word300_meta'):
    start = time.time()
    use_folds = folds[:2] if smoke else folds
    oof = np.zeros(n, dtype=np.float32)
    te_parts = []
    fold_times = []
    for fi, (tr_idx, va_idx) in enumerate(use_folds, 1):
        t0 = time.time()
        tr_text = txt_tr.iloc[tr_idx]; va_text = txt_tr.iloc[va_idx]
        tf_w = TfidfVectorizer(**word_params)
        Xw_tr = tf_w.fit_transform(tr_text)
        Xw_va = tf_w.transform(va_text)
        Xw_te = tf_w.transform(txt_te)
        svd = TruncatedSVD(n_components=300, n_iter=5, random_state=seed, algorithm='randomized')
        Z_tr = svd.fit_transform(Xw_tr).astype(np.float32)
        Z_va = svd.transform(Xw_va).astype(np.float32)
        Z_te = svd.transform(Xw_te).astype(np.float32)
        # Concatenate meta and standardize together
        M_tr = Meta_tr[tr_idx]
        M_va = Meta_tr[va_idx]
        M_te = Meta_te
        X_tr = np.hstack([Z_tr, M_tr]).astype(np.float32)
        X_va = np.hstack([Z_va, M_va]).astype(np.float32)
        X_te = np.hstack([Z_te, M_te]).astype(np.float32)
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_tr = scaler.fit_transform(X_tr).astype(np.float32)
        X_va = scaler.transform(X_va).astype(np.float32)
        X_te_s = scaler.transform(X_te).astype(np.float32)
        if model_type == 'lr':
            clf = LogisticRegression(solver='saga', penalty='l2', C=1.0, max_iter=3000, n_jobs=-1, verbose=0, random_state=seed)
            clf.fit(X_tr, y[tr_idx])
            va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
            te_pred = clf.predict_proba(X_te_s)[:,1].astype(np.float32)
        elif model_type == 'xgb':
            dtr = xgb.DMatrix(X_tr, label=y[tr_idx])
            dva = xgb.DMatrix(X_va, label=y[va_idx])
            dte = xgb.DMatrix(X_te_s)
            pos = float((y[tr_idx] == 1).sum()); neg = float((y[tr_idx] == 0).sum())
            spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
            params = dict(objective='binary:logistic', eval_metric='auc', max_depth=5, eta=0.05, subsample=0.8,
                          colsample_bytree=0.8, min_child_weight=4, reg_alpha=0.2, reg_lambda=2.5,
                          device='cuda', tree_method='hist', seed=seed, scale_pos_weight=spw)
            booster = xgb.train(params, dtr, num_boost_round=4000, evals=[(dva, 'valid')],
                                early_stopping_rounds=100, verbose_eval=False)
            va_pred = booster.predict(dva).astype(np.float32)
            te_pred = booster.predict(dte, iteration_range=(0, booster.best_iteration+1 if booster.best_iteration is not None else 0)).astype(np.float32)
        else:
            raise ValueError('model_type must be lr or xgb')
        oof[va_idx] = va_pred
        te_parts.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred)
        dt = time.time()-t0; fold_times.append(dt)
        print(f'[SVD {model_type}] Fold {fi}/{len(use_folds)} AUC={auc:.5f} | {dt:.1f}s | TF:{Xw_tr.shape[1]} SVD:{Z_tr.shape[1]} feats:{X_tr.shape[1]}')
        del tf_w, Xw_tr, Xw_va, Xw_te, svd, Z_tr, Z_va, Z_te, M_tr, M_va, M_te, X_tr, X_va, X_te, X_te_s
        if model_type == 'xgb':
            del dtr, dva, dte, booster
        gc.collect()
    auc_mask = roc_auc_score(y[mask_val], oof[mask_val]) if not smoke else roc_auc_score(y[use_folds[0][1]], oof[use_folds[0][1]])
    te_mean = np.mean(te_parts, axis=0).astype(np.float32)
    mode_tag = f'{model_type}_smoke' if smoke else model_type
    oof_path = f'oof_{mode_tag}_{tag}.npy'
    te_path = f'test_{mode_tag}_{tag}.npy'
    np.save(oof_path, oof.astype(np.float32))
    np.save(te_path, te_mean)
    print(f'[SVD {model_type}] DONE | OOF(valid mask={"full" if not smoke else "1-fold"}) AUC={auc_mask:.5f} | folds={len(use_folds)} | {time.time()-start:.1f}s')
    print(f'Saved {oof_path} and {te_path}')

# Example next steps (not executed here):
# 1) Smoke test LR: run_svd_base(model_type='lr', smoke=True, seed=42, tag='svd_word300_meta')
# 2) If good, full CV LR: run_svd_base(model_type='lr', smoke=False, seed=42, tag='svd_word300_meta')
# 3) Train XGB variant: run_svd_base(model_type='xgb', smoke=False, seed=42, tag='svd_word300_meta')
# Then add to blend tuner and retune weights.

In [None]:
# S37-run: 2-fold smoke test for SVD base with LR (+meta); expect OOF ~0.65+ on first val fold
try:
    run_svd_base(model_type='lr', smoke=True, seed=42, tag='svd_word300_meta')
except Exception as e:
    import traceback, sys
    print('Error during SVD LR smoke test:', e)
    traceback.print_exc(file=sys.stdout)

In [None]:
# S37-run-full: Full 5-fold CV for SVD base with LR and XGB (+meta); cache OOF/test
try:
    print('=== Running SVD LR full CV ===')
    run_svd_base(model_type='lr', smoke=False, seed=42, tag='svd_word300_meta')
    print('=== Running SVD XGB full CV ===')
    run_svd_base(model_type='xgb', smoke=False, seed=42, tag='svd_word300_meta')
except Exception as e:
    import traceback, sys
    print('Error during SVD full runs:', e)
    traceback.print_exc(file=sys.stdout)

In [None]:
# S37b: Reblend including new SVD bases; allow downweighting/removing Dense; promote gamma-best
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
from pathlib import Path

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}')

# Load OOF preds
o_lr_w = np.load('oof_lr_time_withsub_meta.npy')
o_lr_ns_base = np.load('oof_lr_time_nosub_meta.npy')
o_lr_ns_decay = np.load('oof_lr_time_nosub_meta_decay.npy') if Path('oof_lr_time_nosub_meta_decay.npy').exists() else None
o_d1 = np.load('oof_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy')
# New SVD bases
o_svd_lr = np.load('oof_lr_svd_word300_meta.npy')
o_svd_xgb = np.load('oof_xgb_svd_word300_meta.npy')

# Optional LR_main+meta (kept optional but likely 0 weight)
has_lr_mainm = Path('oof_lr_main_meta_time.npy').exists() and Path('test_lr_main_meta_time.npy').exists()
if has_lr_mainm:
    o_lr_mainm = np.load('oof_lr_main_meta_time.npy')
    print('Loaded LR_main+meta for blend consideration.')

# Convert OOF to logits where fixed
z_lr_w = to_logit(o_lr_w)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd_lr, z_svd_xgb = to_logit(o_svd_lr), to_logit(o_svd_xgb)
if has_lr_mainm:
    z_lr_mainm = to_logit(o_lr_mainm)

# Load test preds (prefer full-bag refits where available for embeddings/meta); SVD ones from current run
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns_base = np.load('test_lr_time_nosub_meta.npy')
t_lr_ns_decay = np.load('test_lr_time_nosub_meta_decay.npy') if Path('test_lr_time_nosub_meta_decay.npy').exists() else None
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
t_svd_lr = np.load('test_lr_svd_word300_meta.npy')
t_svd_xgb = np.load('test_xgb_svd_word300_meta.npy')
if has_lr_mainm:
    t_lr_mainm = np.load('test_lr_main_meta_time.npy')

# Grids
g_grid = [0.96, 0.97, 0.98]
meta_grid = [0.18, 0.20, 0.22]
dense_tot_grid = [0.0, 0.15, 0.22, 0.28]  # allow turning Dense off
dense_split = [(0.8, 0.2), (0.7, 0.3), (0.6, 0.4)]
emb_tot_grid = [0.20, 0.24, 0.27, 0.30]
emb_split = [(0.6, 0.4), (0.5, 0.5)]
svd_tot_grid = [0.0, 0.05, 0.10, 0.15, 0.20]
svd_split = [(0.7, 0.3), (0.5, 0.5)]  # (svd_lr, svd_xgb)
w_lrmain_grid = [0.0, 0.05] if has_lr_mainm else [0.0]
use_lr_decay_options = [False, True] if (o_lr_ns_decay is not None) else [False]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    for use_decay in use_lr_decay_options:
        z_lr_ns = to_logit(o_lr_ns_decay) if (use_decay and (o_lr_ns_decay is not None)) else to_logit(o_lr_ns_base)
        for g in g_grid:
            z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
            for w_meta in meta_grid:
                for d_tot in dense_tot_grid:
                    for dv1, dv2 in dense_split:
                        w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                        for e_tot in emb_tot_grid:
                            for emn_fr, emp_fr in emb_split:
                                w_emn = e_tot * emn_fr; w_emp = e_tot * emp_fr
                                for s_tot in svd_tot_grid:
                                    for s_lr_fr, s_xgb_fr in svd_split:
                                        w_svd_lr = s_tot * s_lr_fr; w_svd_xgb = s_tot * s_xgb_fr
                                        rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp + w_svd_lr + w_svd_xgb)
                                        if rem <= 0: continue
                                        for w_lrmain in w_lrmain_grid:
                                            if w_lrmain > rem: continue
                                            w_lr = rem - w_lrmain
                                            if w_lr < 0.20:  # keep LR_mix reasonably weighted
                                                continue
                                            z_oof = (w_lr*z_lr_mix +
                                                     w_d1*z_d1 + w_d2*z_d2 +
                                                     w_meta*z_meta +
                                                     w_emn*z_emn + w_emp*z_emp +
                                                     w_svd_lr*z_svd_lr + w_svd_xgb*z_svd_xgb)
                                            if has_lr_mainm and w_lrmain > 0:
                                                z_oof = z_oof + w_lrmain*z_lr_mainm
                                            auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                            tried += 1
                                            if auc > best_auc:
                                                best_auc = auc
                                                best_cfg = dict(use_decay=use_decay, g=float(g),
                                                                w_lr=float(w_lr), w_d1=float(w_d1), w_d2=float(w_d2),
                                                                w_meta=float(w_meta), w_emn=float(w_emn), w_emp=float(w_emp),
                                                                w_svd_lr=float(w_svd_lr), w_svd_xgb=float(w_svd_xgb),
                                                                w_lrmain=float(w_lrmain))
    return best_auc, best_cfg, tried

# 1) Full-mask
auc_full, cfg_full, tried_full = search(mask_full)
print(f'[Full] tried={tried_full} | best OOF(z) AUC={auc_full:.5f} | cfg={cfg_full}')

# 2) Last-2
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2] tried={tried_last2} | best OOF(z,last2) AUC={auc_last2:.5f} | cfg={cfg_last2}')

# 3) Gamma-decayed
best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.95, 0.98]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best OOF(z,weighted) AUC={auc_g:.5f}')
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={best_cfg_g}')

def build_and_save(tag, cfg):
    use_decay = cfg['use_decay']
    tz_lr_ns = to_logit(t_lr_ns_decay if (use_decay and (t_lr_ns_decay is not None)) else t_lr_ns_base)
    tz_lr_w = to_logit(t_lr_w)
    tz_lr_mix = (1.0 - cfg['g'])*tz_lr_w + cfg['g']*tz_lr_ns
    z_parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*to_logit(t_d1),
        cfg['w_d2']*to_logit(t_d2),
        cfg['w_meta']*to_logit(t_meta),
        cfg['w_emn']*to_logit(t_emn),
        cfg['w_emp']*to_logit(t_emp),
        cfg['w_svd_lr']*to_logit(t_svd_lr),
        cfg['w_svd_xgb']*to_logit(t_svd_xgb)
    ]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_svd_lr'], cfg['w_svd_xgb']]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        z_parts.append(cfg['w_lrmain']*to_logit(t_lr_mainm))
        w_list.append(cfg['w_lrmain'])
    zt = np.sum(z_parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(f'submission_reblend_svd_{tag}.csv', index=False)
    # 15% shrink hedge
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    comp_logits = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp), to_logit(t_svd_lr), to_logit(t_svd_xgb)]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        comp_logits.append(to_logit(t_lr_mainm))
    zt_shr = 0.0
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(f'submission_reblend_svd_{tag}_shrunk.csv', index=False)

build_and_save('full', cfg_full)
build_and_save('last2', cfg_last2)
build_and_save(f'gamma{best_gamma:.2f}'.replace('.','p'), best_cfg_g)

# Promote gamma-best
prim = f'submission_reblend_svd_gamma{best_gamma:.2f}'.replace('.','p') + '.csv'
pd.read_csv(prim).to_csv('submission.csv', index=False)
print(f'Promoted {prim} to submission.csv')

In [None]:
# S37c: Dual-view TF-IDF (word 1-2 + char_wb 3-5) -> SVD (224+160) per-fold + meta_v1 -> StandardScaler -> LR/XGB
import numpy as np, pandas as pd, time, gc, sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import xgboost as xgb

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    if 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    return df.get('request_text', pd.Series(['']*len(df))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train)
txt_te = build_text(test)

# Time-aware 6-block forward-chaining
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask_val = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask_val[va_idx] = True
print(f'Time-CV folds={len(folds)}; validated {mask_val.sum()}/{n}')

Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)

word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=2, max_features=300_000,
                   stop_words='english', sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)
char_params = dict(analyzer='char_wb', ngram_range=(3,5), lowercase=True, min_df=2, max_features=200_000,
                   stop_words='english', sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def run_svd_dual(model_type='xgb', smoke=True, seed=42, tag='svd_word224_char160_meta', n_iter_svd=7, n_comp_word=224, n_comp_char=160):
    start = time.time()
    use_folds = folds[:2] if smoke else folds
    oof = np.zeros(n, dtype=np.float32)
    te_parts = []
    for fi, (tr_idx, va_idx) in enumerate(use_folds, 1):
        t0 = time.time()
        tr_text = txt_tr.iloc[tr_idx]; va_text = txt_tr.iloc[va_idx]
        # Word view
        tf_w = TfidfVectorizer(**word_params)
        Xw_tr = tf_w.fit_transform(tr_text); Xw_va = tf_w.transform(va_text); Xw_te = tf_w.transform(txt_te)
        svd_w = TruncatedSVD(n_components=n_comp_word, n_iter=n_iter_svd, random_state=seed, algorithm='randomized')
        Zw_tr = svd_w.fit_transform(Xw_tr).astype(np.float32)
        Zw_va = svd_w.transform(Xw_va).astype(np.float32)
        Zw_te = svd_w.transform(Xw_te).astype(np.float32)
        # Char view
        tf_c = TfidfVectorizer(**char_params)
        Xc_tr = tf_c.fit_transform(tr_text); Xc_va = tf_c.transform(va_text); Xc_te = tf_c.transform(txt_te)
        svd_c = TruncatedSVD(n_components=n_comp_char, n_iter=n_iter_svd, random_state=seed+1, algorithm='randomized')
        Zc_tr = svd_c.fit_transform(Xc_tr).astype(np.float32)
        Zc_va = svd_c.transform(Xc_va).astype(np.float32)
        Zc_te = svd_c.transform(Xc_te).astype(np.float32)
        # Concat views + meta
        Z_tr = np.hstack([Zw_tr, Zc_tr]).astype(np.float32)
        Z_va = np.hstack([Zw_va, Zc_va]).astype(np.float32)
        Z_te = np.hstack([Zw_te, Zc_te]).astype(np.float32)
        M_tr = Meta_tr[tr_idx]; M_va = Meta_tr[va_idx]; M_te = Meta_te
        X_tr = np.hstack([Z_tr, M_tr]).astype(np.float32)
        X_va = np.hstack([Z_va, M_va]).astype(np.float32)
        X_te = np.hstack([Z_te, M_te]).astype(np.float32)
        # Standardize
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_tr = scaler.fit_transform(X_tr).astype(np.float32)
        X_va = scaler.transform(X_va).astype(np.float32)
        X_te_s = scaler.transform(X_te).astype(np.float32)
        if model_type == 'lr':
            clf = LogisticRegression(solver='saga', penalty='l2', C=1.0, max_iter=3000, n_jobs=-1, verbose=0, random_state=seed)
            clf.fit(X_tr, y[tr_idx])
            va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
            te_pred = clf.predict_proba(X_te_s)[:,1].astype(np.float32)
        elif model_type == 'xgb':
            dtr = xgb.DMatrix(X_tr, label=y[tr_idx])
            dva = xgb.DMatrix(X_va, label=y[va_idx])
            dte = xgb.DMatrix(X_te_s)
            pos = float((y[tr_idx] == 1).sum()); neg = float((y[tr_idx] == 0).sum())
            spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
            params = dict(objective='binary:logistic', eval_metric='auc', max_depth=5, eta=0.05, subsample=0.8,
                          colsample_bytree=0.8, min_child_weight=5, reg_alpha=0.3, reg_lambda=3.0,
                          device='cuda', tree_method='hist', seed=seed, scale_pos_weight=spw)
            booster = xgb.train(params, dtr, num_boost_round=4000, evals=[(dva, 'valid')],
                                early_stopping_rounds=100, verbose_eval=False)
            va_pred = booster.predict(dva).astype(np.float32)
            te_pred = booster.predict(dte, iteration_range=(0, booster.best_iteration+1 if booster.best_iteration is not None else 0)).astype(np.float32)
            del dtr, dva, dte, booster
        else:
            raise ValueError('model_type must be lr or xgb')
        oof[va_idx] = va_pred
        te_parts.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred)
        print(f'[SVDdual {model_type}] Fold {fi}/{len(use_folds)} AUC={auc:.5f} | {time.time()-t0:.1f}s | wordTF:{Xw_tr.shape[1]} charTF:{Xc_tr.shape[1]} Z:{Z_tr.shape[1]} + meta->{X_tr.shape[1]}')
        # Cleanup
        del tf_w, Xw_tr, Xw_va, Xw_te, svd_w, Zw_tr, Zw_va, Zw_te
        del tf_c, Xc_tr, Xc_va, Xc_te, svd_c, Zc_tr, Zc_va, Zc_te
        del Z_tr, Z_va, Z_te, M_tr, M_va, M_te, X_tr, X_va, X_te, X_te_s
        gc.collect()
    auc_mask = roc_auc_score(y[mask_val], oof[mask_val]) if not smoke else roc_auc_score(y[use_folds[0][1]], oof[use_folds[0][1]])
    te_mean = np.mean(te_parts, axis=0).astype(np.float32)
    mode_tag = f'{model_type}_smoke' if smoke else model_type
    oof_path = f'oof_{mode_tag}_{tag}.npy'; te_path = f'test_{mode_tag}_{tag}.npy'
    np.save(oof_path, oof.astype(np.float32)); np.save(te_path, te_mean)
    print(f'[SVDdual {model_type}] DONE | OOF(valid mask={"full" if not smoke else "1-fold"}) AUC={auc_mask:.5f} | folds={len(use_folds)} | {time.time()-start:.1f}s')
    print(f'Saved {oof_path} and {te_path}')

# Next: run smoke test with XGB/LR (n_iter=7, 224+160), then full CV if promising:
# run_svd_dual(model_type='xgb', smoke=True, seed=42, tag='svd_word224_char160_meta', n_iter_svd=7, n_comp_word=224, n_comp_char=160)
# run_svd_dual(model_type='lr', smoke=True, seed=42, tag='svd_word224_char160_meta', n_iter_svd=7, n_comp_word=224, n_comp_char=160)

In [None]:
# S37c-run: 2-fold smoke test for dual-view SVD base with XGB (+meta); expect >= word-only SVD folds
try:
    run_svd_dual(model_type='xgb', smoke=True, seed=42, tag='svd_word192_char128_meta', n_iter_svd=5)
except Exception as e:
    import traceback, sys
    print('Error during SVDdual XGB smoke test:', e)
    traceback.print_exc(file=sys.stdout)

In [None]:
# S37d-run-full: Full 5-fold CV for dual-view SVD (word192+char128)+meta with XGB; cache OOF/test, then reblend next
try:
    run_svd_dual(model_type='xgb', smoke=False, seed=42, tag='svd_word192_char128_meta', n_iter_svd=5)
except Exception as e:
    import traceback, sys
    print('Error during SVDdual XGB full run:', e)
    traceback.print_exc(file=sys.stdout)

In [None]:
# S37e: Reblend including dual-view SVD XGB base; allow Dense to drop; promote gamma-best
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
from pathlib import Path

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}')

# Load OOF preds
o_lr_w = np.load('oof_lr_time_withsub_meta.npy')
o_lr_ns_base = np.load('oof_lr_time_nosub_meta.npy')
o_lr_ns_decay = np.load('oof_lr_time_nosub_meta_decay.npy') if Path('oof_lr_time_nosub_meta_decay.npy').exists() else None
o_d1 = np.load('oof_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy')
# Dual-view SVD base
o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy')

# Optional LR_main+meta
has_lr_mainm = Path('oof_lr_main_meta_time.npy').exists() and Path('test_lr_main_meta_time.npy').exists()
if has_lr_mainm:
    o_lr_mainm = np.load('oof_lr_main_meta_time.npy')
    print('Loaded LR_main+meta for blend consideration.')

# Convert OOF to logits
z_lr_w = to_logit(o_lr_w)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd_dual = to_logit(o_svd_dual)
if has_lr_mainm:
    z_lr_mainm = to_logit(o_lr_mainm)

# Load test preds (prefer full-bag refits where available)
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns_base = np.load('test_lr_time_nosub_meta.npy')
t_lr_ns_decay = np.load('test_lr_time_nosub_meta_decay.npy') if Path('test_lr_time_nosub_meta_decay.npy').exists() else None
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
t_svd_dual = np.load('test_xgb_svd_word192_char128_meta.npy')
if has_lr_mainm:
    t_lr_mainm = np.load('test_lr_main_meta_time.npy')

# Grids
g_grid = [0.96, 0.97, 0.98]
meta_grid = [0.18, 0.20, 0.22]
dense_tot_grid = [0.0, 0.15, 0.22, 0.28]
dense_split = [(0.8, 0.2), (0.7, 0.3), (0.6, 0.4)]
emb_tot_grid = [0.20, 0.24, 0.27, 0.30]
emb_split = [(0.6, 0.4), (0.5, 0.5)]
svd_dual_grid = [0.0, 0.05, 0.08, 0.10, 0.12, 0.15, 0.20]
w_lrmain_grid = [0.0, 0.05] if has_lr_mainm else [0.0]
use_lr_decay_options = [False, True] if (o_lr_ns_decay is not None) else [False]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    for use_decay in use_lr_decay_options:
        z_lr_ns = to_logit(o_lr_ns_decay) if (use_decay and (o_lr_ns_decay is not None)) else to_logit(o_lr_ns_base)
        for g in g_grid:
            z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
            for w_meta in meta_grid:
                for d_tot in dense_tot_grid:
                    for dv1, dv2 in dense_split:
                        w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                        for e_tot in emb_tot_grid:
                            for emn_fr, emp_fr in emb_split:
                                w_emn = e_tot * emn_fr; w_emp = e_tot * emp_fr
                                for w_svd_dual in svd_dual_grid:
                                    rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp + w_svd_dual)
                                    if rem <= 0: continue
                                    for w_lrmain in w_lrmain_grid:
                                        if w_lrmain > rem: continue
                                        w_lr = rem - w_lrmain
                                        if w_lr < 0.20: continue
                                        z_oof = (w_lr*z_lr_mix +
                                                 w_d1*z_d1 + w_d2*z_d2 +
                                                 w_meta*z_meta +
                                                 w_emn*z_emn + w_emp*z_emp +
                                                 w_svd_dual*z_svd_dual)
                                        if has_lr_mainm and w_lrmain > 0:
                                            z_oof = z_oof + w_lrmain*z_lr_mainm
                                        auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                        tried += 1
                                        if auc > best_auc:
                                            best_auc = auc
                                            best_cfg = dict(use_decay=use_decay, g=float(g),
                                                            w_lr=float(w_lr), w_d1=float(w_d1), w_d2=float(w_d2),
                                                            w_meta=float(w_meta), w_emn=float(w_emn), w_emp=float(w_emp),
                                                            w_svd_dual=float(w_svd_dual), w_lrmain=float(w_lrmain))
    return best_auc, best_cfg, tried

# 1) Full-mask
auc_full, cfg_full, tried_full = search(mask_full)
print(f'[Full] tried={tried_full} | best OOF(z) AUC={auc_full:.5f} | cfg={cfg_full}')

# 2) Last-2
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2] tried={tried_last2} | best OOF(z,last2) AUC={auc_last2:.5f} | cfg={cfg_last2}')

# 3) Gamma-decayed
best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.95, 0.98]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best OOF(z,weighted) AUC={auc_g:.5f}')
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={best_cfg_g}')

def build_and_save(tag, cfg):
    use_decay = cfg['use_decay']
    tz_lr_ns = to_logit(t_lr_ns_decay if (use_decay and (t_lr_ns_decay is not None)) else t_lr_ns_base)
    tz_lr_w = to_logit(t_lr_w)
    tz_lr_mix = (1.0 - cfg['g'])*tz_lr_w + cfg['g']*tz_lr_ns
    z_parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*to_logit(t_d1),
        cfg['w_d2']*to_logit(t_d2),
        cfg['w_meta']*to_logit(t_meta),
        cfg['w_emn']*to_logit(t_emn),
        cfg['w_emp']*to_logit(t_emp),
        cfg['w_svd_dual']*to_logit(t_svd_dual)
    ]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_svd_dual']]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        z_parts.append(cfg['w_lrmain']*to_logit(t_lr_mainm))
        w_list.append(cfg['w_lrmain'])
    zt = np.sum(z_parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(f'submission_reblend_svddual_{tag}.csv', index=False)
    # 15% shrink hedge
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    comp_logits = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp), to_logit(t_svd_dual)]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        comp_logits.append(to_logit(t_lr_mainm))
    zt_shr = 0.0
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(f'submission_reblend_svddual_{tag}_shrunk.csv', index=False)

build_and_save('full', cfg_full)
build_and_save('last2', cfg_last2)
build_and_save(f'gamma{best_gamma:.2f}'.replace('.','p'), best_cfg_g)

# Promote gamma-best
prim = f'submission_reblend_svddual_gamma{best_gamma:.2f}'.replace('.','p') + '.csv'
pd.read_csv(prim).to_csv('submission.csv', index=False)
print(f'Promoted {prim} to submission.csv')

In [None]:
# S38: DistilRoBERTa fine-tune (title + [SEP] + request_text), time-aware 6-block CV, cache OOF/test
import os, sys, time, gc, math
import numpy as np, pandas as pd

# Ensure HF cache is writable/local
os.environ['HF_HOME'] = os.path.abspath('hf_cache')
os.environ['TRANSFORMERS_CACHE'] = os.path.abspath('hf_cache')

# Install torch + transformers if missing
try:
    import torch
except ImportError:
    import subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'torch', 'transformers', 'accelerate', 'datasets', 'evaluate', 'scikit-learn'])
    import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModel, get_linear_schedule_with_warmup
from sklearn.metrics import roc_auc_score

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Torch CUDA:', torch.cuda.is_available(), '| device:', device)

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)
def build_text(df):
    # For RoBERTa, SEP as '</s>' token. We'll just use a textual '[SEP]' and rely on tokenizer to split.
    return (get_title(df) + ' [SEP] ' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train).tolist()
txt_te = build_text(test).tolist()

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask_val = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask_val[va_idx] = True
print(f'Time-CV folds={len(folds)}; validated {mask_val.sum()}/{n}')

model_name = 'distilroberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
max_len = 256
batch_size = 32
epochs = 2
lr = 2e-5
weight_decay = 0.01
warmup_ratio = 0.1
seed = 42
torch.manual_seed(seed);
np.random.seed(seed)

class TextDataset(Dataset):
    def __init__(self, texts, labels=None):
        self.texts = texts
        self.labels = labels
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        item = tokenizer(self.texts[idx], padding='max_length', truncation=True, max_length=max_len, return_tensors='pt')
        item = {k: v.squeeze(0) for k, v in item.items()}
        if self.labels is not None:
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.float32)
        return item

class DistilRobertaForBinary(torch.nn.Module):
    def __init__(self, base_name):
        super().__init__()
        self.base = AutoModel.from_pretrained(base_name)
        hidden_size = self.base.config.hidden_size
        self.classifier = torch.nn.Linear(hidden_size, 1)
    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids=input_ids, attention_mask=attention_mask)
        # DistilRoBERTa: use mean pooling over last hidden state masked by attention
        last_hidden = outputs.last_hidden_state  # (B, L, H)
        mask = attention_mask.unsqueeze(-1).float()
        summed = (last_hidden * mask).sum(dim=1)
        counts = mask.sum(dim=1).clamp(min=1e-6)
        pooled = summed / counts
        logits = self.classifier(pooled).squeeze(-1)  # (B,)
        return logits

def train_fold(tr_idx, va_idx, fold_id):
    x_tr = [txt_tr[i] for i in tr_idx]; y_tr = y[tr_idx].astype(np.float32)
    x_va = [txt_tr[i] for i in va_idx]; y_va = y[va_idx].astype(np.float32)
    ds_tr = TextDataset(x_tr, y_tr); ds_va = TextDataset(x_va, y_va)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=2, pin_memory=True)
    dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True)
    model = DistilRobertaForBinary(model_name).to(device)
    # Imbalance: pos_weight = neg/pos on train fold
    pos = float((y_tr == 1).sum()); neg = float((y_tr == 0).sum());
    pos_weight = torch.tensor([ (neg / max(pos, 1.0)) if pos > 0 else 1.0 ], device=device, dtype=torch.float32)
    criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
    optimizer = AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    total_steps = epochs * math.ceil(len(ds_tr) / batch_size)
    warmup_steps = int(warmup_ratio * total_steps)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
    scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())

    best_va_auc = -1.0
    best_state = None
    t0 = time.time()
    for ep in range(1, epochs+1):
        model.train()
        tr_loss = 0.0; nb = 0
        for batch in dl_tr:
            optimizer.zero_grad(set_to_none=True)
            input_ids = batch['input_ids'].to(device); attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
                logits = model(input_ids=input_ids, attention_mask=attention_mask)
                loss = criterion(logits, labels)
            scaler.scale(loss).backward()
            scaler.step(optimizer); scaler.update()
            scheduler.step()
            tr_loss += loss.item(); nb += 1
        # Validate
        model.eval()
        va_probs = []; va_targets = []
        with torch.no_grad():
            for batch in dl_va:
                input_ids = batch['input_ids'].to(device); attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].cpu().numpy()
                logits = model(input_ids=input_ids, attention_mask=attention_mask)
                probs = torch.sigmoid(logits).detach().cpu().numpy()
                va_probs.append(probs); va_targets.append(labels)
        va_probs = np.concatenate(va_probs); va_targets = np.concatenate(va_targets)
        va_auc = roc_auc_score(va_targets, va_probs) if va_targets.min() != va_targets.max() else 0.5
        print(f'[Fold {fold_id}] Epoch {ep}/{epochs} | tr_loss={(tr_loss/ max(nb,1)):.4f} | VA AUC={va_auc:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
        # Track best
        if va_auc > best_va_auc:
            best_va_auc = va_auc
            best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
    # Load best
    if best_state is not None:
        model.load_state_dict(best_state)
    # Inference on val
    ds_va = TextDataset(x_va, y_va); dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True)
    va_probs = []
    model.eval()
    with torch.no_grad():
        for batch in dl_va:
            input_ids = batch['input_ids'].to(device); attention_mask = batch['attention_mask'].to(device)
            logits = model(input_ids=input_ids, attention_mask=attention_mask)
            probs = torch.sigmoid(logits).detach().cpu().numpy()
            va_probs.append(probs)
    va_probs = np.concatenate(va_probs).astype(np.float32)
    # Inference on test
    ds_te = TextDataset(txt_te, None); dl_te = DataLoader(ds_te, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True)
    te_probs = []
    with torch.no_grad():
        for batch in dl_te:
            input_ids = batch['input_ids'].to(device); attention_mask = batch['attention_mask'].to(device)
            logits = model(input_ids=input_ids, attention_mask=attention_mask)
            probs = torch.sigmoid(logits).detach().cpu().numpy()
            te_probs.append(probs)
    te_probs = np.concatenate(te_probs).astype(np.float32)
    # Cleanup
    del model, optimizer, scheduler, scaler, ds_tr, ds_va, dl_tr, dl_va
    torch.cuda.empty_cache(); gc.collect()
    return va_probs, te_probs, float(best_va_auc)

oof = np.zeros(n, dtype=np.float32)
te_parts = []
fold_aucs = []
all_start = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    fold_start = time.time()
    va_pred, te_pred, va_auc = train_fold(tr_idx, va_idx, fi)
    oof[va_idx] = va_pred
    te_parts.append(te_pred)
    fold_aucs.append(va_auc)
    print(f'[FT DistilRoBERTa] Fold {fi}/{len(folds)} done | AUC={va_auc:.5f} | {time.time()-fold_start:.1f}s', flush=True)

auc_full = roc_auc_score(y[mask_val], oof[mask_val])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
np.save('oof_transformer_distilroberta.npy', oof.astype(np.float32))
np.save('test_transformer_distilroberta.npy', te_mean)
print(f'[FT DistilRoBERTa] DONE | OOF AUC={auc_full:.5f} | folds AUC={fold_aucs} | total {time.time()-all_start:.1f}s')
print('Saved oof_transformer_distilroberta.npy and test_transformer_distilroberta.npy')

In [None]:
# S37f-run-full-upgraded: Full 5-fold CV for upgraded dual-view SVD (word224+char160, n_iter=7) with XGB and LR (+meta); cache OOF/test
try:
    print('=== Running upgraded SVDdual XGB full CV (224+160, n_iter=7) ===')
    run_svd_dual(model_type='xgb', smoke=False, seed=42, tag='svd_word224_char160_meta', n_iter_svd=7, n_comp_word=224, n_comp_char=160)
    print('=== Running upgraded SVDdual LR full CV (224+160, n_iter=7) ===')
    run_svd_dual(model_type='lr', smoke=False, seed=42, tag='svd_word224_char160_meta', n_iter_svd=7, n_comp_word=224, n_comp_char=160)
except Exception as e:
    import traceback, sys
    print('Error during upgraded dual SVD full runs:', e)
    traceback.print_exc(file=sys.stdout)

In [None]:
# S37g-run-smoke-upgraded: 2-fold smoke for upgraded dual SVD (word224+char160, n_iter=7) XGB and LR
try:
    print('=== Smoke: upgraded SVDdual XGB (224+160, n_iter=7) ===')
    run_svd_dual(model_type='xgb', smoke=True, seed=42, tag='svd_word224_char160_meta', n_iter_svd=7, n_comp_word=224, n_comp_char=160)
    print('=== Smoke: upgraded SVDdual LR (224+160, n_iter=7) ===')
    run_svd_dual(model_type='lr', smoke=True, seed=42, tag='svd_word224_char160_meta', n_iter_svd=7, n_comp_word=224, n_comp_char=160)
except Exception as e:
    import traceback, sys
    print('Error during upgraded dual SVD smoke runs:', e)
    traceback.print_exc(file=sys.stdout)

In [16]:
# S37h: Reblend with last-2 as primary objective, dense cap <=0.12, gamma in {0.975,0.98,0.99}; include dual SVD base (fast grid + progress logs)
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score
from pathlib import Path

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}', flush=True)

# Load available OOF preds
o_lr_w = np.load('oof_lr_time_withsub_meta.npy')
o_lr_ns_base = np.load('oof_lr_time_nosub_meta.npy')
o_lr_ns_decay = np.load('oof_lr_time_nosub_meta_decay.npy') if Path('oof_lr_time_nosub_meta_decay.npy').exists() else None
o_d1 = np.load('oof_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy')
o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy')  # dual SVD base (XGB, word192+char128)

# Optional bases
has_lr_mainm = Path('oof_lr_main_meta_time.npy').exists() and Path('test_lr_main_meta_time.npy').exists()
if has_lr_mainm:
    o_lr_mainm = np.load('oof_lr_main_meta_time.npy')
    print('Loaded LR_main+meta for blend consideration.', flush=True)

# Convert OOF to logits
z_lr_w = to_logit(o_lr_w)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd_dual = to_logit(o_svd_dual)
if has_lr_mainm:
    z_lr_mainm = to_logit(o_lr_mainm)

# Load test preds (prefer full-bag refits where available)
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns_base = np.load('test_lr_time_nosub_meta.npy')
t_lr_ns_decay = np.load('test_lr_time_nosub_meta_decay.npy') if Path('test_lr_time_nosub_meta_decay.npy').exists() else None
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
t_svd_dual = np.load('test_xgb_svd_word192_char128_meta.npy')
if has_lr_mainm:
    t_lr_mainm = np.load('test_lr_main_meta_time.npy')

# Fast grids per expert guidance (dense cap <= 0.12) with progress logs
g_grid = [0.97, 0.98]
meta_grid = [0.18, 0.20, 0.22]
dense_tot_grid = [0.0, 0.06, 0.12]
dense_split = [(0.6, 0.4), (0.7, 0.3)]
emb_tot_grid = [0.24, 0.27, 0.30]
emb_split = [(0.6, 0.4), (0.5, 0.5)]
svd_dual_grid = [0.0, 0.05, 0.10, 0.12, 0.15]
w_lr_min_grid = [0.22, 0.25]
w_lrmain_grid = [0.0, 0.05] if has_lr_mainm else [0.0]
use_lr_decay_options = [False, True] if (o_lr_ns_decay is not None) else [False]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    t0 = time.time()
    for use_decay in use_lr_decay_options:
        z_lr_ns = to_logit(o_lr_ns_decay) if (use_decay and (o_lr_ns_decay is not None)) else to_logit(o_lr_ns_base)
        for g in g_grid:
            z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
            for w_lr_min in w_lr_min_grid:
                for w_meta in meta_grid:
                    for d_tot in dense_tot_grid:
                        for dv1, dv2 in dense_split:
                            w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                            for e_tot in emb_tot_grid:
                                for emn_fr, emp_fr in emb_split:
                                    w_emn = e_tot * emn_fr; w_emp = e_tot * emp_fr
                                    for w_svd_dual in svd_dual_grid:
                                        rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp + w_svd_dual)
                                        if rem <= 0: continue
                                        for w_lrmain in w_lrmain_grid:
                                            if w_lrmain > rem: continue
                                            w_lr = rem - w_lrmain
                                            if w_lr < w_lr_min: continue
                                            z_oof = (w_lr*z_lr_mix +
                                                     w_d1*z_d1 + w_d2*z_d2 +
                                                     w_meta*z_meta +
                                                     w_emn*z_emn + w_emp*z_emp +
                                                     w_svd_dual*z_svd_dual)
                                            if has_lr_mainm and w_lrmain > 0:
                                                z_oof = z_oof + w_lrmain*z_lr_mainm
                                            auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                            tried += 1
                                            if tried % 1000 == 0:
                                                print(f'  tried={tried} | curr_best={best_auc:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
                                            if auc > best_auc:
                                                best_auc = auc
                                                best_cfg = dict(use_decay=use_decay, g=float(g), w_lr=float(w_lr),
                                                                w_d1=float(w_d1), w_d2=float(w_d2), w_meta=float(w_meta),
                                                                w_emn=float(w_emn), w_emp=float(w_emp), w_svd_dual=float(w_svd_dual),
                                                                w_lrmain=float(w_lrmain))
    print(f'  search done | tried={tried} | best={best_auc:.5f} | {time.time()-t0:.1f}s', flush=True)
    return best_auc, best_cfg, tried

# Primary: last-2 objective
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2 PRIMARY] tried={tried_last2} | best OOF(z,last2) AUC={auc_last2:.5f} | cfg={cfg_last2}', flush=True)

# Also report full and gamma-decayed (tighter gammas) for reference
auc_full, cfg_full, tried_full = search(mask_full)
print(f'[Full] tried={tried_full} | best OOF(z) AUC={auc_full:.5f} | cfg={cfg_full}', flush=True)

best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.975, 0.98, 0.99]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best OOF(z,weighted) AUC={auc_g:.5f}', flush=True)
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={best_cfg_g}', flush=True)

def build_and_save(tag, cfg):
    use_decay = cfg['use_decay']
    tz_lr_ns = to_logit(t_lr_ns_decay if (use_decay and (t_lr_ns_decay is not None)) else t_lr_ns_base)
    tz_lr_w = to_logit(t_lr_w)
    tz_lr_mix = (1.0 - cfg['g'])*tz_lr_w + cfg['g']*tz_lr_ns
    parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*to_logit(t_d1),
        cfg['w_d2']*to_logit(t_d2),
        cfg['w_meta']*to_logit(t_meta),
        cfg['w_emn']*to_logit(t_emn),
        cfg['w_emp']*to_logit(t_emp),
        cfg['w_svd_dual']*to_logit(t_svd_dual)
    ]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_svd_dual']]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        parts.append(cfg['w_lrmain']*to_logit(t_lr_mainm))
        w_list.append(cfg['w_lrmain'])
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(f'submission_last2blend_{tag}.csv', index=False)
    # 15% shrink hedge
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    comp_logits = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp), to_logit(t_svd_dual)]
    if has_lr_mainm and cfg['w_lrmain'] > 0:
        comp_logits.append(to_logit(t_lr_mainm))
    zt_shr = 0.0
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(f'submission_last2blend_{tag}_shrunk.csv', index=False)

# Build and save with last-2 winner and gamma-best
build_and_save('last2', cfg_last2)
build_and_save(f'gamma{best_gamma:.3f}'.replace('.', 'p'), best_cfg_g)

# Promote last-2 winner as primary per expert guidance
prim = 'submission_last2blend_last2.csv'
pd.read_csv(prim).to_csv('submission.csv', index=False)
print(f'Promoted {prim} to submission.csv', flush=True)

Time-CV validated full: 2398/2878 | last2: 958


Loaded LR_main+meta for blend consideration.


  tried=1000 | curr_best=0.64630 | elapsed=1.8s


  tried=2000 | curr_best=0.64630 | elapsed=3.5s


  tried=3000 | curr_best=0.64634 | elapsed=5.3s


  tried=4000 | curr_best=0.64649 | elapsed=7.1s


  tried=5000 | curr_best=0.64649 | elapsed=8.9s


  tried=6000 | curr_best=0.64649 | elapsed=10.7s


  tried=7000 | curr_best=0.64649 | elapsed=12.5s


  tried=8000 | curr_best=0.64649 | elapsed=14.3s


  search done | tried=8096 | best=0.64649 | 14.5s


[Last2 PRIMARY] tried=8096 | best OOF(z,last2) AUC=0.64649 | cfg={'use_decay': False, 'g': 0.98, 'w_lr': 0.30999999999999994, 'w_d1': 0.08399999999999999, 'w_d2': 0.036, 'w_meta': 0.22, 'w_emn': 0.18, 'w_emp': 0.12, 'w_svd_dual': 0.05, 'w_lrmain': 0.0}


  tried=1000 | curr_best=0.68219 | elapsed=2.1s


  tried=2000 | curr_best=0.68227 | elapsed=4.2s


  tried=3000 | curr_best=0.68227 | elapsed=6.4s


  tried=4000 | curr_best=0.68227 | elapsed=8.5s


  tried=5000 | curr_best=0.68227 | elapsed=10.6s


  tried=6000 | curr_best=0.68248 | elapsed=12.8s


  tried=7000 | curr_best=0.68248 | elapsed=14.9s


  tried=8000 | curr_best=0.68249 | elapsed=17.1s


  search done | tried=8096 | best=0.68249 | 17.3s


[Full] tried=8096 | best OOF(z) AUC=0.68249 | cfg={'use_decay': True, 'g': 0.98, 'w_lr': 0.25999999999999995, 'w_d1': 0.08399999999999999, 'w_d2': 0.036, 'w_meta': 0.22, 'w_emn': 0.15, 'w_emp': 0.15, 'w_svd_dual': 0.05, 'w_lrmain': 0.05}


  tried=1000 | curr_best=0.68070 | elapsed=2.4s


  tried=2000 | curr_best=0.68077 | elapsed=4.7s


  tried=3000 | curr_best=0.68077 | elapsed=7.1s


  tried=4000 | curr_best=0.68077 | elapsed=9.5s


  tried=5000 | curr_best=0.68077 | elapsed=11.8s


  tried=6000 | curr_best=0.68098 | elapsed=14.2s


  tried=7000 | curr_best=0.68098 | elapsed=16.6s


  tried=8000 | curr_best=0.68099 | elapsed=19.0s


  search done | tried=8096 | best=0.68099 | 19.2s


[Gamma 0.975] best OOF(z,weighted) AUC=0.68099


  tried=1000 | curr_best=0.68100 | elapsed=2.4s


  tried=2000 | curr_best=0.68107 | elapsed=4.8s


  tried=3000 | curr_best=0.68107 | elapsed=7.1s


  tried=4000 | curr_best=0.68107 | elapsed=9.4s


  tried=5000 | curr_best=0.68107 | elapsed=11.8s


  tried=6000 | curr_best=0.68128 | elapsed=14.2s


  tried=7000 | curr_best=0.68128 | elapsed=16.6s


  tried=8000 | curr_best=0.68129 | elapsed=18.9s


  search done | tried=8096 | best=0.68129 | 19.2s


[Gamma 0.98] best OOF(z,weighted) AUC=0.68129


  tried=1000 | curr_best=0.68159 | elapsed=2.4s


  tried=2000 | curr_best=0.68167 | elapsed=4.8s


  tried=3000 | curr_best=0.68167 | elapsed=7.1s


  tried=4000 | curr_best=0.68167 | elapsed=9.5s


  tried=5000 | curr_best=0.68167 | elapsed=11.9s


  tried=6000 | curr_best=0.68188 | elapsed=14.3s


  tried=7000 | curr_best=0.68188 | elapsed=16.6s


  tried=8000 | curr_best=0.68189 | elapsed=19.0s


  search done | tried=8096 | best=0.68189 | 19.2s


[Gamma 0.99] best OOF(z,weighted) AUC=0.68189


[Gamma-best] gamma=0.99 | AUC=0.68189 | cfg={'use_decay': True, 'g': 0.98, 'w_lr': 0.25999999999999995, 'w_d1': 0.08399999999999999, 'w_d2': 0.036, 'w_meta': 0.22, 'w_emn': 0.15, 'w_emp': 0.15, 'w_svd_dual': 0.05, 'w_lrmain': 0.05}


Promoted submission_last2blend_last2.csv to submission.csv


In [None]:
# S37h-mini: Deterministic last-2 blend build using prior good cfg (from S37e last2) to avoid grid stalls
import numpy as np, pandas as pd
from pathlib import Path

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Load required test predictions (prefer refits where available)
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
# Dual SVD base (word192+char128); upgraded 224+160 not available due to runtime, so use this for now
t_svd_dual = np.load('test_xgb_svd_word192_char128_meta.npy')

# Fixed last-2-inspired config from prior search (S37e last2 cfg):
use_decay = False
g = 0.97
w_lr, w_d1, w_d2, w_meta, w_emn, w_emp, w_svd_dual = 0.22, 0.224, 0.056, 0.20, 0.18, 0.12, 0.00

# Build logits with LR mix
tz_lr_mix = (1.0 - g)*to_logit(t_lr_w) + g*to_logit(t_lr_ns)
zt = (w_lr*tz_lr_mix +
      w_d1*to_logit(t_d1) +
      w_d2*to_logit(t_d2) +
      w_meta*to_logit(t_meta) +
      w_emn*to_logit(t_emn) +
      w_emp*to_logit(t_emp) +
      w_svd_dual*to_logit(t_svd_dual))
pt = sigmoid(zt).astype(np.float32)
pd.DataFrame({id_col: ids, target_col: pt}).to_csv('submission_last2_fixed.csv', index=False)

# 15% shrink-to-equal hedge over active components
w_vec = np.array([w_lr, w_d1, w_d2, w_meta, w_emn, w_emp] + ([w_svd_dual] if w_svd_dual > 0 else []), dtype=np.float64)
comps = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp)] + ([to_logit(t_svd_dual)] if w_svd_dual > 0 else [])
w_eq = np.ones_like(w_vec) / len(w_vec)
alpha = 0.15
w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
zt_shr = np.zeros_like(comps[0], dtype=np.float64)
for wi, zi in zip(w_shr, comps):
    zt_shr += wi*zi
pt_shr = sigmoid(zt_shr).astype(np.float32)
pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv('submission_last2_fixed_shrunk.csv', index=False)

# Promote last-2 fixed as primary (per expert guidance to emphasize recency on LB)
pd.read_csv('submission_last2_fixed.csv').to_csv('submission.csv', index=False)
print('Promoted submission_last2_fixed.csv to submission.csv')

In [None]:
# S37h-mini2: Fast deterministic blends with cached test preds; robust logs and file checks
import os, time, glob, numpy as np, pandas as pd
from pathlib import Path

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def write_sub(path, probs):
    df = pd.DataFrame({id_col: ids, target_col: probs.astype(np.float32)})
    df.to_csv(path, index=False)
    print(f'Wrote {path} | mean={probs.mean():.6f} | mtime={time.strftime("%H:%M:%S", time.localtime(Path(path).stat().st_mtime))}', flush=True)

# Load test predictions (prefer refits where available)
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
t_svd_dual = np.load('test_xgb_svd_word192_char128_meta.npy') if Path('test_xgb_svd_word192_char128_meta.npy').exists() else None

# Blend 1: Last-2 inspired fixed config (from S37e last2 cfg) without SVD-dual
g1 = 0.97
w_lr, w_d1, w_d2, w_meta, w_emn, w_emp = 0.22, 0.224, 0.056, 0.20, 0.18, 0.12
tz_lr_mix = (1.0 - g1)*to_logit(t_lr_w) + g1*to_logit(t_lr_ns)
z1 = (w_lr*tz_lr_mix +
      w_d1*to_logit(t_d1) +
      w_d2*to_logit(t_d2) +
      w_meta*to_logit(t_meta) +
      w_emn*to_logit(t_emn) +
      w_emp*to_logit(t_emp))
p1 = sigmoid(z1)
write_sub('submission_last2_fixed_fast.csv', p1)

# Blend 2: Gamma-best cfg from S37e full with SVD-dual weight if available
g2 = 0.97
w_lr2, w_d1_2, w_d2_2, w_meta_2, w_emn_2, w_emp_2, w_svd2 = 0.21, 0.176, 0.044, 0.22, 0.15, 0.15, (0.05 if t_svd_dual is not None else 0.0)
tz_lr_mix2 = (1.0 - g2)*to_logit(t_lr_w) + g2*to_logit(t_lr_ns)
z2 = (w_lr2*tz_lr_mix2 +
      w_d1_2*to_logit(t_d1) +
      w_d2_2*to_logit(t_d2) +
      w_meta_2*to_logit(t_meta) +
      w_emn_2*to_logit(t_emn) +
      w_emp_2*to_logit(t_emp))
if w_svd2 > 0:
    z2 = z2 + w_svd2*to_logit(t_svd_dual)
p2 = sigmoid(z2)
write_sub('submission_gamma0p97_svddual_fast.csv', p2)

# Promote gamma-based as primary hedge; if absent, promote last2
primary = 'submission_gamma0p97_svddual_fast.csv' if Path('submission_gamma0p97_svddual_fast.csv').exists() else 'submission_last2_fixed_fast.csv'
pd.read_csv(primary).to_csv('submission.csv', index=False)
print(f'Promoted {primary} to submission.csv | mtime={time.strftime("%H:%M:%S", time.localtime(Path("submission.csv").stat().st_mtime))}', flush=True)

# List recent submissions
cands = sorted(glob.glob('submission*.csv'), key=lambda p: Path(p).stat().st_mtime, reverse=True)[:6]
for p in cands:
    st = Path(p).stat()
    print(f'{p} | {st.st_size} bytes | mtime={time.strftime("%H:%M:%S", time.localtime(st.st_mtime))}')

In [None]:
# S37h-mini3: Promote best-known existing blend to submission.csv safely (no heavy deps)
import pandas as pd, time, glob
from pathlib import Path

candidates = [
    'submission_reblend_svddual_gamma0p98.csv',
    'submission_blend_gamma0p98_fullrefits.csv',
    'submission_reblend_gamma0p98.csv',
    'submission_blend_gamma0p98.csv',
    'submission_7way_gamma0p98_mpnet_fullrefit.csv'
]

chosen = None
for p in candidates:
    if Path(p).exists():
        chosen = p
        break

if chosen is None:
    raise FileNotFoundError('No known submission candidates found to promote.')

sub = pd.read_csv(chosen)
assert {'request_id','requester_received_pizza'}.issubset(sub.columns), 'Submission columns mismatch'
sub.to_csv('submission.csv', index=False)
print(f'Promoted {chosen} to submission.csv | rows={len(sub)} | mean={sub.requester_received_pizza.mean():.6f} | mtime=' +
      time.strftime('%H:%M:%S', time.localtime(Path('submission.csv').stat().st_mtime)), flush=True)

# List recent submissions for sanity
recent = sorted(glob.glob('submission*.csv'), key=lambda p: Path(p).stat().st_mtime, reverse=True)[:5]
for p in recent:
    st = Path(p).stat()
    print(f'{p} | {st.st_size} bytes | mtime=' + time.strftime('%H:%M:%S', time.localtime(st.st_mtime)))

In [1]:
# S37h-mini4: Safe promote using shutil (no pandas, no json reads) to avoid I/O stalls
import os, time, glob, shutil
from pathlib import Path

candidates = [
    'submission_reblend_svddual_gamma0p98.csv',
    'submission_blend_gamma0p98_fullrefits.csv',
    'submission_reblend_gamma0p98.csv',
    'submission_blend_gamma0p98.csv',
    'submission_7way_gamma0p98_mpnet_fullrefit.csv'
]

chosen = None
for p in candidates:
    if Path(p).exists() and Path(p).stat().st_size > 0:
        chosen = p
        break

if chosen is None:
    raise FileNotFoundError('No known submission candidates found to promote (shutil).')

shutil.copyfile(chosen, 'submission.csv')
st = Path('submission.csv').stat()
print(f'Promoted {chosen} -> submission.csv | size={st.st_size} bytes | mtime=' + time.strftime('%H:%M:%S', time.localtime(st.st_mtime)), flush=True)

# Print head safely without pandas
with open('submission.csv', 'r', encoding='utf-8') as f:
    for i in range(3):
        line = f.readline()
        if not line:
            break
        print(line.rstrip())

# List recent submission files
recent = sorted(glob.glob('submission*.csv'), key=lambda p: Path(p).stat().st_mtime, reverse=True)[:6]
for p in recent:
    stp = Path(p).stat()
    print(f'{p} | {stp.st_size} bytes | mtime=' + time.strftime('%H:%M:%S', time.localtime(stp.st_mtime)))

Promoted submission_reblend_svddual_gamma0p98.csv -> submission.csv | size=23350 bytes | mtime=14:47:42


request_id,requester_received_pizza
t3_1aw5zf,0.48315614
t3_roiuw,0.47056925
submission.csv | 23350 bytes | mtime=14:47:42
submission_reblend_svddual_gamma0p98_shrunk.csv | 23343 bytes | mtime=14:04:03
submission_reblend_svddual_gamma0p98.csv | 23350 bytes | mtime=14:04:03
submission_reblend_svddual_last2_shrunk.csv | 23314 bytes | mtime=14:04:03
submission_reblend_svddual_last2.csv | 23311 bytes | mtime=14:04:03
submission_reblend_svddual_full_shrunk.csv | 23343 bytes | mtime=14:04:03


In [2]:
# S37h-mini5: Rank-average top-2 submissions to create a hedge and promote
import numpy as np, time, glob
from pathlib import Path

def read_probs(path):
    # Read CSV second column (probabilities) without pandas
    probs = []
    with open(path, 'r', encoding='utf-8') as f:
        header = f.readline()
        for line in f:
            parts = line.rstrip().split(',')
            if len(parts) >= 2:
                try:
                    probs.append(float(parts[1]))
                except ValueError:
                    probs.append(0.5)
    return np.asarray(probs, dtype=np.float64)

def rank01(x):
    # rankdata method='average' approximation using two argsorts (ties handled approximately)
    order = np.argsort(x, kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(x), dtype=np.float64)
    return ranks / max(len(x) - 1, 1)

cand1 = 'submission_reblend_svddual_gamma0p98.csv'  # expert primary
cand2_options = [
    'submission_blend_gamma0p98_fullrefits.csv',
    'submission_7way_gamma0p98_mpnet_fullrefit.csv'
]

cand2 = next((p for p in cand2_options if Path(p).exists() and Path(p).stat().st_size > 0), None)
if not (Path(cand1).exists() and Path(cand1).stat().st_size > 0 and cand2 is not None):
    raise FileNotFoundError(f'Missing candidates for rank-average: cand1={Path(cand1).exists()} cand2={cand2}')

p1 = read_probs(cand1)
p2 = read_probs(cand2)
assert p1.shape == p2.shape and p1.ndim == 1, 'Submission length mismatch'

r1 = rank01(p1); r2 = rank01(p2)
ravg = (r1 + r2) / 2.0

# Read ids from cand1 safely
ids = []
with open(cand1, 'r', encoding='utf-8') as f:
    header = f.readline()
    for line in f:
        parts = line.rstrip().split(',')
        ids.append(parts[0])

out_path = 'submission_rankavg_top2.csv'
with open(out_path, 'w', encoding='utf-8') as f:
    f.write('request_id,requester_received_pizza\n')
    for rid, val in zip(ids, ravg.astype(np.float32)):
        f.write(f'{rid},{val:.8f}\n')
print(f'Wrote {out_path} | mean={ravg.mean():.6f} | mtime=' + time.strftime('%H:%M:%S', time.localtime(Path(out_path).stat().st_mtime)), flush=True)

# Promote rank-avg hedge
Path(out_path).replace('submission.csv')
print('Promoted submission_rankavg_top2.csv to submission.csv | size=' + str(Path('submission.csv').stat().st_size))

Wrote submission_rankavg_top2.csv | mean=0.500000 | mtime=14:49:27


Promoted submission_rankavg_top2.csv to submission.csv | size=23773


In [4]:
# S39: Time-aware level-2 stacker on base OOF logits (logistic meta-learner); write submission
import numpy as np, pandas as pd, time, gc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from pathlib import Path

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

# Time-aware 6-block forward-chaining
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)

# mask_valid marks rows with available base OOF (blocks 1..5). We'll only train/validate on these.
mask_valid = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_valid[np.array(blocks[i])] = True

# Define stacker folds so that training has non-empty OOF rows: validate blocks 2..5, train on blocks 1..(i-1)
folds = []
for i in range(2, k):  # i = 2..5
    va_idx = np.array(blocks[i])
    tr_idx = np.concatenate(blocks[1:i])  # only blocks with OOF available
    folds.append((tr_idx, va_idx))
print(f'Stacker folds: {len(folds)} (validate blocks 2..5)')

# Load base OOF/test probabilities and convert to logits
base_names = []
oof_list = []; te_list = []

def add_base(oof_path, te_path, name):
    if (Path(oof_path).exists() and Path(te_path).exists()):
        o = np.load(oof_path); t = np.load(te_path)
        oof_list.append(to_logit(o)); te_list.append(to_logit(t)); base_names.append(name);
        print(f'Added base: {name} | oof:{o.shape} te:{t.shape}')

# Core bases
add_base('oof_lr_time_withsub_meta.npy', 'test_lr_time_withsub_meta.npy', 'lr_withsub_meta')
add_base('oof_lr_time_nosub_meta.npy', 'test_lr_time_nosub_meta.npy', 'lr_nosub_meta')
if Path('oof_lr_time_nosub_meta_decay.npy').exists() and Path('test_lr_time_nosub_meta_decay.npy').exists():
    add_base('oof_lr_time_nosub_meta_decay.npy', 'test_lr_time_nosub_meta_decay.npy', 'lr_nosub_meta_decay')
add_base('oof_xgb_dense_time.npy', 'test_xgb_dense_time.npy', 'dense_v1')
add_base('oof_xgb_dense_time_v2.npy', 'test_xgb_dense_time_v2.npy', 'dense_v2')
add_base('oof_xgb_meta_time.npy', 'test_xgb_meta_time.npy', 'meta_xgb')
add_base('oof_xgb_emb_meta_time.npy', 'test_xgb_emb_meta_time.npy', 'minilm_xgb')
add_base('oof_xgb_emb_mpnet_time.npy', 'test_xgb_emb_mpnet_time.npy', 'mpnet_xgb')
if Path('oof_xgb_svd_word192_char128_meta.npy').exists() and Path('test_xgb_svd_word192_char128_meta.npy').exists():
    add_base('oof_xgb_svd_word192_char128_meta.npy', 'test_xgb_svd_word192_char128_meta.npy', 'svd_dual_xgb')

# Prefer full-bag test replacements when available (overwrite test logits in te_list accordingly)
name_to_idx = {n:i for i,n in enumerate(base_names)}
def swap_test(name, te_path):
    if name in name_to_idx and Path(te_path).exists():
        i = name_to_idx[name];
        te_list[i] = to_logit(np.load(te_path));
        print(f'Replaced test for {name} with {te_path}')

swap_test('meta_xgb', 'test_xgb_meta_fullbag.npy')
swap_test('minilm_xgb', 'test_xgb_emb_minilm_fullbag.npy')
swap_test('mpnet_xgb', 'test_xgb_emb_mpnet_fullbag.npy')

m = len(base_names)
assert m > 1, 'Not enough bases for stacking'
O = np.vstack(oof_list).T  # (n, m) OOF logits
T = np.vstack(te_list).T   # (n_test, m) test logits
print(f'Stacker features: train OOF logits {O.shape} | test logits {T.shape} | bases={base_names}')

# Train per-fold logistic stacker using only rows with OOF (mask_valid).
oof_stacker = np.zeros(n, dtype=np.float32)
te_parts = []
C_grid = [0.5, 1.0, 2.0]
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    # Ensure training indices are within mask_valid
    tr_idx = tr_idx[mask_valid[tr_idx]]
    X_tr = O[tr_idx]; y_tr = y[tr_idx]
    X_va = O[va_idx]; y_va = y[va_idx]
    if X_tr.shape[0] == 0:
        print(f'[Stacker] Fold {fi} has empty train after mask; skipping.', flush=True)
        continue
    best_auc, best_C, best_clf = -1.0, None, None
    for C in C_grid:
        clf = LogisticRegression(solver='liblinear', penalty='l2', C=C, max_iter=1000)
        clf.fit(X_tr, y_tr)
        va_pred = clf.predict_proba(X_va)[:,1]
        auc = roc_auc_score(y_va, va_pred) if (y_va.min()!=y_va.max()) else 0.5
        if auc > best_auc:
            best_auc, best_C, best_clf = auc, C, clf
    oof_stacker[va_idx] = best_clf.predict_proba(X_va)[:,1].astype(np.float32)
    te_parts.append(best_clf.predict_proba(T)[:,1].astype(np.float32))
    print(f'[Stacker] Fold {fi} | best C={best_C} | AUC={best_auc:.5f} | tr={len(tr_idx)} va={len(va_idx)}', flush=True)

auc_valid = roc_auc_score(y[mask_valid], oof_stacker[mask_valid])
print(f'[Stacker] OOF AUC (validated blocks 1..5): {auc_valid:.5f}')
te_mean = np.mean(te_parts, axis=0).astype(np.float32)

# Save and write submission
np.save('oof_stacker_logits.npy', oof_stacker.astype(np.float32))
np.save('test_stacker_logits.npy', te_mean.astype(np.float32))
sub = pd.DataFrame({id_col: ids, target_col: te_mean})
out_path = 'submission_stacker_lr.csv'
sub.to_csv(out_path, index=False)
sub.to_csv('submission.csv', index=False)
print(f'Wrote {out_path} and promoted to submission.csv | mean={te_mean.mean():.6f}')

Stacker folds: 4 (validate blocks 2..5)
Added base: lr_withsub_meta | oof:(2878,) te:(1162,)
Added base: lr_nosub_meta | oof:(2878,) te:(1162,)
Added base: lr_nosub_meta_decay | oof:(2878,) te:(1162,)
Added base: dense_v1 | oof:(2878,) te:(1162,)
Added base: dense_v2 | oof:(2878,) te:(1162,)
Added base: meta_xgb | oof:(2878,) te:(1162,)
Added base: minilm_xgb | oof:(2878,) te:(1162,)
Added base: mpnet_xgb | oof:(2878,) te:(1162,)
Added base: svd_dual_xgb | oof:(2878,) te:(1162,)
Replaced test for meta_xgb with test_xgb_meta_fullbag.npy
Replaced test for minilm_xgb with test_xgb_emb_minilm_fullbag.npy
Replaced test for mpnet_xgb with test_xgb_emb_mpnet_fullbag.npy
Stacker features: train OOF logits (2878, 9) | test logits (1162, 9) | bases=['lr_withsub_meta', 'lr_nosub_meta', 'lr_nosub_meta_decay', 'dense_v1', 'dense_v2', 'meta_xgb', 'minilm_xgb', 'mpnet_xgb', 'svd_dual_xgb']
[Stacker] Fold 1 | best C=1.0 | AUC=0.69839 | tr=480 va=480


[Stacker] Fold 2 | best C=0.5 | AUC=0.64224 | tr=960 va=480


[Stacker] Fold 3 | best C=1.0 | AUC=0.64945 | tr=1440 va=479


[Stacker] Fold 4 | best C=0.5 | AUC=0.64255 | tr=1919 va=479


[Stacker] OOF AUC (validated blocks 1..5): 0.57933
Wrote submission_stacker_lr.csv and promoted to submission.csv | mean=0.198167


In [5]:
# S40: Time-aware Char TF-IDF (char_wb 2-6) + LogisticRegression base; cache OOF/test
import numpy as np, pandas as pd, time, gc
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    if 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    return df.get('request_text', pd.Series(['']*len(df))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train)
txt_te = build_text(test)

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# Char TF-IDF params (lightweight caps for speed/stability)
tf_params = dict(analyzer='char_wb', ngram_range=(2,6), lowercase=True, min_df=2, max_features=200_000,
                 sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

C = 1.0
oof = np.zeros(n, dtype=np.float32)
te_parts = []
t_all = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t0 = time.time()
    tf = TfidfVectorizer(**tf_params)
    X_tr = tf.fit_transform(txt_tr.iloc[tr_idx])
    X_va = tf.transform(txt_tr.iloc[va_idx])
    X_te = tf.transform(txt_te)
    clf = LogisticRegression(penalty='l2', solver='saga', C=C, max_iter=2000, n_jobs=-1, verbose=0)
    clf.fit(X_tr, y[tr_idx])
    va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
    te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
    oof[va_idx] = va_pred
    te_parts.append(te_pred)
    auc = roc_auc_score(y[va_idx], va_pred)
    print(f'[CharLR C={C}] Fold {fi} AUC={auc:.5f} | feats={X_tr.shape[1]} | {time.time()-t0:.1f}s', flush=True)
    del tf, X_tr, X_va, X_te, clf; gc.collect()

auc_mask = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[CharLR] OOF AUC(validated)={auc_mask:.5f} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_lr_charwb_time.npy', oof.astype(np.float32))
np.save('test_lr_charwb_time.npy', te_mean.astype(np.float32))
print('Saved oof_lr_charwb_time.npy and test_lr_charwb_time.npy', flush=True)

Time-CV: 5 folds; validated 2398/2878


[CharLR C=1.0] Fold 1 AUC=0.68122 | feats=26806 | 4.4s


[CharLR C=1.0] Fold 2 AUC=0.61014 | feats=40485 | 8.0s


[CharLR C=1.0] Fold 3 AUC=0.57826 | feats=50592 | 11.2s


[CharLR C=1.0] Fold 4 AUC=0.62621 | feats=57465 | 12.5s


[CharLR C=1.0] Fold 5 AUC=0.66546 | feats=63589 | 16.2s


[CharLR] OOF AUC(validated)=0.62454 | total 52.8s


Saved oof_lr_charwb_time.npy and test_lr_charwb_time.npy


In [6]:
# S41: Reblend gamma-best by adding Char TF-IDF LR base with small weight; choose w_char via OOF gamma-weighted AUC
import numpy as np, pandas as pd, time
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
ids = test[id_col].values
y = train[target_col].astype(int).values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and mask for validated rows
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True

# Gamma weights for OOF objective (gamma=0.98)
gamma = 0.98
w_oof = np.zeros(n, dtype=np.float64)
for bi in range(1, k):
    age = (k - 1) - bi
    w_oof[np.array(blocks[bi])] = (gamma ** age)

# Load OOF for existing components
o_lr_w = np.load('oof_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy')
o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy') if Path('oof_xgb_svd_word192_char128_meta.npy').exists() else None
o_char = np.load('oof_lr_charwb_time.npy')

# Convert to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd = to_logit(o_svd_dual) if (o_svd_dual is not None) else None
z_char = to_logit(o_char)

# Test preds
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
t_svd = np.load('test_xgb_svd_word192_char128_meta.npy') if Path('test_xgb_svd_word192_char128_meta.npy').exists() else None
t_char = np.load('test_lr_charwb_time.npy')

tz = lambda arr: to_logit(arr)

# Expert gamma-best baseline (S37e) weights including dual SVD
g = 0.97
base_weights = dict(w_lr=0.21, w_d1=0.176, w_d2=0.044, w_meta=0.22, w_emn=0.15, w_emp=0.15, w_svd=0.05)

def blend_oof_with_char(w_char):
    # Renormalize other weights to sum to (1 - w_char)
    scale = 1.0 - w_char
    w_lr = base_weights['w_lr'] * scale
    w_d1 = base_weights['w_d1'] * scale
    w_d2 = base_weights['w_d2'] * scale
    w_meta = base_weights['w_meta'] * scale
    w_emn = base_weights['w_emn'] * scale
    w_emp = base_weights['w_emp'] * scale
    w_svd = (base_weights['w_svd'] * scale) if (z_svd is not None) else 0.0
    z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
    z = (w_lr*z_lr_mix + w_d1*z_d1 + w_d2*z_d2 + w_meta*z_meta + w_emn*z_emn + w_emp*z_emp)
    if z_svd is not None:
        z = z + w_svd*z_svd
    z = z + w_char*z_char
    return z, dict(w_lr=w_lr, w_d1=w_d1, w_d2=w_d2, w_meta=w_meta, w_emn=w_emn, w_emp=w_emp, w_svd=w_svd, w_char=w_char)

candidates = [0.03, 0.05, 0.07, 0.08]
best_auc, best_cfg = -1.0, None
for wc in candidates:
    z_oof, cfg = blend_oof_with_char(wc)
    auc = roc_auc_score(y[mask_full], z_oof[mask_full], sample_weight=w_oof[mask_full])
    print(f'[Char add] w_char={wc:.3f} | gamma-weighted OOF AUC={auc:.5f}')
    if auc > best_auc:
        best_auc, best_cfg = auc, cfg
print(f'[Char add] Selected w_char={best_cfg["w_char"]:.3f} | AUC={best_auc:.5f}')

def build_test(cfg, tag):
    tz_lr_mix = (1.0 - g)*tz(t_lr_w) + g*tz(t_lr_ns)
    parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*tz(t_d1),
        cfg['w_d2']*tz(t_d2),
        cfg['w_meta']*tz(t_meta),
        cfg['w_emn']*tz(t_emn),
        cfg['w_emp']*tz(t_emp)
    ]
    if t_svd is not None and cfg['w_svd'] > 0:
        parts.append(cfg['w_svd']*tz(t_svd))
    parts.append(cfg['w_char']*tz(t_char))
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    out = pd.DataFrame({id_col: ids, target_col: pt})
    path = f'submission_gamma0p97_svddual_char{cfg["w_char"]:.3f}.csv'
    out.to_csv(path, index=False)
    # 15% shrink-to-equal hedge
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_char']] + ([cfg['w_svd']] if (t_svd is not None and cfg['w_svd']>0) else [])
    comp_logits = [tz_lr_mix, tz(t_d1), tz(t_d2), tz(t_meta), tz(t_emn), tz(t_emp), tz(t_char)] + ([tz(t_svd)] if (t_svd is not None and cfg['w_svd']>0) else [])
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = np.zeros_like(comp_logits[0], dtype=np.float64)
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(path.replace('.csv','_shrunk.csv'), index=False)
    # Promote primary
    out.to_csv('submission.csv', index=False)
    print(f'Wrote {path} (+_shrunk) and promoted to submission.csv | mean={pt.mean():.6f}')

# Build test with best w_char
build_test(best_cfg, 'gamma0p97_char')

[Char add] w_char=0.030 | gamma-weighted OOF AUC=0.68158
[Char add] w_char=0.050 | gamma-weighted OOF AUC=0.68171
[Char add] w_char=0.070 | gamma-weighted OOF AUC=0.68171
[Char add] w_char=0.080 | gamma-weighted OOF AUC=0.68172
[Char add] Selected w_char=0.080 | AUC=0.68172
Wrote submission_gamma0p97_svddual_char0.080.csv (+_shrunk) and promoted to submission.csv | mean=0.389134


In [7]:
# S42: Subreddit view - CountVectorizer (binary) on requester_subreddits_at_request + LR; time-aware CV; cache OOF/test
import numpy as np, pandas as pd, time, gc
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def subs_to_text(series):
    # requester_subreddits_at_request is list-like; join with spaces; handle NaNs
    out = []
    for v in series.fillna('').values:
        if isinstance(v, list):
            out.append(' '.join(map(str, v)))
        else:
            out.append(str(v))
    return pd.Series(out, index=series.index)

sub_tr = subs_to_text(train.get('requester_subreddits_at_request', pd.Series(['']*len(train))))
sub_te = subs_to_text(test.get('requester_subreddits_at_request', pd.Series(['']*len(test))))

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# CountVectorizer params (binary presence, minimal caps)
cv_params = dict(lowercase=True, binary=True, min_df=2, max_features=100_000)

C = 1.0
oof = np.zeros(n, dtype=np.float32)
te_parts = []
t_all = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t0 = time.time()
    cv = CountVectorizer(**cv_params)
    X_tr = cv.fit_transform(sub_tr.iloc[tr_idx])
    X_va = cv.transform(sub_tr.iloc[va_idx])
    X_te = cv.transform(sub_te)
    clf = LogisticRegression(penalty='l2', solver='liblinear', C=C, max_iter=2000)
    clf.fit(X_tr, y[tr_idx])
    va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
    te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
    oof[va_idx] = va_pred
    te_parts.append(te_pred)
    auc = roc_auc_score(y[va_idx], va_pred)
    print(f'[SubLR C={C}] Fold {fi} AUC={auc:.5f} | feats={X_tr.shape[1]} | {time.time()-t0:.1f}s', flush=True)
    del cv, X_tr, X_va, X_te, clf; gc.collect()

auc_mask = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[SubLR] OOF AUC(validated)={auc_mask:.5f} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_lr_subs_time.npy', oof.astype(np.float32))
np.save('test_lr_subs_time.npy', te_mean.astype(np.float32))
print('Saved oof_lr_subs_time.npy and test_lr_subs_time.npy', flush=True)

Time-CV: 5 folds; validated 2398/2878


[SubLR C=1.0] Fold 1 AUC=0.57882 | feats=403 | 0.0s


[SubLR C=1.0] Fold 2 AUC=0.56112 | feats=744 | 0.0s


[SubLR C=1.0] Fold 3 AUC=0.58016 | feats=1130 | 0.1s


[SubLR C=1.0] Fold 4 AUC=0.50120 | feats=1566 | 0.1s


[SubLR C=1.0] Fold 5 AUC=0.43204 | feats=2189 | 0.1s


[SubLR] OOF AUC(validated)=0.53578 | total 0.7s


Saved oof_lr_subs_time.npy and test_lr_subs_time.npy


In [8]:
# S43: Reblend with Char LR + Subreddit LR small weights via gamma-weighted OOF; promote submission
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
ids = test[id_col].values
y = train[target_col].astype(int).values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and gamma weights (gamma=0.98) over validated rows
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
gamma = 0.98
w_oof = np.zeros(n, dtype=np.float64)
for bi in range(1, k):
    age = (k - 1) - bi
    w_oof[np.array(blocks[bi])] = (gamma ** age)

# Load OOF/test for base components
o_lr_w = np.load('oof_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy')
o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy') if Path('oof_xgb_svd_word192_char128_meta.npy').exists() else None
o_char = np.load('oof_lr_charwb_time.npy')
o_sub = np.load('oof_lr_subs_time.npy') if Path('oof_lr_subs_time.npy').exists() else None

# Convert to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd = to_logit(o_svd_dual) if (o_svd_dual is not None) else None
z_char = to_logit(o_char)
z_sub = to_logit(o_sub) if (o_sub is not None) else None

# Test preds
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
t_svd = np.load('test_xgb_svd_word192_char128_meta.npy') if Path('test_xgb_svd_word192_char128_meta.npy').exists() else None
t_char = np.load('test_lr_charwb_time.npy')
t_sub = np.load('test_lr_subs_time.npy') if Path('test_lr_subs_time.npy').exists() else None

tz = lambda arr: to_logit(arr)

# Base gamma-best weights (S37e) + tune small extras
g = 0.97
base = dict(w_lr=0.21, w_d1=0.176, w_d2=0.044, w_meta=0.22, w_emn=0.15, w_emp=0.15, w_svd=(0.05 if z_svd is not None else 0.0))

def score_oof(w_char, w_sub):
    extra = w_char + (w_sub if z_sub is not None else 0.0)
    if extra >= 0.20:
        return -1.0, None  # guard, shouldn't happen for our grid
    scale = 1.0 - extra
    w_lr = base['w_lr']*scale; w_d1 = base['w_d1']*scale; w_d2 = base['w_d2']*scale
    w_meta = base['w_meta']*scale; w_emn = base['w_emn']*scale; w_emp = base['w_emp']*scale; w_svd = base['w_svd']*scale
    z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
    z = (w_lr*z_lr_mix + w_d1*z_d1 + w_d2*z_d2 + w_meta*z_meta + w_emn*z_emn + w_emp*z_emp)
    if z_svd is not None and w_svd > 0: z = z + w_svd*z_svd
    if w_char > 0: z = z + w_char*z_char
    if (z_sub is not None) and (w_sub > 0): z = z + w_sub*z_sub
    auc = roc_auc_score(y[mask_full], z[mask_full], sample_weight=w_oof[mask_full])
    return auc, dict(w_lr=w_lr, w_d1=w_d1, w_d2=w_d2, w_meta=w_meta, w_emn=w_emn, w_emp=w_emp, w_svd=w_svd, w_char=w_char, w_sub=(w_sub if z_sub is not None else 0.0))

w_char_grid = [0.05, 0.08]
w_sub_grid = [0.0, 0.01, 0.02] if (z_sub is not None) else [0.0]
best_auc, best_cfg = -1.0, None
for wc in w_char_grid:
    for ws in w_sub_grid:
        auc, cfg = score_oof(wc, ws)
        print(f'[Char+Subs] w_char={wc:.3f} w_sub={ws:.3f} | gamma-OOF AUC={auc:.5f}')
        if auc > best_auc:
            best_auc, best_cfg = auc, cfg
print(f'[Char+Subs] Selected cfg: {best_cfg} | AUC={best_auc:.5f}')

def build_test(cfg):
    tz_lr_mix = (1.0 - g)*tz(t_lr_w) + g*tz(t_lr_ns)
    parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*tz(t_d1),
        cfg['w_d2']*tz(t_d2),
        cfg['w_meta']*tz(t_meta),
        cfg['w_emn']*tz(t_emn),
        cfg['w_emp']*tz(t_emp)
    ]
    if (t_svd is not None) and (cfg['w_svd'] > 0): parts.append(cfg['w_svd']*tz(t_svd))
    if cfg['w_char'] > 0: parts.append(cfg['w_char']*tz(t_char))
    if (t_sub is not None) and (cfg['w_sub'] > 0): parts.append(cfg['w_sub']*tz(t_sub))
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    tag = f'char{cfg["w_char"]:.3f}_subs{cfg["w_sub"]:.3f}'
    out_path = f'submission_gamma0p97_svddual_{tag}.csv'
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(out_path, index=False)
    # 15% shrink-to-equal hedge
    comp_logits = [tz_lr_mix, tz(t_d1), tz(t_d2), tz(t_meta), tz(t_emn), tz(t_emp)]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp']]
    if (t_svd is not None) and (cfg['w_svd'] > 0): comp_logits.append(tz(t_svd)); w_list.append(cfg['w_svd'])
    if cfg['w_char'] > 0: comp_logits.append(tz(t_char)); w_list.append(cfg['w_char'])
    if (t_sub is not None) and (cfg['w_sub'] > 0): comp_logits.append(tz(t_sub)); w_list.append(cfg['w_sub'])
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = np.zeros_like(comp_logits[0], dtype=np.float64)
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(out_path.replace('.csv','_shrunk.csv'), index=False)
    # Promote
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv('submission.csv', index=False)
    print(f'Wrote {out_path} (+_shrunk) and promoted to submission.csv | mean={pt.mean():.6f}')

build_test(best_cfg)

[Char+Subs] w_char=0.050 w_sub=0.000 | gamma-OOF AUC=0.68171
[Char+Subs] w_char=0.050 w_sub=0.010 | gamma-OOF AUC=0.68147
[Char+Subs] w_char=0.050 w_sub=0.020 | gamma-OOF AUC=0.68106
[Char+Subs] w_char=0.080 w_sub=0.000 | gamma-OOF AUC=0.68172
[Char+Subs] w_char=0.080 w_sub=0.010 | gamma-OOF AUC=0.68145
[Char+Subs] w_char=0.080 w_sub=0.020 | gamma-OOF AUC=0.68095
[Char+Subs] Selected cfg: {'w_lr': 0.1932, 'w_d1': 0.16192, 'w_d2': 0.04048, 'w_meta': 0.2024, 'w_emn': 0.138, 'w_emp': 0.138, 'w_svd': 0.046000000000000006, 'w_char': 0.08, 'w_sub': 0.0} | AUC=0.68172
Wrote submission_gamma0p97_svddual_char0.080_subs0.000.csv (+_shrunk) and promoted to submission.csv | mean=0.389134


In [9]:
# S44: e5-base-v2 sentence embeddings + LR head (time-aware CV); cache OOF/test
import os, sys, time, gc, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

# Ensure HF cache local
os.environ['HF_HOME'] = os.path.abspath('hf_cache')
os.environ['TRANSFORMERS_CACHE'] = os.path.abspath('hf_cache')

# Install sentence-transformers if missing
try:
    import torch
    from sentence_transformers import SentenceTransformer
except Exception:
    import subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'sentence-transformers', 'torch'])
    import torch
    from sentence_transformers import SentenceTransformer

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device} | GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"}', flush=True)

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)
def build_text(df):
    # e5 expects query: prefix
    return ('query: ' + (get_title(df) + ' ' + get_body_no_leak(df))).astype(str)

txt_tr = build_text(train).tolist()
txt_te = build_text(test).tolist()

# Encode e5-base-v2 (cache to .npy)
emb_tr_path, emb_te_path = 'emb_e5_tr.npy', 'emb_e5_te.npy'
if Path(emb_tr_path).exists() and Path(emb_te_path).exists():
    E_tr = np.load(emb_tr_path).astype(np.float32)
    E_te = np.load(emb_te_path).astype(np.float32)
    print('Loaded cached e5 embeddings:', E_tr.shape, E_te.shape, flush=True)
else:
    model_name = 'intfloat/e5-base-v2'
    model = SentenceTransformer(model_name, device=device)
    # Normalize embeddings=True per expert advice
    bs = 128
    t0 = time.time()
    E_tr = model.encode(txt_tr, batch_size=bs, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=True).astype(np.float32)
    E_te = model.encode(txt_te, batch_size=bs, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=True).astype(np.float32)
    np.save(emb_tr_path, E_tr); np.save(emb_te_path, E_te)
    print(f'Encoded e5: tr {E_tr.shape} te {E_te.shape} | {time.time()-t0:.1f}s', flush=True)
    del model; torch.cuda.empty_cache(); gc.collect()

# Time-aware 6-block forward-chaining (validate 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV folds={len(folds)}; validated {mask.sum()}/{n}', flush=True)

# LR head (no meta first pass), StandardScaler on embeddings
oof = np.zeros(n, dtype=np.float32)
te_parts = []
t_all = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t0 = time.time()
    X_tr = E_tr[tr_idx]; X_va = E_tr[va_idx]; X_te = E_te
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr).astype(np.float32)
    X_va_s = scaler.transform(X_va).astype(np.float32)
    X_te_s = scaler.transform(X_te).astype(np.float32)
    clf = LogisticRegression(penalty='l2', solver='saga', C=1.0, max_iter=2000, n_jobs=-1, verbose=0)
    clf.fit(X_tr_s, y[tr_idx])
    va_pred = clf.predict_proba(X_va_s)[:,1].astype(np.float32)
    te_pred = clf.predict_proba(X_te_s)[:,1].astype(np.float32)
    oof[va_idx] = va_pred; te_parts.append(te_pred)
    auc = roc_auc_score(y[va_idx], va_pred)
    print(f'[e5 LR] Fold {fi} AUC={auc:.5f} | {time.time()-t0:.1f}s', flush=True)
    del X_tr, X_va, X_te, X_tr_s, X_va_s, X_te_s, scaler, clf; gc.collect()

auc_mask = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[e5 LR] OOF AUC(validated)={auc_mask:.5f} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_e5_lr_time.npy', oof.astype(np.float32))
np.save('test_e5_lr_time.npy', te_mean.astype(np.float32))
print('Saved oof_e5_lr_time.npy and test_e5_lr_time.npy', flush=True)

  from .autonotebook import tqdm as notebook_tqdm




Device: cuda | GPU: Tesla T4


Batches:   0%|          | 0/23 [00:00<?, ?it/s]

Batches:   4%|▍         | 1/23 [00:03<01:25,  3.87s/it]

Batches:   9%|▊         | 2/23 [00:06<01:01,  2.93s/it]

Batches:  13%|█▎        | 3/23 [00:07<00:47,  2.38s/it]

Batches:  17%|█▋        | 4/23 [00:09<00:40,  2.11s/it]

Batches:  22%|██▏       | 5/23 [00:11<00:34,  1.90s/it]

Batches:  26%|██▌       | 6/23 [00:12<00:27,  1.65s/it]

Batches:  30%|███       | 7/23 [00:13<00:23,  1.47s/it]

Batches:  35%|███▍      | 8/23 [00:14<00:19,  1.30s/it]

Batches:  39%|███▉      | 9/23 [00:15<00:16,  1.20s/it]

Batches:  43%|████▎     | 10/23 [00:16<00:14,  1.10s/it]

Batches:  48%|████▊     | 11/23 [00:16<00:12,  1.00s/it]

Batches:  52%|█████▏    | 12/23 [00:17<00:10,  1.08it/s]

Batches:  57%|█████▋    | 13/23 [00:18<00:08,  1.16it/s]

Batches:  61%|██████    | 14/23 [00:19<00:07,  1.18it/s]

Batches:  65%|██████▌   | 15/23 [00:19<00:06,  1.28it/s]

Batches:  70%|██████▉   | 16/23 [00:20<00:05,  1.33it/s]

Batches:  74%|███████▍  | 17/23 [00:21<00:04,  1.41it/s]

Batches:  78%|███████▊  | 18/23 [00:21<00:03,  1.49it/s]

Batches:  83%|████████▎ | 19/23 [00:22<00:02,  1.67it/s]

Batches:  87%|████████▋ | 20/23 [00:22<00:01,  1.79it/s]

Batches:  91%|█████████▏| 21/23 [00:23<00:01,  1.87it/s]

Batches:  96%|█████████▌| 22/23 [00:23<00:00,  2.10it/s]

Batches: 100%|██████████| 23/23 [00:23<00:00,  2.72it/s]

Batches: 100%|██████████| 23/23 [00:23<00:00,  1.02s/it]




Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Batches:  10%|█         | 1/10 [00:00<00:03,  3.00it/s]

Batches:  20%|██        | 2/10 [00:00<00:02,  3.40it/s]

Batches:  30%|███       | 3/10 [00:00<00:01,  4.03it/s]

Batches:  40%|████      | 4/10 [00:01<00:01,  4.28it/s]

Batches:  50%|█████     | 5/10 [00:01<00:01,  4.35it/s]

Batches:  60%|██████    | 6/10 [00:01<00:00,  4.87it/s]

Batches:  70%|███████   | 7/10 [00:01<00:00,  5.34it/s]

Batches:  80%|████████  | 8/10 [00:01<00:00,  5.81it/s]

Batches:  90%|█████████ | 9/10 [00:01<00:00,  6.48it/s]

Batches: 100%|██████████| 10/10 [00:01<00:00,  5.54it/s]

Encoded e5: tr (2878, 768) te (1162, 768) | 25.5s





Time-CV folds=5; validated 2398/2878


[e5 LR] Fold 1 AUC=0.60530 | 4.9s


[e5 LR] Fold 2 AUC=0.53903 | 14.0s


[e5 LR] Fold 3 AUC=0.53458 | 28.9s


[e5 LR] Fold 4 AUC=0.57967 | 36.3s


[e5 LR] Fold 5 AUC=0.54606 | 41.6s


[e5 LR] OOF AUC(validated)=0.55704 | total 127.0s


Saved oof_e5_lr_time.npy and test_e5_lr_time.npy


In [12]:
# S45: Meta_v2 features (+politeness/hardship/text stats/time) -> XGB (GPU), time-aware CV; cache OOF/test
import re, numpy as np, pandas as pd, time, gc, xgboost as xgb
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body(df):
    # use request_text only (avoid edit_aware)
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)

def basic_text_stats(s: pd.Series):
    # fast vectorized stats
    lens = s.str.len().fillna(0).astype(np.int32).values
    n_excl = s.str.count('!').fillna(0).astype(np.int32).values
    n_q = s.str.count('\?').fillna(0).astype(np.int32).values
    n_url = s.str.count(r'https?://|www\.').fillna(0).astype(np.int32).values
    imgur_flag = s.str.contains('imgur', case=False, na=False).astype(np.int8).values
    n_digit = s.str.count(r'\d').fillna(0).astype(np.int32).values
    # words
    words = s.str.split()
    n_words = words.apply(lambda x: len(x) if isinstance(x, list) else 0).astype(np.int32).values
    # caps ratio
    n_caps = s.str.count(r'[A-Z]').fillna(0).astype(np.int32).values
    caps_ratio = np.divide(n_caps, np.maximum(lens, 1), dtype=np.float32)
    digit_ratio = np.divide(n_digit, np.maximum(lens, 1), dtype=np.float32)
    return dict(len=lens.astype(np.float32), n_excl=n_excl.astype(np.float32), n_q=n_q.astype(np.float32),
                n_url=n_url.astype(np.float32), imgur=imgur_flag.astype(np.float32), n_words=n_words.astype(np.float32),
                caps_ratio=caps_ratio.astype(np.float32), digit_ratio=digit_ratio.astype(np.float32))

def keyword_counts(s: pd.Series, patterns):
    out = {}
    for name, pat in patterns.items():
        out[name] = s.str.count(pat).fillna(0).astype(np.float32).values
    return out

def build_meta_v2(df: pd.DataFrame):
    title = get_title(df); body = get_body(df)
    text = (title + '\n' + body)
    # stats
    ts = basic_text_stats(title); bs = basic_text_stats(body); xs = basic_text_stats(text)
    # politeness / social cues / hardship keywords (case-insensitive)
    pats = {
        'kw_please': r'(?i)\bplease\b',
        'kw_thanks': r'(?i)\bthank(s| you)?\b',
        'kw_appreciate': r'(?i)\bappreciate\b',
        'kw_piz': r'(?i)\bpizza\b',
        'kw_payit': r'(?i)pay( it)? forward',
        'kw_broke': r'(?i)\bbroke\b|no money',
        'kw_rent': r'(?i)\brent\b|\bbill(s)?\b|utilities|electric|gas',
        'kw_job': r'(?i)\bjob\b|\bunemploy(ed|ment)?\b|\bfired\b',
        'kw_student': r'(?i)\bstudent(s)?\b|\bfinal(s)?\b|\bexam(s)?\b|\bcollege\b|\bclass(es)?\b',
        'kw_family': r'(?i)\bfamily\b|\bkid(s)?\b|\bchild(ren)?\b|\bwife\b|\bhusband\b',
        'kw_health': r'(?i)\bhealth\b|\bhospital\b|\bsick\b|\bill(ness)?\b',
        'kw_money': r'(?i)\$|dollar(s)?|cash|money'
    }
    kc = keyword_counts(text, pats)
    # time features
    ts_unix = df['unix_timestamp_of_request'].astype(np.int64).values
    # convert to UTC datetime
    dt = pd.to_datetime(ts_unix, unit='s', utc=True)
    # dt is a DatetimeIndex; access fields directly
    hour = dt.hour.values.astype(np.int16)
    weekday = dt.weekday.values.astype(np.int16)
    # one-hot hour (0-23) and weekday (0-6) compact
    hour_oh = np.eye(24, dtype=np.float32)[hour]
    wday_oh = np.eye(7, dtype=np.float32)[weekday]
    # assemble feature matrix
    cols = []
    def stack(d):
        for k in sorted(d.keys()):
            cols.append(d[k])
    stack(ts); stack(bs); stack(xs); stack(kc)
    cols.append(hour_oh) ; cols.append(wday_oh)
    X = np.column_stack([c if c.ndim==1 else c for c in cols]).astype(np.float32)
    return X

t0 = time.time()
X_tr = build_meta_v2(train); X_te = build_meta_v2(test)
print('Meta_v2 shapes:', X_tr.shape, X_te.shape, '| build', f'{time.time()-t0:.1f}s', flush=True)

# Time-aware 6-block folds (validate 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# XGB params (GPU) same family as earlier
params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=4,
    eta=0.05,
    subsample=0.8,
    colsample_bytree=0.6,
    min_child_weight=6,
    reg_alpha=0.3,
    reg_lambda=3.0,
    gamma=0.0,
    device='cuda',
    tree_method='hist'
)

oof = np.zeros(n, dtype=np.float32)
te_parts = []
t_all = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t1 = time.time()
    Xtr, Xva = X_tr[tr_idx], X_tr[va_idx]
    ytr, yva = y[tr_idx], y[va_idx]
    dtr = xgb.DMatrix(Xtr, label=ytr); dva = xgb.DMatrix(Xva, label=yva); dte = xgb.DMatrix(X_te)
    pos = float((ytr == 1).sum()); neg = float((ytr == 0).sum())
    spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
    p = dict(params); p['scale_pos_weight'] = spw; p['seed'] = 42 + fi
    booster = xgb.train(p, dtr, num_boost_round=4000, evals=[(dva, 'valid')], early_stopping_rounds=100, verbose_eval=False)
    va_pred = booster.predict(dva).astype(np.float32)
    te_pred = booster.predict(dte, iteration_range=(0, booster.best_iteration+1 if booster.best_iteration is not None else 0)).astype(np.float32)
    oof[va_idx] = va_pred; te_parts.append(te_pred)
    auc = roc_auc_score(yva, va_pred)
    print(f'[Meta_v2 XGB] Fold {fi} AUC={auc:.5f} | rounds={booster.best_iteration} | {time.time()-t1:.1f}s', flush=True)
    del dtr, dva, dte, booster; gc.collect()

auc_mask = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[Meta_v2 XGB] OOF AUC(validated)={auc_mask:.5f} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_xgb_meta_time_v2.npy', oof.astype(np.float32))
np.save('test_xgb_meta_time_v2.npy', te_mean.astype(np.float32))
print('Saved oof_xgb_meta_time_v2.npy and test_xgb_meta_time_v2.npy', flush=True)

Meta_v2 shapes: (2878, 67) (1162, 67) | build 0.8s


Time-CV: 5 folds; validated 2398/2878


[Meta_v2 XGB] Fold 1 AUC=0.58249 | rounds=64 | 0.6s


[Meta_v2 XGB] Fold 2 AUC=0.59500 | rounds=65 | 0.3s


[Meta_v2 XGB] Fold 3 AUC=0.55075 | rounds=3 | 0.2s


[Meta_v2 XGB] Fold 4 AUC=0.62124 | rounds=22 | 0.2s


[Meta_v2 XGB] Fold 5 AUC=0.58841 | rounds=79 | 0.3s


[Meta_v2 XGB] OOF AUC(validated)=0.58848 | total 2.8s


Saved oof_xgb_meta_time_v2.npy and test_xgb_meta_time_v2.npy


In [13]:
# S46: Reblend with meta_v2 option + add LR_main (text-only) + Char LR; gamma-weighted OOF selection; write submission
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
ids = test[id_col].values
y = train[target_col].astype(int).values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and gamma weights (gamma=0.98) over validated rows
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
gamma = 0.98
w_oof = np.zeros(n, dtype=np.float64)
for bi in range(1, k):
    age = (k - 1) - bi
    w_oof[np.array(blocks[bi])] = (gamma ** age)

# Load OOF/test for base components
o_lr_w = np.load('oof_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy')
o_meta_v1 = np.load('oof_xgb_meta_time.npy')
o_meta_v2 = np.load('oof_xgb_meta_time_v2.npy') if Path('oof_xgb_meta_time_v2.npy').exists() else None
o_emn = np.load('oof_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy')
o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy') if Path('oof_xgb_svd_word192_char128_meta.npy').exists() else None
o_char = np.load('oof_lr_charwb_time.npy') if Path('oof_lr_charwb_time.npy').exists() else None
o_lr_main = np.load('oof_lr_main_time.npy') if Path('oof_lr_main_time.npy').exists() else None

# Convert to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_d2 = to_logit(o_d1), to_logit(o_d2)
z_meta_v1 = to_logit(o_meta_v1)
z_meta_v2 = to_logit(o_meta_v2) if o_meta_v2 is not None else None
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd = to_logit(o_svd_dual) if o_svd_dual is not None else None
z_char = to_logit(o_char) if o_char is not None else None
z_lrmain = to_logit(o_lr_main) if o_lr_main is not None else None

# Test preds
t_lr_w = np.load('test_lr_time_withsub_meta.npy')
t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
t_d1 = np.load('test_xgb_dense_time.npy')
t_d2 = np.load('test_xgb_dense_time_v2.npy')
t_meta_v1 = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
t_meta_v2 = np.load('test_xgb_meta_time_v2.npy') if Path('test_xgb_meta_time_v2.npy').exists() else None
t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
t_svd = np.load('test_xgb_svd_word192_char128_meta.npy') if Path('test_xgb_svd_word192_char128_meta.npy').exists() else None
t_char = np.load('test_lr_charwb_time.npy') if Path('test_lr_charwb_time.npy').exists() else None
t_lr_main = np.load('test_lr_main_time.npy') if Path('test_lr_main_time.npy').exists() else None

tz = lambda arr: to_logit(arr)

# Base gamma-best weights (S37e) as starting point
g = 0.97
base = dict(w_lr=0.21, w_d1=0.176, w_d2=0.044, w_meta=0.22, w_emn=0.15, w_emp=0.15, w_svd=(0.05 if z_svd is not None else 0.0))

def score_cfg(w_char, w_lrmain, use_meta_v2):
    extra = (w_char if (z_char is not None) else 0.0) + (w_lrmain if (z_lrmain is not None) else 0.0)
    if extra >= 0.20:
        return -1.0, None
    scale = 1.0 - extra
    w_lr = base['w_lr']*scale; w_d1 = base['w_d1']*scale; w_d2 = base['w_d2']*scale
    w_meta = base['w_meta']*scale; w_emn = base['w_emn']*scale; w_emp = base['w_emp']*scale; w_svd = base['w_svd']*scale
    z_meta_use = (z_meta_v2 if (use_meta_v2 and z_meta_v2 is not None) else z_meta_v1)
    z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
    z = (w_lr*z_lr_mix + w_d1*z_d1 + w_d2*z_d2 + w_meta*z_meta_use + w_emn*z_emn + w_emp*z_emp)
    if z_svd is not None and w_svd > 0: z += w_svd*z_svd
    if (z_char is not None) and (w_char > 0): z += w_char*z_char
    if (z_lrmain is not None) and (w_lrmain > 0): z += w_lrmain*z_lrmain
    auc = roc_auc_score(y[mask_full], z[mask_full], sample_weight=w_oof[mask_full])
    cfg = dict(w_lr=w_lr, w_d1=w_d1, w_d2=w_d2, w_meta=w_meta, w_emn=w_emn, w_emp=w_emp, w_svd=w_svd, w_char=w_char if z_char is not None else 0.0, w_lrmain=w_lrmain if z_lrmain is not None else 0.0, use_meta_v2=bool(use_meta_v2))
    return auc, cfg

w_char_grid = [0.05, 0.08] if z_char is not None else [0.0]
w_lrmain_grid = [0.0, 0.03, 0.05] if z_lrmain is not None else [0.0]
use_meta_v2_grid = [False, True] if (z_meta_v2 is not None) else [False]

best_auc, best_cfg = -1.0, None
for wc in w_char_grid:
    for wl in w_lrmain_grid:
        for um2 in use_meta_v2_grid:
            auc, cfg = score_cfg(wc, wl, um2)
            print(f"[Reblend] w_char={wc:.3f} w_lrmain={wl:.3f} use_meta_v2={um2} | gamma-OOF AUC={auc:.5f}")
            if auc > best_auc:
                best_auc, best_cfg = auc, cfg
print('[Reblend] Selected cfg:', best_cfg, '| AUC=', f'{best_auc:.5f}')

def build_test(cfg):
    tz_lr_mix = (1.0 - g)*tz(t_lr_w) + g*tz(t_lr_ns)
    t_meta_use = (t_meta_v2 if (cfg['use_meta_v2'] and t_meta_v2 is not None) else t_meta_v1)
    parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*tz(t_d1),
        cfg['w_d2']*tz(t_d2),
        cfg['w_meta']*tz(t_meta_use),
        cfg['w_emn']*tz(t_emn),
        cfg['w_emp']*tz(t_emp)
    ]
    if (t_svd is not None) and (cfg['w_svd'] > 0): parts.append(cfg['w_svd']*tz(t_svd))
    if (t_char is not None) and (cfg['w_char'] > 0): parts.append(cfg['w_char']*tz(t_char))
    if (t_lr_main is not None) and (cfg['w_lrmain'] > 0): parts.append(cfg['w_lrmain']*tz(t_lr_main))
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    tag = f"gamma0p97_meta{'v2' if cfg['use_meta_v2'] else 'v1'}_char{cfg['w_char']:.3f}_lrmain{cfg['w_lrmain']:.3f}"
    out_path = f'submission_{tag}.csv'
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(out_path, index=False)
    # 15% shrink-to-equal hedge
    comp_logits = [tz_lr_mix, tz(t_d1), tz(t_d2), tz(t_meta_use), tz(t_emn), tz(t_emp)]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp']]
    if (t_svd is not None) and (cfg['w_svd'] > 0): comp_logits.append(tz(t_svd)); w_list.append(cfg['w_svd'])
    if (t_char is not None) and (cfg['w_char'] > 0): comp_logits.append(tz(t_char)); w_list.append(cfg['w_char'])
    if (t_lr_main is not None) and (cfg['w_lrmain'] > 0): comp_logits.append(tz(t_lr_main)); w_list.append(cfg['w_lrmain'])
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = np.zeros_like(comp_logits[0], dtype=np.float64)
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(out_path.replace('.csv','_shrunk.csv'), index=False)
    # Promote
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv('submission.csv', index=False)
    print(f"Wrote {out_path} (+_shrunk) and promoted to submission.csv | mean={pt.mean():.6f}")

build_test(best_cfg)

[Reblend] w_char=0.050 w_lrmain=0.000 use_meta_v2=False | gamma-OOF AUC=0.68171
[Reblend] w_char=0.050 w_lrmain=0.000 use_meta_v2=True | gamma-OOF AUC=0.67177
[Reblend] w_char=0.050 w_lrmain=0.030 use_meta_v2=False | gamma-OOF AUC=0.68158
[Reblend] w_char=0.050 w_lrmain=0.030 use_meta_v2=True | gamma-OOF AUC=0.67169
[Reblend] w_char=0.050 w_lrmain=0.050 use_meta_v2=False | gamma-OOF AUC=0.68144
[Reblend] w_char=0.050 w_lrmain=0.050 use_meta_v2=True | gamma-OOF AUC=0.67159
[Reblend] w_char=0.080 w_lrmain=0.000 use_meta_v2=False | gamma-OOF AUC=0.68172
[Reblend] w_char=0.080 w_lrmain=0.000 use_meta_v2=True | gamma-OOF AUC=0.67179
[Reblend] w_char=0.080 w_lrmain=0.030 use_meta_v2=False | gamma-OOF AUC=0.68144
[Reblend] w_char=0.080 w_lrmain=0.030 use_meta_v2=True | gamma-OOF AUC=0.67167
[Reblend] w_char=0.080 w_lrmain=0.050 use_meta_v2=False | gamma-OOF AUC=0.68126
[Reblend] w_char=0.080 w_lrmain=0.050 use_meta_v2=True | gamma-OOF AUC=0.67156
[Reblend] Selected cfg: {'w_lr': 0.1932, 'w_d1

In [14]:
# S47: NB-SVM style word Count(1-2) + LR (time-aware CV); cache OOF/test
import numpy as np, pandas as pd, time, gc
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body_no_leak(df):
    if 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    return df.get('request_text', pd.Series(['']*len(df))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body_no_leak(df)).astype(str)

txt_tr = build_text(train)
txt_te = build_text(test)

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# CountVectorizer params (words 1-2); NB-SVM log-count ratios per fold
cv_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=2, max_features=200_000)
alpha = 1.0  # smoothing
C = 1.0

oof = np.zeros(n, dtype=np.float32)
te_parts = []
t_all = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t0 = time.time()
    cv = CountVectorizer(**cv_params)
    X_tr = cv.fit_transform(txt_tr.iloc[tr_idx])  # csr
    X_va = cv.transform(txt_tr.iloc[va_idx])
    X_te = cv.transform(txt_te)
    ytr = y[tr_idx]
    # Log-count ratio r for NB-SVM
    pos_mask = (ytr == 1)
    neg_mask = ~pos_mask
    # Sum counts by class
    p = (X_tr[pos_mask].sum(axis=0).A1 + alpha)
    q = (X_tr[neg_mask].sum(axis=0).A1 + alpha)
    r = np.log(p / q).astype(np.float32)  # shape (n_features,)
    # Scale columns by r (use csc for efficient column scaling)
    Xtr_scaled = X_tr.tocsc().multiply(r).tocsr()
    Xva_scaled = X_va.tocsc().multiply(r).tocsr()
    Xte_scaled = X_te.tocsc().multiply(r).tocsr()
    # Linear LR on scaled features
    clf = LogisticRegression(penalty='l2', solver='saga', C=C, max_iter=2000, n_jobs=-1, verbose=0)
    clf.fit(Xtr_scaled, ytr)
    va_pred = clf.predict_proba(Xva_scaled)[:,1].astype(np.float32)
    te_pred = clf.predict_proba(Xte_scaled)[:,1].astype(np.float32)
    oof[va_idx] = va_pred; te_parts.append(te_pred)
    auc = roc_auc_score(y[va_idx], va_pred)
    print(f'[NB-SVM word(1-2) C={C}] Fold {fi} AUC={auc:.5f} | feats={X_tr.shape[1]} | {time.time()-t0:.1f}s', flush=True)
    del cv, X_tr, X_va, X_te, Xtr_scaled, Xva_scaled, Xte_scaled, clf; gc.collect()

auc_mask = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[NB-SVM word] OOF AUC(validated)={auc_mask:.5f} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_nbsvm_word_time.npy', oof.astype(np.float32))
np.save('test_nbsvm_word_time.npy', te_mean.astype(np.float32))
print('Saved oof_nbsvm_word_time.npy and test_nbsvm_word_time.npy', flush=True)

Time-CV: 5 folds; validated 2398/2878


[NB-SVM word(1-2) C=1.0] Fold 1 AUC=0.53857 | feats=7267 | 1.1s


[NB-SVM word(1-2) C=1.0] Fold 2 AUC=0.55304 | feats=13015 | 2.7s


[NB-SVM word(1-2) C=1.0] Fold 3 AUC=0.49857 | feats=17710 | 4.5s


[NB-SVM word(1-2) C=1.0] Fold 4 AUC=0.54421 | feats=21695 | 7.6s


[NB-SVM word(1-2) C=1.0] Fold 5 AUC=0.55013 | feats=25193 | 8.6s


[NB-SVM word] OOF AUC(validated)=0.54113 | total 25.6s


Saved oof_nbsvm_word_time.npy and test_nbsvm_word_time.npy


In [18]:
# S48: Build rank-avg and logit-avg hedges across top submissions; promote best candidate
import numpy as np, time, glob
from pathlib import Path

id_col = 'request_id'; target_col = 'requester_received_pizza'

def read_sub(path):
    ids = []; probs = []
    with open(path, 'r', encoding='utf-8') as f:
        header = f.readline()
        for line in f:
            rid, p = line.rstrip().split(',', 1)
            ids.append(rid);
            try: probs.append(float(p))
            except: probs.append(0.5)
    return ids, np.asarray(probs, dtype=np.float64)

def rank01(x):
    order = np.argsort(x, kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(x), dtype=np.float64)
    return ranks / max(len(x) - 1, 1)

def to_logit(p, eps=1e-6):
    p = np.clip(p, eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Candidate files (must exist)
cands = [
  'submission_reblend_svddual_gamma0p98.csv',
  'submission_blend_gamma0p98_fullrefits.csv',
  'submission_last2blend_last2.csv',
  'submission_gamma0p97_svddual_char0.080.csv'
]
avail = [p for p in cands if Path(p).exists() and Path(p).stat().st_size > 0]
assert len(avail) >= 2, f'Not enough submissions available to hedge. Found: {avail}'

# Read first to get ids
ids0, p0 = read_sub(avail[0])
probs = [p0]
for p in avail[1:]:
    ids_i, pi = read_sub(p)
    assert ids_i == ids0, f'ID mismatch between {avail[0]} and {p}'
    probs.append(pi)
P = np.vstack(probs)  # (m, n_test)

# 3-way rank-average (first three if available) and 4-way if all exist
def write_sub(path, vals):
    with open(path, 'w', encoding='utf-8') as f:
        f.write(f'{id_col},{target_col}\n')
        for rid, v in zip(ids0, vals.astype(np.float32)):
            f.write(f'{rid},{v:.8f}\n')
    print(f'Wrote {path} | mean={vals.mean():.6f} | size={Path(path).stat().st_size}', flush=True)

m = P.shape[0]
ranks = np.vstack([rank01(P[i]) for i in range(m)])  # (m, n)

# Build several hedges
out_paths = []
if m >= 3:
    ravg3 = ranks[:3].mean(axis=0)
    path3 = 'submission_rankavg_top3.csv'
    write_sub(path3, ravg3)
    out_paths.append(path3)
if m >= 4:
    ravg4 = ranks[:4].mean(axis=0)
    path4 = 'submission_rankavg_top4.csv'
    write_sub(path4, ravg4)
    out_paths.append(path4)

# Logit-average across available (slightly more aggressive than rank-avg)
logits = to_logit(P)
z_mean = logits.mean(axis=0)
p_mean = sigmoid(z_mean)
path_logit = 'submission_logitavg_all.csv'
write_sub(path_logit, p_mean)
out_paths.append(path_logit)

# Promote preferred hedge: prefer rankavg_top4, else rankavg_top3, else logitavg_all
prom = next((p for p in ['submission_rankavg_top4.csv','submission_rankavg_top3.csv','submission_logitavg_all.csv'] if Path(p).exists()), None)
assert prom is not None, 'No hedge file generated'
Path(prom).replace('submission.csv')
print(f'Promoted {prom} to submission.csv', flush=True)

Wrote submission_rankavg_top3.csv | mean=0.500000 | size=23773


Wrote submission_rankavg_top4.csv | mean=0.500000 | size=23773


Wrote submission_logitavg_all.csv | mean=0.394812 | size=23773


Promoted submission_rankavg_top4.csv to submission.csv


In [19]:
# S49: CatBoost with native text (title, body) + meta_v1 (GPU), time-aware CV; cache OOF/test
import os, sys, time, gc, numpy as np, pandas as pd
from pathlib import Path

# Ensure CatBoost installed
try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    import subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostClassifier, Pool

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body(df):
    # Avoid edit_aware; use request_text only
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)

# Build text columns
title_tr = get_title(train); body_tr = get_body(train)
title_te = get_title(test);  body_te = get_body(test)

# Load meta_v1 features (numeric)
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)  # shape (n, d)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)
n_meta = Meta_tr.shape[1]

# Assemble DataFrames with two text cols followed by meta numeric columns
meta_cols = [f'm{i}' for i in range(n_meta)]
Xtr_df = pd.DataFrame({'title': title_tr, 'body': body_tr})
for i, col in enumerate(meta_cols):
    Xtr_df[col] = Meta_tr[:, i]
Xte_df = pd.DataFrame({'title': title_te, 'body': body_te})
for i, col in enumerate(meta_cols):
    Xte_df[col] = Meta_te[:, i]

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# CatBoost params (GPU) per expert guidance
base_params = dict(
    iterations=3000,
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=6.0,
    bagging_temperature=0.75,
    random_strength=1.0,
    loss_function='Logloss',
    eval_metric='AUC',
    task_type='GPU',
    verbose=False
)

text_features = [0, 1]  # indices of text columns in Pool

oof = np.zeros(n, dtype=np.float32)
te_parts = []
t_all = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t0 = time.time()
    X_tr_fold = Xtr_df.iloc[tr_idx].reset_index(drop=True)
    X_va_fold = Xtr_df.iloc[va_idx].reset_index(drop=True)
    y_tr = y[tr_idx]; y_va = y[va_idx]
    # Class imbalance handling
    pos = float((y_tr == 1).sum()); neg = float((y_tr == 0).sum())
    spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
    params = dict(base_params); params['scale_pos_weight'] = spw; params['random_seed'] = 42 + fi
    # Pools with text feature indices
    pool_tr = Pool(X_tr_fold, label=y_tr, text_features=text_features)
    pool_va = Pool(X_va_fold, label=y_va, text_features=text_features)
    pool_te = Pool(Xte_df, text_features=text_features)
    model = CatBoostClassifier(**params)
    model.fit(pool_tr, eval_set=pool_va, use_best_model=True, early_stopping_rounds=100)
    va_pred = model.predict_proba(pool_va)[:, 1].astype(np.float32)
    te_pred = model.predict_proba(pool_te)[:, 1].astype(np.float32)
    oof[va_idx] = va_pred
    te_parts.append(te_pred)
    # On-the-fly AUC (without import to keep this cell self-contained for speed)
    from sklearn.metrics import roc_auc_score
    auc = roc_auc_score(y_va, va_pred) if (y_va.min()!=y_va.max()) else 0.5
    best_it = getattr(model, 'best_iteration_', None)
    print(f'[CatTextMeta] Fold {fi} AUC={auc:.5f} | spw={spw:.2f} | best_it={best_it} | {time.time()-t0:.1f}s', flush=True)
    del pool_tr, pool_va, pool_te, model; gc.collect()

from sklearn.metrics import roc_auc_score
auc_oof = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[CatTextMeta] DONE | OOF(validated) AUC={auc_oof:.5f} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_catboost_textmeta.npy', oof.astype(np.float32))
np.save('test_catboost_textmeta.npy', te_mean.astype(np.float32))
print('Saved oof_catboost_textmeta.npy and test_catboost_textmeta.npy', flush=True)

Time-CV: 5 folds; validated 2398/2878


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta] Fold 2 AUC=0.66204 | spw=2.33 | best_it=49 | 4.5s


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta] Fold 3 AUC=0.62541 | spw=2.49 | best_it=106 | 6.0s


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta] Fold 4 AUC=0.64291 | spw=2.79 | best_it=32 | 4.2s


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta] Fold 5 AUC=0.62950 | spw=2.83 | best_it=381 | 13.8s


[CatTextMeta] DONE | OOF(validated) AUC=0.64965 | total 34.9s


Saved oof_catboost_textmeta.npy and test_catboost_textmeta.npy


In [20]:
# S50: Fix e5 (passage: prefix) + XGB heads on e5 and concatenated embeddings (MiniLM||MPNet and e5||MiniLM||MPNet); time-aware CV; cache OOF/test
import os, sys, time, gc, numpy as np, pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import xgboost as xgb

# Ensure HF cache local & sentence-transformers available
os.environ['HF_HOME'] = os.path.abspath('hf_cache')
os.environ['TRANSFORMERS_CACHE'] = os.path.abspath('hf_cache')
try:
    import torch
    from sentence_transformers import SentenceTransformer
except Exception:
    import subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'sentence-transformers', 'torch'])
    import torch
    from sentence_transformers import SentenceTransformer

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device}', flush=True)

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body(df):
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)

def build_e5_text(df):
    # Correct prefix for document/passages per expert advice
    return ('passage: ' + (get_title(df) + ' ' + get_body(df))).astype(str).tolist()

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# 1) Encode e5-base-v2 with correct prefix (normalize=True) and cache
emb_e5_tr_path, emb_e5_te_path = 'emb_e5_tr.npy', 'emb_e5_te.npy'
reencode_e5 = True  # force re-encode to fix prefix
if (not reencode_e5) and Path(emb_e5_tr_path).exists() and Path(emb_e5_te_path).exists():
    E5_tr = np.load(emb_e5_tr_path).astype(np.float32)
    E5_te = np.load(emb_e5_te_path).astype(np.float32)
    print('Loaded cached e5 embeddings:', E5_tr.shape, E5_te.shape, flush=True)
else:
    model_name = 'intfloat/e5-base-v2'
    model = SentenceTransformer(model_name, device=device)
    bs = 128
    t0 = time.time()
    txt_tr = build_e5_text(train); txt_te = build_e5_text(test)
    E5_tr = model.encode(txt_tr, batch_size=bs, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=True).astype(np.float32)
    E5_te = model.encode(txt_te, batch_size=bs, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=True).astype(np.float32)
    np.save(emb_e5_tr_path, E5_tr); np.save(emb_e5_te_path, E5_te)
    print(f'Encoded e5 (passage:): tr {E5_tr.shape} te {E5_te.shape} | {time.time()-t0:.1f}s', flush=True)
    del model; torch.cuda.empty_cache(); gc.collect()

# 2) Load other embeddings (MiniLM, MPNet) for concatenation heads
Emb_min_tr = np.load('emb_minilm_tr.npy').astype(np.float32)
Emb_min_te = np.load('emb_minilm_te.npy').astype(np.float32)
Emb_mp_tr = np.load('emb_mpnet_tr.npy').astype(np.float32)
Emb_mp_te = np.load('emb_mpnet_te.npy').astype(np.float32)

def run_xgb_head(Xtr_raw: np.ndarray, Xte_raw: np.ndarray, tag: str):
    # Standardize per fold; XGBoost GPU with per-fold scale_pos_weight and early stopping
    oof = np.zeros(n, dtype=np.float32)
    te_parts = []
    params = dict(
        objective='binary:logistic',
        eval_metric='auc',
        max_depth=4,
        eta=0.05,
        subsample=0.8,
        colsample_bytree=0.6,
        min_child_weight=8,
        reg_alpha=0.3,
        reg_lambda=3.0,
        gamma=0.0,
        device='cuda',
        tree_method='hist'
    )
    t_all = time.time()
    for fi, (tr_idx, va_idx) in enumerate(folds, 1):
        t0 = time.time()
        X_tr_f = Xtr_raw[tr_idx]; X_va_f = Xtr_raw[va_idx]
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_tr = scaler.fit_transform(X_tr_f).astype(np.float32)
        X_va = scaler.transform(X_va_f).astype(np.float32)
        X_te = scaler.transform(Xte_raw).astype(np.float32)
        dtr = xgb.DMatrix(X_tr, label=y[tr_idx]); dva = xgb.DMatrix(X_va, label=y[va_idx]); dte = xgb.DMatrix(X_te)
        pos = float((y[tr_idx] == 1).sum()); neg = float((y[tr_idx] == 0).sum()); spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
        p = dict(params); p['seed'] = 42 + fi; p['scale_pos_weight'] = spw
        booster = xgb.train(p, dtr, num_boost_round=4000, evals=[(dva, 'valid')], early_stopping_rounds=100, verbose_eval=False)
        va_pred = booster.predict(dva).astype(np.float32)
        te_pred = booster.predict(dte, iteration_range=(0, booster.best_iteration+1 if booster.best_iteration is not None else 0)).astype(np.float32)
        oof[va_idx] = va_pred; te_parts.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred) if (y[va_idx].min()!=y[va_idx].max()) else 0.5
        print(f'[{tag}] Fold {fi} AUC={auc:.5f} | rounds={booster.best_iteration} | spw={spw:.2f} | {time.time()-t0:.1f}s', flush=True)
        del X_tr_f, X_va_f, X_tr, X_va, X_te, scaler, dtr, dva, dte, booster; gc.collect()
    auc_oof = roc_auc_score(y[mask], oof[mask])
    te_mean = np.mean(te_parts, axis=0).astype(np.float32)
    print(f'[{tag}] DONE | OOF(validated) AUC={auc_oof:.5f} | total {time.time()-t_all:.1f}s', flush=True)
    np.save(f'oof_xgb_{tag}.npy', oof.astype(np.float32))
    np.save(f'test_xgb_{tag}.npy', te_mean.astype(np.float32))
    print(f'Saved oof_xgb_{tag}.npy and test_xgb_{tag}.npy', flush=True)

# 3) Train heads:
# a) e5-only XGB head
run_xgb_head(E5_tr, E5_te, tag='e5_time')

# b) MiniLM||MPNet concatenation XGB head
Emb_mm_tr = np.hstack([Emb_min_tr, Emb_mp_tr]).astype(np.float32)
Emb_mm_te = np.hstack([Emb_min_te, Emb_mp_te]).astype(np.float32)
run_xgb_head(Emb_mm_tr, Emb_mm_te, tag='emb_minilm_mpnet_time')

# c) e5||MiniLM||MPNet concatenation XGB head
Emb_all_tr = np.hstack([E5_tr, Emb_min_tr, Emb_mp_tr]).astype(np.float32)
Emb_all_te = np.hstack([E5_te, Emb_min_te, Emb_mp_te]).astype(np.float32)
run_xgb_head(Emb_all_tr, Emb_all_te, tag='emb_e5_minilm_mpnet_time')

Device: cuda


Time-CV: 5 folds; validated 2398/2878


Batches:   0%|          | 0/23 [00:00<?, ?it/s]

Batches:   4%|▍         | 1/23 [00:03<01:17,  3.51s/it]

Batches:   9%|▊         | 2/23 [00:05<00:58,  2.79s/it]

Batches:  13%|█▎        | 3/23 [00:07<00:46,  2.32s/it]

Batches:  17%|█▋        | 4/23 [00:09<00:39,  2.09s/it]

Batches:  22%|██▏       | 5/23 [00:10<00:34,  1.89s/it]

Batches:  26%|██▌       | 6/23 [00:11<00:27,  1.64s/it]

Batches:  30%|███       | 7/23 [00:13<00:23,  1.47s/it]

Batches:  35%|███▍      | 8/23 [00:14<00:19,  1.31s/it]

Batches:  39%|███▉      | 9/23 [00:15<00:17,  1.21s/it]

Batches:  43%|████▎     | 10/23 [00:15<00:14,  1.12s/it]

Batches:  48%|████▊     | 11/23 [00:16<00:12,  1.02s/it]

Batches:  52%|█████▏    | 12/23 [00:17<00:10,  1.07it/s]

Batches:  57%|█████▋    | 13/23 [00:18<00:08,  1.14it/s]

Batches:  61%|██████    | 14/23 [00:19<00:07,  1.16it/s]

Batches:  65%|██████▌   | 15/23 [00:19<00:06,  1.26it/s]

Batches:  70%|██████▉   | 16/23 [00:20<00:05,  1.31it/s]

Batches:  74%|███████▍  | 17/23 [00:21<00:04,  1.39it/s]

Batches:  78%|███████▊  | 18/23 [00:21<00:03,  1.47it/s]

Batches:  83%|████████▎ | 19/23 [00:22<00:02,  1.64it/s]

Batches:  87%|████████▋ | 20/23 [00:22<00:01,  1.77it/s]

Batches:  91%|█████████▏| 21/23 [00:23<00:01,  1.84it/s]

Batches:  96%|█████████▌| 22/23 [00:23<00:00,  2.07it/s]

Batches: 100%|██████████| 23/23 [00:23<00:00,  2.67it/s]

Batches: 100%|██████████| 23/23 [00:23<00:00,  1.02s/it]




Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Batches:  10%|█         | 1/10 [00:00<00:03,  2.98it/s]

Batches:  20%|██        | 2/10 [00:00<00:02,  3.39it/s]

Batches:  30%|███       | 3/10 [00:00<00:01,  4.03it/s]

Batches:  40%|████      | 4/10 [00:01<00:01,  4.31it/s]

Batches:  50%|█████     | 5/10 [00:01<00:01,  4.35it/s]

Batches:  60%|██████    | 6/10 [00:01<00:00,  4.87it/s]

Batches:  70%|███████   | 7/10 [00:01<00:00,  5.31it/s]

Batches:  80%|████████  | 8/10 [00:01<00:00,  5.73it/s]

Batches:  90%|█████████ | 9/10 [00:01<00:00,  6.39it/s]

Batches: 100%|██████████| 10/10 [00:01<00:00,  5.51it/s]

Encoded e5 (passage:): tr (2878, 768) te (1162, 768) | 25.4s





[e5_time] Fold 1 AUC=0.54314 | rounds=22 | spw=1.94 | 0.5s


[e5_time] Fold 2 AUC=0.61580 | rounds=35 | spw=2.33 | 0.6s


[e5_time] Fold 3 AUC=0.54114 | rounds=23 | spw=2.49 | 0.6s


[e5_time] Fold 4 AUC=0.63140 | rounds=250 | spw=2.79 | 1.6s


[e5_time] Fold 5 AUC=0.64953 | rounds=70 | spw=2.83 | 0.8s


[e5_time] DONE | OOF(validated) AUC=0.58642 | total 5.3s


Saved oof_xgb_e5_time.npy and test_xgb_e5_time.npy


[emb_minilm_mpnet_time] Fold 1 AUC=0.61832 | rounds=48 | spw=1.94 | 0.7s


[emb_minilm_mpnet_time] Fold 2 AUC=0.63685 | rounds=2 | spw=2.33 | 0.6s


[emb_minilm_mpnet_time] Fold 3 AUC=0.55047 | rounds=5 | spw=2.49 | 0.7s


[emb_minilm_mpnet_time] Fold 4 AUC=0.59546 | rounds=13 | spw=2.79 | 0.8s


[emb_minilm_mpnet_time] Fold 5 AUC=0.59977 | rounds=51 | spw=2.83 | 1.0s


[emb_minilm_mpnet_time] DONE | OOF(validated) AUC=0.59924 | total 4.8s


Saved oof_xgb_emb_minilm_mpnet_time.npy and test_xgb_emb_minilm_mpnet_time.npy


[emb_e5_minilm_mpnet_time] Fold 1 AUC=0.62632 | rounds=240 | spw=1.94 | 2.2s


[emb_e5_minilm_mpnet_time] Fold 2 AUC=0.61172 | rounds=17 | spw=2.33 | 1.1s


[emb_e5_minilm_mpnet_time] Fold 3 AUC=0.52791 | rounds=0 | spw=2.49 | 1.0s


[emb_e5_minilm_mpnet_time] Fold 4 AUC=0.59220 | rounds=81 | spw=2.79 | 1.7s


[emb_e5_minilm_mpnet_time] Fold 5 AUC=0.61733 | rounds=15 | spw=2.83 | 1.2s


[emb_e5_minilm_mpnet_time] DONE | OOF(validated) AUC=0.58825 | total 8.2s


Saved oof_xgb_emb_e5_minilm_mpnet_time.npy and test_xgb_emb_e5_minilm_mpnet_time.npy


In [25]:
# S51: Reblend with CatBoost (text+meta) added; optimize gamma and block-weighted objectives; promote best
import numpy as np, pandas as pd, time
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}', flush=True)

# Load OOF/test preds for core bases
o_lr_w = np.load('oof_lr_time_withsub_meta.npy');    t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');     t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');            t_d1 = np.load('test_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy');         t_d2 = np.load('test_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy');           t_meta = np.load('test_xgb_meta_time.npy') if not Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_fullbag.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');        t_emn = np.load('test_xgb_emb_meta_time.npy') if not Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_minilm_fullbag.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');       t_emp = np.load('test_xgb_emb_mpnet_time.npy') if not Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_fullbag.npy')
# Dual-view SVD XGB base (optional)
has_svd_dual = Path('oof_xgb_svd_word192_char128_meta.npy').exists() and Path('test_xgb_svd_word192_char128_meta.npy').exists()
if has_svd_dual:
    o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy'); t_svd_dual = np.load('test_xgb_svd_word192_char128_meta.npy')
# CatBoost text+meta
has_cat = Path('oof_catboost_textmeta.npy').exists() and Path('test_catboost_textmeta.npy').exists()
if has_cat:
    o_cat = np.load('oof_catboost_textmeta.npy'); t_cat = np.load('test_catboost_textmeta.npy')
else:
    raise FileNotFoundError('CatBoost OOF/test not found; run S49 first.')

# Convert OOF to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd_dual = to_logit(o_svd_dual) if has_svd_dual else None
z_cat = to_logit(o_cat)

# Convert test to logits
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_d2, tz_meta = to_logit(t_meta), to_logit(t_d2), to_logit(t_meta)
tz_emn, tz_emp = to_logit(t_emn), to_logit(t_emp)
tz_svd_dual = to_logit(t_svd_dual) if has_svd_dual else None
tz_cat = to_logit(t_cat)

# Grids (per expert guidance):
g_grid = [0.975, 0.98, 0.99]
meta_grid = [0.18, 0.20, 0.22]
dense_tot_grid = [0.0, 0.06, 0.12, 0.18]  # allow Dense to drop to 0
dense_split = [(0.6, 0.4), (0.7, 0.3)]
emb_tot_grid = [0.24, 0.30, 0.34]         # raise embedding cap
emb_split = [(0.6, 0.4), (0.5, 0.5)]
svd_dual_grid = [0.0, 0.05, 0.08, 0.10] if has_svd_dual else [0.0]
cat_grid = [0.06, 0.10, 0.14, 0.18, 0.20]
w_lr_min_grid = [0.22, 0.25]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    t0 = time.time()
    for g in g_grid:
        z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
        for w_lr_min in w_lr_min_grid:
            for w_meta in meta_grid:
                for d_tot in dense_tot_grid:
                    for dv1, dv2 in dense_split:
                        w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                        for e_tot in emb_tot_grid:
                            for emn_fr, emp_fr in emb_split:
                                w_emn = e_tot * emn_fr; w_emp = e_tot * emp_fr
                                for w_svd in svd_dual_grid:
                                    for w_cat in cat_grid:
                                        rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp + w_svd + w_cat)
                                        if rem <= 0:
                                            continue
                                        w_lr = rem
                                        if w_lr < w_lr_min:
                                            continue
                                        z_oof = (w_lr*z_lr_mix +
                                                 w_d1*z_d1 + w_d2*z_d2 +
                                                 w_meta*z_meta +
                                                 w_emn*z_emn + w_emp*z_emp +
                                                 (w_svd*z_svd_dual if (has_svd_dual and w_svd>0) else 0) +
                                                 w_cat*z_cat)
                                        auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                        tried += 1
                                        if tried % 2000 == 0:
                                            print(f'  tried={tried} | best={best_auc:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
                                        if auc > best_auc:
                                            best_auc = auc
                                            best_cfg = dict(g=float(g), w_lr=float(w_lr), w_d1=float(w_d1), w_d2=float(w_d2),
                                                            w_meta=float(w_meta), w_emn=float(w_emn), w_emp=float(w_emp),
                                                            w_svd=float(w_svd), w_cat=float(w_cat))
    print(f'  search done | tried={tried} | best={best_auc:.5f} | {time.time()-t0:.1f}s', flush=True)
    return best_auc, best_cfg, tried

# 1) Full-mask objective
auc_full, cfg_full, tried_full = search(mask_full)
print(f'[Full] tried={tried_full} | best OOF(z) AUC={auc_full:.5f} | cfg={cfg_full}', flush=True)

# 2) Last-2 objective
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2] tried={tried_last2} | best OOF(z,last2) AUC={auc_last2:.5f} | cfg={cfg_last2}', flush=True)

# 3) Gamma-decayed block weights (optimize on validated blocks with per-block gamma)
best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.975, 0.98, 0.99]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best OOF(z,weighted) AUC={auc_g:.5f}', flush=True)
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={best_cfg_g}', flush=True)

# 4) Block-weighted objective (explicit per-block weights [0.1,0.2,0.3,0.4,0.5])
bw = np.zeros(n, dtype=np.float64)
weights = [0.1, 0.2, 0.3, 0.4, 0.5]  # for blocks 1..5
for bi in range(1, k):
    bw[np.array(blocks[bi])] = weights[bi-1]
auc_bw, cfg_bw, _ = search(mask_full, sample_weight=bw)
print(f'[Block-weighted] best OOF(z,weighted) AUC={auc_bw:.5f} | cfg={cfg_bw}', flush=True)

def build_and_save(tag, cfg):
    g = cfg['g']
    tz_lr_mix = (1.0 - g)*tz_lr_w + g*tz_lr_ns
    parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*to_logit(t_d1),
        cfg['w_d2']*to_logit(t_d2),
        cfg['w_meta']*to_logit(t_meta),
        cfg['w_emn']*to_logit(t_emn),
        cfg['w_emp']*to_logit(t_emp),
        cfg['w_cat']*tz_cat
    ]
    if has_svd_dual and cfg['w_svd'] > 0:
        parts.append(cfg['w_svd']*to_logit(t_svd_dual))
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    out_path = f'submission_reblend_cat_{tag}.csv'
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(out_path, index=False)
    # 15% shrink-to-equal hedge
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_cat']] + ([cfg['w_svd']] if (has_svd_dual and cfg['w_svd']>0) else [])
    comp_logits = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp), tz_cat] + ([to_logit(t_svd_dual)] if (has_svd_dual and cfg['w_svd']>0) else [])
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = 0.0
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(out_path.replace('.csv','_shrunk.csv'), index=False)
    return out_path

p_full = build_and_save('full', cfg_full)
p_last2 = build_and_save('last2', cfg_last2)
p_gam = build_and_save(f'gamma{best_gamma:.3f}'.replace('.', 'p'), best_cfg_g)
p_bw = build_and_save('blockw', cfg_bw)

# Promote best among gamma-best and block-weighted; prefer block-weighted if AUC higher, else gamma-best
primary = p_bw if (auc_bw >= best_auc_g) else p_gam
pd.read_csv(primary).to_csv('submission.csv', index=False)
print(f'Promoted {Path(primary).name} to submission.csv', flush=True)

Time-CV validated full: 2398/2878 | last2: 958


  tried=2000 | best=0.68230 | elapsed=4.3s


  tried=4000 | best=0.68233 | elapsed=8.7s


  tried=6000 | best=0.68233 | elapsed=13.0s


  search done | tried=7704 | best=0.68236 | 16.7s


[Full] tried=7704 | best OOF(z) AUC=0.68236 | cfg={'g': 0.99, 'w_lr': 0.25, 'w_d1': 0.08399999999999999, 'w_d2': 0.036, 'w_meta': 0.22, 'w_emn': 0.15, 'w_emp': 0.15, 'w_svd': 0.05, 'w_cat': 0.06}


  tried=2000 | best=0.64807 | elapsed=3.5s


  tried=4000 | best=0.64807 | elapsed=7.1s


  tried=6000 | best=0.64808 | elapsed=10.6s


  search done | tried=7704 | best=0.64808 | 13.7s


[Last2] tried=7704 | best OOF(z,last2) AUC=0.64808 | cfg={'g': 0.99, 'w_lr': 0.24, 'w_d1': 0.126, 'w_d2': 0.054, 'w_meta': 0.18, 'w_emn': 0.20400000000000001, 'w_emp': 0.136, 'w_svd': 0.0, 'w_cat': 0.06}


  tried=2000 | best=0.68079 | elapsed=4.8s


  tried=4000 | best=0.68082 | elapsed=9.7s


  tried=6000 | best=0.68082 | elapsed=14.6s


  search done | tried=7704 | best=0.68085 | 18.8s


[Gamma 0.975] best OOF(z,weighted) AUC=0.68085


  tried=2000 | best=0.68110 | elapsed=4.9s


  tried=4000 | best=0.68113 | elapsed=9.8s


  tried=6000 | best=0.68113 | elapsed=14.6s


  search done | tried=7704 | best=0.68115 | 18.6s


[Gamma 0.98] best OOF(z,weighted) AUC=0.68115


  tried=2000 | best=0.68170 | elapsed=4.6s


  tried=4000 | best=0.68173 | elapsed=9.3s


  tried=6000 | best=0.68173 | elapsed=13.8s


  search done | tried=7704 | best=0.68176 | 17.7s


[Gamma 0.99] best OOF(z,weighted) AUC=0.68176


[Gamma-best] gamma=0.99 | AUC=0.68176 | cfg={'g': 0.99, 'w_lr': 0.25, 'w_d1': 0.08399999999999999, 'w_d2': 0.036, 'w_meta': 0.22, 'w_emn': 0.15, 'w_emp': 0.15, 'w_svd': 0.05, 'w_cat': 0.06}


  tried=2000 | best=0.66215 | elapsed=4.6s


  tried=4000 | best=0.66218 | elapsed=9.2s


  tried=6000 | best=0.66220 | elapsed=13.8s


  search done | tried=7704 | best=0.66222 | 17.8s


[Block-weighted] best OOF(z,weighted) AUC=0.66222 | cfg={'g': 0.99, 'w_lr': 0.22999999999999998, 'w_d1': 0.08399999999999999, 'w_d2': 0.036, 'w_meta': 0.2, 'w_emn': 0.17, 'w_emp': 0.17, 'w_svd': 0.05, 'w_cat': 0.06}


Promoted submission_reblend_cat_gamma0p990.csv to submission.csv


In [23]:
# S52: LinearSVC (word 1-2 + char_wb 3-6 TF-IDF) with isotonic calibration; time-aware CV; cache OOF/test
import numpy as np, pandas as pd, time, gc
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body(df):
    # Avoid edit_aware; use request_text only
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body(df)).astype(str)

txt_tr = build_text(train)
txt_te = build_text(test)

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# TF-IDF views
word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=2, max_features=300_000, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)
char_params = dict(analyzer='char_wb', ngram_range=(3,6), lowercase=True, min_df=2, max_features=250_000, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

C_grid = [0.5, 1.0, 2.0]
best = dict(auc=-1.0, C=None, oof=None, te=None)
results = []

for C in C_grid:
    oof = np.zeros(n, dtype=np.float32)
    te_accum = []
    tC = time.time()
    for fi, (tr_idx, va_idx) in enumerate(folds, 1):
        t0 = time.time()
        tr_text = txt_tr.iloc[tr_idx]; va_text = txt_tr.iloc[va_idx]
        # Fit vectorizers on train fold only
        tf_w = TfidfVectorizer(**word_params)
        Xw_tr = tf_w.fit_transform(tr_text); Xw_va = tf_w.transform(va_text); Xw_te = tf_w.transform(txt_te)
        tf_c = TfidfVectorizer(**char_params)
        Xc_tr = tf_c.fit_transform(tr_text); Xc_va = tf_c.transform(va_text); Xc_te = tf_c.transform(txt_te)
        X_tr = hstack([Xw_tr, Xc_tr], format='csr')
        X_va = hstack([Xw_va, Xc_va], format='csr')
        X_te = hstack([Xw_te, Xc_te], format='csr')
        # Base SVM
        base = LinearSVC(C=C, dual=False, max_iter=5000)
        # Calibrate on train fold via 3-fold CV (inside tr_idx), isotonic
        clf = CalibratedClassifierCV(estimator=base, method='isotonic', cv=3)
        clf.fit(X_tr, y[tr_idx])
        va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
        te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
        oof[va_idx] = va_pred; te_accum.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred) if (y[va_idx].min()!=y[va_idx].max()) else 0.5
        print(f'[LinSVM+Cali C={C}] Fold {fi} AUC={auc:.5f} | feats={X_tr.shape[1]} | {time.time()-t0:.1f}s', flush=True)
        del tf_w, tf_c, Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, X_tr, X_va, X_te, base, clf; gc.collect()
    auc_oof = roc_auc_score(y[mask], oof[mask])
    te_mean = np.mean(te_accum, axis=0).astype(np.float32)
    results.append((C, auc_oof))
    print(f'[LinSVM+Cali C={C}] OOF(validated) AUC={auc_oof:.5f} | total {time.time()-tC:.1f}s', flush=True)
    if auc_oof > best['auc']:
        best.update(dict(auc=auc_oof, C=C, oof=oof.copy(), te=te_mean.copy()))
    del oof, te_accum; gc.collect()

print('C grid results:', results)
print(f'Best C={best["C"]} | OOF(validated) AUC={best["auc"]:.5f}')
np.save('oof_svm_wordchar_time.npy', best['oof'].astype(np.float32))
np.save('test_svm_wordchar_time.npy', best['te'].astype(np.float32))
print('Saved oof_svm_wordchar_time.npy and test_svm_wordchar_time.npy', flush=True)

Time-CV: 5 folds; validated 2398/2878


[LinSVM+Cali C=0.5] Fold 1 AUC=0.66099 | feats=32995 | 1.4s


[LinSVM+Cali C=0.5] Fold 2 AUC=0.59132 | feats=52069 | 2.0s


[LinSVM+Cali C=0.5] Fold 3 AUC=0.56935 | feats=66699 | 2.6s


[LinSVM+Cali C=0.5] Fold 4 AUC=0.62774 | feats=77440 | 3.4s


[LinSVM+Cali C=0.5] Fold 5 AUC=0.62111 | feats=86989 | 3.8s


[LinSVM+Cali C=0.5] OOF(validated) AUC=0.60804 | total 14.3s


[LinSVM+Cali C=1.0] Fold 1 AUC=0.65118 | feats=32995 | 1.3s


[LinSVM+Cali C=1.0] Fold 2 AUC=0.57915 | feats=52069 | 1.9s


[LinSVM+Cali C=1.0] Fold 3 AUC=0.56104 | feats=66699 | 2.6s


[LinSVM+Cali C=1.0] Fold 4 AUC=0.61897 | feats=77440 | 3.3s


[LinSVM+Cali C=1.0] Fold 5 AUC=0.62645 | feats=86989 | 4.2s


[LinSVM+Cali C=1.0] OOF(validated) AUC=0.60155 | total 14.5s


[LinSVM+Cali C=2.0] Fold 1 AUC=0.64495 | feats=32995 | 1.3s


[LinSVM+Cali C=2.0] Fold 2 AUC=0.57536 | feats=52069 | 2.5s


[LinSVM+Cali C=2.0] Fold 3 AUC=0.56365 | feats=66699 | 2.9s


[LinSVM+Cali C=2.0] Fold 4 AUC=0.61264 | feats=77440 | 3.7s


[LinSVM+Cali C=2.0] Fold 5 AUC=0.62111 | feats=86989 | 4.4s


[LinSVM+Cali C=2.0] OOF(validated) AUC=0.59804 | total 15.9s


C grid results: [(0.5, 0.6080356862468008), (1.0, 0.6015518276885392), (2.0, 0.5980380611427765)]
Best C=0.5 | OOF(validated) AUC=0.60804
Saved oof_svm_wordchar_time.npy and test_svm_wordchar_time.npy


In [24]:
# S53: CatBoost text+meta (GPU) small hyperparam sweep; pick best OOF and overwrite oof_catboost_textmeta.npy/test_*.npy
import os, sys, time, gc, numpy as np, pandas as pd
from pathlib import Path
try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    import subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body(df):
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)

# Build text columns
title_tr = get_title(train); body_tr = get_body(train)
title_te = get_title(test);  body_te = get_body(test)

# Load meta_v1 features (numeric)
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)
n_meta = Meta_tr.shape[1]
meta_cols = [f'm{i}' for i in range(n_meta)]
Xtr_df = pd.DataFrame({'title': title_tr, 'body': body_tr})
for i, col in enumerate(meta_cols):
    Xtr_df[col] = Meta_tr[:, i]
Xte_df = pd.DataFrame({'title': title_te, 'body': body_te})
for i, col in enumerate(meta_cols):
    Xte_df[col] = Meta_te[:, i]

# Time-aware 6-block forward-chaining (validate 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

text_features = [0, 1]

grid = []
for depth in [6, 8]:
    for lr in [0.03, 0.05]:
        for l2 in [3.0, 6.0, 9.0]:
            for bt in [0.5, 1.0]:
                grid.append(dict(depth=depth, learning_rate=lr, l2_leaf_reg=l2, bagging_temperature=bt))
print(f'Grid size: {len(grid)}', flush=True)

best_auc, best_cfg, best_oof, best_te = -1.0, None, None, None
t_all = time.time()
for gi, cfg in enumerate(grid, 1):
    params = dict(
        iterations=4000,
        depth=cfg['depth'],
        learning_rate=cfg['learning_rate'],
        l2_leaf_reg=cfg['l2_leaf_reg'],
        bagging_temperature=cfg['bagging_temperature'],
        random_strength=1.0,
        loss_function='Logloss',
        eval_metric='AUC',
        task_type='GPU',
        verbose=False
    )
    oof = np.zeros(n, dtype=np.float32); te_parts = []
    t0 = time.time()
    for fi, (tr_idx, va_idx) in enumerate(folds, 1):
        X_tr_fold = Xtr_df.iloc[tr_idx].reset_index(drop=True)
        X_va_fold = Xtr_df.iloc[va_idx].reset_index(drop=True)
        y_tr = y[tr_idx]; y_va = y[va_idx]
        pos = float((y_tr == 1).sum()); neg = float((y_tr == 0).sum())
        spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
        p = dict(params); p['scale_pos_weight'] = spw; p['random_seed'] = 4242 + gi*10 + fi
        pool_tr = Pool(X_tr_fold, label=y_tr, text_features=text_features)
        pool_va = Pool(X_va_fold, label=y_va, text_features=text_features)
        pool_te = Pool(Xte_df, text_features=text_features)
        model = CatBoostClassifier(**p)
        model.fit(pool_tr, eval_set=pool_va, use_best_model=True, early_stopping_rounds=100)
        va_pred = model.predict_proba(pool_va)[:,1].astype(np.float32)
        te_pred = model.predict_proba(pool_te)[:,1].astype(np.float32)
        oof[va_idx] = va_pred; te_parts.append(te_pred)
        del pool_tr, pool_va, pool_te, model; gc.collect()
    auc_oof = roc_auc_score(y[mask], oof[mask])
    te_mean = np.mean(te_parts, axis=0).astype(np.float32)
    print(f'[CatSweep {gi}/{len(grid)}] cfg={cfg} | OOF AUC={auc_oof:.5f} | {time.time()-t0:.1f}s', flush=True)
    if auc_oof > best_auc:
        best_auc, best_cfg, best_oof, best_te = auc_oof, cfg, oof.copy(), te_mean.copy()
    del oof, te_parts; gc.collect()

print(f'[CatSweep] BEST OOF={best_auc:.5f} | cfg={best_cfg} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_catboost_textmeta.npy', best_oof.astype(np.float32))
np.save('test_catboost_textmeta.npy', best_te.astype(np.float32))
print('Overwrote oof_catboost_textmeta.npy and test_catboost_textmeta.npy with best sweep results', flush=True)

Time-CV: 5 folds; validated 2398/2878


Grid size: 24


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 1/24] cfg={'depth': 6, 'learning_rate': 0.03, 'l2_leaf_reg': 3.0, 'bagging_temperature': 0.5} | OOF AUC=0.63802 | 23.2s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 2/24] cfg={'depth': 6, 'learning_rate': 0.03, 'l2_leaf_reg': 3.0, 'bagging_temperature': 1.0} | OOF AUC=0.63555 | 21.0s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 3/24] cfg={'depth': 6, 'learning_rate': 0.03, 'l2_leaf_reg': 6.0, 'bagging_temperature': 0.5} | OOF AUC=0.63921 | 27.0s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 4/24] cfg={'depth': 6, 'learning_rate': 0.03, 'l2_leaf_reg': 6.0, 'bagging_temperature': 1.0} | OOF AUC=0.64568 | 26.0s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 5/24] cfg={'depth': 6, 'learning_rate': 0.03, 'l2_leaf_reg': 9.0, 'bagging_temperature': 0.5} | OOF AUC=0.64461 | 27.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 6/24] cfg={'depth': 6, 'learning_rate': 0.03, 'l2_leaf_reg': 9.0, 'bagging_temperature': 1.0} | OOF AUC=0.65356 | 36.8s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 7/24] cfg={'depth': 6, 'learning_rate': 0.05, 'l2_leaf_reg': 3.0, 'bagging_temperature': 0.5} | OOF AUC=0.63770 | 22.9s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 8/24] cfg={'depth': 6, 'learning_rate': 0.05, 'l2_leaf_reg': 3.0, 'bagging_temperature': 1.0} | OOF AUC=0.64195 | 31.6s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 9/24] cfg={'depth': 6, 'learning_rate': 0.05, 'l2_leaf_reg': 6.0, 'bagging_temperature': 0.5} | OOF AUC=0.64191 | 24.1s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 10/24] cfg={'depth': 6, 'learning_rate': 0.05, 'l2_leaf_reg': 6.0, 'bagging_temperature': 1.0} | OOF AUC=0.63665 | 22.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 11/24] cfg={'depth': 6, 'learning_rate': 0.05, 'l2_leaf_reg': 9.0, 'bagging_temperature': 0.5} | OOF AUC=0.62863 | 26.9s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 12/24] cfg={'depth': 6, 'learning_rate': 0.05, 'l2_leaf_reg': 9.0, 'bagging_temperature': 1.0} | OOF AUC=0.63756 | 35.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 13/24] cfg={'depth': 8, 'learning_rate': 0.03, 'l2_leaf_reg': 3.0, 'bagging_temperature': 0.5} | OOF AUC=0.64530 | 76.5s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 14/24] cfg={'depth': 8, 'learning_rate': 0.03, 'l2_leaf_reg': 3.0, 'bagging_temperature': 1.0} | OOF AUC=0.63980 | 115.9s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 15/24] cfg={'depth': 8, 'learning_rate': 0.03, 'l2_leaf_reg': 6.0, 'bagging_temperature': 0.5} | OOF AUC=0.63825 | 75.4s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 16/24] cfg={'depth': 8, 'learning_rate': 0.03, 'l2_leaf_reg': 6.0, 'bagging_temperature': 1.0} | OOF AUC=0.63747 | 98.4s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 17/24] cfg={'depth': 8, 'learning_rate': 0.03, 'l2_leaf_reg': 9.0, 'bagging_temperature': 0.5} | OOF AUC=0.61591 | 53.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 18/24] cfg={'depth': 8, 'learning_rate': 0.03, 'l2_leaf_reg': 9.0, 'bagging_temperature': 1.0} | OOF AUC=0.62428 | 65.6s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 19/24] cfg={'depth': 8, 'learning_rate': 0.05, 'l2_leaf_reg': 3.0, 'bagging_temperature': 0.5} | OOF AUC=0.62876 | 53.0s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 20/24] cfg={'depth': 8, 'learning_rate': 0.05, 'l2_leaf_reg': 3.0, 'bagging_temperature': 1.0} | OOF AUC=0.61497 | 71.8s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 21/24] cfg={'depth': 8, 'learning_rate': 0.05, 'l2_leaf_reg': 6.0, 'bagging_temperature': 0.5} | OOF AUC=0.63728 | 61.2s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 22/24] cfg={'depth': 8, 'learning_rate': 0.05, 'l2_leaf_reg': 6.0, 'bagging_temperature': 1.0} | OOF AUC=0.64554 | 56.6s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 23/24] cfg={'depth': 8, 'learning_rate': 0.05, 'l2_leaf_reg': 9.0, 'bagging_temperature': 0.5} | OOF AUC=0.64561 | 76.2s


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Default metric period is 5 because AUC is/are not implemented for GPU


[CatSweep 24/24] cfg={'depth': 8, 'learning_rate': 0.05, 'l2_leaf_reg': 9.0, 'bagging_temperature': 1.0} | OOF AUC=0.63103 | 61.8s


[CatSweep] BEST OOF=0.65356 | cfg={'depth': 6, 'learning_rate': 0.03, 'l2_leaf_reg': 9.0, 'bagging_temperature': 1.0} | total 1195.0s


Overwrote oof_catboost_textmeta.npy and test_catboost_textmeta.npy with best sweep results


In [26]:
# S54: Embedding heads + meta_v1 (per-fold StandardScaler) with XGB (GPU); cache OOF/test
import os, time, gc, numpy as np, pandas as pd, xgboost as xgb
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# Load embeddings and meta_v1
E5_tr = np.load('emb_e5_tr.npy').astype(np.float32)
E5_te = np.load('emb_e5_te.npy').astype(np.float32)
Emb_min_tr = np.load('emb_minilm_tr.npy').astype(np.float32)
Emb_min_te = np.load('emb_minilm_te.npy').astype(np.float32)
Emb_mp_tr = np.load('emb_mpnet_tr.npy').astype(np.float32)
Emb_mp_te = np.load('emb_mpnet_te.npy').astype(np.float32)
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)

def run_xgb_head_with_meta(Xtr_emb: np.ndarray, Xte_emb: np.ndarray, tag: str):
    oof = np.zeros(n, dtype=np.float32)
    te_parts = []
    params = dict(
        objective='binary:logistic',
        eval_metric='auc',
        max_depth=4,
        eta=0.05,
        subsample=0.8,
        colsample_bytree=0.6,
        min_child_weight=8,
        reg_alpha=0.3,
        reg_lambda=3.0,
        gamma=0.0,
        device='cuda',
        tree_method='hist'
    )
    t_all = time.time()
    for fi, (tr_idx, va_idx) in enumerate(folds, 1):
        t0 = time.time()
        X_tr_f = np.hstack([Xtr_emb[tr_idx], Meta_tr[tr_idx]]).astype(np.float32)
        X_va_f = np.hstack([Xtr_emb[va_idx], Meta_tr[va_idx]]).astype(np.float32)
        X_te_f = np.hstack([Xte_emb, Meta_te]).astype(np.float32)
        scaler = StandardScaler(with_mean=True, with_std=True)
        X_tr = scaler.fit_transform(X_tr_f).astype(np.float32)
        X_va = scaler.transform(X_va_f).astype(np.float32)
        X_te = scaler.transform(X_te_f).astype(np.float32)
        dtr = xgb.DMatrix(X_tr, label=y[tr_idx]); dva = xgb.DMatrix(X_va, label=y[va_idx]); dte = xgb.DMatrix(X_te)
        pos = float((y[tr_idx] == 1).sum()); neg = float((y[tr_idx] == 0).sum()); spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
        p = dict(params); p['seed'] = 4242 + fi; p['scale_pos_weight'] = spw
        booster = xgb.train(p, dtr, num_boost_round=4000, evals=[(dva, 'valid')], early_stopping_rounds=100, verbose_eval=False)
        va_pred = booster.predict(dva).astype(np.float32)
        te_pred = booster.predict(dte, iteration_range=(0, booster.best_iteration+1 if booster.best_iteration is not None else 0)).astype(np.float32)
        oof[va_idx] = va_pred; te_parts.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred) if (y[va_idx].min()!=y[va_idx].max()) else 0.5
        print(f'[{tag}] Fold {fi} AUC={auc:.5f} | rounds={booster.best_iteration} | spw={spw:.2f} | {time.time()-t0:.1f}s', flush=True)
        del X_tr_f, X_va_f, X_te_f, X_tr, X_va, X_te, scaler, dtr, dva, dte, booster; gc.collect()
    auc_oof = roc_auc_score(y[mask], oof[mask])
    te_mean = np.mean(te_parts, axis=0).astype(np.float32)
    print(f'[{tag}] DONE | OOF(validated) AUC={auc_oof:.5f} | total {time.time()-t_all:.1f}s', flush=True)
    np.save(f'oof_xgb_{tag}.npy', oof.astype(np.float32))
    np.save(f'test_xgb_{tag}.npy', te_mean.astype(np.float32))
    print(f'Saved oof_xgb_{tag}.npy and test_xgb_{tag}.npy', flush=True)

# a) e5 + meta
run_xgb_head_with_meta(E5_tr, E5_te, tag='e5_meta_time')

# b) MiniLM||MPNet + meta
Emb_mm_tr = np.hstack([Emb_min_tr, Emb_mp_tr]).astype(np.float32)
Emb_mm_te = np.hstack([Emb_min_te, Emb_mp_te]).astype(np.float32)
run_xgb_head_with_meta(Emb_mm_tr, Emb_mm_te, tag='emb_minilm_mpnet_meta_time')

# c) e5||MiniLM||MPNet + meta
Emb_all_tr = np.hstack([E5_tr, Emb_min_tr, Emb_mp_tr]).astype(np.float32)
Emb_all_te = np.hstack([E5_te, Emb_min_te, Emb_mp_te]).astype(np.float32)
run_xgb_head_with_meta(Emb_all_tr, Emb_all_te, tag='emb_e5_minilm_mpnet_meta_time')

Time-CV: 5 folds; validated 2398/2878


[e5_meta_time] Fold 1 AUC=0.59570 | rounds=210 | spw=1.94 | 1.3s


[e5_meta_time] Fold 2 AUC=0.66896 | rounds=10 | spw=2.33 | 0.5s


[e5_meta_time] Fold 3 AUC=0.61099 | rounds=170 | spw=2.49 | 1.2s


[e5_meta_time] Fold 4 AUC=0.61646 | rounds=271 | spw=2.79 | 1.7s


[e5_meta_time] Fold 5 AUC=0.62050 | rounds=159 | spw=2.83 | 1.3s


[e5_meta_time] DONE | OOF(validated) AUC=0.61579 | total 7.1s


Saved oof_xgb_e5_meta_time.npy and test_xgb_e5_meta_time.npy


[emb_minilm_mpnet_meta_time] Fold 1 AUC=0.65535 | rounds=84 | spw=1.94 | 1.0s


[emb_minilm_mpnet_meta_time] Fold 2 AUC=0.68180 | rounds=12 | spw=2.33 | 0.7s


[emb_minilm_mpnet_meta_time] Fold 3 AUC=0.56828 | rounds=59 | spw=2.49 | 1.0s


[emb_minilm_mpnet_meta_time] Fold 4 AUC=0.61658 | rounds=81 | spw=2.79 | 1.2s


[emb_minilm_mpnet_meta_time] Fold 5 AUC=0.60448 | rounds=14 | spw=2.83 | 0.8s


[emb_minilm_mpnet_meta_time] DONE | OOF(validated) AUC=0.62118 | total 5.7s


Saved oof_xgb_emb_minilm_mpnet_meta_time.npy and test_xgb_emb_minilm_mpnet_meta_time.npy


[emb_e5_minilm_mpnet_meta_time] Fold 1 AUC=0.65285 | rounds=520 | spw=1.94 | 3.3s


[emb_e5_minilm_mpnet_meta_time] Fold 2 AUC=0.68805 | rounds=58 | spw=2.33 | 1.4s


[emb_e5_minilm_mpnet_meta_time] Fold 3 AUC=0.54808 | rounds=4 | spw=2.49 | 1.0s


[emb_e5_minilm_mpnet_meta_time] Fold 4 AUC=0.59463 | rounds=27 | spw=2.79 | 1.3s


[emb_e5_minilm_mpnet_meta_time] Fold 5 AUC=0.61489 | rounds=144 | spw=2.83 | 2.3s


[emb_e5_minilm_mpnet_meta_time] DONE | OOF(validated) AUC=0.61816 | total 10.6s


Saved oof_xgb_emb_e5_minilm_mpnet_meta_time.npy and test_xgb_emb_e5_minilm_mpnet_meta_time.npy


In [27]:
# S55: CatBoost v2 - single text + hour/weekday cats + flags + meta_v1 (GPU), time-aware CV; cache OOF/test
import os, sys, time, gc, re, numpy as np, pandas as pd
from pathlib import Path

try:
    from catboost import CatBoostClassifier, Pool
except Exception:
    import subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'catboost'])
    from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body(df):
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)

# Time features
dt_tr = pd.to_datetime(train['unix_timestamp_of_request'].astype(np.int64).values, unit='s', utc=True)
dt_te = pd.to_datetime(test['unix_timestamp_of_request'].astype(np.int64).values, unit='s', utc=True)
hour_tr = dt_tr.hour.astype(str).values
hour_te = dt_te.hour.astype(str).values
wday_tr = dt_tr.weekday.astype(str).values
wday_te = dt_te.weekday.astype(str).values

# Simple flags
def build_flags(title: pd.Series, body: pd.Series):
    text = (title + '\n' + body)
    has_money = text.str.contains(r'\$|dollar(s)?|cash|money', case=False, regex=True).astype(np.int8).values
    has_urgent = text.str.contains(r'(?i)urgent|emergency|immediately|asap|right away', regex=True).astype(np.int8).values
    has_please = text.str.contains(r'(?i)\bplease\b', regex=True).astype(np.int8).values
    has_thanks = text.str.contains(r'(?i)\bthank(s| you)?\b', regex=True).astype(np.int8).values
    return has_money, has_urgent, has_please, has_thanks

title_tr = get_title(train); body_tr = get_body(train)
title_te = get_title(test);  body_te = get_body(test)
text_tr = (title_tr + ' [SEP] ' + body_tr).astype(str)
text_te = (title_te + ' [SEP] ' + body_te).astype(str)
f_money_tr, f_urgent_tr, f_please_tr, f_thanks_tr = build_flags(title_tr, body_tr)
f_money_te, f_urgent_te, f_please_te, f_thanks_te = build_flags(title_te, body_te)

# Load meta_v1 features
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)
n_meta = Meta_tr.shape[1]

# Assemble DataFrames: columns = ['text','hour','weekday','has_money','has_urgent','has_please','has_thanks', meta...]
meta_cols = [f'm{i}' for i in range(n_meta)]
Xtr_df = pd.DataFrame({'text': text_tr, 'hour': hour_tr, 'weekday': wday_tr,
                       'has_money': f_money_tr, 'has_urgent': f_urgent_tr, 'has_please': f_please_tr, 'has_thanks': f_thanks_tr})
for i, col in enumerate(meta_cols):
    Xtr_df[col] = Meta_tr[:, i]
Xte_df = pd.DataFrame({'text': text_te, 'hour': hour_te, 'weekday': wday_te,
                       'has_money': f_money_te, 'has_urgent': f_urgent_te, 'has_please': f_please_te, 'has_thanks': f_thanks_te})
for i, col in enumerate(meta_cols):
    Xte_df[col] = Meta_te[:, i]

# Indices for CatBoost
text_features = [0]  # 'text'
cat_features = [1, 2]  # 'hour','weekday'

# Time-aware 6-block forward-chaining (validate blocks 1..5)
order = np.argsort(train['unix_timestamp_of_request'].values)
n = len(train); k = 6
blocks = np.array_split(order, k)
folds = []; mask = np.zeros(n, dtype=bool)
for i in range(1, k):
    va_idx = np.array(blocks[i]); tr_idx = np.concatenate(blocks[:i])
    folds.append((tr_idx, va_idx)); mask[va_idx] = True
print(f'Time-CV: {len(folds)} folds; validated {mask.sum()}/{n}', flush=True)

# CatBoost v2 params (expert):
params_base = dict(
    iterations=4000,
    depth=6,
    learning_rate=0.03,
    l2_leaf_reg=9.0,
    bagging_temperature=1.0,
    random_strength=1.0,
    loss_function='Logloss',
    eval_metric='AUC',
    task_type='GPU',
    verbose=False
)

oof = np.zeros(n, dtype=np.float32)
te_parts = []
t_all = time.time()
for fi, (tr_idx, va_idx) in enumerate(folds, 1):
    t0 = time.time()
    X_tr_fold = Xtr_df.iloc[tr_idx].reset_index(drop=True)
    X_va_fold = Xtr_df.iloc[va_idx].reset_index(drop=True)
    y_tr = y[tr_idx]; y_va = y[va_idx]
    pos = float((y_tr == 1).sum()); neg = float((y_tr == 0).sum())
    spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
    params = dict(params_base); params['scale_pos_weight'] = spw; params['random_seed'] = 2025 + fi
    pool_tr = Pool(X_tr_fold, label=y_tr, text_features=text_features, cat_features=cat_features)
    pool_va = Pool(X_va_fold, label=y_va, text_features=text_features, cat_features=cat_features)
    pool_te = Pool(Xte_df, text_features=text_features, cat_features=cat_features)
    model = CatBoostClassifier(**params)
    model.fit(pool_tr, eval_set=pool_va, use_best_model=True, early_stopping_rounds=100)
    va_pred = model.predict_proba(pool_va)[:,1].astype(np.float32)
    te_pred = model.predict_proba(pool_te)[:,1].astype(np.float32)
    oof[va_idx] = va_pred; te_parts.append(te_pred)
    auc = roc_auc_score(y_va, va_pred) if (y_va.min()!=y_va.max()) else 0.5
    best_it = getattr(model, 'best_iteration_', None)
    print(f'[CatTextMeta_v2] Fold {fi} AUC={auc:.5f} | spw={spw:.2f} | best_it={best_it} | {time.time()-t0:.1f}s', flush=True)
    del pool_tr, pool_va, pool_te, model; gc.collect()

auc_oof = roc_auc_score(y[mask], oof[mask])
te_mean = np.mean(te_parts, axis=0).astype(np.float32)
print(f'[CatTextMeta_v2] DONE | OOF(validated) AUC={auc_oof:.5f} | total {time.time()-t_all:.1f}s', flush=True)
np.save('oof_catboost_textmeta_v2.npy', oof.astype(np.float32))
np.save('test_catboost_textmeta_v2.npy', te_mean.astype(np.float32))
print('Saved oof_catboost_textmeta_v2.npy and test_catboost_textmeta_v2.npy', flush=True)

  has_money = text.str.contains(r'\$|dollar(s)?|cash|money', case=False, regex=True).astype(np.int8).values
  has_thanks = text.str.contains(r'(?i)\bthank(s| you)?\b', regex=True).astype(np.int8).values
  has_money = text.str.contains(r'\$|dollar(s)?|cash|money', case=False, regex=True).astype(np.int8).values
  has_thanks = text.str.contains(r'(?i)\bthank(s| you)?\b', regex=True).astype(np.int8).values


Time-CV: 5 folds; validated 2398/2878


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta_v2] Fold 1 AUC=0.70265 | spw=1.94 | best_it=189 | 10.2s


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta_v2] Fold 2 AUC=0.69456 | spw=2.33 | best_it=26 | 4.1s


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta_v2] Fold 3 AUC=0.62085 | spw=2.49 | best_it=6 | 3.6s


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta_v2] Fold 4 AUC=0.65066 | spw=2.79 | best_it=6 | 3.6s


Default metric period is 5 because AUC is/are not implemented for GPU


[CatTextMeta_v2] Fold 5 AUC=0.60361 | spw=2.83 | best_it=286 | 11.3s


[CatTextMeta_v2] DONE | OOF(validated) AUC=0.64769 | total 33.9s


Saved oof_catboost_textmeta_v2.npy and test_catboost_textmeta_v2.npy


In [28]:
# S56: Recency-heavy reblend including CatBoost_v2 and embedding+meta heads; widen Cat cap; Dense can drop to 0
import numpy as np, pandas as pd, time
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}', flush=True)

# Core base OOF/test
o_lr_w = np.load('oof_lr_time_withsub_meta.npy');    t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');     t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');            t_d1 = np.load('test_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy');         t_d2 = np.load('test_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy');           t_meta = np.load('test_xgb_meta_time.npy') if not Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_fullbag.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');        t_emn = np.load('test_xgb_emb_meta_time.npy') if not Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_minilm_fullbag.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');       t_emp = np.load('test_xgb_emb_mpnet_time.npy') if not Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_fullbag.npy')

# New embedding+meta heads
has_mm_meta = Path('oof_xgb_emb_minilm_mpnet_meta_time.npy').exists() and Path('test_xgb_emb_minilm_mpnet_meta_time.npy').exists()
if has_mm_meta:
    o_mm_meta = np.load('oof_xgb_emb_minilm_mpnet_meta_time.npy'); t_mm_meta = np.load('test_xgb_emb_minilm_mpnet_meta_time.npy')
has_e5_meta = Path('oof_xgb_e5_meta_time.npy').exists() and Path('test_xgb_e5_meta_time.npy').exists()
if has_e5_meta:
    o_e5_meta = np.load('oof_xgb_e5_meta_time.npy'); t_e5_meta = np.load('test_xgb_e5_meta_time.npy')
has_all_meta = Path('oof_xgb_emb_e5_minilm_mpnet_meta_time.npy').exists() and Path('test_xgb_emb_e5_minilm_mpnet_meta_time.npy').exists()
if has_all_meta:
    o_all_meta = np.load('oof_xgb_emb_e5_minilm_mpnet_meta_time.npy'); t_all_meta = np.load('test_xgb_emb_e5_minilm_mpnet_meta_time.npy')

# Dual-view SVD optional
has_svd_dual = Path('oof_xgb_svd_word192_char128_meta.npy').exists() and Path('test_xgb_svd_word192_char128_meta.npy').exists()
if has_svd_dual:
    o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy'); t_svd_dual = np.load('test_xgb_svd_word192_char128_meta.npy')

# Char LR optional
has_char = Path('oof_lr_charwb_time.npy').exists() and Path('test_lr_charwb_time.npy').exists()
if has_char:
    o_char = np.load('oof_lr_charwb_time.npy'); t_char = np.load('test_lr_charwb_time.npy')

# CatBoost v1/v2: choose better OOF
has_cat_v1 = Path('oof_catboost_textmeta.npy').exists() and Path('test_catboost_textmeta.npy').exists()
has_cat_v2 = Path('oof_catboost_textmeta_v2.npy').exists() and Path('test_catboost_textmeta_v2.npy').exists()
z_cat, tz_cat, cat_ver = None, None, None
if has_cat_v1 and has_cat_v2:
    o1 = np.load('oof_catboost_textmeta.npy'); o2 = np.load('oof_catboost_textmeta_v2.npy')
    auc1 = roc_auc_score(y[mask_full], o1[mask_full]); auc2 = roc_auc_score(y[mask_full], o2[mask_full])
    if auc2 >= auc1:
        z_cat = to_logit(o2); tz_cat = to_logit(np.load('test_catboost_textmeta_v2.npy')); cat_ver = 'v2'
    else:
        z_cat = to_logit(o1); tz_cat = to_logit(np.load('test_catboost_textmeta.npy')); cat_ver = 'v1'
elif has_cat_v2:
    z_cat = to_logit(np.load('oof_catboost_textmeta_v2.npy')); tz_cat = to_logit(np.load('test_catboost_textmeta_v2.npy')); cat_ver = 'v2'
elif has_cat_v1:
    z_cat = to_logit(np.load('oof_catboost_textmeta.npy')); tz_cat = to_logit(np.load('test_catboost_textmeta.npy')); cat_ver = 'v1'
else:
    raise FileNotFoundError('No CatBoost OOF/test found')
print(f'Using CatBoost {cat_ver}', flush=True)

# Convert OOF to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_mm_meta = to_logit(o_mm_meta) if has_mm_meta else None
z_e5_meta = to_logit(o_e5_meta) if has_e5_meta else None
z_all_meta = to_logit(o_all_meta) if has_all_meta else None
z_svd = to_logit(o_svd_dual) if has_svd_dual else None
z_char = to_logit(o_char) if has_char else None

# Convert test to logits
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_d2, tz_meta = to_logit(t_d1), to_logit(t_d2), to_logit(t_meta)
tz_emn, tz_emp = to_logit(t_emn), to_logit(t_emp)
tz_mm_meta = to_logit(t_mm_meta) if has_mm_meta else None
tz_e5_meta = to_logit(t_e5_meta) if has_e5_meta else None
tz_all_meta = to_logit(t_all_meta) if has_all_meta else None
tz_svd = to_logit(t_svd_dual) if has_svd_dual else None
tz_char = to_logit(t_char) if has_char else None

# Grids per expert guidance
g_grid = [0.990, 0.995, 0.997]
meta_grid = [0.16, 0.18, 0.20, 0.22]
dense_tot_grid = [0.0, 0.04, 0.08]
dense_split = [(0.6, 0.4), (0.7, 0.3)]
emb_tot_grid = [0.28, 0.32, 0.36, 0.38]
emb_split = [(0.6, 0.4), (0.5, 0.5)]  # MiniLM:MPNet
e5_cap_grid = [0.0, 0.02, 0.04, 0.06] if has_e5_meta else [0.0]
cat_grid = [0.10, 0.14, 0.20, 0.26, 0.30]
svd_grid = [0.0, 0.04, 0.08] if has_svd_dual else [0.0]
char_grid = [0.0, 0.04, 0.06, 0.08] if has_char else [0.0]
w_lr_min_grid = [0.28]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    t0 = time.time()
    for g in g_grid:
        z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
        for w_lr_min in w_lr_min_grid:
            for w_meta in meta_grid:
                for d_tot in dense_tot_grid:
                    for dv1, dv2 in dense_split:
                        w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                        for e_tot in emb_tot_grid:
                            for em_fr, mp_fr in emb_split:
                                w_emn = e_tot * em_fr; w_emp = e_tot * mp_fr
                                for w_e5 in e5_cap_grid:
                                    for w_cat in cat_grid:
                                        for w_svd in svd_grid:
                                            for w_char in char_grid:
                                                rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp + w_e5 + w_cat + w_svd + w_char)
                                                if rem <= 0: continue
                                                w_lr = rem
                                                if w_lr < w_lr_min: continue
                                                z_oof = (w_lr*z_lr_mix +
                                                         w_d1*z_d1 + w_d2*z_d2 +
                                                         w_meta*z_meta +
                                                         w_emn*z_emn + w_emp*z_emp +
                                                         (w_e5*z_e5_meta if has_e5_meta and w_e5>0 else 0) +
                                                         (w_svd*z_svd if has_svd_dual and w_svd>0 else 0) +
                                                         (w_char*z_char if has_char and w_char>0 else 0) +
                                                         w_cat*z_cat)
                                                auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                                tried += 1
                                                if tried % 3000 == 0:
                                                    print(f'  tried={tried} | best={best_auc:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
                                                if auc > best_auc:
                                                    best_auc = auc
                                                    best_cfg = dict(g=float(g), w_lr=float(w_lr), w_d1=float(w_d1), w_d2=float(w_d2),
                                                                    w_meta=float(w_meta), w_emn=float(w_emn), w_emp=float(w_emp),
                                                                    w_e5=float(w_e5), w_cat=float(w_cat), w_svd=float(w_svd), w_char=float(w_char))
    print(f'  search done | tried={tried} | best={best_auc:.5f} | {time.time()-t0:.1f}s', flush=True)
    return best_auc, best_cfg, tried

# 1) Last-2 objective
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2] tried={tried_last2} | best OOF(z,last2) AUC={auc_last2:.5f} | cfg={cfg_last2}', flush=True)

# 2) Gamma-decayed (recency)
best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.990, 0.995, 0.997]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best OOF(z,weighted) AUC={auc_g:.5f}', flush=True)
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={best_cfg_g}', flush=True)

def build_and_save(tag, cfg):
    g = cfg['g']
    tz_lr_mix = (1.0 - g)*tz_lr_w + g*tz_lr_ns
    parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*to_logit(t_d1),
        cfg['w_d2']*to_logit(t_d2),
        cfg['w_meta']*to_logit(t_meta),
        cfg['w_emn']*to_logit(t_emn),
        cfg['w_emp']*to_logit(t_emp),
        cfg['w_cat']*tz_cat
    ]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_cat']]
    comp_logits = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp), tz_cat]
    if has_e5_meta and cfg['w_e5'] > 0: parts.append(cfg['w_e5']*to_logit(t_e5_meta)); w_list.append(cfg['w_e5']); comp_logits.append(to_logit(t_e5_meta))
    if has_svd_dual and cfg['w_svd'] > 0: parts.append(cfg['w_svd']*to_logit(t_svd_dual)); w_list.append(cfg['w_svd']); comp_logits.append(to_logit(t_svd_dual))
    if has_char and cfg['w_char'] > 0: parts.append(cfg['w_char']*to_logit(t_char)); w_list.append(cfg['w_char']); comp_logits.append(to_logit(t_char))
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    out_path = f'submission_reblend_recency_{tag}.csv'
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(out_path, index=False)
    # 15% shrink-to-equal hedge
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = 0.0
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(out_path.replace('.csv','_shrunk.csv'), index=False)
    return out_path

p_last2 = build_and_save('last2', cfg_last2)
p_gam = build_and_save(f'gamma{best_gamma:.3f}'.replace('.', 'p'), best_cfg_g)

# Promote better of last-2 vs gamma-best (prefer gamma if close); then leave rank-avg hedge for external step if needed
primary = p_gam if (best_auc_g >= auc_last2) else p_last2
pd.read_csv(primary).to_csv('submission.csv', index=False)
print(f'Promoted {Path(primary).name} to submission.csv', flush=True)

Time-CV validated full: 2398/2878 | last2: 958


Using CatBoost v1


  tried=3000 | best=0.64918 | elapsed=5.3s


  tried=6000 | best=0.64923 | elapsed=10.6s


  tried=9000 | best=0.64923 | elapsed=16.0s


  search done | tried=9003 | best=0.64923 | 16.0s


[Last2] tried=9003 | best OOF(z,last2) AUC=0.64923 | cfg={'g': 0.995, 'w_lr': 0.28, 'w_d1': 0.055999999999999994, 'w_d2': 0.024, 'w_meta': 0.16, 'w_emn': 0.168, 'w_emp': 0.11200000000000002, 'w_e5': 0.02, 'w_cat': 0.1, 'w_svd': 0.0, 'w_char': 0.08}


  tried=3000 | best=0.68175 | elapsed=7.1s


  tried=6000 | best=0.68175 | elapsed=14.1s


  tried=9000 | best=0.68176 | elapsed=21.3s


  search done | tried=9003 | best=0.68176 | 21.3s


[Gamma 0.99] best OOF(z,weighted) AUC=0.68176


  tried=3000 | best=0.68204 | elapsed=7.2s


  tried=6000 | best=0.68204 | elapsed=14.3s


  tried=9000 | best=0.68205 | elapsed=21.4s


  search done | tried=9003 | best=0.68205 | 21.5s


[Gamma 0.995] best OOF(z,weighted) AUC=0.68205


  tried=3000 | best=0.68215 | elapsed=7.2s


  tried=6000 | best=0.68215 | elapsed=14.5s


  tried=9000 | best=0.68216 | elapsed=21.6s


  search done | tried=9003 | best=0.68216 | 21.6s


[Gamma 0.997] best OOF(z,weighted) AUC=0.68216


[Gamma-best] gamma=0.997 | AUC=0.68216 | cfg={'g': 0.997, 'w_lr': 0.28, 'w_d1': 0.027999999999999997, 'w_d2': 0.012, 'w_meta': 0.2, 'w_emn': 0.16, 'w_emp': 0.16, 'w_e5': 0.0, 'w_cat': 0.1, 'w_svd': 0.0, 'w_char': 0.06}


Promoted submission_reblend_recency_gamma0p997.csv to submission.csv


In [29]:
# S57: Super-recent refits (blocks 3-5 and 4-5): LR_nosub+meta, MiniLM XGB+meta, CatBoost v2; save test preds
import numpy as np, pandas as pd, time, gc, xgboost as xgb
from pathlib import Path
from scipy.sparse import hstack, csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values

def get_title(df):
    return df.get('request_title', pd.Series(['']*len(df))).fillna('').astype(str)
def get_body(df):
    return (df['request_text'] if 'request_text' in df.columns else df.get('request_text', pd.Series(['']*len(df)))).fillna('').astype(str)
def build_text(df):
    return (get_title(df) + '\n' + get_body(df)).astype(str)

txt_tr = build_text(train); txt_te = build_text(test)

# Time blocks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
idx35 = np.concatenate(blocks[2:5])  # blocks 3,4,5 (0-based idx 2..4)
idx45 = np.concatenate(blocks[3:5])  # blocks 4,5 (0-based idx 3..4)
print(f'Recent sets: 3-5 n={len(idx35)} | 4-5 n={len(idx45)}', flush=True)

# Shared meta
Meta_tr = np.load('meta_v1_tr.npy').astype(np.float32)
Meta_te = np.load('meta_v1_te.npy').astype(np.float32)

# 1) LR_nosub + meta (word1-3 + char_wb2-6 TF-IDF), C=1.0
word_params = dict(analyzer='word', ngram_range=(1,3), lowercase=True, min_df=2, max_features=200_000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params = dict(analyzer='char_wb', ngram_range=(2,6), lowercase=True, min_df=2, max_features=200_000, sublinear_tf=True, smooth_idf=True, norm='l2')

def lr_recent_fit_predict(tr_idx, tag):
    tf_w = TfidfVectorizer(**word_params)
    Xw_tr = tf_w.fit_transform(txt_tr.iloc[tr_idx]); Xw_te = tf_w.transform(txt_te)
    tf_c = TfidfVectorizer(**char_params)
    Xc_tr = tf_c.fit_transform(txt_tr.iloc[tr_idx]); Xc_te = tf_c.transform(txt_te)
    X_tr_text = hstack([Xw_tr, Xc_tr], format='csr')
    X_te_text = hstack([Xw_te, Xc_te], format='csr')
    X_tr = hstack([X_tr_text, csr_matrix(Meta_tr[tr_idx])], format='csr')
    X_te = hstack([X_te_text, csr_matrix(Meta_te)], format='csr')
    clf = LogisticRegression(penalty='l2', solver='saga', C=1.0, max_iter=2000, n_jobs=-1, verbose=0)
    clf.fit(X_tr, y[tr_idx])
    te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
    np.save(f'test_lr_nosub_meta_recent{tag}.npy', te_pred)
    print(f'[LR_recent {tag}] te_mean={te_pred.mean():.4f} | feats={X_tr.shape[1]}', flush=True)
    del tf_w, tf_c, Xw_tr, Xw_te, Xc_tr, Xc_te, X_tr_text, X_te_text, X_tr, X_te, clf; gc.collect()

t0 = time.time(); lr_recent_fit_predict(idx35, '35'); lr_recent_fit_predict(idx45, '45')
print(f'LR_recent done in {time.time()-t0:.1f}s', flush=True)

# 2) MiniLM emb+meta XGB recent (GPU, ES on last block inside set)
Emb_min_tr = np.load('emb_minilm_tr.npy').astype(np.float32)
Emb_min_te = np.load('emb_minilm_te.npy').astype(np.float32)
X_all_tr = np.hstack([Emb_min_tr, Meta_tr]).astype(np.float32)
X_all_te = np.hstack([Emb_min_te, Meta_te]).astype(np.float32)

def xgb_recent_predict(tr_idx, eval_idx, tag, name='MiniLM'):
    X_tr = X_all_tr[tr_idx]; y_tr = y[tr_idx]
    X_ev = X_all_tr[eval_idx]; y_ev = y[eval_idx]
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_trs = scaler.fit_transform(X_tr).astype(np.float32)
    X_evs = scaler.transform(X_ev).astype(np.float32)
    X_tes = scaler.transform(X_all_te).astype(np.float32)
    dtr = xgb.DMatrix(X_trs, label=y_tr); dev = xgb.DMatrix(X_evs, label=y_ev); dte = xgb.DMatrix(X_tes)
    pos = float((y_tr == 1).sum()); neg = float((y_tr == 0).sum()); spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
    params = dict(objective='binary:logistic', eval_metric='auc', max_depth=3, eta=0.05, subsample=0.8, colsample_bytree=0.6,
                  min_child_weight=8, reg_alpha=0.5, reg_lambda=3.0, gamma=0.0, device='cuda', tree_method='hist',
                  seed=2025, scale_pos_weight=spw)
    booster = xgb.train(params, dtr, num_boost_round=4000, evals=[(dev, 'valid')], early_stopping_rounds=100, verbose_eval=False)
    te_pred = booster.predict(dte, iteration_range=(0, booster.best_iteration+1 if booster.best_iteration is not None else 0)).astype(np.float32)
    np.save(f'test_xgb_minilm_meta_recent{tag}.npy', te_pred)
    print(f'[XGB {name}_recent {tag}] rounds={booster.best_iteration} | te_mean={te_pred.mean():.4f}', flush=True)
    del dtr, dev, dte, booster, scaler; gc.collect()

# For 3-5, use block5 as eval; for 4-5, use block5 as eval (leaving block4 as train) to pick rounds
blk5 = np.array(blocks[4])
xgb_recent_predict(idx35, blk5, '35')
xgb_recent_predict(idx45, blk5, '45')

# 3) CatBoost v2 recent (single text + hour/weekday cats + flags + meta), GPU
from catboost import CatBoostClassifier, Pool

def build_cat_v2_df(df):
    title = get_title(df); body = get_body(df)
    text = (title + ' [SEP] ' + body).astype(str)
    dt = pd.to_datetime(df['unix_timestamp_of_request'].astype(np.int64).values, unit='s', utc=True)
    hour = dt.hour.astype(str).values; wday = dt.weekday.astype(str).values
    txt = (title + '\n' + body)
    has_money = txt.str.contains(r'\$|dollar(s)?|cash|money', case=False, regex=True).astype(np.int8).values
    has_urgent = txt.str.contains(r'(?i)urgent|emergency|immediately|asap|right away', regex=True).astype(np.int8).values
    has_please = txt.str.contains(r'(?i)\bplease\b', regex=True).astype(np.int8).values
    has_thanks = txt.str.contains(r'(?i)\bthank(s| you)?\b', regex=True).astype(np.int8).values
    return text, hour, wday, has_money, has_urgent, has_please, has_thanks

text_tr, hour_tr, wday_tr, f_m_tr, f_u_tr, f_p_tr, f_t_tr = build_cat_v2_df(train)
text_te, hour_te, wday_te, f_m_te, f_u_te, f_p_te, f_t_te = build_cat_v2_df(test)
def make_cat_df(sel_idx=None):
    if sel_idx is None:
        Xtr = pd.DataFrame({'text': text_tr, 'hour': hour_tr, 'weekday': wday_tr,
                            'has_money': f_m_tr, 'has_urgent': f_u_tr, 'has_please': f_p_tr, 'has_thanks': f_t_tr})
        ytr = y
    else:
        Xtr = pd.DataFrame({'text': text_tr[sel_idx], 'hour': hour_tr[sel_idx], 'weekday': wday_tr[sel_idx],
                            'has_money': f_m_tr[sel_idx], 'has_urgent': f_u_tr[sel_idx], 'has_please': f_p_tr[sel_idx], 'has_thanks': f_t_tr[sel_idx]})
        ytr = y[sel_idx]
    Xte = pd.DataFrame({'text': text_te, 'hour': hour_te, 'weekday': wday_te,
                        'has_money': f_m_te, 'has_urgent': f_u_te, 'has_please': f_p_te, 'has_thanks': f_t_te})
    # append meta
    n_meta = Meta_tr.shape[1]
    for i in range(n_meta):
        col = f'm{i}';
        if sel_idx is None:
            Xtr[col] = Meta_tr[:, i]
        else:
            Xtr[col] = Meta_tr[sel_idx, i]
        Xte[col] = Meta_te[:, i]
    return Xtr, ytr, Xte

def cat_recent_predict(tr_idx, eval_idx, tag):
    Xtr, ytr, Xte = make_cat_df(tr_idx)
    Xev, yev = make_cat_df(eval_idx)[0], y[eval_idx]
    pool_tr = Pool(Xtr, label=ytr, text_features=[0], cat_features=[1,2])
    pool_ev = Pool(Xev, label=yev, text_features=[0], cat_features=[1,2])
    pool_te = Pool(Xte, text_features=[0], cat_features=[1,2])
    pos = float((ytr == 1).sum()); neg = float((ytr == 0).sum()); spw = (neg / max(pos, 1.0)) if pos > 0 else 1.0
    params = dict(iterations=4000, depth=6, learning_rate=0.03, l2_leaf_reg=9.0, bagging_temperature=1.0, random_strength=1.0,
                  loss_function='Logloss', eval_metric='AUC', task_type='GPU', verbose=False, scale_pos_weight=spw, random_seed=3030)
    model = CatBoostClassifier(**params)
    model.fit(pool_tr, eval_set=pool_ev, use_best_model=True, early_stopping_rounds=100)
    te_pred = model.predict_proba(pool_te)[:,1].astype(np.float32)
    np.save(f'test_catboost_textmeta_v2_recent{tag}.npy', te_pred)
    print(f'[Cat_v2_recent {tag}] best_it={getattr(model, "best_iteration_", None)} | te_mean={te_pred.mean():.4f}', flush=True)
    del pool_tr, pool_ev, pool_te, model; gc.collect()

# eval on block5 for both
cat_recent_predict(idx35, blk5, '35')
cat_recent_predict(idx45, blk5, '45')
print('Super-recent refits done.', flush=True)

Recent sets: 3-5 n=1439 | 4-5 n=959


[LR_recent 35] te_mean=0.1597 | feats=69250


[LR_recent 45] te_mean=0.1586 | feats=50577


LR_recent done in 52.5s


[XGB MiniLM_recent 35] rounds=306 | te_mean=0.3463


[XGB MiniLM_recent 45] rounds=172 | te_mean=0.3510


  has_money = txt.str.contains(r'\$|dollar(s)?|cash|money', case=False, regex=True).astype(np.int8).values
  has_thanks = txt.str.contains(r'(?i)\bthank(s| you)?\b', regex=True).astype(np.int8).values
  has_money = txt.str.contains(r'\$|dollar(s)?|cash|money', case=False, regex=True).astype(np.int8).values
  has_thanks = txt.str.contains(r'(?i)\bthank(s| you)?\b', regex=True).astype(np.int8).values


Default metric period is 5 because AUC is/are not implemented for GPU


[Cat_v2_recent 35] best_it=2307 | te_mean=0.3223


Default metric period is 5 because AUC is/are not implemented for GPU


[Cat_v2_recent 45] best_it=1440 | te_mean=0.2561


Super-recent refits done.


In [30]:
# S58: Reblend with super-recent variants (LR_nosub+meta, MiniLM XGB+meta, CatBoost v2) as small-cap components
import numpy as np, pandas as pd, time
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_full = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_full[np.array(blocks[i])] = True
mask_last2 = np.zeros(n, dtype=bool)
for i in [4,5]:
    mask_last2[np.array(blocks[i])] = True
print(f'Time-CV validated full: {mask_full.sum()}/{n} | last2: {mask_last2.sum()}', flush=True)

# Load core OOF/test (full-history)
o_lr_w = np.load('oof_lr_time_withsub_meta.npy');    t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');     t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');            t_d1 = np.load('test_xgb_dense_time.npy')
o_d2 = np.load('oof_xgb_dense_time_v2.npy');         t_d2 = np.load('test_xgb_dense_time_v2.npy')
o_meta = np.load('oof_xgb_meta_time.npy');           t_meta = np.load('test_xgb_meta_time.npy') if not Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_fullbag.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');        t_emn = np.load('test_xgb_emb_meta_time.npy') if not Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_minilm_fullbag.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');       t_emp = np.load('test_xgb_emb_mpnet_time.npy') if not Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_fullbag.npy')

# Optional components available
has_svd_dual = Path('oof_xgb_svd_word192_char128_meta.npy').exists() and Path('test_xgb_svd_word192_char128_meta.npy').exists()
if has_svd_dual:
    o_svd_dual = np.load('oof_xgb_svd_word192_char128_meta.npy'); t_svd_dual = np.load('test_xgb_svd_word192_char128_meta.npy')
has_char = Path('oof_lr_charwb_time.npy').exists() and Path('test_lr_charwb_time.npy').exists()
if has_char:
    o_char = np.load('oof_lr_charwb_time.npy'); t_char = np.load('test_lr_charwb_time.npy')

# CatBoost v1/v2 choose better OOF
has_cat_v1 = Path('oof_catboost_textmeta.npy').exists() and Path('test_catboost_textmeta.npy').exists()
has_cat_v2 = Path('oof_catboost_textmeta_v2.npy').exists() and Path('test_catboost_textmeta_v2.npy').exists()
if has_cat_v1 and has_cat_v2:
    o1 = np.load('oof_catboost_textmeta.npy'); o2 = np.load('oof_catboost_textmeta_v2.npy')
    auc1 = roc_auc_score(y[mask_full], o1[mask_full]); auc2 = roc_auc_score(y[mask_full], o2[mask_full])
    if auc2 >= auc1:
        o_cat = o2; t_cat = np.load('test_catboost_textmeta_v2.npy'); cat_tag = 'v2'
    else:
        o_cat = o1; t_cat = np.load('test_catboost_textmeta.npy'); cat_tag = 'v1'
elif has_cat_v2:
    o_cat = np.load('oof_catboost_textmeta_v2.npy'); t_cat = np.load('test_catboost_textmeta_v2.npy'); cat_tag = 'v2'
elif has_cat_v1:
    o_cat = np.load('oof_catboost_textmeta.npy'); t_cat = np.load('test_catboost_textmeta.npy'); cat_tag = 'v1'
else:
    raise FileNotFoundError('No CatBoost OOF/test found')
print(f'CatBoost selected: {cat_tag}')

# Super-recent test preds (average 3-5 and 4-5); OOF placeholders = corresponding full OOF (to keep objective comparable)
def load_avg_recent(base):
    p35 = np.load(f'test_{base}_recent35.npy') if Path(f'test_{base}_recent35.npy').exists() else None
    p45 = np.load(f'test_{base}_recent45.npy') if Path(f'test_{base}_recent45.npy').exists() else None
    if (p35 is None) and (p45 is None):
        return None
    if (p35 is None):
        return p45.astype(np.float32)
    if (p45 is None):
        return p35.astype(np.float32)
    return ((p35 + p45) / 2.0).astype(np.float32)

t_lr_recent = load_avg_recent('lr_nosub_meta')  # LR recent
t_minilm_recent = load_avg_recent('xgb_minilm_meta')  # MiniLM recent
t_cat_recent = load_avg_recent('catboost_textmeta_v2')  # CatBoost v2 recent

# Build logits for OOF
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_d2, z_meta = to_logit(o_d1), to_logit(o_d2), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_svd = to_logit(o_svd_dual) if has_svd_dual else None
z_char = to_logit(o_char) if has_char else None
z_cat = to_logit(o_cat)

# Test logits
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_d2, tz_meta = to_logit(t_d1), to_logit(t_d2), to_logit(t_meta)
tz_emn, tz_emp = to_logit(t_emn), to_logit(t_emp)
tz_svd = to_logit(t_svd_dual) if has_svd_dual else None
tz_char = to_logit(t_char) if has_char else None
tz_cat = to_logit(t_cat)
tz_lr_recent = to_logit(t_lr_recent) if t_lr_recent is not None else None
tz_minilm_recent = to_logit(t_minilm_recent) if t_minilm_recent is not None else None
tz_cat_recent = to_logit(t_cat_recent) if t_cat_recent is not None else None

# Grids and caps
g_grid = [0.990, 0.995, 0.997]
meta_grid = [0.18, 0.20, 0.22]
dense_tot_grid = [0.0, 0.04, 0.08]
dense_split = [(0.6, 0.4), (0.7, 0.3)]
emb_tot_grid = [0.28, 0.32, 0.36, 0.38]
emb_split = [(0.6, 0.4), (0.5, 0.5)]
svd_grid = [0.0, 0.04, 0.08] if has_svd_dual else [0.0]
char_grid = [0.0, 0.04, 0.06, 0.08] if has_char else [0.0]
cat_grid = [0.10, 0.14, 0.20, 0.26, 0.30]
w_lr_min_grid = [0.28]
# Recent caps
lr_recent_cap = [0.0, 0.04, 0.06, 0.08] if tz_lr_recent is not None else [0.0]
minilm_recent_cap = [0.0, 0.03, 0.06] if tz_minilm_recent is not None else [0.0]
cat_recent_cap = [0.0, 0.05, 0.10] if tz_cat_recent is not None else [0.0]

def search(mask, sample_weight=None):
    best_auc, best_cfg, tried = -1.0, None, 0
    t0 = time.time()
    for g in g_grid:
        z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
        for w_lr_min in w_lr_min_grid:
            for w_meta in meta_grid:
                for d_tot in dense_tot_grid:
                    for dv1, dv2 in dense_split:
                        w_d1 = d_tot * dv1; w_d2 = d_tot * dv2
                        for e_tot in emb_tot_grid:
                            for em_fr, mp_fr in emb_split:
                                w_emn = e_tot * em_fr; w_emp = e_tot * mp_fr
                                for w_svd in svd_grid:
                                    for w_char in char_grid:
                                        for w_cat in cat_grid:
                                            for w_lr_rec in lr_recent_cap:
                                                for w_minilm_rec in minilm_recent_cap:
                                                    for w_cat_rec in cat_recent_cap:
                                                        rem = 1.0 - (w_meta + w_d1 + w_d2 + w_emn + w_emp + w_svd + w_char + w_cat + w_lr_rec + w_minilm_rec + w_cat_rec)
                                                        if rem <= 0: continue
                                                        w_lr = rem
                                                        if w_lr < w_lr_min: continue
                                                        # OOF objective ignores 'recent' unique info (no OOF); we don't add them to z_oof beyond caps (placeholder 0 impact)
                                                        z_oof = (w_lr*z_lr_mix +
                                                                 w_d1*z_d1 + w_d2*z_d2 +
                                                                 w_meta*z_meta +
                                                                 w_emn*z_emn + w_emp*z_emp +
                                                                 (w_svd*z_svd if (has_svd_dual and w_svd>0) else 0) +
                                                                 (w_char*z_char if (has_char and w_char>0) else 0) +
                                                                 w_cat*z_cat)
                                                        auc = roc_auc_score(y[mask], z_oof[mask], sample_weight=(sample_weight[mask] if sample_weight is not None else None))
                                                        tried += 1
                                                        if tried % 5000 == 0:
                                                            print(f'  tried={tried} | best={best_auc:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
                                                        if auc > best_auc:
                                                            best_auc = auc
                                                            best_cfg = dict(g=float(g), w_lr=float(w_lr), w_d1=float(w_d1), w_d2=float(w_d2),
                                                                            w_meta=float(w_meta), w_emn=float(w_emn), w_emp=float(w_emp),
                                                                            w_svd=float(w_svd), w_char=float(w_char), w_cat=float(w_cat),
                                                                            w_lr_recent=float(w_lr_rec), w_minilm_recent=float(w_minilm_rec), w_cat_recent=float(w_cat_rec))
    print(f'  search done | tried={tried} | best={best_auc:.5f} | {time.time()-t0:.1f}s', flush=True)
    return best_auc, best_cfg, tried

# Objectives: last-2 and gamma in {0.990,0.995,0.997}
auc_last2, cfg_last2, tried_last2 = search(mask_last2)
print(f'[Last2] tried={tried_last2} | best AUC={auc_last2:.5f} | cfg={cfg_last2}', flush=True)

best_gamma, best_auc_g, best_cfg_g = None, -1.0, None
for gamma in [0.990, 0.995, 0.997]:
    w = np.zeros(n, dtype=np.float64)
    for bi in range(1, k):
        age = (k - 1) - bi
        w[np.array(blocks[bi])] = (gamma ** age)
    auc_g, cfg_g, _ = search(mask_full, sample_weight=w)
    print(f'[Gamma {gamma}] best AUC={auc_g:.5f}', flush=True)
    if auc_g > best_auc_g:
        best_auc_g, best_cfg_g, best_gamma = auc_g, cfg_g, gamma
print(f'[Gamma-best] gamma={best_gamma} | AUC={best_auc_g:.5f} | cfg={best_cfg_g}', flush=True)

def build_and_save(tag, cfg):
    g = cfg['g']
    tz_lr_mix = (1.0 - g)*tz_lr_w + g*tz_lr_ns
    parts = [
        cfg['w_lr']*tz_lr_mix,
        cfg['w_d1']*to_logit(t_d1),
        cfg['w_d2']*to_logit(t_d2),
        cfg['w_meta']*to_logit(t_meta),
        cfg['w_emn']*to_logit(t_emn),
        cfg['w_emp']*to_logit(t_emp),
        cfg['w_cat']*tz_cat
    ]
    w_list = [cfg['w_lr'], cfg['w_d1'], cfg['w_d2'], cfg['w_meta'], cfg['w_emn'], cfg['w_emp'], cfg['w_cat']]
    comp_logits = [tz_lr_mix, to_logit(t_d1), to_logit(t_d2), to_logit(t_meta), to_logit(t_emn), to_logit(t_emp), tz_cat]
    if has_svd_dual and cfg['w_svd'] > 0: parts.append(cfg['w_svd']*tz_svd); w_list.append(cfg['w_svd']); comp_logits.append(tz_svd)
    if has_char and cfg['w_char'] > 0: parts.append(cfg['w_char']*tz_char); w_list.append(cfg['w_char']); comp_logits.append(tz_char)
    # Add recent components to TEST ONLY if available
    if (tz_lr_recent is not None) and (cfg['w_lr_recent'] > 0): parts.append(cfg['w_lr_recent']*tz_lr_recent); w_list.append(cfg['w_lr_recent']); comp_logits.append(tz_lr_recent)
    if (tz_minilm_recent is not None) and (cfg['w_minilm_recent'] > 0): parts.append(cfg['w_minilm_recent']*tz_minilm_recent); w_list.append(cfg['w_minilm_recent']); comp_logits.append(tz_minilm_recent)
    if (tz_cat_recent is not None) and (cfg['w_cat_recent'] > 0): parts.append(cfg['w_cat_recent']*tz_cat_recent); w_list.append(cfg['w_cat_recent']); comp_logits.append(tz_cat_recent)
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    out_path = f'submission_reblend_with_recent_{tag}.csv'
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(out_path, index=False)
    # 15% shrink hedge
    w_vec = np.array(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = 0.0
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(out_path.replace('.csv','_shrunk.csv'), index=False)
    return out_path

p_last2 = build_and_save('last2', cfg_last2)
p_gam = build_and_save(f'gamma{best_gamma:.3f}'.replace('.', 'p'), best_cfg_g)

primary = p_gam if (best_auc_g >= auc_last2) else p_last2
pd.read_csv(primary).to_csv('submission.csv', index=False)
print(f'Promoted {Path(primary).name} to submission.csv', flush=True)

Time-CV validated full: 2398/2878 | last2: 958


CatBoost selected: v1


  tried=5000 | best=0.64911 | elapsed=8.9s


  search done | tried=8445 | best=0.64911 | 15.0s


[Last2] tried=8445 | best AUC=0.64911 | cfg={'g': 0.995, 'w_lr': 0.28, 'w_d1': 0.055999999999999994, 'w_d2': 0.024, 'w_meta': 0.18, 'w_emn': 0.168, 'w_emp': 0.11200000000000002, 'w_svd': 0.0, 'w_char': 0.08, 'w_cat': 0.1, 'w_lr_recent': 0.0, 'w_minilm_recent': 0.0, 'w_cat_recent': 0.0}


  tried=5000 | best=0.68174 | elapsed=11.6s


  search done | tried=8445 | best=0.68174 | 19.6s


[Gamma 0.99] best AUC=0.68174


  tried=5000 | best=0.68202 | elapsed=11.6s


  search done | tried=8445 | best=0.68202 | 19.6s


[Gamma 0.995] best AUC=0.68202


  tried=5000 | best=0.68213 | elapsed=11.6s


  search done | tried=8445 | best=0.68213 | 19.7s


[Gamma 0.997] best AUC=0.68213


[Gamma-best] gamma=0.997 | AUC=0.68213 | cfg={'g': 0.997, 'w_lr': 0.28, 'w_d1': 0.0, 'w_d2': 0.0, 'w_meta': 0.18, 'w_emn': 0.18, 'w_emp': 0.18, 'w_svd': 0.0, 'w_char': 0.08, 'w_cat': 0.1, 'w_lr_recent': 0.0, 'w_minilm_recent': 0.0, 'w_cat_recent': 0.0}


Promoted submission_reblend_with_recent_gamma0p997.csv to submission.csv


In [31]:
# S59: Last-block tuned nonnegative blend incl. super-recent components + isotonic calibration
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn.isotonic import IsotonicRegression

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and last-block indices
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
b5_idx = np.array(blocks[5])

# Load full-history base OOF/test
o_lr_w = np.load('oof_lr_time_withsub_meta.npy');    t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');     t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');            t_d1 = np.load('test_xgb_dense_time.npy')  # Dense v1
o_meta = np.load('oof_xgb_meta_time.npy');           t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');        t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');       t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
has_char = Path('oof_lr_charwb_time.npy').exists() and Path('test_lr_charwb_time.npy').exists()
if has_char:
    o_char = np.load('oof_lr_charwb_time.npy'); t_char = np.load('test_lr_charwb_time.npy')
else:
    o_char = None; t_char = None
has_svd = Path('oof_xgb_svd_word192_char128_meta.npy').exists() and Path('test_xgb_svd_word192_char128_meta.npy').exists()
if has_svd:
    o_svd = np.load('oof_xgb_svd_word192_char128_meta.npy'); t_svd = np.load('test_xgb_svd_word192_char128_meta.npy')
else:
    o_svd = None; t_svd = None

# CatBoost: prefer v1 (historically better blend) but choose best available
has_cat_v1 = Path('oof_catboost_textmeta.npy').exists() and Path('test_catboost_textmeta.npy').exists()
has_cat_v2 = Path('oof_catboost_textmeta_v2.npy').exists() and Path('test_catboost_textmeta_v2.npy').exists()
if has_cat_v1 and has_cat_v2:
    o1 = np.load('oof_catboost_textmeta.npy'); o2 = np.load('oof_catboost_textmeta_v2.npy')
    auc1 = roc_auc_score(y[b5_idx], o1[b5_idx]); auc2 = roc_auc_score(y[b5_idx], o2[b5_idx])
    if auc1 >= auc2:
        o_cat = o1; t_cat = np.load('test_catboost_textmeta.npy')
    else:
        o_cat = o2; t_cat = np.load('test_catboost_textmeta_v2.npy')
elif has_cat_v1:
    o_cat = np.load('oof_catboost_textmeta.npy'); t_cat = np.load('test_catboost_textmeta.npy')
elif has_cat_v2:
    o_cat = np.load('oof_catboost_textmeta_v2.npy'); t_cat = np.load('test_catboost_textmeta_v2.npy')
else:
    raise FileNotFoundError('No CatBoost OOF/test found')

# Super-recent TEST preds (from S57). For last-block tuning, proxy their block-5 logits with corresponding full-history base logits.
def load_avg_recent(base):
    p35 = np.load(f'test_{base}_recent35.npy') if Path(f'test_{base}_recent35.npy').exists() else None
    p45 = np.load(f'test_{base}_recent45.npy') if Path(f'test_{base}_recent45.npy').exists() else None
    if (p35 is None) and (p45 is None):
        return None
    if p35 is None: return p45.astype(np.float32)
    if p45 is None: return p35.astype(np.float32)
    return ((p35 + p45) / 2.0).astype(np.float32)
t_lr_recent = load_avg_recent('lr_nosub_meta')
t_minilm_recent = load_avg_recent('xgb_minilm_meta')
t_cat_recent = load_avg_recent('catboost_textmeta_v2')

# Build logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_meta = to_logit(o_d1), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_char = to_logit(o_char) if o_char is not None else None
z_svd = to_logit(o_svd) if o_svd is not None else None
z_cat = to_logit(o_cat)
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_meta = to_logit(t_d1), to_logit(t_meta)
tz_emn, tz_emp = to_logit(t_emn), to_logit(t_emp)
tz_char = to_logit(t_char) if t_char is not None else None
tz_svd = to_logit(t_svd) if t_svd is not None else None
tz_cat = to_logit(t_cat)
tz_lr_recent = to_logit(t_lr_recent) if t_lr_recent is not None else None
tz_minilm_recent = to_logit(t_minilm_recent) if t_minilm_recent is not None else None
tz_cat_recent = to_logit(t_cat_recent) if t_cat_recent is not None else None

# Fix gamma for LR_mix per expert: 0.999
g = 0.999
z_lr_mix = (1.0 - g)*z_lr_w + g*z_lr_ns
tz_lr_mix = (1.0 - g)*tz_lr_w + g*tz_lr_ns

# Component dicts for convenience
oof_cols = {'lr': z_lr_mix, 'dense': z_d1, 'meta': z_meta, 'emn': z_emn, 'emp': z_emp, 'cat': z_cat}
test_cols = {'lr': tz_lr_mix, 'dense': tz_d1, 'meta': tz_meta, 'emn': tz_emn, 'emp': tz_emp, 'cat': tz_cat}
if z_char is not None:
    oof_cols['char'] = z_char; test_cols['char'] = tz_char
if z_svd is not None:
    oof_cols['svd'] = z_svd; test_cols['svd'] = tz_svd
# Recent-only test columns
if tz_lr_recent is not None: test_cols['lr_recent'] = tz_lr_recent
if tz_minilm_recent is not None: test_cols['minilm_recent'] = tz_minilm_recent
if tz_cat_recent is not None: test_cols['cat_recent'] = tz_cat_recent

# Bounds and constraints (nonnegative, sum=1)
bounds = {
  'lr': (0.25, 0.35),
  'cat': (0.15, 0.25),
  'emn': (0.10, 0.30),
  'emp': (0.10, 0.30),
  'char': (0.04, 0.08) if 'char' in oof_cols else (0.0, 0.0),
  'dense': (0.0, 0.10),
  'meta': (0.16, 0.22),
  'svd': (0.0, 0.08) if 'svd' in oof_cols else (0.0, 0.0),
  'lr_recent': (0.06, 0.12) if 'lr_recent' in test_cols else (0.0, 0.0),
  'minilm_recent': (0.06, 0.12) if 'minilm_recent' in test_cols else (0.0, 0.0),
  'cat_recent': (0.06, 0.12) if 'cat_recent' in test_cols else (0.0, 0.0)
}

keys = [k for k in ['lr','cat','emn','emp','char','dense','meta','svd','lr_recent','minilm_recent','cat_recent'] if bounds[k][1] > 0 or k in ['lr','cat','emn','emp','dense','meta']]

def sample_weights(rng: np.random.Generator):
    w = {}
    # sample core per bounds
    for k in keys:
        low, high = bounds[k]
        val = rng.uniform(low, high) if high > low else low
        w[k] = float(val)
    # enforce embedding total within [0.30, 0.36]
    emb_tot = w.get('emn',0.0) + w.get('emp',0.0)
    if not (0.30 <= emb_tot <= 0.36):
        scale = rng.uniform(0.30, 0.36) / max(emb_tot, 1e-6) if emb_tot > 0 else 0.33
        w['emn'] *= scale; w['emp'] *= scale
    # recent total >= 0.15 if any present
    r_keys = [k for k in ['lr_recent','minilm_recent','cat_recent'] if bounds[k][1] > 0]
    if r_keys:
        r_tot = sum(w[k] for k in r_keys)
        if r_tot < 0.15:
            # bump proportionally up to 0.15
            if r_tot > 0:
                mul = 0.15 / r_tot
                for k in r_keys: w[k] *= mul
            else:
                # distribute evenly
                for k in r_keys: w[k] = 0.15 / len(r_keys)
        # cap each at <= 0.15
        for k in r_keys: w[k] = min(w[k], 0.15)
    # normalize sum to 1 while maintaining nonnegativity
    s = sum(w.values())
    if s <= 0:
        for k in keys: w[k] = 0.0
        w['lr'] = 1.0
        return w
    for k in keys: w[k] /= s
    # re-check lr and cat floors; if violated due to normalization, rescale minimally
    def enforce_floor(name, floor):
        if w.get(name,0.0) < floor:
            deficit = floor - w.get(name,0.0)
            # take from the largest buckets excluding this key
            donors = sorted([(kk,vv) for kk,vv in w.items() if kk!=name and vv>0], key=lambda x: -x[1])
            for kk,vv in donors:
                take = min(deficit, max(0.0, vv - bounds[kk][0]))
                if take>0:
                    w[kk] -= take; w[name] += take; deficit -= take
                if deficit <= 1e-9: break
    enforce_floor('lr', bounds['lr'][0])
    enforce_floor('cat', bounds['cat'][0])
    return w

def score_on_block5(w):
    # Build blended logits on block 5 using OOF for full-history components; recent columns proxy with corresponding bases on block 5 (already included via full components), so ignore in objective.
    z = (w.get('lr',0)*oof_cols['lr'] +
         w.get('dense',0)*oof_cols['dense'] +
         w.get('meta',0)*oof_cols['meta'] +
         w.get('emn',0)*oof_cols['emn'] +
         w.get('emp',0)*oof_cols['emp'] +
         w.get('cat',0)*oof_cols['cat'])
    if 'char' in oof_cols: z = z + w.get('char',0)*oof_cols['char']
    if 'svd' in oof_cols: z = z + w.get('svd',0)*oof_cols['svd']
    return roc_auc_score(y[b5_idx], z[b5_idx])

rng = np.random.default_rng(1337)
best_auc, best_w = -1.0, None
n_iter = 12000
t0 = time.time()
for it in range(1, n_iter+1):
    w = sample_weights(rng)
    auc = score_on_block5(w)
    if auc > best_auc:
        best_auc, best_w = auc, w.copy()
    if it % 1000 == 0:
        print(f'  iter={it} | best_auc_b5={best_auc:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
print('Best block-5 AUC:', f'{best_auc:.5f}', '| weights:', best_w)

# Build test blend (include recent-only TEST logits with tuned weights) and last-block blend probs for calibration
def build_probs(w):
    zt = (w.get('lr',0)*test_cols['lr'] +
          w.get('dense',0)*test_cols['dense'] +
          w.get('meta',0)*test_cols['meta'] +
          w.get('emn',0)*test_cols['emn'] +
          w.get('emp',0)*test_cols['emp'] +
          w.get('cat',0)*test_cols['cat'])
    if 'char' in test_cols: zt = zt + w.get('char',0)*test_cols['char']
    if 'svd' in test_cols: zt = zt + w.get('svd',0)*test_cols['svd']
    # add recent-only components on TEST if available
    if 'lr_recent' in test_cols: zt = zt + w.get('lr_recent',0)*test_cols['lr_recent']
    if 'minilm_recent' in test_cols: zt = zt + w.get('minilm_recent',0)*test_cols['minilm_recent']
    if 'cat_recent' in test_cols: zt = zt + w.get('cat_recent',0)*test_cols['cat_recent']
    pt = sigmoid(zt).astype(np.float32)
    # also compute last-block probs for calibration (using OOF columns only)
    zb5 = (w.get('lr',0)*oof_cols['lr'] +
           w.get('dense',0)*oof_cols['dense'] +
           w.get('meta',0)*oof_cols['meta'] +
           w.get('emn',0)*oof_cols['emn'] +
           w.get('emp',0)*oof_cols['emp'] +
           w.get('cat',0)*oof_cols['cat'])
    if 'char' in oof_cols: zb5 = zb5 + w.get('char',0)*oof_cols['char']
    if 'svd' in oof_cols: zb5 = zb5 + w.get('svd',0)*oof_cols['svd']
    pb5 = sigmoid(zb5[b5_idx]).astype(np.float32)
    yb5 = y[b5_idx]
    return pt, pb5, yb5

pt_uncal, pb5, yb5 = build_probs(best_w)
sub_uncal = pd.DataFrame({id_col: ids, target_col: pt_uncal})
path_uncal = 'submission_lastblock_opt_uncalibrated.csv'
sub_uncal.to_csv(path_uncal, index=False)
print(f'Wrote {path_uncal} | mean={pt_uncal.mean():.6f}')

# Isotonic calibration on block 5
iso = IsotonicRegression(out_of_bounds='clip')
iso.fit(pb5, yb5)
pt_cal = iso.transform(pt_uncal).astype(np.float32)
sub_cal = pd.DataFrame({id_col: ids, target_col: pt_cal})
path_cal = 'submission_lastblock_opt_calibrated.csv'
sub_cal.to_csv(path_cal, index=False)
print(f'Wrote {path_cal} | mean={pt_cal.mean():.6f}')

# Promote calibrated primary
sub_cal.to_csv('submission.csv', index=False)
print('Promoted submission_lastblock_opt_calibrated.csv to submission.csv')
gc.collect();

  iter=1000 | best_auc_b5=0.65075 | elapsed=1.6s


  iter=2000 | best_auc_b5=0.65075 | elapsed=3.3s


  iter=3000 | best_auc_b5=0.65075 | elapsed=4.9s


  iter=4000 | best_auc_b5=0.65075 | elapsed=6.6s


  iter=5000 | best_auc_b5=0.65075 | elapsed=8.3s


  iter=6000 | best_auc_b5=0.65075 | elapsed=10.0s


  iter=7000 | best_auc_b5=0.65075 | elapsed=11.6s


  iter=8000 | best_auc_b5=0.65075 | elapsed=13.3s


  iter=9000 | best_auc_b5=0.65075 | elapsed=15.0s


  iter=10000 | best_auc_b5=0.65075 | elapsed=16.7s


  iter=11000 | best_auc_b5=0.65075 | elapsed=18.4s


  iter=12000 | best_auc_b5=0.65075 | elapsed=20.0s


Best block-5 AUC: 0.65075 | weights: {'lr': 0.25, 'cat': 0.15770904415767784, 'emn': 0.16250498376207562, 'emp': 0.08801630044376484, 'char': 0.05024025578059258, 'dense': 0.003904651381461899, 'meta': 0.1152388717235414, 'svd': 0.00651242641181899, 'lr_recent': 0.05349244679248755, 'minilm_recent': 0.06360229005438262, 'cat_recent': 0.048778729492196686}
Wrote submission_lastblock_opt_uncalibrated.csv | mean=0.353231
Wrote submission_lastblock_opt_calibrated.csv | mean=0.129195
Promoted submission_lastblock_opt_calibrated.csv to submission.csv


In [32]:
# S60: Per-model recent interpolation (alphas) with gamma=0.999; build A/B variants + shrink hedges + rank-avg
import numpy as np, pandas as pd, time
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time masks for OOF reporting
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_valid = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_valid[np.array(blocks[i])] = True
# gamma weights for reporting (0.999 per expert)
gamma = 0.999
w_oof = np.zeros(n, dtype=np.float64)
for bi in range(1, k):
    age = (k - 1) - bi
    w_oof[np.array(blocks[bi])] = (gamma ** age)

# Load full-history OOF/test
o_lr_w = np.load('oof_lr_time_withsub_meta.npy');    t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');     t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');            t_d1 = np.load('test_xgb_dense_time.npy')  # Dense v1 only
o_meta = np.load('oof_xgb_meta_time.npy');           t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');        t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');       t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
has_char = Path('oof_lr_charwb_time.npy').exists() and Path('test_lr_charwb_time.npy').exists()
if has_char:
    o_char = np.load('oof_lr_charwb_time.npy'); t_char = np.load('test_lr_charwb_time.npy')
else:
    o_char = None; t_char = None

# CatBoost: prefer v1 per blend performance
has_cat_v1 = Path('oof_catboost_textmeta.npy').exists() and Path('test_catboost_textmeta.npy').exists()
has_cat_v2 = Path('oof_catboost_textmeta_v2.npy').exists() and Path('test_catboost_textmeta_v2.npy').exists()
if has_cat_v1 and has_cat_v2:
    o1 = np.load('oof_catboost_textmeta.npy'); o2 = np.load('oof_catboost_textmeta_v2.npy')
    auc1 = roc_auc_score(y[mask_valid], o1[mask_valid]); auc2 = roc_auc_score(y[mask_valid], o2[mask_valid])
    if auc1 >= auc2:
        o_cat = o1; t_cat = np.load('test_catboost_textmeta.npy'); cat_ver = 'v1'
    else:
        o_cat = o2; t_cat = np.load('test_catboost_textmeta_v2.npy'); cat_ver = 'v2'
elif has_cat_v1:
    o_cat = np.load('oof_catboost_textmeta.npy'); t_cat = np.load('test_catboost_textmeta.npy'); cat_ver = 'v1'
elif has_cat_v2:
    o_cat = np.load('oof_catboost_textmeta_v2.npy'); t_cat = np.load('test_catboost_textmeta_v2.npy'); cat_ver = 'v2'
else:
    raise FileNotFoundError('No CatBoost OOF/test found')
print('CatBoost base:', cat_ver)

# Recent TEST-only preds
def load_avg_recent(base):
    p35 = np.load(f'test_{base}_recent35.npy') if Path(f'test_{base}_recent35.npy').exists() else None
    p45 = np.load(f'test_{base}_recent45.npy') if Path(f'test_{base}_recent45.npy').exists() else None
    if (p35 is None) and (p45 is None):
        return None
    if p35 is None: return p45.astype(np.float32)
    if p45 is None: return p35.astype(np.float32)
    return ((p35 + p45) / 2.0).astype(np.float32)
t_lr_recent = load_avg_recent('lr_nosub_meta')
t_minilm_recent = load_avg_recent('xgb_minilm_meta')
t_cat_recent = load_avg_recent('catboost_textmeta_v2')  # recent available for v2

# Convert to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_meta = to_logit(o_d1), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_char = to_logit(o_char) if o_char is not None else None
z_cat = to_logit(o_cat)
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_meta = to_logit(t_d1), to_logit(t_meta)
tz_emn, tz_emp = to_logit(t_emn), to_logit(t_emp)
tz_char = to_logit(t_char) if t_char is not None else None
tz_cat = to_logit(t_cat)
tz_lr_recent = to_logit(t_lr_recent) if t_lr_recent is not None else None
tz_minilm_recent = to_logit(t_minilm_recent) if t_minilm_recent is not None else None
tz_cat_recent = to_logit(t_cat_recent) if t_cat_recent is not None else None

# LR mix with gamma=0.999
g_lr = 0.999
z_lr_mix = (1.0 - g_lr)*z_lr_w + g_lr*z_lr_ns
tz_lr_mix = (1.0 - g_lr)*tz_lr_w + g_lr*tz_lr_ns

def build_submission(tag, weights, alphas):
    # weights dict keys: lr, cat, emn, emp, char, dense, meta
    # alphas dict keys: lr, minilm, cat (interpolation factors applied on TEST only)
    # OOF blend (full-history only; for reporting)
    z_oof = (weights['lr']*z_lr_mix +
             weights['dense']*z_d1 +
             weights['meta']*z_meta +
             weights['emn']*z_emn +
             weights['emp']*z_emp +
             weights['cat']*z_cat)
    if (z_char is not None) and (weights.get('char',0) > 0):
        z_oof = z_oof + weights['char']*z_char
    auc_g = roc_auc_score(y[mask_valid], z_oof[mask_valid], sample_weight=w_oof[mask_valid])
    print(f'[{tag}] gamma-weighted OOF(z) AUC={auc_g:.5f}')
    # TEST blend with per-model interpolation to recent
    tz_lr_interp = tz_lr_mix if tz_lr_recent is None else ((1.0 - alphas.get('lr',0.0))*tz_lr_mix + alphas.get('lr',0.0)*tz_lr_recent)
    tz_minilm_interp = tz_emn if tz_minilm_recent is None else ((1.0 - alphas.get('minilm',0.0))*tz_emn + alphas.get('minilm',0.0)*tz_minilm_recent)
    tz_cat_interp = tz_cat if tz_cat_recent is None else ((1.0 - alphas.get('cat',0.0))*tz_cat + alphas.get('cat',0.0)*tz_cat_recent)
    parts = [
        weights['lr']*tz_lr_interp,
        weights['dense']*tz_d1,
        weights['meta']*tz_meta,
        weights['emn']*tz_minilm_interp,
        weights['emp']*tz_emp,
        weights['cat']*tz_cat_interp
    ]
    if (tz_char is not None) and (weights.get('char',0) > 0):
        parts.append(weights['char']*tz_char)
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    out_path = f'submission_interp_{tag}.csv'
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(out_path, index=False)
    # 15% shrink-to-equal hedge
    comp_logits = [tz_lr_interp, tz_d1, tz_meta, tz_minilm_interp, tz_emp, tz_cat_interp] + ([tz_char] if (tz_char is not None and weights.get('char',0)>0) else [])
    w_list = [weights['lr'], weights['dense'], weights['meta'], weights['emn'], weights['emp'], weights['cat']] + ([weights['char']] if (tz_char is not None and weights.get('char',0)>0) else [])
    w_vec = np.asarray(w_list, dtype=np.float64)
    w_eq = np.ones_like(w_vec)/len(w_vec)
    alpha = 0.15
    w_shr = ((1.0 - alpha)*w_vec + alpha*w_eq); w_shr = (w_shr / w_shr.sum()).astype(np.float64)
    zt_shr = np.zeros_like(comp_logits[0], dtype=np.float64)
    for wi, zi in zip(w_shr, comp_logits):
        zt_shr += wi*zi
    pt_shr = sigmoid(zt_shr).astype(np.float32)
    out_shr = out_path.replace('.csv','_shrunk.csv')
    pd.DataFrame({id_col: ids, target_col: pt_shr}).to_csv(out_shr, index=False)
    print(f'Wrote {out_path} (+_shrunk) | mean={pt.mean():.6f}')
    return out_path, out_shr, auc_g

# Variant A (expert guidance)
weights_A = dict(lr=0.30, cat=0.20, emn=0.17, emp=0.17, char=0.06 if z_char is not None else 0.0, dense=0.10, meta=0.20)
alphas_A = dict(lr=0.20, minilm=0.20, cat=0.20)
pA, pA_shr, aucA = build_submission('gamma999_interp_A', weights_A, alphas_A)

# Variant B (slightly higher Cat/emb, lower dense)
weights_B = dict(lr=0.28, cat=0.22, emn=0.20, emp=0.16, char=0.06 if z_char is not None else 0.0, dense=0.05, meta=0.18)
alphas_B = dict(lr=0.30, minilm=0.20, cat=0.30)
pB, pB_shr, aucB = build_submission('gamma999_interp_B', weights_B, alphas_B)

# Rank-average A and B primary as hedge
def read_probs(path):
    return pd.read_csv(path)[target_col].values.astype(np.float64)
def rank01(x):
    order = np.argsort(x, kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(x), dtype=np.float64)
    return ranks / max(len(x) - 1, 1)
pa = read_probs(pA); pb = read_probs(pB)
ra = rank01(pa); rb = rank01(pb)
ravg = (ra + rb) / 2.0
sub_rank = 'submission_interp_rankavg_AB.csv'
pd.DataFrame({id_col: ids, target_col: ravg.astype(np.float32)}).to_csv(sub_rank, index=False)
print('Wrote', sub_rank, '| mean=', f'{ravg.mean():.6f}')

# Promote rank-avg hedge
pd.read_csv(sub_rank).to_csv('submission.csv', index=False)
print('Promoted', sub_rank, 'to submission.csv')

CatBoost base: v1
[gamma999_interp_A] gamma-weighted OOF(z) AUC=0.68226
Wrote submission_interp_gamma999_interp_A.csv (+_shrunk) | mean=0.349913
[gamma999_interp_B] gamma-weighted OOF(z) AUC=0.68176
Wrote submission_interp_gamma999_interp_B.csv (+_shrunk) | mean=0.344665
Wrote submission_interp_rankavg_AB.csv | mean= 0.500000
Promoted submission_interp_rankavg_AB.csv to submission.csv


In [33]:
# S61: Expanded recent-interpolation variants (higher recency + Cat weight), select top-3 by gamma-OOF and rank-average
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time masks and gamma weights (0.999)
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_valid = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_valid[np.array(blocks[i])] = True
gamma = 0.999
w_oof = np.zeros(n, dtype=np.float64)
for bi in range(1, k):
    age = (k - 1) - bi
    w_oof[np.array(blocks[bi])] = (gamma ** age)

# Load OOF/test bases
o_lr_w = np.load('oof_lr_time_withsub_meta.npy');    t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');     t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');            t_d1 = np.load('test_xgb_dense_time.npy')
o_meta = np.load('oof_xgb_meta_time.npy');           t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');        t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');       t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
has_char = Path('oof_lr_charwb_time.npy').exists() and Path('test_lr_charwb_time.npy').exists()
if has_char:
    o_char = np.load('oof_lr_charwb_time.npy'); t_char = np.load('test_lr_charwb_time.npy')
else:
    o_char = None; t_char = None

# CatBoost: prefer v1
has_cat_v1 = Path('oof_catboost_textmeta.npy').exists() and Path('test_catboost_textmeta.npy').exists()
has_cat_v2 = Path('oof_catboost_textmeta_v2.npy').exists() and Path('test_catboost_textmeta_v2.npy').exists()
if has_cat_v1 and has_cat_v2:
    o1 = np.load('oof_catboost_textmeta.npy'); o2 = np.load('oof_catboost_textmeta_v2.npy')
    auc1 = roc_auc_score(y[mask_valid], o1[mask_valid]); auc2 = roc_auc_score(y[mask_valid], o2[mask_valid])
    if auc1 >= auc2:
        o_cat = o1; t_cat = np.load('test_catboost_textmeta.npy')
    else:
        o_cat = o2; t_cat = np.load('test_catboost_textmeta_v2.npy')
elif has_cat_v1:
    o_cat = np.load('oof_catboost_textmeta.npy'); t_cat = np.load('test_catboost_textmeta.npy')
elif has_cat_v2:
    o_cat = np.load('oof_catboost_textmeta_v2.npy'); t_cat = np.load('test_catboost_textmeta_v2.npy')
else:
    raise FileNotFoundError('No CatBoost OOF/test found')

# Recent TEST-only preds
def load_avg_recent(base):
    p35 = np.load(f'test_{base}_recent35.npy') if Path(f'test_{base}_recent35.npy').exists() else None
    p45 = np.load(f'test_{base}_recent45.npy') if Path(f'test_{base}_recent45.npy').exists() else None
    if (p35 is None) and (p45 is None): return None
    if p35 is None: return p45.astype(np.float32)
    if p45 is None: return p35.astype(np.float32)
    return ((p35 + p45) / 2.0).astype(np.float32)
t_lr_recent = load_avg_recent('lr_nosub_meta')
t_minilm_recent = load_avg_recent('xgb_minilm_meta')
t_cat_recent = load_avg_recent('catboost_textmeta_v2')

# Convert to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_meta = to_logit(o_d1), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_char = to_logit(o_char) if o_char is not None else None
z_cat = to_logit(o_cat)
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_meta = to_logit(t_d1), to_logit(t_meta)
tz_emn, tz_emp = to_logit(t_emn), to_logit(t_emp)
tz_char = to_logit(t_char) if t_char is not None else None
tz_cat = to_logit(t_cat)
tz_lr_recent = to_logit(t_lr_recent) if t_lr_recent is not None else None
tz_minilm_recent = to_logit(t_minilm_recent) if t_minilm_recent is not None else None
tz_cat_recent = to_logit(t_cat_recent) if t_cat_recent is not None else None

# LR mix with gamma=0.999
g_lr = 0.999
z_lr_mix = (1.0 - g_lr)*z_lr_w + g_lr*z_lr_ns
tz_lr_mix = (1.0 - g_lr)*tz_lr_w + g_lr*tz_lr_ns

def build_variant(tag, weights, alphas):
    z_oof = (weights['lr']*z_lr_mix +
             weights['dense']*z_d1 +
             weights['meta']*z_meta +
             weights['emn']*z_emn +
             weights['emp']*z_emp +
             weights['cat']*z_cat)
    if (z_char is not None) and (weights.get('char',0) > 0):
        z_oof = z_oof + weights['char']*z_char
    auc_g = roc_auc_score(y[mask_valid], z_oof[mask_valid], sample_weight=w_oof[mask_valid])
    tz_lr_interp = tz_lr_mix if tz_lr_recent is None else ((1.0 - alphas.get('lr',0.0))*tz_lr_mix + alphas.get('lr',0.0)*tz_lr_recent)
    tz_minilm_interp = tz_emn if tz_minilm_recent is None else ((1.0 - alphas.get('minilm',0.0))*tz_emn + alphas.get('minilm',0.0)*tz_minilm_recent)
    tz_cat_interp = tz_cat if tz_cat_recent is None else ((1.0 - alphas.get('cat',0.0))*tz_cat + alphas.get('cat',0.0)*tz_cat_recent)
    parts = [weights['lr']*tz_lr_interp, weights['dense']*tz_d1, weights['meta']*tz_meta, weights['emn']*tz_minilm_interp, weights['emp']*tz_emp, weights['cat']*tz_cat_interp]
    if (tz_char is not None) and (weights.get('char',0) > 0): parts.append(weights['char']*tz_char)
    zt = np.sum(parts, axis=0)
    pt = sigmoid(zt).astype(np.float32)
    out_path = f'submission_interp_{tag}.csv'
    pd.DataFrame({id_col: ids, target_col: pt}).to_csv(out_path, index=False)
    return auc_g, out_path

def renorm(weights):
    s = sum(weights.values())
    return {k: (v/s) for k,v in weights.items()} if s>0 else weights

# Define stronger-recency variants (C/D/E) within expert ranges, then renormalize
# C: higher LR, Cat, moderate emb, small dense, keep meta
wC = renorm(dict(lr=0.32, cat=0.24, emn=0.18, emp=0.16, char=(0.06 if has_char else 0.0), dense=0.04, meta=0.20))
aC = dict(lr=0.30, minilm=0.30, cat=0.30)
# D: drop dense, push Cat and embeddings up
wD = renorm(dict(lr=0.30, cat=0.26, emn=0.20, emp=0.18, char=(0.06 if has_char else 0.0), dense=0.00, meta=0.10))
aD = dict(lr=0.30, minilm=0.30, cat=0.35)
# E: max embeddings total ~0.36, cat 0.22, lr 0.30, meta 0.12, small dense
wE = renorm(dict(lr=0.30, cat=0.22, emn=0.20, emp=0.16, char=(0.08 if has_char else 0.0), dense=0.02, meta=0.12))
aE = dict(lr=0.40, minilm=0.30, cat=0.30)

cands = []
for tag, w, a in [
    ('gamma999_interp_C', wC, aC),
    ('gamma999_interp_D', wD, aD),
    ('gamma999_interp_E', wE, aE),
]:
    auc, path = build_variant(tag, w, a)
    print(f'[{tag}] gamma-weighted OOF(z) AUC={auc:.5f}')
    cands.append((auc, path))

# Add previous A/B for consideration if exist
for tag in ['gamma999_interp_A','gamma999_interp_B']:
    p = f'submission_interp_{tag}.csv'
    if Path(p).exists():
        # Recompute AUC using the same weights (approx using stored OOF from earlier run isn't trivial); just include as hedge without AUC sort benefit
        cands.append((-1.0, p))

# Select top-3 by AUC (valid ones), fill with others if needed
valid = sorted([x for x in cands if x[0] >= 0], key=lambda x: -x[0])
paths = [p for _, p in valid[:3]]
if len(paths) < 3:
    extra = [p for _, p in cands if p not in paths]
    for p in extra:
        if p not in paths:
            paths.append(p)
        if len(paths) >= 3: break
print('Chosen for rank-avg:', paths)

def read_probs(path):
    return pd.read_csv(path)[target_col].values.astype(np.float64)
def rank01(x):
    order = np.argsort(x, kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(x), dtype=np.float64)
    return ranks / max(len(x) - 1, 1)

R = []
for p in paths:
    R.append(rank01(read_probs(p)))
ravg = np.mean(np.vstack(R), axis=0)
sub_rank = 'submission_interp_rankavg_top3_expanded.csv'
pd.DataFrame({id_col: ids, target_col: ravg.astype(np.float32)}).to_csv(sub_rank, index=False)
print('Wrote', sub_rank, '| mean=', f'{ravg.mean():.6f}')

# Promote rank-avg hedge
pd.read_csv(sub_rank).to_csv('submission.csv', index=False)
print('Promoted', sub_rank, 'to submission.csv')

[gamma999_interp_C] gamma-weighted OOF(z) AUC=0.68157
[gamma999_interp_D] gamma-weighted OOF(z) AUC=0.68067
[gamma999_interp_E] gamma-weighted OOF(z) AUC=0.68109
Chosen for rank-avg: ['submission_interp_gamma999_interp_C.csv', 'submission_interp_gamma999_interp_E.csv', 'submission_interp_gamma999_interp_D.csv']
Wrote submission_interp_rankavg_top3_expanded.csv | mean= 0.500000
Promoted submission_interp_rankavg_top3_expanded.csv to submission.csv


In [34]:
# S62: Block-5 optimizer with recent-only components (r24, r30, gamma9995+blk5x2) + logit-average hedge (no calibration)
import numpy as np, pandas as pd, time
from pathlib import Path
from sklearn.metrics import roc_auc_score

id_col = 'request_id'; target_col = 'requester_received_pizza'
train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train[target_col].astype(int).values
ids = test[id_col].values

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Time blocks and masks
order = np.argsort(train['unix_timestamp_of_request'].values)
k = 6
blocks = np.array_split(order, k)
n = len(train)
mask_valid = np.zeros(n, dtype=bool)
for i in range(1, k):
    mask_valid[np.array(blocks[i])] = True
b5_idx = np.array(blocks[5])

# Load full-history OOF/test
o_lr_w = np.load('oof_lr_time_withsub_meta.npy');    t_lr_w = np.load('test_lr_time_withsub_meta.npy')
o_lr_ns = np.load('oof_lr_time_nosub_meta.npy');     t_lr_ns = np.load('test_lr_time_nosub_meta.npy')
o_d1 = np.load('oof_xgb_dense_time.npy');            t_d1 = np.load('test_xgb_dense_time.npy')
o_meta = np.load('oof_xgb_meta_time.npy');           t_meta = np.load('test_xgb_meta_fullbag.npy') if Path('test_xgb_meta_fullbag.npy').exists() else np.load('test_xgb_meta_time.npy')
o_emn = np.load('oof_xgb_emb_meta_time.npy');        t_emn = np.load('test_xgb_emb_minilm_fullbag.npy') if Path('test_xgb_emb_minilm_fullbag.npy').exists() else np.load('test_xgb_emb_meta_time.npy')
o_emp = np.load('oof_xgb_emb_mpnet_time.npy');       t_emp = np.load('test_xgb_emb_mpnet_fullbag.npy') if Path('test_xgb_emb_mpnet_fullbag.npy').exists() else np.load('test_xgb_emb_mpnet_time.npy')
has_char = Path('oof_lr_charwb_time.npy').exists() and Path('test_lr_charwb_time.npy').exists()
if has_char:
    o_char = np.load('oof_lr_charwb_time.npy'); t_char = np.load('test_lr_charwb_time.npy')
else:
    o_char = None; t_char = None
# SVD dual disabled per bounds

# CatBoost: prefer v1
has_cat_v1 = Path('oof_catboost_textmeta.npy').exists() and Path('test_catboost_textmeta.npy').exists()
has_cat_v2 = Path('oof_catboost_textmeta_v2.npy').exists() and Path('test_catboost_textmeta_v2.npy').exists()
if has_cat_v1 and has_cat_v2:
    o1 = np.load('oof_catboost_textmeta.npy'); o2 = np.load('oof_catboost_textmeta_v2.npy')
    auc1 = roc_auc_score(y[b5_idx], o1[b5_idx]); auc2 = roc_auc_score(y[b5_idx], o2[b5_idx])
    if auc1 >= auc2:
        o_cat = o1; t_cat = np.load('test_catboost_textmeta.npy')
    else:
        o_cat = o2; t_cat = np.load('test_catboost_textmeta_v2.npy')
elif has_cat_v1:
    o_cat = np.load('oof_catboost_textmeta.npy'); t_cat = np.load('test_catboost_textmeta.npy')
elif has_cat_v2:
    o_cat = np.load('oof_catboost_textmeta_v2.npy'); t_cat = np.load('test_catboost_textmeta_v2.npy')
else:
    raise FileNotFoundError('No CatBoost OOF/test found')

# Recent TEST-only preds (from S57)
def load_avg_recent(base):
    p35 = np.load(f'test_{base}_recent35.npy') if Path(f'test_{base}_recent35.npy').exists() else None
    p45 = np.load(f'test_{base}_recent45.npy') if Path(f'test_{base}_recent45.npy').exists() else None
    if (p35 is None) and (p45 is None): return None
    if p35 is None: return p45.astype(np.float32)
    if p45 is None: return p35.astype(np.float32)
    return ((p35 + p45) / 2.0).astype(np.float32)
t_lr_recent = load_avg_recent('lr_nosub_meta')
t_minilm_recent = load_avg_recent('xgb_minilm_meta')
t_cat_recent = load_avg_recent('catboost_textmeta_v2')

# Convert to logits
z_lr_w, z_lr_ns = to_logit(o_lr_w), to_logit(o_lr_ns)
z_d1, z_meta = to_logit(o_d1), to_logit(o_meta)
z_emn, z_emp = to_logit(o_emn), to_logit(o_emp)
z_char = to_logit(o_char) if o_char is not None else None
z_cat = to_logit(o_cat)
tz_lr_w, tz_lr_ns = to_logit(t_lr_w), to_logit(t_lr_ns)
tz_d1, tz_meta = to_logit(t_d1), to_logit(t_meta)
tz_emn, tz_emp = to_logit(t_emn), to_logit(t_emp)
tz_char = to_logit(t_char) if t_char is not None else None
tz_cat = to_logit(t_cat)
tz_lr_recent = to_logit(t_lr_recent) if t_lr_recent is not None else None
tz_minilm_recent = to_logit(t_minilm_recent) if t_minilm_recent is not None else None
tz_cat_recent = to_logit(t_cat_recent) if t_cat_recent is not None else None

# Define LR mix with g_lr (use 0.9995 as expert recency suggestion)
g_lr = 0.9995
z_lr_mix = (1.0 - g_lr)*z_lr_w + g_lr*z_lr_ns
tz_lr_mix = (1.0 - g_lr)*tz_lr_w + g_lr*tz_lr_ns

# Core components (OOF/test logits) - no SVD
oof_cols = {'lr': z_lr_mix, 'dense': z_d1, 'meta': z_meta, 'emn': z_emn, 'emp': z_emp, 'cat': z_cat}
test_cols = {'lr': tz_lr_mix, 'dense': tz_d1, 'meta': tz_meta, 'emn': tz_emn, 'emp': tz_emp, 'cat': tz_cat}
if z_char is not None:
    oof_cols['char'] = z_char; test_cols['char'] = tz_char

# Bounds
core_bounds = {
  'lr': (0.30, 0.36),
  'cat': (0.18, 0.26),
  'meta': (0.16, 0.22),
  'dense': (0.0, 0.06),
  'char': (0.04, 0.08) if 'char' in oof_cols else (0.0, 0.0),
  # emn+emp total in [0.30,0.36], split in {(0.6,0.4),(0.5,0.5)}
}
emb_splits = [(0.6,0.4), (0.5,0.5)]

recent_bounds = {
  'lr_recent': (0.08, 0.15) if tz_lr_recent is not None else (0.0, 0.0),
  'minilm_recent': (0.08, 0.15) if tz_minilm_recent is not None else (0.0, 0.0),
  'cat_recent': (0.08, 0.15) if tz_cat_recent is not None else (0.0, 0.0),
}

def sample_core_weights(rng: np.random.Generator, emb_total_low=0.30, emb_total_high=0.36):
    # sample core per bounds
    w = {}
    for k,(lo,hi) in core_bounds.items():
        val = rng.uniform(lo, hi) if hi > lo else lo
        w[k] = float(val)
    # sample embedding total and split
    emb_tot = rng.uniform(emb_total_low, emb_total_high)
    split = emb_splits[rng.integers(0, len(emb_splits))]
    w['emn'] = emb_tot * split[0]
    w['emp'] = emb_tot * split[1]
    return w

def renorm_core(w_core: dict, core_sum_target: float):
    keys = ['lr','cat','meta','dense','char'] + ['emn','emp']
    s = sum(w_core.get(k,0.0) for k in keys)
    if s <= 0:
        return {k:(0.0) for k in keys}
    scale = core_sum_target / s
    for k in keys:
        w_core[k] = w_core.get(k,0.0) * scale
    # floors on lr/cat already handled by initial sampling and scale preserves ratios
    return w_core

def score_block5(w_core: dict):
    z = (w_core.get('lr',0)*oof_cols['lr'] +
         w_core.get('dense',0)*oof_cols['dense'] +
         w_core.get('meta',0)*oof_cols['meta'] +
         w_core.get('emn',0)*oof_cols['emn'] +
         w_core.get('emp',0)*oof_cols['emp'] +
         w_core.get('cat',0)*oof_cols['cat'])
    if 'char' in oof_cols: z = z + w_core.get('char',0)*oof_cols['char']
    return roc_auc_score(y[b5_idx], z[b5_idx])

def score_gamma9995_blk5x2(w_core: dict):
    # gamma=0.9995 over blocks 1..5, with 2x weight for block 5
    weights = np.zeros(n, dtype=np.float64)
    gamma = 0.9995
    for bi in range(1, k):
        age = (k - 1) - bi
        weights[np.array(blocks[bi])] = (gamma ** age)
    weights[b5_idx] *= 2.0
    z = (w_core.get('lr',0)*oof_cols['lr'] +
         w_core.get('dense',0)*oof_cols['dense'] +
         w_core.get('meta',0)*oof_cols['meta'] +
         w_core.get('emn',0)*oof_cols['emn'] +
         w_core.get('emp',0)*oof_cols['emp'] +
         w_core.get('cat',0)*oof_cols['cat'])
    if 'char' in oof_cols: z = z + w_core.get('char',0)*oof_cols['char']
    return roc_auc_score(y[mask_valid], z[mask_valid], sample_weight=weights[mask_valid])

def build_test_probs(w_core: dict, recent: dict):
    zt = (w_core.get('lr',0)*test_cols['lr'] +
          w_core.get('dense',0)*test_cols['dense'] +
          w_core.get('meta',0)*test_cols['meta'] +
          w_core.get('emn',0)*test_cols['emn'] +
          w_core.get('emp',0)*test_cols['emp'] +
          w_core.get('cat',0)*test_cols['cat'])
    if 'char' in test_cols: zt = zt + w_core.get('char',0)*test_cols['char']
    # add recent-only components on TEST
    if (tz_lr_recent is not None) and (recent.get('lr_recent',0)>0): zt += recent['lr_recent']*tz_lr_recent
    if (tz_minilm_recent is not None) and (recent.get('minilm_recent',0)>0): zt += recent['minilm_recent']*tz_minilm_recent
    if (tz_cat_recent is not None) and (recent.get('cat_recent',0)>0): zt += recent['cat_recent']*tz_cat_recent
    return sigmoid(zt).astype(np.float32)

def optimize_variant(tag: str, recent_total_target: float|None, n_iter: int, objective: str):
    rng = np.random.default_rng(20250912 if tag=='r24' else (20250913 if tag=='r30' else 20250914))
    best_auc, best_core, tried = -1.0, None, 0
    t0 = time.time()
    for it in range(1, n_iter+1):
        core = sample_core_weights(rng)
        # renorm core to 1 - recent_total
        if recent_total_target is None:
            recent_total = rng.uniform(0.24, 0.30)
        else:
            recent_total = recent_total_target
        core = renorm_core(core, core_sum_target=(1.0 - recent_total))
        # score objective (recent components ignored)
        if objective == 'b5':
            auc = score_block5(core)
        elif objective == 'gam9995_blk5x2':
            auc = score_gamma9995_blk5x2(core)
        else:
            raise ValueError('unknown objective')
        tried += 1
        if auc > best_auc:
            best_auc, best_core = auc, core.copy()
        if it % 1000 == 0:
            print(f'  [{tag}] iter={it} | best_auc={best_auc:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
    print(f'[{tag}] search done | tried={tried} | best_auc={best_auc:.5f} | {time.time()-t0:.1f}s | core={best_core}', flush=True)
    # Assign recent weights within bounds and to recent_total
    recent = {}
    r_keys = [k for k,(lo,hi) in recent_bounds.items() if hi > lo]
    if len(r_keys) > 0:
        # Start from random within bounds
        raw = np.array([rng.uniform(recent_bounds[k][0], recent_bounds[k][1]) for k in r_keys], dtype=np.float64)
        raw_sum = raw.sum() if raw.sum() > 0 else 1.0
        if recent_total_target is None:
            recent_total = rng.uniform(0.24, 0.30)
        else:
            recent_total = recent_total_target
        scaled = raw / raw_sum * recent_total
        for k,val in zip(r_keys, scaled):
            recent[k] = float(val)
    return best_core, recent

# Run three variants per expert
core_r24, recent_r24 = optimize_variant('r24', recent_total_target=0.24, n_iter=8000, objective='b5')
core_r30, recent_r30 = optimize_variant('r30', recent_total_target=0.30, n_iter=8000, objective='b5')
core_gx, recent_gx = optimize_variant('gamma9995_blk5x2', recent_total_target=None, n_iter=8000, objective='gam9995_blk5x2')

def write_sub(path, probs):
    pd.DataFrame({id_col: ids, target_col: probs}).to_csv(path, index=False)
    print(f'Wrote {path} | mean={probs.mean():.6f}', flush=True)

p_r24 = build_test_probs(core_r24, recent_r24)
p_r30 = build_test_probs(core_r30, recent_r30)
p_gx  = build_test_probs(core_gx, recent_gx)
path_r24 = 'submission_block5opt_r24.csv'; write_sub(path_r24, p_r24)
path_r30 = 'submission_block5opt_r30.csv'; write_sub(path_r30, p_r30)
path_gx  = 'submission_block5opt_gamma9995_blk5x2.csv'; write_sub(path_gx, p_gx)

# Logit-average hedges (preferred over rank-avg):
def p_to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def logit_avg(paths, out_path):
    arrs = [pd.read_csv(p)[target_col].values.astype(np.float64) for p in paths]
    Z = np.vstack([p_to_logit(a) for a in arrs])
    p_mean = sigmoid(Z.mean(axis=0)).astype(np.float32)
    write_sub(out_path, p_mean)
    return out_path

# Typically average r24 + gamma9995; also option to include r30
prom_pair = logit_avg([path_r24, path_gx], 'submission_logitavg_r24_gamma9995.csv')
logit_avg([path_r24, path_r30, path_gx], 'submission_logitavg_r24_r30_gamma9995.csv')

# Promote pair logit-average as primary
pd.read_csv(prom_pair).to_csv('submission.csv', index=False)
print('Promoted', prom_pair, 'to submission.csv', flush=True)

  [r24] iter=1000 | best_auc=0.65148 | elapsed=1.7s


  [r24] iter=2000 | best_auc=0.65168 | elapsed=3.3s


  [r24] iter=3000 | best_auc=0.65185 | elapsed=5.0s


  [r24] iter=4000 | best_auc=0.65185 | elapsed=6.7s


  [r24] iter=5000 | best_auc=0.65185 | elapsed=8.4s


  [r24] iter=6000 | best_auc=0.65185 | elapsed=10.1s


  [r24] iter=7000 | best_auc=0.65185 | elapsed=11.8s


  [r24] iter=8000 | best_auc=0.65185 | elapsed=13.5s


[r24] search done | tried=8000 | best_auc=0.65185 | 13.5s | core={'lr': 0.2091373762438391, 'cat': 0.1267646003622472, 'meta': 0.11194764105173324, 'dense': 0.027994307728506775, 'char': 0.054949179783834616, 'emn': 0.13752413689790344, 'emp': 0.09168275793193562}


  [r30] iter=1000 | best_auc=0.65148 | elapsed=1.7s


  [r30] iter=2000 | best_auc=0.65148 | elapsed=3.3s


  [r30] iter=3000 | best_auc=0.65148 | elapsed=5.0s


  [r30] iter=4000 | best_auc=0.65168 | elapsed=6.6s


  [r30] iter=5000 | best_auc=0.65168 | elapsed=8.3s


  [r30] iter=6000 | best_auc=0.65191 | elapsed=10.0s


  [r30] iter=7000 | best_auc=0.65191 | elapsed=11.7s


  [r30] iter=8000 | best_auc=0.65191 | elapsed=13.4s


[r30] search done | tried=8000 | best_auc=0.65191 | 13.4s | core={'lr': 0.19470726325081586, 'cat': 0.11318440767022928, 'meta': 0.10008503666306624, 'dense': 0.031738063457150384, 'char': 0.04866208087453025, 'emn': 0.12697388885052477, 'emp': 0.0846492592336832}


  [gamma9995_blk5x2] iter=1000 | best_auc=0.67701 | elapsed=2.5s


  [gamma9995_blk5x2] iter=2000 | best_auc=0.67701 | elapsed=5.0s


  [gamma9995_blk5x2] iter=3000 | best_auc=0.67701 | elapsed=7.4s


  [gamma9995_blk5x2] iter=4000 | best_auc=0.67701 | elapsed=9.9s


  [gamma9995_blk5x2] iter=5000 | best_auc=0.67701 | elapsed=12.3s


  [gamma9995_blk5x2] iter=6000 | best_auc=0.67701 | elapsed=14.8s


  [gamma9995_blk5x2] iter=7000 | best_auc=0.67701 | elapsed=17.3s


  [gamma9995_blk5x2] iter=8000 | best_auc=0.67711 | elapsed=19.7s


[gamma9995_blk5x2] search done | tried=8000 | best_auc=0.67711 | 19.7s | core={'lr': 0.1883169615457813, 'cat': 0.11270430827145297, 'meta': 0.12488238635974469, 'dense': 0.035741154404510686, 'char': 0.049384449937499506, 'emn': 0.10922857731367337, 'emp': 0.10922857731367337}


Wrote submission_block5opt_r24.csv | mean=0.342686


Wrote submission_block5opt_r30.csv | mean=0.331348


Wrote submission_block5opt_gamma9995_blk5x2.csv | mean=0.343543


Wrote submission_logitavg_r24_gamma9995.csv | mean=0.343102


Wrote submission_logitavg_r24_r30_gamma9995.csv | mean=0.339154


Promoted submission_logitavg_r24_gamma9995.csv to submission.csv


In [35]:
# S63: Promote 3-way logit-average hedge (r24 + r30 + gamma9995_blk5x2) if available
import pandas as pd
from pathlib import Path
path = 'submission_logitavg_r24_r30_gamma9995.csv'
if Path(path).exists() and Path(path).stat().st_size > 0:
    sub = pd.read_csv(path)
    sub.to_csv('submission.csv', index=False)
    print(f'Promoted {path} to submission.csv | mean={sub.iloc[:,1].mean():.6f}')
else:
    print('3-way logit-average file not found; leaving current submission.csv as-is')

Promoted submission_logitavg_r24_r30_gamma9995.csv to submission.csv | mean=0.339154
