# Random Acts of Pizza: Medal Plan

Objective: Achieve medal via strong, reliable text+meta baseline with robust CV and fast iteration.

Data:
- train.json, test.json. Target: requester_received_pizza (bool).
- Text fields: request_text, title, combined text. Meta: requester_account_age_in_days_at_request, requester_days_since_first_post_on_raop_at_request, requester_number_of_comments_at_request, requester_number_of_comments_in_raop_at_request, requester_number_of_posts_at_request, requester_number_of_posts_on_raop_at_request, requester_upvotes_plus_downvotes_at_request, requester_upvotes_minus_downvotes_at_request, etc. IDs: request_id.

Metric: ROC AUC (maximize).

Validation:
- Stratified KFold (n_splits=5, shuffle, fixed seed).
- Single fold source saved and reused; transforms fit per-fold to prevent leakage.
- Report OOF AUC mean/std; log per-fold times.

Baseline v1 (fast):
- Text preprocessing: concatenate title + request_text.
- TF-IDF word + char n-grams (word 1-2, char 3-5), sublinear_tf, min_df tuned lightly; limit features via max_features (e.g., 200k).
- Linear Model: LogisticRegression (liblinear/saga) or LinearSVC with calibrated probabilities; start with LogisticRegression(saga, class_weight='balanced', max_iter=4000).
- Blend word and char TF-IDF by hstack.

Meta Features v1:
- Numeric: account age, activity counts, karma-like features, raop history, hour/day from unix_timestamp if present, text length metrics (len, word count, title len), punctuation/exclamation count, presence of keywords (e.g., "pizza", "thanks").
- Binary flags: includes images, has_url, has_code, mentions_student/job/money? (simple keyword lists).

Pipeline:
- Build single ColumnTransformer: TF-IDF on text; passthrough standardized numeric; binary keywords.
- Model: LogisticRegression or XGBoost (gpu) on sparse; start with LR for speed/strong baseline on text.
- Cache vectorizers and sparse matrices to .npz for reuse; cache OOF/test preds .npy.

Iteration Plan:
1) Implement data loader + EDA-lite (shapes, target rate, nulls).
2) Baseline TF-IDF+LR 5-fold; get OOF AUC; produce submission.csv.
3) Add meta features; re-evaluate.
4) Try LinearSVC + CalibratedClassifierCV; compare.
5) Try XGBoost (gpu_hist) on sparse CSR; tune shallow depth/ETA; compare.
6) Simple blend of top-2 via weighted average on OOF to set weights; apply to test.
7) Error analysis on OOF bins; targeted keyword features.

Efficiency:
- Log fold indices and elapsed times.
- Use subsample (e.g., 1000 rows, 2 folds) for smoke tests.
- Save folds and results; one change per run.

Deliverables:
- submission.csv with columns: request_id, requester_received_pizza.
- Reproducible notebook with saved CV.

Next: Implement data loading, schema print, and CV scaffolding.

In [1]:
# Data loading and schema/EDA scaffold
import json, os, time, sys, gc, math, re
import numpy as np
import pandas as pd
from collections import Counter

t0 = time.time()
print('Reading train/test JSON...')
train_path = 'train.json'
test_path = 'test.json'
train = pd.read_json(train_path)
test = pd.read_json(test_path)
print(f'Loaded train: {train.shape}, test: {test.shape}')

# Peek columns and target
print('\nTrain columns:', len(train.columns))
print(sorted(train.columns.tolist())[:40], '...')
print('\nTest columns:', len(test.columns))
print(sorted(test.columns.tolist())[:40], '...')

target_col = 'requester_received_pizza'
id_col = 'request_id'
assert target_col in train.columns, f'Missing target {target_col}'
assert id_col in train.columns and id_col in test.columns, 'Missing request_id in data'

# Target stats
print('\nTarget distribution:')
print(train[target_col].value_counts(dropna=False).rename('count'))
pos_rate = train[target_col].mean()
print(f'Positive rate: {pos_rate:.4f}')

# Identify likely text fields and username for grouping
possible_text_cols = [c for c in ['request_text_edit_aware','request_text','title'] if c in train.columns]
group_col = 'requester_username' if 'requester_username' in train.columns else None
print('\nText columns found:', possible_text_cols)
print('Group column:', group_col)

# Check overlap of users between train/test to assess leakage risk and choose CV
if group_col is not None:
    tr_users = set(train[group_col].dropna().astype(str))
    te_users = set(test[group_col].dropna().astype(str))
    inter = tr_users & te_users
    print(f'Unique users train: {len(tr_users)}, test: {len(te_users)}, overlap: {len(inter)}')
else:
    print('No requester_username available; will use StratifiedKFold.')

# List features ending with _at_request to whitelist meta
request_time_feats = sorted([c for c in train.columns if c.endswith('_at_request')])
print(f'\nRequest-time features (count={len(request_time_feats)}):')
print(request_time_feats[:30], '...')

# Known leaky fields to drop if present
leaky_prefixes = ['giver_', 'post_was_']
leaky_suffixes = ['_at_retrieval']
leaky_cols = [c for c in train.columns if any(c.startswith(p) for p in leaky_prefixes) or any(c.endswith(s) for s in leaky_suffixes)]
if leaky_cols:
    print('Potentially leaky columns found (will exclude):', leaky_cols)
else:
    print('No obvious leaky columns found by prefix/suffix scan.')

# Basic nulls and sample rows
print('\nNull rates (top 20 by null %):')
nulls = train.isnull().mean().sort_values(ascending=False)
print(nulls.head(20))

print('\nSample rows (id, title snippet):')
title_col = 'title' if 'title' in train.columns else None
if title_col:
    print(train[[id_col, title_col]].head(3))
else:
    print(train[[id_col]].head(3))

# Confirm submission format
sub_example = pd.read_csv('sampleSubmission.csv')
print('\nSample submission head:')
print(sub_example.head())
assert list(sub_example.columns) == [id_col, target_col], 'Submission columns mismatch'

elapsed = time.time() - t0
print(f'\nData load + EDA scaffold done in {elapsed:.2f}s')

# Prepare combined text preview function (no fitting yet)
def make_text(df: pd.DataFrame) -> pd.Series:
    parts = []
    if 'title' in df.columns:
        parts.append(df['title'].fillna(''))
    if 'request_text_edit_aware' in df.columns:
        parts.append(df['request_text_edit_aware'].fillna(''))
    elif 'request_text' in df.columns:
        parts.append(df['request_text'].fillna(''))
    if parts:
        return (parts[0].astype(str) + ' \n ' + parts[1].astype(str)) if len(parts) > 1 else parts[0].astype(str)
    return pd.Series([''] * len(df))

train['_combined_text'] = make_text(train)
test['_combined_text'] = make_text(test)
print('\nCombined text examples:')
print(train['_combined_text'].head(2).tolist())

Reading train/test JSON...
Loaded train: (2878, 32), test: (1162, 17)

Train columns: 32
['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester

In [8]:
# TF-IDF + Meta + Logistic Regression CV and Submission (grid: C x class_weight)
import time
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack, csr_matrix

def safe_log1p_signed(arr):
    a = pd.to_numeric(arr, errors='coerce').astype(float)
    return np.sign(a) * np.log1p(np.abs(a))

def extract_subreddit_text(series):
    def joiner(x):
        if isinstance(x, (list, tuple)):
            return ' '.join([str(s).lower() for s in x])
        return ''
    return series.apply(joiner)

# Ensure VADER is available and create a global analyzer
try:
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    _vader = SentimentIntensityAnalyzer()
except Exception:
    import subprocess, sys as _sys
    subprocess.run([_sys.executable, '-m', 'pip', 'install', '-q', 'vaderSentiment'])
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    _vader = SentimentIntensityAnalyzer()

def build_meta_features(df: pd.DataFrame) -> pd.DataFrame:
    df2 = pd.DataFrame(index=df.index)
    base = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in base:
        if c in df.columns:
            df2[c] = pd.to_numeric(df[c], errors='coerce')
    if 'unix_timestamp_of_request' in df.columns:
        ts = pd.to_datetime(df['unix_timestamp_of_request'], unit='s', errors='coerce')
        df2['req_hour'] = ts.dt.hour.fillna(0).astype(np.int16)
        df2['req_wday'] = ts.dt.weekday.fillna(0).astype(np.int16)
        df2['req_is_weekend'] = df2['req_wday'].isin([5,6]).astype(np.int8)
        df2['req_month'] = ts.dt.month.fillna(0).astype(np.int16)
        month = df2['req_month'].fillna(0).astype(int)
        season = pd.Series(np.zeros(len(df2), dtype=np.int16), index=df2.index)
        season[(month==12)|(month<=2)] = 1
        season[(month>=3)&(month<=5)] = 2
        season[(month>=6)&(month<=8)] = 3
        season[(month>=9)&(month<=11)] = 4
        df2['req_season'] = season.astype(np.int16)
    txt = df['_combined_text'].fillna('').astype(str)
    df2['text_len'] = txt.str.len().astype(np.int32)
    raw_wc = txt.str.split().map(len)
    df2['word_count'] = raw_wc.astype(np.int32)
    # Length buckets (simple bins)
    df2['wc_bin_small'] = (raw_wc < 200).astype(np.int8)
    df2['wc_bin_medium'] = ((raw_wc >= 200) & (raw_wc <= 600)).astype(np.int8)
    df2['wc_bin_large'] = (raw_wc > 600).astype(np.int8)
    df2['exclaim_count'] = txt.str.count('!').astype(np.int16)
    df2['question_count'] = txt.str.count('\?').astype(np.int16)
    upper_ratio = txt.map(lambda s: (sum(ch.isupper() for ch in s) / max(1, len(s)))).astype(np.float32)
    df2['upper_ratio'] = upper_ratio.clip(0, 0.7)
    # URL/image flags
    df2['has_url'] = txt.str.contains('http://', regex=False) | txt.str.contains('https://', regex=False)
    df2['has_url'] = df2['has_url'].astype(np.int8)
    df2['url_count'] = txt.str.count('http://') + txt.str.count('https://')
    df2['url_count'] = df2['url_count'].astype(np.int16)
    df2['has_image'] = txt.str.contains('imgur', case=False, regex=False) | txt.str.contains('.jpg', case=False, regex=False) | txt.str.contains('.jpeg', case=False, regex=False) | txt.str.contains('.png', case=False, regex=False) | txt.str.contains('.gif', case=False, regex=False)
    df2['has_image'] = df2['has_image'].astype(np.int8)
    # Sentiment (VADER compound)
    try:
        df2['sent_compound'] = txt.map(lambda s: _vader.polarity_scores(s)['compound']).astype(np.float32)
    except Exception:
        df2['sent_compound'] = 0.0
    lexicons = {
        'kw_student': ['student','college','university','school','tuition'],
        'kw_job': ['job','unemployed','laid off','hired','interview'],
        'kw_money': ['rent','bill','bills','broke','money','paycheck','payment'],
        'kw_family': ['family','kids','child','children','wife','husband','mom','dad'],
        'kw_emergency': ['emergency','medical','hospital','doctor'],
        'kw_gratitude': ['please','thank','thanks','appreciate','grateful'],
        'kw_reciprocity': ['pay it forward','return the favor','pay it back','return favor'],
        'kw_pizza': ['pizza','pepperoni','cheese','dominos','pizza hut','papa john']
    }
    low_txt = txt.str.lower()
    for name, toks in lexicons.items():
        pat = '|'.join([re.escape(t) for t in toks])
        df2[name] = low_txt.str.contains(pat).astype(np.int8)
    if 'requester_subreddits_at_request' in df.columns:
        subs = df['requester_subreddits_at_request']
        df2['subs_len'] = subs.apply(lambda x: len(x) if isinstance(x, (list, tuple)) else 0).astype(np.int32)
        df2['subs_has_raop'] = subs.apply(lambda x: any(isinstance(s, str) and 'random_acts_of_pizza' in s.lower() for s in x) if isinstance(x, (list, tuple)) else False).astype(np.int8)
    if 'requester_number_of_posts_on_raop_at_request' in df2.columns:
        df2['has_raop_post_hist'] = (pd.to_numeric(df['requester_number_of_posts_on_raop_at_request'], errors='coerce') > 0).astype(np.int8)
    if 'requester_number_of_comments_in_raop_at_request' in df2.columns:
        df2['has_raop_comment_hist'] = (pd.to_numeric(df['requester_number_of_comments_in_raop_at_request'], errors='coerce') > 0).astype(np.int8)
    age = pd.to_numeric(df.get('requester_account_age_in_days_at_request', np.nan), errors='coerce')
    comments = pd.to_numeric(df.get('requester_number_of_comments_at_request', np.nan), errors='coerce')
    posts = pd.to_numeric(df.get('requester_number_of_posts_at_request', np.nan), errors='coerce')
    df2['comments_per_day'] = comments / np.maximum(1.0, age)
    df2['posts_per_day'] = posts / np.maximum(1.0, age)
    # Account age buckets (one-hot): [0-30], (30-90], (90-365], >365 days
    age_days = pd.to_numeric(df.get('requester_account_age_in_days_at_request', np.nan), errors='coerce').fillna(0).astype(float)
    df2['age_bin_0_30'] = (age_days <= 30).astype(np.int8)
    df2['age_bin_30_90'] = ((age_days > 30) & (age_days <= 90)).astype(np.int8)
    df2['age_bin_90_365'] = ((age_days > 90) & (age_days <= 365)).astype(np.int8)
    df2['age_bin_365p'] = (age_days > 365).astype(np.int8)
    for c in ['requester_number_of_comments_at_request','requester_number_of_posts_at_request','requester_number_of_comments_in_raop_at_request','requester_number_of_posts_on_raop_at_request','requester_upvotes_plus_downvotes_at_request','text_len','word_count','exclaim_count','question_count','url_count','subs_len','comments_per_day','posts_per_day']:
        if c in df2.columns:
            df2[c] = np.log1p(pd.to_numeric(df2[c], errors='coerce').clip(lower=0))
    if 'requester_upvotes_minus_downvotes_at_request' in df2.columns:
        df2['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(df2['requester_upvotes_minus_downvotes_at_request'])
    df2 = df2.apply(pd.to_numeric, errors='coerce')
    df2 = df2.replace([np.inf, -np.inf], np.nan)
    return df2

def rebuild_combined_text(df: pd.DataFrame):
    parts = []
    if 'request_title' in df.columns:
        parts.append(df['request_title'].fillna(''))
    elif 'request_title' not in df.columns and 'title' in df.columns:
        parts.append(df['title'].fillna(''))
    if 'request_text_edit_aware' in df.columns:
        parts.append(df['request_text_edit_aware'].fillna(''))
    elif 'request_text' in df.columns:
        parts.append(df['request_text'].fillna(''))
    if parts:
        return (parts[0].astype(str) + ' \n ' + parts[1].astype(str)) if len(parts) > 1 else parts[0].astype(str)
    return pd.Series([''] * len(df))

# Rebuild combined text
train['_combined_text'] = rebuild_combined_text(train)
test['_combined_text'] = rebuild_combined_text(test)

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Vectorizer parameters (reduced capacity, no stopwords, min_df=3)
word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=50000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params = dict(analyzer='char_wb', ngram_range=(3,5), lowercase=True, min_df=3, max_features=50000, sublinear_tf=True, smooth_idf=True, norm='l2')

# Precompute test meta once
meta_te_full = build_meta_features(test).replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
te_text = test['_combined_text'].astype(str)

# Prepare OOF/test containers for grid
Cs = [0.5, 1.0, 2.0]
class_weights = [None, 'balanced']
oof_by_cfg = {(C,cw): np.zeros(len(train), dtype=np.float32) for C in Cs for cw in class_weights}
test_preds_by_cfg = {(C,cw): [] for C in Cs for cw in class_weights}

for fold, (tr_idx, va_idx) in enumerate(cv):
    t_fold = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()
    tr_text = train.loc[tr_idx, '_combined_text'].astype(str)
    va_text = train.loc[va_idx, '_combined_text'].astype(str)

    # Vectorize train split only
    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(te_text)

    # Meta features
    meta_tr = build_meta_features(train.loc[tr_idx])
    meta_va = build_meta_features(train.loc[va_idx])
    meta_tr = meta_tr.replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    meta_va = meta_va.replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Stack sparse matrices
    X_tr = hstack([Xw_tr, Xc_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xm_te], format='csr')

    # Train/eval per config
    for C in Cs:
        for cw in class_weights:
            clf = LogisticRegression(solver='saga', penalty='l2', C=C, max_iter=4000, n_jobs=-1, class_weight=cw, random_state=42, verbose=0)
            clf.fit(X_tr, y[tr_idx])
            va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
            oof_by_cfg[(C,cw)][va_idx] = va_pred
            te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
            test_preds_by_cfg[(C,cw)].append(te_pred)

    print(f'Fold {fold+1} done in {time.time()-t_fold:.1f}s')
    sys.stdout.flush()
    # Cleanup per fold
    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te
    gc.collect()

# Evaluate OOF per config and pick best
auc_per_cfg = {}
for key in oof_by_cfg:
    auc = roc_auc_score(y, oof_by_cfg[key])
    auc_per_cfg[key] = auc
    C, cw = key
    print(f'C={C}, class_weight={cw} OOF AUC: {auc:.5f}')
best_cfg = max(auc_per_cfg, key=auc_per_cfg.get)
print(f'Best cfg: C={best_cfg[0]}, class_weight={best_cfg[1]} with OOF AUC {auc_per_cfg[best_cfg]:.5f}')

# Average test preds for best cfg
best_test_stack = np.mean(test_preds_by_cfg[best_cfg], axis=0).astype(np.float32)

# Build submission
sub = pd.DataFrame({id_col: test[id_col].values, target_col: best_test_stack})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).


Fold 1/5 - train 2302 va 576


Fold 1 done in 270.9s


Fold 2/5 - train 2302 va 576


Fold 2 done in 271.3s


Fold 3/5 - train 2302 va 576


Fold 3 done in 369.2s


Fold 4/5 - train 2303 va 575


Fold 4 done in 286.9s


Fold 5/5 - train 2303 va 575


Fold 5 done in 384.8s


C=0.5, class_weight=None OOF AUC: 0.67348
C=0.5, class_weight=balanced OOF AUC: 0.67366
C=1.0, class_weight=None OOF AUC: 0.67357
C=1.0, class_weight=balanced OOF AUC: 0.67261
C=2.0, class_weight=None OOF AUC: 0.67009
C=2.0, class_weight=balanced OOF AUC: 0.66902
Best cfg: C=0.5, class_weight=balanced with OOF AUC 0.67366
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.492938
1   t3_roiuw                  0.341476
2   t3_mjnbq                  0.313049
3   t3_t8wd1                  0.514664
4  t3_1m4zxu                  0.515675


In [10]:
# Alternate model: TF-IDF (50k+50k) + Meta with LR and Calibrated LinearSVC; add NB-SVM and OOF blend
import time, gc, sys
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

assert '_combined_text' in train.columns and '_combined_text' in test.columns, 'Run data load cell first'

# Params matching earlier better run
word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=50000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params = dict(analyzer='char_wb', ngram_range=(3,5), lowercase=True, min_df=3, max_features=50000, sublinear_tf=True, smooth_idf=True, norm='l2')

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test meta
meta_te_full = build_meta_features(test).replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
te_text = test['_combined_text'].astype(str)

oof_lr = np.zeros(len(train), dtype=np.float32)
oof_svc = np.zeros(len(train), dtype=np.float32)
oof_nbsvm = np.zeros(len(train), dtype=np.float32)
test_lr_folds, test_svc_folds, test_nbsvm_folds = [], [], []

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    tr_text = train.loc[tr_idx, '_combined_text'].astype(str)
    va_text = train.loc[va_idx, '_combined_text'].astype(str)

    # Vectorizers fit on train split only
    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(te_text)

    # Meta features
    meta_tr = build_meta_features(train.loc[tr_idx]).replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    meta_va = build_meta_features(train.loc[va_idx]).replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Stack
    X_tr = hstack([Xw_tr, Xc_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xm_te], format='csr')

    # Model 1: Logistic Regression (class_weight=None per expert advice)
    lr = LogisticRegression(solver='saga', penalty='l2', C=1.0, max_iter=4000, n_jobs=-1, class_weight=None, random_state=42, verbose=0)
    lr.fit(X_tr, y[tr_idx])
    va_lr = lr.predict_proba(X_va)[:,1].astype(np.float32)
    oof_lr[va_idx] = va_lr
    te_lr = lr.predict_proba(X_te)[:,1].astype(np.float32)
    test_lr_folds.append(te_lr)

    # Model 2: LinearSVC + calibration (sigmoid, cv=3)
    svc_base = LinearSVC(C=1.0, max_iter=5000, dual=True, random_state=42)
    svc = CalibratedClassifierCV(svc_base, method='sigmoid', cv=3)
    svc.fit(X_tr, y[tr_idx])
    va_svc = svc.predict_proba(X_va)[:,1].astype(np.float32)
    oof_svc[va_idx] = va_svc
    te_svc = svc.predict_proba(X_te)[:,1].astype(np.float32)
    test_svc_folds.append(te_svc)

    # Model 3: NB-SVM on word (1,2) counts only
    cv_counts = CountVectorizer(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=50000)
    Xc_tr_counts = cv_counts.fit_transform(tr_text)
    Xc_va_counts = cv_counts.transform(va_text)
    Xc_te_counts = cv_counts.transform(te_text)
    # Compute log-count ratios
    y_tr = y[tr_idx]
    alpha = 1.0
    pos_mask = (y_tr == 1)
    neg_mask = (y_tr == 0)
    p_pos = Xc_tr_counts[pos_mask].sum(axis=0) + alpha
    p_neg = Xc_tr_counts[neg_mask].sum(axis=0) + alpha
    r = np.log((p_pos / p_neg)).A1  # 1D array
    # Transform
    Xnb_tr = Xc_tr_counts.multiply(r)
    Xnb_va = Xc_va_counts.multiply(r)
    Xnb_te = Xc_te_counts.multiply(r)
    nb_lr = LogisticRegression(solver='liblinear', penalty='l2', C=0.5, max_iter=2000, class_weight=None, random_state=42)
    nb_lr.fit(Xnb_tr, y_tr)
    va_nb = nb_lr.predict_proba(Xnb_va)[:,1].astype(np.float32)
    oof_nbsvm[va_idx] = va_nb
    te_nb = nb_lr.predict_proba(Xnb_te)[:,1].astype(np.float32)
    test_nbsvm_folds.append(te_nb)

    auc_lr = roc_auc_score(y[va_idx], va_lr)
    auc_svc = roc_auc_score(y[va_idx], va_svc)
    auc_nb = roc_auc_score(y[va_idx], va_nb)
    print(f'Fold {fold+1} AUCs | LR: {auc_lr:.5f} | SVC: {auc_svc:.5f} | NB-SVM: {auc_nb:.5f} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr, svc, cv_counts, Xc_tr_counts, Xc_va_counts, Xc_te_counts, nb_lr, Xnb_tr, Xnb_va, Xnb_te
    gc.collect()

# OOF scores
auc_lr_oof = roc_auc_score(y, oof_lr)
auc_svc_oof = roc_auc_score(y, oof_svc)
auc_nb_oof = roc_auc_score(y, oof_nbsvm)
print(f'OOF AUC | LR: {auc_lr_oof:.5f} | SVC: {auc_svc_oof:.5f} | NB-SVM: {auc_nb_oof:.5f}')

# 3-model blend weight search on OOF (w1,w2,w3 sum=1; step=0.02)
best_auc, best_w = -1.0, (1.0, 0.0, 0.0)
grid = np.linspace(0.0, 1.0, 51)
for w1 in grid:
    for w2 in grid:
        if w1 + w2 > 1.0:
            continue
        w3 = 1.0 - w1 - w2
        blend = w1*oof_lr + w2*oof_svc + w3*oof_nbsvm
        auc = roc_auc_score(y, blend)
        if auc > best_auc:
            best_auc, best_w = auc, (float(w1), float(w2), float(w3))
print(f'Best blend w(LR,SVC,NB)={best_w} OOF AUC: {best_auc:.5f}')

# Build test preds using best weights
test_lr = np.mean(test_lr_folds, axis=0).astype(np.float32)
test_svc = np.mean(test_svc_folds, axis=0).astype(np.float32)
test_nb = np.mean(test_nbsvm_folds, axis=0).astype(np.float32)
test_pred = best_w[0]*test_lr + best_w[1]*test_svc + best_w[2]*test_nb

sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).


Fold 1/5 - train 2302 va 576


Fold 1 AUCs | LR: 0.67317 | SVC: 0.64932 | NB-SVM: 0.57840 | time 73.6s


Fold 2/5 - train 2302 va 576


Fold 2 AUCs | LR: 0.68112 | SVC: 0.66406 | NB-SVM: 0.57874 | time 74.0s


Fold 3/5 - train 2302 va 576


Fold 3 AUCs | LR: 0.68007 | SVC: 0.66934 | NB-SVM: 0.57948 | time 101.8s


Fold 4/5 - train 2303 va 575


Fold 4 AUCs | LR: 0.65351 | SVC: 0.63831 | NB-SVM: 0.51811 | time 79.4s


Fold 5/5 - train 2303 va 575


Fold 5 AUCs | LR: 0.68030 | SVC: 0.65694 | NB-SVM: 0.58832 | time 92.9s


OOF AUC | LR: 0.67357 | SVC: 0.65506 | NB-SVM: 0.56847


Best blend w(LR,SVC,NB)=(1.0, 0.0, 0.0) OOF AUC: 0.67357
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.271767
1   t3_roiuw                  0.159797
2   t3_mjnbq                  0.145530
3   t3_t8wd1                  0.303928
4  t3_1m4zxu                  0.274813


In [11]:
# Strong LR baseline: split title/body TF-IDF, upweight title, char_wb(3,6), meta with age buckets; grid C
import time, gc, sys, numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

assert 'request_title' in train.columns and ('request_text_edit_aware' in train.columns or 'request_text' in train.columns)

def get_title_series(df):
    return df['request_title'].fillna('').astype(str) if 'request_title' in df.columns else pd.Series(['']*len(df), index=df.index)

def get_body_series(df):
    if 'request_text_edit_aware' in df.columns:
        return df['request_text_edit_aware'].fillna('').astype(str)
    elif 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    else:
        return pd.Series(['']*len(df), index=df.index)

# Params per expert prescription
title_word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=30000, sublinear_tf=True, smooth_idf=True, norm='l2')
body_word_params  = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=70000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params       = dict(analyzer='char_wb', ngram_range=(3,6), lowercase=True, min_df=3, max_features=50000, sublinear_tf=True, smooth_idf=True, norm='l2')

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test texts and meta
te_title = get_title_series(test)
te_body = get_body_series(test)
te_concat = (te_title + ' ' + te_body).astype(str)
meta_te_full = build_meta_features(test).replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)

Cs = [0.5, 1.0, 2.0, 4.0]
oof_by_C = {C: np.zeros(len(train), dtype=np.float32) for C in Cs}
test_preds_by_C = {C: [] for C in Cs}

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_title = get_title_series(train.loc[tr_idx])
    va_title = get_title_series(train.loc[va_idx])
    tr_body  = get_body_series(train.loc[tr_idx])
    va_body  = get_body_series(train.loc[va_idx])
    tr_concat = (tr_title + ' ' + tr_body).astype(str)
    va_concat = (va_title + ' ' + va_body).astype(str)

    # Fit vectorizers on train split only
    tfidf_title = TfidfVectorizer(**title_word_params)
    Xt_tr = tfidf_title.fit_transform(tr_title)
    Xt_va = tfidf_title.transform(va_title)
    Xt_te = tfidf_title.transform(te_title)

    tfidf_body = TfidfVectorizer(**body_word_params)
    Xb_tr = tfidf_body.fit_transform(tr_body)
    Xb_va = tfidf_body.transform(va_body)
    Xb_te = tfidf_body.transform(te_body)

    tfidf_char = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_char.fit_transform(tr_concat)
    Xc_va = tfidf_char.transform(va_concat)
    Xc_te = tfidf_char.transform(te_concat)

    # Meta features per fold
    meta_tr = build_meta_features(train.loc[tr_idx]).replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    meta_va = build_meta_features(train.loc[va_idx]).replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Upweight title by 2.0 before stacking
    Xt_tr = Xt_tr.multiply(2.0)
    Xt_va = Xt_va.multiply(2.0)
    Xt_te = Xt_te.multiply(2.0)

    X_tr = hstack([Xt_tr, Xb_tr, Xc_tr, Xm_tr], format='csr')
    X_va = hstack([Xt_va, Xb_va, Xc_va, Xm_va], format='csr')
    X_te = hstack([Xt_te, Xb_te, Xc_te, Xm_te], format='csr')

    # Train LR for each C; class_weight=None
    for C in Cs:
        lr = LogisticRegression(solver='saga', penalty='l2', C=C, class_weight=None, max_iter=5000, n_jobs=-1, random_state=42, verbose=0)
        lr.fit(X_tr, y[tr_idx])
        va_pred = lr.predict_proba(X_va)[:,1].astype(np.float32)
        oof_by_C[C][va_idx] = va_pred
        te_pred = lr.predict_proba(X_te)[:,1].astype(np.float32)
        test_preds_by_C[C].append(te_pred)

    # Per-fold diagnostics at C=1.0
    fold_auc = roc_auc_score(y[va_idx], oof_by_C[1.0][va_idx])
    print(f'Fold {fold+1} done in {time.time()-t0:.1f}s | AUC@C=1.0: {fold_auc:.5f}')
    sys.stdout.flush()

    del Xt_tr, Xt_va, Xt_te, Xb_tr, Xb_va, Xb_te, Xc_tr, Xc_va, Xc_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr
    gc.collect()

# Evaluate OOF per C and pick best
auc_per_C = {}
for C in Cs:
    auc = roc_auc_score(y, oof_by_C[C])
    auc_per_C[C] = auc
    print(f'C={C} OOF AUC: {auc:.5f}')
best_C = max(auc_per_C, key=auc_per_C.get)
print(f'Best C: {best_C} with OOF AUC {auc_per_C[best_C]:.5f}')

# Average test preds for best C
best_test = np.mean(test_preds_by_C[best_C], axis=0).astype(np.float32)

sub = pd.DataFrame({id_col: test[id_col].values, target_col: best_test})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).


Fold 1/5 - train 2302 va 576


Fold 1 done in 228.5s | AUC@C=1.0: 0.67845


Fold 2/5 - train 2302 va 576


Fold 2 done in 238.2s | AUC@C=1.0: 0.68102


Fold 3/5 - train 2302 va 576


Fold 3 done in 300.9s | AUC@C=1.0: 0.68588


Fold 4/5 - train 2303 va 575


Fold 4 done in 231.2s | AUC@C=1.0: 0.66409


Fold 5/5 - train 2303 va 575


Fold 5 done in 278.8s | AUC@C=1.0: 0.64261


C=0.5 OOF AUC: 0.67283
C=1.0 OOF AUC: 0.66993
C=2.0 OOF AUC: 0.66629
C=4.0 OOF AUC: 0.66307
Best C: 0.5 with OOF AUC 0.67283
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.223368
1   t3_roiuw                  0.154929
2   t3_mjnbq                  0.110799
3   t3_t8wd1                  0.213140
4  t3_1m4zxu                  0.211933


In [12]:
# Two-view LR: word-only (title*2 + body) and char-only; OOF blend
import time, gc, sys, numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

def get_title_series(df):
    return df['request_title'].fillna('').astype(str) if 'request_title' in df.columns else pd.Series(['']*len(df), index=df.index)

def get_body_series(df):
    if 'request_text_edit_aware' in df.columns:
        return df['request_text_edit_aware'].fillna('').astype(str)
    elif 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    else:
        return pd.Series(['']*len(df), index=df.index)

# Vectorizer params
title_word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=30000, sublinear_tf=True, smooth_idf=True, norm='l2')
body_word_params  = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=70000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params       = dict(analyzer='char_wb', ngram_range=(3,6), lowercase=True, min_df=3, max_features=50000, sublinear_tf=True, smooth_idf=True, norm='l2')

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

te_title = get_title_series(test)
te_body = get_body_series(test)
te_concat = (te_title + ' ' + te_body).astype(str)

Cs = [0.5, 1.0, 2.0]
best_auc_overall = -1.0
best_C_word = None
best_C_char = None
best_w = 0.0
best_test_pred = None

# We'll evaluate grid over C for word and char separately and blend OOF with weight w
for Cw in Cs:
    for Cc in Cs:
        oof_word = np.zeros(len(train), dtype=np.float32)
        oof_char = np.zeros(len(train), dtype=np.float32)
        te_word_folds, te_char_folds = [], []

        for fold, (tr_idx, va_idx) in enumerate(cv):
            t0 = time.time()
            tr_title = get_title_series(train.loc[tr_idx])
            va_title = get_title_series(train.loc[va_idx])
            tr_body  = get_body_series(train.loc[tr_idx])
            va_body  = get_body_series(train.loc[va_idx])
            tr_concat = (tr_title + ' ' + tr_body).astype(str)
            va_concat = (va_title + ' ' + va_body).astype(str)

            # Word view
            tfidf_title = TfidfVectorizer(**title_word_params)
            Xt_tr = tfidf_title.fit_transform(tr_title)
            Xt_va = tfidf_title.transform(va_title)
            Xt_te = tfidf_title.transform(te_title)
            tfidf_body = TfidfVectorizer(**body_word_params)
            Xb_tr = tfidf_body.fit_transform(tr_body)
            Xb_va = tfidf_body.transform(va_body)
            Xb_te = tfidf_body.transform(te_body)
            Xt_tr = Xt_tr.multiply(2.0); Xt_va = Xt_va.multiply(2.0); Xt_te = Xt_te.multiply(2.0)
            Xw_tr = hstack([Xt_tr, Xb_tr], format='csr')
            Xw_va = hstack([Xt_va, Xb_va], format='csr')
            Xw_te = hstack([Xt_te, Xb_te], format='csr')
            lr_w = LogisticRegression(solver='saga', penalty='l2', C=Cw, class_weight=None, max_iter=5000, n_jobs=-1, random_state=42, verbose=0)
            lr_w.fit(Xw_tr, y[tr_idx])
            oof_word[va_idx] = lr_w.predict_proba(Xw_va)[:,1].astype(np.float32)
            te_word_folds.append(lr_w.predict_proba(Xw_te)[:,1].astype(np.float32))

            # Char view
            tfidf_char = TfidfVectorizer(**char_params)
            Xc_tr = tfidf_char.fit_transform(tr_concat)
            Xc_va = tfidf_char.transform(va_concat)
            Xc_te = tfidf_char.transform(te_concat)
            lr_c = LogisticRegression(solver='saga', penalty='l2', C=Cc, class_weight=None, max_iter=5000, n_jobs=-1, random_state=42, verbose=0)
            lr_c.fit(Xc_tr, y[tr_idx])
            oof_char[va_idx] = lr_c.predict_proba(Xc_va)[:,1].astype(np.float32)
            te_char_folds.append(lr_c.predict_proba(Xc_te)[:,1].astype(np.float32))

            print(f'Fold {fold+1}/{n_splits} Cw={Cw} Cc={Cc} done in {time.time()-t0:.1f}s')
            sys.stdout.flush()
            del Xt_tr, Xt_va, Xt_te, Xb_tr, Xb_va, Xb_te, Xw_tr, Xw_va, Xw_te, lr_w, tfidf_title, tfidf_body, Xc_tr, Xc_va, Xc_te, lr_c, tfidf_char
            gc.collect()

        # Blend weight search
        local_best_auc, local_best_w = -1.0, 0.0
        for w in np.linspace(0.0, 1.0, 101):
            blend = w*oof_word + (1.0-w)*oof_char
            auc = roc_auc_score(y, blend)
            if auc > local_best_auc:
                local_best_auc, local_best_w = auc, float(w)
        print(f'Cw={Cw} Cc={Cc} best blend w={local_best_w:.2f} OOF AUC: {local_best_auc:.5f}')

        if local_best_auc > best_auc_overall:
            best_auc_overall = local_best_auc
            best_C_word, best_C_char, best_w = Cw, Cc, local_best_w
            te_word = np.mean(te_word_folds, axis=0).astype(np.float32)
            te_char = np.mean(te_char_folds, axis=0).astype(np.float32)
            best_test_pred = best_w*te_word + (1.0-best_w)*te_char

print(f'Best setting: Cw={best_C_word}, Cc={best_C_char}, w={best_w:.2f} with OOF AUC {best_auc_overall:.5f}')
sub = pd.DataFrame({id_col: test[id_col].values, target_col: best_test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).


Fold 1/5 Cw=0.5 Cc=0.5 done in 10.3s


Fold 2/5 Cw=0.5 Cc=0.5 done in 10.2s


Fold 3/5 Cw=0.5 Cc=0.5 done in 10.3s


Fold 4/5 Cw=0.5 Cc=0.5 done in 9.2s


Fold 5/5 Cw=0.5 Cc=0.5 done in 10.1s


Cw=0.5 Cc=0.5 best blend w=0.20 OOF AUC: 0.64917


Fold 1/5 Cw=0.5 Cc=1.0 done in 11.2s


Fold 2/5 Cw=0.5 Cc=1.0 done in 11.1s


Fold 3/5 Cw=0.5 Cc=1.0 done in 11.2s


Fold 4/5 Cw=0.5 Cc=1.0 done in 10.3s


Fold 5/5 Cw=0.5 Cc=1.0 done in 11.0s


Cw=0.5 Cc=1.0 best blend w=0.28 OOF AUC: 0.64521


Fold 1/5 Cw=0.5 Cc=2.0 done in 12.6s


Fold 2/5 Cw=0.5 Cc=2.0 done in 12.5s


Fold 3/5 Cw=0.5 Cc=2.0 done in 12.8s


Fold 4/5 Cw=0.5 Cc=2.0 done in 11.9s


Fold 5/5 Cw=0.5 Cc=2.0 done in 12.5s


Cw=0.5 Cc=2.0 best blend w=0.37 OOF AUC: 0.63815


Fold 1/5 Cw=1.0 Cc=0.5 done in 10.5s


Fold 2/5 Cw=1.0 Cc=0.5 done in 10.4s


Fold 3/5 Cw=1.0 Cc=0.5 done in 10.6s


Fold 4/5 Cw=1.0 Cc=0.5 done in 9.7s


Fold 5/5 Cw=1.0 Cc=0.5 done in 10.6s


Cw=1.0 Cc=0.5 best blend w=0.17 OOF AUC: 0.64890


Fold 1/5 Cw=1.0 Cc=1.0 done in 11.5s


Fold 2/5 Cw=1.0 Cc=1.0 done in 11.5s


Fold 3/5 Cw=1.0 Cc=1.0 done in 11.4s


Fold 4/5 Cw=1.0 Cc=1.0 done in 10.2s


Fold 5/5 Cw=1.0 Cc=1.0 done in 11.2s


Cw=1.0 Cc=1.0 best blend w=0.23 OOF AUC: 0.64471


Fold 1/5 Cw=1.0 Cc=2.0 done in 13.0s


Fold 2/5 Cw=1.0 Cc=2.0 done in 12.6s


Fold 3/5 Cw=1.0 Cc=2.0 done in 12.7s


Fold 4/5 Cw=1.0 Cc=2.0 done in 11.8s


Fold 5/5 Cw=1.0 Cc=2.0 done in 12.5s


Cw=1.0 Cc=2.0 best blend w=0.30 OOF AUC: 0.63743


Fold 1/5 Cw=2.0 Cc=0.5 done in 10.8s


Fold 2/5 Cw=2.0 Cc=0.5 done in 10.7s


Fold 3/5 Cw=2.0 Cc=0.5 done in 10.8s


Fold 4/5 Cw=2.0 Cc=0.5 done in 9.7s


Fold 5/5 Cw=2.0 Cc=0.5 done in 10.4s


Cw=2.0 Cc=0.5 best blend w=0.12 OOF AUC: 0.64832


Fold 1/5 Cw=2.0 Cc=1.0 done in 11.4s


Fold 2/5 Cw=2.0 Cc=1.0 done in 11.4s


Fold 3/5 Cw=2.0 Cc=1.0 done in 11.6s


Fold 4/5 Cw=2.0 Cc=1.0 done in 10.4s


Fold 5/5 Cw=2.0 Cc=1.0 done in 11.4s


Cw=2.0 Cc=1.0 best blend w=0.18 OOF AUC: 0.64385


Fold 1/5 Cw=2.0 Cc=2.0 done in 13.2s


Fold 2/5 Cw=2.0 Cc=2.0 done in 12.8s


Fold 3/5 Cw=2.0 Cc=2.0 done in 13.3s


Fold 4/5 Cw=2.0 Cc=2.0 done in 12.8s


Fold 5/5 Cw=2.0 Cc=2.0 done in 13.2s


Cw=2.0 Cc=2.0 best blend w=0.24 OOF AUC: 0.63643
Best setting: Cw=0.5, Cc=0.5, w=0.20 with OOF AUC 0.64917
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.256258
1   t3_roiuw                  0.245750
2   t3_mjnbq                  0.225583
3   t3_t8wd1                  0.235965
4  t3_1m4zxu                  0.203067


In [13]:
# Text-only LR: title*2 + body word TF-IDF and char_wb(3,6); grid C; no meta
import time, gc, sys, numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

def _get_title(df):
    return df['request_title'].fillna('').astype(str) if 'request_title' in df.columns else pd.Series(['']*len(df), index=df.index)

def _get_body(df):
    if 'request_text_edit_aware' in df.columns:
        return df['request_text_edit_aware'].fillna('').astype(str)
    elif 'request_text' in df.columns:
        return df['request_text'].fillna('').astype(str)
    else:
        return pd.Series(['']*len(df), index=df.index)

# Vectorizer params
title_word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=30000, sublinear_tf=True, smooth_idf=True, norm='l2')
body_word_params  = dict(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=70000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params       = dict(analyzer='char_wb', ngram_range=(3,6), lowercase=True, min_df=3, max_features=50000, sublinear_tf=True, smooth_idf=True, norm='l2')

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

te_title = _get_title(test)
te_body = _get_body(test)
te_concat = (te_title + ' ' + te_body).astype(str)

Cs = [0.5, 1.0, 2.0, 4.0]
oof_by_C = {C: np.zeros(len(train), dtype=np.float32) for C in Cs}
test_preds_by_C = {C: [] for C in Cs}

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_title = _get_title(train.loc[tr_idx])
    va_title = _get_title(train.loc[va_idx])
    tr_body  = _get_body(train.loc[tr_idx])
    va_body  = _get_body(train.loc[va_idx])
    tr_concat = (tr_title + ' ' + tr_body).astype(str)
    va_concat = (va_title + ' ' + va_body).astype(str)

    # Fit vectorizers on train split only
    tfidf_title = TfidfVectorizer(**title_word_params)
    Xt_tr = tfidf_title.fit_transform(tr_title)
    Xt_va = tfidf_title.transform(va_title)
    Xt_te = tfidf_title.transform(te_title)

    tfidf_body = TfidfVectorizer(**body_word_params)
    Xb_tr = tfidf_body.fit_transform(tr_body)
    Xb_va = tfidf_body.transform(va_body)
    Xb_te = tfidf_body.transform(te_body)

    tfidf_char = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_char.fit_transform(tr_concat)
    Xc_va = tfidf_char.transform(va_concat)
    Xc_te = tfidf_char.transform(te_concat)

    # Upweight title by 2.0 and stack (no meta)
    Xt_tr = Xt_tr.multiply(2.0); Xt_va = Xt_va.multiply(2.0); Xt_te = Xt_te.multiply(2.0)
    X_tr = hstack([Xt_tr, Xb_tr, Xc_tr], format='csr')
    X_va = hstack([Xt_va, Xb_va, Xc_va], format='csr')
    X_te = hstack([Xt_te, Xb_te, Xc_te], format='csr')

    for C in Cs:
        lr = LogisticRegression(solver='saga', penalty='l2', C=C, class_weight=None, max_iter=5000, n_jobs=-1, random_state=42, verbose=0)
        lr.fit(X_tr, y[tr_idx])
        oof_by_C[C][va_idx] = lr.predict_proba(X_va)[:,1].astype(np.float32)
        test_preds_by_C[C].append(lr.predict_proba(X_te)[:,1].astype(np.float32))

    print(f'Fold {fold+1} done in {time.time()-t0:.1f}s')
    sys.stdout.flush()
    del Xt_tr, Xt_va, Xt_te, Xb_tr, Xb_va, Xb_te, Xc_tr, Xc_va, Xc_te, X_tr, X_va, X_te, lr, tfidf_title, tfidf_body, tfidf_char
    gc.collect()

# Evaluate OOF per C and pick best
auc_per_C = {}
for C in Cs:
    auc = roc_auc_score(y, oof_by_C[C])
    auc_per_C[C] = auc
    print(f'C={C} OOF AUC: {auc:.5f}')
best_C = max(auc_per_C, key=auc_per_C.get)
print(f'Best C: {best_C} with OOF AUC {auc_per_C[best_C]:.5f}')

# Average test preds for best C
best_test = np.mean(test_preds_by_C[best_C], axis=0).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: best_test})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).
Fold 1/5 - train 2302 va 576


Fold 1 done in 70.0s


Fold 2/5 - train 2302 va 576


Fold 2 done in 66.5s


Fold 3/5 - train 2302 va 576


Fold 3 done in 65.5s


Fold 4/5 - train 2303 va 575


Fold 4 done in 66.4s


Fold 5/5 - train 2303 va 575


Fold 5 done in 65.6s


C=0.5 OOF AUC: 0.62205
C=1.0 OOF AUC: 0.61834
C=2.0 OOF AUC: 0.61370
C=4.0 OOF AUC: 0.60944
Best C: 0.5 with OOF AUC 0.62205
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.205925
1   t3_roiuw                  0.243738
2   t3_mjnbq                  0.193898
3   t3_t8wd1                  0.156261
4  t3_1m4zxu                  0.154038


In [14]:
# Pivot run: cleaned combined text + high-cap TF-IDF (word, char), subreddit TF-IDF, minimal meta; LR(saga) C grid
import time, gc, sys, re, numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack, csr_matrix

def combine_raw_text(df):
    title = df['request_title'].fillna('').astype(str) if 'request_title' in df.columns else pd.Series(['']*len(df), index=df.index)
    if 'request_text_edit_aware' in df.columns:
        body = df['request_text_edit_aware'].fillna('').astype(str)
    elif 'request_text' in df.columns:
        body = df['request_text'].fillna('').astype(str)
    else:
        body = pd.Series(['']*len(df), index=df.index)
    return (title + '\n' + body).astype(str)

url_pat = re.compile(r'https?://\S+|www\.\S+')
num_pat = re.compile(r'\d+')
repeat_pat = re.compile(r'(.)\1{3,}')
def clean_text_series(s: pd.Series) -> pd.Series:
    def _clean(t: str) -> str:
        t = t.lower()
        t = url_pat.sub(' url_token ', t)
        t = num_pat.sub(' num_token ', t)
        t = repeat_pat.sub(lambda m: m.group(1)*3, t)
        return t
    return s.fillna('').astype(str).map(_clean)

def build_subreddit_text(df: pd.DataFrame) -> pd.Series:
    if 'requester_subreddits_at_request' not in df.columns:
        return pd.Series(['']*len(df), index=df.index)
    def joiner(x):
        if isinstance(x, (list, tuple)):
            return ' '.join([str(s).lower() for s in x if isinstance(s, str)])
        return ''
    return df['requester_subreddits_at_request'].apply(joiner)

def safe_log1p_signed(arr):
    a = pd.to_numeric(arr, errors='coerce').astype(float)
    return np.sign(a) * np.log1p(np.abs(a))

def build_meta_minimal(df: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame(index=df.index)
    # Core numerics
    cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_upvotes_plus_downvotes_at_request',
        'requester_upvotes_minus_downvotes_at_request',
    ]
    for c in cols:
        if c in df.columns:
            out[c] = pd.to_numeric(df[c], errors='coerce')
        else:
            out[c] = 0.0
    # Time features
    if 'unix_timestamp_of_request' in df.columns:
        ts = pd.to_datetime(df['unix_timestamp_of_request'], unit='s', errors='coerce')
    elif 'unix_timestamp_of_request_utc' in df.columns:
        ts = pd.to_datetime(df['unix_timestamp_of_request_utc'], unit='s', errors='coerce')
    else:
        ts = pd.Series(pd.NaT, index=df.index)
    out['req_hour'] = ts.dt.hour.fillna(0).astype(np.int16)
    out['req_wday'] = ts.dt.weekday.fillna(0).astype(np.int16)
    out['req_is_weekend'] = out['req_wday'].isin([5,6]).astype(np.int8)
    # Text-derived simple counts (from combined text computed outside per fold)
    combo = combine_raw_text(df)
    combo_len = combo.str.len().astype(np.int32)
    wc = combo.str.split().map(len)
    out['text_len'] = combo_len
    out['word_count'] = wc.astype(np.int32)
    out['exclaim_count'] = combo.str.count('!').astype(np.int16)
    out['question_count'] = combo.str.count('\?').astype(np.int16)
    out['upper_ratio'] = combo.map(lambda s: (sum(ch.isupper() for ch in s)/max(1, len(s)))).astype(np.float32).clip(0, 0.7)
    # Derived rates
    age = pd.to_numeric(df.get('requester_account_age_in_days_at_request', np.nan), errors='coerce')
    comments = pd.to_numeric(df.get('requester_number_of_comments_at_request', np.nan), errors='coerce')
    posts = pd.to_numeric(df.get('requester_number_of_posts_at_request', np.nan), errors='coerce')
    out['comments_per_day'] = comments / np.maximum(1.0, age)
    out['posts_per_day'] = posts / np.maximum(1.0, age)
    # Age bins
    age_days = pd.to_numeric(df.get('requester_account_age_in_days_at_request', np.nan), errors='coerce').fillna(0).astype(float)
    out['age_bin_0_30'] = (age_days <= 30).astype(np.int8)
    out['age_bin_30_90'] = ((age_days > 30) & (age_days <= 90)).astype(np.int8)
    out['age_bin_90_365'] = ((age_days > 90) & (age_days <= 365)).astype(np.int8)
    out['age_bin_365p'] = (age_days > 365).astype(np.int8)
    # RAOP history flags
    out['has_raop_post_hist'] = (pd.to_numeric(df.get('requester_number_of_posts_on_raop_at_request', 0), errors='coerce') > 0).astype(np.int8)
    out['has_raop_comment_hist'] = (pd.to_numeric(df.get('requester_number_of_comments_in_raop_at_request', 0), errors='coerce') > 0).astype(np.int8)
    # Transforms
    for c in ['requester_number_of_comments_at_request','requester_number_of_posts_at_request','requester_number_of_comments_in_raop_at_request','requester_number_of_posts_on_raop_at_request','requester_upvotes_plus_downvotes_at_request','text_len','word_count','exclaim_count','question_count','comments_per_day','posts_per_day']:
        if c in out.columns:
            out[c] = np.log1p(pd.to_numeric(out[c], errors='coerce').clip(lower=0))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'])
    out = out.apply(pd.to_numeric, errors='coerce').replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)
    return out

# Vectorizer configurations
word_params = dict(analyzer='word', ngram_range=(1,2), lowercase=False, min_df=3, max_df=0.95, max_features=100000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params = dict(analyzer='char_wb', ngram_range=(3,6), lowercase=False, min_df=2, max_features=120000, sublinear_tf=True, smooth_idf=True, norm='l2')
subs_params = dict(analyzer='word', ngram_range=(1,1), lowercase=True, min_df=2, max_features=10000, sublinear_tf=True, smooth_idf=True, norm='l2')

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute raw sources for test
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te_full = build_meta_minimal(test)

Cs = [0.5, 1.0, 2.0]
oof_by_C = {C: np.zeros(len(train), dtype=np.float32) for C in Cs}
test_preds_by_C = {C: [] for C in Cs}

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()
    tr_text_raw = combine_raw_text(train.loc[tr_idx])
    va_text_raw = combine_raw_text(train.loc[va_idx])
    tr_text = clean_text_series(tr_text_raw)
    va_text = clean_text_series(va_text_raw)
    tr_subs = build_subreddit_text(train.loc[tr_idx])
    va_subs = build_subreddit_text(train.loc[va_idx])

    # Fit vectorizers on train split only
    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(clean_te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(clean_te_text)

    tfidf_s = TfidfVectorizer(**subs_params)
    Xs_tr = tfidf_s.fit_transform(tr_subs)
    Xs_va = tfidf_s.transform(va_subs)
    Xs_te = tfidf_s.transform(subs_te_text)

    # Minimal meta per fold
    meta_tr = build_meta_minimal(train.loc[tr_idx])
    meta_va = build_meta_minimal(train.loc[va_idx])
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Stack
    X_tr = hstack([Xw_tr, Xc_tr, Xs_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xs_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xs_te, Xm_te], format='csr')

    for C in Cs:
        lr = LogisticRegression(solver='saga', penalty='l2', C=C, class_weight=None, max_iter=5000, n_jobs=-1, random_state=42, verbose=0)
        lr.fit(X_tr, y[tr_idx])
        va_pred = lr.predict_proba(X_va)[:,1].astype(np.float32)
        oof_by_C[C][va_idx] = va_pred
        te_pred = lr.predict_proba(X_te)[:,1].astype(np.float32)
        test_preds_by_C[C].append(te_pred)

    fold_auc = roc_auc_score(y[va_idx], oof_by_C[1.0][va_idx])
    print(f'Fold {fold+1} done in {time.time()-t0:.1f}s | AUC@C=1.0: {fold_auc:.5f}')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr
    gc.collect()

# Evaluate OOF per C and pick best
auc_per_C = {}
for C in Cs:
    auc = roc_auc_score(y, oof_by_C[C])
    auc_per_C[C] = auc
    print(f'C={C} OOF AUC: {auc:.5f}')
best_C = max(auc_per_C, key=auc_per_C.get)
print(f'Best C: {best_C} with OOF AUC {auc_per_C[best_C]:.5f}')

# Average test preds for best C
best_test = np.mean(test_preds_by_C[best_C], axis=0).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: best_test})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).
Fold 1/5 - train 2302 va 576


Fold 1 done in 92.3s | AUC@C=1.0: 0.67984


Fold 2/5 - train 2302 va 576


Fold 2 done in 87.7s | AUC@C=1.0: 0.67359


Fold 3/5 - train 2302 va 576


Fold 3 done in 91.9s | AUC@C=1.0: 0.69048


Fold 4/5 - train 2303 va 575


Fold 4 done in 82.6s | AUC@C=1.0: 0.64510


Fold 5/5 - train 2303 va 575


Fold 5 done in 88.3s | AUC@C=1.0: 0.69493


C=0.5 OOF AUC: 0.67860
C=1.0 OOF AUC: 0.67634
C=2.0 OOF AUC: 0.67050
Best C: 0.5 with OOF AUC 0.67860
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.280687
1   t3_roiuw                  0.177380
2   t3_mjnbq                  0.194645
3   t3_t8wd1                  0.227205
4  t3_1m4zxu                  0.280125


In [15]:
# Add calibrated RidgeClassifier and blend with LR on pivot features
import time, gc, sys, numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

# Reuse helper funcs and params from pivot cell 7: combine_raw_text, clean_text_series, build_subreddit_text, build_meta_minimal,
# and word_params, char_params, subs_params should be in scope from cell 7.

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals() and 'build_meta_minimal' in globals(), 'Run pivot cell first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell first'

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources via same preprocessing as pivot
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te_full = build_meta_minimal(test)

oof_lr = np.zeros(len(train), dtype=np.float32)
oof_rc = np.zeros(len(train), dtype=np.float32)
test_lr_folds, test_rc_folds = [], []

LR_C = 0.5
ridge_alphas = [0.5, 1.0, 2.0]

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_text_raw = combine_raw_text(train.loc[tr_idx])
    va_text_raw = combine_raw_text(train.loc[va_idx])
    tr_text = clean_text_series(tr_text_raw)
    va_text = clean_text_series(va_text_raw)
    tr_subs = build_subreddit_text(train.loc[tr_idx])
    va_subs = build_subreddit_text(train.loc[va_idx])

    # Fit vectorizers on train split only
    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(clean_te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(clean_te_text)

    tfidf_s = TfidfVectorizer(**subs_params)
    Xs_tr = tfidf_s.fit_transform(tr_subs)
    Xs_va = tfidf_s.transform(va_subs)
    Xs_te = tfidf_s.transform(subs_te_text)

    meta_tr = build_meta_minimal(train.loc[tr_idx])
    meta_va = build_meta_minimal(train.loc[va_idx])
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Stack
    X_tr = hstack([Xw_tr, Xc_tr, Xs_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xs_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xs_te, Xm_te], format='csr')

    # Model 1: Logistic Regression (best C from pivot: 0.5), class_weight=None
    lr = LogisticRegression(solver='saga', penalty='l2', C=LR_C, class_weight=None, max_iter=6000, n_jobs=-1, random_state=42, verbose=0)
    lr.fit(X_tr, y[tr_idx])
    va_lr = lr.predict_proba(X_va)[:,1].astype(np.float32)
    oof_lr[va_idx] = va_lr
    te_lr = lr.predict_proba(X_te)[:,1].astype(np.float32)
    test_lr_folds.append(te_lr)

    # Model 2: RidgeClassifier with sigmoid calibration; select alpha by val AUC per fold
    best_rc_auc, best_rc_te, best_rc_va = -1.0, None, None
    for alpha in ridge_alphas:
        rc_base = RidgeClassifier(alpha=alpha, random_state=42)
        rc_cal = CalibratedClassifierCV(rc_base, method='sigmoid', cv=3)
        rc_cal.fit(X_tr, y[tr_idx])
        va_rc = rc_cal.predict_proba(X_va)[:,1].astype(np.float32)
        auc_rc = roc_auc_score(y[va_idx], va_rc)
        if auc_rc > best_rc_auc:
            best_rc_auc = auc_rc
            best_rc_va = va_rc
            best_rc_te = rc_cal.predict_proba(X_te)[:,1].astype(np.float32)
    oof_rc[va_idx] = best_rc_va
    test_rc_folds.append(best_rc_te)

    auc_lr = roc_auc_score(y[va_idx], va_lr)
    print(f'Fold {fold+1} AUCs | LR(C=0.5): {auc_lr:.5f} | Best RidgeCal: {best_rc_auc:.5f} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr, rc_base, rc_cal
    gc.collect()

# OOF scores
auc_lr_oof = roc_auc_score(y, oof_lr)
auc_rc_oof = roc_auc_score(y, oof_rc)
print(f'OOF AUC | LR: {auc_lr_oof:.5f} | RidgeCal: {auc_rc_oof:.5f}')

# Blend weight search
best_w, best_auc = 0.0, -1.0
for w in np.linspace(0.0, 1.0, 101):
    blend = (1-w)*oof_lr + w*oof_rc
    auc = roc_auc_score(y, blend)
    if auc > best_auc:
        best_auc, best_w = auc, float(w)
print(f'Best blend w(Ridge)={best_w:.2f} OOF AUC: {best_auc:.5f}')

# Build test preds using best weight
test_lr = np.mean(test_lr_folds, axis=0).astype(np.float32)
test_rc = np.mean(test_rc_folds, axis=0).astype(np.float32)
test_pred = (1-best_w)*test_lr + best_w*test_rc

sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).
Fold 1/5 - train 2302 va 576


Fold 1 AUCs | LR(C=0.5): 0.68299 | Best RidgeCal: 0.66707 | time 28.5s


Fold 2/5 - train 2302 va 576


Fold 2 AUCs | LR(C=0.5): 0.66915 | Best RidgeCal: 0.68084 | time 26.6s


Fold 3/5 - train 2302 va 576


Fold 3 AUCs | LR(C=0.5): 0.69675 | Best RidgeCal: 0.68388 | time 28.6s


Fold 4/5 - train 2303 va 575


Fold 4 AUCs | LR(C=0.5): 0.64766 | Best RidgeCal: 0.63491 | time 25.6s


Fold 5/5 - train 2303 va 575


Fold 5 AUCs | LR(C=0.5): 0.69875 | Best RidgeCal: 0.67991 | time 27.8s


OOF AUC | LR: 0.67860 | RidgeCal: 0.66922


Best blend w(Ridge)=0.00 OOF AUC: 0.67860


Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.280687
1   t3_roiuw                  0.177380
2   t3_mjnbq                  0.194645
3   t3_t8wd1                  0.227205
4  t3_1m4zxu                  0.280125


In [19]:
# XGBoost on pivot features + blend with LR (OOF-driven)
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack

try:
    import xgboost as xgb
except Exception:
    import subprocess, sys as _sys
    subprocess.run([_sys.executable, '-m', 'pip', 'install', '-q', 'xgboost'])
    import xgboost as xgb

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals() and 'build_meta_minimal' in globals(), 'Run pivot cell first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell first'

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources via same preprocessing as pivot
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te_full = build_meta_minimal(test)

oof_lr = np.zeros(len(train), dtype=np.float32)
oof_xgb = np.zeros(len(train), dtype=np.float32)
test_lr_folds, test_xgb_folds = [], []

# LR baseline params (from pivot best)
from sklearn.linear_model import LogisticRegression
LR_C = 0.5

# Prefer GPU if available
use_gpu = True

# XGB params (regularized; no early stopping due to API limitations)
xgb_params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=3,
    learning_rate=0.05,
    n_estimators=1200,
    subsample=0.7,
    colsample_bytree=0.5,
    min_child_weight=10,
    reg_alpha=1.0,
    reg_lambda=1.8,
    gamma=0.1,
    tree_method='gpu_hist' if use_gpu else 'hist',
    predictor='gpu_predictor' if use_gpu else 'auto',
    random_state=42,
    n_jobs=-1
)

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_text_raw = combine_raw_text(train.loc[tr_idx])
    va_text_raw = combine_raw_text(train.loc[va_idx])
    tr_text = clean_text_series(tr_text_raw)
    va_text = clean_text_series(va_text_raw)
    tr_subs = build_subreddit_text(train.loc[tr_idx])
    va_subs = build_subreddit_text(train.loc[va_idx])

    # Vectorizers on train split only (reuse pivot params)
    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(clean_te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(clean_te_text)

    tfidf_s = TfidfVectorizer(**subs_params)
    Xs_tr = tfidf_s.fit_transform(tr_subs)
    Xs_va = tfidf_s.transform(va_subs)
    Xs_te = tfidf_s.transform(subs_te_text)

    meta_tr = build_meta_minimal(train.loc[tr_idx])
    meta_va = build_meta_minimal(train.loc[va_idx])
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Stack
    X_tr = hstack([Xw_tr, Xc_tr, Xs_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xs_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xs_te, Xm_te], format='csr')

    # Model 1: LR
    lr = LogisticRegression(solver='saga', penalty='l2', C=LR_C, class_weight=None, max_iter=6000, n_jobs=-1, random_state=42, verbose=0)
    lr.fit(X_tr, y[tr_idx])
    va_lr = lr.predict_proba(X_va)[:,1].astype(np.float32)
    oof_lr[va_idx] = va_lr
    te_lr = lr.predict_proba(X_te)[:,1].astype(np.float32)
    test_lr_folds.append(te_lr)

    # Model 2: XGBoost without early stopping (version compatibility)
    xgb_clf = xgb.XGBClassifier(**xgb_params)
    xgb_clf.fit(
        X_tr, y[tr_idx],
        eval_set=[(X_va, y[va_idx])],
        verbose=False
    )
    va_xgb = xgb_clf.predict_proba(X_va)[:,1].astype(np.float32)
    oof_xgb[va_idx] = va_xgb
    te_xgb = xgb_clf.predict_proba(X_te)[:,1].astype(np.float32)
    test_xgb_folds.append(te_xgb)

    auc_lr = roc_auc_score(y[va_idx], va_lr)
    auc_x = roc_auc_score(y[va_idx], va_xgb)
    print(f'Fold {fold+1} AUCs | LR: {auc_lr:.5f} | XGB: {auc_x:.5f} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr, xgb_clf
    gc.collect()

# OOF scores
auc_lr_oof = roc_auc_score(y, oof_lr)
auc_xgb_oof = roc_auc_score(y, oof_xgb)
print(f'OOF AUC | LR: {auc_lr_oof:.5f} | XGB: {auc_xgb_oof:.5f}')

# Blend LR + XGB by OOF weight grid
best_w, best_auc = 0.0, -1.0
for w in np.linspace(0.0, 1.0, 101):
    blend = (1-w)*oof_lr + w*oof_xgb
    auc = roc_auc_score(y, blend)
    if auc > best_auc:
        best_auc, best_w = auc, float(w)
print(f'Best blend w(XGB)={best_w:.2f} OOF AUC: {best_auc:.5f}')

# Build test preds using best weight
test_lr = np.mean(test_lr_folds, axis=0).astype(np.float32)
test_xg = np.mean(test_xgb_folds, axis=0).astype(np.float32)
test_pred = (1-best_w)*test_lr + best_w*test_xg

sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).
Fold 1/5 - train 2302 va 576



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



Fold 1 AUCs | LR: 0.68299 | XGB: 0.62196 | time 45.6s



    E.g. tree_method = "hist", device = "cuda"

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




Fold 2/5 - train 2302 va 576



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



Fold 2 AUCs | LR: 0.66915 | XGB: 0.60482 | time 42.1s



    E.g. tree_method = "hist", device = "cuda"



Fold 3/5 - train 2302 va 576



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



Fold 3 AUCs | LR: 0.69675 | XGB: 0.63090 | time 43.3s



    E.g. tree_method = "hist", device = "cuda"



Fold 4/5 - train 2303 va 575



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



Fold 4 AUCs | LR: 0.64766 | XGB: 0.58929 | time 40.3s



    E.g. tree_method = "hist", device = "cuda"



Fold 5/5 - train 2303 va 575



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



Fold 5 AUCs | LR: 0.69875 | XGB: 0.61814 | time 41.4s



    E.g. tree_method = "hist", device = "cuda"



OOF AUC | LR: 0.67860 | XGB: 0.61282


Best blend w(XGB)=0.00 OOF AUC: 0.67860


Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.280687
1   t3_roiuw                  0.177380
2   t3_mjnbq                  0.194645
3   t3_t8wd1                  0.227205
4  t3_1m4zxu                  0.280125


In [20]:
# Enhanced meta features + Pivot TF-IDF LR (target: push OOF >= 0.692)
import time, gc, sys, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals(), 'Run pivot cell 7 first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell 7 first'

def build_meta_enhanced(df: pd.DataFrame) -> pd.DataFrame:
    out = build_meta_minimal(df).copy()  # start from minimal core
    # Text for regex-based features
    title = df['request_title'].fillna('').astype(str) if 'request_title' in df.columns else pd.Series(['']*len(df), index=df.index)
    body = df['request_text_edit_aware'].fillna('').astype(str) if 'request_text_edit_aware' in df.columns else df['request_text'].fillna('').astype(str) if 'request_text' in df.columns else pd.Series(['']*len(df), index=df.index)
    txt = (title + ' ' + body).str.lower()
    # Reciprocity / return-the-favor
    out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
    # Payday / timeframe
    out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
    out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
    # Concrete amount ($ or N dollars)
    out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
    # Gratitude density
    wc = np.maximum(1, out.get('word_count', (title + ' ' + body).str.split().map(len)).astype(np.int32))
    grat_cnt = txt.str.count(r'\b(thank|thanks|thank you|appreciate|grateful)\b')
    out['grat_density'] = (grat_cnt / wc).astype(np.float32).clip(0, 0.3)
    # Pronoun ratios
    fp_cnt = txt.str.count(r'\b(i|me|my|we|our|us)\b')
    you_cnt = txt.str.count(r'\b(you|your)\b')
    out['first_person_ratio'] = (fp_cnt / wc).astype(np.float32).clip(0, 1.0)
    out['you_ratio'] = (you_cnt / wc).astype(np.float32).clip(0, 1.0)
    # Time-of-day bins + sin/cos hour
    if 'unix_timestamp_of_request' in df.columns:
        ts = pd.to_datetime(df['unix_timestamp_of_request'], unit='s', errors='coerce')
    elif 'unix_timestamp_of_request_utc' in df.columns:
        ts = pd.to_datetime(df['unix_timestamp_of_request_utc'], unit='s', errors='coerce')
    else:
        ts = pd.Series(pd.NaT, index=df.index)
    hour = ts.dt.hour.fillna(0).astype(int)
    out['tod_morning'] = ((hour>=6)&(hour<=11)).astype(np.int8)
    out['tod_afternoon'] = ((hour>=12)&(hour<=17)).astype(np.int8)
    out['tod_evening'] = ((hour>=18)&(hour<=23)).astype(np.int8)
    out['tod_night'] = ((hour>=0)&(hour<=5)).astype(np.int8)
    out['hour_sin'] = np.sin(2*np.pi*hour/24).astype(np.float32)
    out['hour_cos'] = np.cos(2*np.pi*hour/24).astype(np.float32)
    # Mentions kids
    out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
    # Title features
    tl = title.str.len().astype(np.int32)
    twc = title.str.split().map(len).astype(np.int32).replace(0, 1)
    out['title_word_count'] = np.log1p(twc).astype(np.float32)
    out['title_has_question'] = title.str.contains('\?', regex=True).astype(np.int8)
    out['title_has_please'] = title.str.contains(r'\bplease\b', case=False, regex=True).astype(np.int8)
    # Avg word length
    total_chars = (title + ' ' + body).str.replace('\s+', '', regex=True).str.len().astype(np.int32)
    out['avg_word_len'] = (total_chars / wc).astype(np.float32).replace([np.inf, -np.inf], 0).clip(0, 15)
    # requester_number_of_subreddits_at_request transforms
    if 'requester_number_of_subreddits_at_request' in df.columns:
        val = pd.to_numeric(df['requester_number_of_subreddits_at_request'], errors='coerce').fillna(0)
        out['subs_count_log1p'] = np.log1p(val).astype(np.float32)
        out['subs_high_bin'] = (val >= 50).astype(np.int8)
    return out.replace([np.inf, -np.inf], np.nan).fillna(0).astype(np.float32)

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources via same preprocessing as pivot
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te_full = build_meta_enhanced(test)

Cs = [0.3, 0.5, 0.8]
oof_by_C = {C: np.zeros(len(train), dtype=np.float32) for C in Cs}
test_preds_by_C = {C: [] for C in Cs}

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_text_raw = combine_raw_text(train.loc[tr_idx])
    va_text_raw = combine_raw_text(train.loc[va_idx])
    tr_text = clean_text_series(tr_text_raw)
    va_text = clean_text_series(va_text_raw)
    tr_subs = build_subreddit_text(train.loc[tr_idx])
    va_subs = build_subreddit_text(train.loc[va_idx])

    # Fit vectorizers on train split only (pivot params)
    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(clean_te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(clean_te_text)

    tfidf_s = TfidfVectorizer(**subs_params)
    Xs_tr = tfidf_s.fit_transform(tr_subs)
    Xs_va = tfidf_s.transform(va_subs)
    Xs_te = tfidf_s.transform(subs_te_text)

    meta_tr = build_meta_enhanced(train.loc[tr_idx])
    meta_va = build_meta_enhanced(train.loc[va_idx])
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Stack
    X_tr = hstack([Xw_tr, Xc_tr, Xs_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xs_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xs_te, Xm_te], format='csr')

    for C in Cs:
        lr = LogisticRegression(solver='saga', penalty='l2', C=C, class_weight=None, max_iter=6000, n_jobs=-1, random_state=42, verbose=0)
        lr.fit(X_tr, y[tr_idx])
        va_pred = lr.predict_proba(X_va)[:,1].astype(np.float32)
        oof_by_C[C][va_idx] = va_pred
        te_pred = lr.predict_proba(X_te)[:,1].astype(np.float32)
        test_preds_by_C[C].append(te_pred)

    fold_auc = roc_auc_score(y[va_idx], oof_by_C[0.5][va_idx])
    print(f'Fold {fold+1} done in {time.time()-t0:.1f}s | AUC@C=0.5: {fold_auc:.5f}')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr
    gc.collect()

# Evaluate OOF per C and pick best
auc_per_C = {}
for C in Cs:
    auc = roc_auc_score(y, oof_by_C[C])
    auc_per_C[C] = auc
    print(f'C={C} OOF AUC: {auc:.5f}')
best_C = max(auc_per_C, key=auc_per_C.get)
print(f'Best C: {best_C} with OOF AUC {auc_per_C[best_C]:.5f}')

# Average test preds for best C
best_test = np.mean(test_preds_by_C[best_C], axis=0).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: best_test})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).


Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Fold 1 done in 81.9s | AUC@C=0.5: 0.69215


Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Fold 2 done in 78.3s | AUC@C=0.5: 0.67601


Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Fold 3 done in 78.9s | AUC@C=0.5: 0.69970


Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Fold 4 done in 73.8s | AUC@C=0.5: 0.63280


Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Fold 5 done in 78.5s | AUC@C=0.5: 0.68046


C=0.3 OOF AUC: 0.67501
C=0.5 OOF AUC: 0.67569
C=0.8 OOF AUC: 0.67463
Best C: 0.5 with OOF AUC 0.67569
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.292437
1   t3_roiuw                  0.178725
2   t3_mjnbq                  0.216726
3   t3_t8wd1                  0.227825
4  t3_1m4zxu                  0.388530


In [29]:
# Dense view: SVD(word, char, subs) + enhanced meta -> XGBoost (DMatrix, device=cuda) with early stopping; seed-bag; build OOF and submission
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

try:
    import xgboost as xgb
except Exception:
    import subprocess, sys as _sys
    subprocess.run([_sys.executable, '-m', 'pip', 'install', '-q', 'xgboost'])
    import xgboost as xgb

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals(), 'Run pivot cell 7 first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell 7 first'
assert 'build_meta_enhanced' in globals(), 'Run cell 10 to define build_meta_enhanced'

# SVD component sizes (per expert advice)
svd_word_n = 250
svd_char_n = 250
svd_subs_n = 80

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te = build_meta_enhanced(test).astype(np.float32)

# Containers for seed-bagging (expanded seeds for more stability)
seeds = [42, 2025, 7, 11, 77]
oof_bag = np.zeros((len(train), len(seeds)), dtype=np.float32)
test_bag = np.zeros((len(test), len(seeds)), dtype=np.float32)

# Fixed XGB params (dense view, slightly deeper trees)
base_params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=4,
    eta=0.05,
    subsample=0.8,
    colsample_bytree=0.7,
    min_child_weight=4,
    reg_alpha=0.5,
    reg_lambda=2.0,
    gamma=0.0,
    device='cuda'
)

for si, seed in enumerate(seeds):
    print(f'=== Seed {seed} / {len(seeds)} ===')
    params = dict(base_params)
    params['seed'] = seed
    oof = np.zeros(len(train), dtype=np.float32)
    test_fold_preds = []

    for fold, (tr_idx, va_idx) in enumerate(cv):
        t0 = time.time()
        print(f'Seed {seed} | Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
        sys.stdout.flush()

        # Text prep
        tr_text_raw = combine_raw_text(train.loc[tr_idx])
        va_text_raw = combine_raw_text(train.loc[va_idx])
        tr_text = clean_text_series(tr_text_raw)
        va_text = clean_text_series(va_text_raw)
        tr_subs = build_subreddit_text(train.loc[tr_idx])
        va_subs = build_subreddit_text(train.loc[va_idx])

        # TF-IDF (fit on train split)
        tfidf_w = TfidfVectorizer(**word_params)
        Xw_tr = tfidf_w.fit_transform(tr_text)
        Xw_va = tfidf_w.transform(va_text)
        Xw_te = tfidf_w.transform(clean_te_text)

        tfidf_c = TfidfVectorizer(**char_params)
        Xc_tr = tfidf_c.fit_transform(tr_text)
        Xc_va = tfidf_c.transform(va_text)
        Xc_te = tfidf_c.transform(clean_te_text)

        tfidf_s = TfidfVectorizer(**subs_params)
        Xs_tr = tfidf_s.fit_transform(tr_subs)
        Xs_va = tfidf_s.transform(va_subs)
        Xs_te = tfidf_s.transform(subs_te_text)

        # SVD to dense
        svd_w = TruncatedSVD(n_components=svd_word_n, random_state=seed)
        Zw_tr = svd_w.fit_transform(Xw_tr)
        Zw_va = svd_w.transform(Xw_va)
        Zw_te = svd_w.transform(Xw_te)

        svd_c = TruncatedSVD(n_components=svd_char_n, random_state=seed)
        Zc_tr = svd_c.fit_transform(Xc_tr)
        Zc_va = svd_c.transform(Xc_va)
        Zc_te = svd_c.transform(Xc_te)

        svd_s = TruncatedSVD(n_components=svd_subs_n, random_state=seed)
        Zs_tr = svd_s.fit_transform(Xs_tr)
        Zs_va = svd_s.transform(Xs_va)
        Zs_te = svd_s.transform(Xs_te)

        # Enhanced meta (dense)
        meta_tr = build_meta_enhanced(train.loc[tr_idx]).astype(np.float32)
        meta_va = build_meta_enhanced(train.loc[va_idx]).astype(np.float32)

        # Concatenate dense blocks
        Xtr_dense = np.hstack([Zw_tr, Zc_tr, Zs_tr, meta_tr.values]).astype(np.float32)
        Xva_dense = np.hstack([Zw_va, Zc_va, Zs_va, meta_va.values]).astype(np.float32)
        Xte_dense = np.hstack([Zw_te, Zc_te, Zs_te, meta_te.values]).astype(np.float32)

        # Scale dense features
        scaler = StandardScaler(with_mean=True, with_std=True)
        Xtr_d = scaler.fit_transform(Xtr_dense)
        Xva_d = scaler.transform(Xva_dense)
        Xte_d = scaler.transform(Xte_dense)

        # XGB training via DMatrix with early stopping
        dtrain = xgb.DMatrix(Xtr_d, label=y[tr_idx])
        dvalid = xgb.DMatrix(Xva_d, label=y[va_idx])
        dtest  = xgb.DMatrix(Xte_d)
        evals = [(dtrain, 'train'), (dvalid, 'valid')]
        booster = xgb.train(
            params,
            dtrain,
            num_boost_round=4000,
            evals=evals,
            early_stopping_rounds=200,
            verbose_eval=False
        )
        va_pred = booster.predict(dvalid, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
        te_pred = booster.predict(dtest, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
        oof[va_idx] = va_pred
        test_fold_preds.append(te_pred)

        auc = roc_auc_score(y[va_idx], va_pred)
        print(f'Seed {seed} | Fold {fold+1} AUC: {auc:.5f} | best_iter={booster.best_iteration} | time {time.time()-t0:.1f}s')
        sys.stdout.flush()

        del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Zw_tr, Zw_va, Zw_te, Zc_tr, Zc_va, Zc_te, Zs_tr, Zs_va, Zs_te, meta_tr, meta_va, Xtr_dense, Xva_dense, Xte_dense, Xtr_d, Xva_d, Xte_d, dtrain, dvalid, dtest, booster, tfidf_w, tfidf_c, tfidf_s, svd_w, svd_c, svd_s, scaler
        gc.collect()

    # Store bagged predictions
    oof_bag[:, si] = oof
    test_bag[:, si] = np.mean(test_fold_preds, axis=0).astype(np.float32)
    auc_seed = roc_auc_score(y, oof)
    print(f'Seed {seed} OOF AUC: {auc_seed:.5f}')
    sys.stdout.flush()

# Average across seeds
oof_mean = oof_bag.mean(axis=1)
test_mean = test_bag.mean(axis=1)
auc_oof = roc_auc_score(y, oof_mean)
print(f'Bagged XGB (SVD+dense) OOF AUC: {auc_oof:.5f}')

# Save submission from dense XGB alone (will blend with LR later if needed)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_mean.astype(np.float32)})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).


=== Seed 42 / 5 ===
Seed 42 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 1 AUC: 0.68139 | best_iter=98 | time 16.6s


Seed 42 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 2 AUC: 0.64557 | best_iter=291 | time 17.5s


Seed 42 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 3 AUC: 0.67763 | best_iter=58 | time 16.8s


Seed 42 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 4 AUC: 0.63253 | best_iter=49 | time 16.9s


Seed 42 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 5 AUC: 0.70090 | best_iter=58 | time 16.9s


Seed 42 OOF AUC: 0.65929


=== Seed 2025 / 5 ===
Seed 2025 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 1 AUC: 0.66244 | best_iter=91 | time 16.6s


Seed 2025 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 2 AUC: 0.65729 | best_iter=40 | time 16.7s


Seed 2025 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 3 AUC: 0.69851 | best_iter=228 | time 19.7s


Seed 2025 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 4 AUC: 0.62073 | best_iter=126 | time 16.7s


Seed 2025 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 5 AUC: 0.70964 | best_iter=21 | time 16.4s


Seed 2025 OOF AUC: 0.65686


=== Seed 7 / 5 ===
Seed 7 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 1 AUC: 0.67655 | best_iter=196 | time 17.2s


Seed 7 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 2 AUC: 0.66259 | best_iter=19 | time 16.7s


Seed 7 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 3 AUC: 0.68034 | best_iter=99 | time 16.7s


Seed 7 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 4 AUC: 0.63693 | best_iter=211 | time 17.3s


Seed 7 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 5 AUC: 0.70461 | best_iter=161 | time 17.0s


Seed 7 OOF AUC: 0.66390


=== Seed 11 / 5 ===
Seed 11 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 11 | Fold 1 AUC: 0.64743 | best_iter=86 | time 16.7s


Seed 11 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 11 | Fold 2 AUC: 0.66429 | best_iter=61 | time 16.4s


Seed 11 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 11 | Fold 3 AUC: 0.65905 | best_iter=80 | time 16.5s


Seed 11 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 11 | Fold 4 AUC: 0.62500 | best_iter=624 | time 18.9s


Seed 11 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 11 | Fold 5 AUC: 0.71021 | best_iter=8 | time 16.3s


Seed 11 OOF AUC: 0.63608


=== Seed 77 / 5 ===
Seed 77 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 77 | Fold 1 AUC: 0.67007 | best_iter=88 | time 16.6s


Seed 77 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 77 | Fold 2 AUC: 0.65960 | best_iter=55 | time 16.4s


Seed 77 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 77 | Fold 3 AUC: 0.69084 | best_iter=47 | time 16.6s


Seed 77 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 77 | Fold 4 AUC: 0.64407 | best_iter=196 | time 17.6s


Seed 77 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 77 | Fold 5 AUC: 0.70803 | best_iter=10 | time 16.7s


Seed 77 OOF AUC: 0.65565


Bagged XGB (SVD+dense) OOF AUC: 0.66977
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.350501
1   t3_roiuw                  0.218400
2   t3_mjnbq                  0.233026
3   t3_t8wd1                  0.197108
4  t3_1m4zxu                  0.148176


In [22]:
# Blend pivot LR (sparse) with bagged XGB (SVD+dense) using OOF to tune weight; write submission
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals() and 'build_meta_minimal' in globals(), 'Run pivot cell 7 first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell 7 first'
assert 'oof_mean' in globals() and 'test_mean' in globals(), 'Run cell 11 (dense XGB) to get oof_mean/test_mean'

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources via same preprocessing as pivot
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te_full = build_meta_minimal(test)

# Containers for LR (pivot) OOF/test
oof_lr = np.zeros(len(train), dtype=np.float32)
test_lr_folds = []
LR_C = 0.5

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_text_raw = combine_raw_text(train.loc[tr_idx])
    va_text_raw = combine_raw_text(train.loc[va_idx])
    tr_text = clean_text_series(tr_text_raw)
    va_text = clean_text_series(va_text_raw)
    tr_subs = build_subreddit_text(train.loc[tr_idx])
    va_subs = build_subreddit_text(train.loc[va_idx])

    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(clean_te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(clean_te_text)

    tfidf_s = TfidfVectorizer(**subs_params)
    Xs_tr = tfidf_s.fit_transform(tr_subs)
    Xs_va = tfidf_s.transform(va_subs)
    Xs_te = tfidf_s.transform(subs_te_text)

    meta_tr = build_meta_minimal(train.loc[tr_idx])
    meta_va = build_meta_minimal(train.loc[va_idx])
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    X_tr = hstack([Xw_tr, Xc_tr, Xs_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xs_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xs_te, Xm_te], format='csr')

    lr = LogisticRegression(solver='saga', penalty='l2', C=LR_C, class_weight=None, max_iter=6000, n_jobs=-1, random_state=42, verbose=0)
    lr.fit(X_tr, y[tr_idx])
    va_lr = lr.predict_proba(X_va)[:,1].astype(np.float32)
    oof_lr[va_idx] = va_lr
    te_lr = lr.predict_proba(X_te)[:,1].astype(np.float32)
    test_lr_folds.append(te_lr)

    auc = roc_auc_score(y[va_idx], va_lr)
    print(f'Fold {fold+1} LR AUC: {auc:.5f} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr
    gc.collect()

# OOF diagnostics
auc_lr_oof = roc_auc_score(y, oof_lr)
print(f'LR (pivot) OOF AUC: {auc_lr_oof:.5f}')
auc_xgb_dense_oof = roc_auc_score(y, oof_mean)
print(f'Dense XGB OOF AUC: {auc_xgb_dense_oof:.5f}')

# Blend weight search with fine grid
best_auc, best_w = -1.0, 0.0
for w in np.linspace(0.0, 1.0, 201):
    blend = (1.0 - w)*oof_lr + w*oof_mean
    auc = roc_auc_score(y, blend)
    if auc > best_auc:
        best_auc, best_w = auc, float(w)
print(f'Best blend w(XGB_dense)={best_w:.3f} OOF AUC: {best_auc:.5f}')

# Build blended test predictions
test_lr = np.mean(test_lr_folds, axis=0).astype(np.float32)
test_pred = (1.0 - best_w)*test_lr + best_w*test_mean.astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

# Optionally cache OOF/test for reuse
np.save('oof_lr_pivot.npy', oof_lr)
np.save('oof_xgb_dense.npy', oof_mean)
np.save('test_lr_pivot.npy', test_lr)
np.save('test_xgb_dense.npy', test_mean.astype(np.float32))

Prepared 5-fold StratifiedKFold CV (shuffled).
Fold 1/5 - train 2302 va 576


Fold 1 LR AUC: 0.68299 | time 21.6s


Fold 2/5 - train 2302 va 576


Fold 2 LR AUC: 0.66915 | time 20.6s


Fold 3/5 - train 2302 va 576


Fold 3 LR AUC: 0.69675 | time 22.0s


Fold 4/5 - train 2303 va 575


Fold 4 LR AUC: 0.64766 | time 19.4s


Fold 5/5 - train 2303 va 575


Fold 5 LR AUC: 0.69875 | time 20.5s


LR (pivot) OOF AUC: 0.67860
Dense XGB OOF AUC: 0.67561


Best blend w(XGB_dense)=0.510 OOF AUC: 0.68999
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.319026
1   t3_roiuw                  0.200970
2   t3_mjnbq                  0.210793
3   t3_t8wd1                  0.202225
4  t3_1m4zxu                  0.209998


In [23]:
# Elastic-net LR on pivot sparse features; blend with dense XGB OOF (aim >= 0.692)
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals() and 'build_meta_minimal' in globals(), 'Run pivot cell 7 first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell 7 first'
assert 'oof_mean' in globals() and 'test_mean' in globals(), 'Run cell 11 (dense XGB) first'

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources via same preprocessing as pivot
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te_full = build_meta_minimal(test)

# EN grid
Cs = [0.2, 0.5, 1.0]
l1s = [0.1, 0.3, 0.5]
oof_grid = {(C,l1): np.zeros(len(train), dtype=np.float32) for C in Cs for l1 in l1s}
test_grid = {(C,l1): [] for C in Cs for l1 in l1s}

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_text_raw = combine_raw_text(train.loc[tr_idx])
    va_text_raw = combine_raw_text(train.loc[va_idx])
    tr_text = clean_text_series(tr_text_raw)
    va_text = clean_text_series(va_text_raw)
    tr_subs = build_subreddit_text(train.loc[tr_idx])
    va_subs = build_subreddit_text(train.loc[va_idx])

    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(clean_te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(clean_te_text)

    tfidf_s = TfidfVectorizer(**subs_params)
    Xs_tr = tfidf_s.fit_transform(tr_subs)
    Xs_va = tfidf_s.transform(va_subs)
    Xs_te = tfidf_s.transform(subs_te_text)

    meta_tr = build_meta_minimal(train.loc[tr_idx])
    meta_va = build_meta_minimal(train.loc[va_idx])
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    X_tr = hstack([Xw_tr, Xc_tr, Xs_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xs_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xs_te, Xm_te], format='csr')

    for C in Cs:
        for l1 in l1s:
            clf = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=l1, C=C, class_weight=None, max_iter=6000, n_jobs=-1, random_state=42, verbose=0)
            clf.fit(X_tr, y[tr_idx])
            va_pred = clf.predict_proba(X_va)[:,1].astype(np.float32)
            oof_grid[(C,l1)][va_idx] = va_pred
            te_pred = clf.predict_proba(X_te)[:,1].astype(np.float32)
            test_grid[(C,l1)].append(te_pred)

    ref_auc = roc_auc_score(y[va_idx], oof_grid[(0.5,0.3)][va_idx]) if (0.5,0.3) in oof_grid else roc_auc_score(y[va_idx], list(oof_grid.values())[0][va_idx])
    print(f'Fold {fold+1} done in {time.time()-t0:.1f}s | AUC@C=0.5,l1=0.3: {ref_auc:.5f}')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, clf
    gc.collect()

# Select best EN config
auc_per = {}
for key, oof in oof_grid.items():
    auc = roc_auc_score(y, oof)
    auc_per[key] = auc
    print(f'EN C={key[0]} l1={key[1]} OOF AUC: {auc:.5f}')
best_key = max(auc_per, key=auc_per.get)
print(f'Best EN: C={best_key[0]} l1={best_key[1]} with OOF {auc_per[best_key]:.5f}')

# Build EN test preds
test_en = np.mean(test_grid[best_key], axis=0).astype(np.float32)
oof_en = oof_grid[best_key]

# Blend EN with dense XGB OOF to tune weight
best_auc, best_w = -1.0, 0.0
for w in np.linspace(0.0, 1.0, 201):
    blend = (1.0 - w)*oof_en + w*oof_mean
    auc = roc_auc_score(y, blend)
    if auc > best_auc:
        best_auc, best_w = auc, float(w)
print(f'Best EN+XGBdense w={best_w:.3f} OOF AUC: {best_auc:.5f}')

# Build blended test predictions and save submission
test_pred = (1.0 - best_w)*test_en + best_w*test_mean.astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

# Cache EN OOF/test
np.save('oof_en_pivot.npy', oof_en)
np.save('test_en_pivot.npy', test_en)

Prepared 5-fold StratifiedKFold CV (shuffled).
Fold 1/5 - train 2302 va 576


Fold 1 done in 496.2s | AUC@C=0.5,l1=0.3: 0.65938


Fold 2/5 - train 2302 va 576


KeyboardInterrupt: 

In [24]:
# Meta-only XGB (GPU, early stopping, seed bag) + 3-way blend with LR(pivot) and SVD+XGB(dense)
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import xgboost as xgb
except Exception:
    import subprocess, sys as _sys
    subprocess.run([_sys.executable, '-m', 'pip', 'install', '-q', 'xgboost'])
    import xgboost as xgb

assert 'build_meta_enhanced' in globals(), 'Run cell 10 to define build_meta_enhanced'

# Load or access OOF/test for existing models
def _safe_load(name, fallback=None):
    import os
    if os.path.exists(name):
        return np.load(name)
    return fallback

oof_lr = globals().get('oof_lr', _safe_load('oof_lr_pivot.npy'))
test_lr = globals().get('test_lr', _safe_load('test_lr_pivot.npy'))
oof_dense = globals().get('oof_mean', _safe_load('oof_xgb_dense.npy'))
test_dense = globals().get('test_mean', _safe_load('test_xgb_dense.npy'))
assert oof_lr is not None and oof_dense is not None, 'Missing base OOF preds; run cells 11 and 12'

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute meta features
meta_tr_full = build_meta_enhanced(train).astype(np.float32)
meta_te_full = build_meta_enhanced(test).astype(np.float32)

seeds = [42, 2025, 7]
oof_meta_bag = np.zeros((len(train), len(seeds)), dtype=np.float32)
test_meta_bag = np.zeros((len(test), len(seeds)), dtype=np.float32)

# XGB params for dense/meta-only view
base_params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=5,
    eta=0.04,
    subsample=0.7,
    colsample_bytree=0.7,
    min_child_weight=4,
    reg_alpha=1.0,
    reg_lambda=2.0,
    gamma=0.0,
    device='cuda'
)

for si, seed in enumerate(seeds):
    params = dict(base_params); params['seed'] = seed
    oof_meta = np.zeros(len(train), dtype=np.float32)
    test_fold_preds = []
    for fold, (tr_idx, va_idx) in enumerate(cv):
        t0 = time.time()
        Xtr = meta_tr_full.iloc[tr_idx].values
        Xva = meta_tr_full.iloc[va_idx].values
        Xte = meta_te_full.values
        dtrain = xgb.DMatrix(Xtr, label=y[tr_idx])
        dvalid = xgb.DMatrix(Xva, label=y[va_idx])
        dtest  = xgb.DMatrix(Xte)
        booster = xgb.train(
            params, dtrain, num_boost_round=2000, evals=[(dtrain,'train'),(dvalid,'valid')],
            early_stopping_rounds=150, verbose_eval=False
        )
        va_pred = booster.predict(dvalid, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
        te_pred = booster.predict(dtest, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
        oof_meta[va_idx] = va_pred
        test_fold_preds.append(te_pred)
        auc = roc_auc_score(y[va_idx], va_pred)
        print(f'Seed {seed} | Fold {fold+1} AUC: {auc:.5f} | best_iter={booster.best_iteration} | {time.time()-t0:.1f}s')
        del Xtr, Xva, Xte, dtrain, dvalid, dtest, booster
        gc.collect()
    oof_meta_bag[:, si] = oof_meta
    test_meta_bag[:, si] = np.mean(test_fold_preds, axis=0).astype(np.float32)
    print(f'Seed {seed} OOF AUC (meta-only): {roc_auc_score(y, oof_meta):.5f}')

oof_meta = oof_meta_bag.mean(axis=1)
test_meta = test_meta_bag.mean(axis=1)
print(f'Bagged meta-only XGB OOF AUC: {roc_auc_score(y, oof_meta):.5f}')

# 3-way blend on OOF: w1*LR + w2*DenseXGB + w3*MetaXGB, w sum=1
best_auc, best_weights = -1.0, (1.0, 0.0, 0.0)
grid = np.linspace(0.0, 1.0, 51)
for w1 in grid:
    for w2 in grid:
        if w1 + w2 > 1.0:
            continue
        w3 = 1.0 - w1 - w2
        blend = w1*oof_lr + w2*oof_dense + w3*oof_meta
        auc = roc_auc_score(y, blend)
        if auc > best_auc:
            best_auc, best_weights = auc, (float(w1), float(w2), float(w3))
print(f'Best 3-way weights (LR, DenseXGB, MetaXGB)={best_weights} OOF AUC: {best_auc:.5f}')

# Build and save submission
if test_lr is None:
    test_lr = _safe_load('test_lr_pivot.npy')
if test_dense is None:
    test_dense = _safe_load('test_xgb_dense.npy')
assert test_lr is not None and test_dense is not None, 'Missing base test preds; rerun cells 11 and 12'
w1, w2, w3 = best_weights
test_blend = (w1*test_lr + w2*test_dense + w3*test_meta).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

# Cache meta-only OOF/test
np.save('oof_xgb_meta.npy', oof_meta)
np.save('test_xgb_meta.npy', test_meta)

Prepared 5-fold StratifiedKFold CV (shuffled).


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)


  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 1 AUC: 0.67961 | best_iter=120 | 0.8s


Seed 42 | Fold 2 AUC: 0.66006 | best_iter=21 | 0.5s


Seed 42 | Fold 3 AUC: 0.70261 | best_iter=97 | 0.7s


Seed 42 | Fold 4 AUC: 0.60509 | best_iter=58 | 0.6s


Seed 42 | Fold 5 AUC: 0.68038 | best_iter=35 | 0.5s
Seed 42 OOF AUC (meta-only): 0.65991


Seed 2025 | Fold 1 AUC: 0.67210 | best_iter=81 | 0.6s


Seed 2025 | Fold 2 AUC: 0.65923 | best_iter=139 | 0.7s


Seed 2025 | Fold 3 AUC: 0.69231 | best_iter=94 | 0.7s


Seed 2025 | Fold 4 AUC: 0.60857 | best_iter=23 | 0.5s


Seed 2025 | Fold 5 AUC: 0.68572 | best_iter=39 | 0.5s
Seed 2025 OOF AUC (meta-only): 0.66264


Seed 7 | Fold 1 AUC: 0.67170 | best_iter=176 | 0.8s


Seed 7 | Fold 2 AUC: 0.66174 | best_iter=93 | 0.6s


Seed 7 | Fold 3 AUC: 0.69701 | best_iter=95 | 0.6s


Seed 7 | Fold 4 AUC: 0.61009 | best_iter=53 | 0.5s


Seed 7 | Fold 5 AUC: 0.67805 | best_iter=80 | 0.6s
Seed 7 OOF AUC (meta-only): 0.66368
Bagged meta-only XGB OOF AUC: 0.66619


Best 3-way weights (LR, DenseXGB, MetaXGB)=(0.34, 0.42, 0.23999999999999994) OOF AUC: 0.69182
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.330313
1   t3_roiuw                  0.214858
2   t3_mjnbq                  0.215433
3   t3_t8wd1                  0.206821
4  t3_1m4zxu                  0.225135


In [26]:
# Local fine-grid 3-way blend around coarse best (w1=0.34, w2=0.42, w3=0.24) with 0.001 step; build submission
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

# Load cached OOF/test preds
oof_lr = np.load('oof_lr_pivot.npy')
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')

y = train[target_col].astype(int).values

# Coarse best from prior step
w1_c, w2_c = 0.34, 0.42
step = 0.001
w1_min, w1_max = 0.26, 0.42
w2_min, w2_max = 0.34, 0.50

best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)
t0 = time.time()
w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
cnt = 0
for w1 in w1_grid:
    for w2 in w2_grid:
        w3 = 1.0 - w1 - w2
        if w3 < 0.0 or w3 > 1.0:
            continue
        blend = w1*oof_lr + w2*oof_dense + w3*oof_meta
        auc = roc_auc_score(y, blend)
        cnt += 1
        if auc > best_auc:
            best_auc, best_w = auc, (float(w1), float(w2), float(w3))
print(f'Local fine-grid ({cnt} combos) best weights (LR, DenseXGB, MetaXGB)={best_w} OOF AUC: {best_auc:.5f} | search {time.time()-t0:.1f}s')

# Build submission with best weights
w1, w2, w3 = best_w
test_blend = (w1*test_lr + w2*test_dense + w3*test_meta).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Local fine-grid (25921 combos) best weights (LR, DenseXGB, MetaXGB)=(0.3420000000000001, 0.4180000000000001, 0.23999999999999982) OOF AUC: 0.69184 | search 51.9s
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.330162
1   t3_roiuw                  0.214765
2   t3_mjnbq                  0.215370
3   t3_t8wd1                  0.206919
4  t3_1m4zxu                  0.225410


In [27]:
# Rank-averaged 3-way blend (robust for ROC AUC); global 0.01 simplex grid; build submission
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def rank_normalize(arr):
    # Tie-aware rank to [0,1]
    order = np.argsort(arr)
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(arr), dtype=np.float64)
    return ranks / (len(arr) - 1.0)

# Load cached OOF/test preds
oof_lr = np.load('oof_lr_pivot.npy')
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')

y = train[target_col].astype(int).values

# Rank-normalize
o1 = rank_normalize(oof_lr); o2 = rank_normalize(oof_dense); o3 = rank_normalize(oof_meta)
t1 = rank_normalize(test_lr); t2 = rank_normalize(test_dense); t3 = rank_normalize(test_meta)

# Global simplex search with step=0.01
best_auc, best_w = -1.0, (1.0, 0.0, 0.0)
grid = np.linspace(0.0, 1.0, 101)
t0 = time.time()
cnt = 0
for w1 in grid:
    for w2 in grid:
        if w1 + w2 > 1.0:
            continue
        w3 = 1.0 - w1 - w2
        blend = w1*o1 + w2*o2 + w3*o3
        auc = roc_auc_score(y, blend)
        cnt += 1
        if auc > best_auc:
            best_auc, best_w = auc, (float(w1), float(w2), float(w3))
print(f'Rank blend ({cnt} combos) best weights (LR, DenseXGB, MetaXGB)={best_w} OOF AUC: {best_auc:.5f} | search {time.time()-t0:.1f}s')

# Build submission with best rank-averaged weights
w1, w2, w3 = best_w
test_rank_blend = (w1*t1 + w2*t2 + w3*t3).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_rank_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Rank blend (5151 combos) best weights (LR, DenseXGB, MetaXGB)=(0.43, 0.34, 0.23000000000000004) OOF AUC: 0.69104 | search 10.3s
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.799543
1   t3_roiuw                  0.489905
2   t3_mjnbq                  0.488717
3   t3_t8wd1                  0.472972
4  t3_1m4zxu                  0.532257


In [28]:
# Ultra-local fine-grid refinement around best prob-blend (w1=0.342, w2=0.418); step=0.0002
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

# Load cached OOF/test preds
oof_lr = np.load('oof_lr_pivot.npy')
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')

y = train[target_col].astype(int).values

# Center around previous best (0.342, 0.418); +/- 0.01 window with 0.0002 step
w1_c, w2_c = 0.342, 0.418
step = 0.0002
w1_min, w1_max = w1_c - 0.010, w1_c + 0.010
w2_min, w2_max = w2_c - 0.010, w2_c + 0.010

best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)
t0 = time.time()
w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
cnt = 0
for w1 in w1_grid:
    for w2 in w2_grid:
        w3 = 1.0 - w1 - w2
        if w3 < 0.0 or w3 > 1.0:
            continue
        blend = w1*oof_lr + w2*oof_dense + w3*oof_meta
        auc = roc_auc_score(y, blend)
        cnt += 1
        if auc > best_auc:
            best_auc, best_w = auc, (float(w1), float(w2), float(w3))
print(f'Ultra-local fine-grid ({cnt} combos) best weights (LR, DenseXGB, MetaXGB)={best_w} OOF AUC: {best_auc:.5f} | search {time.time()-t0:.1f}s')

# Build submission with best weights
w1, w2, w3 = best_w
test_blend = (w1*test_lr + w2*test_dense + w3*test_meta).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Ultra-local fine-grid (10201 combos) best weights (LR, DenseXGB, MetaXGB)=(0.341399999999999, 0.40919999999999984, 0.24940000000000112) OOF AUC: 0.69185 | search 20.4s
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.330208
1   t3_roiuw                  0.215065
2   t3_mjnbq                  0.215385
3   t3_t8wd1                  0.207358
4  t3_1m4zxu                  0.226728


In [30]:
# Move 1: Advanced weight optimization using cached strong OOF/test (3-way blend)
import numpy as np, pandas as pd, time, sys, gc
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

print('Loading cached OOF/test predictions (LR Pivot, Dense XGB 3-seed, Meta XGB 3-seed)...')
oof_lr = np.load('oof_lr_pivot.npy')
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')

y = train['requester_received_pizza'].astype(int).values
n = len(y)
assert oof_lr.shape[0] == n and oof_dense.shape[0] == n and oof_meta.shape[0] == n, 'OOF length mismatch'

def eval_auc(w1, w2):
    w3 = 1.0 - w1 - w2
    if w3 < 0 or w3 > 1:
        return -1.0
    blend = w1*oof_lr + w2*oof_dense + w3*oof_meta
    return roc_auc_score(y, blend)

# Start around prior best (from Cell 17): (w1, w2, w3) = (0.3414, 0.4092, 0.2494)
w1_c, w2_c = 0.3414, 0.4092

# Coarse-to-fine global search
best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)
def grid_search(w1_center, w2_center, window, step):
    global best_auc, best_w
    w1_min, w1_max = w1_center - window, w1_center + window
    w2_min, w2_max = w2_center - window, w2_center + window
    w1_grid = np.arange(max(0.0, w1_min), min(1.0, w1_max) + 1e-12, step)
    w2_grid = np.arange(max(0.0, w2_min), min(1.0, w2_max) + 1e-12, step)
    t0 = time.time()
    cnt = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            if w1 + w2 > 1.0:
                continue
            auc = eval_auc(w1, w2)
            cnt += 1
            if auc > best_auc:
                best_auc = auc
                best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
    print(f'Grid window={window} step={step} tried {cnt} combos | best weights={best_w} OOF AUC: {best_auc:.5f} | {time.time()-t0:.1f}s')

# Pass 1: broader window, coarse step
grid_search(w1_c, w2_c, window=0.03, step=0.002)
# Pass 2: refine around current best
grid_search(best_w[0], best_w[1], window=0.01, step=0.001)
# Pass 3: ultra-fine around current best
grid_search(best_w[0], best_w[1], window=0.004, step=0.0005)

global_best_auc, global_best_w = best_auc, best_w
print(f'Global best after coarse-to-fine: weights={global_best_w} OOF AUC: {global_best_auc:.5f}')

# Per-fold weight optimization with averaged weights
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))

def per_fold_averaged_weights(step=0.02):
    w_list = []
    oof_blend = np.zeros(n, dtype=np.float32)
    for k, (tr_idx, va_idx) in enumerate(cv):
        y_tr = y[tr_idx]
        # OOF preds restricted to training part
        a1, a2, a3 = oof_lr[tr_idx], oof_dense[tr_idx], oof_meta[tr_idx]
        best_auc_k, best_w_k = -1.0, (1.0, 0.0, 0.0)
        grid = np.linspace(0.0, 1.0, int(1.0/step)+1)
        for w1 in grid:
            for w2 in grid:
                if w1 + w2 > 1.0:
                    continue
                w3 = 1.0 - w1 - w2
                blend_tr = w1*a1 + w2*a2 + w3*a3
                auc_k = roc_auc_score(y_tr, blend_tr)
                if auc_k > best_auc_k:
                    best_auc_k = auc_k
                    best_w_k = (float(w1), float(w2), float(w3))
        w_list.append(best_w_k)
        # Apply to validation fold
        b1, b2, b3 = oof_lr[va_idx], oof_dense[va_idx], oof_meta[va_idx]
        w1k, w2k, w3k = best_w_k
        oof_blend[va_idx] = (w1k*b1 + w2k*b2 + w3k*b3).astype(np.float32)
        print(f'Fold {k+1} best fold-weights={best_w_k} (train-fold AUC={best_auc_k:.5f})')
    # Average weights
    w_arr = np.array(w_list)
    w_avg = w_arr.mean(axis=0)
    # Renormalize to simplex in case of tiny drift
    s = w_avg.sum()
    if s <= 0:
        w_avg = np.array([1.0, 0.0, 0.0], dtype=np.float64)
    else:
        w_avg = w_avg / s
    auc_oof_avg = roc_auc_score(y, oof_blend)
    print(f'Per-fold averaged weights={tuple(w_avg.tolist())} | blended OOF (using per-fold weights): {auc_oof_avg:.5f}')
    # Also evaluate using the averaged weights applied globally (for stability)
    w1a, w2a, w3a = w_avg.tolist()
    auc_global_avg = eval_auc(w1a, w2a)
    print(f'Applying averaged weights globally gives OOF AUC: {auc_global_avg:.5f}')
    return tuple(w_avg.tolist()), auc_oof_avg, auc_global_avg

w_avg, auc_pf_oof, auc_pf_global = per_fold_averaged_weights(step=0.02)

# Choose the best approach
cand = [
    ('global', global_best_auc, global_best_w),
    ('per_fold_oof', auc_pf_oof, w_avg),
    ('per_fold_global', auc_pf_global, w_avg)
]
cand.sort(key=lambda x: x[1], reverse=True)
choice_name, choice_auc, choice_w = cand[0]
print(f'Chosen blend: {choice_name} with weights={choice_w} OOF AUC: {choice_auc:.5f}')

# Build submission with chosen weights
w1, w2, w3 = choice_w
test_blend = (w1*test_lr + w2*test_dense + w3*test_meta).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Loading cached OOF/test predictions (LR Pivot, Dense XGB 3-seed, Meta XGB 3-seed)...


Grid window=0.03 step=0.002 tried 961 combos | best weights=(0.34140000000000004, 0.4092, 0.24939999999999996) OOF AUC: 0.69185 | 1.8s


Grid window=0.01 step=0.001 tried 441 combos | best weights=(0.34140000000000004, 0.4092, 0.24939999999999996) OOF AUC: 0.69185 | 0.8s


Grid window=0.004 step=0.0005 tried 289 combos | best weights=(0.34140000000000004, 0.4092, 0.24939999999999996) OOF AUC: 0.69185 | 0.6s
Global best after coarse-to-fine: weights=(0.34140000000000004, 0.4092, 0.24939999999999996) OOF AUC: 0.69185


Fold 1 best fold-weights=(0.34, 0.46, 0.1999999999999999) (train-fold AUC=0.69164)


Fold 2 best fold-weights=(0.34, 0.46, 0.1999999999999999) (train-fold AUC=0.69532)


Fold 3 best fold-weights=(0.36, 0.42, 0.22000000000000003) (train-fold AUC=0.68401)


Fold 4 best fold-weights=(0.28, 0.46, 0.25999999999999995) (train-fold AUC=0.70388)


Fold 5 best fold-weights=(0.4, 0.34, 0.25999999999999995) (train-fold AUC=0.68499)
Per-fold averaged weights=(0.34400000000000003, 0.42800000000000005, 0.22799999999999998) | blended OOF (using per-fold weights): 0.69104
Applying averaged weights globally gives OOF AUC: 0.69176
Chosen blend: global with weights=(0.34140000000000004, 0.4092, 0.24939999999999996) OOF AUC: 0.69185
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.330208
1   t3_roiuw                  0.215065
2   t3_mjnbq                  0.215385
3   t3_t8wd1                  0.207358
4  t3_1m4zxu                  0.226728


In [31]:
# Move 2: Simple level-2 stacking (LogisticRegression) on 3 base OOF preds; fold-safe; bag test
import numpy as np, pandas as pd, time, sys, gc
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

print('Loading cached base OOF/test predictions...')
oof_lr = np.load('oof_lr_pivot.npy')
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')

y = train[target_col].astype(int).values
n = len(y)
X_base = np.vstack([oof_lr, oof_dense, oof_meta]).T.astype(np.float32)
X_test = np.vstack([test_lr, test_dense, test_meta]).T.astype(np.float32)
assert X_base.shape == (n, 3) and X_test.shape[1] == 3, 'Shape mismatch in base features'

cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
oof_meta_lr = np.zeros(n, dtype=np.float32)
test_bag = []

t0 = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    Xtr, ytr = X_base[tr_idx], y[tr_idx]
    Xva, yva = X_base[va_idx], y[va_idx]
    clf = LogisticRegression(solver='lbfgs', C=1.0, penalty='l2', max_iter=1000, n_jobs=-1, random_state=42)
    clf.fit(Xtr, ytr)
    va_pred = clf.predict_proba(Xva)[:,1].astype(np.float32)
    oof_meta_lr[va_idx] = va_pred
    test_pred = clf.predict_proba(X_test)[:,1].astype(np.float32)
    test_bag.append(test_pred)
    fold_auc = roc_auc_score(yva, va_pred)
    print(f'Fold {k+1}/5 meta AUC: {fold_auc:.5f}')
    sys.stdout.flush()

auc_oof = roc_auc_score(y, oof_meta_lr)
print(f'Stacker OOF AUC (LR on 3 base preds): {auc_oof:.5f} | time {time.time()-t0:.1f}s')

# Build test via averaging fold meta-model predictions
test_stack = np.mean(test_bag, axis=0).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_stack})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Loading cached base OOF/test predictions...


Fold 1/5 meta AUC: 0.69457


Fold 2/5 meta AUC: 0.67785


Fold 3/5 meta AUC: 0.72156


Fold 4/5 meta AUC: 0.64475


Fold 5/5 meta AUC: 0.72157


In [32]:
# Optional Move: Per-model Platt calibration (fold-safe) + refined weight search using cached OOF/test
import numpy as np, pandas as pd, time, sys, gc
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

print('Loading cached base OOF/test predictions...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t3 = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values
n = len(y)

def platt_calibrate_foldwise(oof_pred, test_pred, y, seed=42):
    cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=seed).split(np.arange(n), y))
    oof_cal = np.zeros(n, dtype=np.float32)
    test_cal_folds = []
    for k, (tr_idx, va_idx) in enumerate(cv):
        Xtr = oof_pred[tr_idx].reshape(-1,1).astype(np.float32)
        ytr = y[tr_idx]
        Xva = oof_pred[va_idx].reshape(-1,1).astype(np.float32)
        Xt  = test_pred.reshape(-1,1).astype(np.float32)
        clf = LogisticRegression(solver='lbfgs', C=1.0, penalty='l2', max_iter=1000, n_jobs=-1, random_state=seed)
        clf.fit(Xtr, ytr)
        oof_cal[va_idx] = clf.predict_proba(Xva)[:,1].astype(np.float32)
        test_cal_folds.append(clf.predict_proba(Xt)[:,1].astype(np.float32))
    test_cal = np.mean(test_cal_folds, axis=0).astype(np.float32)
    return oof_cal, test_cal

t0 = time.time()
o1c, t1c = platt_calibrate_foldwise(o1, t1, y, seed=42)
o2c, t2c = platt_calibrate_foldwise(o2, t2, y, seed=42)
o3c, t3c = platt_calibrate_foldwise(o3, t3, y, seed=42)
print(f'Calibrated single-model OOF AUCs | LR:{roc_auc_score(y,o1c):.5f} Dense:{roc_auc_score(y,o2c):.5f} Meta:{roc_auc_score(y,o3c):.5f} | {time.time()-t0:.1f}s')

def eval_auc(w1, w2, a1, a2, a3):
    w3 = 1.0 - w1 - w2
    if w3 < 0 or w3 > 1:
        return -1.0
    blend = w1*a1 + w2*a2 + w3*a3
    return roc_auc_score(y, blend)

# Start from prior best weights
w1_c, w2_c = 0.3414, 0.4092
best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)

def grid_search(a1, a2, a3, w1_center, w2_center, window, step):
    global best_auc, best_w
    w1_min, w1_max = w1_center - window, w1_center + window
    w2_min, w2_max = w2_center - window, w2_center + window
    w1_grid = np.arange(max(0.0, w1_min), min(1.0, w1_max) + 1e-12, step)
    w2_grid = np.arange(max(0.0, w2_min), min(1.0, w2_max) + 1e-12, step)
    t0 = time.time(); cnt = 0
    local_best_auc, local_best_w = -1.0, None
    for w1 in w1_grid:
        for w2 in w2_grid:
            if w1 + w2 > 1.0:
                continue
            auc = eval_auc(w1, w2, a1, a2, a3)
            cnt += 1
            if auc > local_best_auc:
                local_best_auc = auc
                local_best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
    print(f'Calib grid window={window} step={step} tried {cnt} | best={local_best_w} OOF AUC: {local_best_auc:.5f} | {time.time()-t0:.1f}s')
    return local_best_auc, local_best_w

# Run coarse-to-fine on calibrated OOF
auc1, w1 = grid_search(o1c, o2c, o3c, w1_c, w2_c, window=0.03, step=0.002)
auc2, w2 = grid_search(o1c, o2c, o3c, w1[0], w1[1], window=0.01, step=0.001)
auc3, w3 = grid_search(o1c, o2c, o3c, w2[0], w2[1], window=0.004, step=0.0005)
final_auc, final_w = auc3, w3
print(f'Final calibrated best weights={final_w} OOF AUC: {final_auc:.5f}')

# Build calibrated-blend submission
w1b, w2b, w3b = final_w
test_blend = (w1b*t1c + w2b*t2c + w3b*t3c).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; head:')
print(sub.head())

Loading cached base OOF/test predictions...


Calibrated single-model OOF AUCs | LR:0.67810 Dense:0.67329 Meta:0.66537 | 13.8s


Calib grid window=0.03 step=0.002 tried 961 | best=(0.36740000000000006, 0.3832, 0.24939999999999996) OOF AUC: 0.69032 | 2.0s


Calib grid window=0.01 step=0.001 tried 441 | best=(0.37640000000000007, 0.3732, 0.25039999999999996) OOF AUC: 0.69035 | 0.9s


Calib grid window=0.004 step=0.0005 tried 289 | best=(0.37590000000000007, 0.3737, 0.2503999999999999) OOF AUC: 0.69035 | 0.6s
Final calibrated best weights=(0.37590000000000007, 0.3737, 0.2503999999999999) OOF AUC: 0.69035
Saved submission.csv; head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.325981
1   t3_roiuw                  0.219363
2   t3_mjnbq                  0.218908
3   t3_t8wd1                  0.212368
4  t3_1m4zxu                  0.231804


In [33]:
# Restore best 3-way prob-blend submission using cached strong OOF/test and best weights from Cell 18/17
import numpy as np, pandas as pd
best_w = (0.3414, 0.4092, 0.2494)  # (LR, DenseXGB 3-seed, MetaXGB 3-seed)
oof_lr = np.load('oof_lr_pivot.npy')  # not used here, but sanity if needed
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')
w1, w2, w3 = best_w
test_blend = (w1*test_lr + w2*test_dense + w3*test_meta).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
sub.to_csv('submission.csv', index=False)
print('Rewrote submission.csv with best prob-blend weights', best_w)
print(sub.head())

Rewrote submission.csv with best prob-blend weights (0.3414, 0.4092, 0.2494)
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.330208
1   t3_roiuw                  0.215065
2   t3_mjnbq                  0.215385
3   t3_t8wd1                  0.207358
4  t3_1m4zxu                  0.226728


In [34]:
# Move 2b: Fast level-2 stacking with Ridge regression on 3 base OOF preds (fold-safe); build submission only if improves
import numpy as np, pandas as pd, time, sys
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import Ridge

print('Loading cached base OOF/test predictions...')
oof_lr = np.load('oof_lr_pivot.npy')
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values

X_base = np.vstack([oof_lr, oof_dense, oof_meta]).T.astype(np.float32)
X_test = np.vstack([test_lr, test_dense, test_meta]).T.astype(np.float32)
assert X_base.shape[1] == 3 and X_test.shape[1] == 3

cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(len(y)), y))
oof_stack = np.zeros(len(y), dtype=np.float32)
test_bag = []

alphas = [0.5, 1.0, 2.0, 5.0]
t0 = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    Xtr, ytr = X_base[tr_idx], y[tr_idx]
    Xva, yva = X_base[va_idx], y[va_idx]
    best_auc_k, best_pred_va, best_pred_te = -1.0, None, None
    for a in alphas:
        mdl = Ridge(alpha=a, random_state=42)
        mdl.fit(Xtr, ytr)
        va_pred = mdl.predict(Xva).astype(np.float32)
        # Ridge outputs unbounded; for AUC ranking is fine. Clip to [0,1] for submission stability.
        auc_k = roc_auc_score(yva, va_pred)
        if auc_k > best_auc_k:
            best_auc_k = auc_k
            best_pred_va = va_pred
            best_pred_te = mdl.predict(X_test).astype(np.float32)
    oof_stack[va_idx] = best_pred_va
    test_bag.append(best_pred_te)
    print(f'Fold {k+1}/5 ridge-stack best AUC: {best_auc_k:.5f}')
    sys.stdout.flush()

auc_oof = roc_auc_score(y, oof_stack)
print(f'Ridge stacker OOF AUC: {auc_oof:.5f} | time {time.time()-t0:.1f}s')

test_stack = np.mean(test_bag, axis=0).astype(np.float32)
# Only write submission if this beats our best OOF 0.69185
if auc_oof >= 0.69185:
    # Map to [0,1] via rank normalization to avoid scale issues while preserving order
    order = np.argsort(test_stack)
    ranks = np.empty_like(order, dtype=np.float64); ranks[order] = np.arange(len(test_stack), dtype=np.float64)
    test_scores = (ranks / (len(test_stack) - 1.0)).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_scores})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv from ridge stacker; head:')
    print(sub.head())
else:
    print('Ridge stacker did not beat best OOF 0.69185; keeping existing submission.csv')

Loading cached base OOF/test predictions...
Fold 1/5 ridge-stack best AUC: 0.69471


Fold 2/5 ridge-stack best AUC: 0.67771


Fold 3/5 ridge-stack best AUC: 0.72172


Fold 4/5 ridge-stack best AUC: 0.64525


Fold 5/5 ridge-stack best AUC: 0.72245


Ridge stacker OOF AUC: 0.69059 | time 0.1s
Ridge stacker did not beat best OOF 0.69185; keeping existing submission.csv


In [35]:
# Move 1c: Logit-space blending (coarse-to-fine) using cached OOF/test; may slightly improve AUC synergy
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t3 = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values

# Logit-transform
z1, z2, z3 = to_logit(o1), to_logit(o2), to_logit(o3)
tz1, tz2, tz3 = to_logit(t1), to_logit(t2), to_logit(t3)

def eval_auc_logit(w1, w2):
    w3 = 1.0 - w1 - w2
    if w3 < 0 or w3 > 1:
        return -1.0
    z = w1*z1 + w2*z2 + w3*z3
    # AUC is rank-based; z is monotonic with probs, so use z directly
    return roc_auc_score(y, z)

# Start at prior prob-best
w1_c, w2_c = 0.3414, 0.4092
best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)

def grid_search_logit(w1_center, w2_center, window, step):
    global best_auc, best_w
    w1_min, w1_max = max(0.0, w1_center - window), min(1.0, w1_center + window)
    w2_min, w2_max = max(0.0, w2_center - window), min(1.0, w2_center + window)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    t0 = time.time(); cnt = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            if w1 + w2 > 1.0:
                continue
            auc = eval_auc_logit(w1, w2)
            cnt += 1
            if auc > best_auc:
                best_auc = auc
                best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
    print(f'Logit grid window={window} step={step} tried {cnt} | best={best_w} OOF AUC(z): {best_auc:.5f}')

# Coarse-to-fine
grid_search_logit(w1_c, w2_c, window=0.05, step=0.005)
grid_search_logit(best_w[0], best_w[1], window=0.02, step=0.001)
grid_search_logit(best_w[0], best_w[1], window=0.008, step=0.0005)
print(f'Final logit-blend weights={best_w} | OOF AUC(z): {best_auc:.5f}')

# Build test with logit blending (convert back to probs for submission)
w1b, w2b, w3b = best_w
zt = w1b*tz1 + w2b*tz2 + w3b*tz3
pt = sigmoid(zt).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (logit blend); head:')
print(sub.head())

Loading cached OOF/test predictions...


Logit grid window=0.05 step=0.005 tried 441 | best=(0.35140000000000005, 0.41420000000000007, 0.2343999999999999) OOF AUC(z): 0.69199


Logit grid window=0.02 step=0.001 tried 1681 | best=(0.35040000000000004, 0.4192000000000001, 0.23039999999999988) OOF AUC(z): 0.69201


Logit grid window=0.008 step=0.0005 tried 1089 | best=(0.35040000000000004, 0.4192000000000001, 0.23039999999999988) OOF AUC(z): 0.69201
Final logit-blend weights=(0.35040000000000004, 0.4192000000000001, 0.23039999999999988) | OOF AUC(z): 0.69201
Saved submission.csv (logit blend); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.328460
1   t3_roiuw                  0.212609
2   t3_mjnbq                  0.214714
3   t3_t8wd1                  0.205759
4  t3_1m4zxu                  0.216342


In [36]:
# Ultra-fine logit blend refinement around (0.3504, 0.4192)
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

o1 = np.load('oof_lr_pivot.npy'); o2 = np.load('oof_xgb_dense.npy'); o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy'); t2 = np.load('test_xgb_dense.npy'); t3 = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values

z1, z2, z3 = to_logit(o1), to_logit(o2), to_logit(o3)
tz1, tz2, tz3 = to_logit(t1), to_logit(t2), to_logit(t3)

def eval_auc_logit(w1, w2):
    w3 = 1.0 - w1 - w2
    if w3 < 0 or w3 > 1:
        return -1.0
    z = w1*z1 + w2*z2 + w3*z3
    return roc_auc_score(y, z)

w1_c, w2_c = 0.3504, 0.4192
best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)

def refine(center_w1, center_w2, window, step):
    global best_auc, best_w
    w1_min, w1_max = max(0.0, center_w1 - window), min(1.0, center_w1 + window)
    w2_min, w2_max = max(0.0, center_w2 - window), min(1.0, center_w2 + window)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    t0 = time.time(); cnt = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            if w1 + w2 > 1.0:
                continue
            auc = eval_auc_logit(w1, w2)
            cnt += 1
            if auc > best_auc:
                best_auc = auc
                best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
    print(f'Refine window={window} step={step} tried {cnt} | best={best_w} OOF AUC(z): {best_auc:.5f} | {time.time()-t0:.1f}s')

# Two refinement passes
refine(w1_c, w2_c, window=0.006, step=0.0002)
refine(best_w[0], best_w[1], window=0.003, step=0.0001)
print(f'Ultra-fine final logit weights={best_w} | OOF AUC(z): {best_auc:.5f}')

# Build submission with refined logit blend
w1b, w2b, w3b = best_w
zt = w1b*tz1 + w2b*tz2 + w3b*tz3
pt = sigmoid(zt).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (ultra-fine logit blend); head:')
print(sub.head())

Refine window=0.006 step=0.0002 tried 3721 | best=(0.35019999999999935, 0.41999999999999926, 0.22980000000000134) OOF AUC(z): 0.69201 | 7.4s


Refine window=0.003 step=0.0001 tried 3721 | best=(0.350099999999999, 0.41979999999999895, 0.23010000000000208) OOF AUC(z): 0.69201 | 7.5s
Ultra-fine final logit weights=(0.350099999999999, 0.41979999999999895, 0.23010000000000208) | OOF AUC(z): 0.69201
Saved submission.csv (ultra-fine logit blend); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.328483
1   t3_roiuw                  0.212616
2   t3_mjnbq                  0.214724
3   t3_t8wd1                  0.205729
4  t3_1m4zxu                  0.216253


In [37]:
# Move 1d: Logit scaling (per-model temperature) + weight search to squeeze +0.0001–0.0003
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t3 = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values

# Base logits
z1_base, z2_base, z3_base = to_logit(o1), to_logit(o2), to_logit(o3)
tz1_base, tz2_base, tz3_base = to_logit(t1), to_logit(t2), to_logit(t3)

def eval_auc_logit_combo(s1, s2, s3, w1c, w2c):
    z1 = s1 * z1_base; z2 = s2 * z2_base; z3 = s3 * z3_base
    def auc_for_weights(w1, w2):
        w3 = 1.0 - w1 - w2
        if w3 < 0 or w3 > 1:
            return -1.0
        z = w1*z1 + w2*z2 + w3*z3
        return roc_auc_score(y, z)
    # Coarse-to-fine around provided center weights
    best_auc, best_w = -1.0, (w1c, w2c, 1.0 - w1c - w2c)
    def grid_pass(center, window, step):
        nonlocal best_auc, best_w
        cw1, cw2 = center
        w1_min, w1_max = max(0.0, cw1 - window), min(1.0, cw1 + window)
        w2_min, w2_max = max(0.0, cw2 - window), min(1.0, cw2 + window)
        w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
        w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
        for w1 in w1_grid:
            for w2 in w2_grid:
                if w1 + w2 > 1.0: continue
                auc = auc_for_weights(w1, w2)
                if auc > best_auc:
                    best_auc = auc; best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
        return (best_w[0], best_w[1])
    c = (w1c, w2c)
    c = grid_pass(c, window=0.02, step=0.001)
    c = grid_pass(c, window=0.008, step=0.0005)
    c = grid_pass(c, window=0.004, step=0.0002)
    return best_auc, best_w

# Center from best logit blend found earlier
w1_center, w2_center = 0.3504, 0.4192

# Search small temperature scales per model
scales = [0.85, 0.9, 0.95, 1.0, 1.05, 1.1, 1.15]
best_global_auc, best_cfg = -1.0, None
t0 = time.time(); combos = 0
for s1 in scales:
    for s2 in scales:
        for s3 in scales:
            auc, w = eval_auc_logit_combo(s1, s2, s3, w1_center, w2_center)
            combos += 1
            if auc > best_global_auc:
                best_global_auc, best_cfg = auc, (s1, s2, s3, w)
print(f'Tried {combos} temperature triples; best AUC(z): {best_global_auc:.5f} with scales={best_cfg[:3]} weights={best_cfg[3]} | {time.time()-t0:.1f}s')

# If improvement, build submission with best scales+weights
s1, s2, s3, (bw1, bw2, bw3) = best_cfg
tz1 = s1 * tz1_base; tz2 = s2 * tz2_base; tz3 = s3 * tz3_base
zt = bw1*tz1 + bw2*tz2 + bw3*tz3
pt = sigmoid(zt).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (logit scale+blend); head:')
print(sub.head())

Loading cached OOF/test predictions...


KeyboardInterrupt: 

In [38]:
# Move 2c: Simple level-2 stacking with LogisticRegression (liblinear) on 3 base OOF preds; fold-safe; bag test
import numpy as np, pandas as pd, time, sys
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

print('Loading cached base OOF/test predictions...')
oof_lr = np.load('oof_lr_pivot.npy')
oof_dense = np.load('oof_xgb_dense.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_dense = np.load('test_xgb_dense.npy')
test_meta = np.load('test_xgb_meta.npy')

y = train[target_col].astype(int).values
n = len(y)
X_base = np.vstack([oof_lr, oof_dense, oof_meta]).T.astype(np.float32)
X_test = np.vstack([test_lr, test_dense, test_meta]).T.astype(np.float32)
assert X_base.shape == (n, 3) and X_test.shape[1] == 3, 'Shape mismatch in base features'

cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
oof_meta_lr = np.zeros(n, dtype=np.float32)
test_bag = []

t0 = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    Xtr, ytr = X_base[tr_idx], y[tr_idx]
    Xva, yva = X_base[va_idx], y[va_idx]
    # Use liblinear for reliability on tiny 3-dim data
    clf = LogisticRegression(solver='liblinear', C=1.0, penalty='l2', max_iter=1000, random_state=42)
    clf.fit(Xtr, ytr)
    va_pred = clf.predict_proba(Xva)[:,1].astype(np.float32)
    oof_meta_lr[va_idx] = va_pred
    test_pred = clf.predict_proba(X_test)[:,1].astype(np.float32)
    test_bag.append(test_pred)
    fold_auc = roc_auc_score(yva, va_pred)
    print(f'Fold {k+1}/5 meta AUC: {fold_auc:.5f}')
    sys.stdout.flush()

auc_oof = roc_auc_score(y, oof_meta_lr)
print(f'Stacker OOF AUC (LR-liblinear on 3 base preds): {auc_oof:.5f} | time {time.time()-t0:.1f}s')

# Build test via averaging fold meta-model predictions
test_stack = np.mean(test_bag, axis=0).astype(np.float32)

# Only write submission if this beats our current best OOF baseline (0.69201 logit-blend) or hits bronze threshold
if auc_oof >= 0.69210 or auc_oof > 0.69201:
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_stack})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv from LR-liblinear stacker; head:')
    print(sub.head())
else:
    print('Stacker did not beat best OOF 0.69201; keeping existing submission.csv')

Loading cached base OOF/test predictions...
Fold 1/5 meta AUC: 0.69473


Fold 2/5 meta AUC: 0.67761


Fold 3/5 meta AUC: 0.72151


Fold 4/5 meta AUC: 0.64499


Fold 5/5 meta AUC: 0.72185


Stacker OOF AUC (LR-liblinear on 3 base preds): 0.69015 | time 0.0s
Stacker did not beat best OOF 0.69201; keeping existing submission.csv


In [39]:
# Dense XGB v2: smaller SVD (200/200/60) + 3 seeds; blend with LR+Meta using cached OOF; aim to exceed 0.69210
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

try:
    import xgboost as xgb
except Exception:
    import subprocess, sys as _sys
    subprocess.run([_sys.executable, '-m', 'pip', 'install', '-q', 'xgboost'])
    import xgboost as xgb

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals(), 'Run pivot cell 7 first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell 7 first'
assert 'build_meta_enhanced' in globals(), 'Run cell 10 to define build_meta_enhanced'

# SVD component sizes (smaller per expert coach recommendation)
svd_word_n = 200
svd_char_n = 200
svd_subs_n = 60

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources
raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te = build_meta_enhanced(test).astype(np.float32)

# Containers for seed-bagging
seeds = [42, 2025, 7]
oof_bag = np.zeros((len(train), len(seeds)), dtype=np.float32)
test_bag = np.zeros((len(test), len(seeds)), dtype=np.float32)

# XGB params (slightly regularized, device=cuda, early stopping)
base_params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=4,
    eta=0.05,
    subsample=0.8,
    colsample_bytree=0.7,
    min_child_weight=6,
    reg_alpha=1.0,
    reg_lambda=2.5,
    gamma=0.0,
    device='cuda'
)

for si, seed in enumerate(seeds):
    print(f'=== Seed {seed} / {len(seeds)} ===')
    params = dict(base_params); params['seed'] = seed
    oof = np.zeros(len(train), dtype=np.float32)
    test_fold_preds = []

    for fold, (tr_idx, va_idx) in enumerate(cv):
        t0 = time.time()
        print(f'Seed {seed} | Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
        sys.stdout.flush()

        # Text prep
        tr_text_raw = combine_raw_text(train.loc[tr_idx])
        va_text_raw = combine_raw_text(train.loc[va_idx])
        tr_text = clean_text_series(tr_text_raw)
        va_text = clean_text_series(va_text_raw)
        tr_subs = build_subreddit_text(train.loc[tr_idx])
        va_subs = build_subreddit_text(train.loc[va_idx])

        # TF-IDF (fit on train split)
        tfidf_w = TfidfVectorizer(**word_params)
        Xw_tr = tfidf_w.fit_transform(tr_text)
        Xw_va = tfidf_w.transform(va_text)
        Xw_te = tfidf_w.transform(clean_te_text)

        tfidf_c = TfidfVectorizer(**char_params)
        Xc_tr = tfidf_c.fit_transform(tr_text)
        Xc_va = tfidf_c.transform(va_text)
        Xc_te = tfidf_c.transform(clean_te_text)

        tfidf_s = TfidfVectorizer(**subs_params)
        Xs_tr = tfidf_s.fit_transform(tr_subs)
        Xs_va = tfidf_s.transform(va_subs)
        Xs_te = tfidf_s.transform(subs_te_text)

        # SVD to dense
        svd_w = TruncatedSVD(n_components=svd_word_n, random_state=seed)
        Zw_tr = svd_w.fit_transform(Xw_tr)
        Zw_va = svd_w.transform(Xw_va)
        Zw_te = svd_w.transform(Xw_te)

        svd_c = TruncatedSVD(n_components=svd_char_n, random_state=seed)
        Zc_tr = svd_c.fit_transform(Xc_tr)
        Zc_va = svd_c.transform(Xc_va)
        Zc_te = svd_c.transform(Xc_te)

        svd_s = TruncatedSVD(n_components=svd_subs_n, random_state=seed)
        Zs_tr = svd_s.fit_transform(Xs_tr)
        Zs_va = svd_s.transform(Xs_va)
        Zs_te = svd_s.transform(Xs_te)

        # Enhanced meta (dense)
        meta_tr = build_meta_enhanced(train.loc[tr_idx]).astype(np.float32)
        meta_va = build_meta_enhanced(train.loc[va_idx]).astype(np.float32)

        # Concatenate dense blocks
        Xtr_dense = np.hstack([Zw_tr, Zc_tr, Zs_tr, meta_tr.values]).astype(np.float32)
        Xva_dense = np.hstack([Zw_va, Zc_va, Zs_va, meta_va.values]).astype(np.float32)
        Xte_dense = np.hstack([Zw_te, Zc_te, Zs_te, meta_te.values]).astype(np.float32)

        # Scale dense features
        scaler = StandardScaler(with_mean=True, with_std=True)
        Xtr_d = scaler.fit_transform(Xtr_dense)
        Xva_d = scaler.transform(Xva_dense)
        Xte_d = scaler.transform(Xte_dense)

        # XGB training via DMatrix with early stopping
        dtrain = xgb.DMatrix(Xtr_d, label=y[tr_idx])
        dvalid = xgb.DMatrix(Xva_d, label=y[va_idx])
        dtest  = xgb.DMatrix(Xte_d)
        evals = [(dtrain, 'train'), (dvalid, 'valid')]
        booster = xgb.train(
            params,
            dtrain,
            num_boost_round=4000,
            evals=evals,
            early_stopping_rounds=200,
            verbose_eval=False
        )
        va_pred = booster.predict(dvalid, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
        te_pred = booster.predict(dtest, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
        oof[va_idx] = va_pred
        test_fold_preds.append(te_pred)

        auc = roc_auc_score(y[va_idx], va_pred)
        print(f'Seed {seed} | Fold {fold+1} AUC: {auc:.5f} | best_iter={booster.best_iteration} | time {time.time()-t0:.1f}s')
        sys.stdout.flush()

        del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Zw_tr, Zw_va, Zw_te, Zc_tr, Zc_va, Zc_te, Zs_tr, Zs_va, Zs_te, meta_tr, meta_va, Xtr_dense, Xva_dense, Xte_dense, Xtr_d, Xva_d, Xte_d, dtrain, dvalid, dtest, booster, tfidf_w, tfidf_c, tfidf_s, svd_w, svd_c, svd_s, scaler
        gc.collect()

    # Store bagged predictions
    oof_bag[:, si] = oof
    test_bag[:, si] = np.mean(test_fold_preds, axis=0).astype(np.float32)
    auc_seed = roc_auc_score(y, oof)
    print(f'Seed {seed} OOF AUC: {auc_seed:.5f}')
    sys.stdout.flush()

# Average across seeds
oof_dense_v2 = oof_bag.mean(axis=1)
test_dense_v2 = test_bag.mean(axis=1)
auc_oof_v2 = roc_auc_score(y, oof_dense_v2)
print(f'Bagged XGB (SVD 200/200/60 + enhanced meta) OOF AUC: {auc_oof_v2:.5f}')

# Cache v2 OOF/test
np.save('oof_xgb_dense_v2.npy', oof_dense_v2.astype(np.float32))
np.save('test_xgb_dense_v2.npy', test_dense_v2.astype(np.float32))

# Blend v2 dense with cached LR + Meta OOF to tune weights (prob space)
oof_lr = np.load('oof_lr_pivot.npy')
oof_meta = np.load('oof_xgb_meta.npy')
test_lr = np.load('test_lr_pivot.npy')
test_meta = np.load('test_xgb_meta.npy')

def eval_auc_prob(w1, w2):
    w3 = 1.0 - w1 - w2
    if w3 < 0 or w3 > 1: return -1.0
    blend = w1*oof_lr + w2*oof_dense_v2 + w3*oof_meta
    return roc_auc_score(y, blend)

# Coarse-to-fine around prior best center (0.3414, 0.4092)
w1_c, w2_c = 0.3414, 0.4092
best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)
def grid_pass(center_w1, center_w2, window, step):
    global best_auc, best_w
    w1_min, w1_max = max(0.0, center_w1 - window), min(1.0, center_w1 + window)
    w2_min, w2_max = max(0.0, center_w2 - window), min(1.0, center_w2 + window)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    t0s = time.time(); cnt = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            if w1 + w2 > 1.0: continue
            auc = eval_auc_prob(w1, w2)
            cnt += 1
            if auc > best_auc:
                best_auc = auc; best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
    print(f'Blend grid window={window} step={step} tried {cnt} | best={best_w} OOF AUC: {best_auc:.5f} | {time.time()-t0s:.1f}s')
    return best_w[0], best_w[1]

w1a, w2a = grid_pass(w1_c, w2_c, window=0.06, step=0.002)
w1b, w2b = grid_pass(w1a, w2a, window=0.02, step=0.001)
print(f'Final prob-blend weights(v2)={best_w} | OOF AUC: {best_auc:.5f}')

# If improved vs 0.69201, write submission with v2 dense
if best_auc >= 0.69210 or best_auc > 0.69201:
    w1, w2, w3 = best_w
    test_blend = (w1*test_lr + w2*test_dense_v2 + w3*test_meta).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv (3-way blend with dense v2); head:')
    print(sub.head())
else:
    print('Dense v2 blend did not beat 0.69201; keeping current best submission.')

Prepared 5-fold StratifiedKFold CV (shuffled).


=== Seed 42 / 3 ===
Seed 42 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 1 AUC: 0.68480 | best_iter=125 | time 14.5s


Seed 42 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 2 AUC: 0.65718 | best_iter=108 | time 14.6s


Seed 42 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 3 AUC: 0.68315 | best_iter=49 | time 14.3s


Seed 42 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 4 AUC: 0.64337 | best_iter=296 | time 15.1s


Seed 42 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 42 | Fold 5 AUC: 0.69807 | best_iter=120 | time 14.2s


Seed 42 OOF AUC: 0.66976


=== Seed 2025 / 3 ===
Seed 2025 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 1 AUC: 0.65807 | best_iter=66 | time 13.8s


Seed 2025 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 2 AUC: 0.66279 | best_iter=641 | time 15.9s


Seed 2025 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 3 AUC: 0.67685 | best_iter=45 | time 13.9s


Seed 2025 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 4 AUC: 0.63028 | best_iter=74 | time 13.9s


Seed 2025 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 2025 | Fold 5 AUC: 0.71652 | best_iter=61 | time 13.8s


Seed 2025 OOF AUC: 0.65146


=== Seed 7 / 3 ===
Seed 7 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 1 AUC: 0.66718 | best_iter=80 | time 14.0s


Seed 7 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 2 AUC: 0.65757 | best_iter=15 | time 13.8s


Seed 7 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 3 AUC: 0.68273 | best_iter=26 | time 13.9s


Seed 7 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 4 AUC: 0.62948 | best_iter=189 | time 14.5s


Seed 7 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


Seed 7 | Fold 5 AUC: 0.70696 | best_iter=56 | time 14.2s


Seed 7 OOF AUC: 0.65787


Bagged XGB (SVD 200/200/60 + enhanced meta) OOF AUC: 0.67490


Blend grid window=0.06 step=0.002 tried 3721 | best=(0.37940000000000007, 0.40120000000000006, 0.21939999999999987) OOF AUC: 0.69116 | 7.0s


Blend grid window=0.02 step=0.001 tried 1681 | best=(0.38040000000000007, 0.40020000000000006, 0.21939999999999987) OOF AUC: 0.69116 | 3.2s
Final prob-blend weights(v2)=(0.38040000000000007, 0.40020000000000006, 0.21939999999999987) | OOF AUC: 0.69116
Dense v2 blend did not beat 0.69201; keeping current best submission.


In [40]:
# Quick checks: logit geometric mean and tiny temperature sweep with small-window weight search
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (3-seed dense)...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t3 = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values

# Logits
z1, z2, z3 = to_logit(o1), to_logit(o2), to_logit(o3)
tz1, tz2, tz3 = to_logit(t1), to_logit(t2), to_logit(t3)

# 1) Equal-weight logit geometric mean (average logits) -> check OOF
z_avg = (z1 + z2 + z3) / 3.0
auc_geom = roc_auc_score(y, z_avg)
print(f'Equal-weight logit geometric mean OOF AUC(z): {auc_geom:.5f}')
if auc_geom >= 0.69210:
    pt = sigmoid((tz1 + tz2 + tz3) / 3.0).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv from logit geometric mean; head:')
    print(sub.head())
else:
    # 2) Tiny temperature sweep {0.95,1.00,1.05} with local weight search around prior best logit weights
    scales = [0.95, 1.00, 1.05]
    # Prior best logit weights center from cell 24
    w1c, w2c = 0.3501, 0.4198
    window, step = 0.015, 0.001
    best_auc, best_cfg = -1.0, None
    t0 = time.time(); tried = 0
    for s1 in scales:
        for s2 in scales:
            for s3 in scales:
                zz1, zz2, zz3 = s1*z1, s2*z2, s3*z3
                w1_min, w1_max = max(0.0, w1c - window), min(1.0, w1c + window)
                w2_min, w2_max = max(0.0, w2c - window), min(1.0, w2c + window)
                w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
                w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
                local_best_auc, local_best_w = -1.0, None
                for w1 in w1_grid:
                    for w2 in w2_grid:
                        if w1 + w2 > 1.0: continue
                        w3 = 1.0 - w1 - w2
                        z = w1*zz1 + w2*zz2 + w3*zz3
                        auc = roc_auc_score(y, z)
                        tried += 1
                        if auc > local_best_auc:
                            local_best_auc, local_best_w = auc, (float(w1), float(w2), float(w3))
                if local_best_auc > best_auc:
                    best_auc, best_cfg = local_best_auc, (s1, s2, s3, local_best_w)
    print(f'Tiny temp sweep tried {tried} combos | best AUC(z): {best_auc:.5f} with scales={best_cfg[:3]} weights={best_cfg[3]} | {time.time()-t0:.1f}s')
    # If improved to threshold, write submission
    if best_auc >= 0.69210:
        s1, s2, s3, (bw1, bw2, bw3) = best_cfg
        pt = sigmoid(bw1*(s1*tz1) + bw2*(s2*tz2) + bw3*(s3*tz3)).astype(np.float32)
        sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
        sub.to_csv('submission.csv', index=False)
        print('Saved submission.csv from tiny temperature + logit blend; head:')
        print(sub.head())
    else:
        print('No improvement beyond 0.69210; keeping current best submission.')

Loading cached OOF/test predictions (3-seed dense)...
Equal-weight logit geometric mean OOF AUC(z): 0.69148


Tiny temp sweep tried 25947 combos | best AUC(z): 0.69202 with scales=(0.95, 0.95, 1.05) weights=(0.35810000000000003, 0.4288, 0.2130999999999999) | 50.0s
No improvement beyond 0.69210; keeping current best submission.


In [41]:
# Add a diverse sparse LR variant (word 1-3 + char_wb 2-5, min_df=2, no URL/NUM tokenization) and 4-way logit blend
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

assert 'combine_raw_text' in globals(), 'Run pivot cell 7 first for combine_raw_text'

# Alternate text: simple lowercase without url/num replacement for diversity
def simple_clean_series(s: pd.Series) -> pd.Series:
    return s.fillna('').astype(str).str.lower()

# Vectorizer params for diverse LR
word_params_alt = dict(analyzer='word', ngram_range=(1,3), lowercase=True, min_df=2, max_features=80000, sublinear_tf=True, smooth_idf=True, norm='l2')
char_params_alt = dict(analyzer='char_wb', ngram_range=(2,5), lowercase=True, min_df=2, max_features=100000, sublinear_tf=True, smooth_idf=True, norm='l2')

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute texts
tr_text_full = combine_raw_text(train)
te_text_full = combine_raw_text(test)
tr_text_alt = simple_clean_series(tr_text_full)
te_text_alt = simple_clean_series(te_text_full)

# Train alternate LR (no subreddit vectorizer to save time; include minimal meta for stability)
def build_meta_min_fast(df):
    return build_meta_minimal(df).astype(np.float32) if 'build_meta_minimal' in globals() else pd.DataFrame(index=df.index)

oof_lr_alt = np.zeros(len(train), dtype=np.float32)
test_lr_alt_folds = []
LR_C_alt = 0.5

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Alt LR | Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()
    tr_txt = tr_text_alt.iloc[tr_idx]
    va_txt = tr_text_alt.iloc[va_idx]

    tfidf_w = TfidfVectorizer(**word_params_alt)
    Xw_tr = tfidf_w.fit_transform(tr_txt)
    Xw_va = tfidf_w.transform(va_txt)
    Xw_te = tfidf_w.transform(te_text_alt)

    tfidf_c = TfidfVectorizer(**char_params_alt)
    Xc_tr = tfidf_c.fit_transform(tr_txt)
    Xc_va = tfidf_c.transform(va_txt)
    Xc_te = tfidf_c.transform(te_text_alt)

    # Minimal meta
    meta_tr = build_meta_min_fast(train.loc[tr_idx])
    meta_va = build_meta_min_fast(train.loc[va_idx])
    meta_te = build_meta_min_fast(test)
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te)

    X_tr = hstack([Xw_tr, Xc_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xm_te], format='csr')

    lr_alt = LogisticRegression(solver='saga', penalty='l2', C=LR_C_alt, class_weight=None, max_iter=6000, n_jobs=-1, random_state=42, verbose=0)
    lr_alt.fit(X_tr, y[tr_idx])
    va_pred = lr_alt.predict_proba(X_va)[:,1].astype(np.float32)
    oof_lr_alt[va_idx] = va_pred
    te_pred = lr_alt.predict_proba(X_te)[:,1].astype(np.float32)
    test_lr_alt_folds.append(te_pred)

    auc = roc_auc_score(y[va_idx], va_pred)
    print(f'Alt LR | Fold {fold+1} AUC: {auc:.5f} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()
    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr_alt, tfidf_w, tfidf_c, scaler
    gc.collect()

auc_alt = roc_auc_score(y, oof_lr_alt)
print(f'Alt LR OOF AUC: {auc_alt:.5f}')
test_lr_alt = np.mean(test_lr_alt_folds, axis=0).astype(np.float32)
np.save('oof_lr_alt.npy', oof_lr_alt)
np.save('test_lr_alt.npy', test_lr_alt)

# 4-way logit blend with tiny weight for alt (0..0.15), local search for others around prior best
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o3 = np.load('oof_xgb_meta.npy')
o4 = oof_lr_alt
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t3 = np.load('test_xgb_meta.npy')
t4 = test_lr_alt

z1, z2, z3, z4 = to_logit(o1), to_logit(o2), to_logit(o3), to_logit(o4)
tz1, tz2, tz3, tz4 = to_logit(t1), to_logit(t2), to_logit(t3), to_logit(t4)

# Prior best 3-way logit center
base_w1, base_w2, base_w3 = 0.3501, 0.4198, 0.2301
best_auc, best_w = -1.0, (base_w1, base_w2, base_w3, 0.0)
t0 = time.time(); tried = 0
for w4 in np.arange(0.0, 0.15001, 0.005):
    rem = 1.0 - w4
    cw1, cw2 = base_w1 * rem, base_w2 * rem
    w1_min, w1_max = max(0.0, cw1 - 0.02*rem), min(rem, cw1 + 0.02*rem)
    w2_min, w2_max = max(0.0, cw2 - 0.02*rem), min(rem, cw2 + 0.02*rem)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, 0.001)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, 0.001)
    for w1 in w1_grid:
        for w2 in w2_grid:
            w3 = rem - w1 - w2
            if w3 < 0.0:
                continue
            z = w1*z1 + w2*z2 + w3*z3 + w4*z4
            auc = roc_auc_score(y, z)
            tried += 1
            if auc > best_auc:
                best_auc = auc; best_w = (float(w1/rem) if rem>0 else 0.0, float(w2/rem) if rem>0 else 0.0, float(w3/rem) if rem>0 else 0.0, float(w4))
print(f'4-way logit blend tried {tried} combos | best normalized weights (LR,Dense,Meta,Alt)={best_w} | OOF AUC(z): {best_auc:.5f} | {time.time()-t0:.1f}s')

# Build submission if threshold met
w1n, w2n, w3n, w4n = best_w
if best_auc >= 0.69210:
    zt = (w1n*tz1 + w2n*tz2 + w3n*tz3) * (1.0 - w4n) + w4n*tz4
    pt = sigmoid(zt).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv (4-way logit blend incl. alt LR); head:')
    print(sub.head())
else:
    print('4-way blend did not reach 0.69210; keeping current best submission.')

Prepared 5-fold StratifiedKFold CV (shuffled).
Alt LR | Fold 1/5 - train 2302 va 576


Alt LR | Fold 1 AUC: 0.67561 | time 22.4s


Alt LR | Fold 2/5 - train 2302 va 576


Alt LR | Fold 2 AUC: 0.66758 | time 20.9s


Alt LR | Fold 3/5 - train 2302 va 576


Alt LR | Fold 3 AUC: 0.68796 | time 22.2s


Alt LR | Fold 4/5 - train 2303 va 575


Alt LR | Fold 4 AUC: 0.64426 | time 19.3s


Alt LR | Fold 5/5 - train 2303 va 575


Alt LR | Fold 5 AUC: 0.69882 | time 21.7s


Alt LR OOF AUC: 0.67415


4-way logit blend tried 43956 combos | best normalized weights (LR,Dense,Meta,Alt)=(0.33115820105820104, 0.41990582010582017, 0.24893597883597882, 0.055) | OOF AUC(z): 0.69208 | 88.0s
4-way blend did not reach 0.69210; keeping current best submission.


In [42]:
# Logit-space blend using Dense XGB v2 (oof_xgb_dense_v2.npy) + tiny temperature sweep; aim to exceed 0.69210
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions with dense v2...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense_v2.npy')
t3 = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values

# Logits
z1, z2, z3 = to_logit(o1), to_logit(o2), to_logit(o3)
tz1, tz2, tz3 = to_logit(t1), to_logit(t2), to_logit(t3)

def eval_auc_logit(w1, w2):
    w3 = 1.0 - w1 - w2
    if w3 < 0 or w3 > 1:
        return -1.0
    z = w1*z1 + w2*z2 + w3*z3
    return roc_auc_score(y, z)

# Start near prior best logit center from v1
w1_c, w2_c = 0.3501, 0.4198
best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)

def grid_search_logit(w1_center, w2_center, window, step):
    global best_auc, best_w
    w1_min, w1_max = max(0.0, w1_center - window), min(1.0, w1_center + window)
    w2_min, w2_max = max(0.0, w2_center - window), min(1.0, w2_center + window)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    t0 = time.time(); cnt = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            if w1 + w2 > 1.0:
                continue
            auc = eval_auc_logit(w1, w2)
            cnt += 1
            if auc > best_auc:
                best_auc = auc
                best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
    print(f'v2 logit grid window={window} step={step} tried {cnt} | best={best_w} OOF AUC(z): {best_auc:.5f}')

# Coarse-to-fine around the center
grid_search_logit(w1_c, w2_c, window=0.05, step=0.005)
grid_search_logit(best_w[0], best_w[1], window=0.02, step=0.001)
grid_search_logit(best_w[0], best_w[1], window=0.008, step=0.0005)
print(f'v2 logit-blend preliminary weights={best_w} | OOF AUC(z): {best_auc:.5f}')

# Tiny temperature sweep around 1.0 for v2 (fast)
scales = [0.95, 1.00, 1.05]
w1c, w2c = best_w[0], best_w[1]
window, step = 0.015, 0.001
best_auc_t, best_cfg = best_auc, (1.0, 1.0, 1.0, best_w)
t0 = time.time(); tried = 0
for s1 in scales:
    for s2 in scales:
        for s3 in scales:
            zz1, zz2, zz3 = s1*z1, s2*z2, s3*z3
            w1_min, w1_max = max(0.0, w1c - window), min(1.0, w1c + window)
            w2_min, w2_max = max(0.0, w2c - window), min(1.0, w2c + window)
            w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
            w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
            local_best_auc, local_best_w = -1.0, None
            for w1 in w1_grid:
                for w2 in w2_grid:
                    if w1 + w2 > 1.0: continue
                    w3 = 1.0 - w1 - w2
                    z = w1*zz1 + w2*zz2 + w3*zz3
                    auc = roc_auc_score(y, z)
                    tried += 1
                    if auc > local_best_auc:
                        local_best_auc, local_best_w = auc, (float(w1), float(w2), float(w3))
            if local_best_auc > best_auc_t:
                best_auc_t, best_cfg = local_best_auc, (s1, s2, s3, local_best_w)
print(f'v2 tiny temp sweep tried {tried} combos | best AUC(z): {best_auc_t:.5f} with scales={best_cfg[:3]} weights={best_cfg[3]} | {time.time()-t0:.1f}s')

# Choose the best between plain v2 logit-blend and temp-swept v2
use_auc = best_auc_t if best_auc_t > best_auc else best_auc
use_w = best_cfg[3] if best_auc_t > best_auc else best_w
use_scales = best_cfg[:3] if best_auc_t > best_auc else (1.0, 1.0, 1.0)
print(f'Chosen v2 config: scales={use_scales} weights={use_w} OOF AUC(z): {use_auc:.5f}')

# If threshold reached, write submission with dense v2 logits
if use_auc >= 0.69210:
    s1, s2, s3 = use_scales
    w1b, w2b, w3b = use_w
    pt = sigmoid(w1b*(s1*tz1) + w2b*(s2*tz2) + w3b*(s3*tz3)).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv (v2 logit blend); head:')
    print(sub.head())
else:
    print('v2 logit blend < 0.69210; keeping current best submission.')

Loading cached OOF/test predictions with dense v2...


v2 logit grid window=0.05 step=0.005 tried 441 | best=(0.33010000000000006, 0.41980000000000006, 0.2500999999999999) OOF AUC(z): 0.69153


v2 logit grid window=0.02 step=0.001 tried 1681 | best=(0.33510000000000006, 0.42280000000000006, 0.24209999999999987) OOF AUC(z): 0.69155


v2 logit grid window=0.008 step=0.0005 tried 1089 | best=(0.33460000000000006, 0.42280000000000006, 0.24259999999999993) OOF AUC(z): 0.69155
v2 logit-blend preliminary weights=(0.33460000000000006, 0.42280000000000006, 0.24259999999999993) | OOF AUC(z): 0.69155


v2 tiny temp sweep tried 25947 combos | best AUC(z): 0.69155 with scales=(1.0, 1.05, 1.05) weights=(0.3466000000000001, 0.41480000000000006, 0.23859999999999992) | 51.2s
Chosen v2 config: scales=(1.0, 1.05, 1.05) weights=(0.3466000000000001, 0.41480000000000006, 0.23859999999999992) OOF AUC(z): 0.69155
v2 logit blend < 0.69210; keeping current best submission.


In [43]:
# 4-way logit blend: LR + Dense v1 + Dense v2 + Meta (split dense weight); aim to exceed 0.69210
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o2b = np.load('oof_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t2b = np.load('test_xgb_dense_v2.npy')
t3 = np.load('test_xgb_meta.npy')
y = train[target_col].astype(int).values

# Logits
z1, z2, z2b, z3 = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1, tz2, tz2b, tz3 = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# Start near prior best 3-way logit weights (v1): (0.3501, 0.4198, 0.2301)
w1c, wdc = 0.3501, 0.4198  # LR, total dense
w3c = 1.0 - w1c - wdc
w1_min, w1_max = max(0.0, w1c - 0.02), min(1.0, w1c + 0.02)
wd_min, wd_max = max(0.0, wdc - 0.02), min(1.0, wdc + 0.02)
step = 0.001

best_auc, best_cfg = -1.0, None
t0 = time.time(); cnt = 0
alphas = np.arange(0.0, 0.4001, 0.05)  # fraction of total dense weight assigned to v2
w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
wd_grid = np.arange(wd_min, wd_max + 1e-12, step)
for w1 in w1_grid:
    for wd in wd_grid:
        w3 = 1.0 - w1 - wd
        if w3 < 0.0 or w3 > 1.0:
            continue
        for a in alphas:
            w2b = wd * a
            w2 = wd - w2b
            z = w1*z1 + w2*z2 + w2b*z2b + w3*z3
            auc = roc_auc_score(y, z)
            cnt += 1
            if auc > best_auc:
                best_auc = auc; best_cfg = (float(w1), float(w2), float(w2b), float(w3))
print(f'4-way split-dense search tried {cnt} combos | best weights (LR, DenseV1, DenseV2, Meta)={best_cfg} OOF AUC(z): {best_auc:.5f} | {time.time()-t0:.1f}s')

# Optional micro-refine around best
w1b, w2b1, w2b2, w3b = best_cfg
def refine(w1c, w2c1, w2c2, w3c, window=0.006, step=0.0005):
    best_local_auc, best_local_w = -1.0, None
    # Reparam via (w1, wd, alpha), with wd = w2c1+w2c2, alpha = w2c2/wd (guard wd=0)
    wd_c = w2c1 + w2c2
    alpha_c = (w2c2 / wd_c) if wd_c > 0 else 0.0
    w1_min = max(0.0, w1c - window); w1_max = min(1.0, w1c + window)
    wd_min = max(0.0, wd_c - window); wd_max = min(1.0, wd_c + window)
    a_min = max(0.0, alpha_c - 0.05); a_max = min(0.6, alpha_c + 0.05)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    wd_grid = np.arange(wd_min, wd_max + 1e-12, step)
    a_grid = np.arange(a_min, a_max + 1e-12, step*5)
    t1s = time.time(); cnt2 = 0
    for w1 in w1_grid:
        for wd in wd_grid:
            w3 = 1.0 - w1 - wd
            if w3 < 0.0 or w3 > 1.0:
                continue
            for a in a_grid:
                w2v2 = wd * a
                w2v1 = wd - w2v2
                z = w1*z1 + w2v1*z2 + w2v2*z2b + w3*z3
                auc = roc_auc_score(y, z)
                cnt2 += 1
                if auc > best_local_auc:
                    best_local_auc = auc; best_local_w = (float(w1), float(w2v1), float(w2v2), float(w3))
    print(f'Refine ({cnt2} combos) best weights={best_local_w} OOF AUC(z): {best_local_auc:.5f} | {time.time()-t1s:.1f}s')
    return best_local_auc, best_local_w

auc_ref, w_ref = refine(w1b, w2b1, w2b2, w3b, window=0.006, step=0.0005)
use_auc = auc_ref if auc_ref > best_auc else best_auc
use_w = w_ref if auc_ref > best_auc else best_cfg
print(f'Chosen 4-way split-dense config: weights={use_w} OOF AUC(z): {use_auc:.5f}')

# If threshold reached, write submission
if use_auc >= 0.69210:
    w1f, w2f1, w2f2, w3f = use_w
    zt = w1f*tz1 + w2f1*tz2 + w2f2*tz2b + w3f*tz3
    pt = sigmoid(zt).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv (4-way split-dense logit blend); head:')
    print(sub.head())
else:
    print('4-way split-dense logit blend < 0.69210; keeping current best submission.')

Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...


4-way split-dense search tried 15129 combos | best weights (LR, DenseV1, DenseV2, Meta)=(0.3441, 0.28002, 0.15078000000000003, 0.2250999999999999) OOF AUC(z): 0.69240 | 29.5s


Refine (25625 combos) best weights=(0.3381, 0.26678599999999997, 0.16351400000000005, 0.23159999999999992) OOF AUC(z): 0.69242 | 49.9s
Chosen 4-way split-dense config: weights=(0.3381, 0.26678599999999997, 0.16351400000000005, 0.23159999999999992) OOF AUC(z): 0.69242
Saved submission.csv (4-way split-dense logit blend); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.331806
1   t3_roiuw                  0.217080
2   t3_mjnbq                  0.213391
3   t3_t8wd1                  0.209035
4  t3_1m4zxu                  0.215080


In [44]:
# Per-fold logit standardization + per-fold weight optimization (4-way: LR, Dense v1, Dense v2, Meta); average test over folds
import numpy as np, pandas as pd, time, sys
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions for 4-way (LR, Dense v1, Dense v2, Meta)...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o2b = np.load('oof_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t2b = np.load('test_xgb_dense_v2.npy')
t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values
n = len(y)

# Convert to logits once
z1_all, z2_all, z2b_all, z3_all = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1_all, tz2_all, tz2b_all, tz3_all = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# Global best 4-way logit weights found earlier (Cell 31):
w_best_global = (0.3381, 0.266786, 0.163514, 0.2316)  # (LR, Dense v1, Dense v2, Meta)
print('Starting per-fold search around global best weights:', w_best_global)

cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
oof_blend = np.zeros(n, dtype=np.float64)
test_fold_preds = []

t_start = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    # Per-fold standardization (z-score) using train indices stats
    def zscore(arr, mean, std):
        std = np.where(std == 0, 1.0, std)
        return (arr - mean) / std

    z1_tr, z2_tr, z2b_tr, z3_tr = z1_all[tr_idx], z2_all[tr_idx], z2b_all[tr_idx], z3_all[tr_idx]
    z1_va, z2_va, z2b_va, z3_va = z1_all[va_idx], z2_all[va_idx], z2b_all[va_idx], z3_all[va_idx]

    m1, s1 = z1_tr.mean(), z1_tr.std(ddof=0)
    m2, s2 = z2_tr.mean(), z2_tr.std(ddof=0)
    m2b, s2b = z2b_tr.mean(), z2b_tr.std(ddof=0)
    m3, s3 = z3_tr.mean(), z3_tr.std(ddof=0)

    z1_tr_s = zscore(z1_tr, m1, s1); z1_va_s = zscore(z1_va, m1, s1); tz1_s = zscore(tz1_all, m1, s1)
    z2_tr_s = zscore(z2_tr, m2, s2); z2_va_s = zscore(z2_va, m2, s2); tz2_s = zscore(tz2_all, m2, s2)
    z2b_tr_s = zscore(z2b_tr, m2b, s2b); z2b_va_s = zscore(z2b_va, m2b, s2b); tz2b_s = zscore(tz2b_all, m2b, s2b)
    z3_tr_s = zscore(z3_tr, m3, s3); z3_va_s = zscore(z3_va, m3, s3); tz3_s = zscore(tz3_all, m3, s3)

    # Per-fold tiny grid search around global best (window=0.02, step=0.001); ensure non-neg and sum=1
    base_w1, base_w2, base_w2b, base_w3 = w_best_global
    w1_min, w1_max = max(0.0, base_w1 - 0.02), min(1.0, base_w1 + 0.02)
    w2_min, w2_max = max(0.0, base_w2 - 0.02), min(1.0, base_w2 + 0.02)
    w2b_min, w2b_max = max(0.0, base_w2b - 0.02), min(1.0, base_w2b + 0.02)
    step = 0.001
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    w2b_grid = np.arange(w2b_min, w2b_max + 1e-12, step)

    best_auc_k, best_w_k = -1.0, w_best_global
    y_tr = y[tr_idx]
    tried = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            for w2b in w2b_grid:
                w3 = 1.0 - (w1 + w2 + w2b)
                if w3 < 0.0 or w3 > 1.0:
                    continue
                z_tr_combo = w1*z1_tr_s + w2*z2_tr_s + w2b*z2b_tr_s + w3*z3_tr_s
                auc_k = roc_auc_score(y_tr, z_tr_combo)
                tried += 1
                if auc_k > best_auc_k:
                    best_auc_k = auc_k; best_w_k = (float(w1), float(w2), float(w2b), float(w3))
    # Apply to validation and test (fold-normalized)
    w1k, w2k, w2bk, w3k = best_w_k
    z_va_combo = w1k*z1_va_s + w2k*z2_va_s + w2bk*z2b_va_s + w3k*z3_va_s
    z_te_combo = w1k*tz1_s + w2k*tz2_s + w2bk*tz2b_s + w3k*tz3_s
    oof_blend[va_idx] = z_va_combo
    test_fold_preds.append(z_te_combo.astype(np.float64))
    print(f'Fold {k+1}/5 | tried={tried} | best fold-train AUC(z)={best_auc_k:.5f} | weights={best_w_k} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

# Evaluate OOF AUC using z (monotonic wrt prob) and also prob for completeness
oof_auc_z = roc_auc_score(y, oof_blend)
oof_auc_p = roc_auc_score(y, sigmoid(oof_blend))
print(f'Per-fold standardized 4-way blend OOF AUC(z)={oof_auc_z:.5f} | AUC(prob)={oof_auc_p:.5f} | total {time.time()-t_start:.1f}s')

# Average test logits across folds, convert to prob, and write submission
test_mean_z = np.mean(test_fold_preds, axis=0)
pt = sigmoid(test_mean_z).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (per-fold z-std 4-way blend); head:')
print(sub.head())

Loading cached OOF/test predictions for 4-way (LR, Dense v1, Dense v2, Meta)...
Starting per-fold search around global best weights: (0.3381, 0.266786, 0.163514, 0.2316)


Fold 1/5 | tried=68921 | best fold-train AUC(z)=0.69202 | weights=(0.3461, 0.2857860000000001, 0.143514, 0.2245999999999998) | time 128.0s


Fold 2/5 | tried=68921 | best fold-train AUC(z)=0.69544 | weights=(0.35810000000000003, 0.25178600000000007, 0.17251400000000003, 0.2175999999999998) | time 123.4s


Fold 3/5 | tried=68921 | best fold-train AUC(z)=0.68510 | weights=(0.35810000000000003, 0.24778600000000003, 0.18351400000000004, 0.2105999999999999) | time 123.0s


Fold 4/5 | tried=68921 | best fold-train AUC(z)=0.70456 | weights=(0.3541, 0.2597860000000001, 0.16351400000000002, 0.2225999999999998) | time 124.2s


Fold 5/5 | tried=68921 | best fold-train AUC(z)=0.68544 | weights=(0.3241, 0.24678600000000003, 0.147514, 0.28159999999999996) | time 124.1s


Per-fold standardized 4-way blend OOF AUC(z)=0.69143 | AUC(prob)=0.69143 | total 622.8s
Saved submission.csv (per-fold z-std 4-way blend); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.716866
1   t3_roiuw                  0.498893
2   t3_mjnbq                  0.485197
3   t3_t8wd1                  0.468800
4  t3_1m4zxu                  0.477995


In [56]:
# Weight bagging of nearby 4-way logit optima + constrained pruning checks; write best submission
import numpy as np, pandas as pd, time, sys
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...')
o1 = np.load('oof_lr_pivot.npy'); t1 = np.load('test_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy'); t2 = np.load('test_xgb_dense.npy')
o2b = np.load('oof_xgb_dense_v2.npy'); t2b = np.load('test_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy'); t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values

# Convert to logits
z1, z2, z2b, z3 = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1, tz2, tz2b, tz3 = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# Two nearby strong 4-way configs from Cell 31
w_ref = (0.3381, 0.266786, 0.163514, 0.2316)  # best refine
w_alt = (0.3441, 0.28002, 0.15078, 0.22510)  # pre-refine top

def auc_for_weights(w):
    w1, w2, w2b, w3 = w
    z = w1*z1 + w2*z2 + w2b*z2b + w3*z3
    return roc_auc_score(y, z)

def test_probs_for_weights(w):
    w1, w2, w2b, w3 = w
    zt = w1*tz1 + w2*tz2 + w2b*tz2b + w3*tz3
    return sigmoid(zt).astype(np.float32)

auc_ref = auc_for_weights(w_ref); auc_alt = auc_for_weights(w_alt)
print(f'Config REF OOF AUC(z)={auc_ref:.5f} weights={w_ref}')
print(f'Config ALT OOF AUC(z)={auc_alt:.5f} weights={w_alt}')

# Bag the two configs: average test probs; OOF via averaging logits (safer for AUC ranking)
z_ref = w_ref[0]*z1 + w_ref[1]*z2 + w_ref[2]*z2b + w_ref[3]*z3
z_alt = w_alt[0]*z1 + w_alt[1]*z2 + w_alt[2]*z2b + w_alt[3]*z3
z_bag = 0.5*(z_ref + z_alt)
auc_bag = roc_auc_score(y, z_bag)
print(f'Bagged (2-config) OOF AUC(z)={auc_bag:.5f}')

# Constrained pruning checks around w_ref:
def constrained_search(prune='meta', window=0.01, step=0.001):
    base = w_ref
    best_auc, best_w = -1.0, None
    w1c, w2c, w2bc, w3c = base
    if prune == 'meta':
        # set w3=0, search w1,w2,w2b s.t. sum=1
        w1_grid = np.arange(max(0, w1c-window), min(1, w1c+window)+1e-12, step)
        w2_grid = np.arange(max(0, w2c-window), min(1, w2c+window)+1e-12, step)
        w2b_grid= np.arange(max(0, w2bc-window),min(1, w2bc+window)+1e-12, step)
        tried=0
        for w1v in w1_grid:
            for w2v in w2_grid:
                for w2bv in w2b_grid:
                    s = w1v + w2v + w2bv
                    if s <= 0 or s > 1.0: continue
                    z = w1v*z1 + w2v*z2 + w2bv*z2b  # w3=0
                    auc = roc_auc_score(y, z)
                    tried += 1
                    if auc > best_auc:
                        best_auc, best_w = auc, (float(w1v), float(w2v), float(w2bv), 0.0)
        print(f'Prune meta tried={tried} best OOF AUC(z)={best_auc:.5f} weights={best_w}')
        return best_auc, best_w
    elif prune == 'dense_v2':
        # set w2b=0, search w1,w2,w3 sum=1
        w1_grid = np.arange(max(0, w1c-window), min(1, w1c+window)+1e-12, step)
        w2_grid = np.arange(max(0, w2c-window), min(1, w2c+window)+1e-12, step)
        tried=0
        for w1v in w1_grid:
            for w2v in w2_grid:
                w3v = 1.0 - w1v - w2v
                if w3v < 0 or w3v > 1: continue
                z = w1v*z1 + w2v*z2 + w3v*z3  # w2b=0
                auc = roc_auc_score(y, z)
                tried += 1
                if auc > best_auc:
                    best_auc, best_w = auc, (float(w1v), float(w2v), 0.0, float(w3v))
        print(f'Prune dense_v2 tried={tried} best OOF AUC(z)={best_auc:.5f} weights={best_w}')
        return best_auc, best_w
    else:
        return -1.0, None

auc_prune_meta, w_prune_meta = constrained_search('meta', window=0.01, step=0.001)
auc_prune_dv2,  w_prune_dv2  = constrained_search('dense_v2', window=0.01, step=0.001)

# Choose best among ref, alt, bag, prune variants
candidates = [
    ('ref', auc_ref, w_ref),
    ('alt', auc_alt, w_alt),
    ('bag2', auc_bag, None),
    ('prune_meta', auc_prune_meta, w_prune_meta),
    ('prune_dense_v2', auc_prune_dv2, w_prune_dv2),
]
candidates.sort(key=lambda x: x[1], reverse=True)
print('Candidates ranked (name, OOF AUC(z)):', [(n, round(a,5)) for n,a,_ in candidates])

# Build submission for the top candidate
top_name, top_auc, top_w = candidates[0]
if top_name == 'bag2':
    pt = 0.5*sigmoid(w_ref[0]*tz1 + w_ref[1]*tz2 + w_ref[2]*tz2b + w_ref[3]*tz3) + \
         0.5*sigmoid(w_alt[0]*tz1 + w_alt[1]*tz2 + w_alt[2]*tz2b + w_alt[3]*tz3)
    pt = pt.astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
    sub.to_csv('submission.csv', index=False)
    print(f'Saved submission.csv (bagged two 4-way configs); top OOF AUC(z)={top_auc:.5f}; head:')
    print(sub.head())
else:
    w1, w2, w2b, w3 = top_w
    zt = w1*tz1 + w2*tz2 + w2b*tz2b + w3*tz3
    pt = sigmoid(zt).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
    sub.to_csv('submission.csv', index=False)
    print(f'Saved submission.csv ({top_name}); OOF AUC(z)={top_auc:.5f}; weights={top_w}; head:')
    print(sub.head())

Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...
Config REF OOF AUC(z)=0.69242 weights=(0.3381, 0.266786, 0.163514, 0.2316)
Config ALT OOF AUC(z)=0.69240 weights=(0.3441, 0.28002, 0.15078, 0.2251)
Bagged (2-config) OOF AUC(z)=0.69238


Prune meta tried=9261 best OOF AUC(z)=0.69076 weights=(0.3441, 0.258786, 0.160514, 0.0)


Prune dense_v2 tried=441 best OOF AUC(z)=0.69107 weights=(0.3471, 0.27678600000000003, 0.0, 0.376114)
Candidates ranked (name, OOF AUC(z)): [('ref', 0.69242), ('alt', 0.6924), ('bag2', 0.69238), ('prune_dense_v2', 0.69107), ('prune_meta', 0.69076)]
Saved submission.csv (ref); OOF AUC(z)=0.69242; weights=(0.3381, 0.266786, 0.163514, 0.2316); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.331806
1   t3_roiuw                  0.217080
2   t3_mjnbq                  0.213391
3   t3_t8wd1                  0.209035
4  t3_1m4zxu                  0.215080


In [46]:
# Add fast NB counts model and 5-way logit blend with small NB weight cap
import numpy as np, pandas as pd, time, sys, gc
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

assert 'combine_raw_text' in globals(), 'Need combine_raw_text from earlier pivot cell'

print('Training fast MultinomialNB on word(1,2) counts (min_df=3, max_features=40000)...')
y = train['requester_received_pizza'].astype(int).values
texts_tr = combine_raw_text(train).fillna('').astype(str)
texts_te = combine_raw_text(test).fillna('').astype(str)

# CV for OOF NB preds
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(len(y)), y))
oof_nb = np.zeros(len(y), dtype=np.float32)
te_nb_folds = []
alphas = [0.5, 1.0, 2.0]
vec = CountVectorizer(analyzer='word', ngram_range=(1,2), lowercase=True, min_df=3, max_features=40000, binary=False)

for k, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    tr_txt = texts_tr.iloc[tr_idx]
    va_txt = texts_tr.iloc[va_idx]
    Xtr = vec.fit_transform(tr_txt)
    Xva = vec.transform(va_txt)
    Xte = vec.transform(texts_te)
    best_auc_k, best_va, best_te = -1.0, None, None
    for a in alphas:
        nb = MultinomialNB(alpha=a)
        nb.fit(Xtr, y[tr_idx])
        va_p = nb.predict_proba(Xva)[:,1].astype(np.float32)
        auc_k = roc_auc_score(y[va_idx], va_p)
        if auc_k > best_auc_k:
            best_auc_k = auc_k
            best_va = va_p
            best_te = nb.predict_proba(Xte)[:,1].astype(np.float32)
    oof_nb[va_idx] = best_va
    te_nb_folds.append(best_te)
    print(f'Fold {k+1}/5 NB best AUC: {best_auc_k:.5f} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()
    del Xtr, Xva, Xte, nb
    gc.collect()

auc_nb = roc_auc_score(y, oof_nb)
print(f'NB counts OOF AUC: {auc_nb:.5f}')
test_nb = np.mean(te_nb_folds, axis=0).astype(np.float32)
np.save('oof_nb_counts.npy', oof_nb)
np.save('test_nb_counts.npy', test_nb)

# 5-way logit blend: base 4-way weights from Cell 31; small NB weight cap
print('5-way logit blend search with small NB cap...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o2b = np.load('oof_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy')
o4 = oof_nb
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t2b = np.load('test_xgb_dense_v2.npy')
t3 = np.load('test_xgb_meta.npy')
t4 = test_nb

z1, z2, z2b, z3, z4 = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3), to_logit(o4)
tz1, tz2, tz2b, tz3, tz4 = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3), to_logit(t4)

# Base best 4-way weights from Cell 31 refine:
w_lr0, w_d10, w_d20, w_meta0 = 0.3381, 0.266786, 0.163514, 0.2316

# Grids
w1_grid = np.arange(max(0.0, w_lr0 - 0.01), min(1.0, w_lr0 + 0.01) + 1e-12, 0.001)
w2_grid = np.arange(max(0.0, w_d10 - 0.01), min(1.0, w_d10 + 0.01) + 1e-12, 0.001)
w2b_grid= np.arange(max(0.0, w_d20 - 0.01), min(1.0, w_d20 + 0.01) + 1e-12, 0.001)
w4_grid = np.arange(0.0, 0.08001, 0.005)  # NB cap 8%

best_auc, best_w = -1.0, None
t0 = time.time(); tried = 0
for w4 in w4_grid:
    rem = 1.0 - w4
    for w1 in w1_grid:
        for w2 in w2_grid:
            for w2b in w2b_grid:
                s = w1 + w2 + w2b
                if s > rem: continue
                w3 = rem - s
                z = w1*z1 + w2*z2 + w2b*z2b + w3*z3 + w4*z4
                auc = roc_auc_score(y, z)
                tried += 1
                if auc > best_auc:
                    best_auc = auc
                    # store normalized weights (sum to 1 across all 5)
                    best_w = (float(w1), float(w2), float(w2b), float(w3), float(w4))
print(f'5-way search tried {tried} combos | best OOF AUC(z)={best_auc:.5f} | weights(LR,D1,D2,Meta,NB)={best_w} | {time.time()-t0:.1f}s')

if best_w is not None and best_auc >= 0.69210:
    w1, w2, w2b, w3, w4 = best_w
    zt = w1*tz1 + w2*tz2 + w2b*tz2b + w3*tz3 + w4*tz4
    pt = sigmoid(zt).astype(np.float32)
    sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv (5-way logit blend with NB); head:')
    print(sub.head())
else:
    print('5-way blend did not reach threshold; keeping current best submission.')

Training fast MultinomialNB on word(1,2) counts (min_df=3, max_features=40000)...


Fold 1/5 NB best AUC: 0.57436 | time 0.6s


Fold 2/5 NB best AUC: 0.56471 | time 0.6s


Fold 3/5 NB best AUC: 0.55392 | time 0.6s


Fold 4/5 NB best AUC: 0.54240 | time 0.6s


Fold 5/5 NB best AUC: 0.57940 | time 0.6s


NB counts OOF AUC: 0.56205
5-way logit blend search with small NB cap...


5-way search tried 157437 combos | best OOF AUC(z)=0.69600 | weights(LR,D1,D2,Meta,NB)=(0.3281, 0.259786, 0.156514, 0.23560000000000003, 0.02) | 313.6s
Saved submission.csv (5-way logit blend with NB); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.294011
1   t3_roiuw                  0.217789
2   t3_mjnbq                  0.192705
3   t3_t8wd1                  0.188907
4  t3_1m4zxu                  0.206851


In [58]:
# Per-fold logit weight optimization without standardization (4-way); average test over folds
import numpy as np, pandas as pd, time, sys
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o2b = np.load('oof_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t2b = np.load('test_xgb_dense_v2.npy')
t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values
n = len(y)

# Convert to logits once
z1_all, z2_all, z2b_all, z3_all = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1_all, tz2_all, tz2b_all, tz3_all = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# Global best 4-way logit weights (from Cell 31 refine)
w_best_global = (0.3381, 0.266786, 0.163514, 0.2316)  # (LR, Dense v1, Dense v2, Meta)
print('Per-fold search around global best weights:', w_best_global)

cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
oof_blend = np.zeros(n, dtype=np.float64)
test_fold_logits = []

t_start = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    z1_tr, z2_tr, z2b_tr, z3_tr = z1_all[tr_idx], z2_all[tr_idx], z2b_all[tr_idx], z3_all[tr_idx]
    z1_va, z2_va, z2b_va, z3_va = z1_all[va_idx], z2_all[va_idx], z3_all[va_idx], z3_all[va_idx]  # placeholder
    # fix z2b_va, z3_va assignment (typo guard)
    z2b_va = z2b_all[va_idx]; z3_va = z3_all[va_idx]
    tz1, tz2, tz2b, tz3 = tz1_all, tz2_all, tz2b_all, tz3_all

    # Tiny window around global best (no standardization), step 0.001
    base_w1, base_w2, base_w2b, base_w3 = w_best_global
    w1_min, w1_max = max(0.0, base_w1 - 0.01), min(1.0, base_w1 + 0.01)
    w2_min, w2_max = max(0.0, base_w2 - 0.01), min(1.0, base_w2 + 0.01)
    w2b_min, w2b_max = max(0.0, base_w2b - 0.01), min(1.0, base_w2b + 0.01)
    step = 0.001
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    w2b_grid = np.arange(w2b_min, w2b_max + 1e-12, step)

    best_auc_k, best_w_k = -1.0, w_best_global
    y_tr = y[tr_idx]
    tried = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            for w2b in w2b_grid:
                w3 = 1.0 - (w1 + w2 + w2b)
                if w3 < 0.0 or w3 > 1.0:
                    continue
                z_tr_combo = w1*z1_tr + w2*z2_tr + w2b*z2b_tr + w3*z3_tr
                auc_k = roc_auc_score(y_tr, z_tr_combo)
                tried += 1
                if auc_k > best_auc_k:
                    best_auc_k = auc_k; best_w_k = (float(w1), float(w2), float(w2b), float(w3))

    w1k, w2k, w2bk, w3k = best_w_k
    z_va_combo = w1k*z1_va + w2k*z2_va + w2bk*z2b_va + w3k*z3_va
    z_te_combo = w1k*tz1 + w2k*tz2 + w2bk*tz2b + w3k*tz3
    oof_blend[va_idx] = z_va_combo
    test_fold_logits.append(z_te_combo.astype(np.float64))
    print(f'Fold {k+1}/5 | tried={tried} | best fold-train AUC(z)={best_auc_k:.5f} | weights={best_w_k} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

# Evaluate OOF
oof_auc_z = roc_auc_score(y, oof_blend)
oof_auc_p = roc_auc_score(y, sigmoid(oof_blend))
print(f'Per-fold 4-way (no z-std) OOF AUC(z)={oof_auc_z:.5f} | AUC(prob)={oof_auc_p:.5f} | total {time.time()-t_start:.1f}s')

# Average test logits across folds
test_mean_z = np.mean(test_fold_logits, axis=0)
pt = sigmoid(test_mean_z).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (per-fold 4-way no-zstd); head:')
print(sub.head())

Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...
Per-fold search around global best weights: (0.3381, 0.266786, 0.163514, 0.2316)


Fold 1/5 | tried=9261 | best fold-train AUC(z)=0.69213 | weights=(0.3451, 0.27578600000000003, 0.160514, 0.2185999999999999) | time 16.9s


Fold 2/5 | tried=9261 | best fold-train AUC(z)=0.69554 | weights=(0.3451, 0.27578600000000003, 0.159514, 0.21960000000000002) | time 17.1s


Fold 3/5 | tried=9261 | best fold-train AUC(z)=0.68520 | weights=(0.3441, 0.263786, 0.173514, 0.2185999999999999) | time 17.3s


Fold 4/5 | tried=9261 | best fold-train AUC(z)=0.70448 | weights=(0.3291, 0.27478600000000003, 0.159514, 0.23659999999999992) | time 16.9s


Fold 5/5 | tried=9261 | best fold-train AUC(z)=0.68542 | weights=(0.3341, 0.265786, 0.167514, 0.23259999999999992) | time 17.0s


Per-fold 4-way (no z-std) OOF AUC(z)=0.69225 | AUC(prob)=0.69225 | total 85.3s
Saved submission.csv (per-fold 4-way no-zstd); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.331706
1   t3_roiuw                  0.216851
2   t3_mjnbq                  0.213342
3   t3_t8wd1                  0.208793
4  t3_1m4zxu                  0.214304


In [48]:
# Tiny 4-way logit temperature scaling + narrow weight refinement around best ref weights; write submission if improves
import numpy as np, pandas as pd, time, sys
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o2b = np.load('oof_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t2b = np.load('test_xgb_dense_v2.npy')
t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values

# Base logits
z1b, z2b_, z2bb, z3b = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1b, tz2b_, tz2bb, tz3b = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# Reference best 4-way weights from split-dense refine
w_ref = (0.3381, 0.266786, 0.163514, 0.2316)  # (LR, Dense v1, Dense v2, Meta)

scales = [0.95, 1.00, 1.05]
w_window = 0.01
w_step = 0.001

best_auc, best_cfg = -1.0, None
t0 = time.time(); tried = 0
for s1 in scales:
    for s2 in scales:
        for s2v in scales:
            for s3 in scales:
                z1 = s1 * z1b; z2 = s2 * z2b_; z2v2 = s2v * z2bb; z3 = s3 * z3b
                w1c, w2c, w2vc, w3c = w_ref
                # Search over w1, w2, w2v in a tiny window; w3 = 1 - (w1+w2+w2v)
                w1_min, w1_max = max(0.0, w1c - w_window), min(1.0, w1c + w_window)
                w2_min, w2_max = max(0.0, w2c - w_window), min(1.0, w2c + w_window)
                w2v_min, w2v_max = max(0.0, w2vc - w_window), min(1.0, w2vc + w_window)
                w1_grid = np.arange(w1_min, w1_max + 1e-12, w_step)
                w2_grid = np.arange(w2_min, w2_max + 1e-12, w_step)
                w2v_grid= np.arange(w2v_min, w2v_max + 1e-12, w_step)
                local_best_auc, local_best_w = -1.0, None
                for w1 in w1_grid:
                    for w2 in w2_grid:
                        for w2v_w in w2v_grid:
                            w3 = 1.0 - (w1 + w2 + w2v_w)
                            if w3 < 0.0 or w3 > 1.0:
                                continue
                            z = w1*z1 + w2*z2 + w2v_w*z2v2 + w3*z3
                            auc = roc_auc_score(y, z)
                            tried += 1
                            if auc > local_best_auc:
                                local_best_auc = auc
                                local_best_w = (float(w1), float(w2), float(w2v_w), float(w3))
                if local_best_auc > best_auc:
                    best_auc = local_best_auc
                    best_cfg = ((s1, s2, s2v, s3), local_best_w)
print(f'Tiny 4-way temp+weights tried ~{tried} combos | best OOF AUC(z)={best_auc:.5f} | scales={best_cfg[0]} | weights={best_cfg[1]} | {time.time()-t0:.1f}s')

# Build submission with best config
s1, s2, s2v, s3 = best_cfg[0]
bw1, bw2, bw2v, bw3 = best_cfg[1]
zt = bw1*(s1*tz1b) + bw2*(s2*tz2b_) + bw2v*(s2v*tz2bb) + bw3*(s3*tz3b)
pt = sigmoid(zt).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (4-way tiny temp-scaled logit blend); head:')
print(sub.head())

Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...


KeyboardInterrupt: 

In [49]:
# Ablation: LR pivot without subreddit TF-IDF; re-blend with Dense v1 + Meta in logit space
import time, gc, sys, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_meta_minimal' in globals(), 'Run pivot cell 7 first'
assert 'word_params' in globals() and 'char_params' in globals(), 'Run pivot cell 7 first (params)'

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

y = train[target_col].astype(int).values
n_splits = 5
cv = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42).split(train, y))
print(f'Prepared {n_splits}-fold StratifiedKFold CV (shuffled).')

# Precompute test sources via same preprocessing as pivot
clean_te_text = clean_text_series(combine_raw_text(test))
meta_te_full = build_meta_minimal(test)

# Containers for LR (no-subreddit) OOF/test
oof_lr_ns = np.zeros(len(train), dtype=np.float32)
test_lr_ns_folds = []
LR_C = 0.5

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Fold {fold+1}/{n_splits} - train {len(tr_idx)} va {len(va_idx)}')
    sys.stdout.flush()

    tr_text = clean_text_series(combine_raw_text(train.loc[tr_idx]))
    va_text = clean_text_series(combine_raw_text(train.loc[va_idx]))

    # Fit only word + char TF-IDF (no subreddit vectorizer)
    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text)
    Xw_va = tfidf_w.transform(va_text)
    Xw_te = tfidf_w.transform(clean_te_text)

    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text)
    Xc_va = tfidf_c.transform(va_text)
    Xc_te = tfidf_c.transform(clean_te_text)

    # Minimal meta
    meta_tr = build_meta_minimal(train.loc[tr_idx])
    meta_va = build_meta_minimal(train.loc[va_idx])
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_tr)
    Xm_va = scaler.transform(meta_va)
    Xm_te = scaler.transform(meta_te_full)

    # Stack
    X_tr = hstack([Xw_tr, Xc_tr, Xm_tr], format='csr')
    X_va = hstack([Xw_va, Xc_va, Xm_va], format='csr')
    X_te = hstack([Xw_te, Xc_te, Xm_te], format='csr')

    lr = LogisticRegression(solver='saga', penalty='l2', C=LR_C, class_weight=None, max_iter=6000, n_jobs=-1, random_state=42, verbose=0)
    lr.fit(X_tr, y[tr_idx])
    va_lr = lr.predict_proba(X_va)[:,1].astype(np.float32)
    oof_lr_ns[va_idx] = va_lr
    te_lr = lr.predict_proba(X_te)[:,1].astype(np.float32)
    test_lr_ns_folds.append(te_lr)

    auc = roc_auc_score(y[va_idx], va_lr)
    print(f'Fold {fold+1} LR(no-subs) AUC: {auc:.5f} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xm_tr, Xm_va, Xm_te, X_tr, X_va, X_te, lr, tfidf_w, tfidf_c, scaler
    gc.collect()

auc_ns = roc_auc_score(y, oof_lr_ns)
print(f'LR(no-subreddit) OOF AUC: {auc_ns:.5f}')
test_lr_ns = np.mean(test_lr_ns_folds, axis=0).astype(np.float32)
np.save('oof_lr_nosub.npy', oof_lr_ns)
np.save('test_lr_nosub.npy', test_lr_ns)

# Re-blend with Dense v1 and Meta in logit space
print('Loading cached Dense v1 and Meta OOF/test...')
o_dense = np.load('oof_xgb_dense.npy')
o_meta = np.load('oof_xgb_meta.npy')
t_dense = np.load('test_xgb_dense.npy')
t_meta = np.load('test_xgb_meta.npy')

# Convert to logits
z_lr = to_logit(oof_lr_ns); z_d = to_logit(o_dense); z_m = to_logit(o_meta)
tz_lr = to_logit(test_lr_ns); tz_d = to_logit(t_dense); tz_m = to_logit(t_meta)

def eval_auc_logit(w1, w2):
    w3 = 1.0 - w1 - w2
    if w3 < 0 or w3 > 1: return -1.0
    z = w1*z_lr + w2*z_d + w3*z_m
    return roc_auc_score(y, z)

# Start around prior best (0.3504, 0.4192) but allow adaptation
w1_c, w2_c = 0.3504, 0.4192
best_auc, best_w = -1.0, (w1_c, w2_c, 1.0 - w1_c - w2_c)

def grid_search_logit(w1_center, w2_center, window, step):
    global best_auc, best_w
    w1_min, w1_max = max(0.0, w1_center - window), min(1.0, w1_center + window)
    w2_min, w2_max = max(0.0, w2_center - window), min(1.0, w2_center + window)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    t0s = time.time(); cnt = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            if w1 + w2 > 1.0: continue
            auc = eval_auc_logit(w1, w2)
            cnt += 1
            if auc > best_auc:
                best_auc = auc; best_w = (float(w1), float(w2), float(1.0 - w1 - w2))
    print(f'Logit grid window={window} step={step} tried {cnt} | best={best_w} OOF AUC(z): {best_auc:.5f} | {time.time()-t0s:.1f}s')

grid_search_logit(w1_c, w2_c, window=0.06, step=0.002)
grid_search_logit(best_w[0], best_w[1], window=0.02, step=0.001)
grid_search_logit(best_w[0], best_w[1], window=0.008, step=0.0005)
print(f'Final logit weights (LR_noSubs, Dense, Meta)={best_w} | OOF AUC(z): {best_auc:.5f}')

# Build submission from LR(no-subs) + Dense + Meta
w1b, w2b, w3b = best_w
pt = sigmoid(w1b*tz_lr + w2b*tz_d + w3b*tz_m).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (LR_noSubs + Dense + Meta logit blend); head:')
print(sub.head())

Prepared 5-fold StratifiedKFold CV (shuffled).
Fold 1/5 - train 2302 va 576


Fold 1 LR(no-subs) AUC: 0.67816 | time 20.2s


Fold 2/5 - train 2302 va 576


Fold 2 LR(no-subs) AUC: 0.66422 | time 18.7s


Fold 3/5 - train 2302 va 576


Fold 3 LR(no-subs) AUC: 0.68971 | time 19.5s


Fold 4/5 - train 2303 va 575


Fold 4 LR(no-subs) AUC: 0.64905 | time 18.0s


Fold 5/5 - train 2303 va 575


Fold 5 LR(no-subs) AUC: 0.70163 | time 20.2s


LR(no-subreddit) OOF AUC: 0.67573
Loading cached Dense v1 and Meta OOF/test...


Logit grid window=0.06 step=0.002 tried 3721 | best=(0.3044, 0.4372000000000001, 0.2583999999999999) OOF AUC(z): 0.69127 | 7.4s


Logit grid window=0.02 step=0.001 tried 1681 | best=(0.2994, 0.4282000000000001, 0.2723999999999999) OOF AUC(z): 0.69128 | 3.2s


Logit grid window=0.008 step=0.0005 tried 1089 | best=(0.2999, 0.4277000000000001, 0.27239999999999986) OOF AUC(z): 0.69128 | 2.1s
Final logit weights (LR_noSubs, Dense, Meta)=(0.2999, 0.4277000000000001, 0.27239999999999986) | OOF AUC(z): 0.69128
Saved submission.csv (LR_noSubs + Dense + Meta logit blend); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.336776
1   t3_roiuw                  0.215132
2   t3_mjnbq                  0.205367
3   t3_t8wd1                  0.210039
4  t3_1m4zxu                  0.218076


In [52]:
# Per-fold 3-way logit weight optimization (LR, Dense v1, Meta) without standardization; average test logits across folds
import numpy as np, pandas as pd, time, sys
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (LR, Dense v1, Meta)...')
o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense.npy')
t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values
n = len(y)

# Convert to logits once
z1_all, z2_all, z3_all = to_logit(o1), to_logit(o2), to_logit(o3)
tz1_all, tz2_all, tz3_all = to_logit(t1), to_logit(t2), to_logit(t3)

# Start around prior best 3-way logit weights (from Cell 23/24):
w_center = (0.3501, 0.4198, 0.2301)  # (LR, Dense v1, Meta)
window = 0.015
step = 0.001
print('Per-fold search around center weights:', w_center, 'window=', window, 'step=', step)

cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
oof_blend = np.zeros(n, dtype=np.float64)
test_fold_logits = []

t_start = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    z1_tr, z2_tr, z3_tr = z1_all[tr_idx], z2_all[tr_idx], z3_all[tr_idx]
    z1_va, z2_va, z3_va = z1_all[va_idx], z2_all[va_idx], z3_all[va_idx]
    tz1, tz2, tz3 = tz1_all, tz2_all, tz3_all

    base_w1, base_w2, base_w3 = w_center
    w1_min, w1_max = max(0.0, base_w1 - window), min(1.0, base_w1 + window)
    w2_min, w2_max = max(0.0, base_w2 - window), min(1.0, base_w2 + window)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)

    best_auc_k, best_w_k = -1.0, w_center
    y_tr = y[tr_idx]
    tried = 0
    for w1 in w1_grid:
        for w2 in w2_grid:
            w3 = 1.0 - (w1 + w2)
            if w3 < 0.0 or w3 > 1.0:
                continue
            z_tr_combo = w1*z1_tr + w2*z2_tr + w3*z3_tr
            auc_k = roc_auc_score(y_tr, z_tr_combo)
            tried += 1
            if auc_k > best_auc_k:
                best_auc_k = auc_k; best_w_k = (float(w1), float(w2), float(w3))

    w1k, w2k, w3k = best_w_k
    z_va_combo = w1k*z1_va + w2k*z2_va + w3k*z3_va
    z_te_combo = w1k*tz1 + w2k*tz2 + w3k*tz3
    oof_blend[va_idx] = z_va_combo
    test_fold_logits.append(z_te_combo.astype(np.float64))
    print(f'Fold {k+1}/5 | tried={tried} | best fold-train AUC(z)={best_auc_k:.5f} | weights={best_w_k} | time {time.time()-t0:.1f}s')
    sys.stdout.flush()

# Evaluate OOF AUC
oof_auc_z = roc_auc_score(y, oof_blend)
oof_auc_p = roc_auc_score(y, sigmoid(oof_blend))
print(f'Per-fold 3-way (no z-std) OOF AUC(z)={oof_auc_z:.5f} | AUC(prob)={oof_auc_p:.5f} | total {time.time()-t_start:.1f}s')

# Average test logits across folds, convert to prob, and write submission
test_mean_z = np.mean(test_fold_logits, axis=0)
pt = sigmoid(test_mean_z).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (per-fold 3-way no-zstd); head:')
print(sub.head())

Loading cached OOF/test predictions (LR, Dense v1, Meta)...
Per-fold search around center weights: (0.3501, 0.4198, 0.2301) window= 0.015 step= 0.001


Fold 1/5 | tried=961 | best fold-train AUC(z)=0.69180 | weights=(0.3501, 0.4198, 0.23009999999999997) | time 1.7s


Fold 2/5 | tried=961 | best fold-train AUC(z)=0.69551 | weights=(0.3501, 0.4198, 0.23009999999999997) | time 1.7s


Fold 3/5 | tried=961 | best fold-train AUC(z)=0.68416 | weights=(0.36510000000000004, 0.4068, 0.22809999999999997) | time 1.8s


Fold 4/5 | tried=961 | best fold-train AUC(z)=0.70382 | weights=(0.3411, 0.4198, 0.23909999999999998) | time 1.7s


Fold 5/5 | tried=961 | best fold-train AUC(z)=0.68526 | weights=(0.3351, 0.4058, 0.2591) | time 1.7s


Per-fold 3-way (no z-std) OOF AUC(z)=0.69182 | AUC(prob)=0.69182 | total 8.6s
Saved submission.csv (per-fold 3-way no-zstd); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.328622
1   t3_roiuw                  0.212896
2   t3_mjnbq                  0.214778
3   t3_t8wd1                  0.206003
4  t3_1m4zxu                  0.217101


In [51]:
# Bagging two blends: global 4-way ref logit blend + per-fold 3-way logit blend; optimize alpha on OOF and average test logits
import numpy as np, pandas as pd, time, sys
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached base OOF/test preds...')
o1 = np.load('oof_lr_pivot.npy'); t1 = np.load('test_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy'); t2 = np.load('test_xgb_dense.npy')
o2b= np.load('oof_xgb_dense_v2.npy'); t2b= np.load('test_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy'); t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values
n = len(y)

# Base logits
z1_all, z2_all, z2b_all, z3_all = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1_all, tz2_all, tz2b_all, tz3_all = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# 1) Global 4-way ref logit blend (from Cell 31 refine)
w4_ref = (0.3381, 0.266786, 0.163514, 0.2316)
z_ref4 = w4_ref[0]*z1_all + w4_ref[1]*z2_all + w4_ref[2]*z2b_all + w4_ref[3]*z3_all
tz_ref4 = w4_ref[0]*tz1_all + w4_ref[1]*tz2_all + w4_ref[2]*tz2b_all + w4_ref[3]*tz3_all
auc_ref4 = roc_auc_score(y, z_ref4)
print(f'Ref 4-way logit blend OOF AUC(z): {auc_ref4:.5f}')

# 2) Per-fold 3-way logit blend (LR, Dense v1, Meta) without standardization; rebuild OOF/test logits
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
w_center = (0.3501, 0.4198, 0.2301)  # center from prior best
window, step = 0.015, 0.001
z_pf3 = np.zeros(n, dtype=np.float64)
test_fold_logits = []
t0 = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    z1_tr, z2_tr, z3_tr = z1_all[tr_idx], z2_all[tr_idx], z3_all[tr_idx]
    z1_va, z2_va, z3_va = z1_all[va_idx], z2_all[va_idx], z3_all[va_idx]
    tz1, tz2, tz3 = tz1_all, tz2_all, tz3_all
    w1c, w2c, _ = w_center
    w1_min, w1_max = max(0.0, w1c - window), min(1.0, w1c + window)
    w2_min, w2_max = max(0.0, w2c - window), min(1.0, w2c + window)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    w2_grid = np.arange(w2_min, w2_max + 1e-12, step)
    best_auc_k, best_w_k = -1.0, w_center
    y_tr = y[tr_idx]
    for w1 in w1_grid:
        for w2 in w2_grid:
            w3 = 1.0 - (w1 + w2)
            if w3 < 0.0 or w3 > 1.0: continue
            z_tr_combo = w1*z1_tr + w2*z2_tr + w3*z3_tr
            auc_k = roc_auc_score(y_tr, z_tr_combo)
            if auc_k > best_auc_k:
                best_auc_k, best_w_k = auc_k, (float(w1), float(w2), float(w3))
    w1k, w2k, w3k = best_w_k
    z_pf3[va_idx] = w1k*z1_va + w2k*z2_va + w3k*z3_va
    test_fold_logits.append((w1k*tz1 + w2k*tz2 + w3k*tz3).astype(np.float64))
    print(f'Fold {k+1}/5 per-fold 3-way best weights={best_w_k} | time fold')
auc_pf3 = roc_auc_score(y, z_pf3)
print(f'Per-fold 3-way logit blend OOF AUC(z): {auc_pf3:.5f} | time {time.time()-t0:.1f}s')
tz_pf3 = np.mean(test_fold_logits, axis=0)

# 3) Bag the two blends in logit space with alpha tuned on OOF
best_auc, best_alpha = -1.0, 0.0
for a in np.linspace(0.0, 1.0, 101):
    z_mix = (1.0 - a)*z_ref4 + a*z_pf3
    auc = roc_auc_score(y, z_mix)
    if auc > best_auc:
        best_auc, best_alpha = auc, float(a)
print(f'Alpha search best alpha={best_alpha:.3f} | OOF AUC(z)={best_auc:.5f}')

# Build submission from mixed test logits
tz_mix = (1.0 - best_alpha)*tz_ref4 + best_alpha*tz_pf3
pt = sigmoid(tz_mix).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (bagged 4-way ref + per-fold 3-way, logit space); head:')
print(sub.head())

Loading cached base OOF/test preds...
Ref 4-way logit blend OOF AUC(z): 0.69242


Fold 1/5 per-fold 3-way best weights=(0.3501, 0.4198, 0.23009999999999997) | time fold


Fold 2/5 per-fold 3-way best weights=(0.3501, 0.4198, 0.23009999999999997) | time fold


Fold 3/5 per-fold 3-way best weights=(0.36510000000000004, 0.4068, 0.22809999999999997) | time fold


Fold 4/5 per-fold 3-way best weights=(0.3411, 0.4198, 0.23909999999999998) | time fold


Fold 5/5 per-fold 3-way best weights=(0.3351, 0.4058, 0.2591) | time fold
Per-fold 3-way logit blend OOF AUC(z): 0.69182 | time 8.5s


Alpha search best alpha=0.000 | OOF AUC(z)=0.69242
Saved submission.csv (bagged 4-way ref + per-fold 3-way, logit space); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.331806
1   t3_roiuw                  0.217080
2   t3_mjnbq                  0.213391
3   t3_t8wd1                  0.209035
4  t3_1m4zxu                  0.215080


In [53]:
# Bag 4-way global-ref and per-fold 4-way (no z-std) blends in logit space; optimize alpha on OOF; write submission
import numpy as np, pandas as pd, time, sys
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test preds for 4-way (LR, Dense v1, Dense v2, Meta)...')
o1 = np.load('oof_lr_pivot.npy'); t1 = np.load('test_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy'); t2 = np.load('test_xgb_dense.npy')
o2b= np.load('oof_xgb_dense_v2.npy'); t2b= np.load('test_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy'); t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values
n = len(y)

# Logits
z1_all, z2_all, z2b_all, z3_all = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1_all, tz2_all, tz2b_all, tz3_all = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# 4-way global reference blend (best from Cell 31 refine)
w_ref = (0.3381, 0.266786, 0.163514, 0.2316)
z_ref4 = w_ref[0]*z1_all + w_ref[1]*z2_all + w_ref[2]*z2b_all + w_ref[3]*z3_all
tz_ref4 = w_ref[0]*tz1_all + w_ref[1]*tz2_all + w_ref[2]*tz2b_all + w_ref[3]*tz3_all
auc_ref4 = roc_auc_score(y, z_ref4)
print(f'Ref 4-way OOF AUC(z): {auc_ref4:.5f}')

# Rebuild fast per-fold 4-way (no z-std) with a tiny window (±0.01, step=0.002) for speed
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
base_w = w_ref
w1c, w2c, w2bc, w3c = base_w
w1_grid = np.arange(max(0.0, w1c-0.01), min(1.0, w1c+0.01)+1e-12, 0.002)
w2_grid = np.arange(max(0.0, w2c-0.01), min(1.0, w2c+0.01)+1e-12, 0.002)
w2b_grid= np.arange(max(0.0, w2bc-0.01),min(1.0, w2bc+0.01)+1e-12, 0.002)
z_pf4 = np.zeros(n, dtype=np.float64)
test_fold_logits = []
t0 = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    z1_tr, z2_tr, z2b_tr, z3_tr = z1_all[tr_idx], z2_all[tr_idx], z2b_all[tr_idx], z3_all[tr_idx]
    z1_va, z2_va, z2b_va, z3_va = z1_all[va_idx], z2_all[va_idx], z2b_all[va_idx], z3_all[va_idx]
    y_tr = y[tr_idx]
    best_auc_k, best_w_k = -1.0, base_w
    for w1 in w1_grid:
        for w2 in w2_grid:
            for w2b in w2b_grid:
                w3 = 1.0 - (w1 + w2 + w2b)
                if w3 < 0.0 or w3 > 1.0: continue
                z_tr = w1*z1_tr + w2*z2_tr + w2b*z2b_tr + w3*z3_tr
                auc_k = roc_auc_score(y_tr, z_tr)
                if auc_k > best_auc_k:
                    best_auc_k = auc_k; best_w_k = (float(w1), float(w2), float(w2b), float(w3))
    w1k, w2k, w2bk, w3k = best_w_k
    z_pf4[va_idx] = w1k*z1_va + w2k*z2_va + w2bk*z2b_va + w3k*z3_va
    tz_fold = w1k*tz1_all + w2k*tz2_all + w2bk*tz2b_all + w3k*tz3_all
    test_fold_logits.append(tz_fold.astype(np.float64))
    print(f'Fold {k+1}/5 per-fold 4-way best weights={best_w_k} | time {time.time()-t0:.1f}s'); sys.stdout.flush()
auc_pf4 = roc_auc_score(y, z_pf4)
print(f'Per-fold 4-way (no z-std) OOF AUC(z): {auc_pf4:.5f} | total {time.time()-t0:.1f}s')
tz_pf4 = np.mean(test_fold_logits, axis=0)

# Alpha bagging of z_ref4 and z_pf4
best_auc, best_alpha = -1.0, 0.0
for a in np.linspace(0.0, 1.0, 101):
    z_mix = (1.0 - a)*z_ref4 + a*z_pf4
    auc = roc_auc_score(y, z_mix)
    if auc > best_auc:
        best_auc, best_alpha = auc, float(a)
print(f'Alpha bag best alpha={best_alpha:.3f} | OOF AUC(z)={best_auc:.5f}')

# Build submission from mixed test logits
tz_mix = (1.0 - best_alpha)*tz_ref4 + best_alpha*tz_pf4
pt = sigmoid(tz_mix).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (bagged 4-way ref + per-fold 4-way, logit space); head:')
print(sub.head())

Loading cached OOF/test preds for 4-way (LR, Dense v1, Dense v2, Meta)...
Ref 4-way OOF AUC(z): 0.69242


Fold 1/5 per-fold 4-way best weights=(0.3441, 0.27678600000000003, 0.157514, 0.22160000000000002) | time 2.4s


Fold 2/5 per-fold 4-way best weights=(0.3441, 0.27678600000000003, 0.159514, 0.21960000000000002) | time 4.7s


Fold 3/5 per-fold 4-way best weights=(0.3281, 0.260786, 0.173514, 0.23760000000000003) | time 7.1s


Fold 4/5 per-fold 4-way best weights=(0.3281, 0.27678600000000003, 0.157514, 0.23760000000000003) | time 9.4s


Fold 5/5 per-fold 4-way best weights=(0.3281, 0.264786, 0.167514, 0.23960000000000004) | time 11.8s


Per-fold 4-way (no z-std) OOF AUC(z): 0.69221 | total 11.8s
Alpha bag best alpha=0.010 | OOF AUC(z)=0.69242
Saved submission.csv (bagged 4-way ref + per-fold 4-way, logit space); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.331808
1   t3_roiuw                  0.217081
2   t3_mjnbq                  0.213392
3   t3_t8wd1                  0.209032
4  t3_1m4zxu                  0.215074


In [54]:
# Lightweight tiny temperature scaling for 4-way logit blend with very narrow weight window
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...')
o1 = np.load('oof_lr_pivot.npy'); t1 = np.load('test_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy'); t2 = np.load('test_xgb_dense.npy')
o2b= np.load('oof_xgb_dense_v2.npy'); t2b= np.load('test_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy'); t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values

z1b, z2b, z2bb, z3b = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1b, tz2b, tz2bb, tz3b = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# Base best 4-way weights
w_ref = (0.3381, 0.266786, 0.163514, 0.2316)

scales = [0.95, 1.00, 1.05]
w_window = 0.006  # tighter window
w_step = 0.002    # coarser step

best_auc, best_cfg = -1.0, None
t0 = time.time(); tried = 0
for s1 in scales:
    for s2 in scales:
        for s2v in scales:
            for s3 in scales:
                z1 = s1*z1b; z2 = s2*z2b; z2v2 = s2v*z2bb; z3 = s3*z3b
                w1c, w2c, w2vc, w3c = w_ref
                w1_min, w1_max = max(0.0, w1c - w_window), min(1.0, w1c + w_window)
                w2_min, w2_max = max(0.0, w2c - w_window), min(1.0, w2c + w_window)
                w2v_min, w2v_max = max(0.0, w2vc - w_window), min(1.0, w2vc + w_window)
                for w1 in np.arange(w1_min, w1_max + 1e-12, w_step):
                    for w2 in np.arange(w2_min, w2_max + 1e-12, w_step):
                        for w2v_w in np.arange(w2v_min, w2v_max + 1e-12, w_step):
                            w3 = 1.0 - (w1 + w2 + w2v_w)
                            if w3 < 0.0 or w3 > 1.0:
                                continue
                            z = w1*z1 + w2*z2 + w2v_w*z2v2 + w3*z3
                            auc = roc_auc_score(y, z)
                            tried += 1
                            if auc > best_auc:
                                best_auc = auc
                                best_cfg = ((s1, s2, s2v, s3), (float(w1), float(w2), float(w2v_w), float(w3)))
print(f'Light temp+weights tried {tried} combos | best OOF AUC(z)={best_auc:.5f} | scales={best_cfg[0]} | weights={best_cfg[1]} | {time.time()-t0:.1f}s')

# Build submission with best config
s1, s2, s2v, s3 = best_cfg[0]
bw1, bw2, bw2v, bw3 = best_cfg[1]
zt = bw1*(s1*tz1b) + bw2*(s2*tz2b) + bw2v*(s2v*tz2bb) + bw3*(s3*tz3b)
pt = sigmoid(zt).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (light 4-way temp-scaled logit blend); head:')
print(sub.head())

Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta)...


Light temp+weights tried 27783 combos | best OOF AUC(z)=0.69242 | scales=(0.95, 0.95, 0.95, 0.95) | weights=(0.3381, 0.266786, 0.163514, 0.23160000000000003) | 53.3s
Saved submission.csv (light 4-way temp-scaled logit blend); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.339611
1   t3_roiuw                  0.228178
2   t3_mjnbq                  0.224545
3   t3_t8wd1                  0.220249
4  t3_1m4zxu                  0.226209


In [55]:
# Blend-of-blends: 3-way logit blend of Ref4, PerFold4(no-zstd), Ref3-way; fine grid around (0.35,0.42,0.23)
import numpy as np, pandas as pd, time, sys
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test preds...')
o1 = np.load('oof_lr_pivot.npy'); t1 = np.load('test_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy'); t2 = np.load('test_xgb_dense.npy')
o2b= np.load('oof_xgb_dense_v2.npy'); t2b= np.load('test_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy'); t3 = np.load('test_xgb_meta.npy')
y = train['requester_received_pizza'].astype(int).values
n = len(y)

# Base logits
z1_all, z2_all, z2b_all, z3_all = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1_all, tz2_all, tz2b_all, tz3_all = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)

# A) Ref4: global 4-way split-dense best
w4_ref = (0.3381, 0.266786, 0.163514, 0.2316)
z_ref4 = w4_ref[0]*z1_all + w4_ref[1]*z2_all + w4_ref[2]*z2b_all + w4_ref[3]*z3_all
tz_ref4 = w4_ref[0]*tz1_all + w4_ref[1]*tz2_all + w4_ref[2]*tz2b_all + w4_ref[3]*tz3_all
print(f'Ref4 OOF AUC(z): {roc_auc_score(y, z_ref4):.5f}')

# B) PerFold4 (no z-std) fast rebuild with tiny window (±0.01, step=0.002)
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(np.arange(n), y))
w1c, w2c, w2bc, w3c = w4_ref
w1_grid = np.arange(max(0.0, w1c-0.01), min(1.0, w1c+0.01)+1e-12, 0.002)
w2_grid = np.arange(max(0.0, w2c-0.01), min(1.0, w2c+0.01)+1e-12, 0.002)
w2b_grid= np.arange(max(0.0, w2bc-0.01),min(1.0, w2bc+0.01)+1e-12, 0.002)
z_pf4 = np.zeros(n, dtype=np.float64)
test_fold_logits = []
t0 = time.time()
for k, (tr_idx, va_idx) in enumerate(cv):
    z1_tr, z2_tr, z2b_tr, z3_tr = z1_all[tr_idx], z2_all[tr_idx], z2b_all[tr_idx], z3_all[tr_idx]
    z1_va, z2_va, z2b_va, z3_va = z1_all[va_idx], z2_all[va_idx], z2b_all[va_idx], z3_all[va_idx]
    y_tr = y[tr_idx]
    best_auc_k, best_w_k = -1.0, w4_ref
    for w1 in w1_grid:
        for w2 in w2_grid:
            for w2b in w2b_grid:
                w3 = 1.0 - (w1 + w2 + w2b)
                if w3 < 0.0 or w3 > 1.0: continue
                z_tr = w1*z1_tr + w2*z2_tr + w2b*z2b_tr + w3*z3_tr
                auc_k = roc_auc_score(y_tr, z_tr)
                if auc_k > best_auc_k:
                    best_auc_k = auc_k; best_w_k = (float(w1), float(w2), float(w2b), float(w3))
    w1k, w2k, w2bk, w3k = best_w_k
    z_pf4[va_idx] = w1k*z1_va + w2k*z2_va + w2bk*z2b_va + w3k*z3_va
    tz_fold = w1k*tz1_all + w2k*tz2_all + w2bk*tz2b_all + w3k*tz3_all
    test_fold_logits.append(tz_fold.astype(np.float64))
    print(f'PerFold4 Fold {k+1} best_w={best_w_k}')
auc_pf4 = roc_auc_score(y, z_pf4)
tz_pf4 = np.mean(test_fold_logits, axis=0)
print(f'PerFold4 OOF AUC(z): {auc_pf4:.5f} | {time.time()-t0:.1f}s')

# C) Ref3-way logit blend (LR, Dense v1, Meta) with best weights from cell 24
w3_ref = (0.3501, 0.4198, 0.2301)  # from ultra-fine logit best vicinity
z_ref3 = w3_ref[0]*z1_all + w3_ref[1]*z2_all + w3_ref[2]*z3_all
tz_ref3 = w3_ref[0]*tz1_all + w3_ref[1]*tz2_all + w3_ref[2]*tz3_all
print(f'Ref3 OOF AUC(z): {roc_auc_score(y, z_ref3):.5f}')

# 3-way blend of z_ref4, z_pf4, z_ref3 in logit space; fine grid around (0.35,0.42,0.23)
wA_c, wB_c = 0.35, 0.42  # C=1-wA-wB
window, step = 0.02, 0.001
wA_min, wA_max = max(0.0, wA_c-window), min(1.0, wA_c+window)
wB_min, wB_max = max(0.0, wB_c-window), min(1.0, wB_c+window)
wA_grid = np.arange(wA_min, wA_max + 1e-12, step)
wB_grid = np.arange(wB_min, wB_max + 1e-12, step)
best_auc, best_w = -1.0, (wA_c, wB_c, 1.0 - wA_c - wB_c)
t1s = time.time(); tried = 0
for wA in wA_grid:
    for wB in wB_grid:
        wC = 1.0 - wA - wB
        if wC < 0.0 or wC > 1.0: continue
        z_mix = wA*z_ref4 + wB*z_pf4 + wC*z_ref3
        auc = roc_auc_score(y, z_mix)
        tried += 1
        if auc > best_auc:
            best_auc, best_w = auc, (float(wA), float(wB), float(wC))
print(f'Blend-of-blends tried {tried} combos | best weights(Ref4,PerFold4,Ref3)={best_w} OOF AUC(z): {best_auc:.5f} | {time.time()-t1s:.1f}s')

# Build submission with best weights
wA, wB, wC = best_w
tz_mix = wA*tz_ref4 + wB*tz_pf4 + wC*tz_ref3
pt = sigmoid(tz_mix).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (blend-of-blends 3-way logit); head:')
print(sub.head())

Loading cached OOF/test preds...
Ref4 OOF AUC(z): 0.69242


PerFold4 Fold 1 best_w=(0.3441, 0.27678600000000003, 0.157514, 0.22160000000000002)


PerFold4 Fold 2 best_w=(0.3441, 0.27678600000000003, 0.159514, 0.21960000000000002)


PerFold4 Fold 3 best_w=(0.3281, 0.260786, 0.173514, 0.23760000000000003)


PerFold4 Fold 4 best_w=(0.3281, 0.27678600000000003, 0.157514, 0.23760000000000003)


PerFold4 Fold 5 best_w=(0.3281, 0.264786, 0.167514, 0.23960000000000004)
PerFold4 OOF AUC(z): 0.69221 | 11.6s
Ref3 OOF AUC(z): 0.69201


Blend-of-blends tried 1681 combos | best weights(Ref4,PerFold4,Ref3)=(0.366, 0.432, 0.202) OOF AUC(z): 0.69229 | 3.3s
Saved submission.csv (blend-of-blends 3-way logit); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.331250
1   t3_roiuw                  0.216240
2   t3_mjnbq                  0.213711
3   t3_t8wd1                  0.208273
4  t3_1m4zxu                  0.215067


In [57]:
# Meta-flag tweak (+throwaway, +military) only for dense XGB v1; retrain single-seed and re-run 4-way split-dense logit blend
import time, gc, sys, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack

try:
    import xgboost as xgb
except Exception:
    import subprocess, sys as _sys
    subprocess.run([_sys.executable, '-m', 'pip', 'install', '-q', 'xgboost'])
    import xgboost as xgb

assert 'combine_raw_text' in globals() and 'clean_text_series' in globals() and 'build_subreddit_text' in globals(), 'Run pivot cell 7 first'
assert 'word_params' in globals() and 'char_params' in globals() and 'subs_params' in globals(), 'Run pivot cell 7 first'
assert 'build_meta_enhanced' in globals(), 'Run cell 10 first'

def build_meta_enhanced_plus(df: pd.DataFrame) -> pd.DataFrame:
    out = build_meta_enhanced(df).copy()
    # f_throwaway_user from username
    if 'requester_username' in df.columns:
        un = df['requester_username'].fillna('').astype(str)
        out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
    else:
        out['f_throwaway_user'] = 0
    # f_military from title+body
    title = df['request_title'].fillna('').astype(str) if 'request_title' in df.columns else pd.Series(['']*len(df), index=df.index)
    body = df['request_text_edit_aware'].fillna('').astype(str) if 'request_text_edit_aware' in df.columns else df['request_text'].fillna('').astype(str) if 'request_text' in df.columns else pd.Series(['']*len(df), index=df.index)
    txt = (title + ' ' + body).str.lower()
    out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)
    return out.astype(np.float32)

# 1) Retrain Dense XGB v1 (single seed=42) with enhanced_plus meta; SVD dims same as v1 (250/250/80)
svd_word_n, svd_char_n, svd_subs_n = 250, 250, 80
y = train[target_col].astype(int).values
cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=42).split(train, y))
print('Prepared 5-fold CV; retraining Dense v1 (single seed=42) with meta+flags')

raw_te_text = combine_raw_text(test)
clean_te_text = clean_text_series(raw_te_text)
subs_te_text = build_subreddit_text(test)
meta_te_plus = build_meta_enhanced_plus(test).astype(np.float32)

seed = 42
oof_dense_mf = np.zeros(len(train), dtype=np.float32)
test_dense_mf_folds = []

base_params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    max_depth=4,
    eta=0.05,
    subsample=0.8,
    colsample_bytree=0.7,
    min_child_weight=4,
    reg_alpha=0.5,
    reg_lambda=2.0,
    gamma=0.0,
    device='cuda',
    seed=seed
)

for fold, (tr_idx, va_idx) in enumerate(cv):
    t0 = time.time()
    print(f'Seed {seed} | Fold {fold+1}/5 - train {len(tr_idx)} va {len(va_idx)}')
    tr_text_raw = combine_raw_text(train.loc[tr_idx])
    va_text_raw = combine_raw_text(train.loc[va_idx])
    tr_text = clean_text_series(tr_text_raw)
    va_text = clean_text_series(va_text_raw)
    tr_subs = build_subreddit_text(train.loc[tr_idx])
    va_subs = build_subreddit_text(train.loc[va_idx])

    tfidf_w = TfidfVectorizer(**word_params)
    Xw_tr = tfidf_w.fit_transform(tr_text); Xw_va = tfidf_w.transform(va_text); Xw_te = tfidf_w.transform(clean_te_text)
    tfidf_c = TfidfVectorizer(**char_params)
    Xc_tr = tfidf_c.fit_transform(tr_text); Xc_va = tfidf_c.transform(va_text); Xc_te = tfidf_c.transform(clean_te_text)
    tfidf_s = TfidfVectorizer(**subs_params)
    Xs_tr = tfidf_s.fit_transform(tr_subs); Xs_va = tfidf_s.transform(va_subs); Xs_te = tfidf_s.transform(subs_te_text)

    svd_w = TruncatedSVD(n_components=svd_word_n, random_state=seed)
    Zw_tr = svd_w.fit_transform(Xw_tr); Zw_va = svd_w.transform(Xw_va); Zw_te = svd_w.transform(Xw_te)
    svd_c = TruncatedSVD(n_components=svd_char_n, random_state=seed)
    Zc_tr = svd_c.fit_transform(Xc_tr); Zc_va = svd_c.transform(Xc_va); Zc_te = svd_c.transform(Xc_te)
    svd_s = TruncatedSVD(n_components=svd_subs_n, random_state=seed)
    Zs_tr = svd_s.fit_transform(Xs_tr); Zs_va = svd_s.transform(Xs_va); Zs_te = svd_s.transform(Xs_te)

    meta_tr_plus = build_meta_enhanced_plus(train.loc[tr_idx]).astype(np.float32)
    meta_va_plus = build_meta_enhanced_plus(train.loc[va_idx]).astype(np.float32)

    Xtr_dense = np.hstack([Zw_tr, Zc_tr, Zs_tr, meta_tr_plus.values]).astype(np.float32)
    Xva_dense = np.hstack([Zw_va, Zc_va, Zs_va, meta_va_plus.values]).astype(np.float32)
    Xte_dense = np.hstack([Zw_te, Zc_te, Zs_te, meta_te_plus.values]).astype(np.float32)

    scaler = StandardScaler(with_mean=True, with_std=True)
    Xtr_d = scaler.fit_transform(Xtr_dense)
    Xva_d = scaler.transform(Xva_dense)
    Xte_d = scaler.transform(Xte_dense)

    dtrain = xgb.DMatrix(Xtr_d, label=y[tr_idx])
    dvalid = xgb.DMatrix(Xva_d, label=y[va_idx])
    dtest  = xgb.DMatrix(Xte_d)
    booster = xgb.train(base_params, dtrain, num_boost_round=4000, evals=[(dtrain,'train'),(dvalid,'valid')], early_stopping_rounds=200, verbose_eval=False)
    va_pred = booster.predict(dvalid, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
    te_pred = booster.predict(dtest, iteration_range=(0, booster.best_iteration+1)).astype(np.float32)
    oof_dense_mf[va_idx] = va_pred
    test_dense_mf_folds.append(te_pred)
    auc = roc_auc_score(y[va_idx], va_pred)
    print(f'Fold {fold+1} AUC: {auc:.5f} | best_iter={booster.best_iteration} | time {time.time()-t0:.1f}s')
    del Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, Xs_tr, Xs_va, Xs_te, Zw_tr, Zw_va, Zw_te, Zc_tr, Zc_va, Zc_te, Zs_tr, Zs_va, Zs_te, meta_tr_plus, meta_va_plus, Xtr_dense, Xva_dense, Xte_dense, Xtr_d, Xva_d, Xte_d, dtrain, dvalid, dtest, booster, tfidf_w, tfidf_c, tfidf_s, svd_w, svd_c, svd_s, scaler
    gc.collect()

auc_oof_mf = roc_auc_score(y, oof_dense_mf)
print(f'Dense XGB v1 (meta+flags) single-seed OOF AUC: {auc_oof_mf:.5f}')
test_dense_mf = np.mean(test_dense_mf_folds, axis=0).astype(np.float32)
np.save('oof_xgb_dense_mf.npy', oof_dense_mf.astype(np.float32))
np.save('test_xgb_dense_mf.npy', test_dense_mf)

# 2) 4-way split-dense logit blend using LR + Dense v1 (meta+flags) + Dense v2 (old) + Meta
def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

o1 = np.load('oof_lr_pivot.npy')
o2 = np.load('oof_xgb_dense_mf.npy')  # new dense v1 with flags
o2b = np.load('oof_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy')
t1 = np.load('test_lr_pivot.npy')
t2 = np.load('test_xgb_dense_mf.npy')
t2b = np.load('test_xgb_dense_v2.npy')
t3 = np.load('test_xgb_meta.npy')
z1, z2, z2b, z3 = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3)
tz1, tz2, tz2b, tz3 = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3)
y = train[target_col].astype(int).values

# Start near prior best 3-way logit (v1 baseline): (0.3501, 0.4198, 0.2301). We'll split dense between v1/v2 via alpha.
w1c, wdc = 0.3501, 0.4198
w3c = 1.0 - w1c - wdc
w1_min, w1_max = max(0.0, w1c - 0.02), min(1.0, w1c + 0.02)
wd_min, wd_max = max(0.0, wdc - 0.02), min(1.0, wdc + 0.02)
alphas = np.arange(0.0, 0.4001, 0.05)
step = 0.001
w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
wd_grid = np.arange(wd_min, wd_max + 1e-12, step)
best_auc, best_cfg = -1.0, None
t0 = time.time(); cnt = 0
for w1 in w1_grid:
    for wd in wd_grid:
        w3 = 1.0 - w1 - wd
        if w3 < 0.0 or w3 > 1.0: continue
        for a in alphas:
            w2b = wd * a
            w2 = wd - w2b
            z = w1*z1 + w2*z2 + w2b*z2b + w3*z3
            auc = roc_auc_score(y, z)
            cnt += 1
            if auc > best_auc:
                best_auc = auc; best_cfg = (float(w1), float(w2), float(w2b), float(w3))
print(f'4-way split-dense (dense_v1=flags) tried {cnt} combos | best weights={best_cfg} OOF AUC(z): {best_auc:.5f} | {time.time()-t0:.1f}s')

# Optional micro-refine around best
w1b, w2b1, w2b2, w3b = best_cfg
def refine(w1c, w2c1, w2c2, w3c, window=0.006, step=0.0005):
    best_local_auc, best_local_w = -1.0, None
    wd_c = w2c1 + w2c2
    alpha_c = (w2c2 / wd_c) if wd_c > 0 else 0.0
    w1_min = max(0.0, w1c - window); w1_max = min(1.0, w1c + window)
    wd_min = max(0.0, wd_c - window); wd_max = min(1.0, wd_c + window)
    a_min = max(0.0, alpha_c - 0.05); a_max = min(0.6, alpha_c + 0.05)
    w1_grid = np.arange(w1_min, w1_max + 1e-12, step)
    wd_grid = np.arange(wd_min, wd_max + 1e-12, step)
    a_grid = np.arange(a_min, a_max + 1e-12, step*5)
    t1s = time.time(); cnt2 = 0
    for w1 in w1_grid:
        for wd in wd_grid:
            w3 = 1.0 - w1 - wd
            if w3 < 0.0 or w3 > 1.0: continue
            for a in a_grid:
                w2v2 = wd * a; w2v1 = wd - w2v2
                z = w1*z1 + w2v1*z2 + w2v2*z2b + w3*z3
                auc = roc_auc_score(y, z); cnt2 += 1
                if auc > best_local_auc:
                    best_local_auc, best_local_w = auc, (float(w1), float(w2v1), float(w2v2), float(w3))
    print(f'Refine ({cnt2} combos) best weights={best_local_w} OOF AUC(z): {best_local_auc:.5f} | {time.time()-t1s:.1f}s')
    return best_local_auc, best_local_w

auc_ref, w_ref = refine(w1b, w2b1, w2b2, w3b, window=0.006, step=0.0005)
use_auc = auc_ref if auc_ref > best_auc else best_auc
use_w = w_ref if auc_ref > best_auc else best_cfg
print(f'Chosen 4-way (dense_v1=flags) weights={use_w} OOF AUC(z): {use_auc:.5f}')

# Build submission if at/above threshold (or if this is our best shot now)
w1f, w2f1, w2f2, w3f = use_w
zt = w1f*tz1 + w2f1*tz2 + w2f2*tz2b + w3f*tz3
pt = (1.0/(1.0+np.exp(-zt))).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (4-way with Dense v1 meta+flags); head:')
print(sub.head())

Prepared 5-fold CV; retraining Dense v1 (single seed=42) with meta+flags


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)


Seed 42 | Fold 1/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r

Fold 1 AUC: 0.67162 | best_iter=150 | time 16.9s
Seed 42 | Fold 2/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)


  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)


Fold 2 AUC: 0.64378 | best_iter=256 | time 17.3s
Seed 42 | Fold 3/5 - train 2302 va 576


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r

Fold 3 AUC: 0.67763 | best_iter=58 | time 16.5s
Seed 42 | Fold 4/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r

Fold 4 AUC: 0.63492 | best_iter=55 | time 16.5s
Seed 42 | Fold 5/5 - train 2303 va 575


  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)


  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r'\b(military|army|navy|marine|air\s*force|veteran|deployment)\b', regex=True).astype(np.int8)
  out['f_reciprocity'] = txt.str.contains(r'\b(pay it forward|return the favor|pay you back|pay it back)\b', regex=True).astype(np.int8)
  out['f_payday'] = txt.str.contains(r'\b(payday|paycheck|get paid|next check)\b', regex=True).astype(np.int8)
  out['f_day_words'] = txt.str.contains(r'\b(monday|tuesday|wednesday|thursday|friday|tomorrow|tonight|weekend)\b', regex=True).astype(np.int8)
  out['f_amount'] = txt.str.contains(r'(\$\s?\d+)|(\b\d+\s*dollars?\b)', regex=True).astype(np.int8)
  out['mentions_kids'] = txt.str.contains(r'\b(kids?|children|son|daughter|baby)\b', regex=True).astype(np.int8)
  out['f_throwaway_user'] = un.str.contains(r'(throwaway|temp|anon)', case=False, regex=True).astype(np.int8)
  out['f_military'] = txt.str.contains(r

Fold 5 AUC: 0.70090 | best_iter=58 | time 16.4s
Dense XGB v1 (meta+flags) single-seed OOF AUC: 0.65899


4-way split-dense (dense_v1=flags) tried 15129 combos | best weights=(0.3521, 0.23987999999999998, 0.15992, 0.24809999999999993) OOF AUC(z): 0.69014 | 29.7s


Refine (25625 combos) best weights=(0.3516, 0.21658999999999995, 0.17721000000000003, 0.2546) OOF AUC(z): 0.69045 | 50.3s
Chosen 4-way (dense_v1=flags) weights=(0.3516, 0.21658999999999995, 0.17721000000000003, 0.2546) OOF AUC(z): 0.69045
Saved submission.csv (4-way with Dense v1 meta+flags); head:
  request_id  requester_received_pizza
0  t3_1aw5zf                  0.337412
1   t3_roiuw                  0.219214
2   t3_mjnbq                  0.206461
3   t3_t8wd1                  0.213465
4  t3_1m4zxu                  0.224252


In [None]:
# 5-way logit blend: LR Pivot + Dense v1 + Dense v2 + Meta + LR_alt (cap small weight); cached logits only
import numpy as np, pandas as pd, time
from sklearn.metrics import roc_auc_score

def to_logit(p, eps=1e-6):
    p = np.clip(p.astype(np.float64), eps, 1.0 - eps)
    return np.log(p / (1.0 - p))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

print('Loading cached OOF/test predictions (LR, Dense v1, Dense v2, Meta, LR_alt)...')
o1 = np.load('oof_lr_pivot.npy'); t1 = np.load('test_lr_pivot.npy')
o2 = np.load('oof_xgb_dense.npy'); t2 = np.load('test_xgb_dense.npy')
o2b= np.load('oof_xgb_dense_v2.npy'); t2b= np.load('test_xgb_dense_v2.npy')
o3 = np.load('oof_xgb_meta.npy'); t3 = np.load('test_xgb_meta.npy')
o4 = np.load('oof_lr_alt.npy'); t4 = np.load('test_lr_alt.npy')
y = train['requester_received_pizza'].astype(int).values

# Convert to logits
z1, z2, z2b, z3, z4 = to_logit(o1), to_logit(o2), to_logit(o2b), to_logit(o3), to_logit(o4)
tz1, tz2, tz2b, tz3, tz4 = to_logit(t1), to_logit(t2), to_logit(t2b), to_logit(t3), to_logit(t4)

# Reference 4-way weights (from Cell 31 refine)
w_lr_ref, w_d1_ref, w_d2_ref, w_meta_ref = 0.3381, 0.266786, 0.163514, 0.2316
wd_ref = w_d1_ref + w_d2_ref

# Grids: small cap for LR_alt; tight windows around ref
w_alt_grid = np.arange(0.0, 0.1001, 0.005)
w1_grid   = np.arange(max(0.0, w_lr_ref - 0.01), min(1.0, w_lr_ref + 0.01) + 1e-12, 0.001)
wd_grid   = np.arange(max(0.0, wd_ref - 0.01), min(1.0, wd_ref + 0.01) + 1e-12, 0.001)
alpha_grid= np.arange(0.0, 0.4001, 0.05)  # fraction of wd to Dense v2

best_auc, best_w = -1.0, None
t0 = time.time(); tried = 0
for w_alt in w_alt_grid:
    rem = 1.0 - w_alt
    for w1 in w1_grid:
        for wd in wd_grid:
            # remaining for meta
            w3 = rem - w1 - wd
            if w3 < 0.0 or w3 > 1.0:
                continue
            for a in alpha_grid:
                w2b = wd * a
                w2 = wd - w2b
                z = w1*z1 + w2*z2 + w2b*z2b + w3*z3 + w_alt*z4
                auc = roc_auc_score(y, z)
                tried += 1
                if auc > best_auc:
                    best_auc = auc
                    # store full 5-way weights
                    best_w = (float(w1), float(w2), float(w2b), float(w3), float(w_alt))
print(f'5-way (LR, D1, D2, Meta, LR_alt) tried {tried} combos | best OOF AUC(z)={best_auc:.5f} | weights={best_w} | {time.time()-t0:.1f}s')

# Build submission from best 5-way config (even if OOF ~ same; diversity may help LB)
w1b, w2b, w2bb, w3b, w4b = best_w
zt = w1b*tz1 + w2b*tz2 + w2bb*tz2b + w3b*tz3 + w4b*tz4
pt = sigmoid(zt).astype(np.float32)
sub = pd.DataFrame({id_col: test[id_col].values, target_col: pt})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (5-way with small LR_alt cap); head:')
print(sub.head())