# Random Acts of Pizza — Plan

Objective: Achieve medal-level AUC-ROC via robust text modeling with disciplined CV and fast iteration.

Workflow
- Environment sanity: verify GPU; if unavailable, exit early.
- Data audit:
  - Inspect train.json/test.json: schema, text fields, meta-data, target.
  - Identify potential features: request_text, title, gratitude, politeness markers, narrative structure, user history proxies if present (e.g., account age, karma fields), timing features.
- Validation:
  - Stratified K-fold (e.g., 5 folds) on target; ensure transformations fit within folds.
  - Save folds to disk for reuse.
  - Track OOF AUC; cache OOF/test logits for blending.
- Baselines (fast):
  1) TF-IDF (word+char) -> Logistic Regression (liblinear/saga) with class_weight='balanced'.
  2) TF-IDF -> Linear SVM (Calibrated) if needed.
  3) Naive Bayes-SVM style (NB-SVM) linear model.
- Feature Engineering v1:
  - Text cleaning: lower, basic normalization; preserve punctuation for char n-grams.
  - Separate fields: title vs body; combine with weighted concatenation.
  - Meta features: text length, word counts, uppercase ratio, sentiment (VADER), presence of images/links, mention of money/"student", location, reciprocity cues, offer-of-return, gratitude terms, politeness markers.
  - Temporal if available (weekday/hour).
- Models for improvement:
  - Linear baseline tuning (C, n-gram ranges, min_df).
  - XGBoost/CatBoost on meta + dense text embeddings (e.g., SIF average of word vectors) — GPU if helpful.
  - Light neural option only if quick (e.g., DistilBERT fine-tune with early stopping) but only after strong CV established.
- Ensembling:
  - Blend diverse linear models (word vs char TF-IDF, different seeds/params).
  - Simple weighted average based on OOF.
- Error analysis:
  - Bucket by confidence, length, presence of key phrases; iterate features.

Milestones & Expert Checkpoints
1) Plan review (this).
2) Data schema + baseline TF-IDF-LR OOF AUC and first submission.
3) Feature engineering v1 + tuned linear models.
4) Small ensemble of best models.
5) Optional transformer fine-tune if time.

Next actions
- Run environment check (nvidia-smi), load data, inspect columns/target distribution.
- Implement deterministic folds and TF-IDF+LR baseline with solid logging and OOF caching.
- Use expert review after baseline results.

In [1]:
import os, sys, json, time, subprocess, shutil, math, re
import numpy as np
import pandas as pd

print('=== GPU CHECK (nvidia-smi) ===', flush=True)
try:
    out = subprocess.run(['bash','-lc','nvidia-smi || true'], capture_output=True, text=True)
    print(out.stdout)
except Exception as e:
    print('nvidia-smi failed:', e)

def load_df(path):
    # Try robust JSON loading via pandas; fallback to json.load + DataFrame
    try:
        df = pd.read_json(path)
        return df
    except ValueError:
        with open(path, 'r') as f:
            data = json.load(f)
        return pd.DataFrame(data)

t0 = time.time()
train_path = 'train.json'
test_path = 'test.json'
print('Loading train/test...', flush=True)
train = load_df(train_path)
test = load_df(test_path)
print(f'train shape: {train.shape}; test shape: {test.shape}', flush=True)

# Inspect columns and infer key fields
print('\nTrain columns:', list(train.columns))
print('Test  columns:', list(test.columns))

id_col_candidates = [c for c in train.columns if c.lower() in ('request_id','id')]
target_candidates = [c for c in train.columns if c.lower() in ('requester_received_pizza','target','label','outcome')]
text_candidates = [c for c in train.columns if 'text' in c.lower() or 'title' in c.lower()]
time_candidates = [c for c in train.columns if 'time' in c.lower() or 'created' in c.lower() or 'timestamp' in c.lower()]

print('\nID candidates:', id_col_candidates)
print('Target candidates:', target_candidates)
print('Text candidates:', text_candidates[:10])
print('Time candidates:', time_candidates[:10])

id_col = id_col_candidates[0] if id_col_candidates else None
target_col = target_candidates[0] if target_candidates else None
print(f'Chosen id_col={id_col}, target_col={target_col}')

if target_col is None:
    # Fallback guess for RAOP
    if 'requester_received_pizza' in train.columns:
        target_col = 'requester_received_pizza'
        print('Fallback target_col=requester_received_pizza')

if id_col is None:
    if 'request_id' in train.columns:
        id_col = 'request_id'
        print('Fallback id_col=request_id')

# Basic target distribution
if target_col in train.columns:
    y = train[target_col].astype(int) if train[target_col].dtype != bool else train[target_col].astype(int)
    pos_rate = y.mean()
    print(f'Target positive rate: {pos_rate:.4f} ({y.sum()}/{len(y)})')

# Prefer edit-aware text if present
body_fields_order = [
    'request_text_edit_aware',
    'request_text',
    'request_text_edit_aware_unnormalized',
]
title_fields = [
    'request_title', 'title'
]
body_col = next((c for c in body_fields_order if c in train.columns), None)
title_col = next((c for c in title_fields if c in train.columns), None)
print(f'Selected title_col={title_col}, body_col={body_col}')

# Quick sanity on missingness
if title_col:
    print('Title nulls:', train[title_col].isna().sum(), '/', len(train))
if body_col:
    print('Body nulls:', train[body_col].isna().sum(), '/', len(train))

# Verify sample submission format expectations
if os.path.exists('sampleSubmission.csv'):
    ss = pd.read_csv('sampleSubmission.csv')
    print('\nSampleSubmission head:')
    print(ss.head())
    print('SampleSubmission columns:', list(ss.columns))

print('\nEnvironment/data audit complete in %.2fs' % (time.time()-t0), flush=True)

=== GPU CHECK (nvidia-smi) ===


Mon Sep 29 00:22:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

train shape: (2878, 32); test shape: (1162, 17)



Train columns: ['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester_subreddits_at_request', 'requester_upvotes_minus_downvotes_at_request', 

In [2]:
import time
from typing import Tuple
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

# Build full_text field (lowercase, keep punctuation)
def build_text(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    title = df[title_col].fillna("").astype(str)
    body = df[body_col].fillna("").astype(str)
    full_text = (title + " [SEP] " + body).str.lower()
    return full_text

full_text_train = build_text(train, title_col, body_col)
full_text_test = build_text(test, title_col, body_col)

y = train[target_col].astype(int).values

def get_vectorizers() -> Tuple[TfidfVectorizer, TfidfVectorizer]:
    word_vec = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.9, max_features=50000,
                               sublinear_tf=True, dtype=np.float32, lowercase=False)
    char_vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5), min_df=2, max_features=200000,
                               sublinear_tf=True, dtype=np.float32, lowercase=False)
    return word_vec, char_vec

def hstack_features(word_X, char_X):
    return sparse.hstack([word_X, char_X], format='csr')

def fit_lr(X, y):
    # saga handles large sparse matrices; increase max_iter for convergence
    return LogisticRegression(penalty='l2', C=4.0, solver='saga', max_iter=4000, n_jobs=-1, verbose=0)

print('=== CV: TF-IDF(word+char) -> Logistic Regression ===', flush=True)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof = np.zeros(len(train), dtype=np.float32)
fold_times = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(full_text_train, y), 1):
    t0 = time.time()
    print(f'Fold {fold} start: train {len(trn_idx)} | val {len(val_idx)}', flush=True)
    X_tr_text = full_text_train.iloc[trn_idx]
    X_va_text = full_text_train.iloc[val_idx]

    word_vec, char_vec = get_vectorizers()
    Xw_tr = word_vec.fit_transform(X_tr_text)
    Xc_tr = char_vec.fit_transform(X_tr_text)
    X_tr = hstack_features(Xw_tr, Xc_tr)

    Xw_va = word_vec.transform(X_va_text)
    Xc_va = char_vec.transform(X_va_text)
    X_va = hstack_features(Xw_va, Xc_va)

    clf = fit_lr(X_tr, y[trn_idx])
    clf.fit(X_tr, y[trn_idx])
    oof[val_idx] = clf.predict_proba(X_va)[:,1]
    auc = roc_auc_score(y[val_idx], oof[val_idx])
    dt = time.time()-t0
    fold_times.append(dt)
    print(f'Fold {fold} AUC: {auc:.5f} | time: {dt:.1f}s', flush=True)

oof_auc = roc_auc_score(y, oof)
print(f'OOF AUC: {oof_auc:.5f} | mean fold time: {np.mean(fold_times):.1f}s', flush=True)

# Fit final model on full train (refit vectorizers on full train text), then predict test
print('Fitting full model on all training data...', flush=True)
word_vec_full, char_vec_full = get_vectorizers()
Xw_full = word_vec_full.fit_transform(full_text_train)
Xc_full = char_vec_full.fit_transform(full_text_train)
X_full = hstack_features(Xw_full, Xc_full)
clf_full = fit_lr(X_full, y)
clf_full.fit(X_full, y)

Xw_test = word_vec_full.transform(full_text_test)
Xc_test = char_vec_full.transform(full_text_test)
X_test = hstack_features(Xw_test, Xc_test)
test_pred = clf_full.predict_proba(X_test)[:,1].astype(np.float32)

# Write submission
sub = pd.DataFrame({
    'request_id': test['request_id'],
    'requester_received_pizza': test_pred
})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv:', sub.shape, 'min/max:', float(test_pred.min()), float(test_pred.max()))

# Cache OOF/test for future blending
np.save('oof_lr_wordchar.npy', oof)
np.save('test_lr_wordchar.npy', test_pred)
print('Cached OOF and test predictions to .npy files.')

# Quick sanity checks
assert sub.shape[0] == len(test), 'Submission row count mismatch'
assert 0.0 <= test_pred.min() and test_pred.max() <= 1.0, 'Predictions out of [0,1]'
print('Baseline complete.')

=== CV: TF-IDF(word+char) -> Logistic Regression ===


Fold 1 start: train 2302 | val 576


Fold 1 AUC: 0.61902 | time: 9.0s


Fold 2 start: train 2302 | val 576


Fold 2 AUC: 0.64731 | time: 8.7s


Fold 3 start: train 2302 | val 576


Fold 3 AUC: 0.62645 | time: 9.2s


Fold 4 start: train 2303 | val 575


Fold 4 AUC: 0.59316 | time: 9.0s


Fold 5 start: train 2303 | val 575


Fold 5 AUC: 0.61375 | time: 8.5s


OOF AUC: 0.62018 | mean fold time: 8.9s


Fitting full model on all training data...


Saved submission.csv: (1162, 2) min/max: 0.008493145927786827 0.8981638550758362
Cached OOF and test predictions to .npy files.
Baseline complete.


In [3]:
# Improved TF-IDF + LR baseline per expert advice
import time
from typing import Tuple
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    title = df[title_col].fillna("").astype(str)
    body = df[body_col].fillna("").astype(str)
    # Duplicate title once to upweight it; no manual lowercasing (vectorizers will lowercase)
    return title + " " + title + " [SEP] " + body

full_text_train2 = build_text_upweighted(train, title_col, body_col)
full_text_test2 = build_text_upweighted(test, title_col, body_col)
y2 = train[target_col].astype(int).values

def get_vecs_improved() -> Tuple[TfidfVectorizer, TfidfVectorizer]:
    word_vec = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=150000,
                               sublinear_tf=True, lowercase=True, dtype=np.float32)
    char_vec = TfidfVectorizer(analyzer='char', ngram_range=(3,6), min_df=2, max_features=300000,
                               sublinear_tf=True, lowercase=True, dtype=np.float32)
    return word_vec, char_vec

def hstack_features(word_X, char_X):
    return sparse.hstack([word_X, char_X], format='csr')

def run_cv_lr(C: float, cls_weight=None):
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(full_text_train2), dtype=np.float32)
    times = []
    for fold, (trn_idx, val_idx) in enumerate(skf.split(full_text_train2, y2), 1):
        t0 = time.time()
        Xtr_text = full_text_train2.iloc[trn_idx]
        Xva_text = full_text_train2.iloc[val_idx]
        wv, cv = get_vecs_improved()
        Xw_tr = wv.fit_transform(Xtr_text)
        Xc_tr = cv.fit_transform(Xtr_text)
        X_tr = hstack_features(Xw_tr, Xc_tr)
        Xw_va = wv.transform(Xva_text)
        Xc_va = cv.transform(Xva_text)
        X_va = hstack_features(Xw_va, Xc_va)
        solver = 'saga'
        clf = LogisticRegression(penalty='l2', C=C, solver=solver, max_iter=3000, n_jobs=-1, class_weight=cls_weight, verbose=0)
        clf.fit(X_tr, y2[trn_idx])
        oof[val_idx] = clf.predict_proba(X_va)[:,1]
        times.append(time.time()-t0)
        print(f'  Fold {fold} done in {times[-1]:.1f}s', flush=True)
    auc = roc_auc_score(y2, oof)
    return auc, oof

print('=== Improved LR baseline grid ===', flush=True)
best = (-1, None, None)  # (auc, C, class_weight)
best_oof = None
for C in (2.0, 4.0, 8.0):
    for cw in (None, 'balanced'):
        print(f'Trying C={C}, class_weight={cw}', flush=True)
        auc, oof_preds = run_cv_lr(C, cw)
        print(f'  OOF AUC: {auc:.5f}', flush=True)
        if auc > best[0]:
            best = (auc, C, cw)
            best_oof = oof_preds

print(f'Best OOF AUC: {best[0]:.5f} with C={best[1]} class_weight={best[2]}', flush=True)
np.save('oof_lr_improved.npy', best_oof)

# Fit final model on full training with best params
wv_full, cv_full = get_vecs_improved()
Xw_full = wv_full.fit_transform(full_text_train2)
Xc_full = cv_full.fit_transform(full_text_train2)
X_full = hstack_features(Xw_full, Xc_full)
solver = 'saga'
clf = LogisticRegression(penalty='l2', C=best[1], solver=solver, max_iter=3000, n_jobs=-1, class_weight=best[2], verbose=0)
clf.fit(X_full, y2)
Xw_test = wv_full.transform(full_text_test2)
Xc_test = cv_full.transform(full_text_test2)
X_test = hstack_features(Xw_test, Xc_test)
test_pred2 = clf.predict_proba(X_test)[:,1].astype(np.float32)
np.save('test_lr_improved.npy', test_pred2)

sub2 = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_pred2})
sub2.to_csv('submission_lr_improved.csv', index=False)
print('Saved submission_lr_improved.csv', sub2.shape, 'min/max:', float(test_pred2.min()), float(test_pred2.max()))

=== Improved LR baseline grid ===


Trying C=2.0, class_weight=None


  Fold 1 done in 18.7s


  Fold 2 done in 17.6s


  Fold 3 done in 18.6s


  Fold 4 done in 18.9s


  Fold 5 done in 17.4s


  OOF AUC: 0.63452


Trying C=2.0, class_weight=balanced


  Fold 1 done in 15.9s


  Fold 2 done in 14.1s


  Fold 3 done in 15.5s


  Fold 4 done in 14.1s


  Fold 5 done in 15.6s


  OOF AUC: 0.63389


Trying C=4.0, class_weight=None


  Fold 1 done in 20.6s


  Fold 2 done in 20.0s


  Fold 3 done in 21.7s


  Fold 4 done in 21.2s


  Fold 5 done in 20.3s


  OOF AUC: 0.62974


Trying C=4.0, class_weight=balanced


  Fold 1 done in 18.0s


  Fold 2 done in 18.3s


  Fold 3 done in 16.6s


  Fold 4 done in 20.1s


  Fold 5 done in 16.5s


  OOF AUC: 0.62920


Trying C=8.0, class_weight=None


  Fold 1 done in 23.8s


  Fold 2 done in 24.2s


  Fold 3 done in 25.6s


  Fold 4 done in 25.7s


  Fold 5 done in 24.1s


  OOF AUC: 0.62554


Trying C=8.0, class_weight=balanced


  Fold 1 done in 20.7s


  Fold 2 done in 22.0s


  Fold 3 done in 22.0s


  Fold 4 done in 23.0s


  Fold 5 done in 21.8s


  OOF AUC: 0.62519


Best OOF AUC: 0.63452 with C=2.0 class_weight=None


Saved submission_lr_improved.csv (1162, 2) min/max: 0.02404523827135563 0.7660126686096191


In [4]:
# NB-SVM (word 1-2) + OOF-weighted blend with improved LR
import time
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def nbs_log_count_ratio(X, ybin, alpha=0.25):
    # ybin: 1 for positive, 0 for negative
    pos = X[ybin == 1].sum(axis=0) + alpha
    neg = X[ybin == 0].sum(axis=0) + alpha
    # Normalize to probabilities
    pos = np.asarray(pos).ravel()
    neg = np.asarray(neg).ravel()
    pos = pos / pos.sum()
    neg = neg / neg.sum()
    r = np.log(pos) - np.log(neg)
    return r

def nbs_transform(X, r):
    return X.multiply(r)

def run_nbsvm_cv(text_series, y, alpha=0.25, C=4.0, min_df=2, max_features=100000):
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(text_series), dtype=np.float32)
    times = []
    for fold, (trn_idx, val_idx) in enumerate(skf.split(text_series, y), 1):
        t0 = time.time()
        Xtr_text = text_series.iloc[trn_idx]
        Xva_text = text_series.iloc[val_idx]
        ytr = y[trn_idx]
        vec = CountVectorizer(ngram_range=(1,2), min_df=min_df, max_features=max_features, lowercase=True, dtype=np.float32)
        X_tr = vec.fit_transform(Xtr_text)
        X_va = vec.transform(Xva_text)
        r = nbs_log_count_ratio(X_tr, ytr, alpha=alpha)
        X_tr_nb = nbs_transform(X_tr, r)
        X_va_nb = nbs_transform(X_va, r)
        clf = LogisticRegression(C=C, solver='liblinear', max_iter=2000)
        clf.fit(X_tr_nb, ytr)
        oof[val_idx] = clf.predict_proba(X_va_nb)[:,1]
        dt = time.time() - t0
        times.append(dt)
        print(f'  NB-SVM fold {fold} done in {dt:.1f}s', flush=True)
    auc = roc_auc_score(y, oof)
    return auc, oof, vec, r, clf

print('=== NB-SVM (word 1-2) CV ===', flush=True)
ybin = train[target_col].astype(int).values
nb_auc, nb_oof, nb_vec, nb_r, nb_clf = run_nbsvm_cv(full_text_train2, ybin, alpha=0.25, C=4.0, min_df=2, max_features=100000)
print(f'NB-SVM OOF AUC: {nb_auc:.5f}', flush=True)
np.save('oof_nbsvm.npy', nb_oof)

# Fit NB-SVM on full train and predict test
X_full_nb = nb_vec.fit_transform(full_text_train2)
nb_r_full = nbs_log_count_ratio(X_full_nb, ybin, alpha=0.25)
X_full_nb_tr = nbs_transform(X_full_nb, nb_r_full)
nb_full_clf = LogisticRegression(C=4.0, solver='liblinear', max_iter=2000)
nb_full_clf.fit(X_full_nb_tr, ybin)
X_test_nb = nb_vec.transform(full_text_test2)
X_test_nb_tr = nbs_transform(X_test_nb, nb_r_full)
test_nb = nb_full_clf.predict_proba(X_test_nb_tr)[:,1].astype(np.float32)
np.save('test_nbsvm.npy', test_nb)

# Blend NB-SVM with improved LR using OOF to choose weight
lr_oof = np.load('oof_lr_improved.npy')
weights = np.linspace(0.0, 1.0, 21)
best_w, best_auc = None, -1.0
for w in weights:
    blend = w*lr_oof + (1.0-w)*nb_oof
    auc = roc_auc_score(ybin, blend)
    if auc > best_auc:
        best_auc, best_w = auc, w
print(f'Best text-blend OOF AUC: {best_auc:.5f} at w(LR)={best_w:.2f}', flush=True)

test_lr = np.load('test_lr_improved.npy')
test_blend = best_w*test_lr + (1.0-best_w)*test_nb
sub_blend = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend})
sub_blend.to_csv('submission_text_blend.csv', index=False)
print('Saved submission_text_blend.csv', sub_blend.shape, 'min/max:', float(test_blend.min()), float(test_blend.max()))

=== NB-SVM (word 1-2) CV ===


  NB-SVM fold 1 done in 0.4s


  NB-SVM fold 2 done in 0.4s


  NB-SVM fold 3 done in 0.4s


  NB-SVM fold 4 done in 0.4s


  NB-SVM fold 5 done in 0.4s


NB-SVM OOF AUC: 0.56533


Best text-blend OOF AUC: 0.63452 at w(LR)=1.00


Saved submission_text_blend.csv (1162, 2) min/max: 0.02404523827135563 0.7660126686096191


In [8]:
# Meta + text features (leakage-safe) with LR + RAOP lexicons
import re, time
from datetime import datetime, timezone
from scipy import sparse
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def count_urls(s: str) -> int:
    return len(re.findall(r'https?://\S+', s))

def has_imgur(s: str) -> int:
    return 1 if re.search(r'imgur\.com', s, flags=re.IGNORECASE) else 0

def count_digits(s: str) -> int:
    return sum(ch.isdigit() for ch in s)

def dollar_flag(s: str) -> int:
    return 1 if ('$' in s) or re.search(r'\b\d+\s*(dollars|bucks)\b', s, flags=re.IGNORECASE) else 0

def caps_ratio(s: str) -> float:
    if not s:
        return 0.0
    letters = [c for c in s if c.isalpha()]
    if not letters:
        return 0.0
    return sum(c.isupper() for c in letters) / max(1, len(letters))

def word_count(s: str) -> int:
    return len(s.split()) if s else 0

def exclam_count(s: str) -> int:
    return s.count('!') if s else 0

def question_count(s: str) -> int:
    return s.count('?') if s else 0

def parse_subreddit_count(x) -> int:
    # requester_subreddits_at_request is a list; fall back to 0 otherwise
    if isinstance(x, list):
        return len(x)
    return 0

def safe_log1p_signed(x):
    # signed log1p for values that can be negative
    return np.sign(x) * np.log1p(np.abs(x))

LEX_PATTERNS = {
    # Politeness / gratitude
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    # Reciprocity / willingness / repay
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r'\b(willing to|i\'ll|i will|i can)\b',
    'lex_karma': r'\bkarma\b',
    # Evidence / credibility
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    # Hardship / need
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    # Urgency / help
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}

def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out

def build_meta(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    # user/account features (at_request only)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        if c in df.columns:
            out[c] = pd.to_numeric(df[c], errors='coerce')
        else:
            out[c] = np.nan
    # subreddit count
    if 'requester_subreddits_at_request' in df.columns:
        out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float)
    else:
        out['requester_subreddits_count'] = np.nan
    # temporal
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    # text stats
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    # lexicon counts
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1)
    # Fill NaNs and infs
    out = out.replace([np.inf, -np.inf], np.nan).fillna(0.0)
    # Apply log transforms safely: log1p for nonnegative heavy-tailed, signed log1p for possible negatives
    nonneg_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count',
        'hour','weekday','is_weekend',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        # lexicon features (all counts, nonnegative)
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    # Signed log for karma difference (can be negative)
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    # caps_ratio already in [0,1]; keep as-is
    out = out.replace([np.inf, -np.inf], 0.0).fillna(0.0)
    return out

meta_train = build_meta(train)
meta_test = build_meta(test)
print('Meta shapes:', meta_train.shape, meta_test.shape, flush=True)

# Vectorizers for text (reuse improved settings) - smaller to speed up if needed
def get_vecs_meta():
    wv = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=100000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    cv = TfidfVectorizer(analyzer='char', ngram_range=(3,6), min_df=2, max_features=200000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    return wv, cv

y = train[target_col].astype(int).values
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_meta = np.zeros(len(train), dtype=np.float32)

def hstack3(X1, X2, X3):
    return sparse.hstack([X1, X2, X3], format='csr')

print('=== CV: Text (word+char) + Scaled Meta(+lexicons) -> LR (C=2, cw=balanced) ===', flush=True)
# Per expert advice, rerun only the best config: C=2.0, class_weight='balanced' to save time
oof_tmp = np.zeros(len(train), dtype=np.float32)
times = []
for fold, (trn_idx, val_idx) in enumerate(skf.split(full_text_train2, y), 1):
    t0 = time.time()
    Xtr_text = full_text_train2.iloc[trn_idx]
    Xva_text = full_text_train2.iloc[val_idx]
    wv, cv = get_vecs_meta()
    Xw_tr = wv.fit_transform(Xtr_text)
    Xc_tr = cv.fit_transform(Xtr_text)
    Xw_va = wv.transform(Xva_text)
    Xc_va = cv.transform(Xva_text)
    # Scale meta within fold
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_train.iloc[trn_idx])
    Xm_va = scaler.transform(meta_train.iloc[val_idx])
    X_tr = hstack3(Xw_tr, Xc_tr, Xm_tr)
    X_va = hstack3(Xw_va, Xc_va, Xm_va)
    clf = LogisticRegression(penalty='l2', C=2.0, solver='saga', max_iter=3000, n_jobs=-1, class_weight='balanced', verbose=0)
    clf.fit(X_tr, y[trn_idx])
    oof_tmp[val_idx] = clf.predict_proba(X_va)[:,1]
    times.append(time.time()-t0)
    print(f'  fold {fold} in {times[-1]:.1f}s', flush=True)
auc = roc_auc_score(y, oof_tmp)
print(f'OOF AUC (meta+text+lexicons): {auc:.5f}', flush=True)
oof_meta = oof_tmp.copy()
np.save('oof_lr_text_meta.npy', oof_meta)

# Fit final with best params and predict test
wv_full, cv_full = get_vecs_meta()
Xw_full = wv_full.fit_transform(full_text_train2)
Xc_full = cv_full.fit_transform(full_text_train2)
scaler_full = StandardScaler(with_mean=False)
Xm_full = scaler_full.fit_transform(meta_train)
X_full = hstack3(Xw_full, Xc_full, Xm_full)
clf_full = LogisticRegression(penalty='l2', C=2.0, solver='saga', max_iter=3000, n_jobs=-1, class_weight='balanced', verbose=0)
clf_full.fit(X_full, y)
Xw_test = wv_full.transform(full_text_test2)
Xc_test = cv_full.transform(full_text_test2)
Xm_test = scaler_full.transform(meta_test)
X_test = hstack3(Xw_test, Xc_test, Xm_test)
test_pred_meta = clf_full.predict_proba(X_test)[:,1].astype(np.float32)
np.save('test_lr_text_meta.npy', test_pred_meta)
sub_meta = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_pred_meta})
sub_meta.to_csv('submission_text_meta.csv', index=False)
print('Saved submission_text_meta.csv', sub_meta.shape, 'min/max:', float(test_pred_meta.min()), float(test_pred_meta.max()))

Meta shapes: (2878, 42) (1162, 42)


=== CV: Text (word+char) + Scaled Meta(+lexicons) -> LR (C=2, cw=balanced) ===


  fold 1 in 54.9s


  fold 2 in 62.3s


  fold 3 in 59.0s


  fold 4 in 55.1s


  fold 5 in 58.5s


OOF AUC (meta+text+lexicons): 0.68297


Saved submission_text_meta.csv (1162, 2) min/max: 0.024701131507754326 0.970310389995575


In [7]:
# Blend meta+text LR with improved text-only LR using OOF to pick weight, then write submission.csv
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score

oof_meta = np.load('oof_lr_text_meta.npy')
oof_lr = np.load('oof_lr_improved.npy')
y_blend = train[target_col].astype(int).values

weights = np.linspace(0.0, 1.0, 41)  # finer grid
best_w, best_auc = None, -1.0
for w in weights:
    blend = w*oof_meta + (1.0-w)*oof_lr
    auc = roc_auc_score(y_blend, blend)
    if auc > best_auc:
        best_auc, best_w = auc, w
print(f'Best OOF AUC blend(meta,text-only)={best_auc:.5f} at w(meta)={best_w:.2f}', flush=True)

test_meta = np.load('test_lr_text_meta.npy')
test_lr = np.load('test_lr_improved.npy')
test_blend = best_w*test_meta + (1.0-best_w)*test_lr

sub = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend.astype(np.float32)})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv from blended model:', sub.shape, 'min/max:', float(test_blend.min()), float(test_blend.max()))

Best OOF AUC blend(meta,text-only)=0.68381 at w(meta)=1.00


Saved submission.csv from blended model: (1162, 2) min/max: 0.032168105244636536 0.9459234476089478


In [9]:
# Fix NB-SVM normalization, then train a small meta stacker (meta+lexicons+OOFs) and blend
import time
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import HistGradientBoostingClassifier

def nbsvm_oof_fixed(text_series, y, alpha=1.0, C=2.0, min_df=2, max_features=120000):
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(text_series), dtype=np.float32)
    for fold, (trn_idx, val_idx) in enumerate(skf.split(text_series, y), 1):
        t0 = time.time()
        Xtr_text = text_series.iloc[trn_idx]
        Xva_text = text_series.iloc[val_idx]
        ytr = y[trn_idx]
        vec = CountVectorizer(ngram_range=(1,2), min_df=min_df, max_features=max_features, lowercase=True, dtype=np.float32)
        X_tr = vec.fit_transform(Xtr_text)
        X_va = vec.transform(Xva_text)
        # class-frequency-normalized r
        num_pos = (ytr == 1).sum()
        num_neg = (ytr == 0).sum()
        pos_counts = X_tr[ytr == 1].sum(axis=0).A1 / max(1, num_pos)
        neg_counts = X_tr[ytr == 0].sum(axis=0).A1 / max(1, num_neg)
        r = np.log((pos_counts + alpha) / (neg_counts + alpha))
        X_tr_nb = X_tr.multiply(r)
        X_va_nb = X_va.multiply(r)
        clf = LogisticRegression(C=C, solver='liblinear', max_iter=2000)
        clf.fit(X_tr_nb, ytr)
        oof[val_idx] = clf.predict_proba(X_va_nb)[:,1]
        print(f'  NB-SVM(fixed) fold {fold} in {time.time()-t0:.1f}s', flush=True)
    auc = roc_auc_score(y, oof)
    return auc, oof, vec, r, clf

print('=== NB-SVM fixed normalization CV ===', flush=True)
ybin = train[target_col].astype(int).values
nb_auc2, nb_oof2, nb_vec2, nb_r2, nb_clf2 = nbsvm_oof_fixed(full_text_train2, ybin, alpha=1.0, C=2.0, min_df=2, max_features=120000)
print(f'NB-SVM fixed OOF AUC: {nb_auc2:.5f}', flush=True)
np.save('oof_nbsvm_fixed.npy', nb_oof2)

# Fit NB-SVM fixed on full train and predict test
X_full_nb2 = nb_vec2.fit_transform(full_text_train2)
num_pos_full = (ybin == 1).sum()
num_neg_full = (ybin == 0).sum()
pos_full = X_full_nb2[ybin == 1].sum(axis=0).A1 / max(1, num_pos_full)
neg_full = X_full_nb2[ybin == 0].sum(axis=0).A1 / max(1, num_neg_full)
r_full = np.log((pos_full + 1.0) / (neg_full + 1.0))
X_full_nb2_tr = X_full_nb2.multiply(r_full)
nb_full2 = LogisticRegression(C=2.0, solver='liblinear', max_iter=2000)
nb_full2.fit(X_full_nb2_tr, ybin)
X_test_nb2 = nb_vec2.transform(full_text_test2)
X_test_nb2_tr = X_test_nb2.multiply(r_full)
test_nb2 = nb_full2.predict_proba(X_test_nb2_tr)[:,1].astype(np.float32)
np.save('test_nbsvm_fixed.npy', test_nb2)

# Build stacker features: dense meta (with lexicons) + OOF columns from base models
oof_meta_lr = np.load('oof_lr_text_meta.npy')
oof_lr_only = np.load('oof_lr_improved.npy')
oof_nb_fixed = np.load('oof_nbsvm_fixed.npy')

X_stacker = np.hstack([meta_train.values,
                        oof_meta_lr.reshape(-1,1),
                        oof_lr_only.reshape(-1,1),
                        oof_nb_fixed.reshape(-1,1)])
y_stacker = ybin

print('=== Stacker CV (HistGradientBoosting) on meta+lexicons+OOFs ===', flush=True)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_stack = np.zeros(len(train), dtype=np.float32)
for fold, (trn_idx, val_idx) in enumerate(skf.split(X_stacker, y_stacker), 1):
    X_tr, X_va = X_stacker[trn_idx], X_stacker[val_idx]
    y_tr, y_va = y_stacker[trn_idx], y_stacker[val_idx]
    clf = HistGradientBoostingClassifier(max_depth=3, learning_rate=0.075, max_iter=600,
                                         early_stopping=True, validation_fraction=0.1,
                                         random_state=42)
    t0 = time.time()
    clf.fit(X_tr, y_tr)
    oof_stack[val_idx] = clf.predict_proba(X_va)[:,1]
    print(f'  Stacker fold {fold} in {time.time()-t0:.1f}s', flush=True)
auc_stack = roc_auc_score(y_stacker, oof_stack)
print(f'Stacker OOF AUC: {auc_stack:.5f}', flush=True)
np.save('oof_stacker.npy', oof_stack)

# Fit stacker on full train and predict test
test_meta_lr = np.load('test_lr_text_meta.npy')
test_lr_only = np.load('test_lr_improved.npy')
test_nb_fixed = np.load('test_nbsvm_fixed.npy')
X_test_stacker = np.hstack([meta_test.values,
                             test_meta_lr.reshape(-1,1),
                             test_lr_only.reshape(-1,1),
                             test_nb_fixed.reshape(-1,1)])
final_stacker = HistGradientBoostingClassifier(max_depth=3, learning_rate=0.075, max_iter=600,
                                               early_stopping=True, validation_fraction=0.1,
                                               random_state=42)
final_stacker.fit(X_stacker, y_stacker)
test_stack = final_stacker.predict_proba(X_test_stacker)[:,1].astype(np.float32)
np.save('test_stacker.npy', test_stack)

# Blend LR(meta+text+lex) with stacker using OOF to choose weight
weights = np.linspace(0.0, 1.0, 41)
best_w, best_auc = None, -1.0
for w in weights:
    blend = w*oof_meta_lr + (1.0-w)*oof_stack
    auc = roc_auc_score(ybin, blend)
    if auc > best_auc:
        best_auc, best_w = auc, w
print(f'Best OOF AUC blend(LR_meta_text, stacker)={best_auc:.5f} at w(LR)={best_w:.2f}', flush=True)
test_blend = best_w*test_meta_lr + (1.0-best_w)*test_stack
sub = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (stack blend):', sub.shape, 'min/max:', float(test_blend.min()), float(test_blend.max()))

=== NB-SVM fixed normalization CV ===


  NB-SVM(fixed) fold 1 in 0.4s


  NB-SVM(fixed) fold 2 in 0.4s


  NB-SVM(fixed) fold 3 in 0.4s


  NB-SVM(fixed) fold 4 in 0.4s


  NB-SVM(fixed) fold 5 in 0.4s


NB-SVM fixed OOF AUC: 0.59933


=== Stacker CV (HistGradientBoosting) on meta+lexicons+OOFs ===


  Stacker fold 1 in 0.0s


  Stacker fold 2 in 0.0s


  Stacker fold 3 in 0.0s


  Stacker fold 4 in 0.0s


  Stacker fold 5 in 0.0s


Stacker OOF AUC: 0.66893


Best OOF AUC blend(LR_meta_text, stacker)=0.68297 at w(LR)=1.00


Saved submission.csv (stack blend): (1162, 2) min/max: 0.024701131507754326 0.970310389995575


In [10]:
# Add subreddit bag (TF-IDF of requester_subreddits_at_request) to LR meta+text pipeline
import time, re
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def build_subreddit_text(df: pd.DataFrame) -> pd.Series:
    if 'requester_subreddits_at_request' not in df.columns:
        return pd.Series(['']*len(df))
    def to_line(x):
        if isinstance(x, list):
            # join subreddit names with space; ensure strings and lowercase
            return ' '.join([str(s).lower() for s in x if isinstance(s, str)])
        return ''
    return df['requester_subreddits_at_request'].apply(to_line)

sub_train_txt = build_subreddit_text(train)
sub_test_txt = build_subreddit_text(test)

def get_text_vecs():
    wv = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=100000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    cv = TfidfVectorizer(analyzer='char', ngram_range=(3,6), min_df=2, max_features=200000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    return wv, cv

def get_sub_vec():
    # Compact subreddit vocabulary
    return TfidfVectorizer(ngram_range=(1,1), min_df=3, max_features=600,
                           lowercase=True, token_pattern=r'[^\s]+' , dtype=np.float32)

def hstack4(X1, X2, X3, X4):
    return sparse.hstack([X1, X2, X3, X4], format='csr')

y = train[target_col].astype(int).values
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_sr = np.zeros(len(train), dtype=np.float32)

print('=== CV: Text(word+char) + Scaled Meta(+lex) + Subreddit TF-IDF -> LR (C=2, cw=balanced) ===', flush=True)
times = []
for fold, (trn_idx, val_idx) in enumerate(skf.split(full_text_train2, y), 1):
    t0 = time.time()
    # Text
    wv, cv = get_text_vecs()
    Xw_tr = wv.fit_transform(full_text_train2.iloc[trn_idx])
    Xc_tr = cv.fit_transform(full_text_train2.iloc[trn_idx])
    Xw_va = wv.transform(full_text_train2.iloc[val_idx])
    Xc_va = cv.transform(full_text_train2.iloc[val_idx])
    # Meta (already built in previous cell as meta_train/meta_test); scale in-fold
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_train.iloc[trn_idx])
    Xm_va = scaler.transform(meta_train.iloc[val_idx])
    # Subreddits TF-IDF
    sv = get_sub_vec()
    Xs_tr = sv.fit_transform(sub_train_txt.iloc[trn_idx])
    Xs_va = sv.transform(sub_train_txt.iloc[val_idx])
    # Stack
    X_tr = hstack4(Xw_tr, Xc_tr, Xm_tr, Xs_tr)
    X_va = hstack4(Xw_va, Xc_va, Xm_va, Xs_va)
    clf = LogisticRegression(penalty='l2', C=2.0, solver='saga', max_iter=3000, n_jobs=-1, class_weight='balanced', verbose=0)
    clf.fit(X_tr, y[trn_idx])
    oof_sr[val_idx] = clf.predict_proba(X_va)[:,1]
    times.append(time.time()-t0)
    print(f'  fold {fold} in {times[-1]:.1f}s', flush=True)
auc_sr = roc_auc_score(y, oof_sr)
print(f'OOF AUC (meta+text+lex+subs): {auc_sr:.5f}', flush=True)
np.save('oof_lr_text_meta_subs.npy', oof_sr)

# Fit final and predict test
wv_full, cv_full = get_text_vecs()
Xw_full = wv_full.fit_transform(full_text_train2)
Xc_full = cv_full.fit_transform(full_text_train2)
scaler_full = StandardScaler(with_mean=False)
Xm_full = scaler_full.fit_transform(meta_train)
sv_full = get_sub_vec()
Xs_full = sv_full.fit_transform(sub_train_txt)
X_full = hstack4(Xw_full, Xc_full, Xm_full, Xs_full)
clf_full = LogisticRegression(penalty='l2', C=2.0, solver='saga', max_iter=3000, n_jobs=-1, class_weight='balanced', verbose=0)
clf_full.fit(X_full, y)
Xw_test = wv_full.transform(full_text_test2)
Xc_test = cv_full.transform(full_text_test2)
Xm_test = scaler_full.transform(meta_test)
Xs_test = sv_full.transform(sub_test_txt)
X_test = hstack4(Xw_test, Xc_test, Xm_test, Xs_test)
test_pred_sr = clf_full.predict_proba(X_test)[:,1].astype(np.float32)
np.save('test_lr_text_meta_subs.npy', test_pred_sr)
sub_sr = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_pred_sr})
sub_sr.to_csv('submission_text_meta_subs.csv', index=False)
print('Saved submission_text_meta_subs.csv', sub_sr.shape, 'min/max:', float(test_pred_sr.min()), float(test_pred_sr.max()))

# Optional: OOF-weighted blend among (meta+text+lex) and (meta+text+lex+subs)
oof_base = np.load('oof_lr_text_meta.npy') if os.path.exists('oof_lr_text_meta.npy') else oof_sr
best_w, best_auc = 1.0, auc_sr
for w in np.linspace(0.0,1.0,21):
    blend = w*oof_sr + (1.0-w)*oof_base
    a = roc_auc_score(y, blend)
    if a > best_auc:
        best_auc, best_w = a, w
print(f'Best OOF blend(base vs subs) AUC: {best_auc:.5f} at w(subs)={best_w:.2f}', flush=True)
test_base = np.load('test_lr_text_meta.npy') if os.path.exists('test_lr_text_meta.npy') else test_pred_sr
test_blend = best_w*test_pred_sr + (1.0-best_w)*test_base
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
print('Updated submission.csv from base/subs blend.')

=== CV: Text(word+char) + Scaled Meta(+lex) + Subreddit TF-IDF -> LR (C=2, cw=balanced) ===


  fold 1 in 48.4s


  fold 2 in 58.1s


  fold 3 in 56.2s


  fold 4 in 47.8s


  fold 5 in 51.8s


OOF AUC (meta+text+lex+subs): 0.67327


Saved submission_text_meta_subs.csv (1162, 2) min/max: 0.016643734648823738 0.9696571230888367
Best OOF blend(base vs subs) AUC: 0.68409 at w(subs)=0.20


Updated submission.csv from base/subs blend.


In [11]:
# Try char_wb instead of char for the text+meta(+lex) LR (single best config) to seek OOF lift
import time
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

def get_vecs_meta_charwb():
    wv = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=100000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    # switch to char_wb and keep (3,6); sometimes improves RAOP
    cv = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,6), min_df=2, max_features=200000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    return wv, cv

def hstack3(X1, X2, X3):
    return sparse.hstack([X1, X2, X3], format='csr')

y = train[target_col].astype(int).values
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cwb = np.zeros(len(train), dtype=np.float32)

print('=== CV: word(1-2) TF-IDF + char_wb(3-6) + Scaled Meta(+lex) -> LR (C=2, cw=balanced) ===', flush=True)
times = []
for fold, (trn_idx, val_idx) in enumerate(skf.split(full_text_train2, y), 1):
    t0 = time.time()
    Xtr_text = full_text_train2.iloc[trn_idx]
    Xva_text = full_text_train2.iloc[val_idx]
    wv, cv = get_vecs_meta_charwb()
    Xw_tr = wv.fit_transform(Xtr_text)
    Xc_tr = cv.fit_transform(Xtr_text)
    Xw_va = wv.transform(Xva_text)
    Xc_va = cv.transform(Xva_text)
    scaler = StandardScaler(with_mean=False)
    Xm_tr = scaler.fit_transform(meta_train.iloc[trn_idx])
    Xm_va = scaler.transform(meta_train.iloc[val_idx])
    X_tr = hstack3(Xw_tr, Xc_tr, Xm_tr)
    X_va = hstack3(Xw_va, Xc_va, Xm_va)
    clf = LogisticRegression(penalty='l2', C=2.0, solver='saga', max_iter=3000, n_jobs=-1, class_weight='balanced', verbose=0)
    clf.fit(X_tr, y[trn_idx])
    oof_cwb[val_idx] = clf.predict_proba(X_va)[:,1]
    times.append(time.time()-t0)
    print(f'  fold {fold} in {times[-1]:.1f}s', flush=True)
auc_cwb = roc_auc_score(y, oof_cwb)
print(f'OOF AUC (char_wb variant): {auc_cwb:.5f}', flush=True)
np.save('oof_lr_text_meta_charwb.npy', oof_cwb)

# Fit final and predict test with char_wb if it helps; else keep prior best
wv_full, cv_full = get_vecs_meta_charwb()
Xw_full = wv_full.fit_transform(full_text_train2)
Xc_full = cv_full.fit_transform(full_text_train2)
scaler_full = StandardScaler(with_mean=False)
Xm_full = scaler_full.fit_transform(meta_train)
X_full = hstack3(Xw_full, Xc_full, Xm_full)
clf_full = LogisticRegression(penalty='l2', C=2.0, solver='saga', max_iter=3000, n_jobs=-1, class_weight='balanced', verbose=0)
clf_full.fit(X_full, y)
Xw_test = wv_full.transform(full_text_test2)
Xc_test = cv_full.transform(full_text_test2)
Xm_test = scaler_full.transform(meta_test)
X_test = hstack3(Xw_test, Xc_test, Xm_test)
test_pred_cwb = clf_full.predict_proba(X_test)[:,1].astype(np.float32)
np.save('test_lr_text_meta_charwb.npy', test_pred_cwb)

# Choose best among (meta+text+lex char) vs (char_wb) by OOF and update submission if improved
try:
    oof_base = np.load('oof_lr_text_meta.npy')
    test_base = np.load('test_lr_text_meta.npy')
    base_auc = roc_auc_score(y, oof_base)
except Exception:
    oof_base = oof_cwb
    test_base = test_pred_cwb
    base_auc = auc_cwb

if auc_cwb > base_auc:
    print(f'char_wb improved OOF from {base_auc:.5f} to {auc_cwb:.5f}; updating submission.csv')
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_pred_cwb}).to_csv('submission.csv', index=False)
else:
    print(f'char_wb did not improve OOF ({auc_cwb:.5f} <= {base_auc:.5f}); keeping current submission.csv')

=== CV: word(1-2) TF-IDF + char_wb(3-6) + Scaled Meta(+lex) -> LR (C=2, cw=balanced) ===


  fold 1 in 29.1s


  fold 2 in 33.1s


  fold 3 in 33.9s


  fold 4 in 29.4s


  fold 5 in 30.9s


OOF AUC (char_wb variant): 0.68276


char_wb did not improve OOF (0.68276 <= 0.68297); keeping current submission.csv


In [53]:
# CatBoost TEXT+META with StratifiedGroupKFold by requester_username; seed-bag; OOF blend with LR (no custom text_processing to avoid param errors)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import roc_auc_score
import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Build meta_train/meta_test first'
assert 'train' in globals() and 'test' in globals() and 'title_col' in globals() and 'body_col' in globals()

y = train['requester_received_pizza'].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values

df_tr = pd.concat([
    pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                  't_body':  train[body_col].astype(str).fillna('')}).reset_index(drop=True),
    meta_train.reset_index(drop=True)
], axis=1)
df_te = pd.concat([
    pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                  't_body':  test[body_col].astype(str).fillna('')}).reset_index(drop=True),
    meta_test.reset_index(drop=True)
], axis=1)
text_idx = [0, 1]

base_params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    depth=5,
    learning_rate=0.04,
    l2_leaf_reg=4.0,
    iterations=3000,
    early_stopping_rounds=200,
    random_strength=1.0,
    auto_class_weights='Balanced',
    border_count=128,
    verbose=100
)
try:
    from catboost.utils import get_gpu_device_count
    if get_gpu_device_count() > 0:
        base_params['task_type'] = 'GPU'
        print('CatBoost task_type: GPU', flush=True)
    else:
        print('CatBoost task_type: CPU', flush=True)
except Exception:
    pass

def fit_cb_seed(seed):
    params = dict(base_params, random_seed=int(seed))
    skf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(df_tr), dtype=np.float32)
    test_fold = []
    for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y, groups), 1):
        t0 = time.time()
        tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y[trn_idx], text_features=text_idx)
        va_pool = cb.Pool(df_tr.iloc[val_idx], label=y[val_idx], text_features=text_idx)
        model = cb.CatBoostClassifier(**params)
        model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
        oof[val_idx] = model.predict_proba(va_pool)[:,1]
        if fold == 1:
            te_pool = cb.Pool(df_te, text_features=text_idx)
        test_fold.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
        auc = roc_auc_score(y[val_idx], oof[val_idx])
        print(f'Fold {fold} seed {seed}: AUC {auc:.5f} | {time.time()-t0:.1f}s', flush=True)
    return oof, np.mean(np.vstack(test_fold), axis=0).astype(np.float32)

seeds = [42, 1337, 7]
oofs, tests = [], []
for s in seeds:
    oof_s, test_s = fit_cb_seed(s)
    oofs.append(oof_s); tests.append(test_s)
oof_cb = np.mean(np.vstack(oofs), axis=0).astype(np.float32)
test_cb = np.mean(np.vstack(tests), axis=0).astype(np.float32)
auc_cb = roc_auc_score(y, oof_cb)
print('CatBoost grouped bag OOF AUC:', f'{auc_cb:.5f}')
np.save('oof_catboost_text_meta_g.npy', oof_cb)
np.save('test_catboost_text_meta_g.npy', test_cb)

# Blend with best LR leg (prefer grouped LR if present; else fallback)
lr_oof_path = 'oof_lr_meta_g.npy' if os.path.exists('oof_lr_meta_g.npy') else ('oof_lr_text_meta.npy' if os.path.exists('oof_lr_text_meta.npy') else None)
lr_te_path  = 'test_lr_meta_g.npy' if os.path.exists('test_lr_meta_g.npy') else ('test_lr_text_meta.npy' if os.path.exists('test_lr_text_meta.npy') else None)

if lr_oof_path and lr_te_path:
    oof_lr = np.load(lr_oof_path); test_lr = np.load(lr_te_path)
    ws = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y, oof_lr)
    for w in ws:
        a = roc_auc_score(y, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    w_cb = 0.85*best_w + 0.15*0.5
    print(f'Best OOF blend AUC: {best_auc:.5f} at w(CB)={best_w:.2f} -> using w(CB)={w_cb:.2f}')
    test_blend = w_cb*test_cb + (1.0-w_cb)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
print('Saved submission.csv (CatBoost blend or pure).')

CatBoost task_type: GPU


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.5680620	best: 0.5680620 (0)	total: 17.3ms	remaining: 52s


100:	test: 0.6745113	best: 0.6776460 (82)	total: 1.7s	remaining: 48.7s


200:	test: 0.6676311	best: 0.6776460 (82)	total: 3.34s	remaining: 46.5s


bestTest = 0.6776459813
bestIteration = 82
Shrink model to first 83 iterations.
Fold 1 seed 42: AUC 0.67765 | 5.3s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6073511	best: 0.6073511 (0)	total: 17.5ms	remaining: 52.5s


100:	test: 0.6956515	best: 0.7059892 (77)	total: 1.68s	remaining: 48.3s


200:	test: 0.6893278	best: 0.7059892 (77)	total: 3.33s	remaining: 46.4s


bestTest = 0.705989182
bestIteration = 77
Shrink model to first 78 iterations.
Fold 2 seed 42: AUC 0.70599 | 5.7s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.5904411	best: 0.5904411 (0)	total: 17.9ms	remaining: 53.6s


100:	test: 0.6587330	best: 0.6603368 (95)	total: 1.69s	remaining: 48.4s


200:	test: 0.6662069	best: 0.6662069 (200)	total: 3.33s	remaining: 46.3s


300:	test: 0.6755573	best: 0.6756375 (291)	total: 4.98s	remaining: 44.7s


400:	test: 0.6813793	best: 0.6858541 (370)	total: 6.63s	remaining: 43s


500:	test: 0.6774659	best: 0.6858541 (370)	total: 8.27s	remaining: 41.2s


bestTest = 0.6858540773
bestIteration = 370
Shrink model to first 371 iterations.
Fold 3 seed 42: AUC 0.68585 | 10.6s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6104848	best: 0.6104848 (0)	total: 18.6ms	remaining: 55.8s


100:	test: 0.6786738	best: 0.6791282 (99)	total: 1.68s	remaining: 48.2s


200:	test: 0.6790583	best: 0.6859969 (152)	total: 3.32s	remaining: 46.2s


300:	test: 0.6855600	best: 0.6874475 (252)	total: 4.96s	remaining: 44.5s


400:	test: 0.6865562	best: 0.6874475 (252)	total: 6.59s	remaining: 42.7s


500:	test: 0.6895099	best: 0.6895099 (500)	total: 8.21s	remaining: 41s


600:	test: 0.6891254	best: 0.6907159 (518)	total: 9.85s	remaining: 39.3s


700:	test: 0.6893701	best: 0.6907159 (518)	total: 11.5s	remaining: 37.6s


bestTest = 0.690715909
bestIteration = 518
Shrink model to first 519 iterations.
Fold 4 seed 42: AUC 0.69072 | 13.0s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6003522	best: 0.6003522 (0)	total: 17.8ms	remaining: 53.4s


100:	test: 0.7230480	best: 0.7300459 (63)	total: 1.69s	remaining: 48.4s


200:	test: 0.7326217	best: 0.7347940 (199)	total: 3.33s	remaining: 46.3s


300:	test: 0.7326837	best: 0.7351508 (255)	total: 4.97s	remaining: 44.6s


400:	test: 0.7221015	best: 0.7351508 (255)	total: 6.62s	remaining: 42.9s


bestTest = 0.7351508141
bestIteration = 255
Shrink model to first 256 iterations.
Fold 5 seed 42: AUC 0.73515 | 8.7s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.5502106	best: 0.5502106 (0)	total: 18.1ms	remaining: 54.4s


100:	test: 0.6681295	best: 0.6717786 (80)	total: 1.67s	remaining: 47.9s


200:	test: 0.6753312	best: 0.6785462 (128)	total: 3.31s	remaining: 46.2s


300:	test: 0.6767618	best: 0.6785462 (128)	total: 4.95s	remaining: 44.4s


400:	test: 0.6810861	best: 0.6816165 (367)	total: 6.58s	remaining: 42.7s


500:	test: 0.6812789	best: 0.6828061 (412)	total: 8.22s	remaining: 41s


600:	test: 0.6786909	best: 0.6828061 (412)	total: 9.85s	remaining: 39.3s
bestTest = 0.6828060746
bestIteration = 412
Shrink model to first 413 iterations.


Fold 1 seed 1337: AUC 0.68281 | 10.7s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6074944	best: 0.6074944 (0)	total: 17.2ms	remaining: 51.6s


100:	test: 0.6927684	best: 0.6976426 (56)	total: 1.69s	remaining: 48.5s


200:	test: 0.6868907	best: 0.6976426 (56)	total: 3.34s	remaining: 46.5s


bestTest = 0.6976425648
bestIteration = 56
Shrink model to first 57 iterations.
Fold 2 seed 1337: AUC 0.69764 | 5.4s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.5734803	best: 0.5734803 (0)	total: 17.2ms	remaining: 51.6s


100:	test: 0.6666560	best: 0.6702967 (66)	total: 1.67s	remaining: 47.9s


200:	test: 0.6661428	best: 0.6702967 (66)	total: 3.3s	remaining: 46s


300:	test: 0.6731676	best: 0.6772414 (289)	total: 4.93s	remaining: 44.2s


400:	test: 0.6786367	best: 0.6812029 (370)	total: 6.57s	remaining: 42.6s


500:	test: 0.6785726	best: 0.6812029 (370)	total: 8.21s	remaining: 41s


bestTest = 0.6812028885
bestIteration = 370
Shrink model to first 371 iterations.
Fold 3 seed 1337: AUC 0.68120 | 10.4s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.5925790	best: 0.5925790 (0)	total: 17.4ms	remaining: 52.3s


100:	test: 0.6838472	best: 0.6838472 (100)	total: 1.68s	remaining: 48.2s


200:	test: 0.6951552	best: 0.6953999 (196)	total: 3.33s	remaining: 46.3s


300:	test: 0.6966408	best: 0.6966408 (300)	total: 4.96s	remaining: 44.4s


400:	test: 0.6954873	best: 0.6987556 (362)	total: 6.59s	remaining: 42.7s


500:	test: 0.6878146	best: 0.6987556 (362)	total: 8.23s	remaining: 41s


bestTest = 0.6987556219
bestIteration = 362
Shrink model to first 363 iterations.
Fold 4 seed 1337: AUC 0.69876 | 10.4s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6270016	best: 0.6270016 (0)	total: 17.1ms	remaining: 51.2s


100:	test: 0.7286339	best: 0.7294874 (48)	total: 1.7s	remaining: 48.8s


200:	test: 0.7216981	best: 0.7294874 (48)	total: 3.36s	remaining: 46.9s


bestTest = 0.7294873595
bestIteration = 48
Shrink model to first 49 iterations.
Fold 5 seed 1337: AUC 0.72949 | 5.3s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.5514162	best: 0.5514162 (0)	total: 17.5ms	remaining: 52.5s


100:	test: 0.6634838	best: 0.6678884 (61)	total: 1.67s	remaining: 47.9s


200:	test: 0.6733860	best: 0.6750739 (180)	total: 3.3s	remaining: 46s


300:	test: 0.6699299	best: 0.6750739 (180)	total: 4.92s	remaining: 44.1s


bestTest = 0.6750739217
bestIteration = 180
Shrink model to first 181 iterations.
Fold 1 seed 7: AUC 0.67507 | 6.9s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6092864	best: 0.6092864 (0)	total: 17.7ms	remaining: 53s


100:	test: 0.6741797	best: 0.6843581 (64)	total: 1.68s	remaining: 48.2s


200:	test: 0.6609589	best: 0.6843581 (64)	total: 3.32s	remaining: 46.2s


bestTest = 0.6843580604
bestIteration = 64
Shrink model to first 65 iterations.
Fold 2 seed 7: AUC 0.68436 | 5.5s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6095349	best: 0.6095349 (0)	total: 19.3ms	remaining: 57.9s


100:	test: 0.6627426	best: 0.6676503 (45)	total: 1.67s	remaining: 48s


200:	test: 0.6754290	best: 0.6770489 (192)	total: 3.31s	remaining: 46s


300:	test: 0.6859984	best: 0.6859984 (300)	total: 4.95s	remaining: 44.4s


400:	test: 0.6902165	best: 0.6904411 (393)	total: 6.59s	remaining: 42.7s


500:	test: 0.6882117	best: 0.6927506 (426)	total: 8.25s	remaining: 41.2s


600:	test: 0.6888533	best: 0.6927506 (426)	total: 9.89s	remaining: 39.5s


bestTest = 0.6927505732
bestIteration = 426
Shrink model to first 427 iterations.
Fold 3 seed 7: AUC 0.69275 | 11.4s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.5922819	best: 0.5922819 (0)	total: 17.7ms	remaining: 53s


100:	test: 0.6898595	best: 0.6898595 (100)	total: 1.69s	remaining: 48.4s


200:	test: 0.6949804	best: 0.6956446 (186)	total: 3.33s	remaining: 46.3s


300:	test: 0.6973923	best: 0.7007830 (253)	total: 4.96s	remaining: 44.5s


400:	test: 0.6976895	best: 0.7007830 (253)	total: 6.6s	remaining: 42.8s


bestTest = 0.7007830143
bestIteration = 253
Shrink model to first 254 iterations.
Fold 4 seed 7: AUC 0.70078 | 8.6s


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6159462	best: 0.6159462 (0)	total: 17.2ms	remaining: 51.6s


100:	test: 0.7264151	best: 0.7266168 (99)	total: 1.67s	remaining: 48s


200:	test: 0.7195569	best: 0.7309459 (122)	total: 3.31s	remaining: 46s


300:	test: 0.7117211	best: 0.7309459 (122)	total: 4.94s	remaining: 44.3s


bestTest = 0.7309458852
bestIteration = 122
Shrink model to first 123 iterations.
Fold 5 seed 7: AUC 0.73095 | 6.4s


CatBoost grouped bag OOF AUC: 0.70023
Best OOF blend AUC: 0.70702 at w(CB)=0.80 -> using w(CB)=0.76
Saved submission.csv (CatBoost blend or pure).


In [54]:
# Quick re-blend: use best OOF weight (no shrink) and finer grid; write submission.csv
import os, numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score

assert 'train' in globals() and 'test' in globals(), 'Run earlier cells to load train/test'
y = train['requester_received_pizza'].astype(int).values

cb_oof_path = 'oof_catboost_text_meta_g.npy'
cb_te_path  = 'test_catboost_text_meta_g.npy'
lr_oof_path = 'oof_lr_meta_g.npy' if os.path.exists('oof_lr_meta_g.npy') else ('oof_lr_text_meta.npy' if os.path.exists('oof_lr_text_meta.npy') else None)
lr_te_path  = 'test_lr_meta_g.npy' if os.path.exists('test_lr_meta_g.npy') else ('test_lr_text_meta.npy' if os.path.exists('test_lr_text_meta.npy') else None)
assert os.path.exists(cb_oof_path) and os.path.exists(cb_te_path), 'Missing CatBoost OOF/test npy artifacts'
assert lr_oof_path and lr_te_path, 'Missing LR OOF/test npy artifacts'

oof_cb = np.load(cb_oof_path).astype(np.float32)
test_cb = np.load(cb_te_path).astype(np.float32)
oof_lr = np.load(lr_oof_path).astype(np.float32)
test_lr = np.load(lr_te_path).astype(np.float32)

# Fine search around 0.70-0.90 and global coarse backup
ws_fine = np.arange(0.70, 0.901, 0.01)
ws_coarse = np.linspace(0.0, 1.0, 41)
ws = np.unique(np.concatenate([ws_coarse, ws_fine]))
best_w, best_auc = 0.0, roc_auc_score(y, oof_lr)
for w in ws:
    a = roc_auc_score(y, w*oof_cb + (1.0-w)*oof_lr)
    if a > best_auc:
        best_auc, best_w = a, float(w)
print(f'REBLEND Best OOF blend(LR,CB) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)

test_blend = best_w*test_cb + (1.0-best_w)*test_lr
sub = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv with w(CB)=', best_w)

REBLEND Best OOF blend(LR,CB) AUC: 0.70702 at w(CB)=0.79


Saved submission.csv with w(CB)= 0.79


In [51]:
# CatBoost TEXT+META: title/body as TEXT + meta(+lex) numeric; 5-fold CV; OOF-optimal blend
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Run meta/meta+lex cell first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

params = dict(loss_function='Logloss', eval_metric='AUC', depth=6, learning_rate=0.05, l2_leaf_reg=5.0,
              iterations=2000, early_stopping_rounds=200, random_seed=42, verbose=False)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# OOF-based blend with current best LR(meta+text/lex) if present
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR,CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

CatBoost task_type: GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 1: AUC 0.69379 | 8.2s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 2: AUC 0.69379 | 16.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 3: AUC 0.71684 | 15.7s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 4: AUC 0.65514 | 9.8s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 5: AUC 0.71988 | 7.9s


CatBoost OOF AUC: 0.691649
Best OOF blend(LR,CatBoost) AUC: 0.70600 at w(CB)=0.70


Saved submission.csv (LR+CatBoost blend)


In [None]:
# CatBoost TEXT+META model: title/body as TEXT + meta(+lex) numeric; 5-fold CV; OOF-optimal blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Build meta_train/meta_test first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble frames: first two columns are TEXT, rest are numeric
y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

# Params; prefer GPU if available
params = dict(loss_function='Logloss', eval_metric='AUC',
              depth=6, learning_rate=0.05, l2_leaf_reg=5.0,
              iterations=2000, early_stopping_rounds=200,
              random_seed=42, verbose=False)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with current best LR(meta+text) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR,CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [48]:
# CatBoost TEXT+META(+lex) 5-fold CV; GPU if available; OOF-optimal blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Run meta feature cell first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble DataFrames for CatBoost: first two columns are TEXT, rest are numeric meta(+lex) features
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
y_cb = train[target_col].astype(int).values
text_features_idx = [0, 1]

# CatBoost params; prefer GPU
params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=5.0,
    iterations=2000,
    early_stopping_rounds=200,
    random_seed=42,
    verbose=False
)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with our best LR(meta+text[+lex/subs]) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR(meta+text),CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [None]:
# CatBoost TEXT+META model: title/body as TEXT + meta(+lex); 5-fold CV; OOF-optimal blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Run meta feature cell first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble frames: first two columns are TEXT, rest are numeric
y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

# Params; prefer GPU if available
params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=5.0,
    iterations=2000,
    early_stopping_rounds=200,
    random_seed=42,
    verbose=False
)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with best LR(meta+text[+lex/subs]) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR,CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [None]:
# CatBoost TEXT+META(+lex) 5-fold CV (GPU if available) + OOF-optimal blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Build meta_train/meta_test first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=5.0,
    iterations=2000,
    early_stopping_rounds=200,
    random_seed=42,
    verbose=False
)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# OOF-optimal blend with our LR(meta+text) if available
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR(meta+text),CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [None]:
# CatBoost: native TEXT (title/body) + numeric meta(+lexicons); 5-fold CV; OOF-weighted blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Build meta_train/meta_test first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble frames: first two columns are TEXT, rest are numeric
y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

# CatBoost params; prefer GPU if available
params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=5.0,
    iterations=2000,
    early_stopping_rounds=200,
    random_seed=42,
    verbose=False
)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    auc = roc_auc_score(y_cb[val_idx], oof_cb[val_idx])
    print(f'Fold {fold}: AUC {auc:.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with best LR(meta+text) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR(meta+text),CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [None]:
# CatBoost TEXT+META model (GPU if available): 5-fold CV, save OOF/test, OOF-optimal blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.argv[0], '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Run meta feature cell first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble frames: first two columns are TEXT, rest are numeric (meta+lex)
y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

# CatBoost params (fast/robust); prefer GPU
params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=5.0,
    iterations=1200,
    early_stopping_rounds=100,
    random_seed=42,
    verbose=False
)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with current best LR(meta+text) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR,CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [None]:
# CatBoost TEXT + META (+lexicons) 5-fold CV; GPU if available; OOF-optimal blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Run meta feature cell first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble frames: first two columns are TEXT, rest are numeric
y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

# Params; prefer GPU
params = dict(loss_function='Logloss', eval_metric='AUC',
              depth=6, learning_rate=0.05, l2_leaf_reg=5.0,
              iterations=2000, early_stopping_rounds=200,
              random_seed=42, verbose=False)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with best LR(meta+text[+lex/subs]) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR(meta+text),CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [47]:
# CatBoost TEXT + META(+lex) 5-fold CV; prefer GPU; OOF-optimal blend with LR(meta+text); write submission.csv
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert title_col in train.columns and body_col in train.columns, 'Missing title/body columns'
assert target_col in train.columns, 'Missing target_col'
assert 'meta_train' in globals() and 'meta_test' in globals(), 'Run meta feature cell first'

# Assemble frames for CatBoost: first two columns are TEXT, rest numeric meta(+lex)
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
y_cb = train[target_col].astype(int).values
text_features_idx = [0, 1]

# Params; use GPU if available
params = dict(loss_function='Logloss', eval_metric='AUC',
              depth=6, learning_rate=0.05, l2_leaf_reg=5.0,
              iterations=2000, early_stopping_rounds=200,
              random_seed=42, verbose=False)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with current best LR(meta+text[+lex/subs]) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR(meta+text),CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

CatBoost task_type: GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 1: AUC 0.69379 | 8.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 2: AUC 0.69379 | 16.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 3: AUC 0.71684 | 15.8s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 4: AUC 0.65514 | 9.8s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 5: AUC 0.71988 | 8.0s


CatBoost OOF AUC: 0.691649
Best OOF blend(LR,CatBoost) AUC: 0.70600 at w(CB)=0.70


Saved submission.csv (LR+CatBoost blend)


In [None]:
# CatBoost TEXT+META: use title/body as TEXT features + meta(+lex) numeric; 5-fold CV; OOF-optimal blend
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Build meta_train/meta_test first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble frames: first two columns are TEXT, rest are numeric meta(+lexicons) already built
y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

# CatBoost params (fast, robust); prefer GPU if available
params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    depth=6,
    learning_rate=0.05,
    l2_leaf_reg=5.0,
    iterations=2000,
    early_stopping_rounds=200,
    random_seed=42,
    verbose=False
)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    auc = roc_auc_score(y_cb[val_idx], oof_cb[val_idx])
    print(f'Fold {fold}: AUC {auc:.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with best LR(meta+text[+lex/subs]) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR(meta+text),CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [49]:
# CatBoost TEXT+META model: title/body as TEXT + meta(+lex) numeric; 5-fold CV; OOF-optimal blend with LR(meta+text)
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Build meta_train/meta_test first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Assemble frames: first two columns are TEXT, rest are numeric
y_cb = train[target_col].astype(int).values
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
text_features_idx = [0, 1]

# Params; prefer GPU if available
params = dict(loss_function='Logloss', eval_metric='AUC',
              depth=6, learning_rate=0.05, l2_leaf_reg=5.0,
              iterations=2000, early_stopping_rounds=200,
              random_seed=42, verbose=False)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with best LR(meta+text[+lex/subs]) using OOF to pick weight
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR,CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

In [46]:
# CatBoost TEXT+META: title/body as TEXT + meta(+lex) numeric; 5-fold CV; OOF-optimal blend; save submission.csv
import os, time, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

try:
    import catboost as cb
except Exception:
    import sys, subprocess
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost', '-q'], check=True)
    import catboost as cb

assert 'meta_train' in globals() and 'meta_test' in globals(), 'Run meta feature cell first'
assert title_col in train.columns and body_col in train.columns and target_col in train.columns, 'Missing columns'

# Build frames: first two columns are TEXT, remainder numeric
df_tr = pd.concat([pd.DataFrame({'t_title': train[title_col].astype(str).fillna(''),
                                 't_body':  train[body_col].astype(str).fillna('')}),
                   meta_train.reset_index(drop=True)], axis=1)
df_te = pd.concat([pd.DataFrame({'t_title': test[title_col].astype(str).fillna(''),
                                 't_body':  test[body_col].astype(str).fillna('')}),
                   meta_test.reset_index(drop=True)], axis=1)
y_cb = train[target_col].astype(int).values
text_features_idx = [0, 1]

# Params; prefer GPU
params = dict(loss_function='Logloss', eval_metric='AUC',
              depth=6, learning_rate=0.05, l2_leaf_reg=5.0,
              iterations=2000, early_stopping_rounds=200,
              random_seed=42, verbose=False)
try:
    from catboost.utils import get_gpu_device_count
    params['task_type'] = 'GPU' if get_gpu_device_count() > 0 else 'CPU'
except Exception:
    params['task_type'] = 'CPU'
print('CatBoost task_type:', params['task_type'], flush=True)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
oof_cb = np.zeros(len(df_tr), dtype=np.float32)
test_fold_preds = []

for fold, (trn_idx, val_idx) in enumerate(skf.split(df_tr, y_cb), 1):
    t0 = time.time()
    tr_pool = cb.Pool(df_tr.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_features_idx)
    va_pool = cb.Pool(df_tr.iloc[val_idx], label=y_cb[val_idx], text_features=text_features_idx)
    model = cb.CatBoostClassifier(**params)
    model.fit(tr_pool, eval_set=va_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(va_pool)[:,1].astype(np.float32)
    if fold == 1:
        te_pool = cb.Pool(df_te, text_features=text_features_idx)
    test_fold_preds.append(model.predict_proba(te_pool)[:,1].astype(np.float32))
    print(f'Fold {fold}: AUC {roc_auc_score(y_cb[val_idx], oof_cb[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)

oof_auc_cb = roc_auc_score(y_cb, oof_cb)
print('CatBoost OOF AUC:', round(oof_auc_cb, 6))
np.save('oof_catboost_text_meta.npy', oof_cb)
test_cb = np.mean(np.vstack(test_fold_preds), axis=0).astype(np.float32)
np.save('test_catboost_text_meta.npy', test_cb)

# Blend with best LR(meta+text[+lex/subs]) via OOF
if os.path.exists('oof_lr_text_meta.npy') and os.path.exists('test_lr_text_meta.npy'):
    oof_lr = np.load('oof_lr_text_meta.npy')
    test_lr = np.load('test_lr_text_meta.npy')
    weights = np.linspace(0.0, 1.0, 41)
    best_w, best_auc = 0.0, roc_auc_score(y_cb, oof_lr)
    for w in weights:
        a = roc_auc_score(y_cb, w*oof_cb + (1.0-w)*oof_lr)
        if a > best_auc:
            best_auc, best_w = a, w
    print(f'Best OOF blend(LR(meta+text),CatBoost) AUC: {best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
    test_blend = best_w*test_cb + (1.0-best_w)*test_lr
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR+CatBoost blend)')
else:
    pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_cb}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (CatBoost only)')

CatBoost task_type: GPU


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 1: AUC 0.69379 | 8.3s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 2: AUC 0.69379 | 16.4s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 3: AUC 0.71684 | 15.6s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 4: AUC 0.65514 | 9.9s


Default metric period is 5 because AUC is/are not implemented for GPU


Fold 5: AUC 0.71988 | 8.0s


CatBoost OOF AUC: 0.691649
Best OOF blend(LR(meta+text),CatBoost) AUC: 0.70600 at w(CB)=0.70


Saved submission.csv (LR+CatBoost blend)


In [14]:
# CatBoost TEXT model (CPU) with StratifiedGroupKFold and OOF-weighted blend with LR(meta+text+subs)
import sys, subprocess, time, numpy as np, pandas as pd, os, re
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import roc_auc_score

# Force a stable CatBoost version (avoid potential segfaults), then import
try:
    import catboost as cb
    _ver = getattr(cb, '__version__', 'unknown')
    print('Existing catboost version:', _ver, flush=True)
except Exception:
    _ver = None

if True:
    print('Installing stable CatBoost 1.2.5 (CPU)...', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', '--no-cache-dir', '--upgrade', 'catboost==1.2.5'], check=True)
    import importlib, catboost as cb
    importlib.reload(cb)
    print('CatBoost version now:', cb.__version__, flush=True)

# --- Ensure data and key columns are available (make cell self-contained) ---
def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

if 'train' not in globals() or 'test' not in globals():
    train = load_df('train.json')
    test = load_df('test.json')

id_col = 'request_id' if 'request_id' in train.columns else train.columns[0]
target_col = 'requester_received_pizza' if 'requester_received_pizza' in train.columns else [c for c in train.columns if c.lower() in ('target','label')][0]
body_fields_order = ['request_text_edit_aware','request_text','request_text_edit_aware_unnormalized']
title_fields = ['request_title','title']
body_col = next((c for c in body_fields_order if c in train.columns), None)
title_col = next((c for c in title_fields if c in train.columns), None)

# --- Build CatBoost text-only input (avoid mixed-type pool to improve stability) ---
def join_subs_for_cat(df: pd.DataFrame) -> pd.Series:
    if 'requester_subreddits_at_request' not in df.columns:
        return pd.Series(['']*len(df))
    def to_line(x):
        if isinstance(x, list):
            return ' '.join([str(s) for s in x if isinstance(s, str)])
        return ''
    return df['requester_subreddits_at_request'].apply(to_line)

cat_text_train = pd.DataFrame({
    'cb_title': (train[title_col].fillna('').astype(str) + ' ' + train[title_col].fillna('').astype(str)),  # upweight title
    'cb_body': train[body_col].fillna('').astype(str),
    'cb_subs': join_subs_for_cat(train)
})
cat_text_test = pd.DataFrame({
    'cb_title': (test[title_col].fillna('').astype(str) + ' ' + test[title_col].fillna('').astype(str)),
    'cb_body': test[body_col].fillna('').astype(str),
    'cb_subs': join_subs_for_cat(test)
})

X_cat = cat_text_train.reset_index(drop=True)
X_cat_test = cat_text_test.reset_index(drop=True)
text_feature_indices = [0, 1, 2]
y_cb = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

print('CatBoost TEXT-only features:', X_cat.shape, 'Test:', X_cat_test.shape, flush=True)

# Two CatBoost configs (A and B) per expert advice, simplified text_processing (use defaults for stability)
common_params = dict(
    task_type='CPU',
    eval_metric='AUC',
    loss_function='Logloss',
    auto_class_weights='Balanced',
    early_stopping_rounds=100,
    bootstrap_type='Bayesian',
    random_seed=42,
    verbose=100,
    allow_writing_files=False,
    thread_count=-1
)

# Config A
params_A = dict(common_params, **{
    'depth': 6,
    'learning_rate': 0.04,
    'l2_leaf_reg': 4,
    'iterations': 1000,  # slightly reduced to improve stability/speed
    'bagging_temperature': 0.2,
    'rsm': 0.8,
})

# Config B (regularized)
params_B = dict(common_params, **{
    'depth': 5,
    'learning_rate': 0.05,
    'l2_leaf_reg': 6,
    'iterations': 1200,
    'bagging_temperature': 0.5,
    'rsm': 0.9,
})

def run_cb_cv_textonly(params: dict, name: str):
    skf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(X_cat), dtype=np.float32)
    test_preds = []
    for fold, (trn_idx, val_idx) in enumerate(skf.split(X_cat, y_cb, groups), 1):
        t0 = time.time()
        train_pool = cb.Pool(data=X_cat.iloc[trn_idx], label=y_cb[trn_idx], text_features=text_feature_indices)
        valid_pool = cb.Pool(data=X_cat.iloc[val_idx], label=y_cb[val_idx], text_features=text_feature_indices)
        model = cb.CatBoostClassifier(**params)
        model.fit(train_pool, eval_set=valid_pool, use_best_model=True)
        oof[val_idx] = model.predict_proba(valid_pool)[:,1]
        test_pool = cb.Pool(data=X_cat_test, text_features=text_feature_indices)
        test_preds.append(model.predict_proba(test_pool)[:,1].astype(np.float32))
        print(f'[CB {name}] Fold {fold} done in {time.time()-t0:.1f}s | val AUC={roc_auc_score(y_cb[val_idx], oof[val_idx]):.5f}', flush=True)
    auc = roc_auc_score(y_cb, oof)
    test_mean = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
    print(f'[CB {name}] OOF AUC: {auc:.5f}', flush=True)
    return oof, test_mean, auc

oof_A, test_A, auc_A = run_cb_cv_textonly(params_A, 'A')
oof_B, test_B, auc_B = run_cb_cv_textonly(params_B, 'B')

# Choose best CB by OOF; if close (<=0.001), average them for stability
if abs(auc_A - auc_B) <= 0.001:
    oof_cat = 0.5*(oof_A + oof_B)
    test_cat = 0.5*(test_A + test_B)
    auc_cat = roc_auc_score(y_cb, oof_cat)
    chosen = 'Averaged A+B'
else:
    if auc_A >= auc_B:
        oof_cat, test_cat, auc_cat, chosen = oof_A, test_A, auc_A, 'A'
    else:
        oof_cat, test_cat, auc_cat, chosen = oof_B, test_B, auc_B, 'B'
print(f'Chosen CB variant: {chosen} | OOF AUC: {auc_cat:.5f}', flush=True)
np.save('oof_catboost.npy', oof_cat)
np.save('test_catboost.npy', test_cat)

# Build best LR base OOF via blending base and subs (recompute best weight) for fair comparison
oof_lr_base = np.load('oof_lr_text_meta.npy') if os.path.exists('oof_lr_text_meta.npy') else None
oof_lr_subs = np.load('oof_lr_text_meta_subs.npy') if os.path.exists('oof_lr_text_meta_subs.npy') else None
if oof_lr_base is not None and oof_lr_subs is not None:
    best_w_bs, best_auc_bs = 0.0, -1.0
    for w in np.linspace(0.0, 1.0, 21):
        blend_bs = (1.0-w)*oof_lr_base + w*oof_lr_subs
        a = roc_auc_score(y_cb, blend_bs)
        if a > best_auc_bs:
            best_auc_bs, best_w_bs = a, w
    print(f'Best OOF AUC (LR base vs subs): {best_auc_bs:.5f} at w(subs)={best_w_bs:.2f}', flush=True)
    oof_lr_best = (1.0-best_w_bs)*oof_lr_base + best_w_bs*oof_lr_subs
    test_lr_base = np.load('test_lr_text_meta.npy') if os.path.exists('test_lr_text_meta.npy') else None
    test_lr_subs = np.load('test_lr_text_meta_subs.npy') if os.path.exists('test_lr_text_meta_subs.npy') else None
    if test_lr_base is not None and test_lr_subs is not None:
        test_lr_best = (1.0-best_w_bs)*test_lr_base + best_w_bs*test_lr_subs
    else:
        test_lr_best = test_lr_base if test_lr_base is not None else test_cat
else:
    if os.path.exists('oof_lr_text_meta.npy'):
        oof_lr_best = np.load('oof_lr_text_meta.npy')
        test_lr_best = np.load('test_lr_text_meta.npy')
    else:
        oof_lr_best = oof_cat
        test_lr_best = test_cat

# Blend CatBoost with LR-best using OOF to choose CatBoost weight in [0.10, 0.70]
best_w, best_auc = None, -1.0
for w in np.arange(0.10, 0.75, 0.05):
    blend = w*oof_cat + (1.0-w)*oof_lr_best
    a = roc_auc_score(y_cb, blend)
    if a > best_auc:
        best_auc, best_w = a, w
print(f'Best OOF AUC blend(CB, LR-best)={best_auc:.5f} at w(CB)={best_w:.2f}', flush=True)
test_blend = best_w*test_cat + (1.0-best_w)*test_lr_best
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (CatBoost blend). Rows:', len(test_blend))

CatBoost features: (2878, 45) Test: (1162, 45)


KernelDied: Kernel died unexpectedly.

In [15]:
# XGBoost on TF-IDF SVD (word+char) + meta(+lexicons), StratifiedGroupKFold; blend with LR-best
import sys, subprocess, time, numpy as np, pandas as pd, os, re
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

# Ensure xgboost installed
try:
    import xgboost as xgb
except Exception:
    print('Installing xgboost...', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

# --- Make cell self-contained: load data, define key columns, texts, and meta/lexicons ---
def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

if 'train' not in globals() or 'test' not in globals():
    train = load_df('train.json')
    test = load_df('test.json')

target_col = 'requester_received_pizza' if 'requester_received_pizza' in train.columns else [c for c in train.columns if c.lower() in ('target','label')][0]
body_fields_order = ['request_text_edit_aware','request_text','request_text_edit_aware_unnormalized']
title_fields = ['request_title','title']
body_col = next((c for c in body_fields_order if c in train.columns), None)
title_col = next((c for c in title_fields if c in train.columns), None)

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    return title + ' ' + title + ' [SEP] ' + body

full_text_train2 = build_text_upweighted(train, title_col, body_col)
full_text_test2 = build_text_upweighted(test, title_col, body_col)

# Meta/lexicon builders (leakage-safe; mirror prior cell) 
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))

LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}

def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out

def build_meta(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1)
    out = out.replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    out = out.replace([np.inf, -np.inf], 0.0).fillna(0.0)
    return out

meta_train = build_meta(train)
meta_test = build_meta(test)

# === XGBoost SVD pipeline ===
def get_vecs_for_svd():
    wv = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=200000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    cv = TfidfVectorizer(analyzer='char', ngram_range=(3,6), min_df=2, max_features=300000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    return wv, cv

def build_fold_features(tr_text, va_text, tr_meta, va_meta, n_comp_word=200, n_comp_char=200):
    wv, cv = get_vecs_for_svd()
    Xw_tr = wv.fit_transform(tr_text)
    Xc_tr = cv.fit_transform(tr_text)
    Xw_va = wv.transform(va_text)
    Xc_va = cv.transform(va_text)
    svd_w = TruncatedSVD(n_components=n_comp_word, random_state=42)
    svd_c = TruncatedSVD(n_components=n_comp_char, random_state=42)
    Zw_tr = svd_w.fit_transform(Xw_tr).astype(np.float32)
    Zc_tr = svd_c.fit_transform(Xc_tr).astype(np.float32)
    Zw_va = svd_w.transform(Xw_va).astype(np.float32)
    Zc_va = svd_c.transform(Xc_va).astype(np.float32)
    scaler = StandardScaler(with_mean=True, with_std=True)
    M_tr = scaler.fit_transform(tr_meta).astype(np.float32)
    M_va = scaler.transform(va_meta).astype(np.float32)
    X_tr = np.hstack([Zw_tr, Zc_tr, M_tr]).astype(np.float32)
    X_va = np.hstack([Zw_va, Zc_va, M_va]).astype(np.float32)
    return X_tr, X_va, (wv, cv, svd_w, svd_c, scaler)

def build_test_features(text_train, text_test, transformers, meta_train, meta_test):
    wv, cv, svd_w, svd_c, scaler = transformers
    Xw_tr = wv.fit_transform(text_train)
    Xc_tr = cv.fit_transform(text_train)
    Xw_te = wv.transform(text_test)
    Xc_te = cv.transform(text_test)
    Zw_te = svd_w.fit(Xw_tr).transform(Xw_te).astype(np.float32)
    Zc_te = svd_c.fit(Xc_tr).transform(Xc_te).astype(np.float32)
    M_te = scaler.fit(meta_train).transform(meta_test).astype(np.float32)
    X_te = np.hstack([Zw_te, Zc_te, M_te]).astype(np.float32)
    return X_te

y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))
skf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

print('=== CV: XGBoost on (SVD-word,char) + meta ===', flush=True)
oof_xgb = np.zeros(len(train), dtype=np.float32)
test_preds = []
fold_times = []

params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    tree_method='hist',
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.0,
    reg_lambda=1.0,
    max_bin=256,
    min_child_weight=1.0,
    random_state=42
)

for fold, (trn_idx, val_idx) in enumerate(skf.split(full_text_train2, y, groups), 1):
    t0 = time.time()
    X_tr, X_va, pipes = build_fold_features(
        full_text_train2.iloc[trn_idx],
        full_text_train2.iloc[val_idx],
        meta_train.iloc[trn_idx].values,
        meta_train.iloc[val_idx].values,
        n_comp_word=200, n_comp_char=200
    )
    dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
    dva = xgb.DMatrix(X_va, label=y[val_idx])
    booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva, 'valid')],
                        verbose_eval=False,
                        early_stopping_rounds=100)
    oof_xgb[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
    # Build test features for this fold using the same pipelines fit pattern
    Xt = build_test_features(full_text_train2, full_text_test2, pipes, meta_train.values, meta_test.values)
    dte = xgb.DMatrix(Xt)
    test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
    dt = time.time()-t0
    fold_times.append(dt)
    print(f'  Fold {fold} AUC={roc_auc_score(y[val_idx], oof_xgb[val_idx]):.5f} | iters={booster.best_iteration+1} | {dt:.1f}s', flush=True)

auc_xgb = roc_auc_score(y, oof_xgb)
print(f'OOF AUC (XGB SVD+meta): {auc_xgb:.5f} | mean fold time: {np.mean(fold_times):.1f}s', flush=True)
test_xgb = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
np.save('oof_xgb_svd_meta.npy', oof_xgb)
np.save('test_xgb_svd_meta.npy', test_xgb)

# Blend with LR-best (base vs subs optimized) using OOF to choose weight
oof_lr_base = np.load('oof_lr_text_meta.npy') if os.path.exists('oof_lr_text_meta.npy') else None
oof_lr_subs = np.load('oof_lr_text_meta_subs.npy') if os.path.exists('oof_lr_text_meta_subs.npy') else None
if oof_lr_base is not None and oof_lr_subs is not None:
    best_w_bs, best_auc_bs = 0.0, -1.0
    for w in np.linspace(0.0, 1.0, 21):
        blend_bs = (1.0-w)*oof_lr_base + w*oof_lr_subs
        a = roc_auc_score(y, blend_bs)
        if a > best_auc_bs:
            best_auc_bs, best_w_bs = a, w
    oof_lr_best = (1.0-best_w_bs)*oof_lr_base + best_w_bs*oof_lr_subs
    test_lr_base = np.load('test_lr_text_meta.npy') if os.path.exists('test_lr_text_meta.npy') else None
    test_lr_subs = np.load('test_lr_text_meta_subs.npy') if os.path.exists('test_lr_text_meta_subs.npy') else None
    test_lr_best = (1.0-best_w_bs)*(test_lr_base if test_lr_base is not None else test_xgb) + best_w_bs*(test_lr_subs if test_lr_subs is not None else test_xgb)
else:
    oof_lr_best = oof_xgb
    test_lr_best = test_xgb

best_w, best_auc = None, -1.0
for w in np.arange(0.10, 0.75, 0.05):
    blend = w*oof_xgb + (1.0-w)*oof_lr_best
    a = roc_auc_score(y, blend)
    if a > best_auc:
        best_auc, best_w = a, w
print(f'Best OOF AUC blend(XGB, LR-best)={best_auc:.5f} at w(XGB)={best_w:.2f}', flush=True)
test_blend = best_w*test_xgb + (1.0-best_w)*test_lr_best
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend}).to_csv('submission.csv', index=False)
print('Saved submission.csv (XGB blend). Rows:', len(test_blend))

=== CV: XGBoost on (SVD-word,char) + meta ===


  Fold 1 AUC=0.64836 | iters=241 | 24.8s


  Fold 2 AUC=0.66104 | iters=156 | 23.8s


  Fold 3 AUC=0.67990 | iters=31 | 22.7s


  Fold 4 AUC=0.66364 | iters=197 | 23.7s


  Fold 5 AUC=0.68882 | iters=218 | 23.8s


OOF AUC (XGB SVD+meta): 0.66059 | mean fold time: 23.7s


Best OOF AUC blend(XGB, LR-best)=0.69127 at w(XGB)=0.45


Saved submission.csv (XGB blend). Rows: 1162


In [17]:
# Grouped CV unification for LR models + NNLS 3-way blend with XGB (grouped)
import os, sys, time, json, re, numpy as np, pandas as pd
from scipy import sparse
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        with open(path, 'r') as f:
            return pd.DataFrame(json.load(f))

# Load data
train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
body_fields_order = ['request_text_edit_aware','request_text','request_text_edit_aware_unnormalized']
title_fields = ['request_title','title']
body_col = next((c for c in body_fields_order if c in train.columns), None)
title_col = next((c for c in title_fields if c in train.columns), None)

def build_text_upweighted(df, title_col, body_col):
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' ' + t + ' [SEP] ' + b)

full_text_train2 = build_text_upweighted(train, title_col, body_col)
full_text_test2 = build_text_upweighted(test, title_col, body_col)
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Meta/lexicons (same as earlier, leakage-safe)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df, title_col, body_col):
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta(df):
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1).replace([np.inf,-np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_train = build_meta(train)
meta_test = build_meta(test)
print('Meta shapes:', meta_train.shape, meta_test.shape, flush=True)

def build_subreddit_text(df: pd.DataFrame) -> pd.Series:
    if 'requester_subreddits_at_request' not in df.columns:
        return pd.Series(['']*len(df))
    def to_line(x):
        if isinstance(x, list):
            return ' '.join([str(s).lower() for s in x if isinstance(s, str)])
        return ''
    return df['requester_subreddits_at_request'].apply(to_line)

sub_train_txt = build_subreddit_text(train)
sub_test_txt = build_subreddit_text(test)

# Freeze grouped folds and reuse
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))
print('Prepared grouped folds:', [(len(tr), len(va)) for tr,va in folds], flush=True)

def get_vecs_meta():
    wv = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=100000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    cv = TfidfVectorizer(analyzer='char', ngram_range=(3,6), min_df=2, max_features=200000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    return wv, cv
def get_sub_vec():
    return TfidfVectorizer(ngram_range=(1,1), min_df=3, max_features=600, lowercase=True, token_pattern=r'[^\s]+', dtype=np.float32)

def run_lr_grouped(include_subs: bool):
    oof = np.zeros(len(train), dtype=np.float32)
    test_preds = []
    for i, (trn_idx, val_idx) in enumerate(folds, 1):
        t0 = time.time()
        # Text vecs
        wv, cv = get_vecs_meta()
        Xw_tr = wv.fit_transform(full_text_train2.iloc[trn_idx])
        Xc_tr = cv.fit_transform(full_text_train2.iloc[trn_idx])
        Xw_va = wv.transform(full_text_train2.iloc[val_idx])
        Xc_va = cv.transform(full_text_train2.iloc[val_idx])
        # Meta scaler
        scaler = StandardScaler(with_mean=False)
        Xm_tr = scaler.fit_transform(meta_train.iloc[trn_idx])
        Xm_va = scaler.transform(meta_train.iloc[val_idx])
        if include_subs:
            sv = get_sub_vec()
            Xs_tr = sv.fit_transform(sub_train_txt.iloc[trn_idx])
            Xs_va = sv.transform(sub_train_txt.iloc[val_idx])
            X_tr = sparse.hstack([Xw_tr, Xc_tr, Xm_tr, Xs_tr], format='csr')
            X_va = sparse.hstack([Xw_va, Xc_va, Xm_va, Xs_va], format='csr')
        else:
            X_tr = sparse.hstack([Xw_tr, Xc_tr, Xm_tr], format='csr')
            X_va = sparse.hstack([Xw_va, Xc_va, Xm_va], format='csr')
        clf = LogisticRegression(penalty='l2', C=2.0, solver='saga', max_iter=3000, n_jobs=-1, class_weight='balanced', verbose=0)
        clf.fit(X_tr, y[trn_idx])
        oof[val_idx] = clf.predict_proba(X_va)[:,1]
        # Test via fold pipelines
        Xw_te = wv.transform(full_text_test2)
        Xc_te = cv.transform(full_text_test2)
        Xm_te = scaler.transform(meta_test)
        if include_subs:
            Xs_te = sv.transform(sub_test_txt)
            X_te = sparse.hstack([Xw_te, Xc_te, Xm_te, Xs_te], format='csr')
        else:
            X_te = sparse.hstack([Xw_te, Xc_te, Xm_te], format='csr')
        test_preds.append(clf.predict_proba(X_te)[:,1].astype(np.float32))
        print(f'  LR {"+subs" if include_subs else "+meta"} fold {i} AUC={roc_auc_score(y[val_idx], oof[val_idx]):.5f} in {time.time()-t0:.1f}s', flush=True)
    test_mean = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
    auc = roc_auc_score(y, oof)
    return oof, test_mean, auc

print('=== Refit LR meta+text+lex with grouped folds ===', flush=True)
oof_lr_meta_g, test_lr_meta_g, auc_lr_meta_g = run_lr_grouped(include_subs=False)
print(f'OOF AUC (LR meta+text+lex, grouped): {auc_lr_meta_g:.5f}', flush=True)
np.save('oof_lr_meta_g.npy', oof_lr_meta_g); np.save('test_lr_meta_g.npy', test_lr_meta_g)

print('=== Refit LR meta+text+lex+subs with grouped folds ===', flush=True)
oof_lr_subs_g, test_lr_subs_g, auc_lr_subs_g = run_lr_grouped(include_subs=True)
print(f'OOF AUC (LR meta+text+lex+subs, grouped): {auc_lr_subs_g:.5f}', flush=True)
np.save('oof_lr_subs_g.npy', oof_lr_subs_g); np.save('test_lr_subs_g.npy', test_lr_subs_g)

# Load XGB grouped (from prior cell 11) and assemble NNLS 3-way blend
oof_xgb_g = np.load('oof_xgb_svd_meta.npy') if os.path.exists('oof_xgb_svd_meta.npy') else None
test_xgb_g = np.load('test_xgb_svd_meta.npy') if os.path.exists('test_xgb_svd_meta.npy') else None
if oof_xgb_g is None or test_xgb_g is None:
    print('XGB group OOF/Test not found; skipping XGB in blend.')
    P = np.vstack([oof_lr_meta_g, oof_lr_subs_g]).T
    T = np.vstack([test_lr_meta_g, test_lr_subs_g]).T
else:
    P = np.vstack([oof_lr_meta_g, oof_lr_subs_g, oof_xgb_g]).T
    T = np.vstack([test_lr_meta_g, test_lr_subs_g, test_xgb_g]).T

w, _ = nnls(P, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof = P @ w
blend_auc = roc_auc_score(y, blend_oof)
print('NNLS weights:', w, flush=True)
print(f'Blended OOF AUC (grouped, NNLS): {blend_auc:.5f}', flush=True)

# T is (n_samples, n_models). Multiply directly by weights to get test preds (n_samples,).
blend_test = T @ w
sub = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test.astype(np.float32)})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (NNLS 3-way grouped blend). Rows:', len(sub))

Meta shapes: (2878, 42) (1162, 42)


Prepared grouped folds: [(2302, 576), (2302, 576), (2303, 575), (2303, 575), (2302, 576)]


=== Refit LR meta+text+lex with grouped folds ===


  LR +meta fold 1 AUC=0.64905 in 65.7s


  LR +meta fold 2 AUC=0.67542 in 63.2s


  LR +meta fold 3 AUC=0.65399 in 60.9s


  LR +meta fold 4 AUC=0.67675 in 58.7s


  LR +meta fold 5 AUC=0.69296 in 54.7s


OOF AUC (LR meta+text+lex, grouped): 0.67012


=== Refit LR meta+text+lex+subs with grouped folds ===


  LR +subs fold 1 AUC=0.62822 in 63.0s


  LR +subs fold 2 AUC=0.66985 in 55.1s


  LR +subs fold 3 AUC=0.64885 in 51.7s


  LR +subs fold 4 AUC=0.67411 in 56.5s


  LR +subs fold 5 AUC=0.69863 in 55.5s


OOF AUC (LR meta+text+lex+subs, grouped): 0.66413


NNLS weights: [0.31847639 0.21452477 0.46699884]


Blended OOF AUC (grouped, NNLS): 0.68219


Saved submission.csv (NNLS 3-way grouped blend). Rows: 1162


In [18]:
# Step 3A: CatBoost TEXT-only smoke test (1-fold) on grouped folds
import sys, subprocess, time, numpy as np, pandas as pd, os
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import roc_auc_score

# Ensure stable CatBoost version
try:
    import catboost as cb
    print('CatBoost existing:', getattr(cb, '__version__', 'unknown'), flush=True)
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', '--no-cache-dir', 'catboost==1.2.5'], check=True)
    import catboost as cb
    print('CatBoost installed:', cb.__version__, flush=True)

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def join_subs_for_cat(df: pd.DataFrame) -> pd.Series:
    if 'requester_subreddits_at_request' not in df.columns:
        return pd.Series(['']*len(df))
    def to_line(x):
        if isinstance(x, list):
            return ' '.join([str(s) for s in x if isinstance(s, str)])
        return ''
    return df['requester_subreddits_at_request'].apply(to_line)

# Text-only features (3 columns) with title duplicated to upweight
X_text = pd.DataFrame({
    'cb_title': (train[title_col].fillna('').astype(str) + ' ' + train[title_col].fillna('').astype(str)),
    'cb_body': train[body_col].fillna('').astype(str),
    'cb_subs': join_subs_for_cat(train)
})
X_text_test = pd.DataFrame({
    'cb_title': (test[title_col].fillna('').astype(str) + ' ' + test[title_col].fillna('').astype(str)),
    'cb_body': test[body_col].fillna('').astype(str),
    'cb_subs': join_subs_for_cat(test)
})
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))
text_feature_indices = [0,1,2]
print('CB text-only shapes:', X_text.shape, X_text_test.shape, flush=True)

# Freeze grouped folds and take the first fold for a smoke test
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(X_text, y, groups))
trn_idx, val_idx = folds[0]
print('Smoke test using fold 1:', len(trn_idx), len(val_idx), flush=True)

train_pool = cb.Pool(data=X_text.iloc[trn_idx], label=y[trn_idx], text_features=text_feature_indices)
valid_pool = cb.Pool(data=X_text.iloc[val_idx], label=y[val_idx], text_features=text_feature_indices)
test_pool = cb.Pool(data=X_text_test, text_features=text_feature_indices)

# Safe CPU params (Config B from expert advice)
params = dict(
    task_type='CPU',
    eval_metric='AUC',
    loss_function='Logloss',
    auto_class_weights='Balanced',
    early_stopping_rounds=100,
    bootstrap_type='Bayesian',
    bagging_temperature=0.5,
    rsm=0.9,
    depth=5,
    learning_rate=0.05,
    l2_leaf_reg=6,
    iterations=600,  # shorter for smoke test
    random_seed=42,
    verbose=100,
    allow_writing_files=False,
    thread_count=-1
)

t0 = time.time()
model = cb.CatBoostClassifier(**params)
model.fit(train_pool, eval_set=valid_pool, use_best_model=True)
oof_fold = model.predict_proba(valid_pool)[:,1]
auc_fold = roc_auc_score(y[val_idx], oof_fold)
print(f'[CB smoke] Fold1 val AUC={auc_fold:.5f} | elapsed {time.time()-t0:.1f}s', flush=True)

# If stable, produce provisional test preds for this fold (will average across folds in full run)
test_pred_fold1 = model.predict_proba(test_pool)[:,1].astype(np.float32)
np.save('cb_smoke_oof_fold1.npy', oof_fold)
np.save('cb_smoke_test_fold1.npy', test_pred_fold1)
print('Saved cb_smoke_oof_fold1.npy and cb_smoke_test_fold1.npy', flush=True)

CatBoost existing: 1.2.8


CB text-only shapes: (2878, 3) (1162, 3)


Smoke test using fold 1: 2302 576


0:	test: 0.5682790	best: 0.5682790 (0)	total: 86.3ms	remaining: 51.7s


100:	test: 0.5884131	best: 0.5884131 (100)	total: 2.39s	remaining: 11.8s


200:	test: 0.6149531	best: 0.6149531 (200)	total: 4.64s	remaining: 9.22s


300:	test: 0.6099216	best: 0.6158050 (201)	total: 6.92s	remaining: 6.87s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6158050412
bestIteration = 201

Shrink model to first 202 iterations.
[CB smoke] Fold1 val AUC=0.61581 | elapsed 7.1s


Saved cb_smoke_oof_fold1.npy and cb_smoke_test_fold1.npy


In [19]:
# Step 3A (full): CatBoost TEXT-only 5-fold grouped CV + 4-way NNLS blend
import sys, subprocess, time, numpy as np, pandas as pd, os
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

try:
    import catboost as cb
    print('CatBoost version:', getattr(cb, '__version__', 'unknown'), flush=True)
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', '--no-cache-dir', 'catboost==1.2.5'], check=True)
    import catboost as cb
    print('CatBoost installed:', cb.__version__, flush=True)

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def join_subs_for_cat(df: pd.DataFrame) -> pd.Series:
    if 'requester_subreddits_at_request' not in df.columns:
        return pd.Series(['']*len(df))
    def to_line(x):
        if isinstance(x, list):
            return ' '.join([str(s) for s in x if isinstance(s, str)])
        return ''
    return df['requester_subreddits_at_request'].apply(to_line)

# 3 text columns (title duplicated to upweight)
X_text = pd.DataFrame({
    'cb_title': (train[title_col].fillna('').astype(str) + ' ' + train[title_col].fillna('').astype(str)),
    'cb_body': train[body_col].fillna('').astype(str),
    'cb_subs': join_subs_for_cat(train)
})
X_text_test = pd.DataFrame({
    'cb_title': (test[title_col].fillna('').astype(str) + ' ' + test[title_col].fillna('').astype(str)),
    'cb_body': test[body_col].fillna('').astype(str),
    'cb_subs': join_subs_for_cat(test)
})
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))
text_feature_indices = [0,1,2]
print('CB text-only shapes:', X_text.shape, X_text_test.shape, flush=True)

# Freeze folds (must match other models):
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(X_text, y, groups))

params = dict(
    task_type='CPU',
    eval_metric='AUC',
    loss_function='Logloss',
    auto_class_weights='Balanced',
    early_stopping_rounds=100,
    bootstrap_type='Bayesian',
    bagging_temperature=0.5,
    rsm=0.9,
    depth=5,
    learning_rate=0.05,
    l2_leaf_reg=6,
    iterations=1200,
    random_seed=42,
    verbose=100,
    allow_writing_files=False,
    thread_count=-1
)

oof_cb = np.zeros(len(train), dtype=np.float32)
test_preds = []
for i, (trn_idx, val_idx) in enumerate(folds, 1):
    t0 = time.time()
    train_pool = cb.Pool(data=X_text.iloc[trn_idx], label=y[trn_idx], text_features=text_feature_indices)
    valid_pool = cb.Pool(data=X_text.iloc[val_idx], label=y[val_idx], text_features=text_feature_indices)
    test_pool = cb.Pool(data=X_text_test, text_features=text_feature_indices)
    model = cb.CatBoostClassifier(**params)
    model.fit(train_pool, eval_set=valid_pool, use_best_model=True)
    oof_cb[val_idx] = model.predict_proba(valid_pool)[:,1]
    test_preds.append(model.predict_proba(test_pool)[:,1].astype(np.float32))
    print(f'  CB fold {i} AUC={roc_auc_score(y[val_idx], oof_cb[val_idx]):.5f} in {time.time()-t0:.1f}s', flush=True)

auc_cb = roc_auc_score(y, oof_cb)
test_cb = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
print(f'OOF AUC (CB text-only, grouped): {auc_cb:.5f}', flush=True)
np.save('oof_cat_text_g.npy', oof_cb)
np.save('test_cat_text_g.npy', test_cb)

# 4-way NNLS blend with grouped models if available
oof_lr_meta_g = np.load('oof_lr_meta_g.npy') if os.path.exists('oof_lr_meta_g.npy') else None
oof_lr_subs_g = np.load('oof_lr_subs_g.npy') if os.path.exists('oof_lr_subs_g.npy') else None
oof_xgb_g = np.load('oof_xgb_svd_meta.npy') if os.path.exists('oof_xgb_svd_meta.npy') else None
test_lr_meta_g = np.load('test_lr_meta_g.npy') if os.path.exists('test_lr_meta_g.npy') else None
test_lr_subs_g = np.load('test_lr_subs_g.npy') if os.path.exists('test_lr_subs_g.npy') else None
test_xgb_g = np.load('test_xgb_svd_meta.npy') if os.path.exists('test_xgb_svd_meta.npy') else None

parts_oof = [arr for arr in [oof_lr_meta_g, oof_lr_subs_g, oof_xgb_g, oof_cb] if arr is not None]
parts_test = [arr for arr in [test_lr_meta_g, test_lr_subs_g, test_xgb_g, test_cb] if arr is not None]
P = np.vstack(parts_oof).T
T = np.vstack(parts_test).T
w, _ = nnls(P, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof = P @ w
blend_auc = roc_auc_score(y, blend_oof)
print('NNLS weights (4-way):', w, flush=True)
print(f'Blended OOF AUC (grouped, NNLS 4-way): {blend_auc:.5f}', flush=True)

blend_test = T @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (NNLS 4-way grouped blend). Rows:', len(blend_test))

CatBoost version: 1.2.8


CB text-only shapes: (2878, 3) (1162, 3)


0:	test: 0.5682790	best: 0.5682790 (0)	total: 24.9ms	remaining: 29.9s


100:	test: 0.5884131	best: 0.5884131 (100)	total: 2.35s	remaining: 25.6s


200:	test: 0.6149531	best: 0.6149531 (200)	total: 4.62s	remaining: 23s


300:	test: 0.6099216	best: 0.6158050 (201)	total: 6.92s	remaining: 20.7s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6158050412
bestIteration = 201

Shrink model to first 202 iterations.
  CB fold 1 AUC=0.61581 in 7.1s


0:	test: 0.5548821	best: 0.5548821 (0)	total: 21.3ms	remaining: 25.5s


100:	test: 0.6285441	best: 0.6442657 (38)	total: 2.35s	remaining: 25.6s


Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6442656897
bestIteration = 38

Shrink model to first 39 iterations.
  CB fold 2 AUC=0.64427 in 3.3s


0:	test: 0.5471933	best: 0.5471933 (0)	total: 20.4ms	remaining: 24.5s


100:	test: 0.6064956	best: 0.6074098 (94)	total: 2.28s	remaining: 24.8s


200:	test: 0.6246672	best: 0.6252606 (168)	total: 4.61s	remaining: 22.9s


300:	test: 0.6210746	best: 0.6316921 (215)	total: 6.96s	remaining: 20.8s


Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6316920609
bestIteration = 215

Shrink model to first 216 iterations.
  CB fold 3 AUC=0.63169 in 7.5s


0:	test: 0.5549409	best: 0.5549409 (0)	total: 21.5ms	remaining: 25.8s


100:	test: 0.6149678	best: 0.6214870 (86)	total: 2.32s	remaining: 25.2s


Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6214869966
bestIteration = 86

Shrink model to first 87 iterations.
  CB fold 4 AUC=0.62149 in 4.4s


0:	test: 0.5545711	best: 0.5545711 (0)	total: 22.5ms	remaining: 27s


100:	test: 0.6426964	best: 0.6568086 (22)	total: 2.34s	remaining: 25.5s


Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.6568085899
bestIteration = 22

Shrink model to first 23 iterations.
  CB fold 5 AUC=0.65681 in 3.0s


OOF AUC (CB text-only, grouped): 0.62283


NNLS weights (4-way): [0.27539628 0.21164104 0.4244068  0.08855588]


Blended OOF AUC (grouped, NNLS 4-way): 0.68318


Saved submission.csv (NNLS 4-way grouped blend). Rows: 1162


In [20]:
# Step 3B: XGBoost SVD bump (word=300, char=400, +subreddit SVD=50) with grouped CV + NNLS blend
import sys, subprocess, time, numpy as np, pandas as pd, os, re
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

# Ensure xgboost installed
try:
    import xgboost as xgb
except Exception:
    print('Installing xgboost...', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return t + ' ' + t + ' [SEP] ' + b

full_text_train2 = build_text_upweighted(train, title_col, body_col)
full_text_test2 = build_text_upweighted(test, title_col, body_col)
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Meta/lexicon builders (same leakage-safe set used earlier)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))

LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}

def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out

def build_meta(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1)
    out = out.replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    out = out.replace([np.inf, -np.inf], 0.0).fillna(0.0)
    return out

meta_train = build_meta(train)
meta_test = build_meta(test)

def build_subreddit_text(df: pd.DataFrame) -> pd.Series:
    if 'requester_subreddits_at_request' not in df.columns:
        return pd.Series(['']*len(df))
    def to_line(x):
        if isinstance(x, list):
            return ' '.join([str(s).lower() for s in x if isinstance(s, str)])
        return ''
    return df['requester_subreddits_at_request'].apply(to_line)

sub_train_txt = build_subreddit_text(train)
sub_test_txt = build_subreddit_text(test)

def get_vecs_for_svd():
    wv = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=250000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    cv = TfidfVectorizer(analyzer='char', ngram_range=(3,6), min_df=2, max_features=400000,
                         sublinear_tf=True, lowercase=True, dtype=np.float32)
    return wv, cv

def get_sub_vec():
    return TfidfVectorizer(ngram_range=(1,1), min_df=3, max_features=600, lowercase=True, token_pattern=r'[^\s]+', dtype=np.float32)

def build_fold_features(tr_text, va_text, tr_meta, va_meta, tr_sub, va_sub,
                        n_comp_word=300, n_comp_char=400, n_comp_sub=50):
    # Fit vectorizers on train fold only
    wv, cv = get_vecs_for_svd()
    sv = get_sub_vec()
    Xw_tr = wv.fit_transform(tr_text)
    Xc_tr = cv.fit_transform(tr_text)
    Xs_tr = sv.fit_transform(tr_sub)
    Xw_va = wv.transform(va_text)
    Xc_va = cv.transform(va_text)
    Xs_va = sv.transform(va_sub)
    # SVD reductions
    svd_w = TruncatedSVD(n_components=n_comp_word, random_state=42)
    svd_c = TruncatedSVD(n_components=n_comp_char, random_state=42)
    svd_s = TruncatedSVD(n_components=n_comp_sub, random_state=42)
    Zw_tr = svd_w.fit_transform(Xw_tr).astype(np.float32)
    Zc_tr = svd_c.fit_transform(Xc_tr).astype(np.float32)
    Zs_tr = svd_s.fit_transform(Xs_tr).astype(np.float32)
    Zw_va = svd_w.transform(Xw_va).astype(np.float32)
    Zc_va = svd_c.transform(Xc_va).astype(np.float32)
    Zs_va = svd_s.transform(Xs_va).astype(np.float32)
    # Scale meta
    scaler = StandardScaler(with_mean=True, with_std=True)
    M_tr = scaler.fit_transform(tr_meta).astype(np.float32)
    M_va = scaler.transform(va_meta).astype(np.float32)
    X_tr = np.hstack([Zw_tr, Zc_tr, Zs_tr, M_tr]).astype(np.float32)
    X_va = np.hstack([Zw_va, Zc_va, Zs_va, M_va]).astype(np.float32)
    return X_tr, X_va, (wv, cv, sv, svd_w, svd_c, svd_s, scaler)

def build_test_features(text_train, text_test, sub_train, sub_test, transformers, meta_train, meta_test):
    wv, cv, sv, svd_w, svd_c, svd_s, scaler = transformers
    Xw_tr = wv.fit_transform(text_train)
    Xc_tr = cv.fit_transform(text_train)
    Xs_tr = sv.fit_transform(sub_train)
    Xw_te = wv.transform(text_test)
    Xc_te = cv.transform(text_test)
    Xs_te = sv.transform(sub_test)
    Zw_te = svd_w.fit(Xw_tr).transform(Xw_te).astype(np.float32)
    Zc_te = svd_c.fit(Xc_tr).transform(Xc_te).astype(np.float32)
    Zs_te = svd_s.fit(Xs_tr).transform(Xs_te).astype(np.float32)
    M_te = scaler.fit(meta_train).transform(meta_test).astype(np.float32)
    X_te = np.hstack([Zw_te, Zc_te, Zs_te, M_te]).astype(np.float32)
    return X_te

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(full_text_train2, y, groups))

print('=== CV: XGB on SVD(word=300,char=400,subs=50) + meta ===', flush=True)
oof_xgb_bump = np.zeros(len(train), dtype=np.float32)
test_preds = []
fold_times = []

params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    tree_method='hist',
    max_depth=4,
    learning_rate=0.045,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.0,
    reg_lambda=1.5,
    max_bin=256,
    min_child_weight=1.0,
    random_state=42
)

for fold, (trn_idx, val_idx) in enumerate(folds, 1):
    t0 = time.time()
    X_tr, X_va, pipes = build_fold_features(
        full_text_train2.iloc[trn_idx],
        full_text_train2.iloc[val_idx],
        meta_train.iloc[trn_idx].values,
        meta_train.iloc[val_idx].values,
        sub_train_txt.iloc[trn_idx],
        sub_train_txt.iloc[val_idx],
        n_comp_word=300, n_comp_char=400, n_comp_sub=50
    )
    dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
    dva = xgb.DMatrix(X_va, label=y[val_idx])
    booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva, 'valid')],
                        verbose_eval=False,
                        early_stopping_rounds=100)
    oof_xgb_bump[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
    Xt = build_test_features(full_text_train2, full_text_test2, sub_train_txt, sub_test_txt, pipes, meta_train.values, meta_test.values)
    dte = xgb.DMatrix(Xt)
    test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
    dt = time.time()-t0
    fold_times.append(dt)
    print(f'  Fold {fold} AUC={roc_auc_score(y[val_idx], oof_xgb_bump[val_idx]):.5f} | iters={booster.best_iteration+1} | {dt:.1f}s', flush=True)

auc_xgb_bump = roc_auc_score(y, oof_xgb_bump)
print(f'OOF AUC (XGB SVD-bump+subs+meta): {auc_xgb_bump:.5f} | mean fold time: {np.mean(fold_times):.1f}s', flush=True)
test_xgb_bump = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
np.save('oof_xgb_svd_meta_subs_g.npy', oof_xgb_bump)
np.save('test_xgb_svd_meta_subs_g.npy', test_xgb_bump)

# NNLS blend with grouped models (LR_meta_g, LR_subs_g, CB_text_g, XGB_bump)
oof_lr_meta_g = np.load('oof_lr_meta_g.npy') if os.path.exists('oof_lr_meta_g.npy') else None
oof_lr_subs_g = np.load('oof_lr_subs_g.npy') if os.path.exists('oof_lr_subs_g.npy') else None
oof_cb_g = np.load('oof_cat_text_g.npy') if os.path.exists('oof_cat_text_g.npy') else None
test_lr_meta_g = np.load('test_lr_meta_g.npy') if os.path.exists('test_lr_meta_g.npy') else None
test_lr_subs_g = np.load('test_lr_subs_g.npy') if os.path.exists('test_lr_subs_g.npy') else None
test_cb_g = np.load('test_cat_text_g.npy') if os.path.exists('test_cat_text_g.npy') else None

parts_oof = [arr for arr in [oof_lr_meta_g, oof_lr_subs_g, oof_cb_g, oof_xgb_bump] if arr is not None]
parts_test = [arr for arr in [test_lr_meta_g, test_lr_subs_g, test_cb_g, test_xgb_bump] if arr is not None]
P = np.vstack(parts_oof).T
T = np.vstack(parts_test).T
w, _ = nnls(P, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof = P @ w
blend_auc = roc_auc_score(y, blend_oof)
print('NNLS weights (with XGB bump):', w, flush=True)
print(f'Blended OOF AUC (grouped, NNLS): {blend_auc:.5f}', flush=True)

blend_test = T @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (NNLS with XGB bump). Rows:', len(blend_test))

=== CV: XGB on SVD(word=300,char=400,subs=50) + meta ===


  Fold 1 AUC=0.63497 | iters=135 | 35.3s


  Fold 2 AUC=0.65671 | iters=146 | 35.8s


  Fold 3 AUC=0.66933 | iters=176 | 36.6s


  Fold 4 AUC=0.65017 | iters=58 | 34.9s


  Fold 5 AUC=0.70044 | iters=12 | 35.4s


OOF AUC (XGB SVD-bump+subs+meta): 0.64826 | mean fold time: 35.6s


NNLS weights (with XGB bump): [0.36479612 0.17490624 0.01748015 0.4428175 ]


Blended OOF AUC (grouped, NNLS): 0.67828


Saved submission.csv (NNLS with XGB bump). Rows: 1162


In [21]:
# Step 3C: Sentence-Transformer embeddings (all-MiniLM-L6-v2) + meta -> XGBoost (grouped CV) + NNLS blend
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

# Install sentence-transformers (CPU) and xgboost if missing
try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train, title_col, body_col).tolist()
text_te = build_text_upweighted(test, title_col, body_col).tolist()
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Build leakage-safe meta (reuse prior definitions)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1)
    out = out.replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    out = out.replace([np.inf, -np.inf], 0.0).fillna(0.0)
    return out

meta_train = build_meta(train)
meta_test = build_meta(test)
print('Meta shapes:', meta_train.shape, meta_test.shape, flush=True)

# Encode sentence embeddings (CPU)
print('Encoding sentence embeddings (all-MiniLM-L6-v2)...', flush=True)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cpu')
emb_tr = model.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True)
emb_te = model.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True)
emb_tr = emb_tr.astype(np.float32)
emb_te = emb_te.astype(np.float32)
print('Embeddings shape:', emb_tr.shape, emb_te.shape, flush=True)

# Grouped CV with XGBoost on [embeddings + scaled meta]
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))
oof_st = np.zeros(len(train), dtype=np.float32)
test_preds = []
times = []
params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    tree_method='hist',
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    min_child_weight=1.0,
    max_bin=256,
    random_state=42
)
for i, (trn_idx, val_idx) in enumerate(folds, 1):
    t0 = time.time()
    scaler = StandardScaler(with_mean=True, with_std=True)
    M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
    M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
    X_tr = np.hstack([emb_tr[trn_idx], M_tr]).astype(np.float32)
    X_va = np.hstack([emb_tr[val_idx], M_va]).astype(np.float32)
    dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
    dva = xgb.DMatrix(X_va, label=y[val_idx])
    booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')],
                        verbose_eval=False, early_stopping_rounds=100)
    oof_st[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
    # Test build
    M_te = scaler.transform(meta_test.values).astype(np.float32)
    Xt = np.hstack([emb_te, M_te]).astype(np.float32)
    dte = xgb.DMatrix(Xt)
    test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
    times.append(time.time()-t0)
    print(f'  Fold {i} AUC={roc_auc_score(y[val_idx], oof_st[val_idx]):.5f} | iters={booster.best_iteration+1} | {times[-1]:.1f}s', flush=True)

auc_st = roc_auc_score(y, oof_st)
print(f'OOF AUC (ST-emb + meta XGB): {auc_st:.5f} | mean fold time: {np.mean(times):.1f}s', flush=True)
test_st = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
np.save('oof_st_embed_g.npy', oof_st)
np.save('test_st_embed_g.npy', test_st)

# NNLS blend including ST model
oof_lr_meta_g = np.load('oof_lr_meta_g.npy') if os.path.exists('oof_lr_meta_g.npy') else None
oof_lr_subs_g = np.load('oof_lr_subs_g.npy') if os.path.exists('oof_lr_subs_g.npy') else None
oof_xgb_g = np.load('oof_xgb_svd_meta.npy') if os.path.exists('oof_xgb_svd_meta.npy') else None
oof_cb_g = np.load('oof_cat_text_g.npy') if os.path.exists('oof_cat_text_g.npy') else None

test_lr_meta_g = np.load('test_lr_meta_g.npy') if os.path.exists('test_lr_meta_g.npy') else None
test_lr_subs_g = np.load('test_lr_subs_g.npy') if os.path.exists('test_lr_subs_g.npy') else None
test_xgb_g = np.load('test_xgb_svd_meta.npy') if os.path.exists('test_xgb_svd_meta.npy') else None
test_cb_g = np.load('test_cat_text_g.npy') if os.path.exists('test_cat_text_g.npy') else None

parts_oof = [arr for arr in [oof_lr_meta_g, oof_lr_subs_g, oof_xgb_g, oof_cb_g, oof_st] if arr is not None]
parts_test = [arr for arr in [test_lr_meta_g, test_lr_subs_g, test_xgb_g, test_cb_g, test_st] if arr is not None]
P = np.vstack(parts_oof).T
T = np.vstack(parts_test).T
w, _ = nnls(P, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof = P @ w
blend_auc = roc_auc_score(y, blend_oof)
print('NNLS weights (with ST):', w, flush=True)
print(f'Blended OOF AUC (grouped, NNLS + ST): {blend_auc:.5f}', flush=True)

blend_test = T @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (NNLS with ST). Rows:', len(blend_test))

  from .autonotebook import tqdm as notebook_tqdm


Meta shapes: (2878, 42) (1162, 42)


Encoding sentence embeddings (all-MiniLM-L6-v2)...


Embeddings shape: (2878, 384) (1162, 384)


  Fold 1 AUC=0.65715 | iters=165 | 1.3s


  Fold 2 AUC=0.66457 | iters=68 | 0.8s


  Fold 3 AUC=0.66515 | iters=71 | 0.8s


  Fold 4 AUC=0.64501 | iters=241 | 1.5s


  Fold 5 AUC=0.70300 | iters=25 | 0.6s


OOF AUC (ST-emb + meta XGB): 0.65947 | mean fold time: 1.0s


NNLS weights (with ST): [0.19340665 0.18262475 0.25767387 0.         0.36629473]


Blended OOF AUC (grouped, NNLS + ST): 0.68873


Saved submission.csv (NNLS with ST). Rows: 1162


In [22]:
# Step 3D: Sentence-Transformer embeddings (multi-qa-mpnet-base-dot-v1) + meta -> XGBoost (grouped CV) + NNLS blend
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

# Install sentence-transformers and xgboost if missing
try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train, title_col, body_col).tolist()
text_te = build_text_upweighted(test, title_col, body_col).tolist()
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Build leakage-safe meta (reuse prior definitions)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1)
    out = out.replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    out = out.replace([np.inf, -np.inf], 0.0).fillna(0.0)
    return out

meta_train = build_meta(train)
meta_test = build_meta(test)
print('Meta shapes:', meta_train.shape, meta_test.shape, flush=True)

# Encode sentence embeddings (MPNet base, CPU encode for stability; normalize embeddings)
print('Encoding sentence embeddings (multi-qa-mpnet-base-dot-v1)...', flush=True)
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1', device='cpu')
emb_tr = model.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_te = model.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('Embeddings shape:', emb_tr.shape, emb_te.shape, flush=True)

# Grouped CV with XGBoost on [embeddings + scaled meta]
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))
oof_mp = np.zeros(len(train), dtype=np.float32)
test_preds = []
fold_times = []
params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    tree_method='hist',
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    min_child_weight=1.0,
    max_bin=256,
    random_state=42
)
for i, (trn_idx, val_idx) in enumerate(folds, 1):
    t0 = time.time()
    scaler = StandardScaler(with_mean=True, with_std=True)
    M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
    M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
    X_tr = np.hstack([emb_tr[trn_idx], M_tr]).astype(np.float32)
    X_va = np.hstack([emb_tr[val_idx], M_va]).astype(np.float32)
    dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
    dva = xgb.DMatrix(X_va, label=y[val_idx])
    booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')],
                        verbose_eval=False, early_stopping_rounds=100)
    oof_mp[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
    M_te = scaler.transform(meta_test.values).astype(np.float32)
    Xt = np.hstack([emb_te, M_te]).astype(np.float32)
    dte = xgb.DMatrix(Xt)
    test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
    dt = time.time()-t0
    fold_times.append(dt)
    print(f'  Fold {i} AUC={roc_auc_score(y[val_idx], oof_mp[val_idx]):.5f} | iters={booster.best_iteration+1} | {dt:.1f}s', flush=True)

auc_mp = roc_auc_score(y, oof_mp)
print(f'OOF AUC (MPNet-emb + meta XGB): {auc_mp:.5f} | mean fold time: {np.mean(fold_times):.1f}s', flush=True)
test_mp = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
np.save('oof_mpnet_embed_g.npy', oof_mp)
np.save('test_mpnet_embed_g.npy', test_mp)

# NNLS blend including MPNet model
oof_lr_meta_g = np.load('oof_lr_meta_g.npy') if os.path.exists('oof_lr_meta_g.npy') else None
oof_lr_subs_g = np.load('oof_lr_subs_g.npy') if os.path.exists('oof_lr_subs_g.npy') else None
oof_xgb_g = np.load('oof_xgb_svd_meta.npy') if os.path.exists('oof_xgb_svd_meta.npy') else None
oof_cb_g = np.load('oof_cat_text_g.npy') if os.path.exists('oof_cat_text_g.npy') else None
oof_st_g = np.load('oof_st_embed_g.npy') if os.path.exists('oof_st_embed_g.npy') else None

test_lr_meta_g = np.load('test_lr_meta_g.npy') if os.path.exists('test_lr_meta_g.npy') else None
test_lr_subs_g = np.load('test_lr_subs_g.npy') if os.path.exists('test_lr_subs_g.npy') else None
test_xgb_g = np.load('test_xgb_svd_meta.npy') if os.path.exists('test_xgb_svd_meta.npy') else None
test_cb_g = np.load('test_cat_text_g.npy') if os.path.exists('test_cat_text_g.npy') else None
test_st_g = np.load('test_st_embed_g.npy') if os.path.exists('test_st_embed_g.npy') else None

parts_oof = [arr for arr in [oof_lr_meta_g, oof_lr_subs_g, oof_xgb_g, oof_cb_g, oof_st_g, oof_mp] if arr is not None]
parts_test = [arr for arr in [test_lr_meta_g, test_lr_subs_g, test_xgb_g, test_cb_g, test_st_g, test_mp] if arr is not None]
P = np.vstack(parts_oof).T
T = np.vstack(parts_test).T
w, _ = nnls(P, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof = P @ w
blend_auc = roc_auc_score(y, blend_oof)
print('NNLS weights (with MPNet):', w, flush=True)
print(f'Blended OOF AUC (grouped, NNLS + MPNet): {blend_auc:.5f}', flush=True)

blend_test = T @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (NNLS with MPNet). Rows:', len(blend_test))

Meta shapes: (2878, 42) (1162, 42)


Encoding sentence embeddings (multi-qa-mpnet-base-dot-v1)...


You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





Embeddings shape: (2878, 768) (1162, 768)


  Fold 1 AUC=0.66107 | iters=49 | 1.2s


  Fold 2 AUC=0.66956 | iters=41 | 1.2s


  Fold 3 AUC=0.63849 | iters=31 | 1.3s


  Fold 4 AUC=0.66784 | iters=76 | 1.6s


  Fold 5 AUC=0.68682 | iters=127 | 1.8s


OOF AUC (MPNet-emb + meta XGB): 0.66457 | mean fold time: 1.4s


NNLS weights (with MPNet): [0.1455568  0.17538714 0.17781834 0.         0.2169544  0.28428332]


Blended OOF AUC (grouped, NNLS + MPNet): 0.69224


Saved submission.csv (NNLS with MPNet). Rows: 1162


In [23]:
# Step 3E: Rank-then-NNLS blend (robust AUC ensembling) over grouped OOFs
import numpy as np, pandas as pd, os
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train['requester_received_pizza'].astype(int).values

# Load available grouped OOF/test predictions (same set as 3D)
paths = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'cb_text_g': ('oof_cat_text_g.npy','test_cat_text_g.npy'),
    'st_embed_g': ('oof_st_embed_g.npy','test_st_embed_g.npy'),
    'mpnet_embed_g': ('oof_mpnet_embed_g.npy','test_mpnet_embed_g.npy'),
}
oof_list, test_list, names = [], [], []
for name, (poof, ptest) in paths.items():
    if os.path.exists(poof) and os.path.exists(ptest):
        oof_list.append(np.load(poof))
        test_list.append(np.load(ptest))
        names.append(name)

assert len(oof_list) >= 3, f'Not enough models found for rank-NNLS, got {len(oof_list)}'
P = np.vstack(oof_list).T  # (n_samples, n_models)
T = np.vstack(test_list).T

def rank01(a):
    # rank to [0,1]; stable ties average rank behavior via argsort twice
    order = a.argsort(kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(a), dtype=np.float64)
    return ranks / max(1, len(a)-1)

# Rank-transform OOF and test per model independently
P_rank = np.apply_along_axis(rank01, 0, P)
T_rank = np.apply_along_axis(rank01, 0, T)

# NNLS on ranked OOF
w, _ = nnls(P_rank, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof_rank = P_rank @ w
auc_rank = roc_auc_score(y, blend_oof_rank)
print('Models in rank-NNLS:', names)
print('Rank-NNLS weights:', w)
print(f'Rank-blended OOF AUC: {auc_rank:.5f}')

# Build test submission with rank blend
blend_test_rank = T_rank @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test_rank.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (Rank-NNLS). Rows:', len(blend_test_rank))

Models in rank-NNLS: ['lr_meta_g', 'lr_subs_g', 'xgb_svd_meta', 'cb_text_g', 'st_embed_g', 'mpnet_embed_g']
Rank-NNLS weights: [0.16470432 0.16774478 0.14264587 0.06567989 0.20591968 0.25330546]
Rank-blended OOF AUC: 0.69506
Saved submission.csv (Rank-NNLS). Rows: 1162


In [24]:
# Step 3F: Seed-bag XGB on ST embeddings (MiniLM + MPNet) with grouped CV, then rank-NNLS blend
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train, title_col, body_col).tolist()
text_te = build_text_upweighted(test, title_col, body_col).tolist()
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Meta/lexicons (leakage-safe, same as prior)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_train = build_meta(train)
meta_test = build_meta(test)
print('Meta shapes:', meta_train.shape, meta_test.shape, flush=True)

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))

def train_xgb_bag(emb_tr, emb_te, seeds=(7,13,29), jitter=True, name='model'):
    oof_bag = np.zeros(len(train), dtype=np.float32)
    test_bag_per_seed = []
    for si, seed in enumerate(seeds, 1):
        # jitter params slightly per seed
        rs = np.random.RandomState(seed)
        md = 4 + (rs.rand() < 0.5) if jitter else 4
        subs = 0.75 + rs.rand()*0.10 if jitter else 0.80
        cols = 0.75 + rs.rand()*0.15 if jitter else 0.80
        lr = 0.045 + rs.rand()*0.010 if jitter else 0.05
        params = dict(
            objective='binary:logistic',
            eval_metric='auc',
            tree_method='hist',
            max_depth=int(md),
            learning_rate=float(lr),
            subsample=float(subs),
            colsample_bytree=float(cols),
            reg_lambda=1.0,
            min_child_weight=1.0,
            max_bin=256,
            random_state=int(seed)
        )
        oof_seed = np.zeros(len(train), dtype=np.float32)
        test_preds = []
        t_se = time.time()
        for i, (trn_idx, val_idx) in enumerate(folds, 1):
            t0 = time.time()
            scaler = StandardScaler(with_mean=True, with_std=True)
            M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
            M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
            X_tr = np.hstack([emb_tr[trn_idx], M_tr]).astype(np.float32)
            X_va = np.hstack([emb_tr[val_idx], M_va]).astype(np.float32)
            dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
            dva = xgb.DMatrix(X_va, label=y[val_idx])
            booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')],
                                verbose_eval=False, early_stopping_rounds=100)
            oof_seed[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
            M_te = scaler.transform(meta_test.values).astype(np.float32)
            Xt = np.hstack([emb_te, M_te]).astype(np.float32)
            dte = xgb.DMatrix(Xt)
            test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
            print(f'    [{name} seed {seed}] Fold {i} AUC={roc_auc_score(y[val_idx], oof_seed[val_idx]):.5f} | iters={booster.best_iteration+1} | {time.time()-t0:.1f}s', flush=True)
        oof_bag += oof_seed / len(seeds)
        test_bag_per_seed.append(np.mean(np.vstack(test_preds), axis=0).astype(np.float32))
        print(f'  [{name}] Seed {seed} OOF AUC={roc_auc_score(y, oof_seed):.5f} | elapsed {time.time()-t_se:.1f}s', flush=True)
    test_bag = np.mean(np.vstack(test_bag_per_seed), axis=0).astype(np.float32)
    auc_bag = roc_auc_score(y, oof_bag)
    print(f'[{name}] Bagged OOF AUC: {auc_bag:.5f}', flush=True)
    return oof_bag, test_bag

# Encode MiniLM and MPNet (normalized embeddings) - CPU for stability; dataset is small
print('Encoding MiniLM (all-MiniLM-L6-v2)...', flush=True)
minilm = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cpu')
emb_minilm_tr = minilm.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_minilm_te = minilm.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('MiniLM shapes:', emb_minilm_tr.shape, emb_minilm_te.shape, flush=True)

print('Encoding MPNet (multi-qa-mpnet-base-dot-v1)...', flush=True)
mpnet = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1', device='cpu')
emb_mpnet_tr = mpnet.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_mpnet_te = mpnet.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('MPNet shapes:', emb_mpnet_tr.shape, emb_mpnet_te.shape, flush=True)

seeds = [7, 13, 29]
print('=== Seed bagging XGB on MiniLM embeddings ===', flush=True)
oof_st_bag, test_st_bag = train_xgb_bag(emb_minilm_tr, emb_minilm_te, seeds=seeds, jitter=True, name='MiniLM')
np.save('oof_st_embed_bag.npy', oof_st_bag); np.save('test_st_embed_bag.npy', test_st_bag)

print('=== Seed bagging XGB on MPNet embeddings ===', flush=True)
oof_mp_bag, test_mp_bag = train_xgb_bag(emb_mpnet_tr, emb_mpnet_te, seeds=seeds, jitter=True, name='MPNet')
np.save('oof_mpnet_embed_bag.npy', oof_mp_bag); np.save('test_mpnet_embed_bag.npy', test_mp_bag)

# Rank-then-NNLS including bagged ST models
paths = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'cb_text_g': ('oof_cat_text_g.npy','test_cat_text_g.npy'),
    'st_embed_bag': ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag': ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
}
oof_list, test_list, names = [], [], []
for name, (poof, ptest) in paths.items():
    if os.path.exists(poof) and os.path.exists(ptest):
        oof_list.append(np.load(poof))
        test_list.append(np.load(ptest))
        names.append(name)
assert len(oof_list) >= 3, f'Not enough models found for rank-NNLS, got {len(oof_list)}'
P = np.vstack(oof_list).T
T = np.vstack(test_list).T
def rank01(a):
    order = a.argsort(kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(a), dtype=np.float64)
    return ranks / max(1, len(a)-1)
P_rank = np.apply_along_axis(rank01, 0, P)
T_rank = np.apply_along_axis(rank01, 0, T)
w, _ = nnls(P_rank, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof_rank = P_rank @ w
auc_rank = roc_auc_score(y, blend_oof_rank)
print('Models in rank-NNLS (bagged):', names)
print('Rank-NNLS weights (bagged):', w)
print(f'Rank-blended OOF AUC (bagged): {auc_rank:.5f}')
blend_test_rank = T_rank @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test_rank.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (Rank-NNLS with bagged ST). Rows:', len(blend_test_rank))

Meta shapes: (2878, 42) (1162, 42)


Encoding MiniLM (all-MiniLM-L6-v2)...


MiniLM shapes: (2878, 384) (1162, 384)


Encoding MPNet (multi-qa-mpnet-base-dot-v1)...


You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





MPNet shapes: (2878, 768) (1162, 768)


=== Seed bagging XGB on MiniLM embeddings ===


    [MiniLM seed 7] Fold 1 AUC=0.65329 | iters=31 | 0.8s


    [MiniLM seed 7] Fold 2 AUC=0.65777 | iters=53 | 1.0s


    [MiniLM seed 7] Fold 3 AUC=0.63962 | iters=69 | 1.1s


    [MiniLM seed 7] Fold 4 AUC=0.66336 | iters=91 | 1.2s


    [MiniLM seed 7] Fold 5 AUC=0.69458 | iters=33 | 0.8s


  [MiniLM] Seed 7 OOF AUC=0.66008 | elapsed 4.9s


    [MiniLM seed 13] Fold 1 AUC=0.64450 | iters=193 | 1.3s


    [MiniLM seed 13] Fold 2 AUC=0.67770 | iters=44 | 0.7s


    [MiniLM seed 13] Fold 3 AUC=0.65054 | iters=87 | 1.0s


    [MiniLM seed 13] Fold 4 AUC=0.63736 | iters=40 | 1.2s


    [MiniLM seed 13] Fold 5 AUC=0.68210 | iters=36 | 0.6s


  [MiniLM] Seed 13 OOF AUC=0.65290 | elapsed 4.8s


    [MiniLM seed 29] Fold 1 AUC=0.64283 | iters=22 | 0.5s


    [MiniLM seed 29] Fold 2 AUC=0.66292 | iters=19 | 0.5s


    [MiniLM seed 29] Fold 3 AUC=0.65113 | iters=321 | 1.8s


    [MiniLM seed 29] Fold 4 AUC=0.64253 | iters=357 | 2.0s


    [MiniLM seed 29] Fold 5 AUC=0.67641 | iters=25 | 0.6s


  [MiniLM] Seed 29 OOF AUC=0.64027 | elapsed 5.5s


[MiniLM] Bagged OOF AUC: 0.66566


=== Seed bagging XGB on MPNet embeddings ===


    [MPNet seed 7] Fold 1 AUC=0.65987 | iters=87 | 2.5s


    [MPNet seed 7] Fold 2 AUC=0.69404 | iters=42 | 2.0s


    [MPNet seed 7] Fold 3 AUC=0.63298 | iters=73 | 2.4s


    [MPNet seed 7] Fold 4 AUC=0.66399 | iters=54 | 2.1s


    [MPNet seed 7] Fold 5 AUC=0.67982 | iters=67 | 2.3s


  [MPNet] Seed 7 OOF AUC=0.66370 | elapsed 11.3s


    [MPNet seed 13] Fold 1 AUC=0.67413 | iters=120 | 2.1s


    [MPNet seed 13] Fold 2 AUC=0.67233 | iters=45 | 1.4s


    [MPNet seed 13] Fold 3 AUC=0.63740 | iters=109 | 2.0s


    [MPNet seed 13] Fold 4 AUC=0.66546 | iters=53 | 1.5s


    [MPNet seed 13] Fold 5 AUC=0.69143 | iters=92 | 1.9s


  [MPNet] Seed 13 OOF AUC=0.66589 | elapsed 8.8s


    [MPNet seed 29] Fold 1 AUC=0.67146 | iters=53 | 1.7s


    [MPNet seed 29] Fold 2 AUC=0.67579 | iters=55 | 1.5s


    [MPNet seed 29] Fold 3 AUC=0.64472 | iters=42 | 1.4s


    [MPNet seed 29] Fold 4 AUC=0.66022 | iters=60 | 1.5s


    [MPNet seed 29] Fold 5 AUC=0.69020 | iters=69 | 1.6s


  [MPNet] Seed 29 OOF AUC=0.66906 | elapsed 7.7s


[MPNet] Bagged OOF AUC: 0.67582


Models in rank-NNLS (bagged): ['lr_meta_g', 'lr_subs_g', 'xgb_svd_meta', 'cb_text_g', 'st_embed_bag', 'mpnet_embed_bag']
Rank-NNLS weights (bagged): [0.1359503  0.17626838 0.15552755 0.04538354 0.16800549 0.31886475]
Rank-blended OOF AUC (bagged): 0.69644
Saved submission.csv (Rank-NNLS with bagged ST). Rows: 1162


In [25]:
# Step 3G: Grouped rank-stacking (Logistic Regression on OOF ranks) and submission
import numpy as np, pandas as pd, os, time
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedGroupKFold

train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train['requester_received_pizza'].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Use the strongest, diverse grouped models (bagged ST + MPNet, LR_meta_g, LR_subs_g, XGB_svd_meta, CB_text_g)
paths = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'cb_text_g': ('oof_cat_text_g.npy','test_cat_text_g.npy'),
    'st_embed_bag': ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag': ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
}
oof_list, test_list, names = [], [], []
for name, (poof, ptest) in paths.items():
    if os.path.exists(poof) and os.path.exists(ptest):
        oof_list.append(np.load(poof))
        test_list.append(np.load(ptest))
        names.append(name)

assert len(oof_list) >= 3, f'Need >=3 models for stacking; found {len(oof_list)}'
P = np.vstack(oof_list).T  # (n_samples, n_models)
T = np.vstack(test_list).T  # (n_test, n_models)

def rank01(a):
    order = a.argsort(kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(a), dtype=np.float64)
    return ranks / max(1, len(a)-1)

P_rank = np.apply_along_axis(rank01, 0, P)
T_rank = np.apply_along_axis(rank01, 0, T)

# Grouped CV logistic stacker on ranks
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof_stack = np.zeros(len(train), dtype=np.float32)
test_stack_preds = []
fold_times = []
for i, (trn_idx, val_idx) in enumerate(sgkf.split(P_rank, y, groups), 1):
    t0 = time.time()
    X_tr, X_va = P_rank[trn_idx], P_rank[val_idx]
    y_tr, y_va = y[trn_idx], y[val_idx]
    clf = LogisticRegression(C=1.0, solver='liblinear', max_iter=2000, class_weight=None)
    clf.fit(X_tr, y_tr)
    oof_stack[val_idx] = clf.predict_proba(X_va)[:,1]
    test_stack_preds.append(clf.predict_proba(T_rank)[:,1].astype(np.float32))
    fold_times.append(time.time()-t0)
    print(f'  Stacker fold {i} AUC={roc_auc_score(y_va, oof_stack[val_idx]):.5f} | {fold_times[-1]:.2f}s', flush=True)

auc_stack = roc_auc_score(y, oof_stack)
print('Models in stack:', names)
print(f'Rank-Logistic stacker OOF AUC: {auc_stack:.5f} | mean fold time: {np.mean(fold_times):.2f}s')

# Average test predictions across folds
test_stack = np.mean(np.vstack(test_stack_preds), axis=0).astype(np.float32)

# Save and submit
np.save('oof_rank_stacker.npy', oof_stack)
np.save('test_rank_stacker.npy', test_stack)
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_stack}).to_csv('submission.csv', index=False)
print('Saved submission.csv (rank-logistic stacker). Rows:', len(test_stack))

  Stacker fold 1 AUC=0.67975 | 0.00s


  Stacker fold 2 AUC=0.69998 | 0.00s


  Stacker fold 3 AUC=0.67407 | 0.00s


  Stacker fold 4 AUC=0.69447 | 0.00s


  Stacker fold 5 AUC=0.72669 | 0.00s


Models in stack: ['lr_meta_g', 'lr_subs_g', 'xgb_svd_meta', 'cb_text_g', 'st_embed_bag', 'mpnet_embed_bag']
Rank-Logistic stacker OOF AUC: 0.69346 | mean fold time: 0.00s
Saved submission.csv (rank-logistic stacker). Rows: 1162


In [26]:
# Step 3H: Enrich meta with time features (month/quarter/days_since_start/relative_position) and retrain bagged ST-XGB; rank-NNLS blend
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train, title_col, body_col).tolist()
text_te = build_text_upweighted(test, title_col, body_col).tolist()
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Meta/lexicons + extra time features
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out

def build_meta_enriched(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    # existing temporal
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    # new temporal features
    out['month'] = dt.dt.month.astype(float)
    out['quarter'] = dt.dt.quarter.astype(float)
    # days since start and relative position
    base_ts = np.nanmin(ts.values)
    out['days_since_start'] = ((ts - base_ts) / 86400.0).astype(float)
    # relative position rank 0..1
    order = np.argsort(ts.values)
    rel = np.empty_like(order, dtype=np.float64)
    rel[order] = np.arange(len(order), dtype=np.float64)
    out['relative_position'] = (rel / max(1, len(order)-1)).astype(float)
    # basic text stats
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    # lexicons
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1)
    # transforms
    out = out.replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend','month','quarter','days_since_start','relative_position',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_train = build_meta_enriched(train)
meta_test = build_meta_enriched(test)
print('Meta(enriched) shapes:', meta_train.shape, meta_test.shape, flush=True)

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))

def train_xgb_bag(emb_tr, emb_te, seeds=(7,13,29), jitter=True, name='model'):
    oof_bag = np.zeros(len(train), dtype=np.float32)
    test_bag_per_seed = []
    for seed in seeds:
        rs = np.random.RandomState(seed)
        md = 4 + (rs.rand() < 0.5) if jitter else 4
        subs = 0.75 + rs.rand()*0.10 if jitter else 0.80
        cols = 0.75 + rs.rand()*0.15 if jitter else 0.80
        lr = 0.045 + rs.rand()*0.010 if jitter else 0.05
        params = dict(
            objective='binary:logistic',
            eval_metric='auc',
            tree_method='hist',
            max_depth=int(md),
            learning_rate=float(lr),
            subsample=float(subs),
            colsample_bytree=float(cols),
            reg_lambda=1.0,
            min_child_weight=1.0,
            max_bin=256,
            random_state=int(seed)
        )
        oof_seed = np.zeros(len(train), dtype=np.float32)
        test_preds = []
        for i, (trn_idx, val_idx) in enumerate(folds, 1):
            t0 = time.time()
            scaler = StandardScaler(with_mean=True, with_std=True)
            M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
            M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
            X_tr = np.hstack([emb_tr[trn_idx], M_tr]).astype(np.float32)
            X_va = np.hstack([emb_tr[val_idx], M_va]).astype(np.float32)
            dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
            dva = xgb.DMatrix(X_va, label=y[val_idx])
            booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')], verbose_eval=False, early_stopping_rounds=100)
            oof_seed[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
            M_te = scaler.transform(meta_test.values).astype(np.float32)
            Xt = np.hstack([emb_te, M_te]).astype(np.float32)
            dte = xgb.DMatrix(Xt)
            test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
            print(f'    [{name} seed {seed}] Fold {i} AUC={roc_auc_score(y[val_idx], oof_seed[val_idx]):.5f} | {time.time()-t0:.1f}s', flush=True)
        oof_bag += oof_seed / len(seeds)
        test_bag_per_seed.append(np.mean(np.vstack(test_preds), axis=0).astype(np.float32))
        print(f'  [{name}] Seed {seed} OOF AUC={roc_auc_score(y, oof_seed):.5f}', flush=True)
    test_bag = np.mean(np.vstack(test_bag_per_seed), axis=0).astype(np.float32)
    auc_bag = roc_auc_score(y, oof_bag)
    print(f'[{name}] Bagged OOF AUC: {auc_bag:.5f}', flush=True)
    return oof_bag, test_bag

print('Encoding MiniLM (all-MiniLM-L6-v2)...', flush=True)
minilm = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cpu')
emb_minilm_tr = minilm.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_minilm_te = minilm.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('MiniLM shapes:', emb_minilm_tr.shape, emb_minilm_te.shape, flush=True)

print('Encoding MPNet (multi-qa-mpnet-base-dot-v1)...', flush=True)
mpnet = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1', device='cpu')
emb_mpnet_tr = mpnet.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_mpnet_te = mpnet.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('MPNet shapes:', emb_mpnet_tr.shape, emb_mpnet_te.shape, flush=True)

seeds = [7, 13, 29]
print('=== Bagging XGB on MiniLM with enriched meta ===', flush=True)
oof_st_bag2, test_st_bag2 = train_xgb_bag(emb_minilm_tr, emb_minilm_te, seeds=seeds, jitter=True, name='MiniLM+time')
np.save('oof_st_embed_bag_v2.npy', oof_st_bag2); np.save('test_st_embed_bag_v2.npy', test_st_bag2)

print('=== Bagging XGB on MPNet with enriched meta ===', flush=True)
oof_mp_bag2, test_mp_bag2 = train_xgb_bag(emb_mpnet_tr, emb_mpnet_te, seeds=seeds, jitter=True, name='MPNet+time')
np.save('oof_mpnet_embed_bag_v2.npy', oof_mp_bag2); np.save('test_mpnet_embed_bag_v2.npy', test_mp_bag2)

# Rank-then-NNLS including enriched bagged models
paths = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'cb_text_g': ('oof_cat_text_g.npy','test_cat_text_g.npy'),
    'st_embed_bag_v2': ('oof_st_embed_bag_v2.npy','test_st_embed_bag_v2.npy'),
    'mpnet_embed_bag_v2': ('oof_mpnet_embed_bag_v2.npy','test_mpnet_embed_bag_v2.npy'),
}
oof_list, test_list, names = [], [], []
for name, (poof, ptest) in paths.items():
    if os.path.exists(poof) and os.path.exists(ptest):
        oof_list.append(np.load(poof))
        test_list.append(np.load(ptest))
        names.append(name)
assert len(oof_list) >= 3, 'Not enough models for rank-NNLS'
P = np.vstack(oof_list).T
T = np.vstack(test_list).T
def rank01(a):
    order = a.argsort(kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(a), dtype=np.float64)
    return ranks / max(1, len(a)-1)
P_rank = np.apply_along_axis(rank01, 0, P)
T_rank = np.apply_along_axis(rank01, 0, T)
w, _ = nnls(P_rank, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof_rank = P_rank @ w
auc_rank = roc_auc_score(y, blend_oof_rank)
print('Models in rank-NNLS (enriched):', names)
print('Rank-NNLS weights (enriched):', w)
print(f'Rank-blended OOF AUC (enriched): {auc_rank:.5f}')
blend_test_rank = T_rank @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test_rank.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (Rank-NNLS with enriched bagged ST). Rows:', len(blend_test_rank))

Meta(enriched) shapes: (2878, 46) (1162, 46)


Encoding MiniLM (all-MiniLM-L6-v2)...


MiniLM shapes: (2878, 384) (1162, 384)


Encoding MPNet (multi-qa-mpnet-base-dot-v1)...


You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





MPNet shapes: (2878, 768) (1162, 768)


=== Bagging XGB on MiniLM with enriched meta ===


    [MiniLM+time seed 7] Fold 1 AUC=0.66065 | 1.2s


    [MiniLM+time seed 7] Fold 2 AUC=0.67639 | 0.8s


    [MiniLM+time seed 7] Fold 3 AUC=0.67094 | 1.1s


    [MiniLM+time seed 7] Fold 4 AUC=0.68056 | 1.6s


    [MiniLM+time seed 7] Fold 5 AUC=0.70922 | 0.9s


  [MiniLM+time] Seed 7 OOF AUC=0.67355


    [MiniLM+time seed 13] Fold 1 AUC=0.67356 | 1.1s


    [MiniLM+time seed 13] Fold 2 AUC=0.68896 | 0.8s


    [MiniLM+time seed 13] Fold 3 AUC=0.68510 | 1.0s


    [MiniLM+time seed 13] Fold 4 AUC=0.67203 | 1.9s


    [MiniLM+time seed 13] Fold 5 AUC=0.70925 | 0.7s


  [MiniLM+time] Seed 13 OOF AUC=0.67955


    [MiniLM+time seed 29] Fold 1 AUC=0.65940 | 1.2s


    [MiniLM+time seed 29] Fold 2 AUC=0.70464 | 0.7s


    [MiniLM+time seed 29] Fold 3 AUC=0.67585 | 1.7s


    [MiniLM+time seed 29] Fold 4 AUC=0.65632 | 0.9s


    [MiniLM+time seed 29] Fold 5 AUC=0.72977 | 0.8s


  [MiniLM+time] Seed 29 OOF AUC=0.67563


[MiniLM+time] Bagged OOF AUC: 0.68704


=== Bagging XGB on MPNet with enriched meta ===


    [MPNet+time seed 7] Fold 1 AUC=0.67895 | 2.5s


    [MPNet+time seed 7] Fold 2 AUC=0.69270 | 2.7s


    [MPNet+time seed 7] Fold 3 AUC=0.65411 | 2.9s


    [MPNet+time seed 7] Fold 4 AUC=0.66384 | 2.7s


    [MPNet+time seed 7] Fold 5 AUC=0.70348 | 2.2s


  [MPNet+time] Seed 7 OOF AUC=0.67563


    [MPNet+time seed 13] Fold 1 AUC=0.68036 | 1.6s


    [MPNet+time seed 13] Fold 2 AUC=0.69828 | 1.5s


    [MPNet+time seed 13] Fold 3 AUC=0.64112 | 1.3s


    [MPNet+time seed 13] Fold 4 AUC=0.66988 | 1.5s


    [MPNet+time seed 13] Fold 5 AUC=0.72449 | 1.6s


  [MPNet+time] Seed 13 OOF AUC=0.68266


    [MPNet+time seed 29] Fold 1 AUC=0.67726 | 1.4s


    [MPNet+time seed 29] Fold 2 AUC=0.70656 | 3.2s


    [MPNet+time seed 29] Fold 3 AUC=0.64816 | 1.3s


    [MPNet+time seed 29] Fold 4 AUC=0.65756 | 1.6s


    [MPNet+time seed 29] Fold 5 AUC=0.71933 | 1.1s


  [MPNet+time] Seed 29 OOF AUC=0.67267


[MPNet+time] Bagged OOF AUC: 0.69010


Models in rank-NNLS (enriched): ['lr_meta_g', 'lr_subs_g', 'xgb_svd_meta', 'cb_text_g', 'st_embed_bag_v2', 'mpnet_embed_bag_v2']
Rank-NNLS weights (enriched): [0.15353256 0.10420435 0.09925398 0.         0.29518131 0.3478278 ]
Rank-blended OOF AUC (enriched): 0.70824
Saved submission.csv (Rank-NNLS with enriched bagged ST). Rows: 1162


In [29]:
# Step 4: DeBERTa-v3-small fine-tune (mean-pool head) with StratifiedGroupKFold, add to rank-NNLS
import os, sys, time, json, math, numpy as np, pandas as pd, subprocess, shutil
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import roc_auc_score

# ================= Install GPU torch + transformers (CUDA 12.1) =================
try:
    import torch
    cuda_ok = torch.cuda.is_available() and str(getattr(torch.version, 'cuda', '0')).startswith('12.1')
except Exception:
    cuda_ok = False
if not cuda_ok:
    # Clean any prior torch installs
    for pkg in ("torch","torchvision","torchaudio"):
        subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", pkg], check=False)
    # Install exact cu121 stack
    subprocess.run([sys.executable, "-m", "pip", "install",
                    "--index-url", "https://download.pytorch.org/whl/cu121",
                    "--extra-index-url", "https://pypi.org/simple",
                    "torch==2.4.1", "torchvision==0.19.1", "torchaudio==2.4.1"], check=True)
    # Transformers + accelerate
    subprocess.run([sys.executable, "-m", "pip", "install",
                    "transformers==4.44.2", "accelerate==0.34.2", "scikit-learn==1.4.2"], check=True)
# Ensure sentencepiece is available for DeBERTa-v3 tokenizer
try:
    import sentencepiece  # noqa: F401
except Exception:
    subprocess.run([sys.executable, "-m", "pip", "install", "sentencepiece"], check=True)

import torch
from torch import nn
from torch.utils.data import Dataset
from transformers import AutoModel, DataCollatorWithPadding, Trainer, TrainingArguments
from transformers.models.deberta_v2 import DebertaV2Tokenizer

print('torch:', torch.__version__, 'cuda:', getattr(torch.version, 'cuda', None), 'cuda_available:', torch.cuda.is_available(), flush=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ================= Data =================
def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        with open(path, 'r') as f:
            return pd.DataFrame(json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

def build_text(df):
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' [SEP] ' + t + ' [SEP] ' + b).tolist()

texts_tr = build_text(train)
texts_te = build_text(test)
print('Data sizes:', len(texts_tr), len(texts_te), flush=True)

# ================= Tokenizer (force slow) =================
model_name = 'microsoft/deberta-v3-small'
tok = DebertaV2Tokenizer.from_pretrained(model_name, do_lower_case=False)
max_len = 384

class TxtDataset(Dataset):
    def __init__(self, texts, labels=None):
        self.texts = texts
        self.labels = labels
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        enc = tok(self.texts[idx], truncation=True, max_length=max_len, return_tensors=None)
        if self.labels is not None:
            enc['labels'] = float(self.labels[idx])
        return enc

ds_all = TxtDataset(texts_tr, y)
ds_test = TxtDataset(texts_te, None)
collate = DataCollatorWithPadding(tokenizer=tok, pad_to_multiple_of=8)  # helps fp16

# ================= Mean-pool head =================
class MeanPoolHead(nn.Module):
    def __init__(self, in_dim, p=0.2):
        super().__init__()
        self.ln = nn.LayerNorm(in_dim)
        self.drop = nn.Dropout(p)
        self.fc = nn.Linear(in_dim, 1)
    def forward(self, last_hidden_state, attention_mask):
        mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)
        summed = (last_hidden_state * mask).sum(dim=1)
        counts = mask.sum(dim=1).clamp(min=1e-6)
        mean = summed / counts
        x = self.ln(mean)
        x = self.drop(x)
        return self.fc(x).squeeze(-1)

class MeanPoolDeberta(nn.Module):
    def __init__(self, name):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(name)
        hidden = self.backbone.config.hidden_size
        self.head = MeanPoolHead(hidden, p=0.2)
    def forward(self, input_ids=None, attention_mask=None):
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        logits = self.head(out.last_hidden_state, attention_mask)
        return logits

# ================= Trainer utils =================
import numpy as np
from sklearn.metrics import roc_auc_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    if isinstance(logits, tuple):
        logits = logits[0]
    probs = 1/(1+np.exp(-logits))
    try:
        auc = roc_auc_score(labels, probs)
    except Exception:
        auc = 0.5
    return {'roc_auc': float(auc)}

class WrapModel(nn.Module):
    def __init__(self, base):
        super().__init__()
        self.base = base
        self.loss_fn = nn.BCEWithLogitsLoss()
    def forward(self, input_ids=None, attention_mask=None, labels=None):
        logits = self.base(input_ids=input_ids, attention_mask=attention_mask)
        if labels is not None:
            loss = self.loss_fn(logits.view(-1), labels.float().view(-1))
            return {'loss': loss, 'logits': logits.detach()}
        return {'logits': logits}

# ================= 5-fold StratifiedGroupKFold =================
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))
oof = np.zeros(len(y), dtype=np.float32)
test_preds_folds = []
fold_times = []

for fi, (trn_idx, val_idx) in enumerate(folds, 1):
    t0 = time.time()
    train_texts = [texts_tr[i] for i in trn_idx]
    val_texts = [texts_tr[i] for i in val_idx]
    train_labels = y[trn_idx]
    val_labels = y[val_idx]
    ds_tr = TxtDataset(train_texts, train_labels)
    ds_va = TxtDataset(val_texts, val_labels)
    model = MeanPoolDeberta(model_name)
    model.to(device)
    wrap = WrapModel(model)
    args = TrainingArguments(
        output_dir=f"debv3_fold{fi}",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=64,
        gradient_accumulation_steps=2,
        learning_rate=2e-5,
        weight_decay=0.01,
        warmup_ratio=0.06,
        fp16=True,
        logging_steps=50,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        metric_for_best_model='roc_auc',
        greater_is_better=True,
        report_to=[],
        dataloader_num_workers=2
    )
    trainer = Trainer(
        model=wrap,
        args=args,
        train_dataset=ds_tr,
        eval_dataset=ds_va,
        tokenizer=tok,
        data_collator=collate,
        compute_metrics=compute_metrics
    )
    trainer.train()
    # Val preds
    val_outputs = trainer.predict(ds_va)
    val_logits = val_outputs.predictions if not isinstance(val_outputs.predictions, tuple) else val_outputs.predictions[0]
    val_probs = 1/(1+np.exp(-val_logits))
    oof[val_idx] = val_probs.astype(np.float32)
    fold_auc = roc_auc_score(val_labels, oof[val_idx])
    # Test preds
    te_outputs = trainer.predict(ds_test)
    te_logits = te_outputs.predictions if not isinstance(te_outputs.predictions, tuple) else te_outputs.predictions[0]
    te_probs = 1/(1+np.exp(-te_logits))
    test_preds_folds.append(te_probs.astype(np.float32))
    dt = time.time()-t0
    fold_times.append(dt)
    print(f"  DeBERTa fold {fi} AUC={fold_auc:.5f} | {dt/60:.1f} min", flush=True)

auc_oof = roc_auc_score(y, oof)
print(f'DeBERTa-v3-small OOF AUC: {auc_oof:.5f} | mean fold time: {np.mean(fold_times)/60:.1f} min', flush=True)
test_mean = np.mean(np.vstack(test_preds_folds), axis=0).astype(np.float32)
np.save('oof_deb_v3_small.npy', oof)
np.save('test_deb_v3_small.npy', test_mean)

# ================= Rank-then-NNLS including DeBERTa =================
from scipy.optimize import nnls
def rank01(a):
    order = a.argsort(kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(a), dtype=np.float64)
    return ranks / max(1, len(a)-1)

paths = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'cb_text_g': ('oof_cat_text_g.npy','test_cat_text_g.npy'),
    'st_embed_bag_v2': ('oof_st_embed_bag_v2.npy','test_st_embed_bag_v2.npy') if os.path.exists('oof_st_embed_bag_v2.npy') else ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag_v2': ('oof_mpnet_embed_bag_v2.npy','test_mpnet_embed_bag_v2.npy') if os.path.exists('oof_mpnet_embed_bag_v2.npy') else ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
    'deb_v3_small': ('oof_deb_v3_small.npy','test_deb_v3_small.npy'),
}
oof_list, test_list, names = [], [], []
for name, (poof, ptest) in paths.items():
    if os.path.exists(poof) and os.path.exists(ptest):
        oof_list.append(np.load(poof))
        test_list.append(np.load(ptest))
        names.append(name)
assert len(oof_list) >= 3, f'Not enough models for rank-NNLS, got {len(oof_list)}'
P = np.vstack(oof_list).T
T = np.vstack(test_list).T
P_rank = np.apply_along_axis(rank01, 0, P)
T_rank = np.apply_along_axis(rank01, 0, T)
w, _ = nnls(P_rank, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof_rank = P_rank @ w
auc_rank = roc_auc_score(y, blend_oof_rank)
print('Rank-NNLS models:', names)
print('Rank-NNLS weights:', w)
print(f'Rank-blended OOF AUC (with DeBERTa): {auc_rank:.5f}')
blend_test_rank = T_rank @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test_rank.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (Rank-NNLS + DeBERTa). Rows:', len(blend_test_rank))

Found existing installation: torch 2.8.0


Uninstalling torch-2.8.0:
  Successfully uninstalled torch-2.8.0


Found existing installation: torchvision 0.19.1+cu121
Uninstalling torchvision-0.19.1+cu121:
  Successfully uninstalled torchvision-0.19.1+cu121


Found existing installation: torchaudio 2.4.1+cu121
Uninstalling torchaudio-2.4.1+cu121:
  Successfully uninstalled torchaudio-2.4.1+cu121


Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 534.2 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 517.9 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 81.0 MB/s eta 0:00:00
Collecting jinja2


  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 6.6 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 96.9 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 61.1 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 153.3 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 186.4 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 120.5 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 194.6 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 531.4 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 108.4 MB/s eta 0:00:00
Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 495.9 MB/s eta 0:00:00


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 340.2 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 416.9 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 159.3 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 435.3 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 145.0 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 139.3 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 235.7 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 91.1 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 160.3 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 258.8 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 512.2 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0




Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 94.4 MB/s eta 0:00:00


Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl (324 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.4/324.4 KB 494.9 MB/s eta 0:00:00
Collecting scikit-learn==1.4.2
  Downloading scikit_learn-1.4.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 132.8 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting numpy>=1.17
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 457.2 MB/s eta 0:00:00
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 423.3 MB/s eta 0:00:00


Collecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 152.0 MB/s eta 0:00:00


Collecting regex!=2019.12.17
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (798 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 KB 237.1 MB/s eta 0:00:00
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 399.7 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.23.2
  Downloading huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 563.3/563.3 KB 442.8 MB/s eta 0:00:00


Collecting safetensors>=0.4.1
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 319.9 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 454.8 MB/s eta 0:00:00
Collecting tqdm>=4.27
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 421.1 MB/s eta 0:00:00


Collecting torch>=1.10.0
  Downloading torch-2.8.0-cp311-cp311-manylinux_2_28_x86_64.whl (888.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 888.1/888.1 MB 232.2 MB/s eta 0:00:00


Collecting psutil
  Downloading psutil-7.1.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291.2/291.2 KB 476.4 MB/s eta 0:00:00


Collecting scipy>=1.6.0
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 235.0 MB/s eta 0:00:00
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 446.8 MB/s eta 0:00:00
Collecting typing-extensions>=3.7.4.3
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 286.6 MB/s eta 0:00:00


Collecting fsspec>=2023.5.0
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 432.0 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 257.1 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.7.3.90
  Downloading nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl (267.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 267.5/267.5 MB 158.0 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.27.3
  Downloading nvidia_nccl_cu12-2.27.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (322.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 322.4/322.4 MB 179.8 MB/s eta 0:00:00


Collecting sympy>=1.13.3
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 373.6 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.9.90
  Downloading nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 117.7 MB/s eta 0:00:00
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 451.7 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12==12.8.93
  Downloading nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.3 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.3/39.3 MB 236.0 MB/s eta 0:00:00
Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 497.9 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.8.90
  Downloading nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 182.1 MB/s eta 0:00:00
Collecting nvidia-cusparselt-cu12==0.7.1
  Downloading nvidia_cusparselt_cu12-0.7.1-py3-none-manylinux2014_x86_64.whl (287.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 287.2/287.2 MB 56.2 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.5.8.93
  Downloading nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (288.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 288.2/288.2 MB 162.9 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.8.93
  Downloading nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 163.4 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.10.2.21
  Downloading nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_x86_64.whl (706.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 706.8/706.8 MB 194.9 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.8.4.1
  Downloading nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl (594.3 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 594.3/594.3 MB 176.8 MB/s eta 0:00:00


Collecting triton==3.4.0
  Downloading triton-3.4.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 155.5/155.5 MB 176.0 MB/s eta 0:00:00
Collecting nvidia-cufile-cu12==1.13.1.3
  Downloading nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 342.2 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.8.90
  Downloading nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.0/90.0 KB 356.5 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.8.90
  Downloading nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 KB 499.0 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.3.3.83
  Downloading nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 196.5 MB/s eta 0:00:00


Collecting setuptools>=40.8.0
  Downloading setuptools-80.9.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 472.2 MB/s eta 0:00:00
Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 464.1 MB/s eta 0:00:00


Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 472.5 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 472.8 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 375.6 MB/s eta 0:00:00
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 494.3 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)


Installing collected packages: nvidia-cusparselt-cu12, mpmath, urllib3, typing-extensions, tqdm, threadpoolctl, sympy, setuptools, safetensors, regex, pyyaml, psutil, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, joblib, idna, hf-xet, fsspec, filelock, charset_normalizer, certifi, triton, scipy, requests, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, scikit-learn, nvidia-cusolver-cu12, huggingface-hub, torch, tokenizers, transformers, accelerate


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.19.1+cu121 requires torch==2.4.1, but you have torch 2.8.0 which is incompatible.
torchaudio 2.4.1+cu121 requires torch==2.4.1, but you have torch 2.8.0 which is incompatible.


Successfully installed MarkupSafe-3.0.3 accelerate-0.34.2 certifi-2025.8.3 charset_normalizer-3.4.3 filelock-3.19.1 fsspec-2025.9.0 hf-xet-1.1.10 huggingface-hub-0.35.1 idna-3.10 jinja2-3.1.6 joblib-1.5.2 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.8.4.1 nvidia-cuda-cupti-cu12-12.8.90 nvidia-cuda-nvrtc-cu12-12.8.93 nvidia-cuda-runtime-cu12-12.8.90 nvidia-cudnn-cu12-9.10.2.21 nvidia-cufft-cu12-11.3.3.83 nvidia-cufile-cu12-1.13.1.3 nvidia-curand-cu12-10.3.9.90 nvidia-cusolver-cu12-11.7.3.90 nvidia-cusparse-cu12-12.5.8.93 nvidia-cusparselt-cu12-0.7.1 nvidia-nccl-cu12-2.27.3 nvidia-nvjitlink-cu12-12.8.93 nvidia-nvtx-cu12-12.8.90 packaging-25.0 psutil-7.1.0 pyyaml-6.0.3 regex-2025.9.18 requests-2.32.5 safetensors-0.6.2 scikit-learn-1.4.2 scipy-1.16.2 setuptools-80.9.0 sympy-1.14.0 threadpoolctl-3.6.0 tokenizers-0.19.1 torch-2.8.0 tqdm-4.67.1 transformers-4.44.2 triton-3.4.0 typing-extensions-4.15.0 urllib3-2.5.0


torch: 2.8.0+cu128 cuda: 12.8 cuda_available: True


Data sizes: 2878 1162


ImportError: 
DebertaV2Tokenizer requires the SentencePiece library but it was not found in your environment. Check out the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


In [30]:
# Step 4A: Add intfloat/e5-base-v2 embeddings + meta (XGB, grouped CV) and rank-NNLS blend
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

# sentence-transformers + xgboost
try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' [SEP] ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train, title_col, body_col).tolist()
text_te = build_text_upweighted(test, title_col, body_col).tolist()
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Meta (reuse leakage-safe features, with time enrichments already validated)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta_enriched(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['month'] = dt.dt.month.astype(float)
    out['quarter'] = dt.dt.quarter.astype(float)
    base_ts = np.nanmin(ts.values)
    out['days_since_start'] = ((ts - base_ts) / 86400.0).astype(float)
    order = np.argsort(ts.values)
    rel = np.empty_like(order, dtype=np.float64)
    rel[order] = np.arange(len(order), dtype=np.float64)
    out['relative_position'] = (rel / max(1, len(order)-1)).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float) if 'word_count' in globals() else title.apply(lambda s: len((s or '').split())).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float) if 'word_count' in globals() else body.apply(lambda s: len((s or '').split())).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend','month','quarter','days_since_start','relative_position',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_train = build_meta_enriched(train)
meta_test = build_meta_enriched(test)
print('Meta(e5) shapes:', meta_train.shape, meta_test.shape, flush=True)

# Encode E5-base-v2 embeddings (normalize, CPU). Prepend 'query: ' as advised.
print('Encoding e5-base-v2 embeddings...', flush=True)
e5 = SentenceTransformer('intfloat/e5-base-v2', device='cpu')
e5_tr = e5.encode([f'query: {s}' for s in text_tr], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
e5_te = e5.encode([f'query: {s}' for s in text_te], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('E5 shapes:', e5_tr.shape, e5_te.shape, flush=True)

# Grouped CV XGB on [e5 emb + scaled meta]
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))
oof_e5 = np.zeros(len(train), dtype=np.float32)
test_preds = []
times = []
params = dict(
    objective='binary:logistic',
    eval_metric='auc',
    tree_method='hist',
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    min_child_weight=1.0,
    max_bin=256,
    random_state=42
)
for i, (trn_idx, val_idx) in enumerate(folds, 1):
    t0 = time.time()
    scaler = StandardScaler(with_mean=True, with_std=True)
    M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
    M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
    X_tr = np.hstack([e5_tr[trn_idx], M_tr]).astype(np.float32)
    X_va = np.hstack([e5_tr[val_idx], M_va]).astype(np.float32)
    dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
    dva = xgb.DMatrix(X_va, label=y[val_idx])
    booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')], verbose_eval=False, early_stopping_rounds=100)
    oof_e5[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
    M_te = scaler.transform(meta_test.values).astype(np.float32)
    Xt = np.hstack([e5_te, M_te]).astype(np.float32)
    dte = xgb.DMatrix(Xt)
    test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
    times.append(time.time()-t0)
    print(f'  E5 fold {i} AUC={roc_auc_score(y[val_idx], oof_e5[val_idx]):.5f} | iters={booster.best_iteration+1} | {times[-1]:.1f}s', flush=True)

auc_e5 = roc_auc_score(y, oof_e5)
print(f'OOF AUC (E5 + meta XGB): {auc_e5:.5f} | mean fold time: {np.mean(times):.1f}s', flush=True)
test_e5 = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
np.save('oof_e5_embed_g.npy', oof_e5)
np.save('test_e5_embed_g.npy', test_e5)

# Rank-then-NNLS including E5 and strongest legs (prefer enriched bagged ST if present)
paths = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'cb_text_g': ('oof_cat_text_g.npy','test_cat_text_g.npy'),
    'st_embed_bag_v2': ('oof_st_embed_bag_v2.npy','test_st_embed_bag_v2.npy') if os.path.exists('oof_st_embed_bag_v2.npy') else ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag_v2': ('oof_mpnet_embed_bag_v2.npy','test_mpnet_embed_bag_v2.npy') if os.path.exists('oof_mpnet_embed_bag_v2.npy') else ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
    'e5_embed_g': ('oof_e5_embed_g.npy','test_e5_embed_g.npy'),
}
oof_list, test_list, names = [], [], []
for name, (poof, ptest) in paths.items():
    if os.path.exists(poof) and os.path.exists(ptest):
        oof_list.append(np.load(poof))
        test_list.append(np.load(ptest))
        names.append(name)
assert len(oof_list) >= 3, 'Not enough models for rank-NNLS'
P = np.vstack(oof_list).T
T = np.vstack(test_list).T
def rank01(a):
    order = a.argsort(kind='mergesort')
    ranks = np.empty_like(order, dtype=np.float64)
    ranks[order] = np.arange(len(a), dtype=np.float64)
    return ranks / max(1, len(a)-1)
P_rank = np.apply_along_axis(rank01, 0, P)
T_rank = np.apply_along_axis(rank01, 0, T)
w, _ = nnls(P_rank, y.astype(float))
w = w / w.sum() if w.sum() > 0 else w
blend_oof_rank = P_rank @ w
auc_rank = roc_auc_score(y, blend_oof_rank)
print('Rank-NNLS models (+E5):', names)
print('Rank-NNLS weights (+E5):', w)
print(f'Rank-blended OOF AUC (+E5): {auc_rank:.5f}')
blend_test_rank = T_rank @ w
pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': blend_test_rank.astype(np.float32)}).to_csv('submission.csv', index=False)
print('Saved submission.csv (Rank-NNLS + E5). Rows:', len(blend_test_rank))

Meta(e5) shapes: (2878, 46) (1162, 46)


Encoding e5-base-v2 embeddings...


E5 shapes: (2878, 768) (1162, 768)


  E5 fold 1 AUC=0.68649 | iters=56 | 1.7s


  E5 fold 2 AUC=0.67281 | iters=23 | 1.6s


  E5 fold 3 AUC=0.64393 | iters=29 | 1.2s


  E5 fold 4 AUC=0.68757 | iters=123 | 2.0s


  E5 fold 5 AUC=0.71050 | iters=18 | 1.1s


OOF AUC (E5 + meta XGB): 0.67627 | mean fold time: 1.5s


Rank-NNLS models (+E5): ['lr_meta_g', 'lr_subs_g', 'xgb_svd_meta', 'cb_text_g', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g']
Rank-NNLS weights (+E5): [0.14791451 0.09863673 0.08117717 0.         0.24403993 0.31331505
 0.11491661]
Rank-blended OOF AUC (+E5): 0.70912
Saved submission.csv (Rank-NNLS + E5). Rows: 1162


In [44]:
# Time-aware, group-purged holdout rank-NNLS blender with shrink; prune weak legs; write submission
import numpy as np, pandas as pd, os, time
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train['requester_received_pizza'].astype(int).values
users = train['requester_username'].fillna('').astype(str).values
ts = pd.to_numeric(train['unix_timestamp_of_request'], errors='coerce').values

# Define last-20% time holdout with 5-day purge gap and group purge
q = 0.80
cutoff = np.nanquantile(ts, q)
gap_days = 5
gap_sec = gap_days * 86400.0
train_mask = ts < (cutoff - gap_sec)
val_mask = ts >= cutoff

# Group purge: drop any user appearing on both sides
users_train = set(users[train_mask])
users_val = set(users[val_mask])
overlap = users_train.intersection(users_val)
if overlap:
    drop_overlap = np.isin(users, list(overlap))
    train_mask = train_mask & (~drop_overlap)
    val_mask = val_mask & (~drop_overlap)

idx_tr = np.where(train_mask)[0]
idx_va = np.where(val_mask)[0]
print(f'Holdout split: train={idx_tr.size}, valid={idx_va.size}, overlap_users={len(overlap)}', flush=True)

# Load candidate models' OOF and test predictions
cands = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'st_embed_bag_v2': ('oof_st_embed_bag_v2.npy','test_st_embed_bag_v2.npy') if os.path.exists('oof_st_embed_bag_v2.npy') else ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag_v2': ('oof_mpnet_embed_bag_v2.npy','test_mpnet_embed_bag_v2.npy') if os.path.exists('oof_mpnet_embed_bag_v2.npy') else ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
    'e5_embed_g': ('oof_e5_embed_g.npy','test_e5_embed_g.npy'),
    # Optional leg; will keep only if it earns weight
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
    # Explicitly prune weak CatBoost text-only per advice
    # 'cb_text_g': ('oof_cat_text_g.npy','test_cat_text_g.npy'),
}

names, OOFs, TESTs = [], [], []
for name, (poof, ptest) in list(cands.items()):
    if os.path.exists(poof) and os.path.exists(ptest):
        OOFs.append(np.load(poof))
        TESTs.append(np.load(ptest))
        names.append(name)
    else:
        print(f'Missing predictions for {name}; skipping')

assert len(OOFs) >= 3, f'Need >=3 legs, found {len(OOFs)}'
P = np.vstack(OOFs).T  # (n_train, n_models)
T = np.vstack(TESTs).T # (n_test, n_models)

def rank01(a):
    order = a.argsort(kind='mergesort')
    r = np.empty_like(order, dtype=np.float64)
    r[order] = np.arange(len(a), dtype=np.float64)
    return r / max(1, len(a)-1)

def fit_rank_nnls_with_shrink(P_mat, y_vec, idx_valid, shrink=0.20, prune_thr=0.05):
    P_hold = P_mat[idx_valid]
    # Rank-transform columns on the holdout subset
    P_rank = np.apply_along_axis(rank01, 0, P_hold)
    w, _ = nnls(P_rank, y_vec[idx_valid].astype(float))
    if w.sum() > 0: w = w / w.sum()
    # Shrink toward uniform: w := (1-shrink)*w + shrink*(1/M)
    M = w.size
    w = (1.0 - shrink)*w + shrink*(1.0/M)
    # Prune tiny weights and renormalize
    mask_keep = w >= prune_thr
    if not mask_keep.all():
        w = w * mask_keep.astype(float)
    if w.sum() > 0: w = w / w.sum()
    auc = roc_auc_score(y_vec[idx_valid], P_rank @ w)
    return w, auc, mask_keep

# Try with and without lr_subs_g; choose higher holdout AUC
all_idx = np.arange(len(names))
try:
    subs_idx = names.index('lr_subs_g')
    keep_mask_with = np.ones(len(names), dtype=bool)
    keep_mask_wo = np.ones(len(names), dtype=bool); keep_mask_wo[subs_idx] = False
    options = [('with_subs', keep_mask_with), ('no_subs', keep_mask_wo)]
except ValueError:
    options = [('no_subs', np.ones(len(names), dtype=bool))]

best = None
for tag, kmask in options:
    P_sel = P[:, kmask]
    w, auc, kept = fit_rank_nnls_with_shrink(P_sel, y, idx_va, shrink=0.20, prune_thr=0.05)
    kept_names = np.array(names)[kmask][kept].tolist()
    print(f'Holdout AUC [{tag}] on ranks: {auc:.5f} | legs={np.array(names)[kmask].tolist()}')
    print(f'  -> After shrink/prune (thr=0.05): kept={kept_names} | weights={w}', flush=True)
    if (best is None) or (auc > best[0]):
        # Lift weights back to option space and only place kept weights to avoid shape mismatch
        w_full = np.zeros(kmask.sum(), dtype=float)
        w_kept = w[kept]  # only the kept weights
        w_full[kept] = w_kept
        best = (auc, w_full, kmask, kept)

auc_hold, w_best_optspace, keep_mask_opt, kept_opt = best
# Project selected weights to global model list
w_global = np.zeros(len(names), dtype=float)
w_global[keep_mask_opt] = w_best_optspace
sel_names = np.array(names)[keep_mask_opt][kept_opt].tolist()
w_final = w_global[w_global > 0]
print('Chosen legs:', sel_names)
print('Chosen weights (post-shrink+prune):', w_final)

# Apply weights to test (rank-transform per model on test independently) to build submission
T_sel = T[:, w_global > 0]
T_rank = np.apply_along_axis(rank01, 0, T_sel)
test_blend = (T_rank @ (w_final / w_final.sum())).astype(np.float32) if w_final.sum() > 0 else T_rank.mean(axis=1).astype(np.float32)

# Also compute holdout-blended OOF AUC for reference (using OOF ranks on holdout columns)
P_sel = P[:, w_global > 0]
P_hold_rank = np.apply_along_axis(rank01, 0, P_sel[idx_va])
oof_auc_ref = roc_auc_score(y[idx_va], P_hold_rank @ (w_final / w_final.sum())) if w_final.sum() > 0 else roc_auc_score(y[idx_va], P_hold_rank.mean(axis=1))
print(f'Final holdout AUC (ref): {oof_auc_ref:.5f}')

# Save two files: time-holdout-weighted and set as primary submission
sub = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend})
sub.to_csv('submission_time_holdout.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_time_holdout.csv and updated submission.csv. Rows:', len(sub))

Holdout split: train=2289, valid=576, overlap_users=0


Holdout AUC [with_subs] on ranks: 0.70366 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g']
  -> After shrink/prune (thr=0.05): kept=['lr_meta_g', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g'] | weights=[0.27434644 0.         0.0840204  0.22384182 0.27597431 0.14181703]


Holdout AUC [no_subs] on ranks: 0.70299 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g']
  -> After shrink/prune (thr=0.05): kept=['lr_meta_g', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g'] | weights=[0.38039541 0.         0.09559816 0.24101888 0.28298755]


Chosen legs: ['lr_meta_g', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g']
Chosen weights (post-shrink+prune): [0.27434644 0.0840204  0.22384182 0.27597431 0.14181703]
Final holdout AUC (ref): 0.70366
Saved submission_time_holdout.csv and updated submission.csv. Rows: 1162


In [42]:
# Seed-bag E5-base-v2 embeddings + enriched meta -> XGB (grouped CV); overwrite e5 OOF/test and re-use time-holdout blender
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
target_col = 'requester_received_pizza'
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame, title_col: str, body_col: str) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' [SEP] ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train, title_col, body_col).tolist()
text_te = build_text_upweighted(test, title_col, body_col).tolist()
y = train[target_col].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))

# Enriched meta builder (same as in cells 21/23)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame, title_col: str, body_col: str) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta_enriched(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['month'] = dt.dt.month.astype(float)
    out['quarter'] = dt.dt.quarter.astype(float)
    base_ts = np.nanmin(ts.values)
    out['days_since_start'] = ((ts - base_ts) / 86400.0).astype(float)
    order = np.argsort(ts.values)
    rel = np.empty_like(order, dtype=np.float64)
    rel[order] = np.arange(len(order), dtype=np.float64)
    out['relative_position'] = (rel / max(1, len(order)-1)).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float) if 'word_count' in globals() else title.apply(lambda s: len((s or '').split())).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float) if 'word_count' in globals() else body.apply(lambda s: len((s or '').split())).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df, title_col, body_col)
    out = pd.concat([out, lex], axis=1).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend','month','quarter','days_since_start','relative_position',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_train = build_meta_enriched(train)
meta_test = build_meta_enriched(test)
print('Meta(enriched for E5 bag) shapes:', meta_train.shape, meta_test.shape, flush=True)

# Encode E5 embeddings (normalized, CPU), with 'query:' prefix
print('Encoding E5-base-v2 (for bagging)...', flush=True)
e5 = SentenceTransformer('intfloat/e5-base-v2', device='cpu')
emb_tr = e5.encode([f'query: {s}' for s in text_tr], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_te = e5.encode([f'query: {s}' for s in text_te], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('E5 emb shapes:', emb_tr.shape, emb_te.shape, flush=True)

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))

def bag_xgb_on_emb(emb_tr, emb_te, seeds=(7,13,29), jitter=True):
    oof_bag = np.zeros(len(train), dtype=np.float32)
    test_bag = []
    for seed in seeds:
        rs = np.random.RandomState(seed)
        md = 4 + (rs.rand() < 0.5) if jitter else 4
        subs = 0.75 + rs.rand()*0.10 if jitter else 0.80
        cols = 0.75 + rs.rand()*0.15 if jitter else 0.80
        lr = 0.045 + rs.rand()*0.010 if jitter else 0.05
        params = dict(
            objective='binary:logistic', eval_metric='auc', tree_method='hist',
            max_depth=int(md), learning_rate=float(lr), subsample=float(subs), colsample_bytree=float(cols),
            reg_lambda=1.0, min_child_weight=1.0, max_bin=256, random_state=int(seed)
        )
        oof_seed = np.zeros(len(train), dtype=np.float32)
        test_preds = []
        t_se = time.time()
        for i, (trn_idx, val_idx) in enumerate(folds, 1):
            t0 = time.time()
            scaler = StandardScaler(with_mean=True, with_std=True)
            M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
            M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
            X_tr = np.hstack([emb_tr[trn_idx], M_tr]).astype(np.float32)
            X_va = np.hstack([emb_tr[val_idx], M_va]).astype(np.float32)
            dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
            dva = xgb.DMatrix(X_va, label=y[val_idx])
            booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')], verbose_eval=False, early_stopping_rounds=100)
            oof_seed[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
            M_te = scaler.transform(meta_test.values).astype(np.float32)
            Xt = np.hstack([emb_te, M_te]).astype(np.float32)
            dte = xgb.DMatrix(Xt)
            test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
            print(f'  [E5 seed {seed}] Fold {i} AUC={roc_auc_score(y[val_idx], oof_seed[val_idx]):.5f} | iters={booster.best_iteration+1} | {time.time()-t0:.1f}s', flush=True)
        print(f'[E5] Seed {seed} OOF AUC={roc_auc_score(y, oof_seed):.5f} | elapsed {time.time()-t_se:.1f}s', flush=True)
        oof_bag += oof_seed / len(seeds)
        test_bag.append(np.mean(np.vstack(test_preds), axis=0).astype(np.float32))
    test_mean = np.mean(np.vstack(test_bag), axis=0).astype(np.float32)
    auc_bag = roc_auc_score(y, oof_bag)
    print(f'[E5] Bagged OOF AUC: {auc_bag:.5f}', flush=True)
    return oof_bag, test_mean

oof_e5_bag, test_e5_bag = bag_xgb_on_emb(emb_tr, emb_te, seeds=(7,13,29), jitter=True)

# Overwrite the single-seed files used by blender to pick up bagged predictions
np.save('oof_e5_embed_g.npy', oof_e5_bag)
np.save('test_e5_embed_g.npy', test_e5_bag)
print('Saved bagged E5 OOF/test to oof_e5_embed_g.npy / test_e5_embed_g.npy')

Meta(enriched for E5 bag) shapes: (2878, 46) (1162, 46)


Encoding E5-base-v2 (for bagging)...


E5 emb shapes: (2878, 768) (1162, 768)


  [E5 seed 7] Fold 1 AUC=0.68401 | iters=73 | 2.8s


  [E5 seed 7] Fold 2 AUC=0.66958 | iters=84 | 2.5s


  [E5 seed 7] Fold 3 AUC=0.66898 | iters=201 | 4.3s


  [E5 seed 7] Fold 4 AUC=0.66938 | iters=75 | 2.5s


  [E5 seed 7] Fold 5 AUC=0.70980 | iters=74 | 2.4s


[E5] Seed 7 OOF AUC=0.67663 | elapsed 14.6s


  [E5 seed 13] Fold 1 AUC=0.68546 | iters=138 | 2.1s


  [E5 seed 13] Fold 2 AUC=0.67963 | iters=44 | 1.3s


  [E5 seed 13] Fold 3 AUC=0.67323 | iters=77 | 1.6s


  [E5 seed 13] Fold 4 AUC=0.67222 | iters=35 | 1.2s


  [E5 seed 13] Fold 5 AUC=0.71243 | iters=72 | 1.6s


[E5] Seed 13 OOF AUC=0.67970 | elapsed 7.8s


  [E5 seed 29] Fold 1 AUC=0.69703 | iters=94 | 1.7s


  [E5 seed 29] Fold 2 AUC=0.69184 | iters=76 | 1.6s


  [E5 seed 29] Fold 3 AUC=0.67099 | iters=5 | 1.0s


  [E5 seed 29] Fold 4 AUC=0.68467 | iters=105 | 1.9s


  [E5 seed 29] Fold 5 AUC=0.71146 | iters=48 | 1.4s


[E5] Seed 29 OOF AUC=0.68382 | elapsed 7.5s


[E5] Bagged OOF AUC: 0.69692


Saved bagged E5 OOF/test to oof_e5_embed_g.npy / test_e5_embed_g.npy


In [39]:
# Forward-chaining, group-purged rank-NNLS blender (3 chains, no shrink); save alternate submission
import numpy as np, pandas as pd, os
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train['requester_received_pizza'].astype(int).values
users = train['requester_username'].fillna('').astype(str).values
ts = pd.to_numeric(train['unix_timestamp_of_request'], errors='coerce').values

# Candidate legs (drop CatBoost), keep lr_subs_g optional
cands = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'st_embed_bag_v2': ('oof_st_embed_bag_v2.npy','test_st_embed_bag_v2.npy') if os.path.exists('oof_st_embed_bag_v2.npy') else ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag_v2': ('oof_mpnet_embed_bag_v2.npy','test_mpnet_embed_bag_v2.npy') if os.path.exists('oof_mpnet_embed_bag_v2.npy') else ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
    'e5_embed_g': ('oof_e5_embed_g.npy','test_e5_embed_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
}

names, OOFs, TESTs = [], [], []
for name, (poof, ptest) in list(cands.items()):
    if os.path.exists(poof) and os.path.exists(ptest):
        OOFs.append(np.load(poof))
        TESTs.append(np.load(ptest))
        names.append(name)
    else:
        print(f'Missing predictions for {name}; skipping')
assert len(OOFs) >= 3, f'Need >=3 legs, found {len(OOFs)}'
P = np.vstack(OOFs).T
T = np.vstack(TESTs).T

def rank01(a):
    order = a.argsort(kind='mergesort')
    r = np.empty_like(order, dtype=np.float64)
    r[order] = np.arange(len(a), dtype=np.float64)
    return r / max(1, len(a)-1)

def fit_rank_nnls(P_mat, y_vec, idx_valid):
    P_hold = P_mat[idx_valid]
    P_rank = np.apply_along_axis(rank01, 0, P_hold)
    w, _ = nnls(P_rank, y_vec[idx_valid].astype(float))
    if w.sum() > 0: w = w / w.sum()
    auc = roc_auc_score(y_vec[idx_valid], P_rank @ w) if idx_valid.size else np.nan
    return w, auc

def group_purged_mask(idx_train_cond, idx_valid_cond):
    tr_mask = idx_train_cond.copy()
    va_mask = idx_valid_cond.copy()
    users_tr = set(users[tr_mask])
    users_va = set(users[va_mask])
    overlap = users_tr.intersection(users_va)
    if overlap:
        drop = np.isin(users, list(overlap))
        tr_mask = tr_mask & (~drop)
        va_mask = va_mask & (~drop)
    return tr_mask, va_mask

# Define forward chains with 3-7 day purge: [0-60 -> 60-80], [0-80 -> 80-90], [0-90 -> 90-100]
qs = np.quantile(ts[~np.isnan(ts)], [0.6, 0.8, 0.9])
q60, q80, q90 = qs[0], qs[1], qs[2]
gap_sec = 5*86400.0
chains = [
    ((ts < (q60 - 0)), (ts >= (q60 + gap_sec)) & (ts < (q80 + 0))),
    ((ts < (q80 - 0)), (ts >= (q80 + gap_sec)) & (ts < (q90 + 0))),
    ((ts < (q90 - 0)), (ts >= (q90 + gap_sec))),
]

weights = []
aucs = []
leg_masks = []

for ci, (tr_cond, va_cond) in enumerate(chains, 1):
    tr_mask, va_mask = group_purged_mask(tr_cond, va_cond)
    idx_va = np.where(va_mask)[0]
    print(f'Chain {ci}: valid size={idx_va.size}', flush=True)
    # With and without lr_subs_g
    try:
        subs_idx = names.index('lr_subs_g')
        keep_with = np.ones(len(names), dtype=bool)
        keep_wo = np.ones(len(names), dtype=bool); keep_wo[subs_idx] = False
        opts = [('with_subs', keep_with), ('no_subs', keep_wo)]
    except ValueError:
        opts = [('no_subs', np.ones(len(names), dtype=bool))]
    best = None
    for tag, kmask in opts:
        w, auc = fit_rank_nnls(P[:, kmask], y, idx_va)
        print(f'  Chain {ci} {tag}: AUC={auc:.5f} legs={np.array(names)[kmask].tolist()} w={w}', flush=True)
        if (best is None) or (auc > best[0]):
            best = (auc, w, kmask, tag)
    auc_hold, w_best, kmask_best, tag = best
    weights.append(w_best)
    aucs.append(auc_hold)
    leg_masks.append(kmask_best)

# Align masks across chains (use intersection of kept legs); average weights over chains on common legs
keep_all = np.ones(len(names), dtype=bool)
for km in leg_masks:
    keep_all = keep_all & km
sel_names = np.array(names)[keep_all].tolist()
if not any(keep_all):
    # fallback: use mask from best AUC chain
    best_idx = int(np.nanargmax(aucs))
    keep_all = leg_masks[best_idx]
    sel_names = np.array(names)[keep_all].tolist()

W_mat = []
for w, km in zip(weights, leg_masks):
    # project weights to common leg set by zeroing dropped legs and renormalizing
    w_full = np.zeros(len(names), dtype=np.float64)
    w_full[km] = w
    w_common = w_full[keep_all]
    if w_common.sum() > 0: w_common = w_common / w_common.sum()
    W_mat.append(w_common)
W_mat = np.vstack(W_mat) if len(W_mat) else np.zeros((0, keep_all.sum()), dtype=np.float64)
w_avg = W_mat.mean(axis=0) if W_mat.size else np.ones(keep_all.sum())/keep_all.sum()
if w_avg.sum() > 0: w_avg = w_avg / w_avg.sum()

print('Chosen legs (forward-chain):', sel_names)
print('Per-chain AUCs:', [float(a) for a in aucs])
print('Averaged weights (no shrink):', w_avg)

# Build test submission using averaged weights on rank-transformed test for selected legs
T_sel = T[:, keep_all]
def apply_rank(mat): return np.apply_along_axis(rank01, 0, mat)
T_rank = apply_rank(T_sel)
test_blend_fc = (T_rank @ w_avg).astype(np.float32)

# Save
sub_fc = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend_fc})
sub_fc.to_csv('submission_time_forward_chain.csv', index=False)
sub_fc.to_csv('submission.csv', index=False)
print('Saved submission_time_forward_chain.csv and updated submission.csv. Rows:', len(sub_fc))

Chain 1: valid size=562


  Chain 1 with_subs: AUC=0.66664 legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g'] w=[0.21159857 0.19811641 0.17801016 0.17003187 0.24224299 0.        ]


  Chain 1 no_subs: AUC=0.66664 legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g'] w=[0.21159857 0.19811641 0.17801016 0.17003187 0.24224299]


Chain 2: valid size=278


  Chain 2 with_subs: AUC=0.74896 legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g'] w=[0.2649735  0.03920233 0.08336516 0.39553677 0.         0.21692224]


  Chain 2 no_subs: AUC=0.74586 legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g'] w=[0.46580581 0.03985465 0.07446941 0.41987014 0.        ]


Chain 3: valid size=268


  Chain 3 with_subs: AUC=0.63565 legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g'] w=[0.1789592  0.19912775 0.07158773 0.         0.32647242 0.2238529 ]


  Chain 3 no_subs: AUC=0.63626 legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g'] w=[0.37452554 0.19850363 0.09950774 0.         0.32746309]


Chosen legs (forward-chain): ['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g']
Per-chain AUCs: [0.6666371477657145, 0.7489612449950895, 0.6362621753246754]
Averaged weights (no shrink): [0.30816619 0.14889396 0.12799208 0.22504574 0.18990203]
Saved submission_time_forward_chain.csv and updated submission.csv. Rows: 1162


In [36]:
# Adversarial Validation (AV): train-vs-test classifier to detect shift; save sample weights for retraining fast legs
import os, re, time, numpy as np, pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' ' + t + ' [SEP] ' + b)

txt_tr = build_text_upweighted(train)
txt_te = build_text_upweighted(test)

# Enriched meta (reuse from earlier: time features + lexicons) but defined locally
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta_enriched(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    # time features
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['month'] = dt.dt.month.astype(float)
    out['quarter'] = dt.dt.quarter.astype(float)
    base_ts = np.nanmin(ts.values)
    out['days_since_start'] = ((ts - base_ts) / 86400.0).astype(float)
    order = np.argsort(ts.values)
    rel = np.empty_like(order, dtype=np.float64)
    rel[order] = np.arange(len(order), dtype=np.float64)
    out['relative_position'] = (rel / max(1, len(order)-1)).astype(float)
    # text stats
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df)
    out = pd.concat([out, lex], axis=1).replace([np.inf,-np.inf], np.nan).fillna(0.0)
    # transforms
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend','month','quarter','days_since_start','relative_position',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_tr = build_meta_enriched(train)
meta_te = build_meta_enriched(test)
print('AV meta shapes:', meta_tr.shape, meta_te.shape, flush=True)

# TF-IDF -> SVD features (100 word + 100 char)
wv = TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.95, max_features=150000, sublinear_tf=True, lowercase=True, dtype=np.float32)
cv = TfidfVectorizer(analyzer='char', ngram_range=(3,6), min_df=2, max_features=300000, sublinear_tf=True, lowercase=True, dtype=np.float32)
Xw_tr = wv.fit_transform(txt_tr)
Xc_tr = cv.fit_transform(txt_tr)
Xw_te = wv.transform(txt_te)
Xc_te = cv.transform(txt_te)
svd_w = TruncatedSVD(n_components=100, random_state=42)
svd_c = TruncatedSVD(n_components=100, random_state=42)
Zw_tr = svd_w.fit_transform(Xw_tr).astype(np.float32)
Zc_tr = svd_c.fit_transform(Xc_tr).astype(np.float32)
Zw_te = svd_w.transform(Xw_te).astype(np.float32)
Zc_te = svd_c.transform(Xc_te).astype(np.float32)

# Build AV datasets
X_meta_tr = meta_tr.values.astype(np.float32)
X_meta_te = meta_te.values.astype(np.float32)
sc_meta = StandardScaler(with_mean=True, with_std=True)
Xm_tr = sc_meta.fit_transform(X_meta_tr).astype(np.float32)
Xm_te = sc_meta.transform(X_meta_te).astype(np.float32)

X_tr_all = np.hstack([Zw_tr, Zc_tr, Xm_tr]).astype(np.float32)
X_te_all = np.hstack([Zw_te, Zc_te, Xm_te]).astype(np.float32)
X = np.vstack([X_tr_all, X_te_all])
y_is_test = np.array([0]*len(train) + [1]*len(test), dtype=np.int32)

# Variant 1: WITHOUT time columns (semantic/user shift)
time_cols = ['hour','weekday','is_weekend','month','quarter','days_since_start','relative_position']
meta_no_time_tr = meta_tr.drop(columns=[c for c in time_cols if c in meta_tr.columns])
meta_no_time_te = meta_te.drop(columns=[c for c in time_cols if c in meta_te.columns])
Xm_nt_tr = StandardScaler(with_mean=True, with_std=True).fit_transform(meta_no_time_tr.values.astype(np.float32))
Xm_nt_te = StandardScaler(with_mean=True, with_std=True).fit(meta_no_time_tr.values.astype(np.float32)).transform(meta_no_time_te.values.astype(np.float32))
X_nt = np.vstack([np.hstack([Zw_tr, Zc_tr, Xm_nt_tr]).astype(np.float32),
                  np.hstack([Zw_te, Zc_te, Xm_nt_te]).astype(np.float32)])

def run_av_auc(Xmat, ybin, name):
    # Simple XGB classifier with 5-fold CV on concatenated data
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(ybin), dtype=np.float32)
    params = dict(objective='binary:logistic', eval_metric='auc', tree_method='hist', max_depth=5, learning_rate=0.08,
                  subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0, max_bin=256, random_state=42)
    for i, (tr_idx, va_idx) in enumerate(skf.split(Xmat, ybin), 1):
        dtr = xgb.DMatrix(Xmat[tr_idx], label=ybin[tr_idx])
        dva = xgb.DMatrix(Xmat[va_idx], label=ybin[va_idx])
        booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')], verbose_eval=False, early_stopping_rounds=100)
        oof[va_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
    auc = roc_auc_score(ybin, oof)
    print(f'AV OOF AUC [{name}]: {auc:.5f}', flush=True)
    return auc, oof

auc_all, oof_all = run_av_auc(X, y_is_test, 'with_time')
auc_nt, oof_nt = run_av_auc(X_nt, y_is_test, 'no_time')

# Build sample weights for TRAIN rows using AV 'with_time' probabilities p/(1-p), clipped to [0.5, 2.0]
p_train = oof_all[:len(train)]
eps = 1e-6
w_train = np.clip(p_train / np.clip(1.0 - p_train, eps, 1.0), 0.5, 2.0).astype(np.float32)
np.save('av_weights.npy', w_train)
print('Saved av_weights.npy. Train weight stats: min={:.3f} med={:.3f} max={:.3f}'.format(float(w_train.min()), float(np.median(w_train)), float(w_train.max())))
print('AV summary -> with_time AUC={:.5f}, no_time AUC={:.5f}'.format(auc_all, auc_nt))

AV meta shapes: (2878, 46) (1162, 46)


AV OOF AUC [with_time]: 0.99989


AV OOF AUC [no_time]: 0.96817


Saved av_weights.npy. Train weight stats: min=0.500 med=0.500 max=0.752
AV summary -> with_time AUC=0.99989, no_time AUC=0.96817


In [37]:
# AV-reweighted training for fast XGB-on-emb legs (MiniLM, MPNet, E5) + update .npy files
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
y = train['requester_received_pizza'].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train).tolist()
text_te = build_text_upweighted(test).tolist()

# Enriched meta (same as in cells 21/23/25)
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))
LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out
def build_meta_enriched(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    keep_cols = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep_cols:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['month'] = dt.dt.month.astype(float)
    out['quarter'] = dt.dt.quarter.astype(float)
    base_ts = np.nanmin(ts.values)
    out['days_since_start'] = ((ts - base_ts) / 86400.0).astype(float)
    order = np.argsort(ts.values)
    rel = np.empty_like(order, dtype=np.float64)
    rel[order] = np.arange(len(order), dtype=np.float64)
    out['relative_position'] = (rel / max(1, len(order)-1)).astype(float)
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    lex = add_lexicons(df)
    out = pd.concat([out, lex], axis=1).replace([np.inf,-np.inf], np.nan).fillna(0.0)
    nonneg_cols = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend','month','quarter','days_since_start','relative_position',
        'title_len_chars','title_len_words','body_len_chars','body_len_words',
        'url_count','has_imgur','digits_count','dollar_flag','exclam_count','question_count',
        *list(LEX_PATTERNS.keys()),
    ]
    for c in nonneg_cols:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_train = build_meta_enriched(train)
meta_test = build_meta_enriched(test)
print('Meta (for AV-weighted retrain) shapes:', meta_train.shape, meta_test.shape, flush=True)

# Load AV sample weights
w_train = np.load('av_weights.npy') if os.path.exists('av_weights.npy') else np.ones(len(train), dtype=np.float32)
print('AV weights loaded. Stats: min={:.3f}, med={:.3f}, max={:.3f}'.format(float(w_train.min()), float(np.median(w_train)), float(w_train.max())), flush=True)

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))

def train_xgb_bag_weighted(emb_tr, emb_te, seeds=(7,13,29), name='leg'):
    oof_bag = np.zeros(len(train), dtype=np.float32)
    test_bag_per_seed = []
    for seed in seeds:
        rs = np.random.RandomState(seed)
        md = 4 + (rs.rand() < 0.5)
        subs = 0.75 + rs.rand()*0.10
        cols = 0.75 + rs.rand()*0.15
        lr = 0.045 + rs.rand()*0.010
        params = dict(
            objective='binary:logistic', eval_metric='auc', tree_method='hist',
            max_depth=int(md), learning_rate=float(lr), subsample=float(subs), colsample_bytree=float(cols),
            reg_lambda=1.0, min_child_weight=1.0, max_bin=256, random_state=int(seed)
        )
        oof_seed = np.zeros(len(train), dtype=np.float32)
        test_preds = []
        for i, (trn_idx, val_idx) in enumerate(folds, 1):
            t0 = time.time()
            scaler = StandardScaler(with_mean=True, with_std=True)
            M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
            M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
            X_tr = np.hstack([emb_tr[trn_idx], M_tr]).astype(np.float32)
            X_va = np.hstack([emb_tr[val_idx], M_va]).astype(np.float32)
            dtr = xgb.DMatrix(X_tr, label=y[trn_idx], weight=w_train[trn_idx])
            dva = xgb.DMatrix(X_va, label=y[val_idx])
            booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')], verbose_eval=False, early_stopping_rounds=100)
            oof_seed[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
            M_te = scaler.transform(meta_test.values).astype(np.float32)
            Xt = np.hstack([emb_te, M_te]).astype(np.float32)
            dte = xgb.DMatrix(Xt)
            test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
            print(f'  [{name} seed {seed}] Fold {i} AUC={roc_auc_score(y[val_idx], oof_seed[val_idx]):.5f} | iters={booster.best_iteration+1} | {time.time()-t0:.1f}s', flush=True)
        print(f'[{name}] Seed {seed} OOF AUC={roc_auc_score(y, oof_seed):.5f}', flush=True)
        oof_bag += oof_seed / len(seeds)
        test_bag_per_seed.append(np.mean(np.vstack(test_preds), axis=0).astype(np.float32))
    test_bag = np.mean(np.vstack(test_bag_per_seed), axis=0).astype(np.float32)
    auc_bag = roc_auc_score(y, oof_bag)
    print(f'[{name}] AV-weighted bagged OOF AUC: {auc_bag:.5f}', flush=True)
    return oof_bag, test_bag

# Encode embeddings
print('Encoding MiniLM...', flush=True)
minilm = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cpu')
emb_minilm_tr = minilm.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_minilm_te = minilm.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('Encoding MPNet...', flush=True)
mpnet = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1', device='cpu')
emb_mpnet_tr = mpnet.encode(text_tr, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_mpnet_te = mpnet.encode(text_te, batch_size=64, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('Encoding E5-base-v2...', flush=True)
e5 = SentenceTransformer('intfloat/e5-base-v2', device='cpu')
emb_e5_tr = e5.encode([f'query: {s}' for s in text_tr], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_e5_te = e5.encode([f'query: {s}' for s in text_te], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)

seeds = (7,13,29)
print('=== AV-weighted bagging: MiniLM ===', flush=True)
oof_st_bag_w, test_st_bag_w = train_xgb_bag_weighted(emb_minilm_tr, emb_minilm_te, seeds=seeds, name='MiniLM(av)')
np.save('oof_st_embed_bag_v2.npy', oof_st_bag_w); np.save('test_st_embed_bag_v2.npy', test_st_bag_w)

print('=== AV-weighted bagging: MPNet ===', flush=True)
oof_mp_bag_w, test_mp_bag_w = train_xgb_bag_weighted(emb_mpnet_tr, emb_mpnet_te, seeds=seeds, name='MPNet(av)')
np.save('oof_mpnet_embed_bag_v2.npy', oof_mp_bag_w); np.save('test_mpnet_embed_bag_v2.npy', test_mp_bag_w)

print('=== AV-weighted bagging: E5 ===', flush=True)
oof_e5_bag_w, test_e5_bag_w = train_xgb_bag_weighted(emb_e5_tr, emb_e5_te, seeds=seeds, name='E5(av)')
np.save('oof_e5_embed_g.npy', oof_e5_bag_w); np.save('test_e5_embed_g.npy', test_e5_bag_w)
print('Saved AV-weighted OOF/test for MiniLM, MPNet, E5.')

Meta (for AV-weighted retrain) shapes: (2878, 46) (1162, 46)


AV weights loaded. Stats: min=0.500, med=0.500, max=0.752


Encoding MiniLM...


Encoding MPNet...


You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





Encoding E5-base-v2...


=== AV-weighted bagging: MiniLM ===


  [MiniLM(av) seed 7] Fold 1 AUC=0.67837 | iters=336 | 3.0s


  [MiniLM(av) seed 7] Fold 2 AUC=0.69849 | iters=7 | 0.8s


  [MiniLM(av) seed 7] Fold 3 AUC=0.67774 | iters=75 | 1.2s


  [MiniLM(av) seed 7] Fold 4 AUC=0.66487 | iters=256 | 2.5s


  [MiniLM(av) seed 7] Fold 5 AUC=0.70801 | iters=49 | 1.1s


[MiniLM(av)] Seed 7 OOF AUC=0.66498


  [MiniLM(av) seed 13] Fold 1 AUC=0.67450 | iters=24 | 0.7s


  [MiniLM(av) seed 13] Fold 2 AUC=0.69810 | iters=61 | 0.9s


  [MiniLM(av) seed 13] Fold 3 AUC=0.66844 | iters=37 | 0.7s


  [MiniLM(av) seed 13] Fold 4 AUC=0.66286 | iters=46 | 0.8s


  [MiniLM(av) seed 13] Fold 5 AUC=0.72800 | iters=21 | 0.7s


[MiniLM(av)] Seed 13 OOF AUC=0.68104


  [MiniLM(av) seed 29] Fold 1 AUC=0.67422 | iters=140 | 1.2s


  [MiniLM(av) seed 29] Fold 2 AUC=0.70413 | iters=30 | 0.9s


  [MiniLM(av) seed 29] Fold 3 AUC=0.65992 | iters=197 | 1.6s


  [MiniLM(av) seed 29] Fold 4 AUC=0.66373 | iters=120 | 1.1s


  [MiniLM(av) seed 29] Fold 5 AUC=0.74106 | iters=16 | 0.6s


[MiniLM(av)] Seed 29 OOF AUC=0.67325


[MiniLM(av)] AV-weighted bagged OOF AUC: 0.68668


=== AV-weighted bagging: MPNet ===


  [MPNet(av) seed 7] Fold 1 AUC=0.68504 | iters=112 | 3.2s


  [MPNet(av) seed 7] Fold 2 AUC=0.68625 | iters=122 | 3.3s


  [MPNet(av) seed 7] Fold 3 AUC=0.65774 | iters=66 | 2.5s


  [MPNet(av) seed 7] Fold 4 AUC=0.67285 | iters=109 | 3.1s


  [MPNet(av) seed 7] Fold 5 AUC=0.69575 | iters=42 | 2.1s


[MPNet(av)] Seed 7 OOF AUC=0.67854


  [MPNet(av) seed 13] Fold 1 AUC=0.68295 | iters=75 | 1.8s


  [MPNet(av) seed 13] Fold 2 AUC=0.70046 | iters=33 | 1.5s


  [MPNet(av) seed 13] Fold 3 AUC=0.65052 | iters=92 | 2.2s


  [MPNet(av) seed 13] Fold 4 AUC=0.67186 | iters=92 | 2.0s


  [MPNet(av) seed 13] Fold 5 AUC=0.72161 | iters=60 | 1.7s


[MPNet(av)] Seed 13 OOF AUC=0.68283


  [MPNet(av) seed 29] Fold 1 AUC=0.68682 | iters=49 | 1.5s


  [MPNet(av) seed 29] Fold 2 AUC=0.69046 | iters=99 | 2.0s


  [MPNet(av) seed 29] Fold 3 AUC=0.64576 | iters=40 | 1.5s


  [MPNet(av) seed 29] Fold 4 AUC=0.66695 | iters=77 | 1.8s


  [MPNet(av) seed 29] Fold 5 AUC=0.72553 | iters=12 | 1.2s


[MPNet(av)] Seed 29 OOF AUC=0.67312


[MPNet(av)] AV-weighted bagged OOF AUC: 0.68955


=== AV-weighted bagging: E5 ===


  [E5(av) seed 7] Fold 1 AUC=0.66958 | iters=95 | 3.2s


  [E5(av) seed 7] Fold 2 AUC=0.65991 | iters=84 | 2.6s


  [E5(av) seed 7] Fold 3 AUC=0.65957 | iters=41 | 2.0s


  [E5(av) seed 7] Fold 4 AUC=0.65043 | iters=88 | 2.9s


  [E5(av) seed 7] Fold 5 AUC=0.70043 | iters=8 | 1.5s


[E5(av)] Seed 7 OOF AUC=0.65931


  [E5(av) seed 13] Fold 1 AUC=0.68896 | iters=74 | 1.6s


  [E5(av) seed 13] Fold 2 AUC=0.67512 | iters=120 | 2.0s


  [E5(av) seed 13] Fold 3 AUC=0.65584 | iters=13 | 1.0s


  [E5(av) seed 13] Fold 4 AUC=0.67238 | iters=34 | 1.2s


  [E5(av) seed 13] Fold 5 AUC=0.72278 | iters=24 | 1.1s


[E5(av)] Seed 13 OOF AUC=0.67559


  [E5(av) seed 29] Fold 1 AUC=0.68271 | iters=94 | 1.7s


  [E5(av) seed 29] Fold 2 AUC=0.69455 | iters=65 | 1.5s


  [E5(av) seed 29] Fold 3 AUC=0.65535 | iters=263 | 3.1s


  [E5(av) seed 29] Fold 4 AUC=0.67240 | iters=215 | 2.7s


  [E5(av) seed 29] Fold 5 AUC=0.70351 | iters=190 | 2.8s


[E5(av)] Seed 29 OOF AUC=0.67627


[E5(av)] AV-weighted bagged OOF AUC: 0.68767


Saved AV-weighted OOF/test for MiniLM, MPNet, E5.


In [40]:
# Add RAOP-specific ratios + flair + time interactions to meta; retrain/bag E5 XGB and overwrite OOF/test
import sys, subprocess, time, os, re, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

try:
    from sentence_transformers import SentenceTransformer
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'sentence-transformers==2.7.0'], check=True)
    from sentence_transformers import SentenceTransformer
try:
    import xgboost as xgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

def load_df(path):
    try:
        return pd.read_json(path)
    except ValueError:
        import json as _json
        with open(path, 'r') as f:
            return pd.DataFrame(_json.load(f))

train = load_df('train.json')
test = load_df('test.json')
y = train['requester_received_pizza'].astype(int).values
groups = train['requester_username'].fillna('').astype(str).values if 'requester_username' in train.columns else np.arange(len(train))
title_col = 'request_title' if 'request_title' in train.columns else 'title'
body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in train.columns else 'request_text'

def build_text_upweighted(df: pd.DataFrame) -> pd.Series:
    t = df[title_col].fillna('').astype(str)
    b = df[body_col].fillna('').astype(str)
    return (t + ' [SEP] ' + t + ' [SEP] ' + b)

text_tr = build_text_upweighted(train).tolist()
text_te = build_text_upweighted(test).tolist()

# Utilities
def count_urls(s: str) -> int: return len(re.findall(r'https?://\S+', s or ''))
def has_imgur(s: str) -> int: return 1 if re.search(r'imgur\.com', s or '', flags=re.IGNORECASE) else 0
def count_digits(s: str) -> int: return sum(ch.isdigit() for ch in (s or ''))
def dollar_flag(s: str) -> int: return 1 if ('$' in (s or '')) or re.search(r'\b\d+\s*(dollars|bucks)\b', s or '', flags=re.IGNORECASE) else 0
def caps_ratio(s: str) -> float:
    if not s: return 0.0
    letters = [c for c in s if c.isalpha()]
    return (sum(c.isupper() for c in letters) / max(1, len(letters))) if letters else 0.0
def word_count(s: str) -> int: return len((s or '').split())
def exclam_count(s: str) -> int: return (s or '').count('!')
def question_count(s: str) -> int: return (s or '').count('?')
def parse_subreddit_count(x) -> int: return len(x) if isinstance(x, list) else 0
def safe_log1p_signed(x): return np.sign(x) * np.log1p(np.abs(x))

LEX_PATTERNS = {
    'lex_please': r'\bplease\b',
    'lex_thanks': r'\b(thank you|thanks in advance|thanks|thank|tia)\b',
    'lex_appreciate': r'\b(appreciate|appreciated)\b',
    'lex_pay_it_forward': r'\b(pay it forward|return the favor)\b',
    'lex_repay': r'\b(repay|pay you back|pay back)\b',
    'lex_willing': r"\b(willing to|i\'ll|i will|i can)\b",
    'lex_karma': r'\bkarma\b',
    'lex_evidence': r'\b(proof|receipt|photo|picture|pic|verify|verification|evidence)\b',
    'lex_imgur_word': r'\bimgur\b',
    'lex_student': r'\b(student|college|university|finals|exam|midterm)\b',
    'lex_jobloss': r'\b(unemployed|laid off|lost my job|between jobs|job hunt)\b',
    'lex_broke': r'\b(broke)\b',
    'lex_rent_bills': r'\b(rent|bill|bills|utilities|electric|gas|water|paycheck)\b',
    'lex_family': r'\b(family|kids?|children|baby|pregnant|son|daughter|wife|husband)\b',
    'lex_hungry': r'\b(hungry|starving|no food)\b|\bfood (stamps|pantry)\b',
    'lex_struggling': r'\b(desperate|struggling)\b',
    'lex_urgency': r'\b(tonight|today|tomorrow|asap|urgent)\b',
    'lex_help': r'\bhelp\b',
}
def add_lexicons(df: pd.DataFrame) -> pd.DataFrame:
    txt = (df[title_col].fillna('').astype(str) + ' ' + df[body_col].fillna('').astype(str)).str.lower()
    out = pd.DataFrame(index=df.index)
    for name, pat in LEX_PATTERNS.items():
        out[name] = txt.str.count(pat, flags=re.IGNORECASE).astype(float)
    return out

def build_meta_raop(df: pd.DataFrame) -> pd.DataFrame:
    title = df[title_col].fillna('').astype(str)
    body = df[body_col].fillna('').astype(str)
    ts = pd.to_numeric(df['unix_timestamp_of_request'], errors='coerce')
    dt = pd.to_datetime(ts, unit='s', utc=True)
    out = pd.DataFrame(index=df.index)
    # base at_request numeric
    keep = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
    ]
    for c in keep:
        out[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    out['requester_subreddits_count'] = df['requester_subreddits_at_request'].apply(parse_subreddit_count).astype(float) if 'requester_subreddits_at_request' in df.columns else 0.0
    # time features
    out['hour'] = dt.dt.hour.astype(float)
    out['weekday'] = dt.dt.weekday.astype(float)
    out['is_weekend'] = dt.dt.weekday.isin([5,6]).astype(float)
    out['month'] = dt.dt.month.astype(float)
    out['quarter'] = dt.dt.quarter.astype(float)
    base_ts = np.nanmin(ts.values)
    out['days_since_start'] = ((ts - base_ts) / 86400.0).astype(float)
    order = np.argsort(ts.values)
    rel = np.empty_like(order, dtype=np.float64); rel[order] = np.arange(len(order), dtype=np.float64)
    out['relative_position'] = (rel / max(1, len(order)-1)).astype(float)
    # text stats
    out['title_len_chars'] = title.str.len().astype(float)
    out['title_len_words'] = title.apply(word_count).astype(float)
    out['body_len_chars'] = body.str.len().astype(float)
    out['body_len_words'] = body.apply(word_count).astype(float)
    out['url_count'] = body.apply(count_urls).astype(float)
    out['has_imgur'] = body.apply(has_imgur).astype(float)
    out['digits_count'] = body.apply(count_digits).astype(float)
    out['dollar_flag'] = body.apply(dollar_flag).astype(float)
    out['caps_ratio'] = body.apply(caps_ratio).astype(float)
    out['exclam_count'] = body.apply(exclam_count).astype(float)
    out['question_count'] = body.apply(question_count).astype(float)
    # ratios
    c_all = out['requester_number_of_comments_at_request'].values
    c_raop = out['requester_number_of_comments_in_raop_at_request'].values
    p_all = out['requester_number_of_posts_at_request'].values
    p_raop = out['requester_number_of_posts_on_raop_at_request'].values
    upm = out['requester_upvotes_minus_downvotes_at_request'].values
    upp = out['requester_upvotes_plus_downvotes_at_request'].values
    out['raop_comment_ratio'] = (c_raop / (c_all + 1.0))
    out['raop_post_ratio'] = (p_raop / (p_all + 1.0))
    out['karma_balance_ratio'] = (upm / (upp + 1.0))
    out['title_to_body_len'] = (out['title_len_words'] / (out['body_len_words'] + 1.0))
    # user flair
    flair = df['requester_user_flair'].fillna('').astype(str) if 'requester_user_flair' in df.columns else pd.Series(['']*len(df))
    out['has_flair'] = (flair.str.len() > 0).astype(float)
    out['flair_len_chars'] = flair.str.len().astype(float)
    out['flair_word_count'] = flair.apply(word_count).astype(float)
    # lexicons
    lex = add_lexicons(df)
    out = pd.concat([out, lex], axis=1)
    # interactions
    out['int_caps_relpos'] = out['caps_ratio'] * out['relative_position']
    out['int_urgency_month'] = out['lex_urgency'] * out['month']
    # transforms
    out = out.replace([np.inf,-np.inf], np.nan).fillna(0.0)
    nonneg = [
        'requester_account_age_in_days_at_request','requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request','requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request','requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request','requester_upvotes_plus_downvotes_at_request',
        'requester_subreddits_count','hour','weekday','is_weekend','month','quarter','days_since_start','relative_position',
        'title_len_chars','title_len_words','body_len_chars','body_len_words','url_count','has_imgur','digits_count','dollar_flag',
        'exclam_count','question_count','raop_comment_ratio','raop_post_ratio','title_to_body_len','has_flair','flair_len_chars','flair_word_count',
        *list(LEX_PATTERNS.keys()), 'int_caps_relpos','int_urgency_month'
    ]
    for c in nonneg:
        if c in out.columns:
            out[c] = np.log1p(np.clip(out[c], a_min=0.0, a_max=None))
    if 'requester_upvotes_minus_downvotes_at_request' in out.columns:
        out['requester_upvotes_minus_downvotes_at_request'] = safe_log1p_signed(out['requester_upvotes_minus_downvotes_at_request'].values)
    if 'karma_balance_ratio' in out.columns:
        out['karma_balance_ratio'] = safe_log1p_signed(out['karma_balance_ratio'].values)
    return out.replace([np.inf,-np.inf],0.0).fillna(0.0)

meta_train = build_meta_raop(train)
meta_test = build_meta_raop(test)
print('Meta(RAOP+) shapes:', meta_train.shape, meta_test.shape, flush=True)

# Encode E5 embeddings (normalize) with query: prefix
print('Encoding E5-base-v2 for RAOP+ meta...', flush=True)
e5 = SentenceTransformer('intfloat/e5-base-v2', device='cpu')
emb_tr = e5.encode([f'query: {s}' for s in text_tr], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
emb_te = e5.encode([f'query: {s}' for s in text_te], batch_size=128, show_progress_bar=False, normalize_embeddings=True).astype(np.float32)
print('E5 shapes:', emb_tr.shape, emb_te.shape, flush=True)

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(sgkf.split(np.zeros(len(y)), y, groups))

def bag_xgb_on_emb(emb_tr, emb_te, seeds=(7,13,29), jitter=True):
    oof_bag = np.zeros(len(train), dtype=np.float32)
    test_bag = []
    for seed in seeds:
        rs = np.random.RandomState(seed)
        md = 4 + (rs.rand() < 0.5) if jitter else 4
        subs = 0.75 + rs.rand()*0.10 if jitter else 0.80
        cols = 0.75 + rs.rand()*0.15 if jitter else 0.80
        lr = 0.045 + rs.rand()*0.010 if jitter else 0.05
        params = dict(objective='binary:logistic', eval_metric='auc', tree_method='hist',
                      max_depth=int(md), learning_rate=float(lr), subsample=float(subs), colsample_bytree=float(cols),
                      reg_lambda=1.0, min_child_weight=1.0, max_bin=256, random_state=int(seed))
        oof_seed = np.zeros(len(train), dtype=np.float32)
        test_preds = []
        for i, (trn_idx, val_idx) in enumerate(folds, 1):
            t0 = time.time()
            scaler = StandardScaler(with_mean=True, with_std=True)
            M_tr = scaler.fit_transform(meta_train.iloc[trn_idx].values).astype(np.float32)
            M_va = scaler.transform(meta_train.iloc[val_idx].values).astype(np.float32)
            X_tr = np.hstack([emb_tr[trn_idx], M_tr]).astype(np.float32)
            X_va = np.hstack([emb_tr[val_idx], M_va]).astype(np.float32)
            dtr = xgb.DMatrix(X_tr, label=y[trn_idx])
            dva = xgb.DMatrix(X_va, label=y[val_idx])
            booster = xgb.train(params, dtr, num_boost_round=2000, evals=[(dva,'valid')], verbose_eval=False, early_stopping_rounds=100)
            oof_seed[val_idx] = booster.predict(dva, iteration_range=(0, booster.best_iteration+1))
            M_te = scaler.transform(meta_test.values).astype(np.float32)
            Xt = np.hstack([emb_te, M_te]).astype(np.float32)
            dte = xgb.DMatrix(Xt)
            test_preds.append(booster.predict(dte, iteration_range=(0, booster.best_iteration+1)).astype(np.float32))
            print(f'  [E5 RAOP+ seed {seed}] Fold {i} AUC={roc_auc_score(y[val_idx], oof_seed[val_idx]):.5f} | iters={booster.best_iteration+1} | {time.time()-t0:.1f}s', flush=True)
        print(f'[E5 RAOP+] Seed {seed} OOF AUC={roc_auc_score(y, oof_seed):.5f}', flush=True)
        oof_bag += oof_seed / len(seeds)
        test_bag.append(np.mean(np.vstack(test_preds), axis=0).astype(np.float32))
    test_mean = np.mean(np.vstack(test_bag), axis=0).astype(np.float32)
    auc_bag = roc_auc_score(y, oof_bag)
    print(f'[E5 RAOP+] Bagged OOF AUC: {auc_bag:.5f}', flush=True)
    return oof_bag, test_mean

oof_e5_raop, test_e5_raop = bag_xgb_on_emb(emb_tr, emb_te, seeds=(7,13,29), jitter=True)
np.save('oof_e5_embed_g.npy', oof_e5_raop)
np.save('test_e5_embed_g.npy', test_e5_raop)
print('Saved RAOP+-meta E5 OOF/test to oof_e5_embed_g.npy / test_e5_embed_g.npy')

Meta(RAOP+) shapes: (2878, 55) (1162, 55)


Encoding E5-base-v2 for RAOP+ meta...


E5 shapes: (2878, 768) (1162, 768)


  [E5 RAOP+ seed 7] Fold 1 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 7] Fold 2 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 7] Fold 3 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 7] Fold 4 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 7] Fold 5 AUC=1.00000 | iters=1 | 0.4s


[E5 RAOP+] Seed 7 OOF AUC=1.00000


  [E5 RAOP+ seed 13] Fold 1 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 13] Fold 2 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 13] Fold 3 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 13] Fold 4 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 13] Fold 5 AUC=1.00000 | iters=1 | 0.4s


[E5 RAOP+] Seed 13 OOF AUC=1.00000


  [E5 RAOP+ seed 29] Fold 1 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 29] Fold 2 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 29] Fold 3 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 29] Fold 4 AUC=1.00000 | iters=1 | 0.4s


  [E5 RAOP+ seed 29] Fold 5 AUC=1.00000 | iters=1 | 0.4s


[E5 RAOP+] Seed 29 OOF AUC=1.00000


[E5 RAOP+] Bagged OOF AUC: 1.00000


Saved RAOP+-meta E5 OOF/test to oof_e5_embed_g.npy / test_e5_embed_g.npy


In [41]:
# Time-aware holdout rank-NNLS with shrink, EXCLUDING e5 leg (to avoid leakage-suspect preds); update submission
import numpy as np, pandas as pd, os
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train['requester_received_pizza'].astype(int).values
users = train['requester_username'].fillna('').astype(str).values
ts = pd.to_numeric(train['unix_timestamp_of_request'], errors='coerce').values

# Last-20% holdout with 5-day purge and group purge
cutoff = np.nanquantile(ts, 0.80)
gap_sec = 5*86400.0
train_mask = ts < (cutoff - gap_sec)
val_mask = ts >= cutoff
overlap = set(users[train_mask]).intersection(set(users[val_mask]))
if overlap:
    drop = np.isin(users, list(overlap))
    train_mask &= ~drop
    val_mask &= ~drop
idx_va = np.where(val_mask)[0]
print(f'Holdout split: train={train_mask.sum()}, valid={val_mask.sum()}, overlap_users={len(overlap)}', flush=True)

# Candidate legs WITHOUT e5
cands = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'st_embed_bag_v2': ('oof_st_embed_bag_v2.npy','test_st_embed_bag_v2.npy') if os.path.exists('oof_st_embed_bag_v2.npy') else ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag_v2': ('oof_mpnet_embed_bag_v2.npy','test_mpnet_embed_bag_v2.npy') if os.path.exists('oof_mpnet_embed_bag_v2.npy') else ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
}

names, OOFs, TESTs = [], [], []
for name, (poof, ptest) in cands.items():
    if os.path.exists(poof) and os.path.exists(ptest):
        OOFs.append(np.load(poof)); TESTs.append(np.load(ptest)); names.append(name)
    else:
        print('Missing', name, '-> skip')
assert len(OOFs) >= 3, f'Need >=3 legs, got {len(OOFs)}'
P = np.vstack(OOFs).T
T = np.vstack(TESTs).T

def rank01(a):
    order = a.argsort(kind='mergesort')
    r = np.empty_like(order, dtype=np.float64)
    r[order] = np.arange(len(a), dtype=np.float64)
    return r / max(1, len(a)-1)

def fit_rank_nnls_with_shrink(P_mat, y_vec, idx_valid, shrink=0.15):
    P_hold = P_mat[idx_valid]
    P_rank = np.apply_along_axis(rank01, 0, P_hold)
    w, _ = nnls(P_rank, y_vec[idx_valid].astype(float))
    if w.sum() > 0: w = w / w.sum()
    M = w.size
    w = 0.85*w + shrink*(1.0/M)
    if w.sum() > 0: w = w / w.sum()
    auc = roc_auc_score(y_vec[idx_valid], P_rank @ w)
    return w, auc

# Try with and without lr_subs_g; choose best
try:
    subs_idx = names.index('lr_subs_g')
    keep_with = np.ones(len(names), dtype=bool)
    keep_wo = np.ones(len(names), dtype=bool); keep_wo[subs_idx] = False
    opts = [('with_subs', keep_with), ('no_subs', keep_wo)]
except ValueError:
    opts = [('no_subs', np.ones(len(names), dtype=bool))]

best = None
for tag, kmask in opts:
    w, auc = fit_rank_nnls_with_shrink(P[:, kmask], y, idx_va, shrink=0.15)
    print(f'Holdout AUC [{tag}] (no e5): {auc:.5f} | legs={np.array(names)[kmask].tolist()} | w={w}', flush=True)
    if best is None or auc > best[0]:
        best = (auc, w, kmask, tag)

auc_hold, w_best, keep_mask, tag = best
sel_names = np.array(names)[keep_mask].tolist()
print('Chosen legs (no e5):', sel_names)
print('Chosen weights (post-shrink):', w_best)

T_sel = T[:, keep_mask]
T_rank = np.apply_along_axis(rank01, 0, T_sel)
test_blend = (T_rank @ w_best).astype(np.float32)

P_sel = P[:, keep_mask]
P_hold_rank = np.apply_along_axis(rank01, 0, P_sel[idx_va])
print('Final holdout AUC (ref, no e5):', f'{roc_auc_score(y[idx_va], P_hold_rank @ w_best):.5f}')

sub = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend})
sub.to_csv('submission_time_holdout_no_e5.csv', index=False)
sub.to_csv('submission.csv', index=False)
print('Saved submission_time_holdout_no_e5.csv and updated submission.csv. Rows:', len(sub))

Holdout split: train=2289, valid=576, overlap_users=0


Holdout AUC [with_subs] (no e5): 0.69893 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'lr_subs_g'] | w=[0.33777885 0.07415389 0.15106778 0.30518513 0.13181435]


Holdout AUC [no_subs] (no e5): 0.69868 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2'] | w=[0.43705793 0.0799925  0.16221774 0.32073184]


Chosen legs (no e5): ['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'lr_subs_g']
Chosen weights (post-shrink): [0.33777885 0.07415389 0.15106778 0.30518513 0.13181435]
Final holdout AUC (ref, no e5): 0.69893
Saved submission_time_holdout_no_e5.csv and updated submission.csv. Rows: 1162


In [45]:
# Forward-chaining, group-purged rank-NNLS using BEST single chain (no shrink); save alternate submission
import numpy as np, pandas as pd, os
from sklearn.metrics import roc_auc_score
from scipy.optimize import nnls

train = pd.read_json('train.json')
test = pd.read_json('test.json')
y = train['requester_received_pizza'].astype(int).values
users = train['requester_username'].fillna('').astype(str).values
ts = pd.to_numeric(train['unix_timestamp_of_request'], errors='coerce').values

# Candidate legs (clean set, drop CatBoost)
cands = {
    'lr_meta_g': ('oof_lr_meta_g.npy','test_lr_meta_g.npy'),
    'xgb_svd_meta': ('oof_xgb_svd_meta.npy','test_xgb_svd_meta.npy'),
    'st_embed_bag_v2': ('oof_st_embed_bag_v2.npy','test_st_embed_bag_v2.npy') if os.path.exists('oof_st_embed_bag_v2.npy') else ('oof_st_embed_bag.npy','test_st_embed_bag.npy'),
    'mpnet_embed_bag_v2': ('oof_mpnet_embed_bag_v2.npy','test_mpnet_embed_bag_v2.npy') if os.path.exists('oof_mpnet_embed_bag_v2.npy') else ('oof_mpnet_embed_bag.npy','test_mpnet_embed_bag.npy'),
    'e5_embed_g': ('oof_e5_embed_g.npy','test_e5_embed_g.npy'),
    'lr_subs_g': ('oof_lr_subs_g.npy','test_lr_subs_g.npy'),
}

names, OOFs, TESTs = [], [], []
for name, (poof, ptest) in list(cands.items()):
    if os.path.exists(poof) and os.path.exists(ptest):
        OOFs.append(np.load(poof)); TESTs.append(np.load(ptest)); names.append(name)
    else:
        print(f'Missing predictions for {name}; skipping')
assert len(OOFs) >= 3, f'Need >=3 legs, found {len(OOFs)}'
P = np.vstack(OOFs).T
T = np.vstack(TESTs).T

def rank01(a):
    order = a.argsort(kind='mergesort')
    r = np.empty_like(order, dtype=np.float64)
    r[order] = np.arange(len(a), dtype=np.float64)
    return r / max(1, len(a)-1)

def fit_rank_nnls(P_mat, y_vec, idx_valid):
    P_hold = P_mat[idx_valid]
    P_rank = np.apply_along_axis(rank01, 0, P_hold)
    w, _ = nnls(P_rank, y_vec[idx_valid].astype(float))
    if w.sum() > 0: w = w / w.sum()
    auc = roc_auc_score(y_vec[idx_valid], P_rank @ w) if idx_valid.size else np.nan
    return w, auc

def group_purged_mask(idx_train_cond, idx_valid_cond):
    tr_mask = idx_train_cond.copy()
    va_mask = idx_valid_cond.copy()
    users_tr = set(users[tr_mask])
    users_va = set(users[va_mask])
    overlap = users_tr.intersection(users_va)
    if overlap:
        drop = np.isin(users, list(overlap))
        tr_mask = tr_mask & (~drop)
        va_mask = va_mask & (~drop)
    return tr_mask, va_mask

# Define forward chains with 5-day purge: [0-60 -> 60-80], [0-80 -> 80-90], [0-90 -> 90-100]
qs = np.quantile(ts[~np.isnan(ts)], [0.6, 0.8, 0.9])
q60, q80, q90 = qs[0], qs[1], qs[2]
gap_sec = 5*86400.0
chains = [
    ((ts < (q60 - 0)), (ts >= (q60 + gap_sec)) & (ts < (q80 + 0))),
    ((ts < (q80 - 0)), (ts >= (q80 + gap_sec)) & (ts < (q90 + 0))),
    ((ts < (q90 - 0)), (ts >= (q90 + gap_sec))),
]

best = None
best_tag = None
best_mask = None
for ci, (tr_cond, va_cond) in enumerate(chains, 1):
    tr_mask, va_mask = group_purged_mask(tr_cond, va_cond)
    idx_va = np.where(va_mask)[0]
    print(f'Chain {ci}: valid size={idx_va.size}', flush=True)
    # Evaluate with and without lr_subs_g
    try:
        subs_idx = names.index('lr_subs_g')
        keep_with = np.ones(len(names), dtype=bool)
        keep_wo = np.ones(len(names), dtype=bool); keep_wo[subs_idx] = False
        opts = [('with_subs', keep_with), ('no_subs', keep_wo)]
    except ValueError:
        opts = [('no_subs', np.ones(len(names), dtype=bool))]
    for tag, kmask in opts:
        w, auc = fit_rank_nnls(P[:, kmask], y, idx_va)
        print(f'  Chain {ci} {tag}: AUC={auc:.5f} | legs={np.array(names)[kmask].tolist()} | w={w}', flush=True)
        if (best is None) or (auc > best[0]):
            best = (auc, w, kmask, ci, idx_va)
            best_tag = tag
            best_mask = kmask

auc_best, w_best, kmask_best, ci_best, idx_va_best = best
sel_names = np.array(names)[kmask_best].tolist()
print(f'Best chain: {ci_best} | tag={best_tag} | holdout AUC={auc_best:.5f}')
print('Chosen legs (best-chain):', sel_names)
print('Weights (best-chain):', w_best)

# Build test submission from best-chain weights
T_sel = T[:, kmask_best]
T_rank = np.apply_along_axis(rank01, 0, T_sel)
test_blend_best = (T_rank @ w_best).astype(np.float32)

# Save
sub_best = pd.DataFrame({'request_id': test['request_id'], 'requester_received_pizza': test_blend_best})
sub_best.to_csv('submission_time_forward_chain_best.csv', index=False)
sub_best.to_csv('submission.csv', index=False)
print('Saved submission_time_forward_chain_best.csv and updated submission.csv. Rows:', len(sub_best))

Chain 1: valid size=562


  Chain 1 with_subs: AUC=0.67229 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g'] | w=[0.18127916 0.14721856 0.16107094 0.11633358 0.39409776 0.        ]


  Chain 1 no_subs: AUC=0.67229 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g'] | w=[0.18127916 0.14721856 0.16107094 0.11633358 0.39409776]


Chain 2: valid size=278


  Chain 2 with_subs: AUC=0.74896 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g'] | w=[0.2649735  0.03920233 0.08336516 0.39553677 0.         0.21692224]


  Chain 2 no_subs: AUC=0.74586 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g'] | w=[0.46580581 0.03985465 0.07446941 0.41987014 0.        ]


Chain 3: valid size=268


  Chain 3 with_subs: AUC=0.65016 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g'] | w=[0.11536552 0.06085583 0.         0.         0.6057933  0.21798535]


  Chain 3 no_subs: AUC=0.65189 | legs=['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g'] | w=[0.31454139 0.0658354  0.         0.         0.61962321]


Best chain: 2 | tag=with_subs | holdout AUC=0.74896
Chosen legs (best-chain): ['lr_meta_g', 'xgb_svd_meta', 'st_embed_bag_v2', 'mpnet_embed_bag_v2', 'e5_embed_g', 'lr_subs_g']
Weights (best-chain): [0.2649735  0.03920233 0.08336516 0.39553677 0.         0.21692224]
Saved submission_time_forward_chain_best.csv and updated submission.csv. Rows: 1162
