# Random Acts of Pizza – Plan

Objectives:
- Win a medal (AUC-ROC target ≥ 0.692 for bronze; aim much higher with strong text models).
- Build fast, reliable CV and iterate quickly with cached text features.

Data:
- train.json, test.json; labels likely binary (requester_received_pizza / success).
- Text fields: request_text, title, possibly combined; meta-features available (karma, account age, etc.).

Validation:
- Stratified KFold (e.g., 5 folds) with deterministic seed.
- Fit transforms inside folds; cache TF-IDF matrices once and slice for folds.

Baseline v1:
- Text-only: TF-IDF (word + char n-grams) → LogisticRegression (liblinear/saga) or LinearSVC with calibrated probs.
- Fast OOF to establish reference AUC.

Feature Engineering v2:
- Combine title + request_text; clean URLs/emojis; length stats (chars, words, sentences).
- Handcrafted features from paper: gratitude, politeness cues, reciprocity, evidence (imgur links), need, urgency, narrative, user-reputation proxies (karma, account age), subreddit activity.
- Binary flags: has_image, has_url, mentions_pizza, mentions_money, day-of-week/hour.

Models:
- Linear models on sparse TF-IDF: LogisticRegression, LinearSVC (calibrated).
- Tree models on dense/meta features: XGBoost/LightGBM (CPU ok; GPU optional).
- Optionally Naive Bayes–SVM (NB-SVM) style log-count ratio features.

Ensembling:
- Blend linear text model with boosted trees on meta features.
- Optionally stack: meta-model on OOF predictions.

Efficiency:
- Cache: TF-IDF matrices (train/test), OOF preds, test preds, feature frames.
- Log per-fold progress and elapsed time.
- Subsample quick checks (1k rows, 2 folds) before full runs.

Milestones (expert review at each):
1) Baseline plan + data loading/EDA
2) CV established + TF-IDF/logreg baseline OOF/LB
3) Add meta + text-engineered features; evaluate
4) NB-SVM/Calibrated SVC; evaluate
5) Gradient boosting on meta features; evaluate
6) Blending; finalize and generate submission

Next Steps:
- Load and inspect train.json/test.json schema; identify label and text/meta fields.
- Implement deterministic CV and text preprocessing pipeline; produce baseline OOF and submission.

In [1]:
# Data loading and quick EDA
import os, json, re, time, gc, sys
import numpy as np
import pandas as pd
from collections import Counter

SEED = 42
np.random.seed(SEED)

def read_json_file(path):
    # Try reading as a JSON array; fallback to lines
    try:
        df = pd.read_json(path)
    except ValueError:
        df = pd.read_json(path, lines=True)
    return df

train_path = 'train.json'
test_path = 'test.json'

train = read_json_file(train_path)
test = read_json_file(test_path)

print('Train shape:', train.shape)
print('Test shape:', test.shape)
print('Train columns:', list(train.columns))
print('Test columns:', list(test.columns))

# Peek at sample submission to identify ID/target column names
sample = pd.read_csv('sampleSubmission.csv')
print('SampleSubmission head:')
print(sample.head())

# Identify id and target names from sample submission
id_col = sample.columns[0]
target_col = sample.columns[1]
print('ID column in sample:', id_col, '| Target column:', target_col)

# Basic label check
if target_col in train.columns:
    y = train[target_col].astype(int)
    print('Label distribution:', y.value_counts(normalize=True).to_dict())
else:
    # Common label name in RAOP is requester_received_pizza
    if 'requester_received_pizza' in train.columns:
        target_col = 'requester_received_pizza'
        y = train[target_col].astype(int)
        print('Label distribution (requester_received_pizza):', y.value_counts(normalize=True).to_dict())
    else:
        raise RuntimeError('Could not locate target column in train.json')

# Identify text fields; typical: request_title, request_text or request_text_edit_aware
text_cols_candidates = [c for c in ['request_title','request_text_edit_aware','request_text'] if c in train.columns]
print('Text columns found:', text_cols_candidates)
if not text_cols_candidates:
    raise RuntimeError('No text columns found in dataset')

def preprocess_text_df(df):
    title = df[text_cols_candidates[0]].fillna('') if 'request_title' in text_cols_candidates[0] or 'request_title' in df.columns else ''
    # Prefer edit-aware if present, else request_text
    body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in df.columns else ('request_text' if 'request_text' in df.columns else None)
    body = df[body_col].fillna('') if body_col else ''
    # Simple URL replacement
    def replace_urls(s):
        return re.sub(r'http\S+|www\.[^\s]+', ' [URL] ', s, flags=re.IGNORECASE)
    joined = (title.astype(str) + ' [SEP] ' + body.astype(str)).str.lower().map(replace_urls)
    return joined

train_text = preprocess_text_df(train)
test_text = preprocess_text_df(test)
print('Example combined text (train):')
print(train_text.iloc[0][:300] if len(train_text) else 'N/A')

# Ensure ID column exists in both train/test
if id_col not in train.columns:
    # Common RAOP id is 'request_id'
    if 'request_id' in train.columns:
        id_col = 'request_id'
    else:
        # Fallback: try to infer
        raise RuntimeError(f'ID column {id_col} from sample not found; available: {list(train.columns)}')
if id_col not in test.columns:
    if 'request_id' in test.columns:
        id_col = 'request_id'
    else:
        raise RuntimeError(f'ID column {id_col} not in test; available: {list(test.columns)}')

print('Using ID column:', id_col, 'and target column:', target_col)

# Save quick schema snapshot for reference
schema_info = {
    'train_shape': train.shape,
    'test_shape': test.shape,
    'id_col': id_col,
    'target_col': target_col,
    'text_cols': text_cols_candidates,
}
print('Schema summary:', schema_info)

Train shape: (2878, 32)
Test shape: (1162, 17)
Train columns: ['giver_username_if_known', 'number_of_downvotes_of_request_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'post_was_edited', 'request_id', 'request_number_of_comments_at_retrieval', 'request_text', 'request_text_edit_aware', 'request_title', 'requester_account_age_in_days_at_request', 'requester_account_age_in_days_at_retrieval', 'requester_days_since_first_post_on_raop_at_request', 'requester_days_since_first_post_on_raop_at_retrieval', 'requester_number_of_comments_at_request', 'requester_number_of_comments_at_retrieval', 'requester_number_of_comments_in_raop_at_request', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_number_of_posts_at_request', 'requester_number_of_posts_at_retrieval', 'requester_number_of_posts_on_raop_at_request', 'requester_number_of_posts_on_raop_at_retrieval', 'requester_number_of_subreddits_at_request', 'requester_received_pizza', 'requester_subreddits_at_request', 'r

In [2]:
# Baseline: TF-IDF (word 1-2) + LogisticRegression with 5-fold Stratified CV
import time
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd

X_text = train_text.values.astype(str)
y = train[target_col].astype(int).values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof = np.zeros(len(train))
test_pred_folds = []

start = time.time()
for fold, (tr_idx, va_idx) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    X_tr_text = X_text[tr_idx]
    X_va_text = X_text[va_idx]
    y_tr, y_va = y[tr_idx], y[va_idx]

    # Vectorizer fit inside fold (no leakage)
    vec = TfidfVectorizer(lowercase=True, strip_accents='unicode',
                          ngram_range=(1,2), min_df=2, max_features=300000,
                          sublinear_tf=True)
    X_tr = vec.fit_transform(X_tr_text)
    X_va = vec.transform(X_va_text)
    X_te = vec.transform(test_text.values.astype(str))

    clf = LogisticRegression(solver='saga', C=2.0, penalty='l2',
                             max_iter=2000, n_jobs=-1, class_weight='balanced',
                             random_state=SEED + fold)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict_proba(X_va)[:,1]
    fold_auc = roc_auc_score(y_va, oof[va_idx])
    test_pred_folds.append(clf.predict_proba(X_te)[:,1])
    elapsed = time.time() - t0
    print(f'Fold {fold} AUC: {fold_auc:.5f} | time: {elapsed:.2f}s', flush=True)

oof_auc = roc_auc_score(y, oof)
print(f'OOF AUC: {oof_auc:.5f} | total time: {time.time()-start:.2f}s', flush=True)

# Blend test predictions across folds
test_pred = np.mean(np.vstack(test_pred_folds), axis=0)

# Save OOF and test predictions for reuse
np.save('oof_lr_tfidf.npy', oof)
np.save('test_lr_tfidf.npy', test_pred)

# Build submission
sub = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape', sub.shape, 'and columns', list(sub.columns))

Fold 1 AUC: 0.62931 | time: 0.99s


Fold 2 AUC: 0.63082 | time: 0.94s


Fold 3 AUC: 0.64105 | time: 0.92s


Fold 4 AUC: 0.58332 | time: 0.94s


Fold 5 AUC: 0.64344 | time: 0.56s


OOF AUC: 0.62528 | total time: 4.36s


Saved submission.csv with shape (1162, 2) and columns ['request_id', 'requester_received_pizza']


In [3]:
# Improved baseline: word (1-2) + char_wb (3-5) TF-IDF, LogisticRegression (C=4), 5-fold CV
import time
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack, csr_matrix
import numpy as np
import pandas as pd

X_text = train_text.values.astype(str)
y = train[target_col].astype(int).values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof2 = np.zeros(len(train))
test_pred_folds2 = []

start = time.time()
for fold, (tr_idx, va_idx) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    X_tr_text = X_text[tr_idx]
    X_va_text = X_text[va_idx]
    y_tr, y_va = y[tr_idx], y[va_idx]

    # Fit vectorizers inside fold
    vec_w = TfidfVectorizer(lowercase=True, strip_accents='unicode',
                            analyzer='word', ngram_range=(1,2), min_df=2,
                            max_features=60000, sublinear_tf=True)
    vec_c = TfidfVectorizer(lowercase=True, strip_accents='unicode',
                            analyzer='char_wb', ngram_range=(3,5), min_df=2,
                            max_features=200000, sublinear_tf=True)

    X_tr_w = vec_w.fit_transform(X_tr_text)
    X_va_w = vec_w.transform(X_va_text)
    X_te_w = vec_w.transform(test_text.values.astype(str))

    X_tr_c = vec_c.fit_transform(X_tr_text)
    X_va_c = vec_c.transform(X_va_text)
    X_te_c = vec_c.transform(test_text.values.astype(str))

    X_tr = hstack([X_tr_w, X_tr_c]).tocsr()
    X_va = hstack([X_va_w, X_va_c]).tocsr()
    X_te = hstack([X_te_w, X_te_c]).tocsr()

    clf = LogisticRegression(solver='saga', C=4.0, penalty='l2',
                             max_iter=3000, n_jobs=-1, class_weight='balanced',
                             random_state=SEED + 1337 + fold)
    clf.fit(X_tr, y_tr)
    oof2[va_idx] = clf.predict_proba(X_va)[:,1]
    fold_auc = roc_auc_score(y_va, oof2[va_idx])
    test_pred_folds2.append(clf.predict_proba(X_te)[:,1])
    elapsed = time.time() - t0
    print(f'[word+char] Fold {fold} AUC: {fold_auc:.5f} | time: {elapsed:.2f}s', flush=True)

oof_auc2 = roc_auc_score(y, oof2)
print(f'[word+char] OOF AUC: {oof_auc2:.5f} | total time: {time.time()-start:.2f}s', flush=True)

test_pred2 = np.mean(np.vstack(test_pred_folds2), axis=0)
np.save('oof_lr_tfidf_wordchar.npy', oof2)
np.save('test_lr_tfidf_wordchar.npy', test_pred2)

# Save improved submission variant
sub2 = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred2})
sub2.to_csv('submission.csv', index=False)
print('Saved submission.csv (word+char) with shape', sub2.shape)

[word+char] Fold 1 AUC: 0.61288 | time: 9.43s


[word+char] Fold 2 AUC: 0.63908 | time: 9.82s


[word+char] Fold 3 AUC: 0.62343 | time: 9.11s


[word+char] Fold 4 AUC: 0.58649 | time: 9.56s


[word+char] Fold 5 AUC: 0.61231 | time: 8.44s


[word+char] OOF AUC: 0.61491 | total time: 46.36s


Saved submission.csv (word+char) with shape (1162, 2)


In [4]:
# Fix preprocessing (keep URLs/domains) and tighten TF-IDF; tune LR C on word-only
import time
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd

# Redefine text preprocessing: keep raw text (no URL replacement); prefer edit-aware to match test
def preprocess_text_df_keep_urls(df):
    title = df['request_title'].fillna('') if 'request_title' in df.columns else ''
    body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in df.columns else ('request_text' if 'request_text' in df.columns else None)
    body = df[body_col].fillna('') if body_col else ''
    joined = (title.astype(str) + ' [SEP] ' + body.astype(str)).str.lower()
    return joined

train_text = preprocess_text_df_keep_urls(train)
test_text = preprocess_text_df_keep_urls(test)
print('Rebuilt text with URLs kept. Example:', train_text.iloc[0][:200])

X_text = train_text.values.astype(str)
y = train[target_col].astype(int).values

Cs = [0.5, 1.0, 2.0]
best_auc = -1.0
best_C = None
best_oof = None
best_test_pred = None

for C in Cs:
    print(f'\n=== Training word-only TF-IDF + LR with C={C} ===')
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    oof_c = np.zeros(len(train))
    test_pred_folds_c = []
    start = time.time()
    for fold, (tr_idx, va_idx) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        X_tr_text = X_text[tr_idx]
        X_va_text = X_text[va_idx]
        y_tr, y_va = y[tr_idx], y[va_idx]

        vec = TfidfVectorizer(lowercase=True, strip_accents='unicode',
                              analyzer='word', ngram_range=(1,2),
                              min_df=5, max_features=60000, sublinear_tf=True)
        X_tr = vec.fit_transform(X_tr_text)
        X_va = vec.transform(X_va_text)
        X_te = vec.transform(test_text.values.astype(str))

        clf = LogisticRegression(solver='saga', C=C, penalty='l2',
                                 max_iter=2000, n_jobs=-1, random_state=SEED+fold)
        clf.fit(X_tr, y_tr)
        oof_c[va_idx] = clf.predict_proba(X_va)[:,1]
        fold_auc = roc_auc_score(y_va, oof_c[va_idx])
        test_pred_folds_c.append(clf.predict_proba(X_te)[:,1])
        print(f'[word-only C={C}] Fold {fold} AUC: {fold_auc:.5f} | time: {time.time()-t0:.2f}s', flush=True)

    oof_auc_c = roc_auc_score(y, oof_c)
    test_pred_c = np.mean(np.vstack(test_pred_folds_c), axis=0)
    print(f'[word-only C={C}] OOF AUC: {oof_auc_c:.5f} | total time: {time.time()-start:.2f}s', flush=True)

    if oof_auc_c > best_auc:
        best_auc = oof_auc_c
        best_C = C
        best_oof = oof_c.copy()
        best_test_pred = test_pred_c.copy()

print(f'Best C: {best_C} with OOF AUC: {best_auc:.5f}')

# Save best predictions and submission
np.save('oof_lr_tfidf_word_only.npy', best_oof)
np.save('test_lr_tfidf_word_only.npy', best_test_pred)
sub_best = pd.DataFrame({id_col: test[id_col].values, target_col: best_test_pred})
sub_best.to_csv('submission.csv', index=False)
print('Saved submission.csv (word-only tuned) with shape', sub_best.shape, 'best C:', best_C)

Rebuilt text with URLs kept. Example: [request] oceanside, ca. usa-  us marine getting ready to deploy. [sep] i will soon be going on a long deployment which i'm not aloud to discuss but willing to give some info if you ask. just wanna ea

=== Training word-only TF-IDF + LR with C=0.5 ===


[word-only C=0.5] Fold 1 AUC: 0.64491 | time: 0.80s


[word-only C=0.5] Fold 2 AUC: 0.64580 | time: 0.80s


[word-only C=0.5] Fold 3 AUC: 0.64008 | time: 0.82s


[word-only C=0.5] Fold 4 AUC: 0.59855 | time: 0.80s


[word-only C=0.5] Fold 5 AUC: 0.65375 | time: 0.71s


[word-only C=0.5] OOF AUC: 0.63614 | total time: 3.94s



=== Training word-only TF-IDF + LR with C=1.0 ===


[word-only C=1.0] Fold 1 AUC: 0.63859 | time: 0.88s


[word-only C=1.0] Fold 2 AUC: 0.64542 | time: 0.88s


[word-only C=1.0] Fold 3 AUC: 0.64169 | time: 0.91s


[word-only C=1.0] Fold 4 AUC: 0.59489 | time: 0.89s


[word-only C=1.0] Fold 5 AUC: 0.64655 | time: 0.80s


[word-only C=1.0] OOF AUC: 0.63296 | total time: 4.37s



=== Training word-only TF-IDF + LR with C=2.0 ===


[word-only C=2.0] Fold 1 AUC: 0.63008 | time: 1.00s


[word-only C=2.0] Fold 2 AUC: 0.64533 | time: 1.01s


[word-only C=2.0] Fold 3 AUC: 0.64105 | time: 1.02s


[word-only C=2.0] Fold 4 AUC: 0.58709 | time: 0.99s


[word-only C=2.0] Fold 5 AUC: 0.63672 | time: 0.91s


[word-only C=2.0] OOF AUC: 0.62754 | total time: 4.92s


Best C: 0.5 with OOF AUC: 0.63614
Saved submission.csv (word-only tuned) with shape (1162, 2) best C: 0.5


In [5]:
# NB-SVM (log-count ratio) with word counts (1,2), min_df=3, max_features=50k
import time
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd

def compute_log_count_ratio(X, y, alpha=1.0):
    # X is csr, y is binary array (0/1)
    pos_mask = (y == 1)
    neg_mask = (y == 0)
    # Sum counts per feature for each class
    pos_counts = X[pos_mask].sum(axis=0) + alpha
    neg_counts = X[neg_mask].sum(axis=0) + alpha
    # Normalize to probabilities
    pos_total = np.asarray(pos_counts).sum()
    neg_total = np.asarray(neg_counts).sum()
    p_pos = np.asarray(pos_counts / pos_total).ravel()
    p_neg = np.asarray(neg_counts / neg_total).ravel()
    r = np.log(p_pos / p_neg)
    return r

X_text = train_text.values.astype(str)
y = train[target_col].astype(int).values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof_nbsvm = np.zeros(len(train))
test_pred_folds_nbsvm = []

start = time.time()
for fold, (tr_idx, va_idx) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    X_tr_text = X_text[tr_idx]
    X_va_text = X_text[va_idx]
    y_tr, y_va = y[tr_idx], y[va_idx]

    vec = CountVectorizer(lowercase=True, strip_accents='unicode',
                          analyzer='word', ngram_range=(1,2),
                          min_df=3, max_features=50000)
    X_tr = vec.fit_transform(X_tr_text)
    X_va = vec.transform(X_va_text)
    X_te = vec.transform(test_text.values.astype(str))

    r = compute_log_count_ratio(X_tr, y_tr, alpha=1.0)
    # Reweight by log-count ratio
    X_tr_lr = X_tr.multiply(r)
    X_va_lr = X_va.multiply(r)
    X_te_lr = X_te.multiply(r)

    clf = LogisticRegression(solver='saga', C=2.0, penalty='l2',
                             max_iter=2000, n_jobs=-1, random_state=SEED+fold)
    clf.fit(X_tr_lr, y_tr)
    oof_nbsvm[va_idx] = clf.predict_proba(X_va_lr)[:,1]
    fold_auc = roc_auc_score(y_va, oof_nbsvm[va_idx])
    test_pred_folds_nbsvm.append(clf.predict_proba(X_te_lr)[:,1])
    print(f'[NB-SVM] Fold {fold} AUC: {fold_auc:.5f} | time: {time.time()-t0:.2f}s', flush=True)

oof_auc_nbsvm = roc_auc_score(y, oof_nbsvm)
print(f'[NB-SVM] OOF AUC: {oof_auc_nbsvm:.5f} | total time: {time.time()-start:.2f}s', flush=True)

test_pred_nbsvm = np.mean(np.vstack(test_pred_folds_nbsvm), axis=0)
np.save('oof_nbsvm.npy', oof_nbsvm)
np.save('test_nbsvm.npy', test_pred_nbsvm)

# Save submission from NB-SVM
sub_nbsvm = pd.DataFrame({id_col: test[id_col].values, target_col: test_pred_nbsvm})
sub_nbsvm.to_csv('submission.csv', index=False)
print('Saved submission.csv (NB-SVM) with shape', sub_nbsvm.shape)

[NB-SVM] Fold 1 AUC: 0.56177 | time: 1.89s


[NB-SVM] Fold 2 AUC: 0.58880 | time: 5.55s




[NB-SVM] Fold 3 AUC: 0.59479 | time: 5.55s




[NB-SVM] Fold 4 AUC: 0.52723 | time: 5.54s




[NB-SVM] Fold 5 AUC: 0.55031 | time: 1.97s


[NB-SVM] OOF AUC: 0.56467 | total time: 20.51s


Saved submission.csv (NB-SVM) with shape (1162, 2)


In [11]:
# Meta/features model: XGBoost on *_at_request + timing + rich text/domain flags; 5-fold CV and blend
import time, re, math, ast, os
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
import xgboost as xgb

def build_meta_features(df):
    feats = pd.DataFrame(index=df.index)
    # Numeric *_at_request columns present in both train/test
    cols_req = [
        'requester_account_age_in_days_at_request',
        'requester_days_since_first_post_on_raop_at_request',
        'requester_number_of_comments_at_request',
        'requester_number_of_comments_in_raop_at_request',
        'requester_number_of_posts_at_request',
        'requester_number_of_posts_on_raop_at_request',
        'requester_number_of_subreddits_at_request',
        'requester_upvotes_minus_downvotes_at_request',
        'requester_upvotes_plus_downvotes_at_request',
        'unix_timestamp_of_request'
    ]
    for c in cols_req:
        feats[c] = pd.to_numeric(df[c], errors='coerce') if c in df.columns else 0.0
    # Time features
    ts = feats['unix_timestamp_of_request'].fillna(0).astype(float)
    feats['hour'] = ((ts // 3600) % 24).astype(int)
    feats['dow'] = ((ts // 86400) + 4).astype(int) % 7  # epoch 1970-01-01 was Thursday
    feats['is_weekend'] = feats['dow'].isin([5,6]).astype(int)
    feats['month'] = pd.to_datetime(ts, unit='s', errors='coerce').dt.month.fillna(0).astype(int)
    # Rates (per day), avoid div by zero
    acc_age = feats['requester_account_age_in_days_at_request'].fillna(0).astype(float) + 1.0
    feats['comments_per_day'] = feats['requester_number_of_comments_at_request'].fillna(0).astype(float) / acc_age
    feats['posts_per_day'] = feats['requester_number_of_posts_at_request'].fillna(0).astype(float) / acc_age
    feats['karma_plus_per_day'] = feats['requester_upvotes_plus_downvotes_at_request'].fillna(0).astype(float) / acc_age
    feats['karma_minus_per_day'] = feats['requester_upvotes_minus_downvotes_at_request'].fillna(0).astype(float) / acc_age
    # RAOP ratios
    denom_comments = feats['requester_number_of_comments_at_request'].fillna(0).astype(float) + 1.0
    denom_posts = feats['requester_number_of_posts_at_request'].fillna(0).astype(float) + 1.0
    feats['raop_comments_share'] = feats['requester_number_of_comments_in_raop_at_request'].fillna(0).astype(float) / denom_comments
    feats['raop_posts_share'] = feats['requester_number_of_posts_on_raop_at_request'].fillna(0).astype(float) / denom_posts
    feats['prior_raop_flag'] = (feats['requester_number_of_posts_on_raop_at_request'].fillna(0) > 0).astype(int)
    feats['newbie_flag'] = (feats['requester_account_age_in_days_at_request'].fillna(0) < 30).astype(int)
    # requester_subreddits_at_request: count
    if 'requester_subreddits_at_request' in df.columns:
        def sub_count(x):
            if isinstance(x, list):
                return len(x)
            try:
                if isinstance(x, str):
                    val = ast.literal_eval(x)
                    return len(val) if isinstance(val, list) else 0
            except Exception:
                return 0
            return 0
        feats['subreddits_count'] = df['requester_subreddits_at_request'].apply(sub_count)
    else:
        feats['subreddits_count'] = 0
    # Build text locally from df (title + edit_aware or text), keep URLs
    title = df['request_title'].fillna('') if 'request_title' in df.columns else ''
    body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in df.columns else ('request_text' if 'request_text' in df.columns else None)
    body = df[body_col].fillna('') if body_col else ''
    title_l = title.astype(str).str.lower() if isinstance(title, pd.Series) else pd.Series(['']*len(df))
    body_l = body.astype(str).str.lower() if isinstance(body, pd.Series) else pd.Series(['']*len(df))
    text_series = (title_l + ' ' + body_l).fillna('')
    feats['n_chars'] = text_series.str.len().astype(int)
    feats['n_words'] = text_series.str.split().str.len().astype(int)
    feats['title_len'] = title_l.str.len().astype(int) if isinstance(title_l, pd.Series) else 0
    # URL/image/domain flags
    def count_pattern(s, pat):
        return len(re.findall(pat, s)) if isinstance(s, str) else 0
    feats['has_url'] = text_series.apply(lambda s: int(bool(re.search(r'http[s]?://|www\.', s))))
    feats['num_urls'] = text_series.apply(lambda s: count_pattern(s, r'http[s]?://|www\.'))
    feats['has_imgur'] = text_series.apply(lambda s: int(isinstance(s, str) and ('imgur' in s)))
    feats['num_imgur'] = text_series.apply(lambda s: s.count('imgur') if isinstance(s, str) else 0)
    feats['imgur_in_title'] = title_l.apply(lambda s: int('imgur' in s) if isinstance(s, str) else 0)
    feats['imgur_in_body'] = body_l.apply(lambda s: int('imgur' in s) if isinstance(s, str) else 0)
    feats['has_image_ext'] = text_series.apply(lambda s: int(bool(re.search(r'\.(jpg|jpeg|png|gif)(\b|$)', s))))
    # Domain buckets
    domains = ['reddit','youtube','gyazo','tinypic','dropbox','google','blogspot']
    for d in domains:
        feats[f'domain_{d}'] = text_series.apply(lambda s: int(d in s))
    # Politeness/reciprocity/urgency keywords
    kw_money = ['$', 'usd', 'dollar', 'rent', 'bill', 'bills', 'paycheck', 'broke', 'unemployed', 'laid off', 'jobless', 'homeless']
    kw_urgency = ['hungry', 'starving', 'emergency', 'asap', 'today', 'tonight', 'now']
    kw_roles = ['student', 'college', 'finals', 'exam', 'interview', 'job', 'veteran', 'military', 'marine', 'kids', 'children', 'family', 'pregnant', 'wife', 'husband', 'roommate']
    kw_evidence = ['picture', 'photo', 'album', 'proof', 'receipt', 'screenshot', 'link']
    def kw_count(s, kws):
        if not isinstance(s, str): return 0
        return sum(s.count(k) for k in kws)
    feats['please_cnt'] = text_series.apply(lambda s: s.count('please') if isinstance(s, str) else 0)
    feats['thanks_cnt'] = text_series.apply(lambda s: s.count('thank') if isinstance(s, str) else 0)
    feats['payitforward_cnt'] = text_series.apply(lambda s: sum(s.count(k) for k in ['pay it forward','return the favor','pay-it-forward']) if isinstance(s, str) else 0)
    feats['money_kw_cnt'] = text_series.apply(lambda s: kw_count(s, kw_money))
    feats['urgency_kw_cnt'] = text_series.apply(lambda s: kw_count(s, kw_urgency))
    feats['roles_kw_cnt'] = text_series.apply(lambda s: kw_count(s, kw_roles))
    feats['evidence_kw_cnt'] = text_series.apply(lambda s: kw_count(s, kw_evidence))
    # Structure/tone
    feats['qmark_cnt'] = text_series.str.count('\?').fillna(0).astype(int)
    feats['excl_cnt'] = text_series.str.count('!').fillna(0).astype(int)
    feats['dots_cnt'] = text_series.str.count('\.\.\.').fillna(0).astype(int)
    feats['digit_cnt'] = text_series.str.count('[0-9]').fillna(0).astype(int)
    feats['uppercase_ratio'] = text_series.apply(lambda s: (sum(ch.isupper() for ch in s) / max(1,len(s))) if isinstance(s, str) else 0.0)
    pronouns = re.compile(r'\b(i|me|my|we|our)\b')
    feats['pronoun_density'] = text_series.apply(lambda s: (len(pronouns.findall(s))/max(1, len(s.split()))) if isinstance(s, str) else 0.0)
    # Cyclical hour encoding
    feats['hour_sin'] = np.sin(2*np.pi*feats['hour']/24.0)
    feats['hour_cos'] = np.cos(2*np.pi*feats['hour']/24.0)
    # Final cleanup: ensure numeric, finite, float32
    for col in feats.columns:
        feats[col] = pd.to_numeric(feats[col], errors='coerce')
    feats = feats.replace([np.inf, -np.inf], 0).fillna(0).astype(np.float32)
    return feats

X_train_meta = build_meta_features(train)
X_test_meta = build_meta_features(test)
y = train[target_col].astype(int).values
print('Meta features shape:', X_train_meta.shape, X_test_meta.shape, flush=True)

# XGBoost model (hist, no early stopping due to API limits)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof_meta = np.zeros(len(train), dtype=np.float32)
test_meta_folds = []
start = time.time()
pos = float(y.sum())
neg = float(len(y) - pos)
scale_pos_weight = max(1.0, neg / max(1.0, pos))
for fold, (tr_idx, va_idx) in enumerate(skf.split(X_train_meta, y), 1):
    t0 = time.time()
    X_tr = X_train_meta.iloc[tr_idx]
    X_va = X_train_meta.iloc[va_idx]
    y_tr, y_va = y[tr_idx], y[va_idx]
    clf = XGBClassifier(
        n_estimators=800,
        learning_rate=0.06,
        max_depth=4,
        subsample=0.85,
        colsample_bytree=0.75,
        min_child_weight=3.0,
        reg_lambda=1.5,
        objective='binary:logistic',
        eval_metric='auc',
        tree_method='hist',
        random_state=SEED+fold,
        scale_pos_weight=scale_pos_weight
    )
    clf.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)
    oof_meta[va_idx] = clf.predict_proba(X_va)[:,1]
    fold_auc = roc_auc_score(y_va, oof_meta[va_idx])
    test_meta_folds.append(clf.predict_proba(X_test_meta)[:,1])
    print(f'[META-XGB] Fold {fold} AUC: {fold_auc:.5f} | time: {time.time()-t0:.2f}s', flush=True)

print('[META-XGB] Finished all folds, computing OOF/test/blend...', flush=True)
oof_auc_meta = roc_auc_score(y, oof_meta)
print(f'[META-XGB] OOF AUC: {oof_auc_meta:.5f} | total time: {time.time()-start:.2f}s', flush=True)
test_meta = np.mean(np.vstack(test_meta_folds), axis=0)
np.save('oof_meta_xgb.npy', oof_meta)
np.save('test_meta_xgb.npy', test_meta)

# Blend with best text model available (prefer word-only tuned from cell 4)
oof_text, test_text_pred = None, None
if os.path.exists('oof_lr_tfidf_word_only.npy') and os.path.exists('test_lr_tfidf_word_only.npy'):
    oof_text = np.load('oof_lr_tfidf_word_only.npy')
    test_text_pred = np.load('test_lr_tfidf_word_only.npy')
elif os.path.exists('oof_lr_tfidf.npy') and os.path.exists('test_lr_tfidf.npy'):
    oof_text = np.load('oof_lr_tfidf.npy')
    test_text_pred = np.load('test_lr_tfidf.npy')
else:
    print('[BLEND] No text OOF found; skipping blend and saving meta-only submission.', flush=True)
    sub_meta_only = pd.DataFrame({id_col: test[id_col].values, target_col: test_meta})
    sub_meta_only.to_csv('submission_meta.csv', index=False)
    sub_meta_only.to_csv('submission.csv', index=False)
    print('Saved submission.csv and submission_meta.csv (META only)', flush=True)
    raise SystemExit()

def tune_blend_weight(oof_a, oof_b, y_true):
    best_w, best_auc = 0.5, -1.0
    for w in [i/20 for i in range(0,21)]:
        pred = w*oof_a + (1-w)*oof_b
        auc = roc_auc_score(y_true, pred)
        if auc > best_auc:
            best_auc, best_w = auc, w
    return best_w, best_auc

w_text, best_blend_auc = tune_blend_weight(oof_text, oof_meta, y)
print(f'[BLEND] Best weight on text: {w_text:.2f} | OOF AUC: {best_blend_auc:.5f}', flush=True)
test_blend = w_text*test_text_pred + (1-w_text)*test_meta
np.save('test_blend.npy', test_blend)

# Save submissions
sub_blend = pd.DataFrame({id_col: test[id_col].values, target_col: test_blend})
sub_blend.to_csv('submission.csv', index=False)
print('Saved submission.csv (BLEND) with shape', sub_blend.shape, '| w_text=', w_text, flush=True)
pd.DataFrame({id_col: test[id_col].values, target_col: test_meta}).to_csv('submission_meta.csv', index=False)
print('Saved submission_meta.csv (META only)', flush=True)

Meta features shape: (2878, 55) (1162, 55)


[META-XGB] Fold 1 AUC: 0.64911 | time: 0.82s


[META-XGB] Fold 2 AUC: 0.64476 | time: 0.82s


[META-XGB] Fold 3 AUC: 0.69735 | time: 0.81s


[META-XGB] Fold 4 AUC: 0.58628 | time: 0.82s


[META-XGB] Fold 5 AUC: 0.66317 | time: 0.81s


[META-XGB] Finished all folds, computing OOF/test/blend...


[META-XGB] OOF AUC: 0.64700 | total time: 4.09s


[BLEND] Best weight on text: 0.80 | OOF AUC: 0.67328


Saved submission.csv (BLEND) with shape (1162, 2) | w_text= 0.8


Saved submission_meta.csv (META only)


In [12]:
# Improved text baseline v2: word(1,2) TF-IDF with better tokenization (keep URLs), no [SEP], LR lbfgs; blend with meta OOF
import time, os
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def preprocess_text_urls_no_sep(df):
    title = df['request_title'].fillna('') if 'request_title' in df.columns else ''
    body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in df.columns else ('request_text' if 'request_text' in df.columns else None)
    body = df[body_col].fillna('') if body_col else ''
    # Simple concat with space; keep URLs/domains; lowercase
    return (title.astype(str) + ' ' + body.astype(str)).str.lower()

train_text_v2 = preprocess_text_urls_no_sep(train)
test_text_v2 = preprocess_text_urls_no_sep(test)
X_text = train_text_v2.values.astype(str)
y = train[target_col].astype(int).values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof_t = np.zeros(len(train))
test_pred_folds_t = []
start = time.time()
for fold, (tr_idx, va_idx) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    X_tr_text = X_text[tr_idx]
    X_va_text = X_text[va_idx]
    y_tr, y_va = y[tr_idx], y[va_idx]
    vec = TfidfVectorizer(
        lowercase=True, strip_accents='unicode',
        analyzer='word', ngram_range=(1,2),
        min_df=3, max_features=60000, sublinear_tf=True,
        token_pattern=r'(?u)\b\w+\b'
    )
    X_tr = vec.fit_transform(X_tr_text)
    X_va = vec.transform(X_va_text)
    X_te = vec.transform(test_text_v2.values.astype(str))
    clf = LogisticRegression(solver='lbfgs', C=1.0, max_iter=2000, random_state=SEED+fold)
    clf.fit(X_tr, y_tr)
    oof_t[va_idx] = clf.predict_proba(X_va)[:,1]
    fold_auc = roc_auc_score(y_va, oof_t[va_idx])
    test_pred_folds_t.append(clf.predict_proba(X_te)[:,1])
    print(f'[text-v2] Fold {fold} AUC: {fold_auc:.5f} | time: {time.time()-t0:.2f}s', flush=True)

oof_auc_t = roc_auc_score(y, oof_t)
test_pred_t = np.mean(np.vstack(test_pred_folds_t), axis=0)
print(f'[text-v2] OOF AUC: {oof_auc_t:.5f} | total time: {time.time()-start:.2f}s', flush=True)

# Save replacing previous word-only to be picked by other cells
np.save('oof_lr_tfidf_word_only.npy', oof_t)
np.save('test_lr_tfidf_word_only.npy', test_pred_t)

# If meta OOF exists, compute tuned blend and save submission
if os.path.exists('oof_meta_xgb.npy') and os.path.exists('test_meta_xgb.npy'):
    oof_meta = np.load('oof_meta_xgb.npy')
    test_meta = np.load('test_meta_xgb.npy')
    best_w, best_auc = 0.5, -1.0
    for w in [i/20 for i in range(0,21)]:
        pred = w*oof_t + (1-w)*oof_meta
        auc = roc_auc_score(y, pred)
        if auc > best_auc:
            best_auc, best_w = auc, w
    print(f'[text+meta BLEND] Best weight on text: {best_w:.2f} | OOF AUC: {best_auc:.5f}', flush=True)
    test_blend = best_w*test_pred_t + (1-best_w)*test_meta
    pd.DataFrame({id_col: test[id_col].values, target_col: test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (blend via v2) with shape', (len(test_blend), 2))
else:
    pd.DataFrame({id_col: test[id_col].values, target_col: test_pred_t}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (text-v2 only); meta OOF not found')

[text-v2] Fold 1 AUC: 0.64069 | time: 0.47s


[text-v2] Fold 2 AUC: 0.64050 | time: 0.48s


[text-v2] Fold 3 AUC: 0.63523 | time: 0.44s


[text-v2] Fold 4 AUC: 0.59834 | time: 0.54s


[text-v2] Fold 5 AUC: 0.64703 | time: 0.48s


[text-v2] OOF AUC: 0.63231 | total time: 2.42s


[text+meta BLEND] Best weight on text: 0.75 | OOF AUC: 0.67137


Saved submission.csv (blend via v2) with shape (1162, 2)


In [13]:
# Text baseline v3: word(1,2) TF-IDF grid over C and larger vocab; blend with meta
import time, os
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

def build_text_v3(df):
    title = df['request_title'].fillna('') if 'request_title' in df.columns else ''
    body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in df.columns else ('request_text' if 'request_text' in df.columns else None)
    body = df[body_col].fillna('') if body_col else ''
    return (title.astype(str) + ' ' + body.astype(str)).str.lower()

train_text_v3 = build_text_v3(train)
test_text_v3 = build_text_v3(test)
X_text = train_text_v3.values.astype(str)
y = train[target_col].astype(int).values

Cs = [0.5, 1.0, 2.0]
best_auc, best_C = -1.0, None
best_oof, best_test_pred = None, None

for C in Cs:
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    oof_t = np.zeros(len(train))
    test_pred_folds_t = []
    start = time.time()
    for fold, (tr_idx, va_idx) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        X_tr_text = X_text[tr_idx]
        X_va_text = X_text[va_idx]
        y_tr, y_va = y[tr_idx], y[va_idx]
        vec = TfidfVectorizer(
            lowercase=True, strip_accents='unicode',
            analyzer='word', ngram_range=(1,2),
            min_df=3, max_features=100000, sublinear_tf=True,
            token_pattern=r'(?u)\b\w+\b'
        )
        X_tr = vec.fit_transform(X_tr_text)
        X_va = vec.transform(X_va_text)
        X_te = vec.transform(test_text_v3.values.astype(str))
        clf = LogisticRegression(solver='lbfgs', C=C, max_iter=3000, random_state=SEED+fold)
        clf.fit(X_tr, y_tr)
        oof_t[va_idx] = clf.predict_proba(X_va)[:,1]
        test_pred_folds_t.append(clf.predict_proba(X_te)[:,1])
        print(f'[text-v3 C={C}] Fold {fold} AUC: {roc_auc_score(y_va, oof_t[va_idx]):.5f} | time: {time.time()-t0:.2f}s', flush=True)
    oof_auc_t = roc_auc_score(y, oof_t)
    test_pred_t = np.mean(np.vstack(test_pred_folds_t), axis=0)
    print(f'[text-v3 C={C}] OOF AUC: {oof_auc_t:.5f} | total time: {time.time()-start:.2f}s', flush=True)
    if oof_auc_t > best_auc:
        best_auc, best_C = oof_auc_t, C
        best_oof, best_test_pred = oof_t.copy(), test_pred_t.copy()

print(f'[text-v3] Best C: {best_C} with OOF AUC: {best_auc:.5f}', flush=True)
np.save('oof_lr_tfidf_word_only.npy', best_oof)
np.save('test_lr_tfidf_word_only.npy', best_test_pred)

# Blend with meta if available
if os.path.exists('oof_meta_xgb.npy') and os.path.exists('test_meta_xgb.npy'):
    oof_meta = np.load('oof_meta_xgb.npy')
    test_meta = np.load('test_meta_xgb.npy')
    best_w, best_blend_auc = 0.5, -1.0
    for w in [i/20 for i in range(0,21)]:
        auc = roc_auc_score(y, w*best_oof + (1-w)*oof_meta)
        if auc > best_blend_auc:
            best_blend_auc, best_w = auc, w
    print(f'[text-v3+meta] Best weight on text: {best_w:.2f} | OOF AUC: {best_blend_auc:.5f}', flush=True)
    test_blend = best_w*best_test_pred + (1-best_w)*test_meta
    pd.DataFrame({id_col: test[id_col].values, target_col: test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (text-v3 blend) with shape', (len(test_blend), 2))
else:
    pd.DataFrame({id_col: test[id_col].values, target_col: best_test_pred}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (text-v3 only)')

[text-v3 C=0.5] Fold 1 AUC: 0.64378 | time: 0.42s


[text-v3 C=0.5] Fold 2 AUC: 0.64098 | time: 0.46s


[text-v3 C=0.5] Fold 3 AUC: 0.63422 | time: 0.44s


[text-v3 C=0.5] Fold 4 AUC: 0.60297 | time: 0.45s


[text-v3 C=0.5] Fold 5 AUC: 0.65364 | time: 0.45s


[text-v3 C=0.5] OOF AUC: 0.63475 | total time: 2.24s


[text-v3 C=1.0] Fold 1 AUC: 0.64069 | time: 0.53s


[text-v3 C=1.0] Fold 2 AUC: 0.64050 | time: 0.47s


[text-v3 C=1.0] Fold 3 AUC: 0.63523 | time: 0.46s


[text-v3 C=1.0] Fold 4 AUC: 0.59834 | time: 0.47s


[text-v3 C=1.0] Fold 5 AUC: 0.64703 | time: 0.45s


[text-v3 C=1.0] OOF AUC: 0.63231 | total time: 2.40s


[text-v3 C=2.0] Fold 1 AUC: 0.63373 | time: 0.46s


[text-v3 C=2.0] Fold 2 AUC: 0.63892 | time: 0.53s


[text-v3 C=2.0] Fold 3 AUC: 0.63249 | time: 0.47s


[text-v3 C=2.0] Fold 4 AUC: 0.59282 | time: 0.48s


[text-v3 C=2.0] Fold 5 AUC: 0.63708 | time: 0.48s


[text-v3 C=2.0] OOF AUC: 0.62720 | total time: 2.43s


[text-v3] Best C: 0.5 with OOF AUC: 0.63475


[text-v3+meta] Best weight on text: 0.85 | OOF AUC: 0.67124


Saved submission.csv (text-v3 blend) with shape (1162, 2)


In [14]:
# Meta XGB param sweep (fast): try few combos to push OOF > 0.69; update blend
import numpy as np, time, os
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

# Reuse prebuilt features from cell 6
X_trm = X_train_meta
X_tem = X_test_meta
y_bin = train[target_col].astype(int).values

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
pos = float(y_bin.sum()); neg = float(len(y_bin)-pos)
spw = max(1.0, neg/max(1.0,pos))

param_grid = [
    {'max_depth':3, 'min_child_weight':3, 'subsample':0.90, 'colsample_bytree':0.70, 'learning_rate':0.06, 'n_estimators':900},
    {'max_depth':3, 'min_child_weight':5, 'subsample':0.90, 'colsample_bytree':0.70, 'learning_rate':0.06, 'n_estimators':900},
    {'max_depth':4, 'min_child_weight':5, 'subsample':0.85, 'colsample_bytree':0.70, 'learning_rate':0.05, 'n_estimators':900},
    {'max_depth':4, 'min_child_weight':4, 'subsample':0.80, 'colsample_bytree':0.80, 'learning_rate':0.05, 'n_estimators':800},
    {'max_depth':3, 'min_child_weight':4, 'subsample':0.85, 'colsample_bytree':0.80, 'learning_rate':0.05, 'n_estimators':1000},
]

best_auc, best_params = -1.0, None
best_oof, best_test = None, None

for i, p in enumerate(param_grid, 1):
    oof = np.zeros(len(X_trm), dtype=np.float32)
    test_folds = []
    t0 = time.time()
    for fold, (tr_idx, va_idx) in enumerate(skf.split(X_trm, y_bin), 1):
        X_tr, X_va = X_trm.iloc[tr_idx], X_trm.iloc[va_idx]
        y_tr, y_va = y_bin[tr_idx], y_bin[va_idx]
        clf = XGBClassifier(
            n_estimators=p['n_estimators'],
            learning_rate=p['learning_rate'],
            max_depth=p['max_depth'],
            subsample=p['subsample'],
            colsample_bytree=p['colsample_bytree'],
            min_child_weight=p['min_child_weight'],
            reg_lambda=1.8,
            objective='binary:logistic',
            eval_metric='auc',
            tree_method='hist',
            random_state=SEED+fold,
            scale_pos_weight=spw
        )
        clf.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)
        oof[va_idx] = clf.predict_proba(X_va)[:,1]
        test_folds.append(clf.predict_proba(X_tem)[:,1])
    auc = roc_auc_score(y_bin, oof)
    print(f'[META-XGB SWEEP {i}] params={p} | OOF AUC: {auc:.5f} | time: {time.time()-t0:.2f}s', flush=True)
    if auc > best_auc:
        best_auc, best_params = auc, p
        best_oof = oof.copy()
        best_test = np.mean(np.vstack(test_folds), axis=0)

print(f'[META-XGB SWEEP] Best params: {best_params} | OOF AUC: {best_auc:.5f}', flush=True)
np.save('oof_meta_xgb.npy', best_oof)
np.save('test_meta_xgb.npy', best_test)

# Re-blend with current best text OOF if available
if os.path.exists('oof_lr_tfidf_word_only.npy') and os.path.exists('test_lr_tfidf_word_only.npy'):
    oof_text = np.load('oof_lr_tfidf_word_only.npy')
    test_text = np.load('test_lr_tfidf_word_only.npy')
    best_w, best_blend_auc = 0.5, -1.0
    for w in [i/20 for i in range(0,21)]:
        auc_b = roc_auc_score(y_bin, w*oof_text + (1-w)*best_oof)
        if auc_b > best_blend_auc:
            best_blend_auc, best_w = auc_b, w
    print(f'[BLEND after SWEEP] Best weight on text: {best_w:.2f} | OOF AUC: {best_blend_auc:.5f}', flush=True)
    test_blend = best_w*test_text + (1-best_w)*best_test
    pd.DataFrame({id_col: test[id_col].values, target_col: test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (post-sweep blend) with shape', (len(test_blend), 2))
else:
    pd.DataFrame({id_col: test[id_col].values, target_col: best_test}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (meta-only, post-sweep)')

[META-XGB SWEEP 1] params={'max_depth': 3, 'min_child_weight': 3, 'subsample': 0.9, 'colsample_bytree': 0.7, 'learning_rate': 0.06, 'n_estimators': 900} | OOF AUC: 0.64731 | time: 4.21s


[META-XGB SWEEP 2] params={'max_depth': 3, 'min_child_weight': 5, 'subsample': 0.9, 'colsample_bytree': 0.7, 'learning_rate': 0.06, 'n_estimators': 900} | OOF AUC: 0.64837 | time: 3.94s


[META-XGB SWEEP 3] params={'max_depth': 4, 'min_child_weight': 5, 'subsample': 0.85, 'colsample_bytree': 0.7, 'learning_rate': 0.05, 'n_estimators': 900} | OOF AUC: 0.65052 | time: 4.64s


[META-XGB SWEEP 4] params={'max_depth': 4, 'min_child_weight': 4, 'subsample': 0.8, 'colsample_bytree': 0.8, 'learning_rate': 0.05, 'n_estimators': 800} | OOF AUC: 0.64986 | time: 4.12s


[META-XGB SWEEP 5] params={'max_depth': 3, 'min_child_weight': 4, 'subsample': 0.85, 'colsample_bytree': 0.8, 'learning_rate': 0.05, 'n_estimators': 1000} | OOF AUC: 0.64878 | time: 4.36s


[META-XGB SWEEP] Best params: {'max_depth': 4, 'min_child_weight': 5, 'subsample': 0.85, 'colsample_bytree': 0.7, 'learning_rate': 0.05, 'n_estimators': 900} | OOF AUC: 0.65052


[BLEND after SWEEP] Best weight on text: 0.80 | OOF AUC: 0.67310


Saved submission.csv (post-sweep blend) with shape (1162, 2)


In [15]:
# Text+Meta in one LR: hstack TF-IDF(word 1-2) with standardized meta features; 5-fold OOF and blend
import time, os
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.sparse import hstack, csr_matrix

# Text inputs (same as v3)
def build_text_for_lr(df):
    title = df['request_title'].fillna('') if 'request_title' in df.columns else ''
    body_col = 'request_text_edit_aware' if 'request_text_edit_aware' in df.columns else ('request_text' if 'request_text' in df.columns else None)
    body = df[body_col].fillna('') if body_col else ''
    return (title.astype(str) + ' ' + body.astype(str)).str.lower()

train_text_lr = build_text_for_lr(train)
test_text_lr = build_text_for_lr(test)
X_text_arr = train_text_lr.values.astype(str)
y_bin = train[target_col].astype(int).values

# Meta features from cell 6 (use all built columns)
meta_cols = list(X_train_meta.columns)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
oof_lrm = np.zeros(len(train), dtype=np.float32)
test_pred_folds_lrm = []
start = time.time()

for fold, (tr_idx, va_idx) in enumerate(skf.split(X_text_arr, y_bin), 1):
    t0 = time.time()
    # Text vectorizer
    vec = TfidfVectorizer(lowercase=True, strip_accents='unicode',
                          analyzer='word', ngram_range=(1,2),
                          min_df=3, max_features=120000, sublinear_tf=True,
                          token_pattern=r'(?u)\b\w+\b')
    X_tr_text = vec.fit_transform(X_text_arr[tr_idx])
    X_va_text = vec.transform(X_text_arr[va_idx])
    X_te_text = vec.transform(test_text_lr.values.astype(str))

    # Meta scaler per fold
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_meta = scaler.fit_transform(X_train_meta.iloc[tr_idx][meta_cols].values)
    X_va_meta = scaler.transform(X_train_meta.iloc[va_idx][meta_cols].values)
    X_te_meta = scaler.transform(X_test_meta[meta_cols].values)

    # To sparse
    X_tr_meta_sp = csr_matrix(X_tr_meta)
    X_va_meta_sp = csr_matrix(X_va_meta)
    X_te_meta_sp = csr_matrix(X_te_meta)

    # Stack
    X_tr = hstack([X_tr_text, X_tr_meta_sp], format='csr')
    X_va = hstack([X_va_text, X_va_meta_sp], format='csr')
    X_te = hstack([X_te_text, X_te_meta_sp], format='csr')

    clf = LogisticRegression(solver='lbfgs', C=1.0, max_iter=3000, random_state=SEED+fold)
    clf.fit(X_tr, y_bin[tr_idx])
    oof_lrm[va_idx] = clf.predict_proba(X_va)[:,1]
    test_pred_folds_lrm.append(clf.predict_proba(X_te)[:,1])
    print(f'[LR text+meta] Fold {fold} AUC: {roc_auc_score(y_bin[va_idx], oof_lrm[va_idx]):.5f} | time: {time.time()-t0:.2f}s', flush=True)

oof_auc_lrm = roc_auc_score(y_bin, oof_lrm)
test_pred_lrm = np.mean(np.vstack(test_pred_folds_lrm), axis=0)
print(f'[LR text+meta] OOF AUC: {oof_auc_lrm:.5f} | total time: {time.time()-start:.2f}s', flush=True)

np.save('oof_lr_text_meta.npy', oof_lrm)
np.save('test_lr_text_meta.npy', test_pred_lrm)

# Blend this with meta-XGB if beneficial
if os.path.exists('oof_meta_xgb.npy') and os.path.exists('test_meta_xgb.npy'):
    oof_meta = np.load('oof_meta_xgb.npy')
    test_meta = np.load('test_meta_xgb.npy')
    best_w, best_blend_auc = 0.5, -1.0
    for w in [i/20 for i in range(0,21)]:
        auc_b = roc_auc_score(y_bin, w*oof_lrm + (1-w)*oof_meta)
        if auc_b > best_blend_auc:
            best_blend_auc, best_w = auc_b, w
    print(f'[BLEND LR(text+meta)+XGB] Best weight on LR: {best_w:.2f} | OOF AUC: {best_blend_auc:.5f}', flush=True)
    test_blend = best_w*test_pred_lrm + (1-best_w)*test_meta
    pd.DataFrame({id_col: test[id_col].values, target_col: test_blend}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (blend LR(text+meta)+XGB) with shape', (len(test_blend), 2))
else:
    pd.DataFrame({id_col: test[id_col].values, target_col: test_pred_lrm}).to_csv('submission.csv', index=False)
    print('Saved submission.csv (LR text+meta only)')

[LR text+meta] Fold 1 AUC: 0.70851 | time: 0.60s


[LR text+meta] Fold 2 AUC: 0.67369 | time: 0.68s


[LR text+meta] Fold 3 AUC: 0.69034 | time: 0.67s


[LR text+meta] Fold 4 AUC: 0.63186 | time: 0.71s


[LR text+meta] Fold 5 AUC: 0.65692 | time: 0.65s


[LR text+meta] OOF AUC: 0.67149 | total time: 3.32s


[BLEND LR(text+meta)+XGB] Best weight on LR: 0.75 | OOF AUC: 0.68211


Saved submission.csv (blend LR(text+meta)+XGB) with shape (1162, 2)
