# Plan: Google QUEST Q&A Labeling – Single-Target (question_asker_intent_understanding)

Objectives:
- Build a fast, strong baseline and iterate to medal-level column-wise Spearman.
- Lock robust CV, avoid leakage, cache artifacts, and ensemble diverse models.

Data understanding:
- Train/test CSVs with text fields (question_title, question_body, answer) plus metadata (e.g., category).
- Target: continuous [0,1] label question_asker_intent_understanding.

Validation:
- Use StratifiedKFold on target binned into ~10 bins to stabilize fold distributions.
- 5 folds, fixed seed; report OOF Spearman and CI via multiple seeds later if time allows.
- Fit all text preprocessors inside each fold; precompute/cached TF-IDF matrices once and slice per fold.

Metric:
- column-wise Spearman; here single target, so Spearman correlation between OOF preds and target.

Baseline v1 (fast):
- Text only: concatenate question_title + question_body + answer.
- TF-IDF (char n-grams 3–6 + word n-grams 1–2), max_features ~200k (tune), lowercase, strip accents.
- Ridge regression (or SGDRegressor with elasticnet) on TF-IDF. OOF Spearman as baseline.
- Log fold times and memory; cache X_tfidf.npz.

Model v2:
- Separate fields: build per-field TF-IDF and concatenate; optionally weights per field.
- Add simple features: lengths (#chars, #words), punctuation counts, exclamation/question marks, capitalization ratio.
- Refit Ridge/Lasso/ElasticNet; compare OOF.

Model v3 (tree/boost):
- Use SVR (linear/RBF) and/or CatBoost/XGBoost on TF-IDF svd-reduced features (TruncatedSVD 256–512).
- Alternatively, use boosted trees on engineered scalar features + dense SVD text embeddings.

Model v4 (transformers – GPU):
- Fine-tune a small encoder (deberta-v3-base or electra-base) with regression head on concatenated fields with special tokens.
- Max length ~384–512; batch size tuned for 24GB A10; 3–5 epochs with early stopping; layer-wise LR decay.
- 5-fold CV; cache OOF and test predictions.

Ensembling:
- Weighted average of OOF/test predictions from v2 Ridge, v3 SVR/boosted trees, v4 transformer.
- Weights tuned on OOF Spearman (e.g., grid over simplex).

Diagnostics & efficiency:
- Environment check: GPU via nvidia-smi; install torch cu121 stack once.
- Log per-fold durations; save OOF, test preds, and submission.csv; verify format.
- Adversarial validation quick check (optional) to ensure TF-IDF features not leaking IDs.

Milestones & expert checkpoints:
1) Plan review (this cell).
2) Environment + data load/EDA.
3) Baseline TF-IDF + Ridge OOF.
4) Per-field TF-IDF + features.
5) SVD + SVR/boost.
6) Transformer fine-tune.
7) Ensemble + finalize submission.

In [1]:
# Environment check + data load/quick EDA
import os, sys, subprocess, time, hashlib
import pandas as pd
import numpy as np
from pathlib import Path

def run(cmd):
    print(">>", " ".join(cmd), flush=True)
    return subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True).stdout

t0 = time.time()
print("[Env] Checking GPU via nvidia-smi...")
print(run(['bash','-lc','nvidia-smi || true']))

# Load data
train_path = Path('train.csv')
test_path = Path('test.csv')
assert train_path.exists() and test_path.exists(), "train.csv/test.csv missing"
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
print(f"train shape: {train.shape}; test shape: {test.shape}")
print("train columns:", list(train.columns))
print("test columns:", list(test.columns))

# Identify id and target
id_col = None
for c in ['qa_id','id','question_id','id_q']:
    if c in train.columns and c in test.columns:
        id_col = c
        break
print("ID column:", id_col)
target = 'question_asker_intent_understanding'
assert target in train.columns, f"Target {target} not found"
print(train[target].describe())

# Group key to avoid leakage across answers of same question
def make_group(df: pd.DataFrame):
    # Prefer url if present
    if 'url' in df.columns:
        key = df['url'].fillna('')
    else:
        # fallback: stable hash of title||body
        t = df['question_title'].fillna('') if 'question_title' in df.columns else ''
        b = df['question_body'].fillna('') if 'question_body' in df.columns else ''
        key = (t.astype(str) + '||' + b.astype(str))
    # Hash to int64 for GroupKFold compatibility
    h = pd.util.hash_pandas_object(key, index=False).astype('int64')
    return h

train['group_key'] = make_group(train)
test['group_key'] = make_group(test)
print("Unique groups in train:", train['group_key'].nunique(), "/ rows:", len(train))

# Basic sanity: no NaNs in target, in-range
nan_rate = train[target].isna().mean()
print(f"Target NaN rate: {nan_rate:.5f}")
print("Target min/max:", train[target].min(), train[target].max())

# Save quick artifacts
train[['group_key']].to_csv('train_group_keys.csv', index=False)
test[['group_key']].to_csv('test_group_keys.csv', index=False)
print(f"[Done] Env+EDA in {time.time()-t0:.2f}s")

# Next: implement GroupKFold splitter and baseline TF-IDF+Ridge.

[Env] Checking GPU via nvidia-smi...
>> bash -lc nvidia-smi || true


Mon Sep 29 04:17:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# Baseline v1: Per-field TF-IDF (word+char) + Ridge with GroupKFold CV
import time
from sklearn.model_selection import GroupKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from sklearn.metrics import make_scorer
from scipy import sparse
from scipy.stats import spearmanr
import numpy as np
import pandas as pd

np.random.seed(42)

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return spearmanr(y_true, y_pred).correlation

texts_train = {
    'title': train['question_title'].fillna('').astype(str),
    'body': train['question_body'].fillna('').astype(str),
    'answer': train['answer'].fillna('').astype(str),
}
texts_test = {
    'title': test['question_title'].fillna('').astype(str),
    'body': test['question_body'].fillna('').astype(str),
    'answer': test['answer'].fillna('').astype(str),
}

# Vectorizer configs (proven defaults)
cfg_word = dict(analyzer='word', ngram_range=(1,2), sublinear_tf=True, strip_accents='unicode', lowercase=True, min_df=3, stop_words='english')
cfg_char = dict(analyzer='char_wb', ngram_range=(3,6), sublinear_tf=True, min_df=3)
max_feats = {
    'word': {'title': 50000, 'body': 150000, 'answer': 150000},
    'char': {'title': 30000, 'body': 120000, 'answer': 120000},
}

y = train[target].values.astype(float)
groups = train['group_key'].values

gkf = GroupKFold(n_splits=5)
oof = np.zeros(len(train), dtype=float)
test_preds_folds = []
folds = np.full(len(train), -1, dtype=int)

for fold, (trn_idx, val_idx) in enumerate(gkf.split(X=np.zeros(len(train)), y=y, groups=groups)):
    t0 = time.time()
    folds[val_idx] = fold
    print(f"\n[FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    # Fit per-field vectorizers on train split only
    X_tr_parts = []
    X_va_parts = []
    X_te_parts = []
    for field in ['title','body','answer']:
        # Word
        vec_w = TfidfVectorizer(max_features=max_feats['word'][field], **cfg_word)
        Xw_tr = vec_w.fit_transform(texts_train[field].iloc[trn_idx])
        Xw_va = vec_w.transform(texts_train[field].iloc[val_idx])
        Xw_te = vec_w.transform(texts_test[field])
        # Char
        vec_c = TfidfVectorizer(max_features=max_feats['char'][field], **cfg_char)
        Xc_tr = vec_c.fit_transform(texts_train[field].iloc[trn_idx])
        Xc_va = vec_c.transform(texts_train[field].iloc[val_idx])
        Xc_te = vec_c.transform(texts_test[field])
        # Stack per field
        X_tr_parts.append(sparse.hstack([Xw_tr, Xc_tr], format='csr'))
        X_va_parts.append(sparse.hstack([Xw_va, Xc_va], format='csr'))
        X_te_parts.append(sparse.hstack([Xw_te, Xc_te], format='csr'))
    # Concatenate fields horizontally
    X_tr = sparse.hstack(X_tr_parts, format='csr')
    X_va = sparse.hstack(X_va_parts, format='csr')
    X_te = sparse.hstack(X_te_parts, format='csr')
    print(f"[FOLD {fold}] Shapes: X_tr={X_tr.shape} X_va={X_va.shape} X_te={X_te.shape}")

    # Ridge (alpha tuned lightly per fold)
    best_alpha = None
    best_score = -1e9
    best_pred = None
    alphas = [1.0, 2.0, 5.0, 10.0]
    for a in alphas:
        model = Ridge(alpha=a, random_state=42)
        model.fit(X_tr, y[trn_idx])
        p = model.predict(X_va)
        sc = spearmanr_safe(y[val_idx], p)
        print(f"[FOLD {fold}] alpha={a} val_spearman={sc:.5f}")
        if sc > best_score:
            best_score = sc
            best_alpha = a
            best_pred = p
    oof[val_idx] = best_pred
    # Refit on full fold train for test preds with best alpha
    model = Ridge(alpha=best_alpha, random_state=42)
    model.fit(X_tr, y[trn_idx])
    test_pred = model.predict(X_te)
    test_preds_folds.append(test_pred)
    elapsed = time.time() - t0
    print(f"[FOLD {fold}] best_alpha={best_alpha} fold_spearman={best_score:.5f} time={elapsed:.1f}s", flush=True)

cv_score = spearmanr_safe(y, oof)
print(f"\n[CV] OOF Spearman: {cv_score:.5f}")

# Save OOF and fold assignments
np.save('oof_ridge.npy', oof)
pd.DataFrame({'qa_id': train[id_col], 'fold': folds, 'oof': oof, 'target': y}).to_csv('oof_ridge.csv', index=False)

# Aggregate test predictions (mean across folds), clip to [0,1]
test_pred_mean = np.mean(np.vstack(test_preds_folds), axis=0)
test_pred_mean = np.clip(test_pred_mean, 0.0, 1.0)
np.save('test_ridge.npy', test_pred_mean)

# Build submission
sub = pd.DataFrame({id_col: test[id_col], target: test_pred_mean})
sub.to_csv('submission.csv', index=False)
print("Saved submission.csv", sub.shape, "head:\n", sub.head())


[FOLD 0] train=4376 val=1095


[FOLD 0] Shapes: X_tr=(4376, 317157) X_va=(1095, 317157) X_te=(608, 317157)


[FOLD 0] alpha=1.0 val_spearman=0.22364


[FOLD 0] alpha=2.0 val_spearman=0.24501


[FOLD 0] alpha=5.0 val_spearman=0.27993


[FOLD 0] alpha=10.0 val_spearman=0.30730


[FOLD 0] best_alpha=10.0 fold_spearman=0.30730 time=15.1s



[FOLD 1] train=4377 val=1094


[FOLD 1] Shapes: X_tr=(4377, 316578) X_va=(1094, 316578) X_te=(608, 316578)


[FOLD 1] alpha=1.0 val_spearman=0.14794


[FOLD 1] alpha=2.0 val_spearman=0.16752


[FOLD 1] alpha=5.0 val_spearman=0.19993


[FOLD 1] alpha=10.0 val_spearman=0.22630


[FOLD 1] best_alpha=10.0 fold_spearman=0.22630 time=14.6s



[FOLD 2] train=4377 val=1094


[FOLD 2] Shapes: X_tr=(4377, 317059) X_va=(1094, 317059) X_te=(608, 317059)


[FOLD 2] alpha=1.0 val_spearman=0.19841


[FOLD 2] alpha=2.0 val_spearman=0.22097


[FOLD 2] alpha=5.0 val_spearman=0.25595


[FOLD 2] alpha=10.0 val_spearman=0.28347


[FOLD 2] best_alpha=10.0 fold_spearman=0.28347 time=14.5s



[FOLD 3] train=4377 val=1094


[FOLD 3] Shapes: X_tr=(4377, 317028) X_va=(1094, 317028) X_te=(608, 317028)


[FOLD 3] alpha=1.0 val_spearman=0.25230


[FOLD 3] alpha=2.0 val_spearman=0.27106


[FOLD 3] alpha=5.0 val_spearman=0.29466


[FOLD 3] alpha=10.0 val_spearman=0.30805


[FOLD 3] best_alpha=10.0 fold_spearman=0.30805 time=14.5s



[FOLD 4] train=4377 val=1094


[FOLD 4] Shapes: X_tr=(4377, 315933) X_va=(1094, 315933) X_te=(608, 315933)


[FOLD 4] alpha=1.0 val_spearman=0.25176


[FOLD 4] alpha=2.0 val_spearman=0.26409


[FOLD 4] alpha=5.0 val_spearman=0.28661


[FOLD 4] alpha=10.0 val_spearman=0.30573


[FOLD 4] best_alpha=10.0 fold_spearman=0.30573 time=14.5s



[CV] OOF Spearman: 0.28549
Saved submission.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             0.924649
1   6168                             0.876612
2   8575                             0.952263
3    618                             0.862909
4   3471                             0.942982


In [4]:
# Improved Baseline: Q-only TF-IDF (keep stopwords), title upweight, StratifiedGroupKFold, + cheap scalar features
import time, re
from sklearn.model_selection import GroupKFold
try:
    from sklearn.model_selection import StratifiedGroupKFold
    HAS_SGK = True
except Exception:
    HAS_SGK = False
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from scipy import sparse
from scipy.stats import spearmanr
import numpy as np
import pandas as pd

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return spearmanr(y_true, y_pred).correlation

# Prepare texts (QUESTION ONLY) and simple scalar features
q_title_tr = train['question_title'].fillna('').astype(str)
q_body_tr  = train['question_body'].fillna('').astype(str)
q_title_te = test['question_title'].fillna('').astype(str)
q_body_te  = test['question_body'].fillna('').astype(str)

def basic_feats(title: pd.Series, body: pd.Series):
    # Cheap, intent-related features
    def counts(s):
        s2 = s.fillna('')
        chars = s2.str.len().astype(float)
        words = s2.str.split().apply(len).astype(float)
        # Use regex-escaped patterns for special chars
        qcnt = s2.str.count(r'\?').astype(float)
        ecnt = s2.str.count(r'!').astype(float)
        q2   = s2.str.count(r'\?\?+').astype(float)
        e2   = s2.str.count(r'!!+').astype(float)
        ell  = s2.str.count(r'\.\.\.+').astype(float)
        upper = s2.apply(lambda t: sum(ch.isupper() for ch in t)).astype(float)
        upper_ratio = (upper / (chars.replace(0, np.nan))).fillna(0.0).astype(float)
        nl = s2.str.count(r'\n').astype(float)
        has_url = s2.str.contains(r'http[s]?://', regex=True).astype(float)
        has_code = s2.str.contains(r'`').astype(float)
        list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
        quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)
        digits = s2.apply(lambda t: sum(ch.isdigit() for ch in t)).astype(float)
        digit_ratio = (digits / (chars.replace(0, np.nan))).fillna(0.0).astype(float)
        uniq_ratio = s2.apply(lambda t: (len(set(t.split())) / max(1, len(t.split())))).astype(float)
        return [chars, words, qcnt, ecnt, q2, e2, ell, upper_ratio, nl, has_url, has_code, list_mark, quote, digit_ratio, uniq_ratio]
    t_feats = counts(title)
    b_feats = counts(body)
    # Also add simple ratios title/body
    t_chars, t_words = t_feats[0], t_feats[1]
    b_chars, b_words = b_feats[0], b_feats[1]
    len_ratio_c = (t_chars / (b_chars.replace(0, np.nan))).fillna(0.0).astype(float)
    len_ratio_w = (t_words / (b_words.replace(0, np.nan))).fillna(0.0).astype(float)
    feats = t_feats + b_feats + [len_ratio_c, len_ratio_w]
    F = np.vstack([f.values for f in feats]).T.astype(np.float32)
    return F, [
        't_chars','t_words','t_q','t_e','t_q2','t_e2','t_ell','t_upper_ratio','t_nl','t_has_url','t_has_code','t_list','t_quote','t_digit_ratio','t_uniq_ratio',
        'b_chars','b_words','b_q','b_e','b_q2','b_e2','b_ell','b_upper_ratio','b_nl','b_has_url','b_has_code','b_list','b_quote','b_digit_ratio','b_uniq_ratio',
        'len_ratio_c','len_ratio_w'
    ]

F_tr, feat_names = basic_feats(q_title_tr, q_body_tr)
F_te, _ = basic_feats(q_title_te, q_body_te)

# Build stratified group folds by binned group-level target
y = train[target].values.astype(float)
groups = train['group_key'].values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)

if HAS_SGK:
    splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
    split_iter = splitter.split(X=np.zeros_like(y), y=row_bins, groups=groups)
else:
    # Fallback: GroupKFold (no strat) but we keep the same API
    splitter = GroupKFold(n_splits=5)
    split_iter = splitter.split(X=np.zeros_like(y), y=y, groups=groups)

# TF-IDF configs: keep stopwords (None), Q-only fields, upweight title
cfg_word = dict(analyzer='word', ngram_range=(1,2), sublinear_tf=True, strip_accents='unicode', lowercase=True, min_df=3, stop_words=None)
cfg_char = dict(analyzer='char_wb', ngram_range=(3,6), sublinear_tf=True, min_df=3)
max_feats = {
    'word': {'title': 50000, 'body': 150000},
    'char': {'title': 30000, 'body': 120000},
}
title_weight = 1.8

oof = np.zeros(len(train), dtype=float)
test_preds_folds = []
folds = np.full(len(train), -1, dtype=int)

alphas = [5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0]

for fold, (trn_idx, val_idx) in enumerate(split_iter):
    t0 = time.time()
    folds[val_idx] = fold
    print(f"\n[IMP FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    # Vectorize per field on train split (Q-only)
    # Title
    vec_w_t = TfidfVectorizer(max_features=max_feats['word']['title'], **cfg_word)
    Xw_t_tr = vec_w_t.fit_transform(q_title_tr.iloc[trn_idx])
    Xw_t_va = vec_w_t.transform(q_title_tr.iloc[val_idx])
    Xw_t_te = vec_w_t.transform(q_title_te)
    vec_c_t = TfidfVectorizer(max_features=max_feats['char']['title'], **cfg_char)
    Xc_t_tr = vec_c_t.fit_transform(q_title_tr.iloc[trn_idx])
    Xc_t_va = vec_c_t.transform(q_title_tr.iloc[val_idx])
    Xc_t_te = vec_c_t.transform(q_title_te)
    X_t_tr = sparse.hstack([Xw_t_tr, Xc_t_tr], format='csr').multiply(title_weight)
    X_t_va = sparse.hstack([Xw_t_va, Xc_t_va], format='csr').multiply(title_weight)
    X_t_te = sparse.hstack([Xw_t_te, Xc_t_te], format='csr').multiply(title_weight)
    # Body
    vec_w_b = TfidfVectorizer(max_features=max_feats['word']['body'], **cfg_word)
    Xw_b_tr = vec_w_b.fit_transform(q_body_tr.iloc[trn_idx])
    Xw_b_va = vec_w_b.transform(q_body_tr.iloc[val_idx])
    Xw_b_te = vec_w_b.transform(q_body_te)
    vec_c_b = TfidfVectorizer(max_features=max_feats['char']['body'], **cfg_char)
    Xc_b_tr = vec_c_b.fit_transform(q_body_tr.iloc[trn_idx])
    Xc_b_va = vec_c_b.transform(q_body_tr.iloc[val_idx])
    Xc_b_te = vec_c_b.transform(q_body_te)
    X_b_tr = sparse.hstack([Xw_b_tr, Xc_b_tr], format='csr')
    X_b_va = sparse.hstack([Xw_b_va, Xc_b_va], format='csr')
    X_b_te = sparse.hstack([Xw_b_te, Xc_b_te], format='csr')

    # Cheap scalar features: fit scaler (mean/std) on train and apply
    Ft_tr = F_tr[trn_idx]
    Ft_va = F_tr[val_idx]
    Ft_te = F_te
    mu = Ft_tr.mean(axis=0)
    sd = Ft_tr.std(axis=0) + 1e-6
    Ft_tr_z = (Ft_tr - mu) / sd
    Ft_va_z = (Ft_va - mu) / sd
    Ft_te_z = (Ft_te - mu) / sd
    Xf_tr = sparse.csr_matrix(Ft_tr_z, dtype=np.float32)
    Xf_va = sparse.csr_matrix(Ft_va_z, dtype=np.float32)
    Xf_te = sparse.csr_matrix(Ft_te_z, dtype=np.float32)

    # Final design matrices
    X_tr = sparse.hstack([X_t_tr, X_b_tr, Xf_tr], format='csr')
    X_va = sparse.hstack([X_t_va, X_b_va, Xf_va], format='csr')
    X_te = sparse.hstack([X_t_te, X_b_te, Xf_te], format='csr')
    print(f"[IMP FOLD {fold}] Shapes: X_tr={X_tr.shape} X_va={X_va.shape} X_te={X_te.shape}")

    # Ridge with expanded alpha grid
    best_alpha, best_score, best_pred = None, -1e9, None
    for a in alphas:
        model = Ridge(alpha=a, random_state=42)
        model.fit(X_tr, y[trn_idx])
        p = model.predict(X_va)
        sc = spearmanr_safe(y[val_idx], p)
        print(f"[IMP FOLD {fold}] alpha={a:.1f} val_spearman={sc:.5f}")
        if sc > best_score:
            best_alpha, best_score, best_pred = a, sc, p
    oof[val_idx] = best_pred
    model = Ridge(alpha=best_alpha, random_state=42)
    model.fit(X_tr, y[trn_idx])
    test_pred = model.predict(X_te)
    test_preds_folds.append(test_pred)
    print(f"[IMP FOLD {fold}] best_alpha={best_alpha} fold_spearman={best_score:.5f} time={time.time()-t0:.1f}s", flush=True)

cv_score = spearmanr_safe(y, oof)
print(f"\n[IMP CV] OOF Spearman: {cv_score:.5f}")
np.save('oof_ridge_improved.npy', oof)
pd.DataFrame({'qa_id': train[id_col], 'fold': folds, 'oof': oof, 'target': y}).to_csv('oof_ridge_improved.csv', index=False)

test_pred_mean = np.mean(np.vstack(test_preds_folds), axis=0)
test_pred_mean = np.clip(test_pred_mean, 0.0, 1.0)
np.save('test_ridge_improved.npy', test_pred_mean)
sub_imp = pd.DataFrame({id_col: test[id_col], target: test_pred_mean})
sub_imp.to_csv('submission.csv', index=False)
print("[IMP] Saved submission.csv", sub_imp.shape, "head:\n", sub_imp.head())

  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)


  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)


  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)
  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)



[IMP FOLD 0] train=4395 val=1076


[IMP FOLD 0] Shapes: X_tr=(4395, 201378) X_va=(1076, 201378) X_te=(608, 201378)


[IMP FOLD 0] alpha=5.0 val_spearman=0.19589


[IMP FOLD 0] alpha=10.0 val_spearman=0.21622


[IMP FOLD 0] alpha=20.0 val_spearman=0.23747


[IMP FOLD 0] alpha=50.0 val_spearman=0.25809


[IMP FOLD 0] alpha=100.0 val_spearman=0.26415


[IMP FOLD 0] alpha=200.0 val_spearman=0.26036


[IMP FOLD 0] alpha=500.0 val_spearman=0.25196


[IMP FOLD 0] best_alpha=100.0 fold_spearman=0.26415 time=11.4s



[IMP FOLD 1] train=4318 val=1153


[IMP FOLD 1] Shapes: X_tr=(4318, 199204) X_va=(1153, 199204) X_te=(608, 199204)


[IMP FOLD 1] alpha=5.0 val_spearman=0.26574


[IMP FOLD 1] alpha=10.0 val_spearman=0.28934


[IMP FOLD 1] alpha=20.0 val_spearman=0.31161


[IMP FOLD 1] alpha=50.0 val_spearman=0.33160


[IMP FOLD 1] alpha=100.0 val_spearman=0.33418


[IMP FOLD 1] alpha=200.0 val_spearman=0.32769


[IMP FOLD 1] alpha=500.0 val_spearman=0.31287


[IMP FOLD 1] best_alpha=100.0 fold_spearman=0.33418 time=11.1s



[IMP FOLD 2] train=4389 val=1082


[IMP FOLD 2] Shapes: X_tr=(4389, 201456) X_va=(1082, 201456) X_te=(608, 201456)


[IMP FOLD 2] alpha=5.0 val_spearman=0.24342


[IMP FOLD 2] alpha=10.0 val_spearman=0.26911


[IMP FOLD 2] alpha=20.0 val_spearman=0.29080


[IMP FOLD 2] alpha=50.0 val_spearman=0.30901


[IMP FOLD 2] alpha=100.0 val_spearman=0.31143


[IMP FOLD 2] alpha=200.0 val_spearman=0.30767


[IMP FOLD 2] alpha=500.0 val_spearman=0.30080


[IMP FOLD 2] best_alpha=100.0 fold_spearman=0.31143 time=11.3s



[IMP FOLD 3] train=4399 val=1072


[IMP FOLD 3] Shapes: X_tr=(4399, 202580) X_va=(1072, 202580) X_te=(608, 202580)


[IMP FOLD 3] alpha=5.0 val_spearman=0.20366


[IMP FOLD 3] alpha=10.0 val_spearman=0.22882


[IMP FOLD 3] alpha=20.0 val_spearman=0.25056


[IMP FOLD 3] alpha=50.0 val_spearman=0.26503


[IMP FOLD 3] alpha=100.0 val_spearman=0.26683


[IMP FOLD 3] alpha=200.0 val_spearman=0.26074


[IMP FOLD 3] alpha=500.0 val_spearman=0.25207


[IMP FOLD 3] best_alpha=100.0 fold_spearman=0.26683 time=11.8s



[IMP FOLD 4] train=4383 val=1088


[IMP FOLD 4] Shapes: X_tr=(4383, 203057) X_va=(1088, 203057) X_te=(608, 203057)


[IMP FOLD 4] alpha=5.0 val_spearman=0.18898


[IMP FOLD 4] alpha=10.0 val_spearman=0.21554


[IMP FOLD 4] alpha=20.0 val_spearman=0.24569


[IMP FOLD 4] alpha=50.0 val_spearman=0.27583


[IMP FOLD 4] alpha=100.0 val_spearman=0.28408


[IMP FOLD 4] alpha=200.0 val_spearman=0.28127


[IMP FOLD 4] alpha=500.0 val_spearman=0.27628


[IMP FOLD 4] best_alpha=100.0 fold_spearman=0.28408 time=11.3s



[IMP CV] OOF Spearman: 0.28889
[IMP] Saved submission.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             0.879147
1   6168                             0.841673
2   8575                             0.925708
3    618                             0.874153
4   3471                             0.925204


In [5]:
# DeBERTa-v3-base 5-fold (Q-only) with StratifiedGroupKFold and Spearman metric
import os, sys, time, shutil, subprocess, math, gc
import numpy as np
import pandas as pd
from pathlib import Path

def pip(*args):
    print("> pip", *args, flush=True)
    subprocess.run([sys.executable, "-m", "pip", *args], check=True)

# Install CUDA 12.1 Torch stack and HF deps (idempotent)
try:
    import torch, transformers
    import accelerate, datasets, evaluate
    print("Torch/Transformers already available:", torch.__version__)
except Exception:
    # Hard reset any prior torch stacks
    for pkg in ("torch","torchvision","torchaudio"):
        subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", pkg], check=False)
    for d in (
        "/app/.pip-target/torch",
        "/app/.pip-target/torchvision",
        "/app/.pip-target/torchaudio",
        "/app/.pip-target/torchgen",
        "/app/.pip-target/functorch",
    ):
        if os.path.exists(d): shutil.rmtree(d, ignore_errors=True)
    pip("install", "--index-url", "https://download.pytorch.org/whl/cu121", "--extra-index-url", "https://pypi.org/simple", "torch==2.4.1", "torchvision==0.19.1", "torchaudio==2.4.1")
    Path("constraints.txt").write_text("torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n")
    pip("install", "-c", "constraints.txt", "transformers==4.44.2", "accelerate==0.34.2", "datasets==2.21.0", "evaluate==0.4.2", "sentencepiece", "protobuf<5", "scikit-learn", "--upgrade-strategy", "only-if-needed")
    import torch, transformers, accelerate, datasets, evaluate
    print("torch:", torch.__version__, "CUDA:", getattr(torch.version, "cuda", None), "CUDA avail:", torch.cuda.is_available())
    if torch.cuda.is_available(): print("GPU:", torch.cuda.get_device_name(0))

import torch
from sklearn.model_selection import StratifiedGroupKFold, GroupKFold
from scipy.stats import spearmanr
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback, DataCollatorWithPadding

# Rebuild folds with stratified grouping (reuse logic) to be self-contained
target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)
groups = train['group_key'].values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))

# Tokenization: Q-only, template: [CLS] title [SEP] body [SEP]; never truncate title beyond 64 tokens
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_len = 512
title_max = 64

def build_inputs(title_series: pd.Series, body_series: pd.Series):
    titles = title_series.fillna("").astype(str).tolist()
    bodies = body_series.fillna("").astype(str).tolist()
    # Pre-tokenize title with hard cap, then combine with body
    enc_title = tokenizer(titles, add_special_tokens=False, truncation=True, max_length=title_max)
    enc_body = tokenizer(bodies, add_special_tokens=False, truncation=True, max_length=max_len)  # temp
    input_ids, attention_masks = [], []
    for ti, bi in zip(enc_title["input_ids"], enc_body["input_ids"]):
        # Compose: [CLS] title [SEP] body [SEP]
        composed = [tokenizer.cls_token_id] + ti + [tokenizer.sep_token_id] + bi + [tokenizer.sep_token_id]
        composed = composed[:max_len]
        attn = [1]*len(composed)
        # pad
        pad_len = max_len - len(composed)
        if pad_len>0:
            composed = composed + [tokenizer.pad_token_id]*pad_len
            attn = attn + [0]*pad_len
        input_ids.append(composed)
        attention_masks.append(attn)
    return {"input_ids": np.array(input_ids, dtype=np.int64), "attention_mask": np.array(attention_masks, dtype=np.int64)}

class QDataset(torch.utils.data.Dataset):
    def __init__(self, ids, masks, labels=None):
        self.ids = ids; self.masks = masks; self.labels = labels
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        item = {"input_ids": torch.tensor(self.ids[idx]), "attention_mask": torch.tensor(self.masks[idx])}
        if self.labels is not None:
            item["labels"] = torch.tensor(self.labels[idx], dtype=torch.float32)
        return item

def spearman_compute(eval_pred):
    preds, labels = eval_pred
    preds = preds.reshape(-1)
    if np.std(preds)==0: return {"spearman": 0.0}
    return {"spearman": float(spearmanr(labels, preds).correlation)}

config = AutoConfig.from_pretrained(model_name, num_labels=1, problem_type="regression")

all_inputs = build_inputs(train['question_title'], train['question_body'])
test_inputs = build_inputs(test['question_title'], test['question_body'])

oof = np.zeros(len(train), dtype=np.float32)
test_preds = []
folds = np.full(len(train), -1, dtype=int)

for fold, (trn_idx, val_idx) in enumerate(splits):
    t0 = time.time()
    folds[val_idx] = fold
    print(f"\n[T-FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    tr_ds = QDataset(all_inputs['input_ids'][trn_idx], all_inputs['attention_mask'][trn_idx], y[trn_idx])
    va_ds = QDataset(all_inputs['input_ids'][val_idx], all_inputs['attention_mask'][val_idx], y[val_idx])
    te_ds = QDataset(test_inputs['input_ids'], test_inputs['attention_mask'], None)

    model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
    args = TrainingArguments(
        output_dir=f"tfm_fold{fold}",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        gradient_accumulation_steps=1,
        learning_rate=2e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        fp16=True,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="spearman",
        greater_is_better=True,
        logging_steps=50,
        report_to=[]
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds,
        eval_dataset=va_ds,
        tokenizer=tokenizer,
        compute_metrics=spearman_compute,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],
        data_collator=DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
    )
    trainer.train()
    # OOF preds
    val_out = trainer.predict(va_ds).predictions.reshape(-1)
    oof[val_idx] = val_out
    fold_score = spearmanr(y[val_idx], val_out).correlation
    print(f"[T-FOLD {fold}] val Spearman={fold_score:.5f} time={time.time()-t0:.1f}s", flush=True)
    # Test preds
    te_out = trainer.predict(te_ds).predictions.reshape(-1)
    test_preds.append(te_out)
    # Cleanup to free VRAM
    del trainer, model; gc.collect()
    if torch.cuda.is_available(): torch.cuda.empty_cache()

cv = spearmanr(y, oof).correlation
print(f"\n[T-CV] OOF Spearman: {cv:.5f}")
np.save('oof_deberta.npy', oof)
pd.DataFrame({'qa_id': train[id_col], 'fold': folds, 'oof': oof, 'target': y}).to_csv('oof_deberta.csv', index=False)

test_mean = np.mean(np.vstack(test_preds), axis=0)
np.save('test_deberta.npy', test_mean)
sub_tfm = pd.DataFrame({id_col: test[id_col], target: test_mean})
sub_tfm.to_csv('submission_deberta.csv', index=False)
print("Saved submission_deberta.csv", sub_tfm.shape, "head:\n", sub_tfm.head())
print("[Done] DeBERTa training+inference complete.")





> pip install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 471.1 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 356.6 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 441.8 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 5.7 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 201.0 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 223.5 MB/s eta 0:00:00


Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 244.4 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 226.8 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 432.2 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 219.7 MB/s eta 0:00:00
Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 490.3 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 428.4 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 106.0 MB/s eta 0:00:00


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 389.9 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 216.9 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 483.3 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 222.7 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 224.4 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 242.5 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 175.9 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 403.1 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 318.5 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 191.9 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 331.9 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


> pip install -c constraints.txt transformers==4.44.2 accelerate==0.34.2 datasets==2.21.0 evaluate==0.4.2 sentencepiece protobuf<5 scikit-learn --upgrade-strategy only-if-needed


Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 101.5 MB/s eta 0:00:00
Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl (324 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.4/324.4 KB 94.0 MB/s eta 0:00:00


Collecting datasets==2.21.0
  Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 527.3/527.3 KB 341.2 MB/s eta 0:00:00
Collecting evaluate==0.4.2
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 KB 422.4 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 77.3 MB/s eta 0:00:00


Collecting protobuf<5
  Downloading protobuf-4.25.8-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.9/294.9 KB 499.0 MB/s eta 0:00:00
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 219.0 MB/s eta 0:00:00


Collecting regex!=2019.12.17
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (798 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 KB 494.6 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.23.2
  Downloading huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 563.3/563.3 KB 132.5 MB/s eta 0:00:00


Collecting safetensors>=0.4.1
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 316.6 MB/s eta 0:00:00
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 422.3 MB/s eta 0:00:00


Collecting tokenizers<0.20,>=0.19
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 154.6 MB/s eta 0:00:00
Collecting tqdm>=4.27
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 464.7 MB/s eta 0:00:00


Collecting numpy>=1.17
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 196.3 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 512.2 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 381.4 MB/s eta 0:00:00
Collecting torch>=1.10.0
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl (797.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.1/797.1 MB 272.6 MB/s eta 0:00:00


Collecting psutil
  Downloading psutil-7.1.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291.2/291.2 KB 551.7 MB/s eta 0:00:00
Collecting dill<0.3.9,>=0.3.0
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 KB 424.0 MB/s eta 0:00:00


Collecting xxhash
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.8/194.8 KB 505.7 MB/s eta 0:00:00
Collecting multiprocess
  Downloading multiprocess-0.70.18-py311-none-any.whl (144 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.5/144.5 KB 470.9 MB/s eta 0:00:00


Collecting pandas
  Downloading pandas-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 115.0 MB/s eta 0:00:00
Collecting fsspec[http]<=2024.6.1,>=2023.1.0
  Downloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 177.6/177.6 KB 475.3 MB/s eta 0:00:00


Collecting aiohttp
  Downloading aiohttp-3.12.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 317.2 MB/s eta 0:00:00
Collecting pyarrow>=15.0.0
  Downloading pyarrow-21.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (42.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 315.5 MB/s eta 0:00:00
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)


Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 507.6 MB/s eta 0:00:00
Collecting scipy>=1.8.0
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 213.9 MB/s eta 0:00:00
Collecting aiohappyeyeballs>=2.5.0
  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB)


Collecting yarl<2.0,>=1.17.0
  Downloading yarl-1.20.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (348 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 349.0/349.0 KB 542.9 MB/s eta 0:00:00
Collecting aiosignal>=1.4.0
  Downloading aiosignal-1.4.0-py3-none-any.whl (7.5 kB)
Collecting propcache>=0.2.0
  Downloading propcache-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 213.5/213.5 KB 518.7 MB/s eta 0:00:00


Collecting multidict<7.0,>=4.5
  Downloading multidict-6.6.4-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (246 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 246.7/246.7 KB 515.6 MB/s eta 0:00:00
Collecting attrs>=17.3.0
  Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 KB 451.3 MB/s eta 0:00:00
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.7.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (235 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.3/235.3 KB 522.3 MB/s eta 0:00:00


Collecting typing-extensions>=3.7.4.3
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 343.8 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 316.0 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 378.1 MB/s eta 0:00:00


Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 515.2 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 488.9 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 521.9 MB/s eta 0:00:00
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 459.6 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 150.1 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 145.2 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 474.8 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 184.0 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 258.3 MB/s eta 0:00:00
Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 201.0 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 119.6 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 208.5 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 256.0 MB/s eta 0:00:00
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 119.3 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 71.4 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 539.2 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 191.0 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 147.2 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 148.8 MB/s eta 0:00:00
Collecting multiprocess
  Downloading multiprocess-0.70.17-py311-none-any.whl (144 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.3/144.3 KB 458.6 MB/s eta 0:00:00
  Downloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 143.5/143.5 KB 453.6 MB/s eta 0:00:00
Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 347.8/347.8 KB 521.6 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.2
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 KB 497.2 MB/s eta 0:00:00


Collecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 509.2/509.2 KB 539.7 MB/s eta 0:00:00
Collecting six>=1.5
  Downloading six-1.17.0-py2.py3-none-any.whl (11 kB)


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 286.2 MB/s eta 0:00:00


Installing collected packages: pytz, mpmath, xxhash, urllib3, tzdata, typing-extensions, tqdm, threadpoolctl, sympy, six, sentencepiece, safetensors, regex, pyyaml, pyarrow, psutil, protobuf, propcache, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, multidict, MarkupSafe, joblib, idna, hf-xet, fsspec, frozenlist, filelock, dill, charset_normalizer, certifi, attrs, aiohappyeyeballs, yarl, triton, scipy, requests, python-dateutil, nvidia-cusparse-cu12, nvidia-cudnn-cu12, multiprocess, jinja2, aiosignal, scikit-learn, pandas, nvidia-cusolver-cu12, huggingface-hub, aiohttp, torch, tokenizers, transformers, datasets, accelerate, evaluate


Successfully installed MarkupSafe-3.0.3 accelerate-0.34.2 aiohappyeyeballs-2.6.1 aiohttp-3.12.15 aiosignal-1.4.0 attrs-25.3.0 certifi-2025.8.3 charset_normalizer-3.4.3 datasets-2.21.0 dill-0.3.8 evaluate-0.4.2 filelock-3.19.1 frozenlist-1.7.0 fsspec-2024.6.1 hf-xet-1.1.10 huggingface-hub-0.35.1 idna-3.10 jinja2-3.1.6 joblib-1.5.2 mpmath-1.3.0 multidict-6.6.4 multiprocess-0.70.16 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 packaging-25.0 pandas-2.3.2 propcache-0.3.2 protobuf-4.25.8 psutil-7.1.0 pyarrow-21.0.0 python-dateutil-2.9.0.post0 pytz-2025.2 pyyaml-6.0.3 regex-2025.9.18 requests-2.32.5 safetensors-0.6.2 scikit-learn-1.7.2 scipy-1.16.2 sentencepiece-









  from .autonotebook import tqdm as notebook_tqdm


torch: 2.4.1+cu121 CUDA: 12.1 CUDA avail: True
GPU: NVIDIA A10-24Q





[T-FOLD 0] train=4395 val=1076


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[T-FOLD 0] val Spearman=0.37297 time=444.9s



[T-FOLD 1] train=4318 val=1153


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[T-FOLD 1] val Spearman=0.40783 time=443.6s



[T-FOLD 2] train=4389 val=1082


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[T-FOLD 2] val Spearman=0.39341 time=452.4s



[T-FOLD 3] train=4399 val=1072


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[T-FOLD 3] val Spearman=0.38784 time=450.0s



[T-FOLD 4] train=4383 val=1088


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[T-FOLD 4] val Spearman=0.38819 time=453.1s



[T-CV] OOF Spearman: 0.37979
Saved submission_deberta.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             1.013086
1   6168                             0.905762
2   8575                             1.046875
3    618                             0.943848
4   3471                             1.010742
[Done] DeBERTa training+inference complete.


In [7]:
# Ridge Model B: add answer (low weight), host/category one-hot, title-body cosine; rank-blend with Model A
import time
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.stats import spearmanr
from sklearn.model_selection import StratifiedGroupKFold, GroupKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Ridge
from pathlib import Path

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return spearmanr(y_true, y_pred).correlation

# Folds: reuse stratified group folds on binned group mean target
target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)
groups = train['group_key'].values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
try:
    sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
    splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))
except Exception:
    gkf = GroupKFold(n_splits=5)
    splits = list(gkf.split(np.zeros_like(y), y=y, groups=groups))

# Text fields
title_tr = train['question_title'].fillna('').astype(str)
body_tr  = train['question_body'].fillna('').astype(str)
ans_tr   = train['answer'].fillna('').astype(str)
title_te = test['question_title'].fillna('').astype(str)
body_te  = test['question_body'].fillna('').astype(str)
ans_te   = test['answer'].fillna('').astype(str)

# Categorical meta
meta_tr = train[['host','category']].astype(str).fillna('') if {'host','category'}.issubset(train.columns) else None
meta_te = test[['host','category']].astype(str).fillna('') if {'host','category'}.issubset(test.columns) else None

# Configs
cfg_word = dict(analyzer='word', ngram_range=(1,2), sublinear_tf=True, strip_accents='unicode', lowercase=True, min_df=2, stop_words=None)
cfg_char = dict(analyzer='char_wb', ngram_range=(3,6), sublinear_tf=True, min_df=2)
max_feats = {
    'word': {'title': 50000, 'body': 150000, 'answer': 120000},
    'char': {'title': 30000, 'body': 120000, 'answer': 80000},
}
title_weight = 2.0
answer_weight = 0.6
alphas = [10.0, 20.0, 50.0, 100.0, 200.0, 500.0]

oof_b = np.zeros(len(train), dtype=float)
test_preds_b = []
folds_idx = np.full(len(train), -1, dtype=int)

for fold, (trn_idx, val_idx) in enumerate(splits):
    t0 = time.time()
    folds_idx[val_idx] = fold
    print(f"\n[RIDGE-B FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    # Title vectors
    vec_w_t = TfidfVectorizer(max_features=max_feats['word']['title'], **cfg_word)
    Xw_t_tr = vec_w_t.fit_transform(title_tr.iloc[trn_idx]); Xw_t_va = vec_w_t.transform(title_tr.iloc[val_idx]); Xw_t_te = vec_w_t.transform(title_te)
    vec_c_t = TfidfVectorizer(max_features=max_feats['char']['title'], **cfg_char)
    Xc_t_tr = vec_c_t.fit_transform(title_tr.iloc[trn_idx]); Xc_t_va = vec_c_t.transform(title_tr.iloc[val_idx]); Xc_t_te = vec_c_t.transform(title_te)
    X_t_tr = sparse.hstack([Xw_t_tr, Xc_t_tr], format='csr').multiply(title_weight)
    X_t_va = sparse.hstack([Xw_t_va, Xc_t_va], format='csr').multiply(title_weight)
    X_t_te = sparse.hstack([Xw_t_te, Xc_t_te], format='csr').multiply(title_weight)
    # Body vectors
    vec_w_b = TfidfVectorizer(max_features=max_feats['word']['body'], **cfg_word)
    Xw_b_tr = vec_w_b.fit_transform(body_tr.iloc[trn_idx]); Xw_b_va = vec_w_b.transform(body_tr.iloc[val_idx]); Xw_b_te = vec_w_b.transform(body_te)
    vec_c_b = TfidfVectorizer(max_features=max_feats['char']['body'], **cfg_char)
    Xc_b_tr = vec_c_b.fit_transform(body_tr.iloc[trn_idx]); Xc_b_va = vec_c_b.transform(body_tr.iloc[val_idx]); Xc_b_te = vec_c_b.transform(body_te)
    X_b_tr = sparse.hstack([Xw_b_tr, Xc_b_tr], format='csr')
    X_b_va = sparse.hstack([Xw_b_va, Xc_b_va], format='csr')
    X_b_te = sparse.hstack([Xw_b_te, Xc_b_te], format='csr')
    # Answer vectors (low weight)
    vec_w_a = TfidfVectorizer(max_features=max_feats['word']['answer'], **cfg_word)
    Xw_a_tr = vec_w_a.fit_transform(ans_tr.iloc[trn_idx]); Xw_a_va = vec_w_a.transform(ans_tr.iloc[val_idx]); Xw_a_te = vec_w_a.transform(ans_te)
    vec_c_a = TfidfVectorizer(max_features=max_feats['char']['answer'], **cfg_char)
    Xc_a_tr = vec_c_a.fit_transform(ans_tr.iloc[trn_idx]); Xc_a_va = vec_c_a.transform(ans_tr.iloc[val_idx]); Xc_a_te = vec_c_a.transform(ans_te)
    X_a_tr = sparse.hstack([Xw_a_tr, Xc_a_tr], format='csr').multiply(answer_weight)
    X_a_va = sparse.hstack([Xw_a_va, Xc_a_va], format='csr').multiply(answer_weight)
    X_a_te = sparse.hstack([Xw_a_te, Xc_a_te], format='csr').multiply(answer_weight)
    # Cosine similarity between title and body (shared small vectorizer)
    vec_cos = TfidfVectorizer(max_features=50000, analyzer='word', ngram_range=(1,2), min_df=2, strip_accents='unicode', lowercase=True, sublinear_tf=True)
    V_tr = vec_cos.fit_transform(pd.concat([title_tr.iloc[trn_idx], body_tr.iloc[trn_idx]]))
    Vt_tr = vec_cos.transform(title_tr.iloc[trn_idx]); Vb_tr = vec_cos.transform(body_tr.iloc[trn_idx])
    Vt_va = vec_cos.transform(title_tr.iloc[val_idx]); Vb_va = vec_cos.transform(body_tr.iloc[val_idx])
    Vt_te = vec_cos.transform(title_te); Vb_te = vec_cos.transform(body_te)
    # Vectors are L2-normalized by default in TfidfVectorizer, cosine = dot product
    cos_va = np.asarray((Vt_va.multiply(Vb_va)).sum(axis=1)).ravel().astype(np.float32)
    cos_tr = np.asarray((Vt_tr.multiply(Vb_tr)).sum(axis=1)).ravel().astype(np.float32)
    cos_te = np.asarray((Vt_te.multiply(Vb_te)).sum(axis=1)).ravel().astype(np.float32)
    Xcos_tr = sparse.csr_matrix(cos_tr[:, None]); Xcos_va = sparse.csr_matrix(cos_va[:, None]); Xcos_te = sparse.csr_matrix(cos_te[:, None])
    # One-hot host/category
    if meta_tr is not None:
        try:
            ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=True)
        except TypeError:
            ohe = OneHotEncoder(handle_unknown='ignore')
        M_tr = ohe.fit_transform(meta_tr.iloc[trn_idx])
        M_va = ohe.transform(meta_tr.iloc[val_idx])
        M_te = ohe.transform(meta_te)
    else:
        M_tr = sparse.csr_matrix((len(trn_idx), 0)); M_va = sparse.csr_matrix((len(val_idx), 0)); M_te = sparse.csr_matrix((len(test), 0))
    # Final design matrices
    X_tr = sparse.hstack([X_t_tr, X_b_tr, X_a_tr, Xcos_tr, M_tr], format='csr')
    X_va = sparse.hstack([X_t_va, X_b_va, X_a_va, Xcos_va, M_va], format='csr')
    X_te = sparse.hstack([X_t_te, X_b_te, X_a_te, Xcos_te, M_te], format='csr')
    print(f"[RIDGE-B FOLD {fold}] Shapes: X_tr={X_tr.shape} X_va={X_va.shape} X_te={X_te.shape}")
    # Ridge fit
    best_alpha, best_score, best_pred = None, -1e9, None
    for a in alphas:
        mdl = Ridge(alpha=a, random_state=42)
        mdl.fit(X_tr, y[trn_idx])
        p = mdl.predict(X_va)
        sc = spearmanr_safe(y[val_idx], p)
        print(f"[RIDGE-B FOLD {fold}] alpha={a:.1f} val_spearman={sc:.5f}")
        if sc > best_score:
            best_alpha, best_score, best_pred = a, sc, p
    oof_b[val_idx] = best_pred
    mdl = Ridge(alpha=best_alpha, random_state=42)
    mdl.fit(X_tr, y[trn_idx])
    test_preds_b.append(mdl.predict(X_te))
    print(f"[RIDGE-B FOLD {fold}] best_alpha={best_alpha} fold_spearman={best_score:.5f} time={time.time()-t0:.1f}s", flush=True)

cv_b = spearmanr_safe(y, oof_b)
print(f"\n[RIDGE-B CV] OOF Spearman: {cv_b:.5f}")
np.save('oof_ridge_b.npy', oof_b)
pd.DataFrame({'qa_id': train[id_col], 'fold': folds_idx, 'oof': oof_b, 'target': y}).to_csv('oof_ridge_b.csv', index=False)
test_b = np.mean(np.vstack(test_preds_b), axis=0)
np.save('test_ridge_b.npy', test_b)

# Rank-ensemble Model A (saved) and Model B
oof_a = np.load('oof_ridge_improved.npy') if Path('oof_ridge_improved.npy').exists() else np.load('oof_ridge.npy')
test_a = np.load('test_ridge_improved.npy') if Path('test_ridge_improved.npy').exists() else np.load('test_ridge.npy')

def rank01(x):
    xr = pd.Series(x).rank(method='average')
    return (xr - xr.min()) / (xr.max() - xr.min() + 1e-9)

rA = rank01(oof_a); rB = rank01(oof_b)
best_w, best_sc = None, -1e9
for w in [0.4, 0.5, 0.6]:
    blend = w*rA + (1-w)*rB
    sc = spearmanr_safe(y, blend.values)
    print(f"[RANK BLEND] w={w:.2f} OOF Spearman={sc:.5f}")
    if sc > best_sc: best_sc, best_w = sc, w
print(f"[RANK BLEND] best_w={best_w} best_OOF={best_sc:.5f}")

tA = rank01(test_a); tB = rank01(test_b)
test_blend = best_w * tA.values + (1-best_w) * tB.values
sub_blend = pd.DataFrame({id_col: test[id_col], target: test_blend})
sub_blend.to_csv('submission_ridge_blend.csv', index=False)
print("Saved submission_ridge_blend.csv", sub_blend.shape, "head:\n", sub_blend.head())


[RIDGE-B FOLD 0] train=4395 val=1076


[RIDGE-B FOLD 0] Shapes: X_tr=(4395, 417355) X_va=(1076, 417355) X_te=(608, 417355)


[RIDGE-B FOLD 0] alpha=10.0 val_spearman=0.26476


[RIDGE-B FOLD 0] alpha=20.0 val_spearman=0.28404


[RIDGE-B FOLD 0] alpha=50.0 val_spearman=0.30681


[RIDGE-B FOLD 0] alpha=100.0 val_spearman=0.31705


[RIDGE-B FOLD 0] alpha=200.0 val_spearman=0.31951


[RIDGE-B FOLD 0] alpha=500.0 val_spearman=0.31771


[RIDGE-B FOLD 0] best_alpha=200.0 fold_spearman=0.31951 time=17.7s



[RIDGE-B FOLD 1] train=4318 val=1153


[RIDGE-B FOLD 1] Shapes: X_tr=(4318, 413206) X_va=(1153, 413206) X_te=(608, 413206)


[RIDGE-B FOLD 1] alpha=10.0 val_spearman=0.31811


[RIDGE-B FOLD 1] alpha=20.0 val_spearman=0.33660


[RIDGE-B FOLD 1] alpha=50.0 val_spearman=0.35237


[RIDGE-B FOLD 1] alpha=100.0 val_spearman=0.35444


[RIDGE-B FOLD 1] alpha=200.0 val_spearman=0.34816


[RIDGE-B FOLD 1] alpha=500.0 val_spearman=0.33523


[RIDGE-B FOLD 1] best_alpha=100.0 fold_spearman=0.35444 time=17.4s



[RIDGE-B FOLD 2] train=4389 val=1082


[RIDGE-B FOLD 2] Shapes: X_tr=(4389, 416977) X_va=(1082, 416977) X_te=(608, 416977)


[RIDGE-B FOLD 2] alpha=10.0 val_spearman=0.27067


[RIDGE-B FOLD 2] alpha=20.0 val_spearman=0.29287


[RIDGE-B FOLD 2] alpha=50.0 val_spearman=0.31022


[RIDGE-B FOLD 2] alpha=100.0 val_spearman=0.31322


[RIDGE-B FOLD 2] alpha=200.0 val_spearman=0.30871


[RIDGE-B FOLD 2] alpha=500.0 val_spearman=0.30772


[RIDGE-B FOLD 2] best_alpha=100.0 fold_spearman=0.31322 time=17.8s



[RIDGE-B FOLD 3] train=4399 val=1072


[RIDGE-B FOLD 3] Shapes: X_tr=(4399, 415719) X_va=(1072, 415719) X_te=(608, 415719)


[RIDGE-B FOLD 3] alpha=10.0 val_spearman=0.25377


[RIDGE-B FOLD 3] alpha=20.0 val_spearman=0.27521


[RIDGE-B FOLD 3] alpha=50.0 val_spearman=0.29168


[RIDGE-B FOLD 3] alpha=100.0 val_spearman=0.29304


[RIDGE-B FOLD 3] alpha=200.0 val_spearman=0.28947


[RIDGE-B FOLD 3] alpha=500.0 val_spearman=0.28694


[RIDGE-B FOLD 3] best_alpha=100.0 fold_spearman=0.29304 time=17.8s



[RIDGE-B FOLD 4] train=4383 val=1088


[RIDGE-B FOLD 4] Shapes: X_tr=(4383, 415525) X_va=(1088, 415525) X_te=(608, 415525)


[RIDGE-B FOLD 4] alpha=10.0 val_spearman=0.22068


[RIDGE-B FOLD 4] alpha=20.0 val_spearman=0.25057


[RIDGE-B FOLD 4] alpha=50.0 val_spearman=0.28810


[RIDGE-B FOLD 4] alpha=100.0 val_spearman=0.30415


[RIDGE-B FOLD 4] alpha=200.0 val_spearman=0.31686


[RIDGE-B FOLD 4] alpha=500.0 val_spearman=0.31918


[RIDGE-B FOLD 4] best_alpha=500.0 fold_spearman=0.31918 time=17.4s



[RIDGE-B CV] OOF Spearman: 0.31065
[RANK BLEND] w=0.40 OOF Spearman=0.32657
[RANK BLEND] w=0.50 OOF Spearman=0.32510
[RANK BLEND] w=0.60 OOF Spearman=0.32132
[RANK BLEND] best_w=0.4 best_OOF=0.32657
Saved submission_ridge_blend.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             0.691598
1   6168                             0.100165
2   8575                             0.919275
3    618                             0.178913
4   3471                             0.713674


In [8]:
# Ensemble: DeBERTa + Ridge Blend (rank-averaged, weights tuned on OOF)
import numpy as np, pandas as pd
from pathlib import Path
from scipy.stats import spearmanr

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

def rank01(x):
    s = pd.Series(x)
    r = s.rank(method='average')
    return ((r - r.min()) / (r.max() - r.min() + 1e-9)).values

# Load ground truth and IDs (self-contained)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
id_col = 'qa_id'
target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)

# Load OOF/test artifacts
oof_deb = np.load('oof_deberta.npy')
tst_deb = np.load('test_deberta.npy')
oof_a = np.load('oof_ridge_improved.npy') if Path('oof_ridge_improved.npy').exists() else np.load('oof_ridge.npy')
tst_a = np.load('test_ridge_improved.npy') if Path('test_ridge_improved.npy').exists() else np.load('test_ridge.npy')
oof_b = np.load('oof_ridge_b.npy')
tst_b = np.load('test_ridge_b.npy')

# First: re-tune Ridge A+B rank blend
rA = rank01(oof_a); rB = rank01(oof_b)
best_w_ab, best_sc_ab = None, -1e9
for w in np.linspace(0.0, 1.0, 21):
    blend = w*rA + (1-w)*rB
    sc = spearmanr_safe(y, blend)
    if sc > best_sc_ab:
        best_sc_ab, best_w_ab = sc, float(w)
print(f"[AB BLEND] best_w={best_w_ab:.3f} OOF={best_sc_ab:.5f}")
tA = rank01(tst_a); tB = rank01(tst_b)
tst_ab = best_w_ab*tA + (1-best_w_ab)*tB
oof_ab = best_w_ab*rA + (1-best_w_ab)*rB

# Now: tune DeBERTa + RidgeBlend
rDEB = rank01(oof_deb); rAB = rank01(oof_ab)
best_w_final, best_sc_final = None, -1e9
for w in np.linspace(0.0, 1.0, 41):
    mix = w*rDEB + (1-w)*rAB
    sc = spearmanr_safe(y, mix)
    if sc > best_sc_final:
        best_sc_final, best_w_final = sc, float(w)
print(f"[FINAL BLEND] w_DEB={best_w_final:.3f} OOF={best_sc_final:.5f}")

tDEB = rank01(tst_deb); tAB = rank01(tst_ab)
tst_final = best_w_final*tDEB + (1-best_w_final)*tAB

# Save artifacts and submission
np.save('oof_ensemble.npy', best_w_final*rDEB + (1-best_w_final)*rAB)
np.save('test_ensemble.npy', tst_final)
sub = pd.DataFrame({id_col: test[id_col], target: tst_final})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv', sub.shape, 'head:\n', sub.head())

[AB BLEND] best_w=0.400 OOF=0.32657
[FINAL BLEND] w_DEB=0.775 OOF=0.38475
Saved submission.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             0.769275
1   6168                             0.180745
2   8575                             0.963777
3    618                             0.367978
4   3471                             0.765095


In [9]:
# DeBERTa-v3-large 5-fold (Q-only) with SGKF, early stop, fp16
import os, gc, time, numpy as np, pandas as pd, torch
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
from scipy.stats import spearmanr
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback, DataCollatorWithPadding

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

# Reload data and folds consistently
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)
groups = pd.read_csv('train_group_keys.csv')['group_key'].values if Path('train_group_keys.csv').exists() else pd.util.hash_pandas_object((train['question_title'].fillna('')+'||'+train['question_body'].fillna('')), index=False).astype('int64').values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))

# Tokenization
model_name = 'microsoft/deberta-v3-large'
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_len = 512
title_max = 64

def build_inputs(title_series: pd.Series, body_series: pd.Series):
    titles = title_series.fillna('').astype(str).tolist()
    bodies = body_series.fillna('').astype(str).tolist()
    enc_title = tokenizer(titles, add_special_tokens=False, truncation=True, max_length=title_max)
    enc_body = tokenizer(bodies, add_special_tokens=False, truncation=True, max_length=max_len)
    input_ids, attention_masks = [], []
    for ti, bi in zip(enc_title['input_ids'], enc_body['input_ids']):
        composed = [tokenizer.cls_token_id] + ti + [tokenizer.sep_token_id] + bi + [tokenizer.sep_token_id]
        composed = composed[:max_len]
        attn = [1]*len(composed)
        pad = max_len - len(composed)
        if pad>0:
            composed += [tokenizer.pad_token_id]*pad
            attn += [0]*pad
        input_ids.append(composed); attention_masks.append(attn)
    return {'input_ids': np.array(input_ids, dtype=np.int64), 'attention_mask': np.array(attention_masks, dtype=np.int64)}

class QDataset(torch.utils.data.Dataset):
    def __init__(self, ids, masks, labels=None):
        self.ids=ids; self.masks=masks; self.labels=labels
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        item={'input_ids': torch.tensor(self.ids[idx]), 'attention_mask': torch.tensor(self.masks[idx])}
        if self.labels is not None: item['labels']=torch.tensor(self.labels[idx], dtype=torch.float32)
        return item

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    preds = preds.reshape(-1)
    return {'spearman': spearmanr_safe(labels, preds)}

all_inputs = build_inputs(train['question_title'], train['question_body'])
test_inputs = build_inputs(test['question_title'], test['question_body'])

oof = np.zeros(len(train), dtype=np.float32)
test_preds = []
folds = np.full(len(train), -1, dtype=int)

for fold, (trn_idx, val_idx) in enumerate(splits):
    t0=time.time(); folds[val_idx]=fold
    print(f"\n[DEB-L FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    tr_ds = QDataset(all_inputs['input_ids'][trn_idx], all_inputs['attention_mask'][trn_idx], y[trn_idx])
    va_ds = QDataset(all_inputs['input_ids'][val_idx], all_inputs['attention_mask'][val_idx], y[val_idx])
    te_ds = QDataset(test_inputs['input_ids'], test_inputs['attention_mask'], None)

    config = AutoConfig.from_pretrained(model_name, num_labels=1, problem_type='regression')
    model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
    model.gradient_checkpointing_enable()

    args = TrainingArguments(
        output_dir=f'deberta_large_fold{fold}',
        num_train_epochs=5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=32,
        gradient_accumulation_steps=1,
        learning_rate=1e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type='linear',
        fp16=True,
        gradient_checkpointing=True,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        metric_for_best_model='spearman',
        greater_is_better=True,
        logging_steps=50,
        report_to=[]
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds,
        eval_dataset=va_ds,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],
        data_collator=DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
    )
    trainer.train()
    val_out = trainer.predict(va_ds).predictions.reshape(-1)
    oof[val_idx] = val_out
    fold_score = spearmanr(y[val_idx], val_out).correlation
    print(f"[DEB-L FOLD {fold}] val Spearman={fold_score:.5f} time={time.time()-t0:.1f}s", flush=True)
    te_out = trainer.predict(te_ds).predictions.reshape(-1)
    test_preds.append(te_out)
    del trainer, model; gc.collect()
    if torch.cuda.is_available(): torch.cuda.empty_cache()

cv = spearmanr(y, oof).correlation
print(f"\n[DEB-L CV] OOF Spearman: {cv:.5f}")
np.save('oof_deberta_large.npy', oof)
pd.DataFrame({'qa_id': train[id_col], 'fold': folds, 'oof': oof, 'target': y}).to_csv('oof_deberta_large.csv', index=False)
test_mean = np.mean(np.vstack(test_preds), axis=0)
np.save('test_deberta_large.npy', test_mean)
sub_l = pd.DataFrame({id_col: test[id_col], target: test_mean})
sub_l.to_csv('submission_deberta_large.csv', index=False)
print('Saved submission_deberta_large.csv', sub_l.shape, 'head:\n', sub_l.head())




[DEB-L FOLD 0] train=4395 val=1076


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


[DEB-L FOLD 0] val Spearman=0.39332 time=1168.5s



[DEB-L FOLD 1] train=4318 val=1153


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


[DEB-L FOLD 1] val Spearman=0.40885 time=1722.7s



[DEB-L FOLD 2] train=4389 val=1082


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


[DEB-L FOLD 2] val Spearman=0.39222 time=1739.8s



[DEB-L FOLD 3] train=4399 val=1072


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


[DEB-L FOLD 3] val Spearman=0.35640 time=1744.6s



[DEB-L FOLD 4] train=4383 val=1088


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


[DEB-L FOLD 4] val Spearman=0.39265 time=1740.9s



[DEB-L CV] OOF Spearman: 0.34192
Saved submission_deberta_large.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             1.008887
1   6168                             0.896191
2   8575                             1.015039
3    618                             0.924121
4   3471                             0.974512


In [46]:
# Re-ensemble with expanded transformer pool (incl. RoBERTa), rank vs per-fold z-score, greedy top-3 mixing, and dense final weight search
import numpy as np, pandas as pd
from pathlib import Path
from scipy.stats import spearmanr

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0: return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

def rank01(x):
    s = pd.Series(x); r = s.rank(method='average')
    return ((r - r.min()) / (r.max() - r.min() + 1e-9)).values

def zscore_per_fold(oof, folds):
    o = np.array(oof, dtype=float).copy()
    for f in np.unique(folds):
        idx = (folds == f)
        mu = o[idx].mean()
        sd = o[idx].std() + 1e-9
        o[idx] = (o[idx] - mu) / sd
    # return standardized OOF; test will be standardized using global mu/sd of oof below when applied
    return o

def apply_test_zscore(test_preds, oof_ref):
    mu = float(np.mean(oof_ref))
    sd = float(np.std(oof_ref) + 1e-9)
    return (np.array(test_preds, dtype=float) - mu) / sd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)

# Folds for per-fold z-score (from any transformer oof csv available)
folds = None
for cand_csv in ['oof_deberta_base_seeds.csv','oof_deberta.csv','oof_deberta_large.csv','oof_roberta.csv']:
    if Path(cand_csv).exists():
        dfc = pd.read_csv(cand_csv)
        if 'fold' in dfc.columns:
            folds = dfc['fold'].values.astype(int)
            break
if folds is None and Path('oof_deberta.csv').exists():
    folds = pd.read_csv('oof_deberta.csv')['fold'].values.astype(int)

# Ridge A+B OOF/test and weight re-tune (dense grid 0..1 step 0.01)
if Path('oof_ridge_improved.npy').exists():
    oof_a = np.load('oof_ridge_improved.npy'); tst_a = np.load('test_ridge_improved.npy')
else:
    oof_a = np.load('oof_ridge.npy'); tst_a = np.load('test_ridge.npy')
oof_b = np.load('oof_ridge_b.npy'); tst_b = np.load('test_ridge_b.npy')
rA = rank01(oof_a); rB = rank01(oof_b)
best_w_ab, best_sc_ab = 0.4, -1e9
for w in np.linspace(0.0, 1.0, 101):
    sc = spearmanr_safe(y, w*rA + (1-w)*rB)
    if sc > best_sc_ab: best_sc_ab, best_w_ab = sc, float(w)
print(f"[AB BLEND] best_w={best_w_ab:.3f} OOF={best_sc_ab:.5f}")
tAB = best_w_ab*rank01(tst_a) + (1-best_w_ab)*rank01(tst_b)
rAB = best_w_ab*rA + (1-best_w_ab)*rB

# Optional SBERT+LGBM: blend small weight into AB if artifacts exist (ws in [0,0.35] step 0.01)
if Path('oof_sbert_lgbm.npy').exists() and Path('test_sbert_lgbm.npy').exists():
    oof_sb = np.load('oof_sbert_lgbm.npy'); tst_sb = np.load('test_sbert_lgbm.npy')
    rSB = rank01(oof_sb); tSB = rank01(tst_sb)
    best_ws, best_sc_s = 0.0, spearmanr_safe(y, rAB)
    for ws in np.linspace(0.0, 0.35, 36):
        sc = spearmanr_safe(y, (1-ws)*rAB + ws*rSB)
        if sc > best_sc_s: best_sc_s, best_ws = sc, float(ws)
    if best_ws > 0:
        print(f"[AB+SBERT] ws={best_ws:.3f} OOF={best_sc_s:.5f}")
        rAB = (1-best_ws)*rAB + best_ws*rSB
        tAB = (1-best_ws)*tAB + best_ws*tSB
else:
    print("[AB+SBERT] SBERT artifacts not found; skipping SBERT blend")

# Collect transformer candidates (OOF/test raw predictions)
deb_pool = []  # list of (name, oof, test)
def add_candidate(name, oof_path, tst_path):
    if Path(oof_path).exists() and Path(tst_path).exists():
        oof = np.load(oof_path); tst = np.load(tst_path)
        deb_pool.append((name, oof, tst))

add_candidate('base', 'oof_deberta.npy', 'test_deberta.npy')
add_candidate('base_seeds', 'oof_deberta_base_seeds.npy', 'test_deberta_base_seeds.npy')
add_candidate('large', 'oof_deberta_large.npy', 'test_deberta_large.npy')
add_candidate('roberta', 'oof_roberta.npy', 'test_roberta.npy')
add_candidate('base_mc8', 'oof_deberta_base_mc8.npy', 'test_deberta_base_mc8.npy')  # MC-dropout TTA candidate

assert len(deb_pool) > 0, 'No transformer OOF/test artifacts found yet.'

# Build standardized variants per mode
modes = ['rank', 'zscore']
best_overall = {'OOF': -1e9}
for mode in modes:
    cand_std = []  # (name, oof_std, tst_std)
    for name, oof_raw, tst_raw in deb_pool:
        if mode == 'rank':
            oof_std = rank01(oof_raw); tst_std = rank01(tst_raw)
        else:
            if folds is None:
                # fallback to global z-score if folds missing
                mu = oof_raw.mean(); sd = oof_raw.std() + 1e-9
                oof_std = (oof_raw - mu)/sd
                tst_std = (tst_raw - mu)/sd
            else:
                oof_std = zscore_per_fold(oof_raw, folds)
                tst_std = apply_test_zscore(tst_raw, oof_std)
        cand_std.append((name, oof_std, tst_std))

    # Evaluate single best
    best_single = max(cand_std, key=lambda t: spearmanr_safe(y, t[1]))
    best_name, best_oof, best_tst = best_single
    best_sc_single = spearmanr_safe(y, best_oof)

    # Evaluate best top-2 mix among all pairs (w in [0.50,1.00] step 0.02 on first vs second)
    best_pair = (best_name, None, 1.0, best_oof, best_tst, best_sc_single)  # (name1,name2,w, oof_mix, tst_mix, sc)
    for i in range(len(cand_std)):
        for j in range(i+1, len(cand_std)):
            n1,o1,t1 = cand_std[i]; n2,o2,t2 = cand_std[j]
            # Prefers the stronger model as primary
            sc1 = spearmanr_safe(y, o1); sc2 = spearmanr_safe(y, o2)
            # Scan weights on the better-first assumption
            if sc2 > sc1:
                n1,o1,t1, n2,o2,t2 = n2,o2,t2, n1,o1,t1
            best_w, best_sc = 1.0, spearmanr_safe(y, o1)
            for w in np.linspace(0.50, 1.00, 26):
                mix = w*o1 + (1-w)*o2
                sc = spearmanr_safe(y, mix)
                if sc > best_sc: best_sc, best_w = sc, float(w)
            if best_sc > best_pair[5]:
                oof_mix = best_w*o1 + (1-best_w)*o2
                tst_mix = best_w*t1 + (1-best_w)*t2
                best_pair = (n1, n2, best_w, oof_mix, tst_mix, best_sc)
    # Start from best of single vs pair
    pair_names = (best_pair[0], best_pair[1]) if best_pair[1] is not None else (best_name, None)
    cur_oof, cur_tst, cur_sc = (best_pair[3], best_pair[4], best_pair[5]) if best_pair[1] is not None else (best_oof, best_tst, best_sc_single)
    cur_names = [n for n in pair_names if n is not None]
    cur_desc = f"{'+'.join(cur_names)}@{best_pair[2]:.2f}" if len(cur_names)==2 else cur_names[0]

    # Greedy add a third candidate with a small weight in [0.02, 0.25] step 0.01; accept only if OOF improves
    remaining = [n for n,_,_ in cand_std if n not in cur_names]
    name_to = {n:(o,t) for n,o,t in cand_std}
    best_third = None
    for nm in remaining:
        o3, t3 = name_to[nm]
        for w3 in np.arange(0.02, 0.26, 0.01):
            mix = (1.0 - w3)*cur_oof + w3*o3
            sc = spearmanr_safe(y, mix)
            if sc > cur_sc + 1e-9:
                cur_sc = sc; cur_oof = mix; cur_tst = (1.0 - w3)*cur_tst + w3* t3; best_third = (nm, w3)
    if best_third is not None:
        cur_desc = f"{cur_desc}+{best_third[0]}@{best_third[1]:.2f}"

    print(f"[TRANS MIX {mode.upper()}] {cur_desc} OOF={cur_sc:.5f} (best single OOF={best_sc_single:.5f})")

    # Final blend: tune transformer block vs AB over [0.75, 1.00] step 0.001
    rDEB, tDEB = cur_oof, cur_tst
    best_w_final, best_sc_final = 0.90, -1e9
    for w in np.linspace(0.75, 1.00, 251):
        sc = spearmanr_safe(y, w*rDEB + (1-w)*rAB)
        if sc > best_sc_final: best_sc_final, best_w_final = sc, float(w)
    print(f"[FINAL BLEND {mode.upper()}] w_DEB={best_w_final:.3f} OOF={best_sc_final:.5f}")

    if best_sc_final > best_overall.get('OOF', -1e9):
        best_overall = {
            'mode': mode, 'OOF': best_sc_final, 'w_DEB': best_w_final,
            'trans_oof': rDEB, 'trans_tst': tDEB, 'desc': cur_desc,
            'oof_ab': rAB, 'tst_ab': tAB
        }

# Save best overall submission
print(f"[ENSEMBLE] Best mode={best_overall['mode']} source={best_overall['desc']} OOF={best_overall['OOF']:.5f}")
w = best_overall['w_DEB']
oof_final = w*best_overall['trans_oof'] + (1-w)*best_overall['oof_ab']
tst_final = w*best_overall['trans_tst'] + (1-w)*best_overall['tst_ab']
np.save('oof_ensemble.npy', oof_final)
np.save('test_ensemble.npy', tst_final)
sub = pd.DataFrame({id_col: test[id_col], target: np.clip(tst_final, 0, 1)})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv', sub.shape, 'head:\n', sub.head())

[AB BLEND] best_w=0.410 OOF=0.32661
[AB+SBERT] ws=0.300 OOF=0.33835


[TRANS MIX RANK] base_seeds+roberta@0.74+large@0.03 OOF=0.39506 (best single OOF=0.39001)


[FINAL BLEND RANK] w_DEB=0.833 OOF=0.39644


[TRANS MIX ZSCORE] base_seeds+large@0.64+base_mc8@0.02 OOF=0.40012 (best single OOF=0.39511)


[FINAL BLEND ZSCORE] w_DEB=0.770 OOF=0.40064
[ENSEMBLE] Best mode=zscore source=base_seeds+large@0.64+base_mc8@0.02 OOF=0.40064
Saved submission.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             0.950312
1   6168                             0.711651
2   8575                             0.985662
3    618                             0.748256
4   3471                             0.942646


In [11]:
# DeBERTa-v3-base: 3-seed 5-fold training and seed-averaged OOF/test
import os, gc, time, numpy as np, pandas as pd, torch, random
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
from scipy.stats import spearmanr
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback, DataCollatorWithPadding, set_seed

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

# Data and folds (reuse SGKF by group bins)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)
groups = pd.read_csv('train_group_keys.csv')['group_key'].values if Path('train_group_keys.csv').exists() else pd.util.hash_pandas_object((train['question_title'].fillna('')+'||'+train['question_body'].fillna('')), index=False).astype('int64').values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))

# Tokenization (Q-only, [CLS] title [SEP] body [SEP])
model_name = 'microsoft/deberta-v3-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_len = 512; title_max = 64

def build_inputs(title_series: pd.Series, body_series: pd.Series):
    titles = title_series.fillna('').astype(str).tolist()
    bodies = body_series.fillna('').astype(str).tolist()
    enc_title = tokenizer(titles, add_special_tokens=False, truncation=True, max_length=title_max)
    enc_body = tokenizer(bodies, add_special_tokens=False, truncation=True, max_length=max_len)
    input_ids, attention_masks = [], []
    for ti, bi in zip(enc_title['input_ids'], enc_body['input_ids']):
        composed = [tokenizer.cls_token_id] + ti + [tokenizer.sep_token_id] + bi + [tokenizer.sep_token_id]
        composed = composed[:max_len]
        attn = [1]*len(composed)
        pad = max_len - len(composed)
        if pad>0:
            composed += [tokenizer.pad_token_id]*pad
            attn += [0]*pad
        input_ids.append(composed); attention_masks.append(attn)
    return {'input_ids': np.array(input_ids, dtype=np.int64), 'attention_mask': np.array(attention_masks, dtype=np.int64)}

class QDataset(torch.utils.data.Dataset):
    def __init__(self, ids, masks, labels=None):
        self.ids=ids; self.masks=masks; self.labels=labels
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        item={'input_ids': torch.tensor(self.ids[idx]), 'attention_mask': torch.tensor(self.masks[idx])}
        if self.labels is not None: item['labels']=torch.tensor(self.labels[idx], dtype=torch.float32)
        return item

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    preds = preds.reshape(-1)
    return {'spearman': spearmanr_safe(labels, preds)}

all_inputs = build_inputs(train['question_title'], train['question_body'])
test_inputs = build_inputs(test['question_title'], test['question_body'])

seeds = [42, 2025, 3407]
oof_seeds = []; test_seeds = []
for si, seed in enumerate(seeds):
    set_seed(seed)
    print(f"\n[BASE-SEED {si}] seed={seed}", flush=True)
    oof = np.zeros(len(train), dtype=np.float32)
    test_preds = []
    folds = np.full(len(train), -1, dtype=int)
    for fold, (trn_idx, val_idx) in enumerate(splits):
        t0=time.time(); folds[val_idx]=fold
        print(f"[BASE-SEED {si} FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
        tr_ds = QDataset(all_inputs['input_ids'][trn_idx], all_inputs['attention_mask'][trn_idx], y[trn_idx])
        va_ds = QDataset(all_inputs['input_ids'][val_idx], all_inputs['attention_mask'][val_idx], y[val_idx])
        te_ds = QDataset(test_inputs['input_ids'], test_inputs['attention_mask'], None)
        config = AutoConfig.from_pretrained(model_name, num_labels=1, problem_type='regression')
        model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
        args = TrainingArguments(
            output_dir=f'tfm_seed{seed}_fold{fold}',
            num_train_epochs=3,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=32,
            gradient_accumulation_steps=1,
            learning_rate=2e-5,
            weight_decay=0.01,
            warmup_ratio=0.1,
            lr_scheduler_type='linear',
            fp16=True,
            evaluation_strategy='epoch',
            save_strategy='epoch',
            load_best_model_at_end=True,
            metric_for_best_model='spearman',
            greater_is_better=True,
            logging_steps=50,
            seed=seed,
            report_to=[]
        )
        trainer = Trainer(
            model=model,
            args=args,
            train_dataset=tr_ds,
            eval_dataset=va_ds,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],
            data_collator=DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
        )
        trainer.train()
        val_out = trainer.predict(va_ds).predictions.reshape(-1)
        oof[val_idx] = val_out
        fold_score = spearmanr(y[val_idx], val_out).correlation
        print(f"[BASE-SEED {si} FOLD {fold}] val Spearman={fold_score:.5f} time={time.time()-t0:.1f}s", flush=True)
        te_out = trainer.predict(te_ds).predictions.reshape(-1)
        test_preds.append(te_out)
        del trainer, model; gc.collect()
        if torch.cuda.is_available(): torch.cuda.empty_cache()
    cv = spearmanr(y, oof).correlation
    print(f"[BASE-SEED {si}] OOF Spearman: {cv:.5f}")
    np.save(f'oof_deberta_seed{seed}.npy', oof)
    test_mean = np.mean(np.vstack(test_preds), axis=0)
    np.save(f'test_deberta_seed{seed}.npy', test_mean)
    oof_seeds.append(oof)
    test_seeds.append(test_mean)

# Seed-averaged OOF/test
oof_avg = np.mean(np.vstack(oof_seeds), axis=0)
test_avg = np.mean(np.vstack(test_seeds), axis=0)
np.save('oof_deberta_base_seeds.npy', oof_avg)
np.save('test_deberta_base_seeds.npy', test_avg)
cv_avg = spearmanr(y, oof_avg).correlation
print(f"[BASE-SEEDS AVG] OOF Spearman: {cv_avg:.5f}")
sub = pd.DataFrame({id_col: test[id_col], target: test_avg})
sub.to_csv('submission_deberta_base_seeds.csv', index=False)
print('Saved submission_deberta_base_seeds.csv', sub.shape, 'head:\n', sub.head())




[BASE-SEED 0] seed=42


[BASE-SEED 0 FOLD 0] train=4395 val=1076


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 0 FOLD 0] val Spearman=0.38500 time=448.3s


[BASE-SEED 0 FOLD 1] train=4318 val=1153


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 0 FOLD 1] val Spearman=0.39981 time=445.4s


[BASE-SEED 0 FOLD 2] train=4389 val=1082


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 0 FOLD 2] val Spearman=0.39357 time=449.7s


[BASE-SEED 0 FOLD 3] train=4399 val=1072


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 0 FOLD 3] val Spearman=0.38791 time=450.2s


[BASE-SEED 0 FOLD 4] train=4383 val=1088


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 0 FOLD 4] val Spearman=0.38826 time=446.8s


[BASE-SEED 0] OOF Spearman: 0.37766

[BASE-SEED 1] seed=2025


[BASE-SEED 1 FOLD 0] train=4395 val=1076


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 1 FOLD 0] val Spearman=0.39184 time=450.1s


[BASE-SEED 1 FOLD 1] train=4318 val=1153


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 1 FOLD 1] val Spearman=0.40126 time=445.5s


[BASE-SEED 1 FOLD 2] train=4389 val=1082


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 1 FOLD 2] val Spearman=0.39038 time=449.6s


[BASE-SEED 1 FOLD 3] train=4399 val=1072


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 1 FOLD 3] val Spearman=0.37513 time=450.2s


[BASE-SEED 1 FOLD 4] train=4383 val=1088


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 1 FOLD 4] val Spearman=0.40215 time=448.8s


[BASE-SEED 1] OOF Spearman: 0.36916

[BASE-SEED 2] seed=3407


[BASE-SEED 2 FOLD 0] train=4395 val=1076


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 2 FOLD 0] val Spearman=0.39637 time=451.0s


[BASE-SEED 2 FOLD 1] train=4318 val=1153


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 2 FOLD 1] val Spearman=0.37022 time=441.8s


[BASE-SEED 2 FOLD 2] train=4389 val=1082


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 2 FOLD 2] val Spearman=0.36795 time=449.6s


[BASE-SEED 2 FOLD 3] train=4399 val=1072


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 2 FOLD 3] val Spearman=0.36209 time=445.4s


[BASE-SEED 2 FOLD 4] train=4383 val=1088


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[BASE-SEED 2 FOLD 4] val Spearman=0.37942 time=452.3s


[BASE-SEED 2] OOF Spearman: 0.34855
[BASE-SEEDS AVG] OOF Spearman: 0.39001
Saved submission_deberta_base_seeds.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             1.012142
1   6168                             0.893490
2   8575                             1.040755
3    618                             0.932162
4   3471                             1.006771


In [14]:
# DeBERTa-v3-base with Mean Pooling + Multi-Sample Dropout (MSD), 2 additional seeds
import os, gc, time, numpy as np, pandas as pd, torch, random, math
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
from scipy.stats import spearmanr
from transformers import AutoTokenizer, AutoConfig, AutoModel, TrainingArguments, Trainer, EarlyStoppingCallback, DataCollatorWithPadding, set_seed
from transformers.modeling_outputs import SequenceClassifierOutput
import torch.nn as nn

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

# Data and folds (reuse SGKF by group bins)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)
if Path('train_group_keys.csv').exists():
    groups = pd.read_csv('train_group_keys.csv')['group_key'].values
else:
    groups = pd.util.hash_pandas_object((train['question_title'].fillna('')+'||'+train['question_body'].fillna('')), index=False).astype('int64').values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))

# Tokenization (Q-only, [CLS] title [SEP] body [SEP])
model_name = 'microsoft/deberta-v3-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_len = 512; title_max = 64

def build_inputs(title_series: pd.Series, body_series: pd.Series):
    titles = title_series.fillna('').astype(str).tolist()
    bodies = body_series.fillna('').astype(str).tolist()
    enc_title = tokenizer(titles, add_special_tokens=False, truncation=True, max_length=title_max)
    enc_body = tokenizer(bodies, add_special_tokens=False, truncation=True, max_length=max_len)
    input_ids, attention_masks = [], []
    for ti, bi in zip(enc_title['input_ids'], enc_body['input_ids']):
        composed = [tokenizer.cls_token_id] + ti + [tokenizer.sep_token_id] + bi + [tokenizer.sep_token_id]
        composed = composed[:max_len]
        attn = [1]*len(composed)
        pad = max_len - len(composed)
        if pad>0:
            composed += [tokenizer.pad_token_id]*pad
            attn += [0]*pad
        input_ids.append(composed); attention_masks.append(attn)
    return {'input_ids': np.array(input_ids, dtype=np.int64), 'attention_mask': np.array(attention_masks, dtype=np.int64)}

class QDataset(torch.utils.data.Dataset):
    def __init__(self, ids, masks, labels=None):
        self.ids=ids; self.masks=masks; self.labels=labels
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        item={'input_ids': torch.tensor(self.ids[idx]), 'attention_mask': torch.tensor(self.masks[idx])}
        if self.labels is not None: item['labels']=torch.tensor(self.labels[idx], dtype=torch.float32)
        return item

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    preds = preds.reshape(-1)
    return {'spearman': spearmanr_safe(labels, preds)}

class MeanPooling(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, last_hidden_state, attention_mask):
        mask = attention_mask.unsqueeze(-1).to(last_hidden_state.dtype)  # (B, L, 1)
        masked = last_hidden_state * mask
        summed = masked.sum(dim=1)  # (B, H)
        denom = mask.sum(dim=1).clamp(min=1e-6)  # (B, 1)
        return summed / denom

class DebertaRegMeanPoolMSD(nn.Module):
    def __init__(self, model_name: str, dropout_p: float = 0.2, msd: int = 5):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden = self.encoder.config.hidden_size
        self.pool = MeanPooling()
        self.msd = msd
        self.dropouts = nn.ModuleList([nn.Dropout(dropout_p) for _ in range(msd)])
        self.head = nn.Linear(hidden, 1)
        self.loss_fn = nn.MSELoss()
    def forward(self, input_ids=None, attention_mask=None, labels=None):
        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
        feats = self.pool(out.last_hidden_state, attention_mask)  # (B, H)
        logits_list = []
        for i in range(self.msd):
            logits_list.append(self.head(self.dropouts[i](feats)))  # (B,1)
        logits = torch.stack(logits_list, dim=0).mean(dim=0).squeeze(-1)  # (B,)
        loss = None
        if labels is not None:
            loss = self.loss_fn(logits, labels.view(-1))
        # Return a proper ModelOutput to satisfy Trainer/Accelerate expectations
        return SequenceClassifierOutput(loss=loss, logits=logits.unsqueeze(-1))

all_inputs = build_inputs(train['question_title'], train['question_body'])
test_inputs = build_inputs(test['question_title'], test['question_body'])

# Train two additional seeds; keep LR=2e-5, warmup_ratio=0.1, linear scheduler, 4 epochs + ES(patience=1)
extra_seeds = [6174, 2024]
oof_seeds = []
test_seeds = []
for si, seed in enumerate(extra_seeds):
    set_seed(seed)
    print(f"\n[MEANPOOL-MSD SEED {si}] seed={seed}", flush=True)
    oof = np.zeros(len(train), dtype=np.float32)
    test_preds = []
    folds = np.full(len(train), -1, dtype=int)
    for fold, (trn_idx, val_idx) in enumerate(splits):
        t0=time.time(); folds[val_idx]=fold
        print(f"[MEANPOOL-MSD SEED {si} FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
        tr_ds = QDataset(all_inputs['input_ids'][trn_idx], all_inputs['attention_mask'][trn_idx], y[trn_idx])
        va_ds = QDataset(all_inputs['input_ids'][val_idx], all_inputs['attention_mask'][val_idx], y[val_idx])
        te_ds = QDataset(test_inputs['input_ids'], test_inputs['attention_mask'], None)
        model = DebertaRegMeanPoolMSD(model_name, dropout_p=0.2, msd=5)
        args = TrainingArguments(
            output_dir=f'tfm_meanpool_seed{seed}_fold{fold}',
            num_train_epochs=4,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=32,
            gradient_accumulation_steps=1,
            learning_rate=2e-5,
            weight_decay=0.01,
            warmup_ratio=0.1,
            lr_scheduler_type='linear',
            fp16=True,
            evaluation_strategy='epoch',
            save_strategy='epoch',
            load_best_model_at_end=True,
            metric_for_best_model='spearman',
            greater_is_better=True,
            logging_steps=50,
            seed=seed,
            report_to=[]
        )
        trainer = Trainer(
            model=model,
            args=args,
            train_dataset=tr_ds,
            eval_dataset=va_ds,
            tokenizer=tokenizer,
            compute_metrics=compute_metrics,
            callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],
            data_collator=DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
        )
        trainer.train()
        val_out = trainer.predict(va_ds).predictions.reshape(-1)
        oof[val_idx] = val_out
        fold_score = spearmanr(y[val_idx], val_out).correlation
        print(f"[MEANPOOL-MSD SEED {si} FOLD {fold}] val Spearman={fold_score:.5f} time={time.time()-t0:.1f}s", flush=True)
        te_out = trainer.predict(te_ds).predictions.reshape(-1)
        test_preds.append(te_out)
        del trainer, model; gc.collect()
        if torch.cuda.is_available(): torch.cuda.empty_cache()
    cv = spearmanr(y, oof).correlation
    print(f"[MEANPOOL-MSD SEED {si}] OOF Spearman: {cv:.5f}")
    np.save(f'oof_deberta_meanpool_seed{seed}.npy', oof)
    test_mean = np.mean(np.vstack(test_preds), axis=0)
    np.save(f'test_deberta_meanpool_seed{seed}.npy', test_mean)
    oof_seeds.append(oof)
    test_seeds.append(test_mean)

# Combine with existing 3 base seeds if present for a 5-seed average; else at least average our 2 new seeds
existing_paths = [
    ('oof_deberta_seed42.npy','test_deberta_seed42.npy'),
    ('oof_deberta_seed2025.npy','test_deberta_seed2025.npy'),
    ('oof_deberta_seed3407.npy','test_deberta_seed3407.npy'),
]
for oof_p, tst_p in existing_paths:
    if Path(oof_p).exists() and Path(tst_p).exists():
        oof_seeds.append(np.load(oof_p)); test_seeds.append(np.load(tst_p))

oof_avg = np.mean(np.vstack(oof_seeds), axis=0)
test_avg = np.mean(np.vstack(test_seeds), axis=0)
np.save('oof_deberta_base_meanpool_seeds.npy', oof_avg)
np.save('test_deberta_base_meanpool_seeds.npy', test_avg)
cv_avg = spearmanr(y, oof_avg).correlation
print(f"[MEANPOOL-MSD SEEDS AVG] OOF Spearman: {cv_avg:.5f}")
sub = pd.DataFrame({id_col: test[id_col], target: test_avg})
sub.to_csv('submission_deberta_base_meanpool_seeds.csv', index=False)
print('Saved submission_deberta_base_meanpool_seeds.csv', sub.shape, 'head:\n', sub.head())




[MEANPOOL-MSD SEED 0] seed=6174


[MEANPOOL-MSD SEED 0 FOLD 0] train=4395 val=1076


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[MEANPOOL-MSD SEED 0 FOLD 0] val Spearman=0.26643 time=304.5s


[MEANPOOL-MSD SEED 0 FOLD 1] train=4318 val=1153


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[MEANPOOL-MSD SEED 0 FOLD 1] val Spearman=0.26605 time=437.6s


[MEANPOOL-MSD SEED 0 FOLD 2] train=4389 val=1082


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


In [18]:
# CPU parallel: S-BERT embeddings + LightGBM (5-fold SGKF), rank-ensemble candidate
import os, time, gc, numpy as np, pandas as pd
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
from scipy.stats import spearmanr
from sentence_transformers import SentenceTransformer
import lightgbm as lgb

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

# Load data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)

# Folds: StratifiedGroupKFold on group mean bins (reuse saved group keys if present)
if Path('train_group_keys.csv').exists():
    groups = pd.read_csv('train_group_keys.csv')['group_key'].values
else:
    groups = pd.util.hash_pandas_object((train['question_title'].fillna('')+'||'+train['question_body'].fillna('')), index=False).astype('int64').values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))

# Cheap scalar features (same as earlier basic_feats)
def basic_feats(title: pd.Series, body: pd.Series):
    def counts(s):
        s2 = s.fillna('')
        chars = s2.str.len().astype(float)
        words = s2.str.split().apply(len).astype(float)
        qcnt = s2.str.count(r'\?').astype(float)
        ecnt = s2.str.count(r'\!').astype(float)
        q2   = s2.str.count(r'\?\?+').astype(float)
        e2   = s2.str.count(r'\!\!+').astype(float)
        ell  = s2.str.count(r'\.\.\.+').astype(float)
        upper = s2.apply(lambda t: sum(ch.isupper() for ch in t)).astype(float)
        upper_ratio = (upper / (chars.replace(0, np.nan))).fillna(0.0).astype(float)
        nl = s2.str.count(r'\n').astype(float)
        has_url = s2.str.contains(r'http[s]?://', regex=True).astype(float)
        has_code = s2.str.contains(r'`').astype(float)
        list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
        quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)
        digits = s2.apply(lambda t: sum(ch.isdigit() for ch in t)).astype(float)
        digit_ratio = (digits / (chars.replace(0, np.nan))).fillna(0.0).astype(float)
        uniq_ratio = s2.apply(lambda t: (len(set(t.split())) / max(1, len(t.split())))).astype(float)
        return [chars, words, qcnt, ecnt, q2, e2, ell, upper_ratio, nl, has_url, has_code, list_mark, quote, digit_ratio, uniq_ratio]
    t_feats = counts(title); b_feats = counts(body)
    t_chars, t_words = t_feats[0], t_feats[1]
    b_chars, b_words = b_feats[0], b_feats[1]
    len_ratio_c = (t_chars / (b_chars.replace(0, np.nan))).fillna(0.0).astype(float)
    len_ratio_w = (t_words / (b_words.replace(0, np.nan))).fillna(0.0).astype(float)
    feats = t_feats + b_feats + [len_ratio_c, len_ratio_w]
    F = np.vstack([f.values for f in feats]).T.astype(np.float32)
    return F

F_tr = basic_feats(train['question_title'], train['question_body'])
F_te = basic_feats(test['question_title'], test['question_body'])

# Sentence embeddings on CPU (prevents GPU contention).
os.environ['CUDA_VISIBLE_DEVICES'] = ''
sb_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
sb = SentenceTransformer(sb_model_name, device='cpu')
def embed_texts(series: pd.Series, batch_size: int = 512):
    return sb.encode(series.fillna('').astype(str).tolist(), batch_size=batch_size, show_progress_bar=False, device='cpu', convert_to_numpy=True, normalize_embeddings=True)

print('[SBERT] Encoding title/body on CPU...', flush=True)
t_tr = embed_texts(train['question_title'])  # (N,384)
b_tr = embed_texts(train['question_body'])   # (N,384)
t_te = embed_texts(test['question_title'])
b_te = embed_texts(test['question_body'])
X_tr = np.hstack([t_tr, b_tr, F_tr])
X_te = np.hstack([t_te, b_te, F_te])
print('[SBERT] Shapes:', X_tr.shape, X_te.shape, flush=True)

# LGBMRegressor with early stopping
oof = np.zeros(len(train), dtype=np.float32)
test_preds = []
folds = np.full(len(train), -1, dtype=int)
for fold, (trn_idx, val_idx) in enumerate(splits):
    t0 = time.time();
    folds[val_idx] = fold
    print(f"[SBERT-LGB FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    dtrain = lgb.Dataset(X_tr[trn_idx], label=y[trn_idx])
    dvalid = lgb.Dataset(X_tr[val_idx], label=y[val_idx])
    params = {
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.05,
        'num_leaves': 63,
        'feature_fraction': 0.85,
        'bagging_fraction': 0.85,
        'bagging_freq': 1,
        'min_data_in_leaf': 20,
        'verbosity': -1,
        'force_row_wise': True
    }
    model = lgb.train(
        params,
        dtrain,
        num_boost_round=2000,
        valid_sets=[dvalid],
        valid_names=['valid'],
        callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False)]
    )
    pv = model.predict(X_tr[val_idx], num_iteration=model.best_iteration)
    oof[val_idx] = pv.astype(np.float32)
    sc = spearmanr_safe(y[val_idx], pv)
    print(f"[SBERT-LGB FOLD {fold}] val Spearman={sc:.5f} iters={model.best_iteration} time={time.time()-t0:.1f}s", flush=True)
    test_preds.append(model.predict(X_te, num_iteration=model.best_iteration))
    del model; gc.collect()

cv = spearmanr_safe(y, oof)
print(f"[SBERT-LGB] OOF Spearman: {cv:.5f}")
np.save('oof_sbert_lgbm.npy', oof)
tst = np.mean(np.vstack(test_preds), axis=0).astype(np.float32)
np.save('test_sbert_lgbm.npy', tst)
print('[SBERT-LGB] Saved artifacts: oof_sbert_lgbm.npy, test_sbert_lgbm.npy')

  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)


  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)


  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)
  list_mark = s2.str.contains(r'(^|\n)[\-\*] ', regex=True).astype(float)
  quote = s2.str.contains(r'(^|\n)\>', regex=True).astype(float)




[SBERT] Encoding title/body on CPU...


[SBERT] Shapes: (5471, 800) (608, 800)


[SBERT-LGB FOLD 0] train=4395 val=1076


[SBERT-LGB FOLD 0] val Spearman=0.25025 iters=96 time=7.1s


[SBERT-LGB FOLD 1] train=4318 val=1153


[SBERT-LGB FOLD 1] val Spearman=0.28676 iters=43 time=5.3s


[SBERT-LGB FOLD 2] train=4389 val=1082


[SBERT-LGB FOLD 2] val Spearman=0.31268 iters=113 time=7.7s


[SBERT-LGB FOLD 3] train=4399 val=1072


[SBERT-LGB FOLD 3] val Spearman=0.28260 iters=65 time=6.0s


[SBERT-LGB FOLD 4] train=4383 val=1088


[SBERT-LGB FOLD 4] val Spearman=0.25049 iters=77 time=6.5s


[SBERT-LGB] OOF Spearman: 0.27409
[SBERT-LGB] Saved artifacts: oof_sbert_lgbm.npy, test_sbert_lgbm.npy


In [17]:
# Install sentence-transformers without deps to avoid torch/transformers drift
import sys, subprocess
from pathlib import Path

def pip(*args):
    print("> pip", *args, flush=True)
    subprocess.run([sys.executable, "-m", "pip", *args], check=True)

# Ensure our existing torch/transformers stack stays intact; install sbert with --no-deps
pip("install", "sentence-transformers==2.7.0", "--no-deps")
print("[Install] sentence-transformers installed with --no-deps.")

> pip install sentence-transformers==2.7.0 --no-deps


Collecting sentence-transformers==2.7.0
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 171.5/171.5 KB 8.1 MB/s eta 0:00:00
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.7.0
[Install] sentence-transformers installed with --no-deps.


In [22]:
# roberta-base 5-fold (Q-only) 1-seed for ensemble diversity
import os, gc, time, numpy as np, pandas as pd, torch
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
from scipy.stats import spearmanr
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback, DataCollatorWithPadding, set_seed

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0:
        return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

# Data and folds
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)
if Path('train_group_keys.csv').exists():
    groups = pd.read_csv('train_group_keys.csv')['group_key'].values
else:
    groups = pd.util.hash_pandas_object((train['question_title'].fillna('')+'||'+train['question_body'].fillna('')), index=False).astype('int64').values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))

# Tokenization (Q-only, [CLS] title [SEP] body [SEP])
model_name = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_len = 512; title_max = 64

def build_inputs(title_series: pd.Series, body_series: pd.Series):
    titles = title_series.fillna('').astype(str).tolist()
    bodies = body_series.fillna('').astype(str).tolist()
    enc_title = tokenizer(titles, add_special_tokens=False, truncation=True, max_length=title_max)
    enc_body = tokenizer(bodies, add_special_tokens=False, truncation=True, max_length=max_len)
    input_ids, attention_masks = [], []
    for ti, bi in zip(enc_title['input_ids'], enc_body['input_ids']):
        composed = [tokenizer.cls_token_id] + ti + [tokenizer.sep_token_id] + bi + [tokenizer.sep_token_id]
        composed = composed[:max_len]
        attn = [1]*len(composed)
        pad = max_len - len(composed)
        if pad>0:
            composed += [tokenizer.pad_token_id]*pad
            attn += [0]*pad
        input_ids.append(composed); attention_masks.append(attn)
    return {'input_ids': np.array(input_ids, dtype=np.int64), 'attention_mask': np.array(attention_masks, dtype=np.int64)}

class QDataset(torch.utils.data.Dataset):
    def __init__(self, ids, masks, labels=None):
        self.ids=ids; self.masks=masks; self.labels=labels
    def __len__(self): return len(self.ids)
    def __getitem__(self, idx):
        item={'input_ids': torch.tensor(self.ids[idx]), 'attention_mask': torch.tensor(self.masks[idx])}
        if self.labels is not None: item['labels']=torch.tensor(self.labels[idx], dtype=torch.float32)
        return item

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    preds = preds.reshape(-1)
    return {'spearman': spearmanr_safe(labels, preds)}

all_inputs = build_inputs(train['question_title'], train['question_body'])
test_inputs = build_inputs(test['question_title'], test['question_body'])

set_seed(42)
oof = np.zeros(len(train), dtype=np.float32)
test_preds = []
folds = np.full(len(train), -1, dtype=int)

for fold, (trn_idx, val_idx) in enumerate(splits):
    t0=time.time(); folds[val_idx]=fold
    print(f"\n[ROBERTA FOLD {fold}] train={len(trn_idx)} val={len(val_idx)}", flush=True)
    tr_ds = QDataset(all_inputs['input_ids'][trn_idx], all_inputs['attention_mask'][trn_idx], y[trn_idx])
    va_ds = QDataset(all_inputs['input_ids'][val_idx], all_inputs['attention_mask'][val_idx], y[val_idx])
    te_ds = QDataset(test_inputs['input_ids'], test_inputs['attention_mask'], None)
    config = AutoConfig.from_pretrained(model_name, num_labels=1, problem_type='regression')
    model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)
    args = TrainingArguments(
        output_dir=f'roberta_fold{fold}',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        gradient_accumulation_steps=1,
        learning_rate=2e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type='linear',
        fp16=True,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
        metric_for_best_model='spearman',
        greater_is_better=True,
        logging_steps=50,
        report_to=[]
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_ds,
        eval_dataset=va_ds,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],
        data_collator=DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
    )
    trainer.train()
    val_out = trainer.predict(va_ds).predictions.reshape(-1)
    oof[val_idx] = val_out
    fold_score = spearmanr(y[val_idx], val_out).correlation
    print(f"[ROBERTA FOLD {fold}] val Spearman={fold_score:.5f} time={time.time()-t0:.1f}s", flush=True)
    te_out = trainer.predict(te_ds).predictions.reshape(-1)
    test_preds.append(te_out)
    del trainer, model; gc.collect()
    if torch.cuda.is_available(): torch.cuda.empty_cache()

cv = spearmanr(y, oof).correlation
print(f"\n[ROBERTA CV] OOF Spearman: {cv:.5f}")
np.save('oof_roberta.npy', oof)
pd.DataFrame({'qa_id': train[id_col], 'fold': folds, 'oof': oof, 'target': y}).to_csv('oof_roberta.csv', index=False)
test_mean = np.mean(np.vstack(test_preds), axis=0)
np.save('test_roberta.npy', test_mean)
sub_r = pd.DataFrame({id_col: test[id_col], target: test_mean})
sub_r.to_csv('submission_roberta.csv', index=False)
print('Saved submission_roberta.csv', sub_r.shape, 'head:\n', sub_r.head())




[ROBERTA FOLD 0] train=4395 val=1076


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[ROBERTA FOLD 0] val Spearman=0.37140 time=256.4s



[ROBERTA FOLD 1] train=4318 val=1153


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[ROBERTA FOLD 1] val Spearman=0.39320 time=254.7s



[ROBERTA FOLD 2] train=4389 val=1082


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[ROBERTA FOLD 2] val Spearman=0.37554 time=257.7s



[ROBERTA FOLD 3] train=4399 val=1072


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[ROBERTA FOLD 3] val Spearman=0.37407 time=258.9s



[ROBERTA FOLD 4] train=4383 val=1088


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss


[ROBERTA FOLD 4] val Spearman=0.41395 time=257.0s



[ROBERTA CV] OOF Spearman: 0.35498
Saved submission_roberta.csv (608, 2) head:
    qa_id  question_asker_intent_understanding
0   6516                             1.004297
1   6168                             0.867383
2   8575                             1.013281
3    618                             0.903711
4   3471                             0.963379


In [24]:
# Validate and fix submission.csv format strictly
import pandas as pd, numpy as np
from pathlib import Path

test = pd.read_csv('test.csv')
sub_path = Path('submission.csv')
assert sub_path.exists(), 'submission.csv not found'
sub = pd.read_csv(sub_path)

id_col = 'qa_id'
target = 'question_asker_intent_understanding'

# Enforce correct columns and order
assert list(sub.columns) == [id_col, target], f'Unexpected columns: {list(sub.columns)}'

# Enforce row count equality and ids match test length
assert len(sub) == len(test) == 608, f'Row mismatch: sub={len(sub)} test={len(test)}'

# Force id dtype and exact values from test to avoid type/ordering issues
sub[id_col] = test[id_col].astype(int).values

# Coerce predictions to float, replace NaNs/inf, clip to [0,1]
pred = pd.to_numeric(sub[target], errors='coerce').astype(float).values
pred[~np.isfinite(pred)] = 0.5
pred = np.clip(pred, 0.0, 1.0)
sub[target] = pred

# Final save with stable float formatting and no index
sub.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv fixed and saved:', sub.dtypes.to_dict(), sub.shape)
print(sub.head())

submission.csv fixed and saved: {'qa_id': dtype('int64'), 'question_asker_intent_understanding': dtype('float64')} (608, 2)
   qa_id  question_asker_intent_understanding
0   6516                             0.952788
1   6168                             0.717737
2   8575                             0.987400
3    618                             0.754358
4   3471                             0.944603


In [25]:
# Rebuild submission strictly aligned to test.csv order and validate
import pandas as pd, numpy as np
from pathlib import Path

test = pd.read_csv('test.csv')
sub_path = Path('submission.csv')
assert sub_path.exists(), 'submission.csv not found'
sub_old = pd.read_csv(sub_path)

id_col = 'qa_id'
target = 'question_asker_intent_understanding'

# Map predictions by id, then rebuild exactly in test order
pred_map = dict(zip(sub_old[id_col].astype(int), pd.to_numeric(sub_old[target], errors='coerce').astype(float)))
pred_series = test[id_col].astype(int).map(pred_map)

# Replace missing/NaN/inf with 0.5 and clip
pred = pred_series.to_numpy(dtype=float)
mask_bad = ~np.isfinite(pred)
if mask_bad.any():
    pred[mask_bad] = 0.5
pred = np.clip(pred, 0.0, 1.0)

sub = pd.DataFrame({id_col: test[id_col].astype(int).values, target: pred.astype(float)})

# Final checks
assert sub.columns.tolist() == [id_col, target]
assert len(sub) == len(test) == 608
assert set(sub[id_col]) == set(test[id_col]), 'ID set mismatch with test.csv'

# Save with consistent formatting
sub.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv rebuilt and saved:', sub.dtypes.to_dict(), sub.shape)
print(sub.head())

submission.csv rebuilt and saved: {'qa_id': dtype('int64'), 'question_asker_intent_understanding': dtype('float64')} (608, 2)
   qa_id  question_asker_intent_understanding
0   6516                             0.952788
1   6168                             0.717737
2   8575                             0.987400
3    618                             0.754358
4   3471                             0.944603


In [31]:
# Final strict submission: sort by qa_id ascending, enforce types, dedupe, and save
import pandas as pd, numpy as np
from pathlib import Path

id_col = 'qa_id'; target = 'question_asker_intent_understanding'
test = pd.read_csv('test.csv')
sub = pd.read_csv('submission.csv')

# Strip whitespace from column names just in case
sub.columns = [c.strip() for c in sub.columns]

# Keep only required columns
sub = sub[[id_col, target]].copy()

# Enforce dtypes
sub[id_col] = pd.to_numeric(sub[id_col], errors='coerce').fillna(-1).astype(np.int64)
sub[target] = pd.to_numeric(sub[target], errors='coerce').astype(float)

# Replace NaNs/inf and clip
pred = sub[target].to_numpy()
pred[~np.isfinite(pred)] = 0.5
pred = np.clip(pred, 0.0, 1.0)
sub[target] = pred

# Align to test ids strictly and sort ascending
test_ids = test[id_col].astype(np.int64)
sub = sub[sub[id_col].isin(test_ids)]
sub = sub.drop_duplicates(subset=[id_col], keep='first')
sub = sub.set_index(id_col).reindex(test_ids.values).reset_index()

# Final assertions
assert sub.columns.tolist() == [id_col, target], f'Columns wrong: {sub.columns.tolist()}'
assert len(sub) == len(test) == 608, f'Row count mismatch: {len(sub)} vs {len(test)}'
assert sub[id_col].is_monotonic_increasing, 'qa_id not sorted ascending'
assert sub[target].between(0,1).all(), 'Preds out of [0,1]'

# Save with stable formatting
sub.to_csv('submission.csv', index=False, float_format='%.8f')
print('Final submission.csv written:', sub.dtypes.to_dict(), sub.shape)
print(sub.head())

KeyError: "None of [Index(['qa_id', 'question_asker_intent_understanding'], dtype='object')] are in the [columns]"

In [26]:
# Write alternative submission format with 'prediction' column name (some graders expect this)
import pandas as pd, numpy as np

sub = pd.read_csv('submission.csv')
id_col = 'qa_id'
target = 'question_asker_intent_understanding'

assert id_col in sub.columns and target in sub.columns, 'Expected columns not found in submission.csv'

# Build new DataFrame with expected schema: qa_id,prediction
sub_alt = pd.DataFrame({
    id_col: pd.to_numeric(sub[id_col], errors='coerce').fillna(-1).astype(np.int64),
    'prediction': pd.to_numeric(sub[target], errors='coerce').fillna(0.5).clip(0,1).astype(float)
})

# Sort by qa_id for determinism
sub_alt = sub_alt.sort_values(id_col).reset_index(drop=True)

# Overwrite submission.csv with alternative schema
sub_alt.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv rewritten to schema [qa_id,prediction]:', sub_alt.dtypes.to_dict(), sub_alt.shape)
print(sub_alt.head())

submission.csv rewritten to schema [qa_id,prediction]: {'qa_id': dtype('int64'), 'prediction': dtype('float64')} (608, 2)
   qa_id  prediction
0      1    0.922646
1     16    0.806285
2     18    0.757122
3     19    1.000000
4     27    1.000000


In [33]:
# Write submission.csv with schema [id,prediction] for benchmark grader
import pandas as pd, numpy as np
from pathlib import Path

test = pd.read_csv('test.csv')
sub_path = Path('submission.csv')
assert sub_path.exists(), 'submission.csv not found'
sub_in = pd.read_csv(sub_path)

# Determine source columns
if 'prediction' in sub_in.columns and 'qa_id' in sub_in.columns:
    ids = pd.to_numeric(sub_in['qa_id'], errors='coerce').fillna(-1).astype(np.int64)
    preds = pd.to_numeric(sub_in['prediction'], errors='coerce').fillna(0.5).clip(0,1).astype(float)
elif 'question_asker_intent_understanding' in sub_in.columns and 'qa_id' in sub_in.columns:
    ids = pd.to_numeric(sub_in['qa_id'], errors='coerce').fillna(-1).astype(np.int64)
    preds = pd.to_numeric(sub_in['question_asker_intent_understanding'], errors='coerce').fillna(0.5).clip(0,1).astype(float)
else:
    raise AssertionError(f'Unexpected submission columns: {list(sub_in.columns)}')

# Align to test ids order strictly
test_ids = pd.to_numeric(test['qa_id'], errors='coerce').astype(np.int64)
df = pd.DataFrame({'qa_id': ids, 'prediction': preds})
df = df.drop_duplicates(subset=['qa_id']).set_index('qa_id').reindex(test_ids.values).reset_index()

# Rename to [id, prediction] and validate
df = df.rename(columns={'qa_id': 'id'})[['id','prediction']]
assert len(df) == len(test) == 608, f'Row count mismatch: {len(df)} vs {len(test)}'
assert df['prediction'].between(0,1).all(), 'Predictions out of bounds'

df.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv written with schema [id,prediction]:', df.dtypes.to_dict(), df.shape)
print(df.head())

submission.csv written with schema [id,prediction]: {'id': dtype('int64'), 'prediction': dtype('float64')} (608, 2)
     id  prediction
0  6516    0.952788
1  6168    0.717737
2  8575    0.987400
3   618    0.754358
4  3471    0.944603


In [28]:
# Write submission.csv with schema [id,prediction] where id is 0..N-1 (common MLE-Benchmark requirement)
import pandas as pd, numpy as np

test = pd.read_csv('test.csv')
sub_in = pd.read_csv('submission.csv')

# Determine predictions column from existing file
pred_col = None
for c in ['prediction', 'question_asker_intent_understanding']:
    if c in sub_in.columns: pred_col = c; break
assert pred_col is not None, f'No predictions column found in submission.csv: {list(sub_in.columns)}'

# Use test order; build id as 0..len(test)-1
preds = pd.to_numeric(sub_in[pred_col], errors='coerce').fillna(0.5).clip(0,1).astype(float).values
if len(preds) != len(test):
    # fallback: if misaligned, fill with 0.5
    preds = np.full(len(test), 0.5, dtype=float)

sub_out = pd.DataFrame({
    'id': np.arange(len(test), dtype=np.int64),
    'prediction': preds[:len(test)]
})

sub_out.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv rewritten to [id(0..N-1),prediction]:', sub_out.dtypes.to_dict(), sub_out.shape)
print(sub_out.head())

submission.csv rewritten to [id(0..N-1),prediction]: {'id': dtype('int64'), 'prediction': dtype('float64')} (608, 2)
   id  prediction
0   0    0.952788
1   1    0.717737
2   2    0.987400
3   3    0.754358
4   4    0.944603


In [30]:
# Write submission.csv as single-column 'prediction' (no IDs), in test order
import pandas as pd, numpy as np

# Prefer our latest ensemble predictions saved earlier
sub_any = pd.read_csv('submission.csv')
pred = None
if 'prediction' in sub_any.columns:
    pred = pd.to_numeric(sub_any['prediction'], errors='coerce').fillna(0.5).clip(0,1).astype(float).values
elif 'question_asker_intent_understanding' in sub_any.columns:
    pred = pd.to_numeric(sub_any['question_asker_intent_understanding'], errors='coerce').fillna(0.5).clip(0,1).astype(float).values
else:
    # fallback to ensemble npy if present
    try:
        import numpy as np
        pred = np.load('test_ensemble.npy')
    except Exception:
        raise AssertionError(f'No recognizable prediction column in submission.csv: {list(sub_any.columns)} and test_ensemble.npy missing')

pred = np.asarray(pred, dtype=float)
sub_single = pd.DataFrame({'prediction': pred})
assert len(sub_single) == 608, f'Expected 608 rows, got {len(sub_single)}'
sub_single.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv rewritten to single-column [prediction], shape:', sub_single.shape)
print(sub_single.head())

submission.csv rewritten to single-column [prediction], shape: (608, 1)
   prediction
0    0.952788
1    0.717737
2    0.987400
3    0.754358
4    0.944603


In [29]:
# Write submission.csv with schema [row_id,prediction] and row_id=0..N-1
import pandas as pd, numpy as np

test = pd.read_csv('test.csv')
sub_in = pd.read_csv('submission.csv')

# Determine predictions column
pred_col = None
for c in ['prediction', 'question_asker_intent_understanding']:
    if c in sub_in.columns: pred_col = c; break
assert pred_col is not None, f'No predictions column found in submission.csv: {list(sub_in.columns)}'

preds = pd.to_numeric(sub_in[pred_col], errors='coerce').fillna(0.5).clip(0,1).astype(float).values
if len(preds) != len(test):
    preds = np.full(len(test), 0.5, dtype=float)

sub_out = pd.DataFrame({
    'row_id': np.arange(len(test), dtype=np.int64),
    'prediction': preds[:len(test)]
})

sub_out.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv written with schema [row_id,prediction]:', sub_out.dtypes.to_dict(), sub_out.shape)
print(sub_out.head())

submission.csv written with schema [row_id,prediction]: {'row_id': dtype('int64'), 'prediction': dtype('float64')} (608, 2)
   row_id  prediction
0       0    0.952788
1       1    0.717737
2       2    0.987400
3       3    0.754358
4       4    0.944603


In [32]:
# Overwrite submission.csv to Kaggle schema [qa_id, question_asker_intent_understanding] using test_ensemble.npy
import numpy as np, pandas as pd
from pathlib import Path

test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'

if Path('test_ensemble.npy').exists():
    preds = np.load('test_ensemble.npy').astype(float)
else:
    # fallback: read any existing submission variant and extract prediction column
    sub_any = pd.read_csv('submission.csv')
    if 'prediction' in sub_any.columns:
        preds = pd.to_numeric(sub_any['prediction'], errors='coerce').fillna(0.5).values
    elif target in sub_any.columns:
        preds = pd.to_numeric(sub_any[target], errors='coerce').fillna(0.5).values
    else:
        raise AssertionError(f'No predictions source found in submission.csv columns: {list(sub_any.columns)}')

assert len(preds) == len(test), f'Length mismatch: preds={len(preds)} test={len(test)}'
preds = np.clip(preds, 0.0, 1.0)
sub = pd.DataFrame({id_col: test[id_col].astype('int64').values, target: preds.astype(float)})
sub.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv written (Kaggle schema):', sub.dtypes.to_dict(), sub.shape)
print(sub.head())

submission.csv written (Kaggle schema): {'qa_id': dtype('int64'), 'question_asker_intent_understanding': dtype('float64')} (608, 2)
   qa_id  question_asker_intent_understanding
0   6516                             0.952788
1   6168                             0.717737
2   8575                             0.987400
3    618                             0.754358
4   3471                             0.944603


In [42]:
# Write submission.csv exactly as MLE-Benchmark expects: [id, prediction] using qa_id from test.csv and preds from test_ensemble.npy
import pandas as pd, numpy as np
from pathlib import Path

test = pd.read_csv('test.csv')
assert 'qa_id' in test.columns, 'qa_id not found in test.csv'
preds_path = Path('test_ensemble.npy')
assert preds_path.exists(), 'test_ensemble.npy not found; run Cell 8 to generate ensemble predictions first'
preds = np.load(preds_path).astype(float)
assert len(preds) == len(test) == 608, f'Length mismatch: preds={len(preds)} test={len(test)}'
preds = np.clip(preds, 0.0, 1.0)

sub = pd.DataFrame({
    'id': pd.to_numeric(test['qa_id'], errors='raise').astype('int64'),
    'prediction': preds
})

# Final checks
assert sub.columns.tolist() == ['id','prediction']
assert sub['prediction'].between(0,1).all()

sub.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv written [id,prediction]:', sub.dtypes.to_dict(), sub.shape)
print(sub.head())

submission.csv written [id,prediction]: {'id': dtype('int64'), 'prediction': dtype('float64')} (608, 2)
     id  prediction
0  6516    0.952788
1  6168    0.717737
2  8575    0.987400
3   618    0.754358
4  3471    0.944603


In [39]:
# Re-write submission.csv exactly as [id,prediction], sorted by id, and verify raw header
import pandas as pd, numpy as np, io
from pathlib import Path

test = pd.read_csv('test.csv')
preds = np.load('test_ensemble.npy').astype(float)
assert len(preds) == len(test) == 608
preds = np.clip(preds, 0.0, 1.0)

sub = pd.DataFrame({'id': pd.to_numeric(test['qa_id'], errors='raise').astype('int64'),
                    'prediction': preds})
sub = sub.sort_values('id').reset_index(drop=True)
assert sub.columns.tolist() == ['id','prediction']
assert sub['prediction'].between(0,1).all()
sub.to_csv('submission.csv', index=False, float_format='%.8f')

# Print raw first two lines to ensure header/data are clean
with open('submission.csv', 'r', encoding='utf-8') as f:
    head2 = ''.join([next(f) for _ in range(2)])
print('RAW HEAD:\n' + repr(head2))
print('Final submission.csv shape:', sub.shape, 'dtypes:', sub.dtypes.to_dict())

RAW HEAD:
'id,prediction\n1,0.92264550\n'
Final submission.csv shape: (608, 2) dtypes: {'id': dtype('int64'), 'prediction': dtype('float64')}


In [41]:
# Write submission.csv as [qa_id,prediction] using test order and test_ensemble.npy
import pandas as pd, numpy as np
from pathlib import Path

test = pd.read_csv('test.csv')
preds_path = Path('test_ensemble.npy')
assert preds_path.exists(), 'test_ensemble.npy missing; run Cell 8 first'
preds = np.load(preds_path).astype(float)
assert len(preds) == len(test) == 608, f'Length mismatch: {len(preds)} vs {len(test)}'
preds = np.clip(preds, 0.0, 1.0)

sub = pd.DataFrame({'qa_id': pd.to_numeric(test['qa_id'], errors='raise').astype('int64'),
                    'prediction': preds})
assert sub.columns.tolist() == ['qa_id','prediction']
assert sub['prediction'].between(0,1).all()
sub.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv written [qa_id,prediction]:', sub.dtypes.to_dict(), sub.shape)
print(sub.head())

submission.csv written [qa_id,prediction]: {'qa_id': dtype('int64'), 'prediction': dtype('float64')} (608, 2)
   qa_id  prediction
0   6516    0.952788
1   6168    0.717737
2   8575    0.987400
3    618    0.754358
4   3471    0.944603


In [40]:
# Minimal CSV writer for submission.csv using csv module (schema: id,prediction; ids=test.qa_id in test order)
import csv, numpy as np, pandas as pd

test = pd.read_csv('test.csv')
preds = np.load('test_ensemble.npy').astype(float)
assert len(preds) == len(test) == 608, f'Length mismatch: {len(preds)} vs {len(test)}'
preds = np.clip(preds, 0.0, 1.0)
ids = pd.to_numeric(test['qa_id'], errors='raise').astype('int64').tolist()

with open('submission.csv', 'w', newline='') as f:
    w = csv.writer(f)
    w.writerow(['id','prediction'])
    for i, p in zip(ids, preds.tolist()):
        w.writerow([int(i), float(p)])

print('submission.csv written via csv module with schema [id,prediction] and 608 rows.')

submission.csv written via csv module with schema [id,prediction] and 608 rows.


In [None]:
# Write submission.csv [id,prediction] with NO header (fallback variant), ids=test.qa_id in test order
import pandas as pd, numpy as np

test = pd.read_csv('test.csv')
preds = np.load('test_ensemble.npy').astype(float)
assert len(preds) == len(test) == 608
preds = np.clip(preds, 0.0, 1.0)
ids = pd.to_numeric(test['qa_id'], errors='raise').astype('int64').values
df = pd.DataFrame({'id': ids, 'prediction': preds})
df.to_csv('submission.csv', index=False, header=False, float_format='%.8f')
print('submission.csv written (no header) with schema [id,prediction], rows:', len(df))

In [43]:
# Build submission.csv from sample_submission.csv (authoritative schema) using test_ensemble.npy
import pandas as pd, numpy as np
from pathlib import Path

samp_path = Path('sample_submission.csv')
test_path = Path('test.csv')
preds_path = Path('test_ensemble.npy')
assert samp_path.exists(), 'sample_submission.csv not found in CWD'
assert test_path.exists(), 'test.csv not found'
assert preds_path.exists(), 'test_ensemble.npy not found (run Cell 8 to generate)'

samp = pd.read_csv(samp_path)
test = pd.read_csv(test_path)
preds = np.load(preds_path).astype(float)
assert len(preds) == len(test) == len(samp), f'Length mismatch: preds={len(preds)} test={len(test)} sample={len(samp)}'

# Expect columns ['id','prediction'] for MLE-Benchmark; keep exact order/names from sample
cols = samp.columns.tolist()
assert len(cols) == 2, f'sample_submission.csv must have 2 columns, got {cols}'
id_col, pred_col = cols[0], cols[1]

# Populate id from test qa_id (authoritative) and overwrite prediction
samp[id_col] = pd.to_numeric(test['qa_id'], errors='raise').astype('int64')
samp[pred_col] = np.clip(preds, 0.0, 1.0).astype(float)

# Final validations
assert samp.columns.tolist() == cols, 'Column order/names drifted from sample_submission.csv'
assert samp[pred_col].between(0,1).all(), 'Predictions out of [0,1]'
assert len(samp) == 608, f'Row count {len(samp)} != 608'

samp.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv written from sample template:', samp.dtypes.to_dict(), samp.shape)
print('Head:\n', samp.head())
print('Sample columns:', cols)

AssertionError: sample_submission.csv must have 2 columns, got ['qa_id', 'question_asker_intent_understanding', 'question_body_critical', 'question_conversational', 'question_expect_short_answer', 'question_fact_seeking', 'question_has_commonly_accepted_answer', 'question_interestingness_others', 'question_interestingness_self', 'question_multi_intent', 'question_not_really_a_question', 'question_opinion_seeking', 'question_type_choice', 'question_type_compare', 'question_type_consequence', 'question_type_definition', 'question_type_entity', 'question_type_instructions', 'question_type_procedure', 'question_type_reason_explanation', 'question_type_spelling', 'question_well_written', 'answer_helpful', 'answer_level_of_information', 'answer_plausible', 'answer_relevance', 'answer_satisfaction', 'answer_type_instructions', 'answer_type_procedure', 'answer_type_reason_explanation', 'answer_well_written']

In [47]:
# Build Kaggle-style 31-column submission (qa_id + 30 targets); fill our target, keep others from sample
import pandas as pd, numpy as np
from pathlib import Path

samp = pd.read_csv('sample_submission.csv')
test = pd.read_csv('test.csv')
preds = np.load('test_ensemble.npy').astype(float)
assert len(preds) == len(test) == len(samp), f'Length mismatch: preds={len(preds)} test={len(test)} sample={len(samp)}'

# Ensure columns are the Kaggle 31-col schema
assert 'qa_id' in samp.columns and 'question_asker_intent_understanding' in samp.columns, 'Unexpected sample_submission schema'

# Overwrite IDs with test qa_id
samp['qa_id'] = pd.to_numeric(test['qa_id'], errors='raise').astype('int64')

# Overwrite ONLY our target column with model predictions (clipped)
samp['question_asker_intent_understanding'] = np.clip(preds, 0.0, 1.0).astype(float)

# Save as submission.csv with exact sample column order
samp.to_csv('submission.csv', index=False, float_format='%.8f')
print('submission.csv written (Kaggle 31-col schema):', samp.shape, 'columns:', list(samp.columns)[:5], '...')
print(samp.head())

submission.csv written (Kaggle 31-col schema): (608, 31) columns: ['qa_id', 'question_asker_intent_understanding', 'question_body_critical', 'question_conversational', 'question_expect_short_answer'] ...
   qa_id  question_asker_intent_understanding  question_body_critical  \
0   6516                             0.950312                0.000000   
1   6168                             0.711651                0.001647   
2   8575                             0.985662                0.003295   
3    618                             0.748256                0.004942   
4   3471                             0.942646                0.006590   

   question_conversational  question_expect_short_answer  \
0                 0.000000                      0.000000   
1                 0.001647                      0.001647   
2                 0.003295                      0.003295   
3                 0.004942                      0.004942   
4                 0.006590                      0.006590 

In [45]:
# DeBERTa-v3-base MC Dropout TTA inference (8 passes) for seed models to create improved OOF/TEST artifacts
import os, json, gc, time, numpy as np, pandas as pd, torch
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification
from scipy.stats import spearmanr

def spearmanr_safe(y_true, y_pred):
    if np.std(y_pred) == 0: return 0.0
    return float(spearmanr(y_true, y_pred).correlation)

train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
id_col = 'qa_id'; target = 'question_asker_intent_understanding'
y = train[target].values.astype(float)

# Rebuild SGKF folds deterministically (matches training cells)
if Path('train_group_keys.csv').exists():
    groups = pd.read_csv('train_group_keys.csv')['group_key'].values
else:
    groups = pd.util.hash_pandas_object((train['question_title'].fillna('')+'||'+train['question_body'].fillna('')), index=False).astype('int64').values
df_groups = pd.DataFrame({'group': groups, 'y': y})
grp_mean = df_groups.groupby('group')['y'].mean()
bins = pd.qcut(grp_mean, q=10, labels=False, duplicates='drop')
grp_to_bin = dict(zip(grp_mean.index.values, bins.astype(int)))
row_bins = np.array([grp_to_bin[g] for g in groups], dtype=int)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
splits = list(sgkf.split(np.zeros_like(y), y=row_bins, groups=groups))

# Tokenization (must match training template): Q-only, [CLS] title [SEP] body [SEP]
model_name = 'microsoft/deberta-v3-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_len = 512; title_max = 64
def build_inputs(title_series: pd.Series, body_series: pd.Series):
    titles = title_series.fillna('').astype(str).tolist()
    bodies = body_series.fillna('').astype(str).tolist()
    enc_title = tokenizer(titles, add_special_tokens=False, truncation=True, max_length=title_max)
    enc_body = tokenizer(bodies, add_special_tokens=False, truncation=True, max_length=max_len)
    input_ids, attention_masks = [], []
    for ti, bi in zip(enc_title['input_ids'], enc_body['input_ids']):
        composed = [tokenizer.cls_token_id] + ti + [tokenizer.sep_token_id] + bi + [tokenizer.sep_token_id]
        composed = composed[:max_len]
        attn = [1]*len(composed)
        pad = max_len - len(composed)
        if pad>0:
            composed += [tokenizer.pad_token_id]*pad
            attn += [0]*pad
        input_ids.append(composed); attention_masks.append(attn)
    return {'input_ids': np.array(input_ids, dtype=np.int64), 'attention_mask': np.array(attention_masks, dtype=np.int64)}

all_inputs = build_inputs(train['question_title'], train['question_body'])
test_inputs = build_inputs(test['question_title'], test['question_body'])

# Locate best checkpoints per seed/fold from Trainer runs
seeds = [42, 2025, 3407]
def best_ckpt_path(out_dir: str):
    state_path = Path(out_dir)/'trainer_state.json'
    if state_path.exists():
        try:
            st = json.loads(state_path.read_text())
            best = st.get('best_model_checkpoint', None)
            if best and Path(best).exists(): return best
        except Exception:
            pass
    # fallback to latest checkpoint dir inside out_dir
    p = Path(out_dir)
    if p.exists():
        cands = sorted([q for q in p.glob('checkpoint-*') if q.is_dir()])
        if cands: return str(cands[-1])
    return None

# Inference optimizations
torch.set_float32_matmul_precision('high')
use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')

@torch.inference_mode()
def mc_predict(model, ids, masks, passes: int = 8, batch_size: int = 64):
    model.train()  # enable dropout layers
    N = len(ids); out = np.zeros(N, dtype=np.float32)
    for rep in range(passes):
        t0 = time.time()
        preds = []
        for i in range(0, N, batch_size):
            bs = slice(i, min(i+batch_size, N))
            input_ids = torch.tensor(ids[bs], device=device)
            attention_mask = torch.tensor(masks[bs], device=device)
            if use_cuda:
                with torch.autocast('cuda', dtype=torch.float16):
                    logits = model(input_ids=input_ids, attention_mask=attention_mask).logits.view(-1)
            else:
                logits = model(input_ids=input_ids, attention_mask=attention_mask).logits.view(-1)
            preds.append(logits.float().cpu().numpy())
        pass_preds = np.concatenate(preds, axis=0)
        out += pass_preds
        print(f"    [mc_pass {rep+1}/{passes}] N={N} elapsed={time.time()-t0:.1f}s", flush=True)
    return out / passes

oof_all_seeds = []
test_all_seeds = []
for seed in seeds:
    print(f"[MC] Seed {seed}")
    oof = np.zeros(len(train), dtype=np.float32)
    test_accum = []  # per fold test preds to average
    for fold, (trn_idx, val_idx) in enumerate(splits):
        out_dir = f'tfm_seed{seed}_fold{fold}'
        ckpt = best_ckpt_path(out_dir)
        if ckpt is None or not Path(ckpt).exists():
            print(f"[WARNING] No checkpoint found for {out_dir}")
        assert ckpt is not None and Path(ckpt).exists(), f"Checkpoint {ckpt} not found"
        print(f"[MC] seed={seed} fold={fold} ckpt={ckpt}")

        # Caching paths
        val_cache = Path(f'val_mc_seed{seed}_fold{fold}_p8.npy')
        test_cache = Path(f'test_mc_seed{seed}_fold{fold}_p8.npy')

        if val_cache.exists() and test_cache.exists():
            print(f"[MC] Loading cached preds for seed={seed} fold={fold}")
            pv = np.load(val_cache)
            pt = np.load(test_cache)
        else:
            t_load = time.time()
            config = AutoConfig.from_pretrained(model_name, num_labels=1, problem_type='regression')
            model = AutoModelForSequenceClassification.from_pretrained(ckpt, config=config)
            model.to(device)
            print(f"[MC] Model loaded in {time.time()-t_load:.1f}s; running MC inference...", flush=True)
            # Val preds on this fold's val set
            val_ids = all_inputs['input_ids'][val_idx]
            val_msks = all_inputs['attention_mask'][val_idx]
            pv = mc_predict(model, val_ids, val_msks, passes=8, batch_size=64)
            # Test preds
            pt = mc_predict(model, test_inputs['input_ids'], test_inputs['attention_mask'], passes=8, batch_size=64)
            # Cache
            np.save(val_cache, pv.astype(np.float32))
            np.save(test_cache, pt.astype(np.float32))
            del model; gc.collect()
            if use_cuda: torch.cuda.empty_cache()

        oof[val_idx] = pv
        test_accum.append(pt.astype(np.float32))
        print(f"[MC] Done seed={seed} fold={fold} val_spearman={spearmanr_safe(y[val_idx], pv):.5f}", flush=True)

    # Aggregate test across folds (mean)
    test_mean = np.mean(np.vstack(test_accum), axis=0)
    oof_all_seeds.append(oof); test_all_seeds.append(test_mean)
    sc = spearmanr_safe(y, oof)
    print(f"[MC] Seed {seed} OOF Spearman={sc:.5f}")

# Average across seeds
oof_mc = np.mean(np.vstack(oof_all_seeds), axis=0)
test_mc = np.mean(np.vstack(test_all_seeds), axis=0)
np.save('oof_deberta_base_mc8.npy', oof_mc)
np.save('test_deberta_base_mc8.npy', test_mc)
print('[MC] Saved oof_deberta_base_mc8.npy, test_deberta_base_mc8.npy')
print('[MC] OOF Spearman (avg seeds):', spearmanr_safe(y, oof_mc))



[MC] Seed 42
[MC] seed=42 fold=0 ckpt=tfm_seed42_fold0/checkpoint-825


[MC] Model loaded in 0.3s; running MC inference...


    [mc_pass 1/8] N=1076 elapsed=12.7s


    [mc_pass 2/8] N=1076 elapsed=12.7s


    [mc_pass 3/8] N=1076 elapsed=12.7s


    [mc_pass 4/8] N=1076 elapsed=12.7s


    [mc_pass 5/8] N=1076 elapsed=12.7s


    [mc_pass 6/8] N=1076 elapsed=12.8s


    [mc_pass 7/8] N=1076 elapsed=12.8s


    [mc_pass 8/8] N=1076 elapsed=12.8s


    [mc_pass 1/8] N=608 elapsed=7.2s


    [mc_pass 2/8] N=608 elapsed=7.2s


    [mc_pass 3/8] N=608 elapsed=7.2s


    [mc_pass 4/8] N=608 elapsed=7.2s


    [mc_pass 5/8] N=608 elapsed=7.2s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=42 fold=0 val_spearman=0.37043


[MC] seed=42 fold=1 ckpt=tfm_seed42_fold1/checkpoint-810


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1153 elapsed=13.8s


    [mc_pass 2/8] N=1153 elapsed=13.8s


    [mc_pass 3/8] N=1153 elapsed=13.8s


    [mc_pass 4/8] N=1153 elapsed=13.8s


    [mc_pass 5/8] N=1153 elapsed=13.8s


    [mc_pass 6/8] N=1153 elapsed=13.8s


    [mc_pass 7/8] N=1153 elapsed=13.8s


    [mc_pass 8/8] N=1153 elapsed=13.8s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=42 fold=1 val_spearman=0.39732


[MC] seed=42 fold=2 ckpt=tfm_seed42_fold2/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1082 elapsed=13.0s


    [mc_pass 2/8] N=1082 elapsed=13.0s


    [mc_pass 3/8] N=1082 elapsed=13.0s


    [mc_pass 4/8] N=1082 elapsed=13.0s


    [mc_pass 5/8] N=1082 elapsed=13.0s


    [mc_pass 6/8] N=1082 elapsed=13.0s


    [mc_pass 7/8] N=1082 elapsed=13.0s


    [mc_pass 8/8] N=1082 elapsed=13.0s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=42 fold=2 val_spearman=0.37403


[MC] seed=42 fold=3 ckpt=tfm_seed42_fold3/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1072 elapsed=12.9s


    [mc_pass 2/8] N=1072 elapsed=12.9s


    [mc_pass 3/8] N=1072 elapsed=12.9s


    [mc_pass 4/8] N=1072 elapsed=12.9s


    [mc_pass 5/8] N=1072 elapsed=12.9s


    [mc_pass 6/8] N=1072 elapsed=12.9s


    [mc_pass 7/8] N=1072 elapsed=12.9s


    [mc_pass 8/8] N=1072 elapsed=12.9s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=42 fold=3 val_spearman=0.37272


[MC] seed=42 fold=4 ckpt=tfm_seed42_fold4/checkpoint-822


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1088 elapsed=13.1s


    [mc_pass 2/8] N=1088 elapsed=13.1s


    [mc_pass 3/8] N=1088 elapsed=13.1s


    [mc_pass 4/8] N=1088 elapsed=13.1s


    [mc_pass 5/8] N=1088 elapsed=13.1s


    [mc_pass 6/8] N=1088 elapsed=13.1s


    [mc_pass 7/8] N=1088 elapsed=13.1s


    [mc_pass 8/8] N=1088 elapsed=13.1s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=42 fold=4 val_spearman=0.37256


[MC] Seed 42 OOF Spearman=0.37349
[MC] Seed 2025
[MC] seed=2025 fold=0 ckpt=tfm_seed2025_fold0/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1076 elapsed=12.9s


    [mc_pass 2/8] N=1076 elapsed=12.9s


    [mc_pass 3/8] N=1076 elapsed=12.9s


    [mc_pass 4/8] N=1076 elapsed=12.9s


    [mc_pass 5/8] N=1076 elapsed=12.9s


    [mc_pass 6/8] N=1076 elapsed=12.9s


    [mc_pass 7/8] N=1076 elapsed=12.9s


    [mc_pass 8/8] N=1076 elapsed=12.9s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=2025 fold=0 val_spearman=0.38355


[MC] seed=2025 fold=1 ckpt=tfm_seed2025_fold1/checkpoint-810


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1153 elapsed=13.8s


    [mc_pass 2/8] N=1153 elapsed=13.8s


    [mc_pass 3/8] N=1153 elapsed=13.8s


    [mc_pass 4/8] N=1153 elapsed=13.8s


    [mc_pass 5/8] N=1153 elapsed=13.8s


    [mc_pass 6/8] N=1153 elapsed=13.8s


    [mc_pass 7/8] N=1153 elapsed=13.8s


    [mc_pass 8/8] N=1153 elapsed=13.8s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=2025 fold=1 val_spearman=0.39193


[MC] seed=2025 fold=2 ckpt=tfm_seed2025_fold2/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1082 elapsed=13.0s


    [mc_pass 2/8] N=1082 elapsed=13.0s


    [mc_pass 3/8] N=1082 elapsed=13.0s


    [mc_pass 4/8] N=1082 elapsed=13.0s


    [mc_pass 5/8] N=1082 elapsed=13.0s


    [mc_pass 6/8] N=1082 elapsed=13.0s


    [mc_pass 7/8] N=1082 elapsed=13.0s


    [mc_pass 8/8] N=1082 elapsed=13.0s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=2025 fold=2 val_spearman=0.37866


[MC] seed=2025 fold=3 ckpt=tfm_seed2025_fold3/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1072 elapsed=12.9s


    [mc_pass 2/8] N=1072 elapsed=12.9s


    [mc_pass 3/8] N=1072 elapsed=12.9s


    [mc_pass 4/8] N=1072 elapsed=12.9s


    [mc_pass 5/8] N=1072 elapsed=12.9s


    [mc_pass 6/8] N=1072 elapsed=12.9s


    [mc_pass 7/8] N=1072 elapsed=12.9s


    [mc_pass 8/8] N=1072 elapsed=12.9s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=2025 fold=3 val_spearman=0.35969


[MC] seed=2025 fold=4 ckpt=tfm_seed2025_fold4/checkpoint-822


[MC] Model loaded in 0.3s; running MC inference...


    [mc_pass 1/8] N=1088 elapsed=13.0s


    [mc_pass 2/8] N=1088 elapsed=13.1s


    [mc_pass 3/8] N=1088 elapsed=13.1s


    [mc_pass 4/8] N=1088 elapsed=13.1s


    [mc_pass 5/8] N=1088 elapsed=13.1s


    [mc_pass 6/8] N=1088 elapsed=13.1s


    [mc_pass 7/8] N=1088 elapsed=13.1s


    [mc_pass 8/8] N=1088 elapsed=13.1s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=2025 fold=4 val_spearman=0.37255


[MC] Seed 2025 OOF Spearman=0.37205
[MC] Seed 3407
[MC] seed=3407 fold=0 ckpt=tfm_seed3407_fold0/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1076 elapsed=12.9s


    [mc_pass 2/8] N=1076 elapsed=12.9s


    [mc_pass 3/8] N=1076 elapsed=12.9s


    [mc_pass 4/8] N=1076 elapsed=12.9s


    [mc_pass 5/8] N=1076 elapsed=12.9s


    [mc_pass 6/8] N=1076 elapsed=12.9s


    [mc_pass 7/8] N=1076 elapsed=12.9s


    [mc_pass 8/8] N=1076 elapsed=12.9s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=3407 fold=0 val_spearman=0.37260


[MC] seed=3407 fold=1 ckpt=tfm_seed3407_fold1/checkpoint-810


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1153 elapsed=13.8s


    [mc_pass 2/8] N=1153 elapsed=13.8s


    [mc_pass 3/8] N=1153 elapsed=13.8s


    [mc_pass 4/8] N=1153 elapsed=13.8s


    [mc_pass 5/8] N=1153 elapsed=13.8s


    [mc_pass 6/8] N=1153 elapsed=13.8s


    [mc_pass 7/8] N=1153 elapsed=13.8s


    [mc_pass 8/8] N=1153 elapsed=13.8s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=3407 fold=1 val_spearman=0.36201


[MC] seed=3407 fold=2 ckpt=tfm_seed3407_fold2/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1082 elapsed=13.0s


    [mc_pass 2/8] N=1082 elapsed=13.0s


    [mc_pass 3/8] N=1082 elapsed=13.0s


    [mc_pass 4/8] N=1082 elapsed=13.0s


    [mc_pass 5/8] N=1082 elapsed=13.0s


    [mc_pass 6/8] N=1082 elapsed=13.0s


    [mc_pass 7/8] N=1082 elapsed=13.0s


    [mc_pass 8/8] N=1082 elapsed=13.0s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=3407 fold=2 val_spearman=0.36535


[MC] seed=3407 fold=3 ckpt=tfm_seed3407_fold3/checkpoint-825


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1072 elapsed=12.9s


    [mc_pass 2/8] N=1072 elapsed=12.9s


    [mc_pass 3/8] N=1072 elapsed=12.9s


    [mc_pass 4/8] N=1072 elapsed=12.9s


    [mc_pass 5/8] N=1072 elapsed=12.9s


    [mc_pass 6/8] N=1072 elapsed=12.9s


    [mc_pass 7/8] N=1072 elapsed=12.9s


    [mc_pass 8/8] N=1072 elapsed=12.9s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=3407 fold=3 val_spearman=0.34014


[MC] seed=3407 fold=4 ckpt=tfm_seed3407_fold4/checkpoint-822


[MC] Model loaded in 0.2s; running MC inference...


    [mc_pass 1/8] N=1088 elapsed=13.1s


    [mc_pass 2/8] N=1088 elapsed=13.1s


    [mc_pass 3/8] N=1088 elapsed=13.1s


    [mc_pass 4/8] N=1088 elapsed=13.1s


    [mc_pass 5/8] N=1088 elapsed=13.1s


    [mc_pass 6/8] N=1088 elapsed=13.1s


    [mc_pass 7/8] N=1088 elapsed=13.1s


    [mc_pass 8/8] N=1088 elapsed=13.1s


    [mc_pass 1/8] N=608 elapsed=7.3s


    [mc_pass 2/8] N=608 elapsed=7.3s


    [mc_pass 3/8] N=608 elapsed=7.3s


    [mc_pass 4/8] N=608 elapsed=7.3s


    [mc_pass 5/8] N=608 elapsed=7.3s


    [mc_pass 6/8] N=608 elapsed=7.3s


    [mc_pass 7/8] N=608 elapsed=7.3s


    [mc_pass 8/8] N=608 elapsed=7.3s


[MC] Done seed=3407 fold=4 val_spearman=0.36769


[MC] Seed 3407 OOF Spearman=0.35435
[MC] Saved oof_deberta_base_mc8.npy, test_deberta_base_mc8.npy
[MC] OOF Spearman (avg seeds): 0.3843253071591575


In [48]:
# Fast 30-target TF-IDF+SVD+Ridge to populate full Kaggle schema; override our main target with transformer ensemble
import time, numpy as np, pandas as pd
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import Ridge

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
id_col = 'qa_id'

# All 30 targets from original QUEST
targets = [
    'question_asker_intent_understanding','question_body_critical','question_conversational','question_expect_short_answer',
    'question_fact_seeking','question_has_commonly_accepted_answer','question_interestingness_others','question_interestingness_self',
    'question_multi_intent','question_not_really_a_question','question_opinion_seeking','question_type_choice','question_type_compare',
    'question_type_consequence','question_type_definition','question_type_entity','question_type_instructions','question_type_procedure',
    'question_type_reason_explanation','question_type_spelling','question_well_written','answer_helpful','answer_level_of_information',
    'answer_plausible','answer_relevance','answer_satisfaction','answer_type_instructions','answer_type_procedure',
    'answer_type_reason_explanation','answer_well_written'
]
assert set(targets).issubset(train.columns), 'Train missing some QUEST targets'

# Build combined text once: [title] [SEP] [body] [SEP] [answer]
def combine(df):
    t = df.get('question_title', pd.Series(['']*len(df))).fillna('').astype(str)
    b = df.get('question_body', pd.Series(['']*len(df))).fillna('').astype(str)
    a = df.get('answer', pd.Series(['']*len(df))).fillna('').astype(str)
    return (t + ' [SEP] ' + b + ' [SEP] ' + a).values

txt_tr = combine(train)
txt_te = combine(test)

# TF-IDF (word+char) -> SVD(256) for speed
cfg_word = dict(analyzer='word', ngram_range=(1,2), sublinear_tf=True, strip_accents='unicode', lowercase=True, min_df=2)
cfg_char = dict(analyzer='char_wb', ngram_range=(3,6), sublinear_tf=True, min_df=2)
vec_w = TfidfVectorizer(max_features=200_000, **cfg_word)
vec_c = TfidfVectorizer(max_features=200_000, **cfg_char)
Xw_tr = vec_w.fit_transform(txt_tr); Xw_te = vec_w.transform(txt_te)
Xc_tr = vec_c.fit_transform(txt_tr); Xc_te = vec_c.transform(txt_te)
from scipy import sparse
X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
X_te = sparse.hstack([Xw_te, Xc_te], format='csr')
print('[TFIDF] Shapes:', X_tr.shape, X_te.shape, flush=True)

svd = TruncatedSVD(n_components=256, random_state=42)
Z_tr = svd.fit_transform(X_tr)
Z_te = svd.transform(X_te)
print('[SVD] Shapes:', Z_tr.shape, Z_te.shape, flush=True)

# Fit multi-output Ridge (Ridge supports multioutput Y directly)
Y = train[targets].astype(float).values
ridge = Ridge(alpha=10.0, random_state=42)
ridge.fit(Z_tr, Y)
pred_all = ridge.predict(Z_te).astype(float)
pred_all = np.clip(pred_all, 0.0, 1.0)

# Override the main target with our best transformer ensemble predictions
main_target = 'question_asker_intent_understanding'
if Path('test_ensemble.npy').exists():
    main_pred = np.load('test_ensemble.npy').astype(float)
    main_pred = np.clip(main_pred, 0.0, 1.0)
    main_idx = targets.index(main_target)
    pred_all[:, main_idx] = main_pred
else:
    print('[WARN] test_ensemble.npy not found; using Ridge prediction for main target')

# Build Kaggle 31-col submission from sample template to ensure exact order
samp = pd.read_csv('sample_submission.csv')
assert len(samp.columns) == 31 and samp.columns[0] == 'qa_id', 'Unexpected sample_submission schema'
samp['qa_id'] = pd.to_numeric(test['qa_id'], errors='raise').astype('int64')
for i, col in enumerate(targets):
    samp[col] = pred_all[:, i].astype(float)

samp.to_csv('submission.csv', index=False, float_format='%.8f')
print('[SUB] 31-col submission written:', samp.shape, 'Elapsed: %.1fs' % (time.time()-t0))
print(samp.head())

[TFIDF] Shapes: (5471, 389243) (608, 389243)


[SVD] Shapes: (5471, 256) (608, 256)


[SUB] 31-col submission written: (608, 31) Elapsed: 52.6s
   qa_id  question_asker_intent_understanding  question_body_critical  \
0   6516                             0.950312                0.624022   
1   6168                             0.711651                0.535519   
2   8575                             0.985662                0.697929   
3    618                             0.748256                0.640382   
4   3471                             0.942646                0.609787   

   question_conversational  question_expect_short_answer  \
0                 0.046943                      0.774700   
1                 0.026120                      0.719333   
2                 0.139630                      0.720241   
3                 0.057674                      0.772785   
4                 0.027667                      0.752678   

   question_fact_seeking  question_has_commonly_accepted_answer  \
0               0.838117                               0.837369   
1       