# Spooky Author Identification: Plan

Goals:
- Establish a strong baseline fast and iterate to medal.
- Metric: multi-class log loss; target: author (3 classes).

Environment & Discipline:
- Verify GPU availability immediately (nvidia-smi). If unavailable, exit.
- Single, deterministic CV: StratifiedKFold (n_splits=5, shuffle=True, seed=42).
- Cache vectorized matrices (TF-IDF) once; slice per fold.
- Log progress and timing per fold; save OOF and test logits for ensembling.

Baseline v1:
- Text-only linear models with TF-IDF:
  - Word n-grams (1–2), Character n-grams (3–5).
  - Model: Logistic Regression (saga or liblinear), and LinearSVC+Platt or SGDClassifier(log).
  - NB-SVM style log-count ratio variant for comparison.
- Expect strong baseline (classic for this comp) with CV logloss ~0.27–0.30; push to ≤0.27 for silver.

Feature Engineering v2:
- Tune TF-IDF ranges, min_df, sublinear_tf, normalization.
- Combine word + char spaces (FeatureUnion or hstack).
- Add simple lexical features (length, punctuation, capitalization ratios) to a tree model (CatBoost/XGBoost) and/or concatenate to linear.

Modeling v2/v3:
- Try multinomial Naive Bayes, Logistic Regression (C sweep), SGD (alpha sweep).
- Calibrate scores (cv=5, method='isotonic' or 'sigmoid') if needed for logloss.
- Optional: Light CatBoost text (CPU) or XGBoost on sparse (GPU) if helpful.

Ensembling:
- Blend diverse OOFs (word vs char models, NB-SVM vs LR vs SGD).
- Weight by CV logloss; simple weighted average.

Validation & Tracking:
- Save and reuse: folds, vectorizers, OOF preds, test preds.
- Inspect per-class confusion and confidence bins to guide tweaks.

Next steps:
1) Env check + quick data EDA (size, lengths, class balance).
2) Implement TF-IDF (word+char) + Logistic Regression baseline with 5-fold CV, cache OOF/test.
3) Iterate: tuning + add char/word unions; evaluate; then blend.
4) Generate submission and aim for ≤0.27 CV.

We will request expert review after environment check + baseline CV results, and before committing longer training runs.

In [1]:
# Environment check + quick EDA
import os, sys, time, shutil, subprocess, json, math, random
import numpy as np
import pandas as pd

def log(msg):
    print(f"[{time.strftime('%H:%M:%S')}] {msg}", flush=True)

# 1) GPU check (nvidia-smi); exit early if no GPU
log("Checking GPU availability via nvidia-smi...")
try:
    out = subprocess.run(["bash","-lc","nvidia-smi || true"], capture_output=True, text=True)
    print(out.stdout)
    if 'NVIDIA-SMI' not in out.stdout:
        log("WARNING: GPU not detected. Proceeding with CPU (OK for linear TF-IDF models).")
    else:
        log("GPU detected.")
except Exception as e:
    log(f"nvidia-smi check failed: {e}")

# 2) Data load
t0=time.time()
log("Loading train.csv and test.csv...")
train_path = 'train.csv'
test_path = 'test.csv'
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
log(f"Loaded train: {train.shape}, test: {test.shape} in {time.time()-t0:.2f}s")

# 3) Basic schema
log("Train head:")
print(train.head(3))
log("Train columns:")
print(train.columns.tolist())
log("Null summary:")
print(train.isnull().mean())

# 4) Target distribution
if 'author' in train.columns:
    vc = train['author'].value_counts().sort_index()
    log("Author distribution (counts / proportions):")
    print(vc)
    print((vc / vc.sum()).round(4))
else:
    log("ERROR: 'author' not found in train.")

# 5) Text length stats
text_col = 'text' if 'text' in train.columns else None
if text_col:
    train['len'] = train[text_col].astype(str).str.len()
    train['wc'] = train[text_col].astype(str).str.split().apply(len)
    log("Length stats (char):")
    print(train['len'].describe())
    log("Length stats (words):")
    print(train['wc'].describe())
    log("Sample texts:")
    for i in range(3):
        r = train.sample(1, random_state=42+i).iloc[0]
        print(f"id={r['id']} author={r['author']} len={r['len']} wc={r['wc']}\n{r[text_col][:200]}...")
else:
    log("ERROR: 'text' column not found in train.")

# 6) Sanity: test has id+text
missing_test_cols = [c for c in ['id','text'] if c not in test.columns]
if missing_test_cols:
    log(f"ERROR: test is missing columns: {missing_test_cols}")
else:
    log("Test columns OK: ['id','text'] present.")

log("EDA complete. Next: implement TF-IDF + Logistic Regression baseline with 5-fold CV.")

[01:58:54] Checking GPU availability via nvidia-smi...


Mon Sep 29 01:58:54 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     128MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

[01:58:54] Loading train.csv and test.csv...


[01:58:54] Loaded train: (17621, 3), test: (1958, 2) in 0.03s


[01:58:54] Train head:


        id                                               text author
0  id06121  So I did not abandon the search until I had be...    EAP
1  id01074  He had promised to spend some hours with me on...    MWS
2  id12145  In the end he give him a funny kind o' thingum...    HPL
[01:58:54] Train columns:


['id', 'text', 'author']
[01:58:54] Null summary:


id        0.0
text      0.0
author    0.0
dtype: float64
[01:58:54] Author distribution (counts / proportions):


author
EAP    7090
HPL    5074
MWS    5457
Name: count, dtype: int64
author
EAP    0.4024
HPL    0.2880
MWS    0.3097
Name: count, dtype: float64
[01:58:54] Length stats (char):


count    17621.000000
mean       148.750752
std        107.652448
min         21.000000
25%         81.000000
50%        128.000000
75%        191.000000
max       4663.000000
Name: len, dtype: float64
[01:58:54] Length stats (words):


count    17621.000000
mean        26.682651
std         19.226686
min          2.000000
25%         15.000000
50%         23.000000
75%         34.000000
max        861.000000
Name: wc, dtype: float64
[01:58:54] Sample texts:


id=id22723 author=EAP len=60 wc=10
The frame was oval, richly gilded and filigreed in Moresque....
id=id19992 author=HPL len=145 wc=26
I would, of course, soon have to shift from Washington to some other southward street; for that party from the hotel would doubtless be after me....
id=id21137 author=EAP len=168 wc=32
Jupiter and myself are going upon an expedition into the hills, upon the main land, and, in this expedition we shall need the aid of some person in whom we can confide....
[01:58:54] Test columns OK: ['id','text'] present.


[01:58:54] EDA complete. Next: implement TF-IDF + Logistic Regression baseline with 5-fold CV.


In [2]:
# Baseline v1: TF-IDF (word 1-2 + char 3-5) + Logistic Regression (multinomial) with 5-fold CV
import time, gc
from typing import Tuple
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values

# Vectorizer configs (expert defaults)
word_params = dict(analyzer='word', ngram_range=(1,2), min_df=2, max_df=0.9, lowercase=True,
                   strip_accents='unicode', token_pattern=r'(?u)\b\w+\b', sublinear_tf=True,
                   smooth_idf=True, norm='l2', dtype=np.float32)
char_params = dict(analyzer='char', ngram_range=(3,5), min_df=2, lowercase=True,
                   sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_fold_features(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix]:
    v_word = TfidfVectorizer(**word_params)
    v_char = TfidfVectorizer(**char_params)
    Xw_tr = v_word.fit_transform(x_tr)
    Xw_val = v_word.transform(x_val)
    Xw_test = v_word.transform(x_test)
    Xc_tr = v_char.fit_transform(x_tr)
    Xc_val = v_char.transform(x_val)
    Xc_test = v_char.transform(x_test)
    X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
    X_val = sparse.hstack([Xw_val, Xc_val], format='csr')
    X_te = sparse.hstack([Xw_test, Xc_test], format='csr')
    return X_tr, X_val, X_te

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Prepare arrays
classes = np.unique(y)
class_to_idx = {c:i for i,c in enumerate(classes)}
oof = np.zeros((len(train), len(classes)), dtype=np.float32)
test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)

fold_losses = []
t0_all = time.time()
for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    print(f"[CV] Fold {fold}/{N_FOLDS} start: tr={len(tr_idx)} val={len(val_idx)}", flush=True)
    x_tr, x_val = X_text[tr_idx], X_text[val_idx]
    y_tr, y_val = y[tr_idx], y[val_idx]

    # Build features per fold to avoid leakage
    X_tr, X_val, X_te = build_fold_features(x_tr, x_val, X_test_text)
    print(f"[CV] Fold {fold} features: X_tr={X_tr.shape} X_val={X_val.shape} X_te={X_te.shape}", flush=True)

    # Model: Logistic Regression (multinomial) saga
    clf = LogisticRegression(solver='saga', penalty='l2', multi_class='multinomial',
                             C=4.0, max_iter=5000, tol=1e-3, n_jobs=-1, random_state=SEED)
    clf.fit(X_tr, y_tr)
    proba_val = clf.predict_proba(X_val).astype(np.float32)
    proba_test = clf.predict_proba(X_te).astype(np.float32)

    # Map to fixed class order
    # Ensure probabilities align by column to 'classes' order
    clf_order = list(clf.classes_)
    order_idx = [clf_order.index(c) for c in classes]
    proba_val = proba_val[:, order_idx]
    proba_test = proba_test[:, order_idx]

    # Store
    oof[val_idx] = proba_val
    test_pred += proba_test / N_FOLDS

    # Fold logloss
    loss = log_loss(y_val, proba_val, labels=list(classes))
    fold_losses.append(loss)
    dt = time.time() - t0
    print(f"[CV] Fold {fold} logloss={loss:.5f} elapsed={dt:.2f}s", flush=True)

    # Cleanup
    del X_tr, X_val, X_te, proba_val, proba_test, clf
    gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f"[CV] OOF logloss={oof_loss:.5f}; folds={fold_losses}; total_elapsed={time.time()-t0_all:.2f}s", flush=True)

# Save artifacts
np.save('oof_lr_wordchar.npy', oof)
np.save('test_lr_wordchar.npy', test_pred)

# Build submission with correct column order ['EAP','HPL','MWS']
submit_cols = ['EAP','HPL','MWS']
submit_order = submit_cols  # desired order

# Ensure classes cover exactly these and reorder
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"
idx_map = [list(classes).index(c) for c in submit_order]
probs = test_pred[:, idx_map]

# Numerical safety: clip and renormalize per row
eps = 1e-9
probs = np.clip(probs, eps, 1 - eps)
probs = probs / probs.sum(axis=1, keepdims=True)

sub = pd.DataFrame(probs, columns=submit_order)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print("Saved submission.csv. Head:\n", sub.head(), flush=True)

# Quick sanity: per-row sums ≈ 1
row_sums = sub[submit_cols].sum(axis=1).values
print("Row sums (first 5):", row_sums[:5], flush=True)

[CV] Fold 1/5 start: tr=14096 val=3525


[CV] Fold 1 features: X_tr=(14096, 173289) X_val=(3525, 173289) X_te=(1958, 173289)




[CV] Fold 1 logloss=0.39005 elapsed=6.35s


[CV] Fold 2/5 start: tr=14097 val=3524


[CV] Fold 2 features: X_tr=(14097, 174021) X_val=(3524, 174021) X_te=(1958, 174021)




[CV] Fold 2 logloss=0.39548 elapsed=6.15s


[CV] Fold 3/5 start: tr=14097 val=3524


[CV] Fold 3 features: X_tr=(14097, 173478) X_val=(3524, 173478) X_te=(1958, 173478)




[CV] Fold 3 logloss=0.40560 elapsed=6.36s


[CV] Fold 4/5 start: tr=14097 val=3524


[CV] Fold 4 features: X_tr=(14097, 173723) X_val=(3524, 173723) X_te=(1958, 173723)




[CV] Fold 4 logloss=0.39183 elapsed=6.24s


[CV] Fold 5/5 start: tr=14097 val=3524


[CV] Fold 5 features: X_tr=(14097, 173699) X_val=(3524, 173699) X_te=(1958, 173699)




[CV] Fold 5 logloss=0.38725 elapsed=6.23s


[CV] OOF logloss=0.39404; folds=[0.3900476689274559, 0.39547997709809957, 0.40559739327647654, 0.39182913231648786, 0.38725023417017657]; total_elapsed=31.60s


Saved submission.csv. Head:
         id       EAP       HPL       MWS
0  id27251  0.023297  0.963772  0.012931
1  id09612  0.234541  0.186852  0.578606
2  id11943  0.031630  0.009772  0.958598
3  id19526  0.018826  0.058122  0.923052
4  id12931  0.076831  0.049408  0.873761


Row sums (first 5): [0.99999994 1.         1.         1.         0.99999994]


In [3]:
# Models v2: char-only LR (char_wb 3-6), word-only LR (1-3), SGD(word+char) + OOF-weighted blend
import time, gc
import numpy as np
import pandas as pd
from typing import Dict, Tuple, List
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

def cv_model_single_vectorizer(vectorizer: TfidfVectorizer,
                               build_on_each_fold: bool,
                               clf_builder,
                               clf_param_grid: List,
                               name: str) -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    t_all = time.time()
    best = dict(loss=1e9, params=None, oof=None, test=None)
    for params in clf_param_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        print(f"[{name}] Params: {params}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0=time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            if build_on_each_fold:
                vec = vectorizer
                X_tr = vec.fit_transform(x_tr)
                X_val = vec.transform(x_val)
                X_te  = vec.transform(X_test_text)
            else:
                # Fit vectorizer once on full train (not recommended for OOF honesty).
                vec = vectorizer
                X_tr = vec.fit_transform(X_text[tr_idx])
                X_val = vec.transform(X_text[val_idx])
                X_te  = vec.transform(X_test_text)
            clf = clf_builder(**params)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            # reorder to classes
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"[{name}] Fold {fold} loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, params=params, oof=oof, test=test_pred)
    return best['oof'], best['test'], best['loss'], best

def cv_model_two_vectorizers(vec_word: TfidfVectorizer, vec_char: TfidfVectorizer,
                             clf_builder, clf_param_grid: List, name: str) -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    t_all = time.time()
    best = dict(loss=1e9, params=None, oof=None, test=None)
    for params in clf_param_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        print(f"[{name}] Params: {params}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0=time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            vw = vec_word
            vc = vec_char
            Xw_tr = vw.fit_transform(x_tr); Xw_val = vw.transform(x_val); Xw_te = vw.transform(X_test_text)
            Xc_tr = vc.fit_transform(x_tr); Xc_val = vc.transform(x_val); Xc_te = vc.transform(X_test_text)
            X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
            X_val = sparse.hstack([Xw_val, Xc_val], format='csr')
            X_te  = sparse.hstack([Xw_te, Xc_te], format='csr')
            clf = clf_builder(**params)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"[{name}] Fold {fold} loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del Xw_tr, Xw_val, Xw_te, Xc_tr, Xc_val, Xc_te, X_tr, X_val, X_te, proba_val, proba_test, clf
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, params=params, oof=oof, test=test_pred)
    return best['oof'], best['test'], best['loss'], best

# Vectorizers per expert defaults
vec_char_wb = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,6), min_df=2, lowercase=True,
                              sublinear_tf=True, dtype=np.float32)
vec_char = TfidfVectorizer(analyzer='char', ngram_range=(3,5), min_df=2, lowercase=True,
                           sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)
vec_word13 = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.9, lowercase=True,
                             strip_accents='unicode', token_pattern=r'(?u)\b\w+\b', sublinear_tf=True,
                             smooth_idf=True, norm='l2', dtype=np.float32)
vec_word12 = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=2, max_df=0.9, lowercase=True,
                             strip_accents='unicode', token_pattern=r'(?u)\b\w+\b', sublinear_tf=True,
                             smooth_idf=True, norm='l2', dtype=np.float32)

# 1) LR on char-only (char_wb 3-6), sweep C
lr_builder = lambda **kw: LogisticRegression(solver='saga', penalty='l2', multi_class='multinomial',
                                             max_iter=5000, tol=1e-3, n_jobs=-1, random_state=SEED, **kw)
lr_C_grid = [ {'C': c} for c in [2.0, 4.0, 8.0] ]
oof_char, test_char, loss_char, best_char = cv_model_single_vectorizer(vec_char_wb, True, lr_builder, lr_C_grid, name='LR_char_wb')
np.save('oof_lr_char.npy', oof_char); np.save('test_lr_char.npy', test_char)

# 2) LR on word-only (1-3), sweep C
oof_word, test_word, loss_word, best_word = cv_model_single_vectorizer(vec_word13, True, lr_builder, lr_C_grid, name='LR_word_1_3')
np.save('oof_lr_word.npy', oof_word); np.save('test_lr_word.npy', test_word)

# 3) SGD on word+char (word 1-2 + char 3-5), sweep alpha
sgd_builder = lambda **kw: SGDClassifier(loss='log_loss', penalty='l2', max_iter=2000, tol=1e-4,
                                         early_stopping=True, validation_fraction=0.1, n_iter_no_change=5,
                                         average=True, random_state=SEED, **kw)
alpha_grid = [ {'alpha': a} for a in [1e-6, 3e-6, 1e-5] ]
oof_sgd, test_sgd, loss_sgd, best_sgd = cv_model_two_vectorizers(vec_word12, vec_char, sgd_builder, alpha_grid, name='SGD_wordchar')
np.save('oof_sgd_wordchar.npy', oof_sgd); np.save('test_sgd_wordchar.npy', test_sgd)

print(f"Best OOF losses -> char:{loss_char:.5f} word:{loss_word:.5f} sgd:{loss_sgd:.5f}", flush=True)

# Blend (weights per expert starting point): 0.50 char, 0.35 word, 0.15 sgd
w_char, w_word, w_sgd = 0.50, 0.35, 0.15
oof_blend = (w_char*oof_char + w_word*oof_word + w_sgd*oof_sgd).astype(np.float32)
test_blend = (w_char*test_char + w_word*test_word + w_sgd*test_sgd).astype(np.float32)
oof_blend_loss = log_loss(y, oof_blend, labels=list(classes))
print(f"[BLEND] OOF logloss={oof_blend_loss:.5f}", flush=True)

# Save blended submission
idx_map = [list(classes).index(c) for c in submit_cols]
probs = test_blend[:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print("Saved submission.csv (blend). Head:\n", sub.head(), flush=True)

# Log params chosen
print("Best params:")
print("  LR_char_wb:", best_char['params'])
print("  LR_word_1_3:", best_word['params'])
print("  SGD_wordchar:", best_sgd['params'])

[LR_char_wb] Params: {'C': 2.0}




[LR_char_wb] Fold 1 loss=0.45683 elapsed=3.53s




[LR_char_wb] Fold 2 loss=0.46149 elapsed=3.57s




[LR_char_wb] Fold 3 loss=0.46741 elapsed=3.66s




[LR_char_wb] Fold 4 loss=0.46121 elapsed=3.74s




[LR_char_wb] Fold 5 loss=0.45641 elapsed=3.75s


[LR_char_wb] OOF=0.46067; folds=[0.45683, 0.46149, 0.46741, 0.46121, 0.45641] total=18.52s


[LR_char_wb] Params: {'C': 4.0}




[LR_char_wb] Fold 1 loss=0.42375 elapsed=3.83s




[LR_char_wb] Fold 2 loss=0.42454 elapsed=3.83s




[LR_char_wb] Fold 3 loss=0.43573 elapsed=3.77s




[LR_char_wb] Fold 4 loss=0.42745 elapsed=3.76s




[LR_char_wb] Fold 5 loss=0.42344 elapsed=3.78s


[LR_char_wb] OOF=0.42698; folds=[0.42375, 0.42454, 0.43573, 0.42745, 0.42344] total=37.76s


[LR_char_wb] Params: {'C': 8.0}




[LR_char_wb] Fold 1 loss=0.40969 elapsed=4.29s




[LR_char_wb] Fold 2 loss=0.40529 elapsed=4.28s




[LR_char_wb] Fold 3 loss=0.42235 elapsed=4.38s




[LR_char_wb] Fold 4 loss=0.41161 elapsed=4.37s




[LR_char_wb] Fold 5 loss=0.40872 elapsed=4.33s


[LR_char_wb] OOF=0.41153; folds=[0.40969, 0.40529, 0.42235, 0.41161, 0.40872] total=59.68s


[LR_word_1_3] Params: {'C': 2.0}




[LR_word_1_3] Fold 1 loss=0.51518 elapsed=1.19s




[LR_word_1_3] Fold 2 loss=0.52073 elapsed=1.19s




[LR_word_1_3] Fold 3 loss=0.52547 elapsed=1.23s




[LR_word_1_3] Fold 4 loss=0.51224 elapsed=1.21s




[LR_word_1_3] Fold 5 loss=0.51230 elapsed=1.24s


[LR_word_1_3] OOF=0.51718; folds=[0.51518, 0.52073, 0.52547, 0.51224, 0.5123] total=6.33s


[LR_word_1_3] Params: {'C': 4.0}




[LR_word_1_3] Fold 1 loss=0.46807 elapsed=1.25s




[LR_word_1_3] Fold 2 loss=0.47426 elapsed=1.29s




[LR_word_1_3] Fold 3 loss=0.48028 elapsed=1.28s




[LR_word_1_3] Fold 4 loss=0.46474 elapsed=1.27s




[LR_word_1_3] Fold 5 loss=0.46666 elapsed=1.27s


[LR_word_1_3] OOF=0.47080; folds=[0.46807, 0.47426, 0.48028, 0.46474, 0.46666] total=12.97s


[LR_word_1_3] Params: {'C': 8.0}




[LR_word_1_3] Fold 1 loss=0.43805 elapsed=1.38s




[LR_word_1_3] Fold 2 loss=0.44438 elapsed=1.43s




[LR_word_1_3] Fold 3 loss=0.45196 elapsed=1.37s




[LR_word_1_3] Fold 4 loss=0.43415 elapsed=1.38s




[LR_word_1_3] Fold 5 loss=0.43811 elapsed=1.37s


[LR_word_1_3] OOF=0.44133; folds=[0.43805, 0.44438, 0.45196, 0.43415, 0.43811] total=20.19s


[SGD_wordchar] Params: {'alpha': 1e-06}


[SGD_wordchar] Fold 1 loss=0.82708 elapsed=3.80s


[SGD_wordchar] Fold 2 loss=0.82626 elapsed=3.77s


[SGD_wordchar] Fold 3 loss=0.93218 elapsed=3.84s


[SGD_wordchar] Fold 4 loss=0.89054 elapsed=3.84s


[SGD_wordchar] Fold 5 loss=0.80539 elapsed=3.88s


[SGD_wordchar] OOF=0.85629; folds=[0.82708, 0.82626, 0.93218, 0.89054, 0.80539] total=19.47s


[SGD_wordchar] Params: {'alpha': 3e-06}


[SGD_wordchar] Fold 1 loss=0.46374 elapsed=3.98s


[SGD_wordchar] Fold 2 loss=0.50019 elapsed=3.77s


[SGD_wordchar] Fold 3 loss=0.50480 elapsed=4.09s


[SGD_wordchar] Fold 4 loss=0.49524 elapsed=3.81s


[SGD_wordchar] Fold 5 loss=0.47229 elapsed=3.87s


[SGD_wordchar] OOF=0.48725; folds=[0.46374, 0.50019, 0.5048, 0.49524, 0.47229] total=39.37s


[SGD_wordchar] Params: {'alpha': 1e-05}


In [20]:
# Fix per expert: char analyzer (2-6, lowercase=False) + word max_df=1.0; LR C sweep with 5-fold CV
import time, gc
import numpy as np
import pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

# Vectorizer params (known-good, with apostrophes and hyphens kept for words; no accent strip on char)
word_params = dict(analyzer='word', ngram_range=(1,2), min_df=2, max_df=1.0, lowercase=True,
                   strip_accents='unicode', token_pattern=r"(?u)\b[-\w']+\b", sublinear_tf=True,
                   smooth_idf=True, norm='l2', dtype=np.float32)
char_params = dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=False,
                   strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_fold_features(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int, int]:
    v_word = TfidfVectorizer(**word_params)
    v_char = TfidfVectorizer(**char_params)
    Xw_tr = v_word.fit_transform(x_tr); Xw_val = v_word.transform(x_val); Xw_test = v_word.transform(x_test)
    Xc_tr = v_char.fit_transform(x_tr); Xc_val = v_char.transform(x_val); Xc_test = v_char.transform(x_test)
    X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
    X_val = sparse.hstack([Xw_val, Xc_val], format='csr')
    X_te  = sparse.hstack([Xw_test, Xc_test], format='csr')
    return X_tr, X_val, X_te, Xw_tr.shape[1], Xc_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

best = {'loss': 1e9, 'C': None, 'oof': None, 'test': None}
for C in [2.0, 4.0, 8.0, 12.0]:
    print(f"[RUN] C={C}", flush=True)
    oof = np.zeros((len(train), len(classes)), dtype=np.float32)
    test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
    fold_losses = []
    t0_all = time.time()
    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_val = X_text[tr_idx], X_text[val_idx]
        y_tr, y_val = y[tr_idx], y[val_idx]
        X_tr, X_val, X_te, vw_dim, vc_dim = build_fold_features(x_tr, x_val, X_test_text)
        print(f"  [Fold {fold}] shapes: X_tr={X_tr.shape}, X_val={X_val.shape}, X_te={X_te.shape}; vocab(word,char)=({vw_dim},{vc_dim})", flush=True)
        clf = LogisticRegression(solver='saga', multi_class='multinomial', penalty='l2',
                                 C=C, max_iter=10000, tol=1e-4, n_jobs=-1, random_state=SEED)
        clf.fit(X_tr, y_tr)
        proba_val = clf.predict_proba(X_val).astype(np.float32)
        proba_test = clf.predict_proba(X_te).astype(np.float32)
        order_idx = [list(clf.classes_).index(c) for c in classes]
        proba_val = proba_val[:, order_idx]
        proba_test = proba_test[:, order_idx]
        oof[val_idx] = proba_val
        test_pred += proba_test / N_FOLDS
        loss = log_loss(y_val, proba_val)
        fold_losses.append(loss)
        print(f"  [Fold {fold}] logloss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
        del X_tr, X_val, X_te, proba_val, proba_test, clf
        gc.collect()
    oof_loss = log_loss(y, oof)
    print(f"[RUN] C={C} OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total_elapsed={time.time()-t0_all:.2f}s", flush=True)
    if oof_loss < best['loss']:
        best.update(loss=oof_loss, C=C, oof=oof, test=test_pred)

print(f"[BEST] OOF={best['loss']:.5f} at C={best['C']}", flush=True)
np.save('oof_lr_wordchar_fixed.npy', best['oof'])
np.save('test_lr_wordchar_fixed.npy', best['test'])

# Build submission
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (fixed recipe with hyphen-aware tokenization). Head:\n', sub.head(), flush=True)

[RUN] C=2.0


  [Fold 1] shapes: X_tr=(14096, 345620), X_val=(3525, 345620), X_te=(1958, 345620); vocab(word,char)=(52523,293097)




  [Fold 1] logloss=0.42226 elapsed=10.31s


  [Fold 2] shapes: X_tr=(14097, 346840), X_val=(3524, 346840), X_te=(1958, 346840); vocab(word,char)=(52659,294181)




  [Fold 2] logloss=0.42466 elapsed=18.73s


  [Fold 3] shapes: X_tr=(14097, 345772), X_val=(3524, 345772), X_te=(1958, 345772); vocab(word,char)=(52462,293310)




  [Fold 3] logloss=0.43415 elapsed=16.35s


  [Fold 4] shapes: X_tr=(14097, 346267), X_val=(3524, 346267), X_te=(1958, 346267); vocab(word,char)=(52519,293748)




  [Fold 4] logloss=0.42090 elapsed=12.70s


  [Fold 5] shapes: X_tr=(14097, 346805), X_val=(3524, 346805), X_te=(1958, 346805); vocab(word,char)=(52356,294449)




  [Fold 5] logloss=0.41732 elapsed=18.53s


[RUN] C=2.0 OOF=0.42386; folds=[0.42226, 0.42466, 0.43415, 0.4209, 0.41732] total_elapsed=76.98s


[RUN] C=4.0


  [Fold 1] shapes: X_tr=(14096, 345620), X_val=(3525, 345620), X_te=(1958, 345620); vocab(word,char)=(52523,293097)




  [Fold 1] logloss=0.39074 elapsed=19.86s


  [Fold 2] shapes: X_tr=(14097, 346840), X_val=(3524, 346840), X_te=(1958, 346840); vocab(word,char)=(52659,294181)




  [Fold 2] logloss=0.39204 elapsed=28.25s


  [Fold 3] shapes: X_tr=(14097, 345772), X_val=(3524, 345772), X_te=(1958, 345772); vocab(word,char)=(52462,293310)




  [Fold 3] logloss=0.40400 elapsed=26.15s


  [Fold 4] shapes: X_tr=(14097, 346267), X_val=(3524, 346267), X_te=(1958, 346267); vocab(word,char)=(52519,293748)




  [Fold 4] logloss=0.38968 elapsed=22.54s


  [Fold 5] shapes: X_tr=(14097, 346805), X_val=(3524, 346805), X_te=(1958, 346805); vocab(word,char)=(52356,294449)




  [Fold 5] logloss=0.38616 elapsed=27.79s


[RUN] C=4.0 OOF=0.39252; folds=[0.39074, 0.39204, 0.404, 0.38968, 0.38616] total_elapsed=124.95s


[RUN] C=8.0


  [Fold 1] shapes: X_tr=(14096, 345620), X_val=(3525, 345620), X_te=(1958, 345620); vocab(word,char)=(52523,293097)




KeyboardInterrupt: 

In [12]:
# Char-only model: Calibrated LinearSVC on TF-IDF char (1-6, min_df=1, lowercase=False, isotonic)
import time, gc
import numpy as np
import pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Char vectorizer per expert: analyzer='char', ngram_range=(1,6), min_df=1, lowercase=False, strip_accents=None
char_params = dict(analyzer='char', ngram_range=(1,6), min_df=1, lowercase=False,
                   strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_char_fold(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**char_params)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_char_svc(C_grid: List[float], name: str='CalibSVC_char_1_6_iso') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    best = dict(loss=1e9, C=None, oof=None, test=None)
    for C in C_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        t0_all = time.time()
        print(f"[{name}] C={C}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0 = time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            X_tr, X_val, X_te, vdim = build_char_fold(x_tr, x_val, X_test_text)
            print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
            base = LinearSVC(C=C, tol=1e-4, max_iter=10000, random_state=SEED, dual=True)
            clf = CalibratedClassifierCV(estimator=base, method='isotonic', cv=5, n_jobs=-1)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf, base
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, C=C, oof=oof, test=test_pred)
    return best['oof'], best['test'], best['loss'], best

# Run Calibrated LinearSVC char-only (isotonic, includes 1-grams)
svc_C_grid = [0.5, 1.0, 2.0]
oof_svc_char16_iso, test_svc_char16_iso, loss_svc_char16_iso, best_svc_char16_iso = cv_char_svc(svc_C_grid, name='CalibSVC_char_1_6_iso')
np.save('oof_svc_char_1_6_iso.npy', oof_svc_char16_iso); np.save('test_svc_char_1_6_iso.npy', test_svc_char16_iso)
print(f"[CalibSVC_char_1_6_iso] BEST OOF={loss_svc_char16_iso:.5f} with C={best_svc_char16_iso['C']}", flush=True)

# Build submission from best SVC char-only
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best_svc_char16_iso['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (CalibSVC char 1-6 isotonic). Head:\n', sub.head(), flush=True)

[CalibSVC_char_1_6_iso] C=0.5


  [Fold 1] X_tr=(14096, 512933) vdim=512933


  [Fold 1] loss=0.40480 elapsed=7.73s


  [Fold 2] X_tr=(14097, 515302) vdim=515302


  [Fold 2] loss=0.39344 elapsed=7.76s


  [Fold 3] X_tr=(14097, 515756) vdim=515756


  [Fold 3] loss=0.40903 elapsed=7.71s


  [Fold 4] X_tr=(14097, 513663) vdim=513663


  [Fold 4] loss=0.40520 elapsed=7.60s


  [Fold 5] X_tr=(14097, 514470) vdim=514470


  [Fold 5] loss=0.38461 elapsed=7.62s


[CalibSVC_char_1_6_iso] OOF=0.39942; folds=[0.4048, 0.39344, 0.40903, 0.4052, 0.38461] total=38.80s


[CalibSVC_char_1_6_iso] C=1.0


  [Fold 1] X_tr=(14096, 512933) vdim=512933


  [Fold 1] loss=0.38945 elapsed=7.88s


  [Fold 2] X_tr=(14097, 515302) vdim=515302


  [Fold 2] loss=0.38524 elapsed=7.87s


  [Fold 3] X_tr=(14097, 515756) vdim=515756


  [Fold 3] loss=0.40065 elapsed=7.83s


  [Fold 4] X_tr=(14097, 513663) vdim=513663


  [Fold 4] loss=0.39636 elapsed=7.37s


  [Fold 5] X_tr=(14097, 514470) vdim=514470


  [Fold 5] loss=0.37740 elapsed=7.36s


[CalibSVC_char_1_6_iso] OOF=0.38982; folds=[0.38945, 0.38524, 0.40065, 0.39636, 0.3774] total=38.66s


[CalibSVC_char_1_6_iso] C=2.0


  [Fold 1] X_tr=(14096, 512933) vdim=512933


  [Fold 1] loss=0.38765 elapsed=8.02s


  [Fold 2] X_tr=(14097, 515302) vdim=515302


  [Fold 2] loss=0.38231 elapsed=7.88s


  [Fold 3] X_tr=(14097, 515756) vdim=515756


  [Fold 3] loss=0.39823 elapsed=7.85s


  [Fold 4] X_tr=(14097, 513663) vdim=513663


  [Fold 4] loss=0.39746 elapsed=7.98s


  [Fold 5] X_tr=(14097, 514470) vdim=514470


  [Fold 5] loss=0.37686 elapsed=7.74s


[CalibSVC_char_1_6_iso] OOF=0.38850; folds=[0.38765, 0.38231, 0.39823, 0.39746, 0.37686] total=39.82s


[CalibSVC_char_1_6_iso] BEST OOF=0.38850 with C=2.0


Saved submission.csv (CalibSVC char 1-6 isotonic). Head:
         id       EAP       HPL       MWS
0  id27251  0.013289  0.966162  0.020548
1  id09612  0.173638  0.204327  0.622035
2  id11943  0.036458  0.004822  0.958720
3  id19526  0.002769  0.217847  0.779384
4  id12931  0.029037  0.060493  0.910470


In [9]:
# Fast NB-SVM (log-count ratio) for word(1-2) only, then OOF-weighted blend with LR(word+char fixed)
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from pathlib import Path
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Faster vectorizer config (reduced vdim): word 1-2, min_df=2, keep apostrophes
vec_word_params = dict(analyzer='word', ngram_range=(1,2), min_df=2, max_df=1.0, lowercase=True,
                        strip_accents='unicode', token_pattern=r"(?u)\b[\w']+\b", dtype=np.float32)

def _log_count_ratio(X: sparse.csr_matrix, y_bin: np.ndarray, alpha: float = 0.5) -> np.ndarray:
    X_pos = X[y_bin == 1]
    X_neg = X[y_bin == 0]
    p = np.asarray(X_pos.sum(axis=0)).ravel() + alpha
    q = np.asarray(X_neg.sum(axis=0)).ravel() + alpha
    r = np.log(p / q)
    return r.astype(np.float32)

def _apply_r(X: sparse.csr_matrix, r: np.ndarray) -> sparse.csr_matrix:
    return X.multiply(r)

def cv_nbsvm_word(vec: CountVectorizer, alpha: float, C: float, name: str):
    t0_all = time.time()
    oof = np.zeros((len(train), len(classes)), dtype=np.float32)
    test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
    fold_losses = []
    print(f"[{name}] C={C} alpha={alpha}", flush=True)
    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_val = X_text[tr_idx], X_text[val_idx]
        y_tr, y_val = y[tr_idx], y[val_idx]
        v = vec
        X_tr = v.fit_transform(x_tr)
        X_val = v.transform(x_val)
        X_te  = v.transform(X_test_text)
        vdim = X_tr.shape[1]
        print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
        proba_val = np.zeros((len(val_idx), len(classes)), dtype=np.float32)
        proba_te  = np.zeros((len(test), len(classes)), dtype=np.float32)
        for ci, c in enumerate(classes):
            y_bin_tr = (y_tr == c).astype(np.int8)
            r = _log_count_ratio(X_tr, y_bin_tr, alpha=alpha)
            Xr_tr = _apply_r(X_tr, r)
            Xr_val = _apply_r(X_val, r)
            Xr_te  = _apply_r(X_te, r)
            # Faster binary LR: liblinear (probability=True) on reduced vdim
            clf = LogisticRegression(solver='liblinear', penalty='l2', max_iter=2000, tol=1e-4,
                                     C=C, n_jobs=1, random_state=SEED)
            clf.fit(Xr_tr, y_bin_tr)
            proba_val[:, ci] = clf.predict_proba(Xr_val)[:, 1].astype(np.float32)
            proba_te[:,  ci] = clf.predict_proba(Xr_te)[:, 1].astype(np.float32)
            del y_bin_tr, r, Xr_tr, Xr_val, Xr_te, clf
            gc.collect()
        eps = 1e-9
        proba_val = np.clip(proba_val, eps, 1 - eps)
        proba_val = proba_val / proba_val.sum(axis=1, keepdims=True)
        proba_te  = np.clip(proba_te,  eps, 1 - eps)
        proba_te  = proba_te  / proba_te.sum(axis=1,  keepdims=True)
        oof[val_idx] = proba_val
        test_pred += proba_te / N_FOLDS
        loss = log_loss(y_val, proba_val, labels=list(classes))
        fold_losses.append(loss)
        print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
        del X_tr, X_val, X_te, proba_val, proba_te, v
        gc.collect()
    oof_loss = log_loss(y, oof, labels=list(classes))
    print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
    return oof, test_pred, oof_loss

# Run fast NB-SVM word-only
word_vec = CountVectorizer(**vec_word_params)
alpha = 0.5
C = 12.0
oof_nb_word, test_nb_word, loss_nb_word = cv_nbsvm_word(word_vec, alpha, C, name='NBSVM_word_1_2_min2_fast')
np.save('oof_nbsvm_word.npy', oof_nb_word); np.save('test_nbsvm_word.npy', test_nb_word)
print(f"[NBSVM_word_fast] OOF={loss_nb_word:.5f} C={C}", flush=True)

# Load LR(word+char fixed recipe) OOF/test
oof_lr_wc = np.load('oof_lr_wordchar_fixed.npy') if (Path('oof_lr_wordchar_fixed.npy').exists()) else None
test_lr_wc = np.load('test_lr_wordchar_fixed.npy') if (Path('test_lr_wordchar_fixed.npy').exists()) else None

# Simple OOF-weighted blend over a small grid (non-negative, sum to 1) between NBSVM_word and LR(word+char)
best_blend = (1e9, None, None)
if oof_lr_wc is not None:
    for w in np.linspace(0.1, 0.9, 9):
        oof_bl = (w * oof_nb_word + (1.0 - w) * oof_lr_wc).astype(np.float32)
        loss = log_loss(y, oof_bl, labels=list(classes))
        if loss < best_blend[0]:
            best_blend = (loss, w, oof_bl)
    print(f"[BLEND word+lr_wc] best OOF={best_blend[0]:.5f} w_word={best_blend[1]:.2f} w_lr_wc={1-best_blend[1]:.2f}", flush=True)
    # Build blended submission
    w = best_blend[1]
    test_bl = (w * test_nb_word + (1.0 - w) * test_lr_wc).astype(np.float32)
    idx_map = [list(classes).index(c) for c in submit_cols]
    probs = test_bl[:, idx_map]
    eps = 1e-9
    probs = np.clip(probs, eps, 1-eps)
    probs = probs / probs.sum(axis=1, keepdims=True)
    sub = pd.DataFrame(probs, columns=submit_cols)
    sub.insert(0, 'id', test['id'].values)
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv (Fast NB-SVM word + LR blend). Head:\n', sub.head(), flush=True)
else:
    print('Skipped blend: missing LR word+char fixed preds.', flush=True)

[NBSVM_word_1_2_min2_fast] C=12.0 alpha=0.5


  [Fold 1] X_tr=(14096, 52523) vdim=52523


  [Fold 1] loss=0.51822 elapsed=2.28s


  [Fold 2] X_tr=(14097, 52659) vdim=52659


  [Fold 2] loss=0.50560 elapsed=1.97s


  [Fold 3] X_tr=(14097, 52462) vdim=52462


  [Fold 3] loss=0.53824 elapsed=1.95s


  [Fold 4] X_tr=(14097, 52519) vdim=52519


  [Fold 4] loss=0.49216 elapsed=1.93s


  [Fold 5] X_tr=(14097, 52356) vdim=52356


  [Fold 5] loss=0.51517 elapsed=2.06s


[NBSVM_word_1_2_min2_fast] OOF=0.51388; folds=[0.51822, 0.5056, 0.53824, 0.49216, 0.51517] total=10.48s


[NBSVM_word_fast] OOF=0.51388 C=12.0


[BLEND word+lr_wc] best OOF=0.36143 w_word=0.20 w_lr_wc=0.80


Saved submission.csv (Fast NB-SVM word + LR blend). Head:
         id       EAP       HPL       MWS
0  id27251  0.005360  0.992360  0.002280
1  id09612  0.161116  0.145306  0.693578
2  id11943  0.012770  0.004396  0.982834
3  id19526  0.011659  0.042058  0.946283
4  id12931  0.052283  0.034653  0.913063


In [17]:
# Blend existing OOFs: LR(word+char fixed) + NBSVM(word fast) + CalibSVC(char 1-6 isotonic) + LR(char 1-7) + CNB(word)
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import log_loss

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

def load_or_none(p):
    return np.load(p) if Path(p).exists() else None

oof_lr = load_or_none('oof_lr_wordchar_fixed.npy')
test_lr = load_or_none('test_lr_wordchar_fixed.npy')
oof_nbw = load_or_none('oof_nbsvm_word.npy')
test_nbw = load_or_none('test_nbsvm_word.npy')
oof_svc_iso = load_or_none('oof_svc_char_1_6_iso.npy')
test_svc_iso = load_or_none('test_svc_char_1_6_iso.npy')
oof_lr_char17 = load_or_none('oof_lr_char_1_7.npy')
test_lr_char17 = load_or_none('test_lr_char_1_7.npy')
oof_cnb_word = load_or_none('oof_cnb_word.npy')
test_cnb_word = load_or_none('test_cnb_word.npy')

avail = [
    ('lr_wc', oof_lr, test_lr),
    ('nbw', oof_nbw, test_nbw),
    ('svc_iso', oof_svc_iso, test_svc_iso),
    ('lr_char17', oof_lr_char17, test_lr_char17),
    ('cnb_word', oof_cnb_word, test_cnb_word),
]
avail = [(n,o,t) for n,o,t in avail if o is not None and t is not None]
print('Available models:', [n for n,_,_ in avail])

assert any(n=='lr_wc' for n,_,_ in avail), 'LR(word+char fixed) required'

# Grid search non-negative weights that sum to 1 for up to 5 models (coarse grid to keep runtime tractable)
best = (1e9, None, None)
ws = np.round(np.arange(0.0, 1.0001, 0.05), 5)
names = [n for n,_,_ in avail]

def try_weights(ws_sel):
    s = sum(ws_sel)
    if abs(s - 1.0) > 1e-9: return None
    weights = {names[i]: ws_sel[i] for i in range(len(names))}
    blend = None
    for i,(name,oof,_) in enumerate(avail):
        w = weights.get(name, 0.0)
        if w == 0.0: continue
        blend = (oof * w) if blend is None else (blend + oof * w)
    eps = 1e-9
    blend = np.clip(blend, eps, 1 - eps)
    blend = blend / blend.sum(axis=1, keepdims=True)
    loss = log_loss(y, blend, labels=list(classes))
    return loss, weights, blend

L = len(avail)
if L >= 2:
    # Nested loops up to 5 models; keep step coarse
    if L == 5:
        for w1 in ws:
            for w2 in ws:
                for w3 in ws:
                    for w4 in ws:
                        w5 = 1.0 - w1 - w2 - w3 - w4
                        if w5 < 0 or w5 > 1: continue
                        res = try_weights([w1,w2,w3,w4,w5])
                        if res is None: continue
                        loss, weights, blend = res
                        if loss < best[0]: best = (loss, weights.copy(), blend.copy())
    elif L == 4:
        for w1 in ws:
            for w2 in ws:
                for w3 in ws:
                    w4 = 1.0 - w1 - w2 - w3
                    if w4 < 0 or w4 > 1: continue
                    res = try_weights([w1,w2,w3,w4])
                    if res is None: continue
                    loss, weights, blend = res
                    if loss < best[0]: best = (loss, weights.copy(), blend.copy())
    elif L == 3:
        for w1 in ws:
            for w2 in ws:
                w3 = 1.0 - w1 - w2
                if w3 < 0 or w3 > 1: continue
                res = try_weights([w1,w2,w3])
                if res is None: continue
                loss, weights, blend = res
                if loss < best[0]: best = (loss, weights.copy(), blend.copy())
    else:  # L == 2
        for w1 in ws:
            w2 = 1.0 - w1
            res = try_weights([w1,w2])
            if res is None: continue
            loss, weights, blend = res
            if loss < best[0]: best = (loss, weights.copy(), blend.copy())

print(f"[BLEND SEARCH] best OOF={best[0]:.5f} weights={best[1]}")

# Build blended test preds
weights = best[1]
test_blend = None
for name, _, tpred in avail:
    w = weights.get(name, 0.0)
    if w == 0.0:
        continue
    test_blend = (tpred * w) if test_blend is None else (test_blend + tpred * w)
eps = 1e-9
test_blend = np.clip(test_blend, eps, 1 - eps)
test_blend = test_blend / test_blend.sum(axis=1, keepdims=True)

# Save submission
idx_map = [list(classes).index(c) for c in submit_cols]
probs = test_blend[:, idx_map]
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (optimized blend incl. isotonic SVC, char LR, CNB). Head:\n', sub.head())

Available models: ['lr_wc', 'nbw', 'svc_iso', 'lr_char17', 'cnb_word']


[BLEND SEARCH] best OOF=0.34115 weights={'lr_wc': 0.05, 'nbw': 0.15, 'svc_iso': 0.0, 'lr_char17': 0.45, 'cnb_word': 0.3499999999999999}
Saved submission.csv (optimized blend incl. isotonic SVC, char LR, CNB). Head:
         id       EAP       HPL       MWS
0  id27251  0.002815  0.994643  0.002542
1  id09612  0.111937  0.146228  0.741835
2  id11943  0.011674  0.003286  0.985039
3  id19526  0.009523  0.055012  0.935465
4  id12931  0.037538  0.029420  0.933042


In [11]:
# Proper NB-SVM (word+char presence) with softmax margins
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Vectorizers per expert: presence (binary=True)
word_params = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=1.0, lowercase=True,
                   strip_accents='unicode', token_pattern=r"(?u)\b[\w']+\b", binary=True, dtype=np.float32)
char_params = dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=False,
                   strip_accents=None, binary=True, dtype=np.float32)

def _log_count_ratio(X: sparse.csr_matrix, y_bin: np.ndarray, alpha: float = 1.0) -> np.ndarray:
    X_pos = X[y_bin == 1]
    X_neg = X[y_bin == 0]
    p = np.asarray(X_pos.sum(axis=0)).ravel() + alpha
    q = np.asarray(X_neg.sum(axis=0)).ravel() + alpha
    r = np.log(p / q)
    return r.astype(np.float32)

def _apply_r(X: sparse.csr_matrix, r: np.ndarray) -> sparse.csr_matrix:
    return X.multiply(r)

def _softmax(m: np.ndarray, axis: int = 1) -> np.ndarray:
    m = m - m.max(axis=axis, keepdims=True)
    expm = np.exp(m, dtype=np.float32)
    s = expm.sum(axis=axis, keepdims=True)
    return expm / s

def cv_nbsvm_wordchar(alpha: float = 1.0, C: float = 30.0, name: str = 'NBSVM_wc_word1_3_char2_6_bin'):
    t0_all = time.time()
    oof = np.zeros((len(train), len(classes)), dtype=np.float32)
    test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
    fold_losses = []
    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_val = X_text[tr_idx], X_text[val_idx]
        y_tr, y_val = y[tr_idx], y[val_idx]
        vw = CountVectorizer(**word_params)
        vc = CountVectorizer(**char_params)
        Xw_tr = vw.fit_transform(x_tr); Xw_val = vw.transform(x_val); Xw_te = vw.transform(X_test_text)
        Xc_tr = vc.fit_transform(x_tr); Xc_val = vc.transform(x_val); Xc_te = vc.transform(X_test_text)
        X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
        X_val = sparse.hstack([Xw_val, Xc_val], format='csr')
        X_te  = sparse.hstack([Xw_te, Xc_te], format='csr')
        vdim = X_tr.shape[1]
        print(f"[${name}] Fold {fold} X_tr={X_tr.shape} vdim={vdim}", flush=True)
        # margins for each class
        margins_val = np.zeros((len(val_idx), len(classes)), dtype=np.float32)
        margins_te  = np.zeros((len(test), len(classes)), dtype=np.float32)
        for ci, c in enumerate(classes):
            y_bin_tr = (y_tr == c).astype(np.int8)
            r = _log_count_ratio(X_tr, y_bin_tr, alpha=alpha)
            Xr_tr = _apply_r(X_tr, r)
            Xr_val = _apply_r(X_val, r)
            Xr_te  = _apply_r(X_te, r)
            clf = LogisticRegression(solver='liblinear', penalty='l2', max_iter=2000, tol=1e-4,
                                     C=C, n_jobs=1, random_state=SEED)
            clf.fit(Xr_tr, y_bin_tr)
            margins_val[:, ci] = clf.decision_function(Xr_val).astype(np.float32)
            margins_te[:,  ci] = clf.decision_function(Xr_te).astype(np.float32)
            del y_bin_tr, r, Xr_tr, Xr_val, Xr_te, clf
            gc.collect()
        proba_val = _softmax(margins_val, axis=1).astype(np.float32)
        proba_te  = _softmax(margins_te,  axis=1).astype(np.float32)
        oof[val_idx] = proba_val
        test_pred += proba_te / N_FOLDS
        loss = log_loss(y_val, proba_val, labels=list(classes))
        fold_losses.append(loss)
        print(f"[{name}] Fold {fold} loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
        del X_tr, X_val, X_te, Xw_tr, Xw_val, Xw_te, Xc_tr, Xc_val, Xc_te, margins_val, margins_te, proba_val, proba_te
        gc.collect()
    oof_loss = log_loss(y, oof, labels=list(classes))
    print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
    return oof, test_pred, oof_loss

# Run NB-SVM word+char
alpha = 1.0
C = 30.0
oof_nbwc, test_nbwc, loss_nbwc = cv_nbsvm_wordchar(alpha=alpha, C=C, name='NBSVM_wc_word1_3_char2_6_bin')
np.save('oof_nbsvm_wordchar.npy', oof_nbwc); np.save('test_nbsvm_wordchar.npy', test_nbwc)
print(f"[NBSVM_wc] OOF={loss_nbwc:.5f} C={C}", flush=True)

# Build submission from NB-SVM wc
idx_map = [list(classes).index(c) for c in submit_cols]
probs = test_nbwc[:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (NBSVM wc). Head:\n', sub.head(), flush=True)

[$NBSVM_wc_word1_3_char2_6_bin] Fold 1 X_tr=(14096, 368484) vdim=368484


[NBSVM_wc_word1_3_char2_6_bin] Fold 1 loss=0.83395 elapsed=17.66s


[$NBSVM_wc_word1_3_char2_6_bin] Fold 2 X_tr=(14097, 369846) vdim=369846


[NBSVM_wc_word1_3_char2_6_bin] Fold 2 loss=0.79148 elapsed=19.07s


[$NBSVM_wc_word1_3_char2_6_bin] Fold 3 X_tr=(14097, 368577) vdim=368577


[NBSVM_wc_word1_3_char2_6_bin] Fold 3 loss=0.84791 elapsed=18.89s


[$NBSVM_wc_word1_3_char2_6_bin] Fold 4 X_tr=(14097, 369281) vdim=369281


[NBSVM_wc_word1_3_char2_6_bin] Fold 4 loss=0.76345 elapsed=18.66s


[$NBSVM_wc_word1_3_char2_6_bin] Fold 5 X_tr=(14097, 369678) vdim=369678


[NBSVM_wc_word1_3_char2_6_bin] Fold 5 loss=0.80935 elapsed=19.03s


[NBSVM_wc_word1_3_char2_6_bin] OOF=0.80923; folds=[0.83395, 0.79148, 0.84791, 0.76345, 0.80935] total=93.61s


[NBSVM_wc] OOF=0.80923 C=30.0


Saved submission.csv (NBSVM wc). Head:
         id           EAP           HPL           MWS
0  id27251  1.639839e-08  1.000000e+00  1.000000e-09
1  id09612  2.211119e-02  3.770798e-01  6.008090e-01
2  id11943  1.000000e-09  1.000000e-09  1.000000e+00
3  id19526  4.798707e-07  3.557995e-06  9.999959e-01
4  id12931  1.356093e-09  4.184225e-07  9.999996e-01


In [14]:
# Strong char-only LR: TF-IDF char (1-7), lowercase=False, high C sweep
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Char TF-IDF: include 1-grams, keep case and punctuation; no accent strip
char_params = dict(analyzer='char', ngram_range=(1,7), min_df=1, lowercase=False,
                   strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_char_fold(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**char_params)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_char_lr(C_grid: List[float], name: str='LR_char_1_7') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    best = dict(loss=1e9, C=None, oof=None, test=None)
    for C in C_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        t0_all = time.time()
        print(f"[{name}] C={C}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0 = time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            X_tr, X_val, X_te, vdim = build_char_fold(x_tr, x_val, X_test_text)
            print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
            clf = LogisticRegression(solver='saga', multi_class='multinomial', penalty='l2',
                                     C=C, max_iter=12000, tol=1e-4, n_jobs=-1, random_state=SEED)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, C=C, oof=oof.copy(), test=test_pred.copy())
    return best['oof'], best['test'], best['loss'], best

# Run char LR with high-C sweep
C_grid = [16.0, 24.0, 32.0]
oof_lr_char17, test_lr_char17, loss_lr_char17, best_lr_char17 = cv_char_lr(C_grid, name='LR_char_1_7')
np.save('oof_lr_char_1_7.npy', oof_lr_char17); np.save('test_lr_char_1_7.npy', test_lr_char17)
print(f"[LR_char_1_7] BEST OOF={loss_lr_char17:.5f} with C={best_lr_char17['C']}", flush=True)

# Build submission from best char LR
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best_lr_char17['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (LR char 1-7). Head:\n', sub.head(), flush=True)

[LR_char_1_7] C=16.0


  [Fold 1] X_tr=(14096, 1047048) vdim=1047048




  [Fold 1] loss=0.38691 elapsed=75.43s


  [Fold 2] X_tr=(14097, 1051824) vdim=1051824




  [Fold 2] loss=0.38322 elapsed=85.09s


  [Fold 3] X_tr=(14097, 1052194) vdim=1052194




  [Fold 3] loss=0.39852 elapsed=86.69s


  [Fold 4] X_tr=(14097, 1049129) vdim=1049129




  [Fold 4] loss=0.38787 elapsed=71.06s


  [Fold 5] X_tr=(14097, 1049968) vdim=1049968




  [Fold 5] loss=0.37650 elapsed=85.12s


[LR_char_1_7] OOF=0.38660; folds=[0.38691, 0.38322, 0.39852, 0.38787, 0.3765] total=403.79s


[LR_char_1_7] C=24.0


  [Fold 1] X_tr=(14096, 1047048) vdim=1047048




  [Fold 1] loss=0.37900 elapsed=74.71s


  [Fold 2] X_tr=(14097, 1051824) vdim=1051824




  [Fold 2] loss=0.37448 elapsed=92.76s


  [Fold 3] X_tr=(14097, 1052194) vdim=1052194




  [Fold 3] loss=0.39122 elapsed=94.76s


  [Fold 4] X_tr=(14097, 1049129) vdim=1049129




  [Fold 4] loss=0.38062 elapsed=83.48s


  [Fold 5] X_tr=(14097, 1049968) vdim=1049968




  [Fold 5] loss=0.36824 elapsed=95.58s


[LR_char_1_7] OOF=0.37871; folds=[0.379, 0.37448, 0.39122, 0.38062, 0.36824] total=441.68s


[LR_char_1_7] C=32.0


  [Fold 1] X_tr=(14096, 1047048) vdim=1047048




  [Fold 1] loss=0.37493 elapsed=84.32s


  [Fold 2] X_tr=(14097, 1051824) vdim=1051824




  [Fold 2] loss=0.36982 elapsed=105.92s


  [Fold 3] X_tr=(14097, 1052194) vdim=1052194




  [Fold 3] loss=0.38758 elapsed=109.78s


  [Fold 4] X_tr=(14097, 1049129) vdim=1049129




  [Fold 4] loss=0.37705 elapsed=93.48s


  [Fold 5] X_tr=(14097, 1049968) vdim=1049968




  [Fold 5] loss=0.36392 elapsed=109.18s


[LR_char_1_7] OOF=0.37466; folds=[0.37493, 0.36982, 0.38758, 0.37705, 0.36392] total=503.07s


[LR_char_1_7] BEST OOF=0.37466 with C=32.0


Saved submission.csv (LR char 1-7). Head:
         id       EAP       HPL       MWS
0  id27251  0.004349  0.990329  0.005322
1  id09612  0.193648  0.212092  0.594261
2  id11943  0.024085  0.006677  0.969238
3  id19526  0.008216  0.081952  0.909832
4  id12931  0.038221  0.034926  0.926853


In [16]:
# Fast ComplementNB word-only (1-3) with apostrophes; add to blend
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import log_loss
from pathlib import Path

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

vec_word_params = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=1.0, lowercase=True,
                        strip_accents='unicode', token_pattern=r"(?u)\b[\w']+\b", dtype=np.float32)

def cv_cnb_word(alpha_grid: List[float], name: str='CNB_word_1_3') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    best = dict(loss=1e9, alpha=None, oof=None, test=None)
    for alpha in alpha_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        t0_all = time.time()
        print(f"[{name}] alpha={alpha}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0 = time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            v = CountVectorizer(**vec_word_params)
            X_tr = v.fit_transform(x_tr)
            X_val = v.transform(x_val)
            X_te  = v.transform(X_test_text)
            clf = ComplementNB(alpha=alpha)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf, v
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, alpha=alpha, oof=oof.copy(), test=test_pred.copy())
    return best['oof'], best['test'], best['loss'], best

# Run CNB word-only quickly
alpha_grid = [0.2, 0.5, 1.0]
oof_cnb_word, test_cnb_word, loss_cnb_word, best_cnb_word = cv_cnb_word(alpha_grid, name='CNB_word_1_3')
np.save('oof_cnb_word.npy', oof_cnb_word); np.save('test_cnb_word.npy', test_cnb_word)
print(f"[CNB_word] BEST OOF={loss_cnb_word:.5f} alpha={best_cnb_word['alpha']}", flush=True)

# Build submission from CNB (for inspection)
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best_cnb_word['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (CNB word). Head:\n', sub.head(), flush=True)

[CNB_word_1_3] alpha=0.2


  [Fold 1] loss=0.68287 elapsed=0.94s


  [Fold 2] loss=0.64267 elapsed=0.94s


  [Fold 3] loss=0.70578 elapsed=0.94s


  [Fold 4] loss=0.65235 elapsed=0.94s


  [Fold 1] loss=0.60621 elapsed=0.94s


  [Fold 2] loss=0.57018 elapsed=0.95s


  [Fold 3] loss=0.62249 elapsed=0.95s


  [Fold 4] loss=0.57917 elapsed=0.95s


  [Fold 5] loss=0.58839 elapsed=0.94s


[CNB_word_1_3] OOF=0.59329; folds=[0.60621, 0.57018, 0.62249, 0.57917, 0.58839] total=5.11s


[CNB_word_1_3] alpha=1.0


  [Fold 1] loss=0.55913 elapsed=0.96s


  [Fold 2] loss=0.52643 elapsed=0.96s


  [Fold 3] loss=0.57037 elapsed=0.95s


  [Fold 4] loss=0.53416 elapsed=0.94s


  [Fold 5] loss=0.54351 elapsed=0.95s


[CNB_word_1_3] OOF=0.54672; folds=[0.55913, 0.52643, 0.57037, 0.53416, 0.54351] total=5.12s


[CNB_word] BEST OOF=0.54672 alpha=1.0


Saved submission.csv (CNB word). Head:
         id           EAP           HPL       MWS
0  id27251  5.201592e-08  9.999968e-01  0.000003
1  id09612  2.594946e-03  8.419059e-02  0.913214
2  id11943  8.744872e-06  5.705953e-07  0.999991
3  id19526  2.069279e-03  1.644513e-02  0.981486
4  id12931  3.491907e-02  6.544923e-04  0.964426


In [18]:
# Correct NB-SVM (word+char presence, row L2-normalized, softmax margins) per expert
import time, gc, numpy as np, pandas as pd
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

word_params=dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=1.0, lowercase=True,
                 strip_accents='unicode', token_pattern=r"(?u)\b[-\w']+\b", binary=True, dtype=np.float32)
char_params=dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=False,
                 strip_accents=None, binary=True, dtype=np.float32)

def _r(X, yb, a=1.0):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    return np.log(p/q).astype(np.float32)

def _softmax(m):
    m = m - m.max(axis=1, keepdims=True)
    e = np.exp(m); return e / e.sum(axis=1, keepdims=True)

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
t0_all = time.time()
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    vw = CountVectorizer(**word_params); vc = CountVectorizer(**char_params)
    Xw_tr = vw.fit_transform(x_tr); Xw_va = vw.transform(x_va); Xw_te = vw.transform(X_test)
    Xc_tr = vc.fit_transform(x_tr); Xc_va = vc.transform(x_va); Xc_te = vc.transform(X_test)
    X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
    X_va = sparse.hstack([Xw_va, Xc_va], format='csr')
    X_te = sparse.hstack([Xw_te, Xc_te], format='csr')
    print(f"[NB-SVM] Fold {fold}: X_tr={X_tr.shape}", flush=True)
    margins_va = np.zeros((len(va), len(classes)), np.float32)
    margins_te = np.zeros((len(test), len(classes)), np.float32)
    for ci, c in enumerate(classes):
        yb = (y_tr == c).astype(np.int8)
        r = _r(X_tr, yb, a=1.0)
        Xr_tr = normalize(X_tr.multiply(r), norm='l2', axis=1, copy=False)
        Xr_va = normalize(X_va.multiply(r), norm='l2', axis=1, copy=False)
        Xr_te = normalize(X_te.multiply(r), norm='l2', axis=1, copy=False)
        clf = LogisticRegression(solver='liblinear', C=30.0, max_iter=2000, random_state=SEED)
        clf.fit(Xr_tr, yb)
        margins_va[:, ci] = clf.decision_function(Xr_va).astype(np.float32)
        margins_te[:, ci] = clf.decision_function(Xr_te).astype(np.float32)
        del yb, r, Xr_tr, Xr_va, Xr_te, clf
        gc.collect()
    P_va = _softmax(margins_va).astype(np.float32)
    P_te = _softmax(margins_te).astype(np.float32)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    print(f"[NB-SVM] Fold {fold} logloss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
    del X_tr, X_va, X_te, Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, margins_va, margins_te, P_va, P_te
    gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f"NB-SVM (wc presence) OOF: {oof_loss:.5f}; total_elapsed={time.time()-t0_all:.2f}s", flush=True)
np.save('oof_nbsvm_wc_fixed.npy', oof)
np.save('test_nbsvm_wc_fixed.npy', test_pred)

# No submission here; will use in optimized blend next
print('Saved oof_nbsvm_wc_fixed.npy and test_nbsvm_wc_fixed.npy', flush=True)

[NB-SVM] Fold 1: X_tr=(14096, 368484)


[NB-SVM] Fold 1 logloss=0.38046 elapsed=14.89s


[NB-SVM] Fold 2: X_tr=(14097, 369846)


[NB-SVM] Fold 2 logloss=0.37418 elapsed=14.55s


[NB-SVM] Fold 3: X_tr=(14097, 368577)


[NB-SVM] Fold 3 logloss=0.38723 elapsed=14.55s


[NB-SVM] Fold 4: X_tr=(14097, 369281)


[NB-SVM] Fold 4 logloss=0.36116 elapsed=14.19s


[NB-SVM] Fold 5: X_tr=(14097, 369678)


[NB-SVM] Fold 5 logloss=0.36744 elapsed=14.47s


NB-SVM (wc presence) OOF: 0.37409; total_elapsed=72.93s


Saved oof_nbsvm_wc_fixed.npy and test_nbsvm_wc_fixed.npy


In [71]:
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from sklearn.metrics import log_loss
from scipy.optimize import minimize, minimize_scalar

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

def load(p): return np.load(p) if Path(p).exists() else None

# Pruned diverse set (drop weaker/duplicative bases):
cands = [
    ('nbsvm_wc_tweaked',          load('oof_nbsvm_wc_tweaked.npy'),          load('test_nbsvm_wc_tweaked.npy')),
    ('lr_wc_fixed',               load('oof_lr_wordchar_fixed.npy'),         load('test_lr_wordchar_fixed.npy')),
    ('nbsvm_char_2_6_counts',     load('oof_nbsvm_char_2_6_counts.npy'),     load('test_nbsvm_char_2_6_counts.npy')),
    ('nbsvm_char_2_7_presence',   load('oof_nbsvm_char_2_7_presence.npy'),   load('test_nbsvm_char_2_7_presence.npy')),
    ('lr_char_1_7',               load('oof_lr_char_1_7.npy'),               load('test_lr_char_1_7.npy')),
    ('lr_char_1_8_fast',          load('oof_lr_char_1_8.npy'),               load('test_lr_char_1_8.npy')),
    ('lr_char_1_8_hero',          load('oof_lr_char_1_8_hero.npy'),          load('test_lr_char_1_8_hero.npy')),
    ('svc_charwb_1_6_sig',        load('oof_svc_charwb_1_6_sig.npy'),        load('test_svc_charwb_1_6_sig.npy')),
    ('lr_charwb_4_8',             load('oof_lr_charwb_4_8.npy'),             load('test_lr_charwb_4_8.npy')),
    # Removed: 'svc_word_uni_iso', 'svc_char_1_6_iso', 'sgd_char_3_7_hinge_sig' (too weak / redundant)
]
cands = [(n,o,t) for n,o,t in cands if (o is not None and t is not None)]
names = [n for n,_,_ in cands]
OOFs_raw  = [np.clip(o,1e-12,1-1e-12)/o.sum(axis=1,keepdims=True) for _,o,_ in cands]
TESTs_raw = [np.clip(t,1e-12,1-1e-12)/t.sum(axis=1,keepdims=True) for _,_,t in cands]
K = len(names); assert K>=2, f'Need >=2 models, got {K}'
print('Blending models:', names)

def scale_probs(P, T):
    S = np.clip(P, 1e-12, 1-1e-12) ** (1.0/float(T))
    return S / S.sum(axis=1, keepdims=True)

# Per-model temperature calibration on OOF
per_model_T = []
OOFs = []
TESTs = []
for i in range(K):
    Pi = OOFs_raw[i]
    def loss_Ti(T):
        return log_loss(y, scale_probs(Pi, T), labels=classes)
    resTi = minimize_scalar(loss_Ti, bounds=(0.5, 5.0), method='bounded')
    Ti = float(resTi.x)
    per_model_T.append(Ti)
    OOFs.append(scale_probs(OOFs_raw[i], Ti))
    TESTs.append(scale_probs(TESTs_raw[i], Ti))
print('Per-model T:', {names[i]: round(per_model_T[i],3) for i in range(K)})

# Diagnostics: Per-model OOFs after calibration
per_oof = {names[i]: log_loss(y, OOFs[i], labels=classes) for i in range(K)}
print('Per-model OOF (post-cal):', {k: round(v,5) for k,v in per_oof.items()})

def geo_pool_log(stacks, w):
    A = np.zeros_like(stacks[0], dtype=np.float64)
    for k in range(K):
        if w[k] == 0.0: continue
        A += w[k] * np.log(stacks[k])
    A -= A.max(axis=1, keepdims=True)
    P = np.exp(A); P /= P.sum(axis=1, keepdims=True)
    return P

def softmax(z):
    z = z - z.max()
    e = np.exp(z); return e / e.sum()

# Objective in log-prob space; small entropy to avoid collapse
lambda_ent = 0.005
def obj(theta):
    w = softmax(theta)
    P = geo_pool_log(OOFs, w)
    reg = lambda_ent * float(np.sum(w * (np.log(w + 1e-12))))
    return log_loss(y, P, labels=classes) + reg

# Deterministic multi-start L-BFGS (32 starts)
best = (1e9, None, None)
rng = np.random.RandomState(42)
starts = [np.zeros(K)] + [rng.normal(0, 0.5, size=K) for _ in range(31)]
for si, theta0 in enumerate(starts):
    res = minimize(obj, theta0, method='L-BFGS-B')
    w_try = softmax(res.x)
    val = float(res.fun)
    if val < best[0]:
        best = (val, w_try.copy(), res.x.copy())
w_raw = best[1]
print('Best obj:', round(best[0],5))

# Caps/pruning
min_weight = 0.00
global_cap = 0.60
nb_cap = 0.60
weak_cap = 0.12  # optional cap for weaker models
w = w_raw.copy()
w[w < min_weight] = 0.0
w = np.minimum(w, global_cap)
# NB-family cap
nb_mask = np.array([n.startswith('nbsvm_') for n in names], bool)
nb_sum = w[nb_mask].sum()
if nb_sum > nb_cap and nb_sum > 0:
    w[nb_mask] *= (nb_cap / nb_sum)
# Cap very weak models (OOF > 0.45) lightly if present
for i,n in enumerate(names):
    try:
        if per_oof[n] > 0.45:
            w[i] = min(w[i], weak_cap)
    except KeyError:
        pass
# Normalize capped/pruned weights
s = w.sum()
if s == 0:
    w = np.ones_like(w) / len(w)
else:
    w = w / s
print('Final weights:', {names[i]: round(w[i],3) for i in range(K)})

# Blend OOF/Test
P_oof = geo_pool_log(OOFs, w)
P_test = geo_pool_log(TESTs, w)
oof_preT = log_loss(y, P_oof, labels=classes)
print('Blend OOF (pre-temp):', round(oof_preT,5))

# Final temperature (single global)
def scale(P,T):
    S = np.clip(P,1e-12,1-1e-12) ** (1.0/float(T))
    return S / S.sum(axis=1, keepdims=True)
def loss_T(T): return log_loss(y, scale(P_oof,T), labels=classes)
resT = minimize_scalar(loss_T, bounds=(0.5,5.0), method='bounded')
T_opt = float(resT.x)
P_oof_scaled = scale(P_oof, T_opt)
oof_final = log_loss(y, P_oof_scaled, labels=classes)
print('Final T:', round(T_opt,4), 'Final OOF:', round(oof_final,5))

# Apply to test and save
P_test_scaled = scale(P_test, T_opt)
probs = P_test_scaled[:, [classes.index(c) for c in submit_cols]]
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0,'id',test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; elapsed=%.2fs' % (time.time()-t0))

Blending models: ['nbsvm_wc_tweaked', 'lr_wc_fixed', 'nbsvm_char_2_6_counts', 'nbsvm_char_2_7_presence', 'lr_char_1_7', 'lr_char_1_8_fast', 'lr_char_1_8_hero', 'svc_charwb_1_6_sig', 'lr_charwb_4_8']


Per-model T: {'nbsvm_wc_tweaked': 1.55, 'lr_wc_fixed': 0.877, 'nbsvm_char_2_6_counts': 1.579, 'nbsvm_char_2_7_presence': 1.477, 'lr_char_1_7': 0.88, 'lr_char_1_8_fast': 0.825, 'lr_char_1_8_hero': 0.871, 'svc_charwb_1_6_sig': 0.803, 'lr_charwb_4_8': 1.036}
Per-model OOF (post-cal): {'nbsvm_wc_tweaked': 0.32156, 'lr_wc_fixed': 0.36343, 'nbsvm_char_2_6_counts': 0.33019, 'nbsvm_char_2_7_presence': 0.338, 'lr_char_1_7': 0.37154, 'lr_char_1_8_fast': 0.37566, 'lr_char_1_8_hero': 0.36987, 'svc_charwb_1_6_sig': 0.39657, 'lr_charwb_4_8': 0.40984}


Best obj: 0.31216
Final weights: {'nbsvm_wc_tweaked': 0.556, 'lr_wc_fixed': 0.084, 'nbsvm_char_2_6_counts': 0.027, 'nbsvm_char_2_7_presence': 0.037, 'lr_char_1_7': 0.093, 'lr_char_1_8_fast': 0.035, 'lr_char_1_8_hero': 0.062, 'svc_charwb_1_6_sig': 0.074, 'lr_charwb_4_8': 0.031}
Blend OOF (pre-temp): 0.32009


Final T: 0.9493 Final OOF: 0.31961
Saved submission.csv; elapsed=91.95s


In [21]:
# Fast char-only LR: TF-IDF char (2-6), lowercase=False, min_df=2, sublinear_tf=True; C sweep
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

char_params = dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=False,
                   strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_char_fold(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**char_params)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_char26_lr(C_grid: List[float], name: str='LR_char_2_6') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    best = dict(loss=1e9, C=None, oof=None, test=None)
    for C in C_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        t0_all = time.time()
        print(f"[{name}] C={C}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0 = time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            X_tr, X_val, X_te, vdim = build_char_fold(x_tr, x_val, X_test_text)
            print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
            clf = LogisticRegression(solver='saga', multi_class='multinomial', penalty='l2',
                                     C=C, max_iter=8000, tol=1e-4, n_jobs=-1, random_state=SEED)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, C=C, oof=oof.copy(), test=test_pred.copy())
    return best['oof'], best['test'], best['loss'], best

# Run char(2,6) LR with C grid as per expert quick win
C_grid = [12.0, 20.0, 32.0]
oof_lr_char26, test_lr_char26, loss_lr_char26, best_lr_char26 = cv_char26_lr(C_grid, name='LR_char_2_6')
np.save('oof_lr_char_2_6.npy', oof_lr_char26); np.save('test_lr_char_2_6.npy', test_lr_char26)
print(f"[LR_char_2_6] BEST OOF={loss_lr_char26:.5f} with C={best_lr_char26['C']}", flush=True)

# Optional: quick submission from this model
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best_lr_char26['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (LR char 2-6). Head:\n', sub.head(), flush=True)

[LR_char_2_6] C=12.0


  [Fold 1] X_tr=(14096, 293097) vdim=293097




  [Fold 1] loss=0.39090 elapsed=28.49s


  [Fold 2] X_tr=(14097, 294181) vdim=294181




  [Fold 2] loss=0.38840 elapsed=34.08s


  [Fold 3] X_tr=(14097, 293310) vdim=293310




  [Fold 3] loss=0.40300 elapsed=33.66s


  [Fold 4] X_tr=(14097, 293748) vdim=293748




  [Fold 4] loss=0.39411 elapsed=27.29s


  [Fold 5] X_tr=(14097, 294449) vdim=294449




  [Fold 5] loss=0.38137 elapsed=34.40s


[LR_char_2_6] OOF=0.39155; folds=[0.3909, 0.3884, 0.403, 0.39411, 0.38137] total=158.27s


[LR_char_2_6] C=20.0


  [Fold 1] X_tr=(14096, 293097) vdim=293097




  [Fold 1] loss=0.38241 elapsed=36.36s


  [Fold 2] X_tr=(14097, 294181) vdim=294181




KeyboardInterrupt: 

In [22]:
# Level-2 Stacking: Multinomial Logistic Regression on base-model OOF probabilities
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

def load(p):
    return np.load(p) if Path(p).exists() else None

# Load base OOF/Test predictions (only those available will be used)
bases = [
    ('lr_wc_fixed', load('oof_lr_wordchar_fixed.npy'), load('test_lr_wordchar_fixed.npy')),
    ('nbsvm_wc_fixed', load('oof_nbsvm_wc_fixed.npy'), load('test_nbsvm_wc_fixed.npy')),
    ('lr_char_1_7', load('oof_lr_char_1_7.npy'), load('test_lr_char_1_7.npy')),
    ('svc_char_1_6_iso', load('oof_svc_char_1_6_iso.npy'), load('test_svc_char_1_6_iso.npy')),
    ('cnb_word', load('oof_cnb_word.npy'), load('test_cnb_word.npy')),
    ('nbsvm_word_fast', load('oof_nbsvm_word.npy'), load('test_nbsvm_word.npy')),
]
bases = [(n,o,t) for n,o,t in bases if (o is not None and t is not None)]
names = [n for n,_,_ in bases]
assert len(bases) >= 2, 'Need at least two base models for stacking'
print('Stacking base models:', names)

# Build meta features: concatenate probabilities from each base model (order as in 'classes')
def reorder_cols(P: np.ndarray, current_order, target_order):
    idx = [list(current_order).index(c) for c in target_order]
    return P[:, idx]

OOFs = []
TESTs = []
for n,o,t in bases:
    # assume columns are in classes order already; if unsure, clip & renorm
    o = np.clip(o, 1e-12, 1-1e-12); o = o / o.sum(axis=1, keepdims=True)
    t = np.clip(t, 1e-12, 1-1e-12); t = t / t.sum(axis=1, keepdims=True)
    OOFs.append(o.astype(np.float32))
    TESTs.append(t.astype(np.float32))

X_meta = np.concatenate(OOFs, axis=1)  # shape (n_train, 3*K)
X_test_meta = np.concatenate(TESTs, axis=1)  # shape (n_test, 3*K)
print('Meta feature shapes:', X_meta.shape, X_test_meta.shape)

# 5-fold meta CV
SEED=42; N_FOLDS=5
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
meta_oof = np.zeros((len(train), len(classes)), dtype=np.float32)
meta_test = np.zeros((len(test), len(classes)), dtype=np.float32)
fold_losses = []
t0_all = time.time()
for fold, (tr, va) in enumerate(skf.split(X_meta, y), 1):
    t0 = time.time()
    X_tr, X_va = X_meta[tr], X_meta[va]
    y_tr, y_va = y[tr], y[va]
    clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', C=0.5, max_iter=2000, n_jobs=-1, random_state=SEED)
    clf.fit(X_tr, y_tr)
    P_va = clf.predict_proba(X_va).astype(np.float32)
    P_te = clf.predict_proba(X_test_meta).astype(np.float32)
    # reorder to fixed class order
    order_idx = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
    meta_oof[va] = P_va
    meta_test += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=classes)
    fold_losses.append(loss)
    print(f"[META] Fold {fold} loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
    del X_tr, X_va, y_tr, y_va, clf, P_va, P_te; gc.collect()

oof_loss = log_loss(y, meta_oof, labels=classes)
print(f"[META] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
np.save('oof_meta_stack.npy', meta_oof); np.save('test_meta_stack.npy', meta_test)

# Build submission
idx_map = [classes.index(c) for c in submit_cols]
probs = meta_test[:, idx_map]
probs = np.clip(probs, 1e-12, 1-1e-12); probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (meta stack). Head:\n', sub.head())

Stacking base models: ['lr_wc_fixed', 'nbsvm_wc_fixed', 'lr_char_1_7', 'svc_char_1_6_iso', 'cnb_word', 'nbsvm_word_fast']
Meta feature shapes: (17621, 18) (1958, 18)




[META] Fold 1 loss=0.32405 elapsed=0.86s




[META] Fold 2 loss=0.32416 elapsed=0.72s




[META] Fold 3 loss=0.33462 elapsed=0.72s




[META] Fold 4 loss=0.32360 elapsed=0.74s




[META] Fold 5 loss=0.32435 elapsed=0.73s


[META] OOF=0.32616; folds=[0.32405, 0.32416, 0.33462, 0.3236, 0.32435] total=4.12s


Saved submission.csv (meta stack). Head:
         id       EAP       HPL       MWS
0  id27251  0.009436  0.984666  0.005898
1  id09612  0.068307  0.051661  0.880032
2  id11943  0.021022  0.007304  0.971674
3  id19526  0.031051  0.019808  0.949141
4  id12931  0.027187  0.011376  0.961437


In [23]:
# Temperature scaling for meta stack predictions
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.metrics import log_loss
from scipy.optimize import minimize_scalar

train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

oof_meta = np.load('oof_meta_stack.npy')
test_meta = np.load('test_meta_stack.npy')

def scale_probs(P, T):
    P = np.clip(P, 1e-12, 1-1e-12)
    S = P ** (1.0 / T)
    return S / S.sum(axis=1, keepdims=True)

def loss_T(T):
    return log_loss(y, scale_probs(oof_meta, T), labels=list(classes))

resT = minimize_scalar(loss_T, bounds=(0.5, 3.0), method='bounded')
T_opt = float(resT.x)
oof_scaled = scale_probs(oof_meta, T_opt)
test_scaled = scale_probs(test_meta, T_opt)
oof_loss_scaled = log_loss(y, oof_scaled, labels=list(classes))
print(f"[META TEMP] T={T_opt:.4f} OOF_scaled={oof_loss_scaled:.5f}")

# Save final submission using temperature-scaled meta predictions
idx_map = [list(classes).index(c) for c in submit_cols]
probs = test_scaled[:, idx_map]
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (meta stack + temperature scaling). Head:\n', sub.head())

[META TEMP] T=0.9985 OOF_scaled=0.32616
Saved submission.csv (meta stack + temperature scaling). Head:
         id       EAP       HPL       MWS
0  id27251  0.009374  0.984772  0.005854
1  id09612  0.068085  0.051472  0.880443
2  id11943  0.020908  0.007253  0.971838
3  id19526  0.030905  0.019701  0.949394
4  id12931  0.027051  0.011305  0.961644


In [25]:
# NB-SVM tweaked: word(1,2)+char(2,6); presence vs counts (alpha=0.5), C in [30,50]; pick best and save
import time, gc, numpy as np, pandas as pd
from scipy import sparse
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

word_params_base = dict(analyzer='word', ngram_range=(1,2), min_df=2, max_df=1.0, lowercase=True,
                        strip_accents='unicode', token_pattern=r"(?u)\b[-\w']+\b", dtype=np.float32)
char_params_base = dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=False,
                        strip_accents=None, dtype=np.float32)

def _r_presence(X, yb, a=0.5):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    return np.log(p/q).astype(np.float32)

def _r_counts_normed(X, yb, a=0.5):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    p = p / (p.sum() + a)
    q = q / (q.sum() + a)
    return np.log(p/q).astype(np.float32)

def _softmax(m):
    m = m - m.max(axis=1, keepdims=True)
    e = np.exp(m); return e / e.sum(axis=1, keepdims=True)

def run_nbsvm(variant='presence', C=30.0, alpha=0.5):
    wp = word_params_base.copy(); cp = char_params_base.copy()
    if variant == 'presence':
        wp.update(binary=True); cp.update(binary=True)
        r_func = _r_presence
    elif variant == 'counts':
        wp.update(binary=False); cp.update(binary=False)
        r_func = _r_counts_normed
    else:
        raise ValueError('Unknown variant')
    skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
    oof = np.zeros((len(train), len(classes)), np.float32)
    test_pred = np.zeros((len(test), len(classes)), np.float32)
    fold_losses = []
    t0_all = time.time()
    for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_va = X_text[tr], X_text[va]
        y_tr, y_va = y[tr], y[va]
        vw = CountVectorizer(**wp); vc = CountVectorizer(**cp)
        Xw_tr = vw.fit_transform(x_tr); Xw_va = vw.transform(x_va); Xw_te = vw.transform(X_test)
        Xc_tr = vc.fit_transform(x_tr); Xc_va = vc.transform(x_va); Xc_te = vc.transform(X_test)
        X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
        X_va = sparse.hstack([Xw_va, Xc_va], format='csr')
        X_te = sparse.hstack([Xw_te, Xc_te], format='csr')
        margins_va = np.zeros((len(va), len(classes)), np.float32)
        margins_te = np.zeros((len(test), len(classes)), np.float32)
        for ci, c in enumerate(classes):
            yb = (y_tr == c).astype(np.int8)
            r = r_func(X_tr, yb, a=alpha)
            Xr_tr = normalize(X_tr.multiply(r), norm='l2', axis=1, copy=False)
            Xr_va = normalize(X_va.multiply(r), norm='l2', axis=1, copy=False)
            Xr_te = normalize(X_te.multiply(r), norm='l2', axis=1, copy=False)
            clf = LogisticRegression(solver='liblinear', C=C, max_iter=2000, random_state=SEED)
            clf.fit(Xr_tr, yb)
            margins_va[:, ci] = clf.decision_function(Xr_va).astype(np.float32)
            margins_te[:, ci] = clf.decision_function(Xr_te).astype(np.float32)
            del yb, r, Xr_tr, Xr_va, Xr_te, clf
            gc.collect()
        P_va = _softmax(margins_va).astype(np.float32)
        P_te = _softmax(margins_te).astype(np.float32)
        oof[va] = P_va
        test_pred += P_te / N_FOLDS
        loss = log_loss(y_va, P_va, labels=list(classes))
        fold_losses.append(loss)
        print(f"[NBSVM {variant} C={C}] Fold {fold} loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
        del X_tr, X_va, X_te, Xw_tr, Xw_va, Xw_te, Xc_tr, Xc_va, Xc_te, margins_va, margins_te, P_va, P_te
        gc.collect()
    oof_loss = log_loss(y, oof, labels=list(classes))
    print(f"[NBSVM {variant} C={C}] OOF={oof_loss:.5f}; total={time.time()-t0_all:.2f}s", flush=True)
    return oof, test_pred, oof_loss

# Grid over variants and C; pick best
configs = [(v, C) for v in ('presence','counts') for C in (30.0, 50.0)]
best = dict(loss=1e9, variant=None, C=None, oof=None, test=None)
for variant, C in configs:
    oof_, test_, loss_ = run_nbsvm(variant=variant, C=C, alpha=0.5)
    if loss_ < best['loss']:
        best.update(loss=loss_, variant=variant, C=C, oof=oof_, test=test_)

print(f"[NBSVM BEST] variant={best['variant']} C={best['C']} OOF={best['loss']:.5f}")
np.save('oof_nbsvm_wc_tweaked.npy', best['oof'])
np.save('test_nbsvm_wc_tweaked.npy', best['test'])
print('Saved oof_nbsvm_wc_tweaked.npy and test_nbsvm_wc_tweaked.npy')

[NBSVM presence C=30.0] Fold 1 loss=0.36812 elapsed=14.29s


[NBSVM presence C=30.0] Fold 2 loss=0.36319 elapsed=14.19s


[NBSVM presence C=30.0] Fold 3 loss=0.37374 elapsed=14.18s


[NBSVM presence C=30.0] Fold 4 loss=0.34871 elapsed=14.17s


[NBSVM presence C=30.0] Fold 5 loss=0.35646 elapsed=14.05s


[NBSVM presence C=30.0] OOF=0.36204; total=71.18s


[NBSVM presence C=50.0] Fold 1 loss=0.38758 elapsed=13.96s


[NBSVM presence C=50.0] Fold 2 loss=0.38165 elapsed=14.05s


[NBSVM presence C=50.0] Fold 3 loss=0.39283 elapsed=13.84s


[NBSVM presence C=50.0] Fold 4 loss=0.36632 elapsed=13.72s


[NBSVM presence C=50.0] Fold 5 loss=0.37434 elapsed=14.00s


[NBSVM presence C=50.0] OOF=0.38054; total=69.87s


[NBSVM counts C=30.0] Fold 1 loss=0.36347 elapsed=13.43s


[NBSVM counts C=30.0] Fold 2 loss=0.34677 elapsed=13.37s


[NBSVM counts C=30.0] Fold 3 loss=0.37334 elapsed=13.64s


[NBSVM counts C=30.0] Fold 4 loss=0.35811 elapsed=13.51s


[NBSVM counts C=30.0] Fold 5 loss=0.34077 elapsed=13.09s


[NBSVM counts C=30.0] OOF=0.35649; total=67.33s


[NBSVM counts C=50.0] Fold 1 loss=0.38089 elapsed=13.40s


[NBSVM counts C=50.0] Fold 2 loss=0.36180 elapsed=13.30s


[NBSVM counts C=50.0] Fold 3 loss=0.39125 elapsed=13.41s


[NBSVM counts C=50.0] Fold 4 loss=0.37514 elapsed=13.47s


[NBSVM counts C=50.0] Fold 5 loss=0.35629 elapsed=13.49s


[NBSVM counts C=50.0] OOF=0.37307; total=67.37s


[NBSVM BEST] variant=counts C=30.0 OOF=0.35649
Saved oof_nbsvm_wc_tweaked.npy and test_nbsvm_wc_tweaked.npy


In [27]:
# Char-only LR: TF-IDF char (2-6), lowercase=True (diverse), min_df=2, sublinear_tf=True; C sweep
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

char_params_lower = dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=True,
                         strip_accents='unicode', sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_char_fold_lower(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**char_params_lower)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_char26_lr_lower(C_grid: List[float], name: str='LR_char_2_6_lower') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    best = dict(loss=1e9, C=None, oof=None, test=None)
    for C in C_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        t0_all = time.time()
        print(f"[{name}] C={C}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0 = time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            X_tr, X_val, X_te, vdim = build_char_fold_lower(x_tr, x_val, X_test_text)
            print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
            clf = LogisticRegression(solver='saga', multi_class='multinomial', penalty='l2',
                                     C=C, max_iter=8000, tol=1e-4, n_jobs=-1, random_state=SEED)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, C=C, oof=oof.copy(), test=test_pred.copy())
    return best['oof'], best['test'], best['loss'], best

# Run char(2,6) lowercase LR with C grid
C_grid = [12.0, 20.0, 32.0]
oof_lr_char26_lower, test_lr_char26_lower, loss_lr_char26_lower, best_lr_char26_lower = cv_char26_lr_lower(C_grid, name='LR_char_2_6_lower')
np.save('oof_lr_char_2_6_lower.npy', oof_lr_char26_lower); np.save('test_lr_char_2_6_lower.npy', test_lr_char26_lower)
print(f"[LR_char_2_6_lower] BEST OOF={loss_lr_char26_lower:.5f} with C={best_lr_char26_lower['C']}", flush=True)

# Optional: quick submission from this model
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best_lr_char26_lower['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (LR char 2-6 lowercase). Head:\n', sub.head(), flush=True)

[LR_char_2_6_lower] C=12.0


  [Fold 1] X_tr=(14096, 277345) vdim=277345




  [Fold 1] loss=0.39519 elapsed=28.89s


  [Fold 2] X_tr=(14097, 278200) vdim=278200




  [Fold 2] loss=0.39413 elapsed=34.48s


  [Fold 3] X_tr=(14097, 277563) vdim=277563




  [Fold 3] loss=0.40774 elapsed=33.88s


  [Fold 4] X_tr=(14097, 278079) vdim=278079




  [Fold 4] loss=0.40166 elapsed=27.30s


  [Fold 5] X_tr=(14097, 278322) vdim=278322




  [Fold 5] loss=0.38833 elapsed=35.07s


[LR_char_2_6_lower] OOF=0.39741; folds=[0.39519, 0.39413, 0.40774, 0.40166, 0.38833] total=159.97s


[LR_char_2_6_lower] C=20.0


  [Fold 1] X_tr=(14096, 277345) vdim=277345




  [Fold 1] loss=0.38702 elapsed=36.28s


  [Fold 2] X_tr=(14097, 278200) vdim=278200




  [Fold 2] loss=0.38482 elapsed=42.30s


  [Fold 3] X_tr=(14097, 277563) vdim=277563




  [Fold 3] loss=0.40072 elapsed=41.88s


  [Fold 4] X_tr=(14097, 278079) vdim=278079




  [Fold 4] loss=0.39484 elapsed=35.11s


  [Fold 5] X_tr=(14097, 278322) vdim=278322




  [Fold 5] loss=0.38012 elapsed=43.59s


[LR_char_2_6_lower] OOF=0.38950; folds=[0.38702, 0.38482, 0.40072, 0.39484, 0.38012] total=199.51s


[LR_char_2_6_lower] C=32.0


  [Fold 1] X_tr=(14096, 277345) vdim=277345




  [Fold 1] loss=0.38383 elapsed=40.21s


  [Fold 2] X_tr=(14097, 278200) vdim=278200




  [Fold 2] loss=0.38064 elapsed=51.35s


  [Fold 3] X_tr=(14097, 277563) vdim=277563




  [Fold 3] loss=0.39858 elapsed=51.24s


  [Fold 4] X_tr=(14097, 278079) vdim=278079




  [Fold 4] loss=0.39294 elapsed=43.96s


  [Fold 5] X_tr=(14097, 278322) vdim=278322




  [Fold 5] loss=0.37691 elapsed=50.40s


[LR_char_2_6_lower] OOF=0.38658; folds=[0.38383, 0.38064, 0.39858, 0.39294, 0.37691] total=237.51s


[LR_char_2_6_lower] BEST OOF=0.38658 with C=32.0


Saved submission.csv (LR char 2-6 lowercase). Head:
         id       EAP       HPL       MWS
0  id27251  0.008741  0.985476  0.005784
1  id09612  0.185323  0.209027  0.605651
2  id11943  0.024677  0.003342  0.971981
3  id19526  0.003913  0.037732  0.958355
4  id12931  0.023281  0.040654  0.936065


In [30]:
# New base model: LR with analyzer='char_wb' ngram_range=(3,6), lowercase=False; C sweep [8,16,24]
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

charwb_params = dict(analyzer='char_wb', ngram_range=(3,6), min_df=2, lowercase=False,
                     strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_charwb_fold(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**charwb_params)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_charwb_lr(C_grid: List[float], name: str='LR_charwb_3_6') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    best = dict(loss=1e9, C=None, oof=None, test=None)
    for C in C_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        t0_all = time.time()
        print(f"[{name}] C={C}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0 = time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            X_tr, X_val, X_te, vdim = build_charwb_fold(x_tr, x_val, X_test_text)
            print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
            clf = LogisticRegression(solver='saga', multi_class='multinomial', penalty='l2',
                                     C=C, max_iter=10000, tol=1e-4, n_jobs=-1, random_state=SEED)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, C=C, oof=oof.copy(), test=test_pred.copy())
    return best['oof'], best['test'], best['loss'], best

# Run LR char_wb(3,6) with C sweep
C_grid = [8.0, 16.0, 24.0]
oof_lr_charwb_36, test_lr_charwb_36, loss_lr_charwb_36, best_charwb_36 = cv_charwb_lr(C_grid, name='LR_charwb_3_6')
np.save('oof_lr_charwb_3_6.npy', oof_lr_charwb_36); np.save('test_lr_charwb_3_6.npy', test_lr_charwb_36)
print(f"[LR_charwb_3_6] BEST OOF={loss_lr_charwb_36:.5f} with C={best_charwb_36['C']}", flush=True)

# Quick submission from best char_wb model
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best_charwb_36['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (LR char_wb 3-6). Head:\n', sub.head(), flush=True)

[LR_charwb_3_6] C=8.0


  [Fold 1] X_tr=(14096, 117398) vdim=117398




  [Fold 1] loss=0.39976 elapsed=5.94s


  [Fold 2] X_tr=(14097, 117813) vdim=117813




  [Fold 2] loss=0.39901 elapsed=9.49s


  [Fold 3] X_tr=(14097, 117656) vdim=117656




  [Fold 3] loss=0.41587 elapsed=8.00s


  [Fold 4] X_tr=(14097, 117909) vdim=117909




  [Fold 4] loss=0.40470 elapsed=5.99s


  [Fold 5] X_tr=(14097, 117984) vdim=117984




  [Fold 5] loss=0.39950 elapsed=8.62s


[LR_charwb_3_6] OOF=0.40377; folds=[0.39976, 0.39901, 0.41587, 0.4047, 0.3995] total=38.39s


[LR_charwb_3_6] C=16.0


  [Fold 1] X_tr=(14096, 117398) vdim=117398




  [Fold 1] loss=0.39988 elapsed=8.63s


  [Fold 2] X_tr=(14097, 117813) vdim=117813




  [Fold 2] loss=0.39451 elapsed=13.72s


  [Fold 3] X_tr=(14097, 117656) vdim=117656




  [Fold 3] loss=0.41683 elapsed=12.49s


  [Fold 4] X_tr=(14097, 117909) vdim=117909




  [Fold 4] loss=0.40362 elapsed=8.03s


  [Fold 5] X_tr=(14097, 117984) vdim=117984




  [Fold 5] loss=0.39843 elapsed=13.01s


[LR_charwb_3_6] OOF=0.40266; folds=[0.39988, 0.39451, 0.41683, 0.40362, 0.39843] total=56.22s


[LR_charwb_3_6] C=24.0


  [Fold 1] X_tr=(14096, 117398) vdim=117398




  [Fold 1] loss=0.40588 elapsed=11.50s


  [Fold 2] X_tr=(14097, 117813) vdim=117813




  [Fold 2] loss=0.39783 elapsed=16.77s


  [Fold 3] X_tr=(14097, 117656) vdim=117656




  [Fold 3] loss=0.42332 elapsed=15.37s


  [Fold 4] X_tr=(14097, 117909) vdim=117909




  [Fold 4] loss=0.40890 elapsed=11.33s


  [Fold 5] X_tr=(14097, 117984) vdim=117984




  [Fold 5] loss=0.40362 elapsed=15.80s


[LR_charwb_3_6] OOF=0.40791; folds=[0.40588, 0.39783, 0.42332, 0.4089, 0.40362] total=71.11s


[LR_charwb_3_6] BEST OOF=0.40266 with C=16.0


Saved submission.csv (LR char_wb 3-6). Head:
         id       EAP       HPL       MWS
0  id27251  0.006199  0.992111  0.001690
1  id09612  0.232643  0.064667  0.702690
2  id11943  0.006258  0.000503  0.993239
3  id19526  0.019338  0.070578  0.910085
4  id12931  0.011832  0.023729  0.964439


In [35]:
# New base: Calibrated RidgeClassifier on word TF-IDF (1-3), token pattern keeps apostrophes/hyphens
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

word_params = dict(analyzer='word', ngram_range=(1,3), min_df=2, lowercase=True,
                   strip_accents='unicode', token_pattern=r"(?u)\b[-\w']+\b",
                   sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_word_fold(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**word_params)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_calibrated_ridge(alpha_grid: List[float], name: str='CalibRidge_word_1_3') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    best = dict(loss=1e9, alpha=None, oof=None, test=None)
    for alpha in alpha_grid:
        oof = np.zeros((len(train), len(classes)), dtype=np.float32)
        test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
        fold_losses = []
        t0_all = time.time()
        print(f"[{name}] alpha={alpha}", flush=True)
        for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
            t0 = time.time()
            x_tr, x_val = X_text[tr_idx], X_text[val_idx]
            y_tr, y_val = y[tr_idx], y[val_idx]
            X_tr, X_val, X_te, vdim = build_word_fold(x_tr, x_val, X_test_text)
            print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
            base = RidgeClassifier(alpha=alpha, random_state=SEED)
            # sklearn >=1.4 uses 'estimator' instead of 'base_estimator'
            clf = CalibratedClassifierCV(estimator=base, method='sigmoid', cv=5, n_jobs=-1)
            clf.fit(X_tr, y_tr)
            proba_val = clf.predict_proba(X_val).astype(np.float32)
            proba_test = clf.predict_proba(X_te).astype(np.float32)
            order_idx = [list(clf.classes_).index(c) for c in classes]
            proba_val = proba_val[:, order_idx]
            proba_test = proba_test[:, order_idx]
            oof[val_idx] = proba_val
            test_pred += proba_test / N_FOLDS
            loss = log_loss(y_val, proba_val, labels=list(classes))
            fold_losses.append(loss)
            print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
            del X_tr, X_val, X_te, proba_val, proba_test, clf, base
            gc.collect()
        oof_loss = log_loss(y, oof, labels=list(classes))
        print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
        if oof_loss < best['loss']:
            best.update(loss=oof_loss, alpha=alpha, oof=oof.copy(), test=test_pred.copy())
    return best['oof'], best['test'], best['loss'], best

# Run calibrated Ridge with small alpha grid
alpha_grid = [2.0, 1.0, 4.0]
oof_ridge_word, test_ridge_word, loss_ridge_word, best_ridge = cv_calibrated_ridge(alpha_grid, name='CalibRidge_word_1_3')
np.save('oof_ridge_word.npy', oof_ridge_word); np.save('test_ridge_word.npy', test_ridge_word)
print(f"[CalibRidge_word_1_3] BEST OOF={loss_ridge_word:.5f} alpha={best_ridge['alpha']}", flush=True)

# Optional quick submission from calibrated Ridge
idx_map = [list(classes).index(c) for c in submit_cols]
probs = best_ridge['test'][:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (Calibrated Ridge word 1-3). Head:\n', sub.head(), flush=True)

[CalibRidge_word_1_3] alpha=2.0


  [Fold 1] X_tr=(14096, 75387) vdim=75387


  [Fold 1] loss=0.46513 elapsed=2.15s


  [Fold 2] X_tr=(14097, 75665) vdim=75665


  [Fold 2] loss=0.46835 elapsed=1.78s


  [Fold 3] X_tr=(14097, 75267) vdim=75267


  [Fold 3] loss=0.47315 elapsed=1.73s


  [Fold 4] X_tr=(14097, 75533) vdim=75533


  [Fold 4] loss=0.46238 elapsed=1.73s


  [Fold 5] X_tr=(14097, 75229) vdim=75229


  [Fold 5] loss=0.46752 elapsed=1.72s


[CalibRidge_word_1_3] OOF=0.46730; folds=[0.46513, 0.46835, 0.47315, 0.46238, 0.46752] total=9.66s


[CalibRidge_word_1_3] alpha=1.0


  [Fold 1] X_tr=(14096, 75387) vdim=75387


  [Fold 1] loss=0.43918 elapsed=1.73s


  [Fold 2] X_tr=(14097, 75665) vdim=75665


  [Fold 2] loss=0.44207 elapsed=1.75s


  [Fold 3] X_tr=(14097, 75267) vdim=75267


  [Fold 3] loss=0.44729 elapsed=1.75s


  [Fold 4] X_tr=(14097, 75533) vdim=75533


  [Fold 4] loss=0.43559 elapsed=1.20s


  [Fold 5] X_tr=(14097, 75229) vdim=75229


  [Fold 5] loss=0.44300 elapsed=1.22s


[CalibRidge_word_1_3] OOF=0.44143; folds=[0.43918, 0.44207, 0.44729, 0.43559, 0.443] total=8.15s


[CalibRidge_word_1_3] alpha=4.0


  [Fold 1] X_tr=(14096, 75387) vdim=75387


  [Fold 1] loss=0.50802 elapsed=1.21s


  [Fold 2] X_tr=(14097, 75665) vdim=75665


  [Fold 2] loss=0.50931 elapsed=1.19s


  [Fold 3] X_tr=(14097, 75267) vdim=75267


  [Fold 3] loss=0.51530 elapsed=1.17s


  [Fold 4] X_tr=(14097, 75533) vdim=75533


  [Fold 4] loss=0.50476 elapsed=1.20s


  [Fold 5] X_tr=(14097, 75229) vdim=75229


  [Fold 5] loss=0.50717 elapsed=1.21s


[CalibRidge_word_1_3] OOF=0.50891; folds=[0.50802, 0.50931, 0.5153, 0.50476, 0.50717] total=6.47s


[CalibRidge_word_1_3] BEST OOF=0.44143 alpha=1.0


Saved submission.csv (Calibrated Ridge word 1-3). Head:
         id       EAP       HPL       MWS
0  id27251  0.014615  0.970435  0.014950
1  id09612  0.155445  0.115151  0.729403
2  id11943  0.025075  0.012276  0.962649
3  id19526  0.037478  0.061502  0.901020
4  id12931  0.155919  0.026078  0.818003


In [51]:
# New base: Char-only NB-SVM (counts) with char ngrams (2,6), min_df=2, lowercase=False; alpha=0.5, C=30
import time, gc, numpy as np, pandas as pd
from scipy import sparse
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

char_params = dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=False,
                   strip_accents=None, binary=False, dtype=np.float32)

def _r_counts_normed(X, yb, a=0.5):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    p = p / (p.sum() + a)
    q = q / (q.sum() + a)
    return np.log(p/q).astype(np.float32)

def _softmax(m):
    m = m - m.max(axis=1, keepdims=True)
    e = np.exp(m); return e / e.sum(axis=1, keepdims=True)

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    vc = CountVectorizer(**char_params)
    Xc_tr = vc.fit_transform(x_tr); Xc_va = vc.transform(x_va); Xc_te = vc.transform(X_test)
    print(f"[NBSVM_char_counts] Fold {fold} X_tr={Xc_tr.shape}", flush=True)
    margins_va = np.zeros((len(va), len(classes)), np.float32)
    margins_te = np.zeros((len(test), len(classes)), np.float32)
    for ci, c in enumerate(classes):
        yb = (y_tr == c).astype(np.int8)
        r = _r_counts_normed(Xc_tr, yb, a=0.5)
        Xr_tr = normalize(Xc_tr.multiply(r), norm='l2', axis=1, copy=False)
        Xr_va = normalize(Xc_va.multiply(r), norm='l2', axis=1, copy=False)
        Xr_te = normalize(Xc_te.multiply(r), norm='l2', axis=1, copy=False)
        clf = LogisticRegression(solver='liblinear', C=30.0, max_iter=2000, random_state=SEED)
        clf.fit(Xr_tr, yb)
        margins_va[:, ci] = clf.decision_function(Xr_va).astype(np.float32)
        margins_te[:, ci] = clf.decision_function(Xr_te).astype(np.float32)
        del yb, r, Xr_tr, Xr_va, Xr_te, clf
        gc.collect()
    P_va = _softmax(margins_va).astype(np.float32)
    P_te = _softmax(margins_te).astype(np.float32)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f"[NBSVM_char_counts] Fold {fold} loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
    del Xc_tr, Xc_va, Xc_te, margins_va, margins_te, P_va, P_te, vc
    gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f"[NBSVM_char_counts] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
np.save('oof_nbsvm_char_2_6_counts.npy', oof)
np.save('test_nbsvm_char_2_6_counts.npy', test_pred)
print('Saved oof_nbsvm_char_2_6_counts.npy and test_nbsvm_char_2_6_counts.npy', flush=True)

[NBSVM_char_counts] Fold 1 X_tr=(14096, 293097)


[NBSVM_char_counts] Fold 1 loss=0.37768 elapsed=12.24s


[NBSVM_char_counts] Fold 2 X_tr=(14097, 294181)


[NBSVM_char_counts] Fold 2 loss=0.35898 elapsed=12.19s


[NBSVM_char_counts] Fold 3 X_tr=(14097, 293310)


[NBSVM_char_counts] Fold 3 loss=0.38402 elapsed=11.65s


[NBSVM_char_counts] Fold 4 X_tr=(14097, 293748)


[NBSVM_char_counts] Fold 4 loss=0.37070 elapsed=11.91s


[NBSVM_char_counts] Fold 5 X_tr=(14097, 294449)


[NBSVM_char_counts] Fold 5 loss=0.35348 elapsed=12.25s


[NBSVM_char_counts] OOF=0.36897; folds=[0.37768, 0.35898, 0.38402, 0.3707, 0.35348] total=60.76s


Saved oof_nbsvm_char_2_6_counts.npy and test_nbsvm_char_2_6_counts.npy


In [54]:
# New base: Strong Char-only LR (1-8), lowercase=False, sublinear_tf=True; faster variant (min_df=2, OvR/liblinear, single C=32)
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Speed tweak: min_df=2 to reduce vdim; keep case/punct; (1,8) char range
char_params = dict(analyzer='char', ngram_range=(1,8), min_df=2, lowercase=False,
                   strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_char_fold(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**char_params)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_char18_lr_fast(C: float = 32.0, name: str='LR_char_1_8_fast') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    oof = np.zeros((len(train), len(classes)), dtype=np.float32)
    test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
    fold_losses = []
    t0_all = time.time()
    print(f"[{name}] C={C}", flush=True)
    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_val = X_text[tr_idx], X_text[val_idx]
        y_tr, y_val = y[tr_idx], y[val_idx]
        X_tr, X_val, X_te, vdim = build_char_fold(x_tr, x_val, X_test_text)
        print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
        # Faster on huge vdim: OvR/liblinear instead of multinomial/saga
        clf = LogisticRegression(solver='liblinear', multi_class='ovr', penalty='l2',
                                 C=C, max_iter=5000, tol=1e-4, n_jobs=1, random_state=SEED)
        clf.fit(X_tr, y_tr)
        proba_val = clf.predict_proba(X_val).astype(np.float32)
        proba_test = clf.predict_proba(X_te).astype(np.float32)
        order_idx = [list(clf.classes_).index(c) for c in classes]
        proba_val = proba_val[:, order_idx]
        proba_test = proba_test[:, order_idx]
        oof[val_idx] = proba_val
        test_pred += proba_test / N_FOLDS
        loss = log_loss(y_val, proba_val, labels=list(classes))
        fold_losses.append(loss)
        print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
        del X_tr, X_val, X_te, proba_val, proba_test, clf
        gc.collect()
    oof_loss = log_loss(y, oof, labels=list(classes))
    print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
    meta = dict(C=C, folds=fold_losses, oof=oof_loss)
    return oof, test_pred, oof_loss, meta

# Train single strong C to keep runtime reasonable
oof_lr_char18, test_lr_char18, loss_lr_char18, meta_lr_char18 = cv_char18_lr_fast(C=32.0, name='LR_char_1_8_fast')
np.save('oof_lr_char_1_8.npy', oof_lr_char18); np.save('test_lr_char_1_8.npy', test_lr_char18)
print(f"[LR_char_1_8_fast] OOF={loss_lr_char18:.5f} with C={meta_lr_char18['C']}", flush=True)

# Optional: quick submission preview
idx_map = [list(classes).index(c) for c in submit_cols]
probs = test_lr_char18[:, idx_map]
eps = 1e-9
probs = np.clip(probs, eps, 1-eps)
probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (LR char 1-8 fast). Head:\n', sub.head(), flush=True)

[LR_char_1_8_fast] C=32.0


  [Fold 1] X_tr=(14096, 768122) vdim=768122




  [Fold 1] loss=0.38229 elapsed=19.34s


  [Fold 2] X_tr=(14097, 770255) vdim=770255




  [Fold 2] loss=0.37923 elapsed=18.41s


  [Fold 3] X_tr=(14097, 767867) vdim=767867




  [Fold 3] loss=0.39507 elapsed=18.41s


  [Fold 4] X_tr=(14097, 769029) vdim=769029




  [Fold 4] loss=0.38627 elapsed=18.34s


  [Fold 5] X_tr=(14097, 769390) vdim=769390




  [Fold 5] loss=0.37175 elapsed=17.77s


[LR_char_1_8_fast] OOF=0.38292; folds=[0.38229, 0.37923, 0.39507, 0.38627, 0.37175] total=92.78s


[LR_char_1_8_fast] OOF=0.38292 with C=32.0


Saved submission.csv (LR char 1-8 fast). Head:
         id       EAP       HPL       MWS
0  id27251  0.009801  0.973037  0.017163
1  id09612  0.149320  0.213424  0.637257
2  id11943  0.033257  0.010454  0.956289
3  id19526  0.007851  0.194168  0.797982
4  id12931  0.046046  0.045911  0.908043


In [56]:
# New base (fast): NB-SVM char presence (2,7), alpha=0.75, liblinear OvR per-class
import time, gc, numpy as np, pandas as pd
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

char_params = dict(analyzer='char', ngram_range=(2,7), min_df=2, lowercase=False,
                   strip_accents=None, binary=True, dtype=np.float32)

def _r_presence(X, yb, a=0.75):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    return np.log(p/q).astype(np.float32)

def _softmax(m):
    m = m - m.max(axis=1, keepdims=True)
    e = np.exp(m); return e / e.sum(axis=1, keepdims=True)

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
print('[NBSVM_char_presence_2_7] alpha=0.75 C=25.0', flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    vc = CountVectorizer(**char_params)
    Xc_tr = vc.fit_transform(x_tr); Xc_va = vc.transform(x_va); Xc_te = vc.transform(X_test)
    print(f'  [Fold {fold}] X_tr={Xc_tr.shape}', flush=True)
    margins_va = np.zeros((len(va), len(classes)), np.float32)
    margins_te = np.zeros((len(test), len(classes)), np.float32)
    for ci, c in enumerate(classes):
        yb = (y_tr == c).astype(np.int8)
        r = _r_presence(Xc_tr, yb, a=0.75)
        Xr_tr = normalize(Xc_tr.multiply(r), norm='l2', axis=1, copy=False)
        Xr_va = normalize(Xc_va.multiply(r), norm='l2', axis=1, copy=False)
        Xr_te = normalize(Xc_te.multiply(r), norm='l2', axis=1, copy=False)
        clf = LogisticRegression(solver='liblinear', C=25.0, max_iter=2000, random_state=SEED)
        clf.fit(Xr_tr, yb)
        margins_va[:, ci] = clf.decision_function(Xr_va).astype(np.float32)
        margins_te[:, ci] = clf.decision_function(Xr_te).astype(np.float32)
        del yb, r, Xr_tr, Xr_va, Xr_te, clf; gc.collect()
    P_va = _softmax(margins_va).astype(np.float32)
    P_te = _softmax(margins_te).astype(np.float32)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del Xc_tr, Xc_va, Xc_te, margins_va, margins_te, P_va, P_te, vc; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[NBSVM_char_presence_2_7] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_nbsvm_char_2_7_presence.npy', oof)
np.save('test_nbsvm_char_2_7_presence.npy', test_pred)
print('Saved oof_nbsvm_char_2_7_presence.npy and test_nbsvm_char_2_7_presence.npy')

[NBSVM_char_presence_2_7] alpha=0.75 C=25.0


  [Fold 1] X_tr=(14096, 512978)


  [Fold 1] loss=0.37087 elapsed=15.68s


  [Fold 2] X_tr=(14097, 514363)


  [Fold 2] loss=0.37088 elapsed=15.37s


  [Fold 3] X_tr=(14097, 512905)


  [Fold 3] loss=0.37714 elapsed=16.20s


  [Fold 4] X_tr=(14097, 513617)


  [Fold 4] loss=0.35531 elapsed=16.69s


  [Fold 5] X_tr=(14097, 514414)


  [Fold 5] loss=0.35912 elapsed=16.83s


[NBSVM_char_presence_2_7] OOF=0.36667; folds=[0.37087, 0.37088, 0.37714, 0.35531, 0.35912] total=81.39s


Saved oof_nbsvm_char_2_7_presence.npy and test_nbsvm_char_2_7_presence.npy


In [58]:
# New base: LinearSVC on char_wb(1,6) with sigmoid calibration (fast, diverse)
import time, gc, numpy as np, pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

charwb_params = dict(analyzer='char_wb', ngram_range=(1,6), min_df=1, lowercase=False,
                     strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**charwb_params)
    X_tr = v.fit_transform(x_tr); X_va = v.transform(x_va); X_te = v.transform(x_te)
    return X_tr, X_va, X_te, X_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
C = 1.5
print(f'[CalibSVC_charwb_1_6_sig] C={C}', flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    X_tr, X_va, X_te, vdim = build_fold(x_tr, x_va, X_test)
    print(f'  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}', flush=True)
    base = LinearSVC(C=C, tol=1e-4, max_iter=10000, random_state=SEED, dual=True)
    clf = CalibratedClassifierCV(estimator=base, method='sigmoid', cv=5, n_jobs=-1)
    clf.fit(X_tr, y_tr)
    P_va = clf.predict_proba(X_va).astype(np.float32)
    P_te = clf.predict_proba(X_te).astype(np.float32)
    # reorder to fixed class order
    order_idx = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
    # clip+renorm
    P_va = np.clip(P_va, 1e-12, 1-1e-12); P_va /= P_va.sum(axis=1, keepdims=True)
    P_te = np.clip(P_te, 1e-12, 1-1e-12); P_te /= P_te.sum(axis=1, keepdims=True)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del X_tr, X_va, X_te, P_va, P_te, clf, base; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[CalibSVC_charwb_1_6_sig] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_svc_charwb_1_6_sig.npy', oof)
np.save('test_svc_charwb_1_6_sig.npy', test_pred)
print('Saved oof_svc_charwb_1_6_sig.npy and test_svc_charwb_1_6_sig.npy')

[CalibSVC_charwb_1_6_sig] C=1.5


  [Fold 1] X_tr=(14096, 184581) vdim=184581


  [Fold 1] loss=0.40118 elapsed=5.93s


  [Fold 2] X_tr=(14097, 185066) vdim=185066


  [Fold 2] loss=0.39867 elapsed=5.64s


  [Fold 3] X_tr=(14097, 185603) vdim=185603


  [Fold 3] loss=0.41855 elapsed=5.64s


  [Fold 4] X_tr=(14097, 184468) vdim=184468


  [Fold 4] loss=0.40914 elapsed=5.64s


  [Fold 5] X_tr=(14097, 185025) vdim=185025


  [Fold 5] loss=0.40124 elapsed=5.55s


[CalibSVC_charwb_1_6_sig] OOF=0.40576; folds=[0.40118, 0.39867, 0.41855, 0.40914, 0.40124] total=28.94s


Saved oof_svc_charwb_1_6_sig.npy and test_svc_charwb_1_6_sig.npy


In [60]:
# New base: Calibrated LinearSVC on word unigrams (case-sensitive) with isotonic calibration
import time, gc, numpy as np, pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

word_params = dict(analyzer='word', ngram_range=(1,1), min_df=1, max_df=0.95, lowercase=False,
                   strip_accents=None, token_pattern=r"(?u)\b[-\w']+\b",
                   sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**word_params)
    X_tr = v.fit_transform(x_tr); X_va = v.transform(x_va); X_te = v.transform(x_te)
    return X_tr, X_va, X_te, X_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
C = 2.0
print(f'[CalibSVC_word_uni_iso] C={C}', flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    X_tr, X_va, X_te, vdim = build_fold(x_tr, x_va, X_test)
    print(f'  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}', flush=True)
    base = LinearSVC(C=C, tol=1e-4, max_iter=3000, random_state=SEED, dual=True)
    clf = CalibratedClassifierCV(estimator=base, method='isotonic', cv=5, n_jobs=-1)
    clf.fit(X_tr, y_tr)
    P_va = clf.predict_proba(X_va).astype(np.float32)
    P_te = clf.predict_proba(X_te).astype(np.float32)
    order_idx = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
    P_va = np.clip(P_va, 1e-12, 1-1e-12); P_va /= P_va.sum(axis=1, keepdims=True)
    P_te = np.clip(P_te, 1e-12, 1-1e-12); P_te /= P_te.sum(axis=1, keepdims=True)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del X_tr, X_va, X_te, P_va, P_te, clf, base; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[CalibSVC_word_uni_iso] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_svc_word_uni_iso.npy', oof)
np.save('test_svc_word_uni_iso.npy', test_pred)
print('Saved oof_svc_word_uni_iso.npy and test_svc_word_uni_iso.npy')

[CalibSVC_word_uni_iso] C=2.0


  [Fold 1] X_tr=(14096, 24405) vdim=24405


  [Fold 1] loss=0.45176 elapsed=1.06s


  [Fold 2] X_tr=(14097, 24488) vdim=24488


  [Fold 2] loss=0.44446 elapsed=1.07s


  [Fold 3] X_tr=(14097, 24624) vdim=24624


  [Fold 3] loss=0.46200 elapsed=1.04s


  [Fold 4] X_tr=(14097, 24443) vdim=24443


  [Fold 4] loss=0.43980 elapsed=0.55s


  [Fold 5] X_tr=(14097, 24594) vdim=24594


  [Fold 5] loss=0.44370 elapsed=0.55s


[CalibSVC_word_uni_iso] OOF=0.44834; folds=[0.45176, 0.44446, 0.462, 0.4398, 0.4437] total=4.80s


Saved oof_svc_word_uni_iso.npy and test_svc_word_uni_iso.npy


In [64]:
# New base (fast): SGD hinge on char counts (3,7) + sigmoid calibration
import time, gc, numpy as np, pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

char_params = dict(analyzer='char', ngram_range=(3,7), min_df=2, lowercase=False,
                   strip_accents=None, binary=False, dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = CountVectorizer(**char_params)
    X_tr = v.fit_transform(x_tr); X_va = v.transform(x_va); X_te = v.transform(x_te)
    return X_tr, X_va, X_te, X_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
alpha = 1e-5
print(f'[SGD_char_3_7_hinge_sig] alpha={alpha}', flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    X_tr, X_va, X_te, vdim = build_fold(x_tr, x_va, X_test)
    print(f'  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}', flush=True)
    base = SGDClassifier(loss='hinge', penalty='l2', alpha=alpha, max_iter=2000, tol=1e-4,
                         early_stopping=True, validation_fraction=0.1, n_iter_no_change=5,
                         random_state=SEED)
    clf = CalibratedClassifierCV(estimator=base, method='sigmoid', cv=3, n_jobs=-1)
    clf.fit(X_tr, y_tr)
    P_va = clf.predict_proba(X_va).astype(np.float32)
    P_te = clf.predict_proba(X_te).astype(np.float32)
    # reorder to fixed class order
    order_idx = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
    # clip+renorm
    P_va = np.clip(P_va, 1e-12, 1-1e-12); P_va /= P_va.sum(axis=1, keepdims=True)
    P_te = np.clip(P_te, 1e-12, 1-1e-12); P_te /= P_te.sum(axis=1, keepdims=True)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del X_tr, X_va, X_te, P_va, P_te, clf, base; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[SGD_char_3_7_hinge_sig] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_sgd_char_3_7_hinge_sig.npy', oof)
np.save('test_sgd_char_3_7_hinge_sig.npy', test_pred)
print('Saved oof_sgd_char_3_7_hinge_sig.npy and test_sgd_char_3_7_hinge_sig.npy')

[SGD_char_3_7_hinge_sig] alpha=1e-05


  [Fold 1] X_tr=(14096, 511677) vdim=511677


  [Fold 1] loss=0.46576 elapsed=7.86s


  [Fold 2] X_tr=(14097, 513064) vdim=513064


  [Fold 2] loss=0.47545 elapsed=7.40s


  [Fold 3] X_tr=(14097, 511604) vdim=511604


  [Fold 3] loss=0.47131 elapsed=7.58s


  [Fold 4] X_tr=(14097, 512304) vdim=512304


  [Fold 4] loss=0.47301 elapsed=7.48s


  [Fold 5] X_tr=(14097, 513095) vdim=513095


  [Fold 5] loss=0.45768 elapsed=7.51s


[SGD_char_3_7_hinge_sig] OOF=0.46864; folds=[0.46576, 0.47545, 0.47131, 0.47301, 0.45768] total=38.36s


Saved oof_sgd_char_3_7_hinge_sig.npy and test_sgd_char_3_7_hinge_sig.npy


In [67]:
# New base: LR char_wb(4,8) multinomial (expected strong/diverse, relatively fast)
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42
N_FOLDS = 5
np.random.seed(SEED)

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols), f"Classes mismatch: {classes}"

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Boundary-aware character n-grams; fewer features than plain char(1,8)
charwb_params = dict(analyzer='char_wb', ngram_range=(4,8), min_df=1, lowercase=True,
                     strip_accents='unicode', sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_charwb_fold(x_tr, x_val, x_test) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**charwb_params)
    X_tr = v.fit_transform(x_tr)
    X_val = v.transform(x_val)
    X_te  = v.transform(x_test)
    return X_tr, X_val, X_te, X_tr.shape[1]

def cv_charwb_lr_strong(C: float = 20.0, name: str='LR_charwb_4_8_strong') -> Tuple[np.ndarray, np.ndarray, float, Dict]:
    oof = np.zeros((len(train), len(classes)), dtype=np.float32)
    test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
    fold_losses = []
    t0_all = time.time()
    print(f"[{name}] C={C}", flush=True)
    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_val = X_text[tr_idx], X_text[val_idx]
        y_tr, y_val = y[tr_idx], y[val_idx]
        X_tr, X_val, X_te, vdim = build_charwb_fold(x_tr, x_val, X_test_text)
        print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
        clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', penalty='l2',
                                 C=C, max_iter=10000, tol=1e-4, n_jobs=-1, random_state=SEED)
        clf.fit(X_tr, y_tr)
        proba_val = clf.predict_proba(X_val).astype(np.float32)
        proba_test = clf.predict_proba(X_te).astype(np.float32)
        order_idx = [list(clf.classes_).index(c) for c in classes]
        proba_val = proba_val[:, order_idx]
        proba_test = proba_test[:, order_idx]
        oof[val_idx] = proba_val
        test_pred += proba_test / N_FOLDS
        loss = log_loss(y_val, proba_val, labels=list(classes))
        fold_losses.append(loss)
        print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
        del X_tr, X_val, X_te, proba_val, proba_test, clf
        gc.collect()
    oof_loss = log_loss(y, oof, labels=list(classes))
    print(f"[{name}] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
    meta = dict(C=C, folds=fold_losses, oof=oof_loss)
    return oof, test_pred, oof_loss, meta

# Single strong C per expert; adjust if needed
oof_lr_charwb_48, test_lr_charwb_48, loss_lr_charwb_48, meta_lr_charwb_48 = cv_charwb_lr_strong(C=20.0, name='LR_charwb_4_8_strong')
np.save('oof_lr_charwb_4_8.npy', oof_lr_charwb_48); np.save('test_lr_charwb_4_8.npy', test_lr_charwb_48)
print(f"[LR_charwb_4_8_strong] OOF={loss_lr_charwb_48:.5f} with C={meta_lr_charwb_48['C']}", flush=True)

[LR_charwb_4_8_strong] C=20.0


  [Fold 1] X_tr=(14096, 307266) vdim=307266




  [Fold 1] loss=0.41165 elapsed=10.95s


  [Fold 2] X_tr=(14097, 308405) vdim=308405




  [Fold 2] loss=0.40260 elapsed=8.15s


  [Fold 3] X_tr=(14097, 308856) vdim=308856




  [Fold 3] loss=0.41427 elapsed=11.23s


  [Fold 4] X_tr=(14097, 307582) vdim=307582




  [Fold 4] loss=0.41366 elapsed=10.52s


  [Fold 5] X_tr=(14097, 308469) vdim=308469




  [Fold 5] loss=0.40829 elapsed=11.87s


[LR_charwb_4_8_strong] OOF=0.41009; folds=[0.41165, 0.4026, 0.41427, 0.41366, 0.40829] total=53.26s


[LR_charwb_4_8_strong] OOF=0.41009 with C=20.0


In [69]:
# New base (fast): NB-SVM char presence (1,5), lowercase=True, alpha=1.0, C=50
import time, gc, numpy as np, pandas as pd
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

char_params = dict(analyzer='char', ngram_range=(1,5), min_df=1, lowercase=True,
                   strip_accents='unicode', binary=True, dtype=np.float32)

def _r_presence(X, yb, a=1.0):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    return np.log(p/q).astype(np.float32)

def _softmax(m):
    m = m - m.max(axis=1, keepdims=True)
    e = np.exp(m); return e / e.sum(axis=1, keepdims=True)

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
print('[NBSVM_char_presence_1_5_lc] alpha=1.0 C=50.0', flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    vc = CountVectorizer(**char_params)
    Xc_tr = vc.fit_transform(x_tr); Xc_va = vc.transform(x_va); Xc_te = vc.transform(X_test)
    print(f'  [Fold {fold}] X_tr={Xc_tr.shape}', flush=True)
    margins_va = np.zeros((len(va), len(classes)), np.float32)
    margins_te = np.zeros((len(test), len(classes)), np.float32)
    for ci, c in enumerate(classes):
        yb = (y_tr == c).astype(np.int8)
        r = _r_presence(Xc_tr, yb, a=1.0)
        Xr_tr = normalize(Xc_tr.multiply(r), norm='l2', axis=1, copy=False)
        Xr_va = normalize(Xc_va.multiply(r), norm='l2', axis=1, copy=False)
        Xr_te = normalize(Xc_te.multiply(r), norm='l2', axis=1, copy=False)
        clf = LogisticRegression(solver='liblinear', C=50.0, max_iter=2000, random_state=SEED)
        clf.fit(Xr_tr, yb)
        margins_va[:, ci] = clf.decision_function(Xr_va).astype(np.float32)
        margins_te[:, ci] = clf.decision_function(Xr_te).astype(np.float32)
        del yb, r, Xr_tr, Xr_va, Xr_te, clf; gc.collect()
    P_va = _softmax(margins_va).astype(np.float32)
    P_te = _softmax(margins_te).astype(np.float32)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del Xc_tr, Xc_va, Xc_te, margins_va, margins_te, P_va, P_te, vc; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[NBSVM_char_presence_1_5_lc] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_nbsvm_char_1_5_presence_lc.npy', oof)
np.save('test_nbsvm_char_1_5_presence_lc.npy', test_pred)
print('Saved oof_nbsvm_char_1_5_presence_lc.npy and test_nbsvm_char_1_5_presence_lc.npy')

[NBSVM_char_presence_1_5_lc] alpha=1.0 C=50.0


  [Fold 1] X_tr=(14096, 175400)


  [Fold 1] loss=0.45363 elapsed=11.60s


  [Fold 2] X_tr=(14097, 175847)


  [Fold 2] loss=0.44355 elapsed=11.48s


  [Fold 3] X_tr=(14097, 176099)


  [Fold 3] loss=0.45335 elapsed=10.49s


  [Fold 4] X_tr=(14097, 175579)


  [Fold 4] loss=0.44176 elapsed=10.99s


  [Fold 5] X_tr=(14097, 175671)


  [Fold 5] loss=0.44731 elapsed=10.63s


[NBSVM_char_presence_1_5_lc] OOF=0.44792; folds=[0.45363, 0.44355, 0.45335, 0.44176, 0.44731] total=55.69s


Saved oof_nbsvm_char_1_5_presence_lc.npy and test_nbsvm_char_1_5_presence_lc.npy


In [70]:
# Hero base: LR char(1,8) multinomial, min_df=1, lowercase=False, sublinear_tf=True (saga), single strong C
import time, gc, numpy as np, pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42; N_FOLDS = 5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

char_params = dict(analyzer='char', ngram_range=(1,8), min_df=1, lowercase=False,
                   strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**char_params)
    X_tr = v.fit_transform(x_tr); X_va = v.transform(x_va); X_te = v.transform(x_te)
    return X_tr, X_va, X_te, X_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
C = 40.0
print(f"[LR_char_1_8_hero] C={C}", flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    X_tr, X_va, X_te, vdim = build_fold(x_tr, x_va, X_test)
    print(f"  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}", flush=True)
    clf = LogisticRegression(solver='saga', multi_class='multinomial', penalty='l2',
                             C=C, max_iter=15000, tol=1e-4, n_jobs=-1, random_state=SEED)
    clf.fit(X_tr, y_tr)
    P_va = clf.predict_proba(X_va).astype(np.float32)
    P_te = clf.predict_proba(X_te).astype(np.float32)
    order_idx = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f"  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s", flush=True)
    del X_tr, X_va, X_te, P_va, P_te, clf; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f"[LR_char_1_8_hero] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s", flush=True)
np.save('oof_lr_char_1_8_hero.npy', oof)
np.save('test_lr_char_1_8_hero.npy', test_pred)
print('Saved oof_lr_char_1_8_hero.npy and test_lr_char_1_8_hero.npy', flush=True)

[LR_char_1_8_hero] C=40.0


  [Fold 1] X_tr=(14096, 1833412) vdim=1833412




  [Fold 1] loss=0.37428 elapsed=132.07s


  [Fold 2] X_tr=(14097, 1841692) vdim=1841692




  [Fold 2] loss=0.36944 elapsed=154.61s


  [Fold 3] X_tr=(14097, 1841174) vdim=1841174




  [Fold 3] loss=0.38547 elapsed=176.34s


  [Fold 4] X_tr=(14097, 1837079) vdim=1837079




  [Fold 4] loss=0.37512 elapsed=124.40s


  [Fold 5] X_tr=(14097, 1837993) vdim=1837993




  [Fold 5] loss=0.36295 elapsed=170.56s


[LR_char_1_8_hero] OOF=0.37345; folds=[0.37428, 0.36944, 0.38547, 0.37512, 0.36295] total=758.57s


Saved oof_lr_char_1_8_hero.npy and test_lr_char_1_8_hero.npy


In [72]:
# A1) NBSVM word-only presence (1,3), alpha=0.75, liblinear C=25; save OOF/TEST
import time, gc, numpy as np, pandas as pd
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes)==set(submit_cols)

vec_params = dict(analyzer='word', ngram_range=(1,3), min_df=2, max_df=1.0, lowercase=True,
                  strip_accents='unicode', token_pattern=r"(?u)\b[-\w']+\b", binary=True, dtype=np.float32)

def _r_presence(X, yb, a=0.75):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    return np.log(p/q).astype(np.float32)

def _softmax(m):
    m = m - m.max(axis=1, keepdims=True)
    e = np.exp(m); return e / e.sum(axis=1, keepdims=True)

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
print('[NBSVM_word_1_3_presence] alpha=0.75 C=25', flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    vw = CountVectorizer(**vec_params)
    Xw_tr = vw.fit_transform(x_tr); Xw_va = vw.transform(x_va); Xw_te = vw.transform(X_test)
    print(f'  [Fold {fold}] X_tr={Xw_tr.shape}', flush=True)
    margins_va = np.zeros((len(va), len(classes)), np.float32)
    margins_te = np.zeros((len(test), len(classes)), np.float32)
    for ci, c in enumerate(classes):
        yb = (y_tr == c).astype(np.int8)
        r = _r_presence(Xw_tr, yb, a=0.75)
        Xr_tr = normalize(Xw_tr.multiply(r), norm='l2', axis=1, copy=False)
        Xr_va = normalize(Xw_va.multiply(r), norm='l2', axis=1, copy=False)
        Xr_te = normalize(Xw_te.multiply(r), norm='l2', axis=1, copy=False)
        clf = LogisticRegression(solver='liblinear', C=25.0, max_iter=2000, random_state=SEED)
        clf.fit(Xr_tr, yb)
        margins_va[:, ci] = clf.decision_function(Xr_va).astype(np.float32)
        margins_te[:, ci] = clf.decision_function(Xr_te).astype(np.float32)
        del yb, r, Xr_tr, Xr_va, Xr_te, clf; gc.collect()
    P_va = _softmax(margins_va).astype(np.float32)
    P_te = _softmax(margins_te).astype(np.float32)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del Xw_tr, Xw_va, Xw_te, margins_va, margins_te, P_va, P_te, vw; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[NBSVM_word_1_3_presence] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_nbsvm_word_1_3_presence.npy', oof)
np.save('test_nbsvm_word_1_3_presence.npy', test_pred)
print('Saved oof_nbsvm_word_1_3_presence.npy and test_nbsvm_word_1_3_presence.npy', flush=True)

[NBSVM_word_1_3_presence] alpha=0.75 C=25


  [Fold 1] X_tr=(14096, 75387)


  [Fold 1] loss=0.44227 elapsed=2.03s


  [Fold 2] X_tr=(14097, 75665)


  [Fold 2] loss=0.43865 elapsed=1.95s


  [Fold 3] loss=0.46112 elapsed=1.97s


  [Fold 4] X_tr=(14097, 75533)


  [Fold 4] loss=0.43584 elapsed=2.08s


  [Fold 5] X_tr=(14097, 75229)


  [Fold 5] loss=0.44982 elapsed=1.95s


[NBSVM_word_1_3_presence] OOF=0.44554; folds=[0.44227, 0.43865, 0.46112, 0.43584, 0.44982] total=10.42s


Saved oof_nbsvm_word_1_3_presence.npy and test_nbsvm_word_1_3_presence.npy


In [73]:
# A2) NBSVM char_wb presence (2,7), alpha=0.75, liblinear C=30; save OOF/TEST
import time, gc, numpy as np, pandas as pd
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes)==set(submit_cols)

vec_params = dict(analyzer='char_wb', ngram_range=(2,7), min_df=1, lowercase=False,
                  strip_accents=None, binary=True, dtype=np.float32)

def _r_presence(X, yb, a=0.75):
    p = np.asarray(X[yb==1].sum(axis=0)).ravel() + a
    q = np.asarray(X[yb==0].sum(axis=0)).ravel() + a
    return np.log(p/q).astype(np.float32)

def _softmax(m):
    m = m - m.max(axis=1, keepdims=True)
    e = np.exp(m); return e / e.sum(axis=1, keepdims=True)

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
print('[NBSVM_charwb_2_7_presence] alpha=0.75 C=30', flush=True)
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    v = CountVectorizer(**vec_params)
    X_tr = v.fit_transform(x_tr); X_va = v.transform(x_va); X_te = v.transform(X_test)
    print(f'  [Fold {fold}] X_tr={X_tr.shape}', flush=True)
    margins_va = np.zeros((len(va), len(classes)), np.float32)
    margins_te = np.zeros((len(test), len(classes)), np.float32)
    for ci, c in enumerate(classes):
        yb = (y_tr == c).astype(np.int8)
        r = _r_presence(X_tr, yb, a=0.75)
        Xr_tr = normalize(X_tr.multiply(r), norm='l2', axis=1, copy=False)
        Xr_va = normalize(X_va.multiply(r), norm='l2', axis=1, copy=False)
        Xr_te = normalize(X_te.multiply(r), norm='l2', axis=1, copy=False)
        clf = LogisticRegression(solver='liblinear', C=30.0, max_iter=2000, random_state=SEED)
        clf.fit(Xr_tr, yb)
        margins_va[:, ci] = clf.decision_function(Xr_va).astype(np.float32)
        margins_te[:, ci] = clf.decision_function(Xr_te).astype(np.float32)
        del yb, r, Xr_tr, Xr_va, Xr_te, clf; gc.collect()
    P_va = _softmax(margins_va).astype(np.float32)
    P_te = _softmax(margins_te).astype(np.float32)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del X_tr, X_va, X_te, margins_va, margins_te, P_va, P_te, v; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[NBSVM_charwb_2_7_presence] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_nbsvm_charwb_2_7_presence.npy', oof)
np.save('test_nbsvm_charwb_2_7_presence.npy', test_pred)
print('Saved oof_nbsvm_charwb_2_7_presence.npy and test_nbsvm_charwb_2_7_presence.npy', flush=True)

[NBSVM_charwb_2_7_presence] alpha=0.75 C=30


  [Fold 1] X_tr=(14096, 264846)


  [Fold 1] loss=0.45101 elapsed=9.64s


  [Fold 2] X_tr=(14097, 265778)


  [Fold 2] loss=0.43024 elapsed=9.86s


  [Fold 3] X_tr=(14097, 266406)


  [Fold 3] loss=0.45530 elapsed=9.54s


  [Fold 4] X_tr=(14097, 264832)


  [Fold 4] loss=0.42798 elapsed=9.58s


  [Fold 5] X_tr=(14097, 265685)


  [Fold 5] loss=0.43758 elapsed=9.38s


[NBSVM_charwb_2_7_presence] OOF=0.44042; folds=[0.45101, 0.43024, 0.4553, 0.42798, 0.43758] total=48.50s


Saved oof_nbsvm_charwb_2_7_presence.npy and test_nbsvm_charwb_2_7_presence.npy


In [74]:
# A3) Hybrid TF-IDF LR: word(1,3) + char_wb(3,6), multinomial saga; save OOF/TEST
import time, gc, numpy as np, pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes)==set(submit_cols)

vec_word = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=3, max_df=0.95,
                           lowercase=True, strip_accents='unicode', token_pattern=r"(?u)\b[-\w']+\b",
                           sublinear_tf=True, norm='l2', dtype=np.float32)
vec_charwb = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,6), min_df=2, lowercase=False,
                             strip_accents=None, sublinear_tf=True, norm='l2', dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix]:
    Xw_tr = vec_word.fit_transform(x_tr); Xw_va = vec_word.transform(x_va); Xw_te = vec_word.transform(x_te)
    Xc_tr = vec_charwb.fit_transform(x_tr); Xc_va = vec_charwb.transform(x_va); Xc_te = vec_charwb.transform(x_te)
    X_tr = sparse.hstack([Xw_tr, Xc_tr], format='csr')
    X_va = sparse.hstack([Xw_va, Xc_va], format='csr')
    X_te = sparse.hstack([Xw_te, Xc_te], format='csr')
    return X_tr, X_va, X_te

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
best = dict(loss=1e9, C=None, oof=None, test=None)
for C in [12.0, 16.0]:
    print(f'[LR_word13_charwb36] C={C}', flush=True)
    oof = np.zeros((len(train), len(classes)), np.float32)
    test_pred = np.zeros((len(test), len(classes)), np.float32)
    fold_losses = []
    t0_all = time.time()
    for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_va = X_text[tr], X_text[va]
        y_tr, y_va = y[tr], y[va]
        X_tr, X_va, X_te = build_fold(x_tr, x_va, X_test)
        print(f'  [Fold {fold}] X_tr={X_tr.shape}', flush=True)
        clf = LogisticRegression(solver='saga', multi_class='multinomial', C=C,
                                 max_iter=8000, tol=1e-4, n_jobs=-1, random_state=SEED)
        clf.fit(X_tr, y_tr)
        P_va = clf.predict_proba(X_va).astype(np.float32)
        P_te = clf.predict_proba(X_te).astype(np.float32)
        order_idx = [list(clf.classes_).index(c) for c in classes]
        P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
        oof[va] = P_va
        test_pred += P_te / N_FOLDS
        loss = log_loss(y_va, P_va, labels=list(classes))
        fold_losses.append(loss)
        print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
        del X_tr, X_va, X_te, P_va, P_te, clf; gc.collect()
    oof_loss = log_loss(y, oof, labels=list(classes))
    print(f'[LR_word13_charwb36] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
    if oof_loss < best['loss']:
        best.update(loss=oof_loss, C=C, oof=oof.copy(), test=test_pred.copy())

print(f"[LR_word13_charwb36] BEST OOF={best['loss']:.5f} with C={best['C']}", flush=True)
np.save('oof_lr_word13_charwb36.npy', best['oof'])
np.save('test_lr_word13_charwb36.npy', best['test'])
print('Saved oof_lr_word13_charwb36.npy and test_lr_word13_charwb36.npy', flush=True)

[LR_word13_charwb36] C=12.0


  [Fold 1] X_tr=(14096, 155680)




  [Fold 1] loss=0.36414 elapsed=13.47s


  [Fold 2] X_tr=(14097, 156093)




  [Fold 2] loss=0.36485 elapsed=19.56s


  [Fold 3] X_tr=(14097, 155966)




  [Fold 3] loss=0.38251 elapsed=17.77s


  [Fold 4] X_tr=(14097, 156197)




  [Fold 4] loss=0.36746 elapsed=14.66s


  [Fold 5] X_tr=(14097, 156270)




  [Fold 5] loss=0.36449 elapsed=19.03s


[LR_word13_charwb36] OOF=0.36869; folds=[0.36414, 0.36485, 0.38251, 0.36746, 0.36449] total=85.03s


[LR_word13_charwb36] C=16.0


  [Fold 1] X_tr=(14096, 155680)




  [Fold 1] loss=0.36385 elapsed=15.59s


  [Fold 2] X_tr=(14097, 156093)




  [Fold 2] loss=0.36379 elapsed=21.96s


  [Fold 3] X_tr=(14097, 155966)




  [Fold 3] loss=0.38255 elapsed=20.10s


  [Fold 4] X_tr=(14097, 156197)




  [Fold 4] loss=0.36711 elapsed=17.00s


  [Fold 5] X_tr=(14097, 156270)




  [Fold 5] loss=0.36401 elapsed=21.25s


[LR_word13_charwb36] OOF=0.36826; folds=[0.36385, 0.36379, 0.38255, 0.36711, 0.36401] total=96.44s


[LR_word13_charwb36] BEST OOF=0.36826 with C=16.0


Saved oof_lr_word13_charwb36.npy and test_lr_word13_charwb36.npy


In [115]:
# Classwise Log-Opinion Pool (LOP) blender (9 models), per-model scalar temps, classwise weights, simple caps (revert for recovery)
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from sklearn.metrics import log_loss
from scipy.optimize import minimize, minimize_scalar

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)
C = len(classes)

def load(p): return np.load(p) if Path(p).exists() else None

# Candidate set: original 9-core (no MNB here for recovery)
cands = [
    ('nbsvm_wc_tweaked',        'oof_nbsvm_wc_tweaked.npy',        'test_nbsvm_wc_tweaked.npy'),
    ('nbsvm_char_2_6_counts',   'oof_nbsvm_char_2_6_counts.npy',   'test_nbsvm_char_2_6_counts.npy'),
    ('nbsvm_wc_fixed',          'oof_nbsvm_wc_fixed.npy',          'test_nbsvm_wc_fixed.npy'),
    ('nbsvm_char_2_7_presence', 'oof_nbsvm_char_2_7_presence.npy', 'test_nbsvm_char_2_7_presence.npy'),
    ('lr_wc_fixed',             'oof_lr_wordchar_fixed.npy',       'test_lr_wordchar_fixed.npy'),
    ('lr_word13_charwb36',      'oof_lr_word13_charwb36.npy',      'test_lr_word13_charwb36.npy'),
    ('lr_char_1_8_hero',        'oof_lr_char_1_8_hero.npy',        'test_lr_char_1_8_hero.npy'),
    ('svc_charwb_1_6_sig',      'oof_svc_charwb_1_6_sig.npy',      'test_svc_charwb_1_6_sig.npy'),
    ('lr_wordpunct_1_3',        'oof_lr_wordpunct_1_3.npy',        'test_lr_wordpunct_1_3.npy'),
]

loaded = []
for name, oofp, tsp in cands:
    o = load(oofp); t = load(tsp)
    if o is None or t is None:
        continue
    o = np.clip(o, 1e-12, 1-1e-12); o = o / o.sum(axis=1, keepdims=True)
    t = np.clip(t, 1e-12, 1-1e-12); t = t / t.sum(axis=1, keepdims=True)
    loaded.append((name, o.astype(np.float64), t.astype(np.float64)))

assert len(loaded) == 9, f'Need exactly 9 core bases; got {len(loaded)}'
names = [n for n,_,_ in loaded]
K = len(names)
print('LOP candidates:', names, flush=True)

# Scalar temperature per model
def scale_probs_scalar(P, T):
    S = np.clip(P, 1e-12, 1-1e-12) ** (1.0/float(T))
    return S / S.sum(axis=1, keepdims=True)

OOFs_raw = [o for _,o,_ in loaded]
TESTs_raw = [t for _,_,t in loaded]
per_model_T = []; OOFs = []; TESTs = []
for i in range(K):
    Pi = OOFs_raw[i]
    def loss_Ti(T): return log_loss(y, scale_probs_scalar(Pi, T), labels=classes)
    resTi = minimize_scalar(loss_Ti, bounds=(0.5, 5.0), method='bounded')
    Ti = float(resTi.x)
    per_model_T.append(Ti)
    OOFs.append(scale_probs_scalar(OOFs_raw[i], Ti))
    TESTs.append(scale_probs_scalar(TESTs_raw[i], Ti))
print('Per-model scalar T:', {names[i]: round(per_model_T[i],3) for i in range(K)})

per_oof = {names[i]: log_loss(y, OOFs[i], labels=classes) for i in range(K)}
print('Per-model OOF (post scalar cal):', {k: round(v,5) for k,v in per_oof.items()})

# LOP with classwise weights
def geo_pool_log_classwise(stacks, W):
    n = stacks[0].shape[0]
    A = np.zeros((n, C), dtype=np.float64)
    for k in range(K):
        A += np.log(stacks[k]) * W[k][None, :]
    A -= A.max(axis=1, keepdims=True)
    P = np.exp(A); P /= P.sum(axis=1, keepdims=True)
    return P

def softmax_cols(Z):
    W = np.zeros_like(Z)
    for j in range(C):
        z = Z[:, j]
        z = z - z.max()
        e = np.exp(z)
        s = e.sum()
        W[:, j] = e / (s if s>0 else 1.0)
    return W

# Recovery params (simple caps/renorm that gave ~0.31184)
lambda_ent = 0.0025
starts = 128
global_cap = 0.55
nb_cap = 0.62
weak_cap = 0.09
tiny_prune_thresh = 0.00
explicit_caps = {
    'svc_charwb_1_6_sig': 0.06,
    'lr_wordpunct_1_3': 0.05,
}

nb_mask = np.array([n.startswith('nbsvm_') for n in names], dtype=bool)
name_to_idx = {n:i for i,n in enumerate(names)}

# Simple apply_caps (prior version that achieved 0.31184): entry caps -> NB family cap -> per-class renorm
def apply_caps(W):
    Wc = W.copy()
    # Global cap per entry
    Wc = np.minimum(Wc, global_cap)
    # Explicit per-model caps
    for n, cap in explicit_caps.items():
        if n in name_to_idx:
            i = name_to_idx[n]
            Wc[i, :] = np.minimum(Wc[i, :], cap)
    # Weak cap based on per-model OOF
    for i, n in enumerate(names):
        if per_oof[n] > 0.40:
            Wc[i, :] = np.minimum(Wc[i, :], weak_cap)
    # NB family cap per class
    for j in range(C):
        s_nb = Wc[nb_mask, j].sum()
        if s_nb > nb_cap and s_nb > 0:
            Wc[nb_mask, j] *= (nb_cap / s_nb)
    # Prune tiny (disabled here) and renormalize per class
    for j in range(C):
        col = Wc[:, j]
        if tiny_prune_thresh > 0:
            col[col < tiny_prune_thresh] = 0.0
        s = col.sum()
        if s == 0:
            col[:] = 1.0 / K
        else:
            col[:] = col / s
        Wc[:, j] = col
    return Wc

def objective(Z):
    W0 = softmax_cols(Z)
    Wc = apply_caps(W0)
    P = geo_pool_log_classwise(OOFs, Wc)
    ent = 0.0
    for j in range(C):
        wj = np.clip(Wc[:, j], 1e-12, 1.0)
        ent += float(np.sum(wj * np.log(wj)))
    reg = lambda_ent * ent
    return log_loss(y, P, labels=classes) + reg

# Multi-start optimization
best = (1e9, None, None)
rng = np.random.RandomState(42)
inits = [np.zeros((K, C))] + [rng.normal(0, 0.5, size=(K, C)) for _ in range(starts-1)]
for si, Z0 in enumerate(inits, 1):
    res = minimize(lambda z: objective(z.reshape(K, C)), Z0.ravel(), method='L-BFGS-B')
    val = float(res.fun)
    if val < best[0]:
        best = (val, res.x.reshape(K, C).copy(), res)
    if si % 16 == 0:
        print(f'  [start {si}/{starts}] best_obj={best[0]:.5f}', flush=True)

Z_star = best[1]
W0 = softmax_cols(Z_star)
Wc = apply_caps(W0)
print('Best obj:', round(best[0],5))

# Report final per-class weights
print('Final per-class weights (sum=1 each class):')
for j, cls in enumerate(classes):
    wj = {names[i]: round(float(Wc[i, j]), 3) for i in range(K)}
    print(cls, wj)

# Blend OOF/Test with classwise weights
P_oof = geo_pool_log_classwise(OOFs, Wc)
P_test = geo_pool_log_classwise(TESTs, Wc)
oof_pre = log_loss(y, P_oof, labels=classes)
print('LOP Blend OOF (pre-temp):', round(oof_pre,5))

# Per-class final temperatures
def scale_classwise(P, Tvec):
    T = np.asarray(Tvec, dtype=np.float64)
    S = np.clip(P, 1e-12, 1-1e-12) ** (1.0 / T[None, :])
    return S / S.sum(axis=1, keepdims=True)

from scipy.optimize import minimize as _minimize
bounds = [(0.5, 2.0)] * C
x0 = np.ones(C, dtype=np.float64)
resTc = _minimize(lambda t: log_loss(y, scale_classwise(P_oof, t), labels=classes),
                  x0, method='L-BFGS-B', bounds=bounds)
T_class = resTc.x
P_oof_scaled = scale_classwise(P_oof, T_class)
oof_final = log_loss(y, P_oof_scaled, labels=classes)
print('Classwise T:', np.round(T_class, 4), 'Final OOF:', round(oof_final,5))

# Save submission using classwise temperature scaling
P_test_scaled = scale_classwise(P_test, T_class)
probs = P_test_scaled[:, [classes.index(c) for c in submit_cols]]
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv; elapsed=%.2fs' % (time.time()-t0), flush=True)

# Free
del OOFs_raw, TESTs_raw, OOFs, TESTs; gc.collect()

LOP candidates: ['nbsvm_wc_tweaked', 'nbsvm_char_2_6_counts', 'nbsvm_wc_fixed', 'nbsvm_char_2_7_presence', 'lr_wc_fixed', 'lr_word13_charwb36', 'lr_char_1_8_hero', 'svc_charwb_1_6_sig', 'lr_wordpunct_1_3']


Per-model scalar T: {'nbsvm_wc_tweaked': 1.55, 'nbsvm_char_2_6_counts': 1.579, 'nbsvm_wc_fixed': 1.567, 'nbsvm_char_2_7_presence': 1.477, 'lr_wc_fixed': 0.877, 'lr_word13_charwb36': 1.006, 'lr_char_1_8_hero': 0.871, 'svc_charwb_1_6_sig': 0.803, 'lr_wordpunct_1_3': 0.771}
Per-model OOF (post scalar cal): {'nbsvm_wc_tweaked': 0.32156, 'nbsvm_char_2_6_counts': 0.33019, 'nbsvm_wc_fixed': 0.33633, 'nbsvm_char_2_7_presence': 0.338, 'lr_wc_fixed': 0.36343, 'lr_word13_charwb36': 0.36825, 'lr_char_1_8_hero': 0.36987, 'svc_charwb_1_6_sig': 0.39657, 'lr_wordpunct_1_3': 0.40368}


  [start 16/128] best_obj=0.29898


  [start 32/128] best_obj=0.29898


  [start 48/128] best_obj=0.29898


  [start 64/128] best_obj=0.29898


  [start 80/128] best_obj=0.29898


  [start 96/128] best_obj=0.29898


  [start 112/128] best_obj=0.29898


  [start 128/128] best_obj=0.29881


Best obj: 0.29881
Final per-class weights (sum=1 each class):
EAP {'nbsvm_wc_tweaked': 0.234, 'nbsvm_char_2_6_counts': 0.105, 'nbsvm_wc_fixed': 0.178, 'nbsvm_char_2_7_presence': 0.159, 'lr_wc_fixed': 0.034, 'lr_word13_charwb36': 0.053, 'lr_char_1_8_hero': 0.022, 'svc_charwb_1_6_sig': 0.057, 'lr_wordpunct_1_3': 0.157}
HPL {'nbsvm_wc_tweaked': 0.346, 'nbsvm_char_2_6_counts': 0.164, 'nbsvm_wc_fixed': 0.145, 'nbsvm_char_2_7_presence': 0.145, 'lr_wc_fixed': 0.023, 'lr_word13_charwb36': 0.024, 'lr_char_1_8_hero': 0.01, 'svc_charwb_1_6_sig': 0.077, 'lr_wordpunct_1_3': 0.065}
MWS {'nbsvm_wc_tweaked': 0.291, 'nbsvm_char_2_6_counts': 0.127, 'nbsvm_wc_fixed': 0.069, 'nbsvm_char_2_7_presence': 0.107, 'lr_wc_fixed': 0.031, 'lr_word13_charwb36': 0.075, 'lr_char_1_8_hero': 0.014, 'svc_charwb_1_6_sig': 0.064, 'lr_wordpunct_1_3': 0.222}
LOP Blend OOF (pre-temp): 0.31313


Classwise T: [0.8854 0.9071 0.9523] Final OOF: 0.31158
Saved submission.csv; elapsed=1033.75s


644

In [76]:
# A5) LR char_wb(1,6) multinomial (often strong); sweep C in [16, 20, 24]; save OOF/TEST
import time, gc, numpy as np, pandas as pd
from typing import List, Tuple, Dict
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED = 42; N_FOLDS = 5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)

charwb_params = dict(analyzer='char_wb', ngram_range=(1,6), min_df=1, lowercase=False,
                     strip_accents=None, sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**charwb_params)
    X_tr = v.fit_transform(x_tr); X_va = v.transform(x_va); X_te = v.transform(x_te)
    return X_tr, X_va, X_te, X_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
best = dict(loss=1e9, C=None, oof=None, test=None)
for C in [16.0, 20.0, 24.0]:
    print(f'[LR_charwb_1_6] C={C}', flush=True)
    oof = np.zeros((len(train), len(classes)), dtype=np.float32)
    test_pred = np.zeros((len(test), len(classes)), dtype=np.float32)
    fold_losses = []
    t0_all = time.time()
    for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
        t0 = time.time()
        x_tr, x_va = X_text[tr], X_text[va]
        y_tr, y_va = y[tr], y[va]
        X_tr, X_va, X_te, vdim = build_fold(x_tr, x_va, X_test)
        print(f'  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}', flush=True)
        clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', penalty='l2',
                                 C=C, max_iter=12000, tol=1e-4, n_jobs=-1, random_state=SEED)
        clf.fit(X_tr, y_tr)
        P_va = clf.predict_proba(X_va).astype(np.float32)
        P_te = clf.predict_proba(X_te).astype(np.float32)
        order_idx = [list(clf.classes_).index(c) for c in classes]
        P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
        oof[va] = P_va
        test_pred += P_te / N_FOLDS
        loss = log_loss(y_va, P_va, labels=list(classes))
        fold_losses.append(loss)
        print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
        del X_tr, X_va, X_te, P_va, P_te, clf; gc.collect()
    oof_loss = log_loss(y, oof, labels=list(classes))
    print(f'[LR_charwb_1_6] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
    if oof_loss < best['loss']:
        best.update(loss=oof_loss, C=C, oof=oof.copy(), test=test_pred.copy())

print(f"[LR_charwb_1_6] BEST OOF={best['loss']:.5f} with C={best['C']}", flush=True)
np.save('oof_lr_charwb_1_6.npy', best['oof'])
np.save('test_lr_charwb_1_6.npy', best['test'])
print('Saved oof_lr_charwb_1_6.npy and test_lr_charwb_1_6.npy', flush=True)

[LR_charwb_1_6] C=16.0


  [Fold 1] X_tr=(14096, 184581) vdim=184581




  [Fold 1] loss=0.39544 elapsed=10.73s


  [Fold 2] X_tr=(14097, 185066) vdim=185066




  [Fold 2] loss=0.38354 elapsed=14.48s


  [Fold 3] X_tr=(14097, 185603) vdim=185603




  [Fold 3] loss=0.41368 elapsed=11.54s


  [Fold 4] X_tr=(14097, 184468) vdim=184468




  [Fold 4] loss=0.40293 elapsed=9.68s


  [Fold 5] X_tr=(14097, 185025) vdim=185025




  [Fold 5] loss=0.39141 elapsed=9.12s


[LR_charwb_1_6] OOF=0.39740; folds=[0.39544, 0.38354, 0.41368, 0.40293, 0.39141] total=56.10s


[LR_charwb_1_6] C=20.0


  [Fold 1] X_tr=(14096, 184581) vdim=184581




  [Fold 1] loss=0.39692 elapsed=14.45s


  [Fold 2] X_tr=(14097, 185066) vdim=185066




  [Fold 2] loss=0.38451 elapsed=10.69s


  [Fold 3] X_tr=(14097, 185603) vdim=185603




  [Fold 3] loss=0.41620 elapsed=12.31s


  [Fold 4] X_tr=(14097, 184468) vdim=184468




  [Fold 4] loss=0.39989 elapsed=13.24s


  [Fold 5] X_tr=(14097, 185025) vdim=185025




  [Fold 5] loss=0.39742 elapsed=11.22s


[LR_charwb_1_6] OOF=0.39899; folds=[0.39692, 0.38451, 0.4162, 0.39989, 0.39742] total=62.44s


[LR_charwb_1_6] C=24.0


  [Fold 1] X_tr=(14096, 184581) vdim=184581




  [Fold 1] loss=0.39886 elapsed=13.85s


  [Fold 2] X_tr=(14097, 185066) vdim=185066




  [Fold 2] loss=0.38602 elapsed=12.42s


  [Fold 3] X_tr=(14097, 185603) vdim=185603




  [Fold 3] loss=0.41828 elapsed=14.21s


  [Fold 4] X_tr=(14097, 184468) vdim=184468




  [Fold 4] loss=0.40106 elapsed=13.02s


  [Fold 5] X_tr=(14097, 185025) vdim=185025




  [Fold 5] loss=0.39747 elapsed=9.51s


[LR_charwb_1_6] OOF=0.40034; folds=[0.39886, 0.38602, 0.41828, 0.40106, 0.39747] total=63.53s


[LR_charwb_1_6] BEST OOF=0.39740 with C=16.0


Saved oof_lr_charwb_1_6.npy and test_lr_charwb_1_6.npy


In [80]:
# A4) Word+punct TF-IDF LR: word(1,3) with punctuation tokens; multinomial saga; save OOF/TEST
import time, gc, re, numpy as np, pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes)==set(submit_cols)

# Token pattern keeps words (with apostrophes/hyphens) and standalone punctuation tokens
punct_pat = r"[\.,;:!\?—”“\"'()\-…]"  # includes em-dash and ellipsis
tok_pat = rf"(?u)\b[-\w']+\b|{punct_pat}"
vec = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=2, max_df=0.95, lowercase=False,
                      strip_accents=None, token_pattern=tok_pat, sublinear_tf=True, norm='l2', dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    X_tr = vec.fit_transform(x_tr); X_va = vec.transform(x_va); X_te = vec.transform(x_te)
    return X_tr, X_va, X_te, X_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
C = 10.0
print(f'[LR_wordpunct_1_3] C={C}', flush=True)
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []
t0_all = time.time()
for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
    t0 = time.time()
    x_tr, x_va = X_text[tr], X_text[va]
    y_tr, y_va = y[tr], y[va]
    X_tr, X_va, X_te, vdim = build_fold(x_tr, x_va, X_test)
    print(f'  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}', flush=True)
    clf = LogisticRegression(solver='saga', multi_class='multinomial', C=C,
                             max_iter=8000, tol=1e-4, n_jobs=-1, random_state=SEED)
    clf.fit(X_tr, y_tr)
    P_va = clf.predict_proba(X_va).astype(np.float32)
    P_te = clf.predict_proba(X_te).astype(np.float32)
    order_idx = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'  [Fold {fold}] loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del X_tr, X_va, X_te, P_va, P_te, clf; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[LR_wordpunct_1_3] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_lr_wordpunct_1_3.npy', oof)
np.save('test_lr_wordpunct_1_3.npy', test_pred)
print('Saved oof_lr_wordpunct_1_3.npy and test_lr_wordpunct_1_3.npy', flush=True)

[LR_wordpunct_1_3] C=10.0


  [Fold 1] X_tr=(14096, 87868) vdim=87868




  [Fold 1] loss=0.41118 elapsed=2.02s


  [Fold 2] X_tr=(14097, 88062) vdim=88062




  [Fold 2] loss=0.41795 elapsed=2.22s




  [Fold 3] loss=0.42880 elapsed=2.03s


  [Fold 4] X_tr=(14097, 88161) vdim=88161




  [Fold 4] loss=0.41522 elapsed=1.97s


  [Fold 5] X_tr=(14097, 87824) vdim=87824




  [Fold 5] loss=0.41173 elapsed=2.05s


[LR_wordpunct_1_3] OOF=0.41697; folds=[0.41118, 0.41795, 0.4288, 0.41522, 0.41173] total=10.85s


Saved oof_lr_wordpunct_1_3.npy and test_lr_wordpunct_1_3.npy


In [97]:
# New base (cheap, diverse): MultinomialNB on TF-IDF char (2,6), alpha=0.1, lowercase=True, sublinear_tf=True; save OOF/TEST
import time, gc, numpy as np, pandas as pd
from typing import Tuple
from scipy import sparse
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import log_loss

SEED=42; N_FOLDS=5
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test = test['text'].astype(str).values
classes = np.unique(y)
submit_cols = ['EAP','HPL','MWS']
assert set(classes)==set(submit_cols)

char_params = dict(analyzer='char', ngram_range=(2,6), min_df=2, lowercase=True,
                   strip_accents='unicode', sublinear_tf=True, smooth_idf=True, norm='l2', dtype=np.float32)

def build_fold(x_tr, x_va, x_te) -> Tuple[sparse.csr_matrix, sparse.csr_matrix, sparse.csr_matrix, int]:
    v = TfidfVectorizer(**char_params)
    X_tr = v.fit_transform(x_tr); X_va = v.transform(x_va); X_te = v.transform(x_te)
    return X_tr, X_va, X_te, X_tr.shape[1]

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
alpha = 0.1
oof = np.zeros((len(train), len(classes)), np.float32)
test_pred = np.zeros((len(test), len(classes)), np.float32)
fold_losses = []; t0_all=time.time()
print(f'[MNB_char_tfidf_2_6] alpha={alpha}', flush=True)
for fold,(tr,va) in enumerate(skf.split(X_text, y),1):
    t0=time.time()
    x_tr,x_va = X_text[tr], X_text[va]
    y_tr,y_va = y[tr], y[va]
    X_tr, X_va, X_te, vdim = build_fold(x_tr, x_va, X_test)
    print(f'  [Fold {fold}] X_tr={X_tr.shape} vdim={vdim}', flush=True)
    clf = MultinomialNB(alpha=alpha)
    clf.fit(X_tr, y_tr)
    P_va = clf.predict_proba(X_va).astype(np.float32)
    P_te = clf.predict_proba(X_te).astype(np.float32)
    # reorder to fixed class order
    order_idx = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order_idx]; P_te = P_te[:, order_idx]
    # clip+renorm
    P_va = np.clip(P_va, 1e-12, 1-1e-12); P_va /= P_va.sum(axis=1, keepdims=True)
    P_te = np.clip(P_te, 1e-12, 1-1e-12); P_te /= P_te.sum(axis=1, keepdims=True)
    oof[va] = P_va
    test_pred += P_te / N_FOLDS
    loss = log_loss(y_va, P_va, labels=list(classes))
    fold_losses.append(loss)
    print(f'    loss={loss:.5f} elapsed={time.time()-t0:.2f}s', flush=True)
    del X_tr, X_va, X_te, P_va, P_te, clf; gc.collect()

oof_loss = log_loss(y, oof, labels=list(classes))
print(f'[MNB_char_tfidf_2_6] OOF={oof_loss:.5f}; folds={np.round(fold_losses,5).tolist()} total={time.time()-t0_all:.2f}s', flush=True)
np.save('oof_mnb_char_2_6.npy', oof)
np.save('test_mnb_char_2_6.npy', test_pred)
print('Saved oof_mnb_char_2_6.npy and test_mnb_char_2_6.npy', flush=True)

[MNB_char_tfidf_2_6] alpha=0.1


  [Fold 1] X_tr=(14096, 277345) vdim=277345


    loss=0.39844 elapsed=4.77s


  [Fold 2] X_tr=(14097, 278200) vdim=278200


    loss=0.39749 elapsed=4.78s


  [Fold 3] X_tr=(14097, 277563) vdim=277563


    loss=0.40645 elapsed=4.78s


  [Fold 4] X_tr=(14097, 278079) vdim=278079


    loss=0.40916 elapsed=4.80s


  [Fold 5] X_tr=(14097, 278322) vdim=278322


    loss=0.38659 elapsed=4.77s


[MNB_char_tfidf_2_6] OOF=0.39963; folds=[0.39844, 0.39749, 0.40645, 0.40916, 0.38659] total=24.42s


Saved oof_mnb_char_2_6.npy and test_mnb_char_2_6.npy


In [123]:
# 3-bin length-gated Classwise LOP (Run A): add MNB tiny-cap, stabilize temps, caps tweaks, confidence overlay
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from scipy.optimize import minimize, minimize_scalar
from sklearn.metrics import log_loss

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)
C = len(classes)

def load(p):
    return np.load(p) if Path(p).exists() else None

# Use 9-core + add mnb_char_2_6 with ultra-tight per-bin cap
cands = [
    ('nbsvm_wc_tweaked',        'oof_nbsvm_wc_tweaked.npy',        'test_nbsvm_wc_tweaked.npy'),
    ('nbsvm_char_2_6_counts',   'oof_nbsvm_char_2_6_counts.npy',   'test_nbsvm_char_2_6_counts.npy'),
    ('nbsvm_wc_fixed',          'oof_nbsvm_wc_fixed.npy',          'test_nbsvm_wc_fixed.npy'),
    ('nbsvm_char_2_7_presence', 'oof_nbsvm_char_2_7_presence.npy', 'test_nbsvm_char_2_7_presence.npy'),
    ('lr_wc_fixed',             'oof_lr_wordchar_fixed.npy',       'test_lr_wordchar_fixed.npy'),
    ('lr_word13_charwb36',      'oof_lr_word13_charwb36.npy',      'test_lr_word13_charwb36.npy'),
    ('lr_char_1_8_hero',        'oof_lr_char_1_8_hero.npy',        'test_lr_char_1_8_hero.npy'),
    ('svc_charwb_1_6_sig',      'oof_svc_charwb_1_6_sig.npy',      'test_svc_charwb_1_6_sig.npy'),
    ('lr_wordpunct_1_3',        'oof_lr_wordpunct_1_3.npy',        'test_lr_wordpunct_1_3.npy'),
    ('mnb_char_2_6',            'oof_mnb_char_2_6.npy',            'test_mnb_char_2_6.npy'),
]
loaded = []
for name, oofp, tsp in cands:
    o = load(oofp); t = load(tsp)
    assert o is not None and t is not None, f'Missing preds for {name}'
    o = np.clip(o, 1e-12, 1-1e-12); o = o / o.sum(axis=1, keepdims=True)
    t = np.clip(t, 1e-12, 1-1e-12); t = t / t.sum(axis=1, keepdims=True)
    loaded.append((name, o.astype(np.float64), t.astype(np.float64)))
names = [n for n,_,_ in loaded]
K = len(names)
print('Length-gated LOP with models:', names, flush=True)

# Scalar temperature helper
def scale_probs_scalar(P, T):
    S = np.clip(P, 1e-12, 1-1e-12) ** (1.0/float(T))
    return S / S.sum(axis=1, keepdims=True)

OOFs_raw = [o for _,o,_ in loaded]
TESTs_raw = [t for _,_,t in loaded]

# Global per-model temps (for diagnostics and shrink target)
per_model_T_global = []
OOFs_global = []
TESTs_global = []
for i in range(K):
    Pi = OOFs_raw[i]
    # fit global T with broad bounds first
    resTi = minimize_scalar(lambda T: log_loss(y, scale_probs_scalar(Pi, T), labels=classes),
                            bounds=(0.5, 5.0), method='bounded')
    Ti = float(resTi.x)
    per_model_T_global.append(Ti)
    OOFs_global.append(scale_probs_scalar(OOFs_raw[i], Ti))
    TESTs_global.append(scale_probs_scalar(TESTs_raw[i], Ti))
per_oof = {names[i]: log_loss(y, OOFs_global[i], labels=classes) for i in range(K)}
print('Per-model T (global):', {names[i]: round(per_model_T_global[i],3) for i in range(K)})
print('Per-model OOF (post scalar cal):', {k: round(v,5) for k,v in per_oof.items()})

# Helpers
def geo_pool_log_classwise(stacks, W):
    n = stacks[0].shape[0]
    A = np.zeros((n, C), dtype=np.float64)
    for k in range(K):
        A += np.log(stacks[k]) * W[k][None, :]
    A -= A.max(axis=1, keepdims=True)
    P = np.exp(A); P /= P.sum(axis=1, keepdims=True)
    return P

def softmax_cols(Z):
    W = np.zeros_like(Z)
    for j in range(C):
        z = Z[:, j]
        z = z - z.max()
        e = np.exp(z); s = e.sum()
        W[:, j] = e / (s if s>0 else 1.0)
    return W

# Regularization and caps (Run A settings)
lambda_ent = 0.0025
global_cap = 0.55
weak_cap = 0.09
tiny_prune_thresh = 0.00
# explicit caps will be passed per-bin to include per-bin cap for mnb
nb_mask = np.array([n.startswith('nbsvm_') for n in names], dtype=bool)
name_to_idx = {n:i for i,n in enumerate(names)}

def apply_caps_with_nbcap(W, nb_cap_local, explicit_caps_local):
    Wc = W.copy()
    # Global entry cap
    Wc = np.minimum(Wc, global_cap)
    # Explicit per-model caps
    for n, cap in explicit_caps_local.items():
        if n in name_to_idx:
            i = name_to_idx[n]
            Wc[i, :] = np.minimum(Wc[i, :], cap)
    # Weak cap for very weak bases
    for i, n in enumerate(names):
        if per_oof.get(n, 1.0) > 0.40:
            Wc[i, :] = np.minimum(Wc[i, :], weak_cap)
    # NB-family total cap per class
    for j in range(C):
        s_nb = Wc[nb_mask, j].sum()
        if s_nb > nb_cap_local and s_nb > 0:
            Wc[nb_mask, j] *= (nb_cap_local / s_nb)
    # Per-class renormalize
    for j in range(C):
        col = Wc[:, j]
        if tiny_prune_thresh > 0:
            col[col < tiny_prune_thresh] = 0.0
        s = col.sum()
        if s == 0:
            col[:] = 1.0 / K
        else:
            col[:] = col / s
        Wc[:, j] = col
    return Wc

def make_objective(OOFs_subset, nb_cap_local, explicit_caps_local, y_bin):
    def objective(Z):
        W0 = softmax_cols(Z)
        Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
        P = geo_pool_log_classwise(OOFs_subset, Wc)
        ent = 0.0
        for j in range(C):
            wj = np.clip(Wc[:, j], 1e-12, 1.0)
            ent += float(np.sum(wj * np.log(wj)))
        reg = lambda_ent * ent
        return log_loss(y_bin, P, labels=classes) + reg
    return objective

# Length bins (fixed thresholds)
train_len = train['text'].astype(str).str.len().values
test_len  = test['text'].astype(str).str.len().values
short_thr, mid_lo, mid_hi = 100, 101, 180
mask_short = (train_len <= short_thr)
mask_mid   = (train_len >= mid_lo) & (train_len <= mid_hi)
mask_long  = (train_len > mid_hi)
test_mask_short = (test_len <= short_thr)
test_mask_mid   = (test_len >= mid_lo) & (test_len <= mid_hi)
test_mask_long  = (test_len > mid_hi)
print('Bin sizes:', {'short': int(mask_short.sum()), 'mid': int(mask_mid.sum()), 'long': int(mask_long.sum())}, flush=True)

# Warm start from global Z_star if available
Z_global = None
try:
    Z_global = Z_star.copy()  # from Cell 32 if present
except NameError:
    Z_global = None
Z_warm = np.zeros((K, C), dtype=np.float64)
if Z_global is not None and getattr(Z_global, 'shape', None) == (K, C):
    Z_warm = Z_global

final_oof = np.zeros((len(train), C), dtype=np.float64)
final_test = np.zeros((len(test), C), dtype=np.float64)

def run_bin(name, tr_mask, te_mask, nb_cap_local, mnb_cap_local):
    idx_tr = np.where(tr_mask)[0]
    idx_te = np.where(te_mask)[0]
    if len(idx_tr) == 0:
        return
    y_bin = y[idx_tr]
    # Per-bin per-model scalar temperatures with bounds (0.75,1.5) and shrink toward global
    OOFs_bin = []
    TESTs_bin = []
    for i in range(K):
        Pi_tr = OOFs_raw[i][idx_tr]
        # bound temps for stability
        resTi = minimize_scalar(lambda T: log_loss(y_bin, scale_probs_scalar(Pi_tr, T), labels=classes),
                                bounds=(0.75, 1.5), method='bounded')
        Ti_bin = float(resTi.x)
        # shrink: 70% global + 30% fitted-in-bin
        Ti_shrunk = 0.7 * float(per_model_T_global[i]) + 0.3 * Ti_bin
        OOFs_bin.append(scale_probs_scalar(OOFs_raw[i][idx_tr], Ti_shrunk))
        TESTs_bin.append(scale_probs_scalar(TESTs_raw[i][idx_te], Ti_shrunk))
    # Optimize classwise weights
    starts = 64  # reduced per expert
    rng = np.random.RandomState(42)
    inits = [Z_warm.copy()] + [rng.normal(0, 0.5, size=(K, C)) for _ in range(starts-1)]
    best = (1e9, None)
    # per-bin explicit caps (svc/lr_wordpunct fixed; mnb per-bin tiny caps)
    explicit_caps_local = {
        'svc_charwb_1_6_sig': 0.06,
        'lr_wordpunct_1_3': 0.05,
        'mnb_char_2_6': mnb_cap_local,
    }
    obj = make_objective(OOFs_bin, nb_cap_local, explicit_caps_local, y_bin)
    for si, Z0 in enumerate(inits, 1):
        res = minimize(lambda z: obj(z.reshape(K, C)), Z0.ravel(), method='L-BFGS-B')
        val = float(res.fun)
        if val < best[0]:
            best = (val, res.x.reshape(K, C).copy())
        if si % 16 == 0:
            print(f'  [{name}] start {si}/{starts} best_obj={best[0]:.5f}', flush=True)
    Z_bin = best[1]
    W0 = softmax_cols(Z_bin)
    Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
    # Blend
    P_oof_bin = geo_pool_log_classwise(OOFs_bin, Wc)
    P_test_bin = geo_pool_log_classwise(TESTs_bin, Wc)
    oof_pre = log_loss(y_bin, P_oof_bin, labels=classes)
    # Per-bin classwise temperature scaling with tighter bounds
    def scale_classwise(P, Tvec):
        T = np.asarray(Tvec, dtype=np.float64)
        S = np.clip(P, 1e-12, 1-1e-12) ** (1.0 / T[None, :])
        return S / S.sum(axis=1, keepdims=True)
    bounds = [(0.75, 1.5)] * C
    resTc = minimize(lambda t: log_loss(y_bin, scale_classwise(P_oof_bin, t), labels=classes),
                     x0=np.ones(C, dtype=np.float64), method='L-BFGS-B', bounds=bounds)
    T_class = resTc.x
    P_oof_scaled = scale_classwise(P_oof_bin, T_class)
    P_test_scaled = scale_classwise(P_test_bin, T_class)
    oof_final_bin = log_loss(y_bin, P_oof_scaled, labels=classes)
    print(f'  [{name}] OOF pre-temp={oof_pre:.5f} final={oof_final_bin:.5f} T={np.round(T_class,4)}', flush=True)
    # Stitch
    final_oof[idx_tr] = P_oof_scaled
    final_test[idx_te] = P_test_scaled

# Run bins with per-bin nb_cap and per-bin mnb tiny caps
run_bin('short', mask_short, test_mask_short, nb_cap_local=0.66, mnb_cap_local=0.04)
run_bin('mid',   mask_mid,   test_mask_mid,   nb_cap_local=0.62, mnb_cap_local=0.035)
run_bin('long',  mask_long,  test_mask_long,  nb_cap_local=0.58, mnb_cap_local=0.03)

# Overall OOF
oof_loss = log_loss(y, final_oof, labels=classes)
print('Length-gated overall OOF:', round(oof_loss,5), 'elapsed=%.2fs' % (time.time()-t0), flush=True)

# Confidence overlay post-processing
P_lop_oof = final_oof.copy(); P_lop_test = final_test.copy()
P_hero_oof = load('oof_lr_char_1_8_hero.npy'); P_hero_test = load('test_lr_char_1_8_hero.npy')
assert P_hero_oof is not None and P_hero_test is not None, 'Missing hero model preds for overlay'
P_hero_oof = np.clip(P_hero_oof, 1e-12, 1-1e-12); P_hero_oof /= P_hero_oof.sum(axis=1, keepdims=True)
P_hero_test = np.clip(P_hero_test, 1e-12, 1-1e-12); P_hero_test /= P_hero_test.sum(axis=1, keepdims=True)

def apply_overlay(P_base, P_aux, mask_conf, alpha=0.8):
    P = P_base.copy()
    mix = (~mask_conf).astype(np.float64)  # rows to mix if max<0.46 -> False; invert
    # Actually build per-row mixture:
    # where low confidence (max<0.46): P_final = 0.8*P_base + 0.2*P_aux; else keep P_base
    low_conf = mask_conf  # True where max<0.46
    P[low_conf] = (alpha * P_base[low_conf] + (1.0 - alpha) * P_aux[low_conf])
    P = np.clip(P, 1e-12, 1-1e-12); P /= P.sum(axis=1, keepdims=True)
    return P

# Define low-confidence masks
mask_low_oof = (P_lop_oof.max(axis=1) < 0.46)
mask_low_test = (P_lop_test.max(axis=1) < 0.46)
P_final_oof = apply_overlay(P_lop_oof, P_hero_oof, mask_low_oof, alpha=0.8)
P_final_test = apply_overlay(P_lop_test, P_hero_test, mask_low_test, alpha=0.8)

# Optional micro prior: boost MWS on very short texts (<=60 chars) by 1.02
short_mask_60_tr = (train_len <= 60)
short_mask_60_te = (test_len <= 60)
def boost_mws(P, mask):
    P2 = P.copy()
    j = classes.index('MWS')
    if mask.any():
        P2[mask, j] *= 1.02
        P2[mask] /= P2[mask].sum(axis=1, keepdims=True)
    return P2
P_final_oof = boost_mws(P_final_oof, short_mask_60_tr)
P_final_test = boost_mws(P_final_test, short_mask_60_te)

# Report OOF after overlay
oof_loss_final = log_loss(y, P_final_oof, labels=classes)
print('Post-overlay OOF:', round(oof_loss_final,5))

# Safety checks and save submission
probs = P_final_test[:, [classes.index(c) for c in submit_cols]]
probs = np.clip(probs, 1e-12, 1-1e-12)
probs = probs / probs.sum(axis=1, keepdims=True)
assert probs.shape[0] == len(test) and probs.shape[1] == 3, f'Bad submission shape {probs.shape}'
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
assert sub.shape[0] == 1958, f'Expected 1958 rows, got {sub.shape[0]}'
assert np.all(np.isfinite(sub[submit_cols].values)), 'NaNs/Infs in submission'
row_sums = sub[submit_cols].sum(axis=1).values
assert np.allclose(row_sums, 1.0, atol=1e-6), 'Row sums not ~1'
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (Run A: gated LOP + overlay). Elapsed=%.2fs' % (time.time()-t0), flush=True)

Length-gated LOP with models: ['nbsvm_wc_tweaked', 'nbsvm_char_2_6_counts', 'nbsvm_wc_fixed', 'nbsvm_char_2_7_presence', 'lr_wc_fixed', 'lr_word13_charwb36', 'lr_char_1_8_hero', 'svc_charwb_1_6_sig', 'lr_wordpunct_1_3', 'mnb_char_2_6']


Per-model T (global): {'nbsvm_wc_tweaked': 1.55, 'nbsvm_char_2_6_counts': 1.579, 'nbsvm_wc_fixed': 1.567, 'nbsvm_char_2_7_presence': 1.477, 'lr_wc_fixed': 0.877, 'lr_word13_charwb36': 1.006, 'lr_char_1_8_hero': 0.871, 'svc_charwb_1_6_sig': 0.803, 'lr_wordpunct_1_3': 0.771, 'mnb_char_2_6': 1.17}
Per-model OOF (post scalar cal): {'nbsvm_wc_tweaked': 0.32156, 'nbsvm_char_2_6_counts': 0.33019, 'nbsvm_wc_fixed': 0.33633, 'nbsvm_char_2_7_presence': 0.338, 'lr_wc_fixed': 0.36343, 'lr_word13_charwb36': 0.36825, 'lr_char_1_8_hero': 0.36987, 'svc_charwb_1_6_sig': 0.39657, 'lr_wordpunct_1_3': 0.40368, 'mnb_char_2_6': 0.39492}
Bin sizes: {'short': 6232, 'mid': 6419, 'long': 4970}


  [short] start 16/64 best_obj=0.44223


  [short] start 32/64 best_obj=0.44223


  [short] start 48/64 best_obj=0.44223


  [short] start 64/64 best_obj=0.44223


  [short] OOF pre-temp=0.45688 final=0.45585 T=[1.0988 0.9619 1.0017]


  [mid] start 16/64 best_obj=0.27634


  [mid] start 32/64 best_obj=0.27624


  [mid] start 48/64 best_obj=0.27619


  [mid] start 64/64 best_obj=0.27619


  [mid] OOF pre-temp=0.29132 final=0.28391 T=[0.75   0.9423 0.9038]


  [long] start 16/64 best_obj=0.13268


  [long] start 32/64 best_obj=0.13268


  [long] start 48/64 best_obj=0.13268


  [long] start 64/64 best_obj=0.13263


  [long] OOF pre-temp=0.14728 final=0.13967 T=[0.7819 0.7883 0.75  ]


Length-gated overall OOF: 0.30404 elapsed=631.55s


Post-overlay OOF: 0.30419
Saved submission.csv (Run A: gated LOP + overlay). Elapsed=631.58s


In [126]:
# 4-bin length-gated Classwise LOP (Run B): refine bins, per-bin NB caps, tiny-cap MNB, stabilized temps, confidence overlay
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from scipy.optimize import minimize, minimize_scalar
from sklearn.metrics import log_loss

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)
C = len(classes)

def load(p):
    return np.load(p) if Path(p).exists() else None

# Same portfolio as Run A (10-core including MNB)
cands = [
    ('nbsvm_wc_tweaked',        'oof_nbsvm_wc_tweaked.npy',        'test_nbsvm_wc_tweaked.npy'),
    ('nbsvm_char_2_6_counts',   'oof_nbsvm_char_2_6_counts.npy',   'test_nbsvm_char_2_6_counts.npy'),
    ('nbsvm_wc_fixed',          'oof_nbsvm_wc_fixed.npy',          'test_nbsvm_wc_fixed.npy'),
    ('nbsvm_char_2_7_presence', 'oof_nbsvm_char_2_7_presence.npy', 'test_nbsvm_char_2_7_presence.npy'),
    ('lr_wc_fixed',             'oof_lr_wordchar_fixed.npy',       'test_lr_wordchar_fixed.npy'),
    ('lr_word13_charwb36',      'oof_lr_word13_charwb36.npy',      'test_lr_word13_charwb36.npy'),
    ('lr_char_1_8_hero',        'oof_lr_char_1_8_hero.npy',        'test_lr_char_1_8_hero.npy'),
    ('svc_charwb_1_6_sig',      'oof_svc_charwb_1_6_sig.npy',      'test_svc_charwb_1_6_sig.npy'),
    ('lr_wordpunct_1_3',        'oof_lr_wordpunct_1_3.npy',        'test_lr_wordpunct_1_3.npy'),
    ('mnb_char_2_6',            'oof_mnb_char_2_6.npy',            'test_mnb_char_2_6.npy'),
]
loaded = []
for name, oofp, tsp in cands:
    o = load(oofp); t = load(tsp)
    assert o is not None and t is not None, f'Missing preds for {name}'
    o = np.clip(o, 1e-12, 1-1e-12); o = o / o.sum(axis=1, keepdims=True)
    t = np.clip(t, 1e-12, 1-1e-12); t = t / t.sum(axis=1, keepdims=True)
    loaded.append((name, o.astype(np.float64), t.astype(np.float64)))
names = [n for n,_,_ in loaded]
K = len(names)
print('Length-gated LOP (4-bin) with models:', names, flush=True)

def scale_probs_scalar(P, T):
    S = np.clip(P, 1e-12, 1-1e-12) ** (1.0/float(T))
    return S / S.sum(axis=1, keepdims=True)

OOFs_raw = [o for _,o,_ in loaded]
TESTs_raw = [t for _,_,t in loaded]

# Global per-model temps (shrink target)
per_model_T_global = []; OOFs_global = []; TESTs_global = []
for i in range(K):
    Pi = OOFs_raw[i]
    resTi = minimize_scalar(lambda T: log_loss(y, scale_probs_scalar(Pi, T), labels=classes),
                            bounds=(0.5, 5.0), method='bounded')
    Ti = float(resTi.x)
    per_model_T_global.append(Ti)
    OOFs_global.append(scale_probs_scalar(OOFs_raw[i], Ti))
    TESTs_global.append(scale_probs_scalar(TESTs_raw[i], Ti))
per_oof = {names[i]: log_loss(y, OOFs_global[i], labels=classes) for i in range(K)}
print('Per-model T (global):', {names[i]: round(per_model_T_global[i],3) for i in range(K)})
print('Per-model OOF (post scalar cal):', {k: round(v,5) for k,v in per_oof.items()})

def geo_pool_log_classwise(stacks, W):
    n = stacks[0].shape[0]
    A = np.zeros((n, C), dtype=np.float64)
    for k in range(K):
        A += np.log(stacks[k]) * W[k][None, :]
    A -= A.max(axis=1, keepdims=True)
    P = np.exp(A); P /= P.sum(axis=1, keepdims=True)
    return P

def softmax_cols(Z):
    W = np.zeros_like(Z)
    for j in range(C):
        z = Z[:, j]
        z = z - z.max()
        e = np.exp(z); s = e.sum()
        W[:, j] = e / (s if s>0 else 1.0)
    return W

# Regularization and caps
lambda_ent = 0.0025
global_cap = 0.55
weak_cap = 0.09
tiny_prune_thresh = 0.00
nb_mask = np.array([n.startswith('nbsvm_') for n in names], dtype=bool)
name_to_idx = {n:i for i,n in enumerate(names)}

def apply_caps_with_nbcap(W, nb_cap_local, explicit_caps_local):
    Wc = W.copy()
    Wc = np.minimum(Wc, global_cap)
    for n, cap in explicit_caps_local.items():
        if n in name_to_idx:
            i = name_to_idx[n]
            Wc[i, :] = np.minimum(Wc[i, :], cap)
    for i, n in enumerate(names):
        if per_oof.get(n, 1.0) > 0.40:
            Wc[i, :] = np.minimum(Wc[i, :], weak_cap)
    for j in range(C):
        s_nb = Wc[nb_mask, j].sum()
        if s_nb > nb_cap_local and s_nb > 0:
            Wc[nb_mask, j] *= (nb_cap_local / s_nb)
    for j in range(C):
        col = Wc[:, j]
        if tiny_prune_thresh > 0:
            col[col < tiny_prune_thresh] = 0.0
        s = col.sum()
        if s == 0:
            col[:] = 1.0 / K
        else:
            col[:] = col / s
        Wc[:, j] = col
    return Wc

def make_objective(OOFs_subset, nb_cap_local, explicit_caps_local, y_bin):
    def objective(Z):
        W0 = softmax_cols(Z)
        Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
        P = geo_pool_log_classwise(OOFs_subset, Wc)
        ent = 0.0
        for j in range(C):
            wj = np.clip(Wc[:, j], 1e-12, 1.0)
            ent += float(np.sum(wj * np.log(wj)))
        reg = lambda_ent * ent
        return log_loss(y_bin, P, labels=classes) + reg
    return objective

# 4-bin thresholds: <=80, 81-130, 131-200, >200
train_len = train['text'].astype(str).str.len().values
test_len  = test['text'].astype(str).str.len().values
b1, b2, b3 = 80, 130, 200
mask_vshort = (train_len <= b1)
mask_short  = (train_len > b1) & (train_len <= b2)
mask_mid    = (train_len > b2) & (train_len <= b3)
mask_long   = (train_len > b3)
test_vshort = (test_len <= b1)
test_short  = (test_len > b1) & (test_len <= b2)
test_mid    = (test_len > b2) & (test_len <= b3)
test_long   = (test_len > b3)
print('4-bin sizes:', {'vshort': int(mask_vshort.sum()), 'short': int(mask_short.sum()), 'mid': int(mask_mid.sum()), 'long': int(mask_long.sum())}, flush=True)

Z_global = None
try:
    Z_global = Z_star.copy()
except NameError:
    Z_global = None
Z_warm = np.zeros((K, C), dtype=np.float64)
if Z_global is not None and getattr(Z_global, 'shape', None) == (K, C):
    Z_warm = Z_global

final_oof = np.zeros((len(train), C), dtype=np.float64)
final_test = np.zeros((len(test), C), dtype=np.float64)

def run_bin(name, tr_mask, te_mask, nb_cap_local, mnb_cap_local):
    idx_tr = np.where(tr_mask)[0]
    idx_te = np.where(te_mask)[0]
    if len(idx_tr) == 0:
        return
    y_bin = y[idx_tr]
    OOFs_bin = []; TESTs_bin = []
    for i in range(K):
        Pi_tr = OOFs_raw[i][idx_tr]
        resTi = minimize_scalar(lambda T: log_loss(y_bin, scale_probs_scalar(Pi_tr, T), labels=classes),
                                bounds=(0.75, 1.5), method='bounded')
        Ti_bin = float(resTi.x)
        Ti_shrunk = 0.7 * float(per_model_T_global[i]) + 0.3 * Ti_bin
        OOFs_bin.append(scale_probs_scalar(OOFs_raw[i][idx_tr], Ti_shrunk))
        TESTs_bin.append(scale_probs_scalar(TESTs_raw[i][idx_te], Ti_shrunk))
    starts = 64
    rng = np.random.RandomState(42)
    inits = [Z_warm.copy()] + [rng.normal(0, 0.5, size=(K, C)) for _ in range(starts-1)]
    best = (1e9, None)
    explicit_caps_local = {
        'svc_charwb_1_6_sig': 0.06,
        'lr_wordpunct_1_3': 0.05,
        'mnb_char_2_6': mnb_cap_local,
    }
    obj = make_objective(OOFs_bin, nb_cap_local, explicit_caps_local, y_bin)
    for si, Z0 in enumerate(inits, 1):
        res = minimize(lambda z: obj(z.reshape(K, C)), Z0.ravel(), method='L-BFGS-B')
        val = float(res.fun)
        if val < best[0]:
            best = (val, res.x.reshape(K, C).copy())
        if si % 16 == 0:
            print(f'  [{name}] start {si}/{starts} best_obj={best[0]:.5f}', flush=True)
    Z_bin = best[1]
    W0 = softmax_cols(Z_bin)
    Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
    P_oof_bin = geo_pool_log_classwise(OOFs_bin, Wc)
    P_test_bin = geo_pool_log_classwise(TESTs_bin, Wc)
    oof_pre = log_loss(y_bin, P_oof_bin, labels=classes)
    def scale_classwise(P, Tvec):
        T = np.asarray(Tvec, dtype=np.float64)
        S = np.clip(P, 1e-12, 1-1e-12) ** (1.0 / T[None, :])
        return S / S.sum(axis=1, keepdims=True)
    bounds = [(0.75, 1.5)] * C
    resTc = minimize(lambda t: log_loss(y_bin, scale_classwise(P_oof_bin, t), labels=classes),
                     x0=np.ones(C, dtype=np.float64), method='L-BFGS-B', bounds=bounds)
    T_class = resTc.x
    P_oof_scaled = scale_classwise(P_oof_bin, T_class)
    P_test_scaled = scale_classwise(P_test_bin, T_class)
    oof_final_bin = log_loss(y_bin, P_oof_scaled, labels=classes)
    print(f'  [{name}] OOF pre-temp={oof_pre:.5f} final={oof_final_bin:.5f} T={np.round(T_class,4)}', flush=True)
    final_oof[idx_tr] = P_oof_scaled
    final_test[idx_te] = P_test_scaled

# Per-bin NB caps per expert for 4-bin: [0.68, 0.65, 0.62, 0.58]
# MNB tiny caps per-bin (extend slightly for 4th bin): [0.042, 0.038, 0.032, 0.028]
run_bin('vshort', mask_vshort, test_vshort, nb_cap_local=0.68, mnb_cap_local=0.042)
run_bin('short',  mask_short,  test_short,  nb_cap_local=0.65, mnb_cap_local=0.038)
run_bin('mid',    mask_mid,    test_mid,    nb_cap_local=0.62, mnb_cap_local=0.032)
run_bin('long',   mask_long,   test_long,   nb_cap_local=0.58, mnb_cap_local=0.028)

oof_loss = log_loss(y, final_oof, labels=classes)
print('4-bin Length-gated overall OOF:', round(oof_loss,5), 'elapsed=%.2fs' % (time.time()-t0), flush=True)

# Confidence overlay with hero model on low-confidence rows (max<0.46):
def apply_overlay(P_base, P_aux, mask_conf, alpha=0.8):
    P = P_base.copy()
    low_conf = mask_conf
    P[low_conf] = (alpha * P_base[low_conf] + (1.0 - alpha) * P_aux[low_conf])
    P = np.clip(P, 1e-12, 1-1e-12); P /= P.sum(axis=1, keepdims=True)
    return P

P_lop_oof = final_oof.copy(); P_lop_test = final_test.copy()
P_hero_oof = load('oof_lr_char_1_8_hero.npy'); P_hero_test = load('test_lr_char_1_8_hero.npy')
assert P_hero_oof is not None and P_hero_test is not None, 'Missing hero model preds for overlay'
P_hero_oof = np.clip(P_hero_oof, 1e-12, 1-1e-12); P_hero_oof /= P_hero_oof.sum(axis=1, keepdims=True)
P_hero_test = np.clip(P_hero_test, 1e-12, 1-1e-12); P_hero_test /= P_hero_test.sum(axis=1, keepdims=True)
mask_low_oof = (P_lop_oof.max(axis=1) < 0.46)
mask_low_test = (P_lop_test.max(axis=1) < 0.46)
P_final_oof = apply_overlay(P_lop_oof, P_hero_oof, mask_low_oof, alpha=0.8)
P_final_test = apply_overlay(P_lop_test, P_hero_test, mask_low_test, alpha=0.8)

# Optional micro prior on very short (<=60) MWS x1.02
short_mask_60_tr = (train_len <= 60)
short_mask_60_te = (test_len <= 60)
def boost_mws(P, mask):
    P2 = P.copy()
    j = classes.index('MWS')
    if mask.any():
        P2[mask, j] *= 1.02
        P2[mask] /= P2[mask].sum(axis=1, keepdims=True)
    return P2
P_final_oof = boost_mws(P_final_oof, short_mask_60_tr)
P_final_test = boost_mws(P_final_test, short_mask_60_te)

oof_loss_final = log_loss(y, P_final_oof, labels=classes)
print('4-bin Post-overlay OOF:', round(oof_loss_final,5))

probs = P_final_test[:, [classes.index(c) for c in submit_cols]]
probs = np.clip(probs, 1e-12, 1-1e-12); probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
assert sub.shape[0] == 1958
assert np.all(np.isfinite(sub[submit_cols].values))
assert np.allclose(sub[submit_cols].sum(axis=1).values, 1.0, atol=1e-6)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (Run B: 4-bin gated LOP + overlay). Elapsed=%.2fs' % (time.time()-t0), flush=True)

Length-gated LOP (4-bin) with models: ['nbsvm_wc_tweaked', 'nbsvm_char_2_6_counts', 'nbsvm_wc_fixed', 'nbsvm_char_2_7_presence', 'lr_wc_fixed', 'lr_word13_charwb36', 'lr_char_1_8_hero', 'svc_charwb_1_6_sig', 'lr_wordpunct_1_3', 'mnb_char_2_6']


Per-model T (global): {'nbsvm_wc_tweaked': 1.55, 'nbsvm_char_2_6_counts': 1.579, 'nbsvm_wc_fixed': 1.567, 'nbsvm_char_2_7_presence': 1.477, 'lr_wc_fixed': 0.877, 'lr_word13_charwb36': 1.006, 'lr_char_1_8_hero': 0.871, 'svc_charwb_1_6_sig': 0.803, 'lr_wordpunct_1_3': 0.771, 'mnb_char_2_6': 1.17}
Per-model OOF (post scalar cal): {'nbsvm_wc_tweaked': 0.32156, 'nbsvm_char_2_6_counts': 0.33019, 'nbsvm_wc_fixed': 0.33633, 'nbsvm_char_2_7_presence': 0.338, 'lr_wc_fixed': 0.36343, 'lr_word13_charwb36': 0.36825, 'lr_char_1_8_hero': 0.36987, 'svc_charwb_1_6_sig': 0.39657, 'lr_wordpunct_1_3': 0.40368, 'mnb_char_2_6': 0.39492}
4-bin sizes: {'vshort': 4330, 'short': 4674, 'mid': 4698, 'long': 3919}


  [vshort] start 16/64 best_obj=0.47375


  [vshort] start 32/64 best_obj=0.47375


  [vshort] start 48/64 best_obj=0.47371


  [vshort] start 64/64 best_obj=0.47345


  [vshort] OOF pre-temp=0.48815 final=0.48532 T=[1.19   0.9634 1.0137]


  [short] start 16/64 best_obj=0.34177


  [short] start 32/64 best_obj=0.34177


  [short] start 48/64 best_obj=0.34177


  [short] start 64/64 best_obj=0.34177


  [short] OOF pre-temp=0.35646 final=0.35279 T=[0.8029 0.9461 0.9456]


  [mid] start 16/64 best_obj=0.22552


  [mid] start 32/64 best_obj=0.22512


  [mid] start 48/64 best_obj=0.22512


  [mid] start 64/64 best_obj=0.22510


  [mid] OOF pre-temp=0.24028 final=0.23172 T=[0.75   0.9341 0.8437]


  [long] start 16/64 best_obj=0.12151


  [long] start 32/64 best_obj=0.12148


  [long] start 48/64 best_obj=0.12141


  [long] start 64/64 best_obj=0.12141


  [long] OOF pre-temp=0.13632 final=0.12852 T=[0.8209 0.75   0.75  ]


4-bin Length-gated overall OOF: 0.3032 elapsed=672.57s


4-bin Post-overlay OOF: 0.30335
Saved submission.csv (Run B: 4-bin gated LOP + overlay). Elapsed=672.60s


In [125]:
# 3-bin length-gated LOP (Run C): swap lr_wordpunct_1_3 -> lr_char_1_7; keep stabilizers, mnb tiny-cap, overlay
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from scipy.optimize import minimize, minimize_scalar
from sklearn.metrics import log_loss

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)
C = len(classes)

def load(p):
    return np.load(p) if Path(p).exists() else None

# Portfolio swap: replace lr_wordpunct_1_3 with lr_char_1_7 (keep 10-core including MNB)
cands = [
    ('nbsvm_wc_tweaked',        'oof_nbsvm_wc_tweaked.npy',        'test_nbsvm_wc_tweaked.npy'),
    ('nbsvm_char_2_6_counts',   'oof_nbsvm_char_2_6_counts.npy',   'test_nbsvm_char_2_6_counts.npy'),
    ('nbsvm_wc_fixed',          'oof_nbsvm_wc_fixed.npy',          'test_nbsvm_wc_fixed.npy'),
    ('nbsvm_char_2_7_presence', 'oof_nbsvm_char_2_7_presence.npy', 'test_nbsvm_char_2_7_presence.npy'),
    ('lr_wc_fixed',             'oof_lr_wordchar_fixed.npy',       'test_lr_wordchar_fixed.npy'),
    ('lr_word13_charwb36',      'oof_lr_word13_charwb36.npy',      'test_lr_word13_charwb36.npy'),
    ('lr_char_1_8_hero',        'oof_lr_char_1_8_hero.npy',        'test_lr_char_1_8_hero.npy'),
    ('svc_charwb_1_6_sig',      'oof_svc_charwb_1_6_sig.npy',      'test_svc_charwb_1_6_sig.npy'),
    ('lr_char_1_7',             'oof_lr_char_1_7.npy',             'test_lr_char_1_7.npy'),
    ('mnb_char_2_6',            'oof_mnb_char_2_6.npy',            'test_mnb_char_2_6.npy'),
]
loaded = []
for name, oofp, tsp in cands:
    o = load(oofp); t = load(tsp)
    assert o is not None and t is not None, f'Missing preds for {name}'
    o = np.clip(o, 1e-12, 1-1e-12); o = o / o.sum(axis=1, keepdims=True)
    t = np.clip(t, 1e-12, 1-1e-12); t = t / t.sum(axis=1, keepdims=True)
    loaded.append((name, o.astype(np.float64), t.astype(np.float64)))
names = [n for n,_,_ in loaded]
K = len(names)
print('Length-gated LOP (Run C) with models:', names, flush=True)

def scale_probs_scalar(P, T):
    S = np.clip(P, 1e-12, 1-1e-12) ** (1.0/float(T))
    return S / S.sum(axis=1, keepdims=True)

OOFs_raw = [o for _,o,_ in loaded]
TESTs_raw = [t for _,_,t in loaded]

# Global per-model temps (shrink target)
per_model_T_global = []; OOFs_global = []; TESTs_global = []
for i in range(K):
    Pi = OOFs_raw[i]
    resTi = minimize_scalar(lambda T: log_loss(y, scale_probs_scalar(Pi, T), labels=classes),
                            bounds=(0.5, 5.0), method='bounded')
    Ti = float(resTi.x)
    per_model_T_global.append(Ti)
    OOFs_global.append(scale_probs_scalar(OOFs_raw[i], Ti))
    TESTs_global.append(scale_probs_scalar(TESTs_raw[i], Ti))
per_oof = {names[i]: log_loss(y, OOFs_global[i], labels=classes) for i in range(K)}
print('Per-model T (global):', {names[i]: round(per_model_T_global[i],3) for i in range(K)})
print('Per-model OOF (post scalar cal):', {k: round(v,5) for k,v in per_oof.items()})

def geo_pool_log_classwise(stacks, W):
    n = stacks[0].shape[0]
    A = np.zeros((n, C), dtype=np.float64)
    for k in range(K):
        A += np.log(stacks[k]) * W[k][None, :]
    A -= A.max(axis=1, keepdims=True)
    P = np.exp(A); P /= P.sum(axis=1, keepdims=True)
    return P

def softmax_cols(Z):
    W = np.zeros_like(Z)
    for j in range(C):
        z = Z[:, j]
        z = z - z.max()
        e = np.exp(z); s = e.sum()
        W[:, j] = e / (s if s>0 else 1.0)
    return W

# Regularization and caps (Run A/B settings)
lambda_ent = 0.0025
global_cap = 0.55
weak_cap = 0.09
tiny_prune_thresh = 0.00
nb_mask = np.array([n.startswith('nbsvm_') for n in names], dtype=bool)
name_to_idx = {n:i for i,n in enumerate(names)}

def apply_caps_with_nbcap(W, nb_cap_local, explicit_caps_local):
    Wc = W.copy()
    Wc = np.minimum(Wc, global_cap)
    for n, cap in explicit_caps_local.items():
        if n in name_to_idx:
            i = name_to_idx[n]
            Wc[i, :] = np.minimum(Wc[i, :], cap)
    for i, n in enumerate(names):
        if per_oof.get(n, 1.0) > 0.40:
            Wc[i, :] = np.minimum(Wc[i, :], weak_cap)
    for j in range(C):
        s_nb = Wc[nb_mask, j].sum()
        if s_nb > nb_cap_local and s_nb > 0:
            Wc[nb_mask, j] *= (nb_cap_local / s_nb)
    for j in range(C):
        col = Wc[:, j]
        if tiny_prune_thresh > 0:
            col[col < tiny_prune_thresh] = 0.0
        s = col.sum()
        if s == 0:
            col[:] = 1.0 / K
        else:
            col[:] = col / s
        Wc[:, j] = col
    return Wc

def make_objective(OOFs_subset, nb_cap_local, explicit_caps_local, y_bin):
    def objective(Z):
        W0 = softmax_cols(Z)
        Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
        P = geo_pool_log_classwise(OOFs_subset, Wc)
        ent = 0.0
        for j in range(C):
            wj = np.clip(Wc[:, j], 1e-12, 1.0)
            ent += float(np.sum(wj * np.log(wj)))
        reg = lambda_ent * ent
        return log_loss(y_bin, P, labels=classes) + reg
    return objective

# 3-bin thresholds: <=100, 101-180, >180
train_len = train['text'].astype(str).str.len().values
test_len  = test['text'].astype(str).str.len().values
short_thr, mid_lo, mid_hi = 100, 101, 180
mask_short = (train_len <= short_thr)
mask_mid   = (train_len >= mid_lo) & (train_len <= mid_hi)
mask_long  = (train_len > mid_hi)
test_mask_short = (test_len <= short_thr)
test_mask_mid   = (test_len >= mid_lo) & (test_len <= mid_hi)
test_mask_long  = (test_len > mid_hi)
print('3-bin sizes:', {'short': int(mask_short.sum()), 'mid': int(mask_mid.sum()), 'long': int(mask_long.sum())}, flush=True)

Z_global = None
try:
    Z_global = Z_star.copy()
except NameError:
    Z_global = None
Z_warm = np.zeros((K, C), dtype=np.float64)
if Z_global is not None and getattr(Z_global, 'shape', None) == (K, C):
    Z_warm = Z_global

final_oof = np.zeros((len(train), C), dtype=np.float64)
final_test = np.zeros((len(test), C), dtype=np.float64)

def run_bin(name, tr_mask, te_mask, nb_cap_local, mnb_cap_local):
    idx_tr = np.where(tr_mask)[0]
    idx_te = np.where(te_mask)[0]
    if len(idx_tr) == 0:
        return
    y_bin = y[idx_tr]
    OOFs_bin = []; TESTs_bin = []
    # Per-bin per-model temps with bounds and shrink toward global
    for i in range(K):
        Pi_tr = OOFs_raw[i][idx_tr]
        resTi = minimize_scalar(lambda T: log_loss(y_bin, scale_probs_scalar(Pi_tr, T), labels=classes),
                                bounds=(0.75, 1.5), method='bounded')
        Ti_bin = float(resTi.x)
        Ti_shrunk = 0.7 * float(per_model_T_global[i]) + 0.3 * Ti_bin
        OOFs_bin.append(scale_probs_scalar(OOFs_raw[i][idx_tr], Ti_shrunk))
        TESTs_bin.append(scale_probs_scalar(TESTs_raw[i][idx_te], Ti_shrunk))
    starts = 64
    rng = np.random.RandomState(42)
    inits = [Z_warm.copy()] + [rng.normal(0, 0.5, size=(K, C)) for _ in range(starts-1)]
    best = (1e9, None)
    explicit_caps_local = {
        'svc_charwb_1_6_sig': 0.06,
        'lr_char_1_7': 0.50,  # no special tiny cap
        'mnb_char_2_6': mnb_cap_local,
    }
    obj = make_objective(OOFs_bin, nb_cap_local, explicit_caps_local, y_bin)
    for si, Z0 in enumerate(inits, 1):
        res = minimize(lambda z: obj(z.reshape(K, C)), Z0.ravel(), method='L-BFGS-B')
        val = float(res.fun)
        if val < best[0]:
            best = (val, res.x.reshape(K, C).copy())
        if si % 16 == 0:
            print(f'  [{name}] start {si}/{starts} best_obj={best[0]:.5f}', flush=True)
    Z_bin = best[1]
    W0 = softmax_cols(Z_bin)
    Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
    P_oof_bin = geo_pool_log_classwise(OOFs_bin, Wc)
    P_test_bin = geo_pool_log_classwise(TESTs_bin, Wc)
    oof_pre = log_loss(y_bin, P_oof_bin, labels=classes)
    def scale_classwise(P, Tvec):
        T = np.asarray(Tvec, dtype=np.float64)
        S = np.clip(P, 1e-12, 1-1e-12) ** (1.0 / T[None, :])
        return S / S.sum(axis=1, keepdims=True)
    bounds = [(0.75, 1.5)] * C
    resTc = minimize(lambda t: log_loss(y_bin, scale_classwise(P_oof_bin, t), labels=classes),
                     x0=np.ones(C, dtype=np.float64), method='L-BFGS-B', bounds=bounds)
    T_class = resTc.x
    P_oof_scaled = scale_classwise(P_oof_bin, T_class)
    P_test_scaled = scale_classwise(P_test_bin, T_class)
    oof_final_bin = log_loss(y_bin, P_oof_scaled, labels=classes)
    print(f'  [{name}] OOF pre-temp={oof_pre:.5f} final={oof_final_bin:.5f} T={np.round(T_class,4)}', flush=True)
    final_oof[idx_tr] = P_oof_scaled
    final_test[idx_te] = P_test_scaled

# Per-bin NB caps per expert: short=0.66, mid=0.62, long=0.58; MNB tiny caps per-bin
run_bin('short', mask_short, test_mask_short, nb_cap_local=0.66, mnb_cap_local=0.04)
run_bin('mid',   mask_mid,   test_mask_mid,   nb_cap_local=0.62, mnb_cap_local=0.035)
run_bin('long',  mask_long,  test_mask_long,  nb_cap_local=0.58, mnb_cap_local=0.03)

# Overall OOF
oof_loss = log_loss(y, final_oof, labels=classes)
print('Run C 3-bin overall OOF:', round(oof_loss,5), 'elapsed=%.2fs' % (time.time()-t0), flush=True)

# Confidence overlay with hero model (max<0.46 => mix 0.8/0.2)
P_lop_oof = final_oof.copy(); P_lop_test = final_test.copy()
P_hero_oof = load('oof_lr_char_1_8_hero.npy'); P_hero_test = load('test_lr_char_1_8_hero.npy')
assert P_hero_oof is not None and P_hero_test is not None, 'Missing hero model preds for overlay'
P_hero_oof = np.clip(P_hero_oof, 1e-12, 1-1e-12); P_hero_oof /= P_hero_oof.sum(axis=1, keepdims=True)
P_hero_test = np.clip(P_hero_test, 1e-12, 1-1e-12); P_hero_test /= P_hero_test.sum(axis=1, keepdims=True)
def apply_overlay(P_base, P_aux, mask_conf, alpha=0.8):
    P = P_base.copy()
    low_conf = mask_conf
    P[low_conf] = (alpha * P_base[low_conf] + (1.0 - alpha) * P_aux[low_conf])
    P = np.clip(P, 1e-12, 1-1e-12); P /= P.sum(axis=1, keepdims=True)
    return P
mask_low_oof = (P_lop_oof.max(axis=1) < 0.46)
mask_low_test = (P_lop_test.max(axis=1) < 0.46)
P_final_oof = apply_overlay(P_lop_oof, P_hero_oof, mask_low_oof, alpha=0.8)
P_final_test = apply_overlay(P_lop_test, P_hero_test, mask_low_test, alpha=0.8)

# Optional micro prior: boost MWS for very short (<=60 chars)
short_mask_60_tr = (train_len <= 60)
short_mask_60_te = (test_len <= 60)
def boost_mws(P, mask):
    P2 = P.copy()
    j = classes.index('MWS')
    if mask.any():
        P2[mask, j] *= 1.02
        P2[mask] /= P2[mask].sum(axis=1, keepdims=True)
    return P2
P_final_oof = boost_mws(P_final_oof, short_mask_60_tr)
P_final_test = boost_mws(P_final_test, short_mask_60_te)

oof_loss_final = log_loss(y, P_final_oof, labels=classes)
print('Run C Post-overlay OOF:', round(oof_loss_final,5))

# Save submission
probs = P_final_test[:, [classes.index(c) for c in submit_cols]]
probs = np.clip(probs, 1e-12, 1-1e-12); probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
assert sub.shape[0] == 1958
assert np.all(np.isfinite(sub[submit_cols].values))
assert np.allclose(sub[submit_cols].sum(axis=1).values, 1.0, atol=1e-6)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (Run C: 3-bin swap + overlay). Elapsed=%.2fs' % (time.time()-t0), flush=True)

Length-gated LOP (Run C) with models: ['nbsvm_wc_tweaked', 'nbsvm_char_2_6_counts', 'nbsvm_wc_fixed', 'nbsvm_char_2_7_presence', 'lr_wc_fixed', 'lr_word13_charwb36', 'lr_char_1_8_hero', 'svc_charwb_1_6_sig', 'lr_char_1_7', 'mnb_char_2_6']


Per-model T (global): {'nbsvm_wc_tweaked': 1.55, 'nbsvm_char_2_6_counts': 1.579, 'nbsvm_wc_fixed': 1.567, 'nbsvm_char_2_7_presence': 1.477, 'lr_wc_fixed': 0.877, 'lr_word13_charwb36': 1.006, 'lr_char_1_8_hero': 0.871, 'svc_charwb_1_6_sig': 0.803, 'lr_char_1_7': 0.88, 'mnb_char_2_6': 1.17}
Per-model OOF (post scalar cal): {'nbsvm_wc_tweaked': 0.32156, 'nbsvm_char_2_6_counts': 0.33019, 'nbsvm_wc_fixed': 0.33633, 'nbsvm_char_2_7_presence': 0.338, 'lr_wc_fixed': 0.36343, 'lr_word13_charwb36': 0.36825, 'lr_char_1_8_hero': 0.36987, 'svc_charwb_1_6_sig': 0.39657, 'lr_char_1_7': 0.37154, 'mnb_char_2_6': 0.39492}
3-bin sizes: {'short': 6232, 'mid': 6419, 'long': 4970}


  [short] start 16/64 best_obj=0.44541


  [short] start 32/64 best_obj=0.44541


  [short] start 48/64 best_obj=0.44541


  [short] start 64/64 best_obj=0.44541


  [short] OOF pre-temp=0.45984 final=0.45836 T=[1.1331 0.9752 1.0227]


  [mid] start 16/64 best_obj=0.27953


  [mid] start 32/64 best_obj=0.27948


  [mid] start 48/64 best_obj=0.27948


  [mid] start 64/64 best_obj=0.27948


  [mid] OOF pre-temp=0.29433 final=0.28808 T=[0.75   0.9669 0.9296]


  [long] start 16/64 best_obj=0.13378


  [long] start 32/64 best_obj=0.13378


  [long] start 48/64 best_obj=0.13378


  [long] start 64/64 best_obj=0.13377


  [long] OOF pre-temp=0.14797 final=0.14139 T=[0.7961 0.797  0.75  ]


Run C 3-bin overall OOF: 0.30693 elapsed=707.53s


Run C Post-overlay OOF: 0.30707
Saved submission.csv (Run C: 3-bin swap + overlay). Elapsed=707.55s


In [127]:
# New bases: Char 5-gram LM-like generative probs + Stylometric LR features (OOF/Test saved)
import numpy as np, pandas as pd, time, gc, re
from pathlib import Path
from typing import List, Tuple
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
X_text = train['text'].astype(str).values
y = train['author'].values
X_test_text = test['text'].astype(str).values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)
C = len(classes)
SEED = 42; N_FOLDS = 5
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# --- Base 1: Char 5-gram LM-like (bag-of-ngrams generative with Laplace smoothing) ---
def char5_lm_oof(alpha: float = 0.5, ngram_range=(5,5), lowercase=False):
    oof = np.zeros((len(train), C), dtype=np.float64)
    tpred = np.zeros((len(test), C), dtype=np.float64)
    for fold, (tr, va) in enumerate(skf.split(X_text, y), 1):
        x_tr, x_va = X_text[tr], X_text[va]
        y_tr, y_va = y[tr], y[va]
        vec = CountVectorizer(analyzer='char', ngram_range=ngram_range, min_df=1, lowercase=lowercase, strip_accents=None, dtype=np.int32)
        X_tr = vec.fit_transform(x_tr); X_va = vec.transform(x_va); X_te = vec.transform(X_test_text)
        V = X_tr.shape[1]
        # compute per-class smoothed log probs over ngrams
        class_sum = np.zeros(C, dtype=np.float64)
        class_counts = np.zeros((C, V), dtype=np.float64)
        for ci, c in enumerate(classes):
            mask = (y_tr == c)
            if mask.any():
                cnts = np.asarray(X_tr[mask].sum(axis=0)).ravel().astype(np.float64)
            else:
                cnts = np.zeros(V, dtype=np.float64)
            class_counts[ci] = cnts + alpha
            class_sum[ci] = class_counts[ci].sum()
        logp = np.log(class_counts) - np.log(class_sum[:, None])  # shape (C,V)
        # scores: doc_counts dot logp^T per class
        def scores(Xm):
            # result shape (n_samples, C)
            S = np.zeros((Xm.shape[0], C), dtype=np.float64)
            # compute per class efficiently via matrix mult in blocks to save memory
            # S[:,ci] = Xm @ logp[ci].T
            for ci in range(C):
                S[:, ci] = Xm.dot(logp[ci].astype(np.float64))
            # softmax across classes to get probs
            S = S - S.max(axis=1, keepdims=True)
            P = np.exp(S); P /= P.sum(axis=1, keepdims=True)
            return P
        P_va = scores(X_va); P_te = scores(X_te)
        oof[va] = P_va
        tpred += P_te / N_FOLDS
        loss = log_loss(y_va, P_va, labels=classes)
        print(f'[char5lm] Fold {fold} loss={loss:.5f} V={V}', flush=True)
        del X_tr, X_va, X_te, class_counts, logp; gc.collect()
    oof_loss = log_loss(y, oof, labels=classes)
    print(f'[char5lm] OOF={oof_loss:.5f}', flush=True)
    return oof.astype(np.float32), tpred.astype(np.float32), float(oof_loss)

oof_char5lm, test_char5lm, loss_char5lm = char5_lm_oof(alpha=0.5, ngram_range=(5,5), lowercase=False)
np.save('oof_char5lm.npy', oof_char5lm); np.save('test_char5lm.npy', test_char5lm)
print('[char5lm] Saved oof_char5lm.npy and test_char5lm.npy')

# --- Base 2: Stylometric features + Logistic Regression (multinomial) ---
def compute_stylo(texts: np.ndarray) -> np.ndarray:
    feats = []
    for t in texts:
        s = t
        L = len(s)
        wc = len(s.split()) if L>0 else 0
        avg_wlen = (sum(len(w) for w in s.split())/wc) if wc>0 else 0.0
        p_excl = s.count('!')/max(L,1)
        p_q    = s.count('?')/max(L,1)
        p_sem  = s.count(';')/max(L,1)
        p_col  = s.count(':')/max(L,1)
        p_dash = s.count('—')/max(L,1) + s.count('-')/max(L,1)
        p_ell  = s.count('…')/max(L,1)
        p_com  = s.count(',')/max(L,1)
        p_dot  = s.count('.')/max(L,1)
        caps = sum(1 for ch in s if ch.isupper())/max(L,1)
        digits = sum(1 for ch in s if ch.isdigit())/max(L,1)
        # type-token ratio (rough)
        toks = re.findall(r"[A-Za-z']+", s)
        ttr = (len(set(toks))/max(len(toks),1)) if toks else 0.0
        feats.append([L, wc, avg_wlen, p_excl, p_q, p_sem, p_col, p_dash, p_ell, p_com, p_dot, caps, digits, ttr])
    return np.asarray(feats, dtype=np.float32)

X_tr_sty = compute_stylo(X_text)
X_te_sty = compute_stylo(X_test_text)
oof_sty = np.zeros((len(train), C), dtype=np.float32)
test_sty = np.zeros((len(test), C), dtype=np.float32)
for fold, (tr, va) in enumerate(skf.split(X_tr_sty, y), 1):
    Xtr, Xva = X_tr_sty[tr], X_tr_sty[va]
    ytr, yva = y[tr], y[va]
    clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', C=2.0, max_iter=2000, n_jobs=-1, random_state=SEED)
    clf.fit(Xtr, ytr)
    P_va = clf.predict_proba(Xva).astype(np.float32)
    P_te = clf.predict_proba(X_te_sty).astype(np.float32)
    # reorder
    order = [list(clf.classes_).index(c) for c in classes]
    P_va = P_va[:, order]; P_te = P_te[:, order]
    # clip+renorm
    P_va = np.clip(P_va, 1e-12, 1-1e-12); P_va /= P_va.sum(axis=1, keepdims=True)
    P_te = np.clip(P_te, 1e-12, 1-1e-12); P_te /= P_te.sum(axis=1, keepdims=True)
    oof_sty[va] = P_va
    test_sty += P_te / N_FOLDS
    loss = log_loss(yva, P_va, labels=classes)
    print(f'[styloLR] Fold {fold} loss={loss:.5f}', flush=True)
oof_sty_loss = log_loss(y, oof_sty, labels=classes)
print(f'[styloLR] OOF={oof_sty_loss:.5f}', flush=True)
np.save('oof_stylo_lr.npy', oof_sty); np.save('test_stylo_lr.npy', test_sty)
print('[styloLR] Saved oof_stylo_lr.npy and test_stylo_lr.npy')

print('Finished new bases. Elapsed=%.2fs' % (time.time()-t0), flush=True)

[char5lm] Fold 1 loss=1.53375 V=141747


[char5lm] Fold 2 loss=1.51921 V=142475


[char5lm] Fold 3 loss=1.58201 V=142659


[char5lm] Fold 4 loss=1.51984 V=141991




[styloLR] Fold 1 loss=1.06426


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[styloLR] Fold 2 loss=1.06370


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[styloLR] Fold 3 loss=1.07204


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[styloLR] Fold 4 loss=1.06293


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[styloLR] Fold 5 loss=1.06496


[styloLR] OOF=1.06558


[styloLR] Saved oof_stylo_lr.npy and test_stylo_lr.npy
Finished new bases. Elapsed=23.59s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [128]:
# 4-bin length-gated LOP (Run D): add char5lm + stylo bases with ultra-tight caps; stabilize temps; overlay
import numpy as np, pandas as pd, time, gc
from pathlib import Path
from scipy.optimize import minimize, minimize_scalar
from sklearn.metrics import log_loss

t0 = time.time()
train = pd.read_csv('train.csv'); test = pd.read_csv('test.csv')
y = train['author'].values
classes = np.unique(y).tolist()
submit_cols = ['EAP','HPL','MWS']
assert set(classes) == set(submit_cols)
C = len(classes)

def load(p):
    return np.load(p) if Path(p).exists() else None

# Portfolio: Run B 10-core + new tiny-capped diversity bases (char5lm, stylo)
cands = [
    ('nbsvm_wc_tweaked',        'oof_nbsvm_wc_tweaked.npy',        'test_nbsvm_wc_tweaked.npy'),
    ('nbsvm_char_2_6_counts',   'oof_nbsvm_char_2_6_counts.npy',   'test_nbsvm_char_2_6_counts.npy'),
    ('nbsvm_wc_fixed',          'oof_nbsvm_wc_fixed.npy',          'test_nbsvm_wc_fixed.npy'),
    ('nbsvm_char_2_7_presence', 'oof_nbsvm_char_2_7_presence.npy', 'test_nbsvm_char_2_7_presence.npy'),
    ('lr_wc_fixed',             'oof_lr_wordchar_fixed.npy',       'test_lr_wordchar_fixed.npy'),
    ('lr_word13_charwb36',      'oof_lr_word13_charwb36.npy',      'test_lr_word13_charwb36.npy'),
    ('lr_char_1_8_hero',        'oof_lr_char_1_8_hero.npy',        'test_lr_char_1_8_hero.npy'),
    ('svc_charwb_1_6_sig',      'oof_svc_charwb_1_6_sig.npy',      'test_svc_charwb_1_6_sig.npy'),
    ('lr_wordpunct_1_3',        'oof_lr_wordpunct_1_3.npy',        'test_lr_wordpunct_1_3.npy'),
    ('mnb_char_2_6',            'oof_mnb_char_2_6.npy',            'test_mnb_char_2_6.npy'),
    ('char5lm',                 'oof_char5lm.npy',                 'test_char5lm.npy'),
    ('stylo_lr',                'oof_stylo_lr.npy',                'test_stylo_lr.npy'),
]
loaded = []
for name, oofp, tsp in cands:
    o = load(oofp); t = load(tsp)
    assert o is not None and t is not None, f'Missing preds for {name}'
    o = np.clip(o, 1e-12, 1-1e-12); o = o / o.sum(axis=1, keepdims=True)
    t = np.clip(t, 1e-12, 1-1e-12); t = t / t.sum(axis=1, keepdims=True)
    loaded.append((name, o.astype(np.float64), t.astype(np.float64)))
names = [n for n,_,_ in loaded]
K = len(names)
print('Length-gated LOP (4-bin + extras) with models:', names, flush=True)

def scale_probs_scalar(P, T):
    S = np.clip(P, 1e-12, 1-1e-12) ** (1.0/float(T))
    return S / S.sum(axis=1, keepdims=True)

OOFs_raw = [o for _,o,_ in loaded]
TESTs_raw = [t for _,_,t in loaded]

# Global per-model temps (shrink target)
per_model_T_global = []; OOFs_global = []; TESTs_global = []
for i in range(K):
    Pi = OOFs_raw[i]
    resTi = minimize_scalar(lambda T: log_loss(y, scale_probs_scalar(Pi, T), labels=classes),
                            bounds=(0.5, 5.0), method='bounded')
    Ti = float(resTi.x)
    per_model_T_global.append(Ti)
    OOFs_global.append(scale_probs_scalar(OOFs_raw[i], Ti))
    TESTs_global.append(scale_probs_scalar(TESTs_raw[i], Ti))
per_oof = {names[i]: log_loss(y, OOFs_global[i], labels=classes) for i in range(K)}
print('Per-model T (global):', {names[i]: round(per_model_T_global[i],3) for i in range(K)})
print('Per-model OOF (post scalar cal):', {k: round(v,5) for k,v in per_oof.items()})

def geo_pool_log_classwise(stacks, W):
    n = stacks[0].shape[0]
    A = np.zeros((n, C), dtype=np.float64)
    for k in range(K):
        A += np.log(stacks[k]) * W[k][None, :]
    A -= A.max(axis=1, keepdims=True)
    P = np.exp(A); P /= P.sum(axis=1, keepdims=True)
    return P

def softmax_cols(Z):
    W = np.zeros_like(Z)
    for j in range(C):
        z = Z[:, j]
        z = z - z.max()
        e = np.exp(z); s = e.sum()
        W[:, j] = e / (s if s>0 else 1.0)
    return W

# Regularization and caps
lambda_ent = 0.0025
global_cap = 0.55
weak_cap = 0.09
tiny_prune_thresh = 0.00
nb_mask = np.array([n.startswith('nbsvm_') for n in names], dtype=bool)
name_to_idx = {n:i for i,n in enumerate(names)}

def apply_caps_with_nbcap(W, nb_cap_local, explicit_caps_local):
    Wc = W.copy()
    Wc = np.minimum(Wc, global_cap)
    # explicit model caps
    for n, cap in explicit_caps_local.items():
        if n in name_to_idx:
            i = name_to_idx[n]
            Wc[i, :] = np.minimum(Wc[i, :], cap)
    # weak cap for very weak bases
    for i, n in enumerate(names):
        if per_oof.get(n, 1.0) > 0.40:
            Wc[i, :] = np.minimum(Wc[i, :], weak_cap)
    # NB-family total cap per class
    for j in range(C):
        s_nb = Wc[nb_mask, j].sum()
        if s_nb > nb_cap_local and s_nb > 0:
            Wc[nb_mask, j] *= (nb_cap_local / s_nb)
    # Per-class renormalize
    for j in range(C):
        col = Wc[:, j]
        if tiny_prune_thresh > 0:
            col[col < tiny_prune_thresh] = 0.0
        s = col.sum()
        if s == 0:
            col[:] = 1.0 / K
        else:
            col[:] = col / s
        Wc[:, j] = col
    return Wc

def make_objective(OOFs_subset, nb_cap_local, explicit_caps_local, y_bin):
    def objective(Z):
        W0 = softmax_cols(Z)
        Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
        P = geo_pool_log_classwise(OOFs_subset, Wc)
        ent = 0.0
        for j in range(C):
            wj = np.clip(Wc[:, j], 1e-12, 1.0)
            ent += float(np.sum(wj * np.log(wj)))
        reg = lambda_ent * ent
        return log_loss(y_bin, P, labels=classes) + reg
    return objective

# 4-bin thresholds: <=80, 81-130, 131-200, >200
train_len = train['text'].astype(str).str.len().values
test_len  = test['text'].astype(str).str.len().values
b1, b2, b3 = 80, 130, 200
mask_vshort = (train_len <= b1)
mask_short  = (train_len > b1) & (train_len <= b2)
mask_mid    = (train_len > b2) & (train_len <= b3)
mask_long   = (train_len > b3)
test_vshort = (test_len <= b1)
test_short  = (test_len > b1) & (test_len <= b2)
test_mid    = (test_len > b2) & (test_len <= b3)
test_long   = (test_len > b3)
print('4-bin sizes:', {'vshort': int(mask_vshort.sum()), 'short': int(mask_short.sum()), 'mid': int(mask_mid.sum()), 'long': int(mask_long.sum())}, flush=True)

Z_global = None
try:
    Z_global = Z_star.copy()
except NameError:
    Z_global = None
Z_warm = np.zeros((K, C), dtype=np.float64)
if Z_global is not None and getattr(Z_global, 'shape', None) == (K, C):
    Z_warm = Z_global

final_oof = np.zeros((len(train), C), dtype=np.float64)
final_test = np.zeros((len(test), C), dtype=np.float64)

def run_bin(name, tr_mask, te_mask, nb_cap_local, mnb_cap_local, char5_cap_local, stylo_cap_local):
    idx_tr = np.where(tr_mask)[0]
    idx_te = np.where(te_mask)[0]
    if len(idx_tr) == 0:
        return
    y_bin = y[idx_tr]
    OOFs_bin = []; TESTs_bin = []
    for i in range(K):
        Pi_tr = OOFs_raw[i][idx_tr]
        resTi = minimize_scalar(lambda T: log_loss(y_bin, scale_probs_scalar(Pi_tr, T), labels=classes),
                                bounds=(0.75, 1.5), method='bounded')
        Ti_bin = float(resTi.x)
        Ti_shrunk = 0.7 * float(per_model_T_global[i]) + 0.3 * Ti_bin
        OOFs_bin.append(scale_probs_scalar(OOFs_raw[i][idx_tr], Ti_shrunk))
        TESTs_bin.append(scale_probs_scalar(TESTs_raw[i][idx_te], Ti_shrunk))
    starts = 64
    rng = np.random.RandomState(42)
    inits = [Z_warm.copy()] + [rng.normal(0, 0.5, size=(K, C)) for _ in range(starts-1)]
    best = (1e9, None)
    explicit_caps_local = {
        'svc_charwb_1_6_sig': 0.06,
        'lr_wordpunct_1_3': 0.05,
        'mnb_char_2_6': mnb_cap_local,
        'char5lm': char5_cap_local,
        'stylo_lr': stylo_cap_local,
    }
    obj = make_objective(OOFs_bin, nb_cap_local, explicit_caps_local, y_bin)
    for si, Z0 in enumerate(inits, 1):
        res = minimize(lambda z: obj(z.reshape(K, C)), Z0.ravel(), method='L-BFGS-B')
        val = float(res.fun)
        if val < best[0]:
            best = (val, res.x.reshape(K, C).copy())
        if si % 16 == 0:
            print(f'  [{name}] start {si}/{starts} best_obj={best[0]:.5f}', flush=True)
    Z_bin = best[1]
    W0 = softmax_cols(Z_bin)
    Wc = apply_caps_with_nbcap(W0, nb_cap_local, explicit_caps_local)
    P_oof_bin = geo_pool_log_classwise(OOFs_bin, Wc)
    P_test_bin = geo_pool_log_classwise(TESTs_bin, Wc)
    oof_pre = log_loss(y_bin, P_oof_bin, labels=classes)
    def scale_classwise(P, Tvec):
        T = np.asarray(Tvec, dtype=np.float64)
        S = np.clip(P, 1e-12, 1-1e-12) ** (1.0 / T[None, :])
        return S / S.sum(axis=1, keepdims=True)
    bounds = [(0.75, 1.5)] * C
    resTc = minimize(lambda t: log_loss(y_bin, scale_classwise(P_oof_bin, t), labels=classes),
                     x0=np.ones(C, dtype=np.float64), method='L-BFGS-B', bounds=bounds)
    T_class = resTc.x
    P_oof_scaled = scale_classwise(P_oof_bin, T_class)
    P_test_scaled = scale_classwise(P_test_bin, T_class)
    oof_final_bin = log_loss(y_bin, P_oof_scaled, labels=classes)
    print(f'  [{name}] OOF pre-temp={oof_pre:.5f} final={oof_final_bin:.5f} T={np.round(T_class,4)}', flush=True)
    final_oof[idx_tr] = P_oof_scaled
    final_test[idx_te] = P_test_scaled

# Per-bin caps: NB-family [0.68,0.65,0.62,0.58]; MNB tiny [0.042,0.038,0.032,0.028];
# char5lm super tiny [0.02,0.018,0.015,0.012]; stylo super tiny [0.02,0.018,0.015,0.012]
run_bin('vshort', mask_vshort, test_vshort, nb_cap_local=0.68, mnb_cap_local=0.042, char5_cap_local=0.02,  stylo_cap_local=0.02)
run_bin('short',  mask_short,  test_short,  nb_cap_local=0.65, mnb_cap_local=0.038, char5_cap_local=0.018, stylo_cap_local=0.018)
run_bin('mid',    mask_mid,    test_mid,    nb_cap_local=0.62, mnb_cap_local=0.032, char5_cap_local=0.015, stylo_cap_local=0.015)
run_bin('long',   mask_long,   test_long,   nb_cap_local=0.58, mnb_cap_local=0.028, char5_cap_local=0.012, stylo_cap_local=0.012)

oof_loss = log_loss(y, final_oof, labels=classes)
print('Run D 4-bin overall OOF:', round(oof_loss,5), 'elapsed=%.2fs' % (time.time()-t0), flush=True)

# Confidence overlay with hero model on low-confidence rows (max<0.46):
def apply_overlay(P_base, P_aux, mask_conf, alpha=0.8):
    P = P_base.copy()
    low_conf = mask_conf
    P[low_conf] = (alpha * P_base[low_conf] + (1.0 - alpha) * P_aux[low_conf])
    P = np.clip(P, 1e-12, 1-1e-12); P /= P.sum(axis=1, keepdims=True)
    return P

P_lop_oof = final_oof.copy(); P_lop_test = final_test.copy()
P_hero_oof = load('oof_lr_char_1_8_hero.npy'); P_hero_test = load('test_lr_char_1_8_hero.npy')
assert P_hero_oof is not None and P_hero_test is not None, 'Missing hero model preds for overlay'
P_hero_oof = np.clip(P_hero_oof, 1e-12, 1-1e-12); P_hero_oof /= P_hero_oof.sum(axis=1, keepdims=True)
P_hero_test = np.clip(P_hero_test, 1e-12, 1-1e-12); P_hero_test /= P_hero_test.sum(axis=1, keepdims=True)
mask_low_oof = (P_lop_oof.max(axis=1) < 0.46)
mask_low_test = (P_lop_test.max(axis=1) < 0.46)
P_final_oof = apply_overlay(P_lop_oof, P_hero_oof, mask_low_oof, alpha=0.8)
P_final_test = apply_overlay(P_lop_test, P_hero_test, mask_low_test, alpha=0.8)

# Optional micro prior on very short (<=60) MWS x1.02
short_mask_60_tr = (train_len <= 60)
short_mask_60_te = (test_len <= 60)
def boost_mws(P, mask):
    P2 = P.copy()
    j = classes.index('MWS')
    if mask.any():
        P2[mask, j] *= 1.02
        P2[mask] /= P2[mask].sum(axis=1, keepdims=True)
    return P2
P_final_oof = boost_mws(P_final_oof, short_mask_60_tr)
P_final_test = boost_mws(P_final_test, short_mask_60_te)

oof_loss_final = log_loss(y, P_final_oof, labels=classes)
print('Run D 4-bin Post-overlay OOF:', round(oof_loss_final,5))

probs = P_final_test[:, [classes.index(c) for c in submit_cols]]
probs = np.clip(probs, 1e-12, 1-1e-12); probs = probs / probs.sum(axis=1, keepdims=True)
sub = pd.DataFrame(probs, columns=submit_cols)
sub.insert(0, 'id', test['id'].values)
assert sub.shape[0] == 1958
assert np.all(np.isfinite(sub[submit_cols].values))
assert np.allclose(sub[submit_cols].sum(axis=1).values, 1.0, atol=1e-6)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (Run D: 4-bin + char5lm + stylo + overlay). Elapsed=%.2fs' % (time.time()-t0), flush=True)

Length-gated LOP (4-bin + extras) with models: ['nbsvm_wc_tweaked', 'nbsvm_char_2_6_counts', 'nbsvm_wc_fixed', 'nbsvm_char_2_7_presence', 'lr_wc_fixed', 'lr_word13_charwb36', 'lr_char_1_8_hero', 'svc_charwb_1_6_sig', 'lr_wordpunct_1_3', 'mnb_char_2_6', 'char5lm', 'stylo_lr']


Per-model T (global): {'nbsvm_wc_tweaked': 1.55, 'nbsvm_char_2_6_counts': 1.579, 'nbsvm_wc_fixed': 1.567, 'nbsvm_char_2_7_presence': 1.477, 'lr_wc_fixed': 0.877, 'lr_word13_charwb36': 1.006, 'lr_char_1_8_hero': 0.871, 'svc_charwb_1_6_sig': 0.803, 'lr_wordpunct_1_3': 0.771, 'mnb_char_2_6': 1.17, 'char5lm': 5.0, 'stylo_lr': 0.784}
Per-model OOF (post scalar cal): {'nbsvm_wc_tweaked': 0.32156, 'nbsvm_char_2_6_counts': 0.33019, 'nbsvm_wc_fixed': 0.33633, 'nbsvm_char_2_7_presence': 0.338, 'lr_wc_fixed': 0.36343, 'lr_word13_charwb36': 0.36825, 'lr_char_1_8_hero': 0.36987, 'svc_charwb_1_6_sig': 0.39657, 'lr_wordpunct_1_3': 0.40368, 'mnb_char_2_6': 0.39492, 'char5lm': 0.42481, 'stylo_lr': 1.06406}
4-bin sizes: {'vshort': 4330, 'short': 4674, 'mid': 4698, 'long': 3919}


  [vshort] start 16/64 best_obj=0.46941


  [vshort] start 32/64 best_obj=0.46941


  [vshort] start 48/64 best_obj=0.46941


  [vshort] start 64/64 best_obj=0.46937


  [vshort] OOF pre-temp=0.48591 final=0.48287 T=[1.1475 0.9265 0.9561]


  [short] start 16/64 best_obj=0.33630


  [short] start 32/64 best_obj=0.33613


  [short] start 48/64 best_obj=0.33613


  [short] start 64/64 best_obj=0.33613


  [short] OOF pre-temp=0.35159 final=0.35091 T=[0.9413 0.945  0.9399]


  [mid] start 16/64 best_obj=0.22400


  [mid] start 32/64 best_obj=0.22245


  [mid] start 48/64 best_obj=0.22245


  [mid] start 64/64 best_obj=0.22245


  [mid] OOF pre-temp=0.23822 final=0.22958 T=[0.75   0.9113 0.8251]


  [long] start 16/64 best_obj=0.12054


  [long] start 32/64 best_obj=0.12039


  [long] start 48/64 best_obj=0.12039


  [long] start 64/64 best_obj=0.12039


  [long] OOF pre-temp=0.13598 final=0.12792 T=[0.8107 0.75   0.75  ]


Run D 4-bin overall OOF: 0.30139 elapsed=831.02s


Run D 4-bin Post-overlay OOF: 0.3015
Saved submission.csv (Run D: 4-bin + char5lm + stylo + overlay). Elapsed=831.04s
