# Plan to Medal: MLSP 2013 Birds (AUC-ROC, multilabel)

Objectives:
- Build a strong baseline fast using provided spectrogram artifacts (BMPs) to avoid audio preprocessing overhead.
- Use multilabel image classifier (e.g., EfficientNet/BiT/ResNet) with sigmoid BCE, stratified CV per CVfolds_2.txt.
- Generate test predictions and create submission.csv.

Data understanding tasks:
1) Parse species_list.txt (class order).
2) Parse rec_id2filename.txt (recording IDs, mapping to wav/bmp names).
3) Parse CVfolds_2.txt (train/val split IDs).
4) Determine train/test split using rec_labels_test_hidden.txt and sample_submission.csv.
5) Confirm image availability in supplemental_data/spectrograms or filtered_spectrograms.

Baseline modeling plan:
- Input: 2D grayscale BMP spectrograms (use both spectrograms/ and filtered_spectrograms/ experiments later).
- Model: timm EfficientNet-B0 or ResNet50 with pretrained ImageNet, single channel replicated to 3 channels.
- Loss: BCEWithLogitsLoss; Metrics: micro/macro AUC on CV.
- Augmentations: light (RandomResizedCrop, HorizontalFlip=FALSE, VerticalFlip=maybe; keep time-axis horizontal; do time masking later if needed).
- Optimizer: AdamW; LR: 1e-3 with cosine or OneCycle; epochs: 10-20 (early stop).
- Batch size: fit GPU (T4 16GB) -> start 32.
- CV: use CVfolds_2.txt (2-fold) to iterate quickly; log per-fold AUC and time.

Efficiency:
- Cache dataset indices; use WebDataset-like loader not needed; simple ImageFolder-like custom dataset.
- Use mixed precision (AMP) and cudnn.benchmark=True.
- Log progress every N steps; estimate ETA.

Milestones:
A) Data loading + label matrix built.
B) Sanity-check a small model overfit on tiny subset.
C) Full 2-fold training run; evaluate CV AUC.
D) Inference on test; write submission.csv.

Expert checkpoints:
- After this plan.
- After data parsing/EDA.
- After first baseline CV results (decide on filtered vs raw spectrograms, augmentations, thresholds).
- Before long training runs or ensembling.

Stretch improvements (time permitting):
- Try filtered_spectrograms vs spectrograms; ensemble logits.
- Add mixup/cutmix (weak for multilabel images but can help).
- Fine-tune a stronger backbone (tf_efficientnet_b3_ns) and TTA (center + horizontal crop).
- Post-processing: none required for AUC; consider calibration if needed.

Next step:
- Implement data parsing notebook cells and verify counts and shapes.

In [2]:
# Data parsing and integrity checks (revised for actual file formats)
import os, sys, json, time, math, random, re, gc, hashlib
from pathlib import Path
import pandas as pd
import numpy as np

BASE = Path('.')
ESS = BASE/'essential_data'
SUP = BASE/'supplemental_data'

def log(s):
    print(f"[{time.strftime('%H:%M:%S')}] {s}")

# 1) species_list.txt (CSV with columns: class_id, code, species)
sp_df = pd.read_csv(ESS/'species_list.txt')
assert {'class_id','code','species'}.issubset(sp_df.columns), 'species_list.txt must have class_id,code,species'
sp_df = sp_df.sort_values('class_id').reset_index(drop=True)
num_classes = sp_df.shape[0]
species_codes = sp_df['code'].tolist()
species_names = sp_df['species'].tolist()
log(f"Loaded species_list: {num_classes} classes -> {species_codes[:5]} ...")

# 2) rec_id2filename.txt (CSV with header: rec_id, filename)
rid2fn_df = pd.read_csv(ESS/'rec_id2filename.txt')
assert {'rec_id','filename'}.issubset(rid2fn_df.columns), 'rec_id2filename.txt must have rec_id,filename'
rid2fn_df['rec_id'] = rid2fn_df['rec_id'].astype(int)
id2fn = dict(zip(rid2fn_df['rec_id'], rid2fn_df['filename']))
log(f"Loaded rec_id2filename: {len(id2fn)} mappings")

# 3) rec_labels_test_hidden.txt -> parse train labels and identify hidden test
# Format: header 'rec_id,[labels]'; rows like '0,11,12' or '1,?' or '3' (no labels) etc.
train_labels = {}  # rec_id -> set(class_ids)
all_ids_in_labels_file = []
with open(ESS/'rec_labels_test_hidden.txt', 'r') as f:
    for i, line in enumerate(f):
        line = line.strip()
        if not line:
            continue
        if i == 0 and line.lower().startswith('rec_id'):
            continue  # header
        parts = [p.strip() for p in line.split(',') if p.strip()!='']
        if len(parts) == 0:
            continue
        rec_id = int(parts[0])
        all_ids_in_labels_file.append(rec_id)
        if len(parts) == 1:
            continue  # hidden test row with no labels
        if parts[1] == '?' or parts[1] == '[labels]':
            continue  # hidden test indicator
        # remaining parts are class_ids
        class_ids = []
        for tok in parts[1:]:
            tok = tok.strip().strip('[]')
            if tok == '?' or tok == '':
                continue
            try:
                cid = int(tok)
            except:
                continue
            class_ids.append(cid)
        if len(class_ids) > 0:
            train_labels[rec_id] = set(class_ids)

all_ids_in_labels_file = pd.Index(sorted(set(all_ids_in_labels_file)))
train_ids = pd.Index(sorted(train_labels.keys()))
hidden_test_ids = all_ids_in_labels_file.difference(train_ids)
log(f"Parsed labels: total IDs listed={len(all_ids_in_labels_file)}, train IDs with labels={len(train_ids)}, hidden test IDs={len(hidden_test_ids)}")

# 4) CVfolds_2.txt (CSV with columns: rec_id, fold)
cv_df = pd.read_csv(ESS/'CVfolds_2.txt')
assert {'rec_id','fold'}.issubset(cv_df.columns), 'CVfolds_2.txt must have rec_id,fold'
cv_df['rec_id'] = cv_df['rec_id'].astype(int)
cv_df['fold'] = cv_df['fold'].astype(int)
fold_map = dict(zip(cv_df['rec_id'], cv_df['fold']))
log(f"Loaded CVfolds_2: {len(fold_map)} entries, folds={sorted(cv_df['fold'].unique().tolist())}")

# 5) sample_submission.csv -> Id,Probability with Id encoding rec_id*100 + class_id
sub_df = pd.read_csv('sample_submission.csv')
assert {'Id','Probability'}.issubset(sub_df.columns), 'sample_submission must have Id,Probability columns'
sub_df['rec_id'] = (sub_df['Id'] // 100).astype(int)
sub_df['class_id'] = (sub_df['Id'] % 100).astype(int)
assert sub_df['class_id'].between(0, num_classes-1).all(), 'class_id out of range in sample_submission'
submission_test_ids = pd.Index(sorted(sub_df['rec_id'].unique()))
log(f"sample_submission: rows={len(sub_df)}, unique test rec_ids={len(submission_test_ids)}")

# Determine train vs test sets
train_rec_ids = train_ids
test_rec_ids = submission_test_ids
leak_ids = train_rec_ids.intersection(test_rec_ids)
assert len(leak_ids) == 0, f"Leakage: {len(leak_ids)} rec_ids appear in both train and test"

# Build label matrix for train (multilabel one-hot over class_ids 0..num_classes-1)
y_list = []
train_rows = []
for rid in train_rec_ids:
    labs = train_labels.get(rid, set())
    y = np.zeros(num_classes, dtype=np.float32)
    for cid in labs:
        if 0 <= cid < num_classes:
            y[cid] = 1.0
    train_rows.append({'rec_id': rid})
    y_list.append(y)
Y = np.vstack(y_list) if len(y_list) else np.zeros((0, num_classes), dtype=np.float32)
train_df = pd.DataFrame(train_rows)
log(f"Train matrix: n={len(train_df)}, num_classes={Y.shape[1]}, positive labels={int(Y.sum())}")

# Attach fold assignments
train_df['fold'] = train_df['rec_id'].map(fold_map).astype('Int64')
if train_df['fold'].isna().any():
    miss = train_df[train_df['fold'].isna()]['rec_id'].tolist()[:10]
    log(f"WARNING: {train_df['fold'].isna().sum()} train rec_ids missing CV fold mapping. Examples: {miss}")
    train_df['fold'] = train_df['fold'].fillna(-1).astype(int)
else:
    train_df['fold'] = train_df['fold'].astype(int)

# Map rec_id -> filename stems
train_df['filename'] = train_df['rec_id'].map(id2fn)
test_df = sub_df[['Id','rec_id','class_id']].copy()
test_df['filename'] = test_df['rec_id'].map(id2fn)

# Choose input view: filtered_spectrograms
VIEW_DIR = SUP/'filtered_spectrograms'
def bmp_path(stem):
    return VIEW_DIR/f"{stem}.bmp" if isinstance(stem, str) else None
train_df['bmp'] = train_df['filename'].map(bmp_path)
test_df['bmp'] = test_df['filename'].map(bmp_path)

# Assert files exist
missing_train = train_df[~train_df['bmp'].map(lambda p: p is not None and p.exists())]
missing_test = test_df[~test_df['bmp'].map(lambda p: p is not None and p.exists())]
log(f"Missing BMPs -> train: {len(missing_train)}, test rows: {len(missing_test)} (note: test_df has multiple rows per rec_id)")
if len(missing_train) > 0:
    log(f"Example missing train: {missing_train.head(3).to_dict(orient='records')}")
if len(missing_test) > 0:
    log(f"Example missing test: {missing_test.head(3).to_dict(orient='records')}")

# Store artifacts for later cells
data_contract = {
    'num_classes': int(num_classes),
    'species_codes': species_codes,
    'species_names': species_names,
    'train_df_shape': tuple(train_df.shape),
    'test_df_shape': tuple(test_df.shape),
    'view_dir': str(VIEW_DIR),
}
log(json.dumps(data_contract)[:300] + ('...' if len(json.dumps(data_contract))>300 else ''))

# Preview
display(train_df.head())
display(test_df.head())
pos_per_class = Y.sum(axis=0)
log(f"Classes with zero positives in train: {(pos_per_class==0).sum()} / {num_classes}")

[04:45:46] Loaded species_list: 19 classes -> ['BRCR', 'PAWR', 'PSFL', 'RBNU', 'DEJU'] ...
[04:45:46] Loaded rec_id2filename: 322 mappings
[04:45:46] Parsed labels: total IDs listed=322, train IDs with labels=145, hidden test IDs=177
[04:45:46] Loaded CVfolds_2: 322 entries, folds=[0, 1]
[04:45:46] sample_submission: rows=1216, unique test rec_ids=64
[04:45:46] Train matrix: n=145, num_classes=19, positive labels=270
[04:45:46] Missing BMPs -> train: 0, test rows: 0 (note: test_df has multiple rows per rec_id)
[04:45:46] {"num_classes": 19, "species_codes": ["BRCR", "PAWR", "PSFL", "RBNU", "DEJU", "OSFL", "HETH", "CBCH", "VATH", "HEWA", "SWTH", "HAFL", "WETA", "BHGB", "GCKI", "WAVI", "MGWA", "STJA", "CONI"], "species_names": ["Brown Creeper", "Pacific Wren", "Pacific-slope Flycatcher", "Red-breasted Nuthatch", "Dark...


Unnamed: 0,rec_id,fold,filename,bmp
0,0,0,PC1_20090606_050012_0010,supplemental_data/filtered_spectrograms/PC1_20...
1,2,0,PC1_20090705_070000_0010,supplemental_data/filtered_spectrograms/PC1_20...
2,10,0,PC2_20090606_050000_0010,supplemental_data/filtered_spectrograms/PC2_20...
3,11,0,PC2_20090606_070000_0010,supplemental_data/filtered_spectrograms/PC2_20...
4,19,0,PC4_20100606_050000_0010,supplemental_data/filtered_spectrograms/PC4_20...


Unnamed: 0,Id,rec_id,class_id,filename,bmp
0,100,1,0,PC1_20090606_070012_0010,supplemental_data/filtered_spectrograms/PC1_20...
1,101,1,1,PC1_20090606_070012_0010,supplemental_data/filtered_spectrograms/PC1_20...
2,102,1,2,PC1_20090606_070012_0010,supplemental_data/filtered_spectrograms/PC1_20...
3,103,1,3,PC1_20090606_070012_0010,supplemental_data/filtered_spectrograms/PC1_20...
4,104,1,4,PC1_20090606_070012_0010,supplemental_data/filtered_spectrograms/PC1_20...


[04:45:46] Classes with zero positives in train: 0 / 19


In [None]:
# Compute mean/std for filtered_spectrograms (train only)
import numpy as np
from PIL import Image
from pathlib import Path
import time, sys

def compute_mean_std(image_paths):
    s = 0.0
    ss = 0.0
    n = 0
    h_acc = 0
    w_acc = 0
    t0 = time.time()
    for i, p in enumerate(image_paths):
        if i % 50 == 0:
            elapsed = time.time() - t0
            print(f"[{i}/{len(image_paths)}] elapsed {elapsed:.1f}s", flush=True)
        img = Image.open(p).convert('L')
        arr = np.asarray(img, dtype=np.float32) / 255.0
        s += arr.sum()
        ss += (arr * arr).sum()
        n += arr.size
        h_acc += arr.shape[0]
        w_acc += arr.shape[1]
    mean = s / n if n > 0 else 0.0
    var = ss / n - mean * mean if n > 0 else 0.0
    std = float(np.sqrt(max(var, 0.0)))
    h_mean = h_acc / len(image_paths) if len(image_paths) else 0
    w_mean = w_acc / len(image_paths) if len(image_paths) else 0
    return mean, std, int(h_mean), int(w_mean)

stats_cache = Path('filtered_stats.npy')
if stats_cache.exists():
    cached = np.load(stats_cache, allow_pickle=True).item()
    f_mean, f_std, Hmean, Wmean = cached['mean'], cached['std'], cached['Hmean'], cached['Wmean']
    print(f"Loaded cached stats: mean={f_mean:.6f}, std={f_std:.6f}, Hmean={Hmean}, Wmean={Wmean}")
else:
    paths = train_df['bmp'].tolist()
    f_mean, f_std, Hmean, Wmean = compute_mean_std(paths)
    print(f"Computed stats: mean={f_mean:.6f}, std={f_std:.6f}, Hmean={Hmean}, Wmean={Wmean}")
    np.save(stats_cache, {'mean': f_mean, 'std': f_std, 'Hmean': Hmean, 'Wmean': Wmean})

# For single-channel replicated to 3: use scalar mean/std for all 3 channels
norm_mean = [f_mean, f_mean, f_mean]
norm_std = [f_std, f_std, f_std]
print('Normalization params:', norm_mean, norm_std)

In [None]:
# Install required packages (PyTorch CUDA 12.1 build, timm, sklearn, albumentations, cv2)
import sys, subprocess, time
def pip_install(pkg_line):
    print(f"Installing: {pkg_line}", flush=True)
    code = subprocess.call([sys.executable, '-m', 'pip', 'install'] + pkg_line.split())
    print(f"Return code: {code}", flush=True)
    return code

pkgs = [
    "--index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio",
    "timm==0.9.16",
    "albumentations==1.4.8",
    "opencv-python-headless==4.10.0.84",
    "scikit-learn==1.5.1"
]
t0=time.time()
for p in pkgs:
    pip_install(p)
print(f"All installs done in {time.time()-t0:.1f}s")

In [1]:
# Diagnostics: import libs and print versions to debug kernel crash
import sys
import torch, torchvision, timm
print('Python:', sys.version)
print('Torch:', torch.__version__, 'CUDA available:', torch.cuda.is_available())
print('Torchvision:', torchvision.__version__)
print('timm:', timm.__version__)
print('CUDA device count:', torch.cuda.device_count())
if torch.cuda.is_available():
    print('CUDA device:', torch.cuda.get_device_name(0))

KernelDied: Kernel died unexpectedly.

In [None]:
# Dataset, transforms, ASL loss, metrics, and training skeleton (torchvision-only, no albumentations)
import math, time, os, random, gc
from pathlib import Path
from PIL import Image
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import roc_auc_score
import timm
from torchvision import transforms as T
from torchvision.transforms import InterpolationMode

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
torch.backends.cudnn.benchmark = True

# Global normalization from cell 2
NORM_MEAN = norm_mean
NORM_STD = norm_std

class TimeFreqMask:
    def __init__(self, time_mask_frac=0.2, freq_mask_frac=0.2, num_time_masks=1, num_freq_masks=1, p=0.5):
        self.time_mask_frac = time_mask_frac
        self.freq_mask_frac = freq_mask_frac
        self.num_time_masks = num_time_masks
        self.num_freq_masks = num_freq_masks
        self.p = p
    def __call__(self, img_t):
        # img_t: Tensor CxHxW in [0,1]
        if self.p <= 0 or random.random() > self.p:
            return img_t
        C, H, W = img_t.shape
        out = img_t.clone()
        # time masks: along width
        max_w = max(1, int(W * self.time_mask_frac))
        for _ in range(self.num_time_masks):
            w = random.randint(1, max_w)
            x0 = random.randint(0, max(0, W - w))
            out[:, :, x0:x0+w] = 0.0
        # freq masks: along height
        max_h = max(1, int(H * self.freq_mask_frac))
        for _ in range(self.num_freq_masks):
            h = random.randint(1, max_h)
            y0 = random.randint(0, max(0, H - h))
            out[:, y0:y0+h, :] = 0.0
        return out

def get_train_transforms(out_size=224):
    return T.Compose([
        T.Resize((out_size, out_size), interpolation=InterpolationMode.BILINEAR),
        T.RandomAffine(degrees=0, translate=(0.1, 0.1), fill=0),
        T.ToTensor(),
        TimeFreqMask(time_mask_frac=0.2, freq_mask_frac=0.2, num_time_masks=2, num_freq_masks=1, p=0.7),
        T.Normalize(mean=NORM_MEAN, std=NORM_STD),
    ])

def get_valid_transforms(out_size=224):
    return T.Compose([
        T.Resize((out_size, out_size), interpolation=InterpolationMode.BILINEAR),
        T.ToTensor(),
        T.Normalize(mean=NORM_MEAN, std=NORM_STD),
    ])

class BirdDataset(Dataset):
    def __init__(self, df, labels_matrix=None, transforms=None):
        self.df = df.reset_index(drop=True)
        self.labels = labels_matrix
        self.transforms = transforms
        self.has_labels = labels_matrix is not None
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img = Image.open(row['bmp']).convert('RGB')  # replicate grayscale to 3 channels
        if self.transforms is not None:
            img_t = self.transforms(img)
        else:
            img_t = T.ToTensor()(img)
            img_t = T.Normalize(mean=NORM_MEAN, std=NORM_STD)(img_t)
        if self.has_labels:
            y = torch.from_numpy(self.labels[idx])
            return img_t, y
        else:
            return img_t, row['Id'], row['rec_id'], row['class_id']

# Asymmetric Loss for multilabel
class AsymmetricLoss(nn.Module):
    def __init__(self, gamma_pos=1.0, gamma_neg=4.0, clip=0.05, eps=1e-8, disable_torch_grad_focal_loss=True):
        super().__init__()
        self.gamma_pos = gamma_pos
        self.gamma_neg = gamma_neg
        self.clip = clip
        self.eps = eps
        self.disable_torch_grad_focal_loss = disable_torch_grad_focal_loss
    def forward(self, logits, targets):
        x_sigmoid = torch.sigmoid(logits)
        xs_pos = x_sigmoid
        xs_neg = 1.0 - x_sigmoid
        if self.clip is not None and self.clip > 0:
            xs_neg = (xs_neg + self.clip).clamp(max=1)
        los_pos = targets * torch.log(xs_pos.clamp(min=self.eps))
        los_neg = (1 - targets) * torch.log(xs_neg.clamp(min=self.eps))
        if self.gamma_pos > 0 or self.gamma_neg > 0:
            if self.disable_torch_grad_focal_loss:
                torch.set_grad_enabled(False)
            pt0 = xs_pos * targets
            pt1 = xs_neg * (1 - targets)
            one_sided_gamma = self.gamma_pos * targets + self.gamma_neg * (1 - targets)
            one_sided_w = torch.pow(1.0 - (pt0 + pt1), one_sided_gamma)
            if self.disable_torch_grad_focal_loss:
                torch.set_grad_enabled(True)
            los_pos *= one_sided_w
            los_neg *= one_sided_w
        loss = - (los_pos + los_neg).mean()
        return loss

def compute_auc(y_true, y_pred, average='macro'):
    # y_true, y_pred: numpy arrays [N, C]
    aucs = []
    C = y_true.shape[1]
    for c in range(C):
        yt = y_true[:, c]
        yp = y_pred[:, c]
        if yt.max() == yt.min():
            continue  # skip ill-defined
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            continue
    macro = float(np.mean(aucs)) if len(aucs) else float('nan')
    micro = float(roc_auc_score(y_true.ravel(), y_pred.ravel())) if (y_true.sum()>0 and y_true.sum()<y_true.size) else float('nan')
    return macro, micro, aucs

def build_model(model_name='tf_efficientnet_b0_ns', num_classes=19, pretrained=True, drop_rate=0.2):
    model = timm.create_model(model_name, pretrained=pretrained, in_chans=3, num_classes=num_classes, drop_rate=drop_rate)
    return model

def get_fold_indices(train_df, fold_id):
    trn_idx = train_df.index[train_df['fold'] != fold_id].to_numpy()
    val_idx = train_df.index[train_df['fold'] == fold_id].to_numpy()
    return trn_idx, val_idx

def train_one_epoch(model, loader, optimizer, scaler, criterion, device='cuda', log_interval=50):
    model.train()
    running = 0.0
    n = 0
    t0 = time.time()
    for i, (imgs, targets) in enumerate(loader):
        imgs = imgs.to(device, non_blocking=True)
        targets = targets.to(device, non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast():
            logits = model(imgs)
            loss = criterion(logits, targets)
        scaler.scale(loss).step(optimizer)
        scaler.update()
        running += loss.item() * imgs.size(0)
        n += imgs.size(0)
        if (i+1) % log_interval == 0:
            print(f"  [train] step {i+1}/{len(loader)} loss={running/max(n,1):.4f} elapsed={time.time()-t0:.1f}s", flush=True)
    return running / max(n,1)

def validate(model, loader, device='cuda'):
    model.eval()
    preds = []
    targets_all = []
    with torch.no_grad():
        for imgs, targets in loader:
            imgs = imgs.to(device, non_blocking=True)
            logits = model(imgs)
            probs = torch.sigmoid(logits).float().cpu().numpy()
            preds.append(probs)
            targets_all.append(targets.numpy())
    y_pred = np.concatenate(preds, axis=0)
    y_true = np.concatenate(targets_all, axis=0)
    macro, micro, aucs = compute_auc(y_true, y_pred)
    return macro, micro, y_true, y_pred

def make_loaders(trn_df, val_df, Y, out_size=224, bs=32, num_workers=4):
    trn_ds = BirdDataset(trn_df, labels_matrix=Y[trn_df.index], transforms=get_train_transforms(out_size=out_size))
    val_ds = BirdDataset(val_df, labels_matrix=Y[val_df.index], transforms=get_valid_transforms(out_size=out_size))
    trn_ld = DataLoader(trn_ds, batch_size=bs, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=False)
    val_ld = DataLoader(val_ds, batch_size=bs*2, shuffle=False, num_workers=num_workers, pin_memory=True, drop_last=False)
    return trn_ld, val_ld

def run_cv(model_name='tf_efficientnet_b0_ns', out_size=224, epochs=12, lr=1e-3, wd=1e-4, bs=32, device='cuda'):
    folds = sorted(train_df['fold'].unique())
    folds = [f for f in folds if f >= 0]
    all_fold_metrics = []
    for f in folds:
        print(f"=== Fold {f} / {max(folds)} ===", flush=True)
        trn_idx, val_idx = get_fold_indices(train_df, f)
        trn_df = train_df.loc[trn_idx].reset_index(drop=True)
        val_df = train_df.loc[val_idx].reset_index(drop=True)

        model = build_model(model_name=model_name, num_classes=Y.shape[1], pretrained=True, drop_rate=0.2)
        model.to(device)
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
        scaler = torch.cuda.amp.GradScaler()
        criterion = AsymmetricLoss(gamma_pos=1.0, gamma_neg=4.0, clip=0.05)

        trn_ld, val_ld = make_loaders(trn_df, val_df, Y, out_size=out_size, bs=bs, num_workers=4)
        best_macro = -1.0
        best_state = None
        t_fold0 = time.time()
        for ep in range(1, epochs+1):
            t_ep0 = time.time()
            loss = train_one_epoch(model, trn_ld, optimizer, scaler, criterion, device=device, log_interval=25)
            macro, micro, y_true, y_pred = validate(model, val_ld, device=device)
            print(f"Fold {f} Epoch {ep}/{epochs} | train_loss={loss:.4f} | val_macroAUC={macro:.4f} | val_microAUC={micro:.4f} | ep_time={time.time()-t_ep0:.1f}s", flush=True)
            if macro > best_macro:
                best_macro = macro
                best_state = { 'model': model.state_dict(), 'macro': macro, 'micro': micro, 'epoch': ep }
        print(f"Fold {f} best macroAUC={best_macro:.4f} | fold_time={time.time()-t_fold0:.1f}s", flush=True)
        # Save best
        os.makedirs('checkpoints', exist_ok=True)
        torch.save(best_state, f'checkpoints/{model_name}_fold{f}_best.pth')
        all_fold_metrics.append(best_macro)
        # Free
        del model; gc.collect(); torch.cuda.empty_cache()
    print(f"CV macro AUCs: {all_fold_metrics} | mean={np.mean(all_fold_metrics):.4f}", flush=True)
    return all_fold_metrics

print("Dataset, transforms (torchvision), ASL, and training skeleton ready.")

In [None]:
# Quick sanity CV run (short) with EfficientNet-B0 @224
import time, torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)
t0 = time.time()
cv_metrics = run_cv(model_name='tf_efficientnet_b0_ns', out_size=224, epochs=4, lr=1e-3, wd=1e-4, bs=32, device=device)
print('CV metrics:', cv_metrics, 'elapsed:', f'{time.time()-t0:.1f}s')

In [6]:
# Classical ML baseline using histogram_of_segments (no torch) with robust CV fallback
import pandas as pd, numpy as np, time, os, json, random
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score

t0 = time.time()
hist_path = SUP/'histogram_of_segments.txt'
# Load: first row is a header; skip it
hist_df = pd.read_csv(hist_path, header=None, skiprows=1)
assert hist_df.shape[1] >= 2, 'histogram_of_segments must have rec_id and features'
hist_df = hist_df.rename(columns={0:'rec_id'})
hist_df['rec_id'] = hist_df['rec_id'].astype(int)

# Features per rec_id
feat_cols = [c for c in hist_df.columns if c != 'rec_id']
agg_df = hist_df[['rec_id'] + feat_cols].copy().sort_values('rec_id').reset_index(drop=True)
log(f'Histogram features: rows={agg_df.shape[0]}, num_features={len(feat_cols)}')

# Build train/test feature matrices aligned to rec_ids
train_ids_sorted = pd.Index(sorted(train_rec_ids))
test_ids_sorted = pd.Index(sorted(test_rec_ids))

X_train = agg_df.set_index('rec_id').reindex(train_ids_sorted)[feat_cols].values
X_test = agg_df.set_index('rec_id').reindex(test_ids_sorted)[feat_cols].values
assert not np.isnan(X_train).any(), 'NaNs in train features after reindex (histogram)'
assert not np.isnan(X_test).any(), 'NaNs in test features after reindex (histogram)'

# Labels matrix Y aligned
Y_df = pd.DataFrame(Y, index=train_rec_ids)
Y_aligned = Y_df.reindex(train_ids_sorted).values.astype(np.float32)
assert Y_aligned.shape[0] == X_train.shape[0], 'Train features/labels mismatch'

# Map folds for these train_ids
fold_series = pd.Series(train_ids_sorted).map(fold_map)
fold_series = fold_series.fillna(-1).astype(int)
folds = sorted([f for f in fold_series.unique() if f >= 0])
log(f'Folds present (raw): {folds}')

def compute_auc_macro_micro(y_true, y_pred):
    aucs = []
    C = y_true.shape[1]
    for c in range(C):
        yt = y_true[:, c]
        yp = y_pred[:, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    macro = float(np.mean(aucs)) if len(aucs) else float('nan')
    micro = float(roc_auc_score(y_true.ravel(), y_pred.ravel())) if (y_true.sum()>0 and y_true.sum()<y_true.size) else float('nan')
    return macro, micro, aucs

# CV loop with validation checks; fallback to random split if provided folds invalid
cv_macros = []
valid_cv_done = False
for f in folds:
    trn_mask = (fold_series.values != f)
    val_mask = (fold_series.values == f)
    if trn_mask.sum() == 0 or val_mask.sum() == 0:
        continue
    X_tr, X_val = X_train[trn_mask], X_train[val_mask]
    y_tr, y_val = Y_aligned[trn_mask], Y_aligned[val_mask]
    log(f'Fold {f}: train {X_tr.shape[0]}, val {X_val.shape[0]}')
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_trs = scaler.fit_transform(X_tr)
    X_vals = scaler.transform(X_val)
    clf = OneVsRestClassifier(LogisticRegression(max_iter=1000, solver='lbfgs', class_weight='balanced'))
    t_fit0 = time.time()
    clf.fit(X_trs, y_tr)
    log(f'  Fit time: {time.time()-t_fit0:.2f}s')
    val_proba = np.vstack([est.predict_proba(X_vals)[:,1] for est in clf.estimators_]).T
    macro, micro, _ = compute_auc_macro_micro(y_val, val_proba)
    log(f'  Fold {f} AUC macro={macro:.4f} micro={micro:.4f}')
    cv_macros.append(macro)
    valid_cv_done = True

if valid_cv_done:
    log(f'CV macro AUCs: {cv_macros} | mean={np.mean(cv_macros):.4f}')
else:
    # Fallback: single random 80/20 split for a quick sanity metric
    n = X_train.shape[0]
    idx = np.arange(n)
    rng = np.random.default_rng(42)
    rng.shuffle(idx)
    cut = int(0.8 * n)
    tr_idx, va_idx = idx[:cut], idx[cut:]
    X_tr, X_val = X_train[tr_idx], X_train[va_idx]
    y_tr, y_val = Y_aligned[tr_idx], Y_aligned[va_idx]
    log(f'Random split: train {X_tr.shape[0]}, val {X_val.shape[0]}')
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_trs = scaler.fit_transform(X_tr)
    X_vals = scaler.transform(X_val)
    clf = OneVsRestClassifier(LogisticRegression(max_iter=1000, solver='lbfgs', class_weight='balanced'))
    t_fit0 = time.time()
    clf.fit(X_trs, y_tr)
    log(f'  Fit time: {time.time()-t_fit0:.2f}s')
    val_proba = np.vstack([est.predict_proba(X_vals)[:,1] for est in clf.estimators_]).T
    macro, micro, _ = compute_auc_macro_micro(y_val, val_proba)
    log(f'  Random split AUC macro={macro:.4f} micro={micro:.4f}')

# Fit on full train and predict test
scaler_full = StandardScaler(with_mean=True, with_std=True)
X_tr_full = scaler_full.fit_transform(X_train)
X_te_full = scaler_full.transform(X_test)
clf_full = OneVsRestClassifier(LogisticRegression(max_iter=1000, solver='lbfgs', class_weight='balanced'))
t_fit_full0 = time.time()
clf_full.fit(X_tr_full, Y_aligned)
log(f'Full fit time: {time.time()-t_fit_full0:.2f}s')
test_proba_mat = np.vstack([est.predict_proba(X_te_full)[:,1] for est in clf_full.estimators_]).T  # shape [n_test_ids, num_classes]

# Build submission in sample_submission order
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(test_ids_sorted)}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx.get(rid, None)
    assert ti is not None, f'rec_id {rid} not found in test features'
    p = float(test_proba_mat[ti, cid])
    if not np.isfinite(p): p = 0.0
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
log(f'Wrote submission.csv with {len(sub_out)} rows in {time.time()-t0:.1f}s')
display(sub_out.head())

[04:50:26] Histogram features: rows=322, num_features=100
[04:50:26] Folds present (raw): [0]
[04:50:26] Random split: train 116, val 29


[04:50:27]   Fit time: 1.31s
[04:50:28]   Random split AUC macro=0.7522 micro=0.7636


[04:50:31] Full fit time: 3.03s
[04:50:31] Wrote submission.csv with 1216 rows in 4.5s


Unnamed: 0,Id,Probability
0,100,0.003139
1,101,0.271228
2,102,0.154523
3,103,0.668181
4,104,0.28723


In [7]:
# Install classical ML deps and iterative multilabel stratification
import sys, subprocess, time
def pip_install(pkg_line):
    print(f"Installing: {pkg_line}", flush=True)
    code = subprocess.call([sys.executable, '-m', 'pip', 'install'] + pkg_line.split())
    print(f"Return code: {code}", flush=True)
    return code

pkgs = [
    "iterative-stratification==0.1.7",
    "lightgbm==4.5.0",
    "xgboost==2.1.1"
]
t0=time.time()
for p in pkgs:
    pip_install(p)
print(f"All installs done in {time.time()-t0:.1f}s")

# Set global seeds for reproducibility
import os, random, numpy as np
SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
print('Seeds set to', SEED)

Installing: iterative-stratification==0.1.7


Collecting iterative-stratification==0.1.7
  Downloading iterative_stratification-0.1.7-py3-none-any.whl (8.5 kB)


Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 18.2 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 102.0 MB/s eta 0:00:00


Collecting scipy
  Downloading scipy-1.16.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.4/35.4 MB 251.4 MB/s eta 0:00:00


Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 396.6 MB/s eta 0:00:00


Installing collected packages: threadpoolctl, numpy, joblib, scipy, scikit-learn, iterative-stratification


Successfully installed iterative-stratification-0.1.7 joblib-1.5.2 numpy-1.26.4 scikit-learn-1.7.2 scipy-1.16.1 threadpoolctl-3.6.0




Return code: 0


Installing: lightgbm==4.5.0


Collecting lightgbm==4.5.0
  Downloading lightgbm-4.5.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 42.5 MB/s eta 0:00:00


Collecting scipy
  Downloading scipy-1.16.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.4/35.4 MB 251.7 MB/s eta 0:00:00


Collecting numpy>=1.17.0
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 301.0 MB/s eta 0:00:00


Installing collected packages: numpy, scipy, lightgbm


Successfully installed lightgbm-4.5.0 numpy-1.26.4 scipy-1.16.1
Return code: 0


Installing: xgboost==2.1.1




Collecting xgboost==2.1.1
  Downloading xgboost-2.1.1-py3-none-manylinux_2_28_x86_64.whl (153.9 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.9/153.9 MB 325.5 MB/s eta 0:00:00


Collecting scipy
  Downloading scipy-1.16.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.4/35.4 MB 268.8 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 276.1 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12
  Downloading nvidia_nccl_cu12-2.28.3-py3-none-manylinux_2_18_x86_64.whl (295.9 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.9/295.9 MB 300.8 MB/s eta 0:00:00


Installing collected packages: nvidia-nccl-cu12, numpy, scipy, xgboost


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.8.0 requires nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-nccl-cu12 2.28.3 which is incompatible.
torch 2.8.0 requires nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-nvjitlink-cu12 12.9.86 which is incompatible.


Successfully installed numpy-1.26.4 nvidia-nccl-cu12-2.28.3 scipy-1.16.1 xgboost-2.1.1


Return code: 0


All installs done in 40.2s
Seeds set to 42


In [17]:
# Feature engineering + 5-fold MultilabelStratified CV + LGBM/LR OVR + Ensembling
import pandas as pd, numpy as np, time, os, gc, random
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

def auc_macro_micro(y_true, y_pred):
    aucs = []
    C = y_true.shape[1]
    for c in range(C):
        yt = y_true[:, c]
        yp = y_pred[:, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    macro = float(np.mean(aucs)) if len(aucs) else float('nan')
    micro = float(roc_auc_score(y_true.ravel(), y_pred.ravel())) if (y_true.sum()>0 and y_true.sum()<y_true.size) else float('nan')
    return macro, micro

# 1) Load histogram_of_segments and build features
hist_path = SUP/'histogram_of_segments.txt'
hist_df = pd.read_csv(hist_path, header=None, skiprows=1)
hist_df = hist_df.rename(columns={0:'rec_id'})
hist_df['rec_id'] = hist_df['rec_id'].astype(int)
hist_feat_cols = [c for c in hist_df.columns if c != 'rec_id']

# Base histogram stats per row (vectorized; avoid per-column inserts)
bins_mat = hist_df[hist_feat_cols].astype(np.float32).values
row_sum = bins_mat.sum(axis=1, keepdims=True) + 1e-6
row_mean = bins_mat.mean(axis=1, keepdims=True)
row_std = bins_mat.std(axis=1, keepdims=True)
proportions = bins_mat / row_sum
entropy = -(proportions * np.log(proportions + 1e-12)).sum(axis=1).astype(np.float32)
p10 = np.percentile(bins_mat, 10, axis=1).astype(np.float32)
p25 = np.percentile(bins_mat, 25, axis=1).astype(np.float32)
p75 = np.percentile(bins_mat, 75, axis=1).astype(np.float32)
p90 = np.percentile(bins_mat, 90, axis=1).astype(np.float32)

# Proportions features
prop_colnames = [f'{col}_prop' for col in hist_feat_cols]
prop_df = pd.DataFrame(proportions, columns=prop_colnames, index=hist_df.index).astype(np.float32)

# log1p of raw histogram counts
log_df = pd.DataFrame(np.log1p(bins_mat), columns=[f'{col}_log1p' for col in hist_feat_cols], index=hist_df.index).astype(np.float32)

# Ratios between consecutive bins (vectorized)
ratios = bins_mat[:, 1:] / (np.abs(bins_mat[:, :-1]) + 1e-6)
ratio_cols = [f'h_ratio_{i}' for i in range(1, len(hist_feat_cols))]
ratio_df = pd.DataFrame(ratios.astype(np.float32), columns=ratio_cols, index=hist_df.index)

# Row-level summary stats
summary_df = pd.DataFrame({
    'hist_row_sum': row_sum[:,0].astype(np.float32),
    'hist_row_mean': row_mean[:,0].astype(np.float32),
    'hist_row_std': row_std[:,0].astype(np.float32),
    'hist_entropy': entropy,
    'hist_p10': p10,
    'hist_p25': p25,
    'hist_p75': p75,
    'hist_p90': p90
}, index=hist_df.index)

# Assemble histogram feature frame once (avoid fragmentation)
hist_feats = pd.concat([hist_df[['rec_id']], hist_df[hist_feat_cols].astype(np.float32), prop_df, log_df, ratio_df, summary_df], axis=1)
hist_df = hist_feats  # reuse downstream

# 2) Load segment_features and aggregate per rec_id
seg_path = SUP/'segment_features.txt'
seg_raw = pd.read_csv(seg_path, header=None, skiprows=1)
seg_raw = seg_raw.rename(columns={0:'rec_id', 1:'seg_idx'})
seg_raw['rec_id'] = seg_raw['rec_id'].astype(int)
num_cols = [c for c in seg_raw.columns if c not in ['rec_id','seg_idx']]
for c in num_cols:
    seg_raw[c] = pd.to_numeric(seg_raw[c], errors='coerce')

def q25(x): return np.nanpercentile(x, 25)
def q75(x): return np.nanpercentile(x, 75)

agg_funcs = {c:['mean','std','min','max','median',q25,q75] for c in num_cols}
agg_df = seg_raw.groupby('rec_id').agg(agg_funcs)
# Flatten MultiIndex columns
agg_df.columns = ['%s_%s'% (col[0], ('q25' if col[1]==q25 else 'q75' if col[1]==q75 else col[1])) for col in agg_df.columns.values]
agg_df = agg_df.reset_index()
agg_df['n_seg'] = seg_raw.groupby('rec_id').size().reindex(agg_df['rec_id']).values.astype(np.float32)
# std/mean ratios
for c in num_cols:
    m = agg_df[f'{c}_mean'].replace(0, np.nan)
    s = agg_df[f'{c}_std']
    agg_df[f'{c}_std_over_mean'] = (s / m).replace([np.inf, -np.inf], np.nan).astype(np.float32)

# has_seg indicator and log1p of n_seg
agg_df['has_seg'] = 1.0
agg_df['n_seg_log1p'] = np.log1p(agg_df['n_seg'].fillna(0.0)).astype(np.float32)

# 3) Merge hist + ratios + seg aggregations
feat_df = hist_df.merge(agg_df, on='rec_id', how='left')
feat_df['has_seg'] = feat_df['has_seg'].fillna(0.0).astype(np.float32)
feat_df['n_seg'] = feat_df['n_seg'].fillna(0.0).astype(np.float32)
feat_df['n_seg_log1p'] = feat_df['n_seg_log1p'].fillna(0.0).astype(np.float32)
# Replace inf/nan
feat_df = feat_df.replace([np.inf, -np.inf], 0.0)
feat_df = feat_df.fillna(0.0)

# Drop near-constant columns
all_feat_cols = [c for c in feat_df.columns if c != 'rec_id']
vars_ = feat_df[all_feat_cols].astype(np.float32).var(axis=0).values
keep_mask = vars_ > 1e-8
kept_cols = [col for col, keep in zip(all_feat_cols, keep_mask) if keep]
X_all = feat_df[['rec_id'] + kept_cols].copy()
print(f'Engineered features: total={len(all_feat_cols)}, kept={len(kept_cols)}')

# Build train/test matrices aligned to rec_ids
train_ids_sorted = pd.Index(sorted(train_rec_ids))
test_ids_sorted = pd.Index(sorted(test_rec_ids))
X_train_df = X_all.set_index('rec_id').reindex(train_ids_sorted)[kept_cols].astype(np.float32)
X_test_df = X_all.set_index('rec_id').reindex(test_ids_sorted)[kept_cols].astype(np.float32)
assert not np.isnan(X_train_df.values).any() and not np.isnan(X_test_df.values).any(), 'NaNs after engineering'

Y_df = pd.DataFrame(Y, index=train_rec_ids)
Y_aligned = Y_df.reindex(train_ids_sorted).values.astype(np.float32)

# 4) MultilabelStratifiedKFold(5) + simple models (kept here for reference, but downstream improved cell will use X_train_df/X_test_df)
SEED = 42
mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

N = X_train_df.shape[0]
C = Y_aligned.shape[1]
oof_lgb = np.zeros((N, C), dtype=np.float32)
oof_lr = np.zeros((N, C), dtype=np.float32)
test_lgb_folds = []
test_lr_folds = []

# LightGBM params
lgb_params = dict(
    objective='binary', boosting_type='gbdt',
    n_estimators=2000, learning_rate=0.03, max_depth=4, num_leaves=16,
    min_child_samples=10, subsample=0.8, subsample_freq=1, colsample_bytree=0.7,
    reg_alpha=1.0, reg_lambda=5.0, random_state=SEED, n_jobs=-1, verbosity=-1
)

fold_macros = {'lgb':[], 'lr':[]}
fold_micros = {'lgb':[], 'lr':[]}

X_all_np = X_train_df.values
T_all_np = X_test_df.values

fold_idx = 0
for tr_idx, va_idx in mskf.split(X_all_np, Y_aligned):
    t_fold0 = time.time()
    print(f'=== Fold {fold_idx} ===', flush=True)
    X_tr, X_va = X_all_np[tr_idx], X_all_np[va_idx]
    y_tr, y_va = Y_aligned[tr_idx], Y_aligned[va_idx]

    # -- LightGBM OVR (class-wise) with early stopping
    te_pred_lgb_fold = np.zeros((T_all_np.shape[0], C), dtype=np.float32)
    for c in range(C):
        y_tr_c = y_tr[:, c]
        y_va_c = y_va[:, c]
        if y_tr_c.max() == y_tr_c.min():
            oof_lgb[va_idx, c] = 0.0
            te_pred_lgb_fold[:, c] = 0.0
            continue
        clf = lgb.LGBMClassifier(**lgb_params)
        clf.fit(
            X_tr, y_tr_c,
            eval_set=[(X_va, y_va_c)],
            eval_metric='auc',
            callbacks=[lgb.early_stopping(100, verbose=False), lgb.log_evaluation(period=0)]
        )
        oof_lgb[va_idx, c] = clf.predict_proba(X_va)[:,1]
        te_pred_lgb_fold[:, c] = clf.predict_proba(T_all_np)[:,1]
    m_lgb, mi_lgb = auc_macro_micro(Y_aligned[va_idx], oof_lgb[va_idx])
    fold_macros['lgb'].append(m_lgb); fold_micros['lgb'].append(mi_lgb)
    print(f'  LGBM fold macro={m_lgb:.4f} micro={mi_lgb:.4f}', flush=True)

    # -- LogisticRegression (elastic-net) OVR with StandardScaler (fit in-fold)
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_trs = scaler.fit_transform(X_tr)
    X_vas = scaler.transform(X_va)
    X_tes = scaler.transform(T_all_np)
    lr_clf = OneVsRestClassifier(LogisticRegression(solver='saga', penalty='elasticnet',
                                                    l1_ratio=0.5, C=0.5, max_iter=5000,
                                                    class_weight='balanced', n_jobs=-1))
    lr_clf.fit(X_trs, y_tr)
    lr_val_pred = np.vstack([est.predict_proba(X_vas)[:,1] for est in lr_clf.estimators_]).T
    lr_te_pred = np.vstack([est.predict_proba(X_tes)[:,1] for est in lr_clf.estimators_]).T
    oof_lr[va_idx] = lr_val_pred.astype(np.float32)
    m_lr, mi_lr = auc_macro_micro(Y_aligned[va_idx], lr_val_pred)
    fold_macros['lr'].append(m_lr); fold_micros['lr'].append(mi_lr)
    print(f'  LR(enet) fold macro={m_lr:.4f} micro={mi_lr:.4f}', flush=True)

    test_lgb_folds.append(te_pred_lgb_fold)
    test_lr_folds.append(lr_te_pred.astype(np.float32))
    print(f'Fold {fold_idx} time: {time.time()-t_fold0:.1f}s', flush=True)
    fold_idx += 1

print('CV LGBM macro:', fold_macros['lgb'], 'mean=', np.nanmean(fold_macros['lgb']))
print('CV LGBM micro:', fold_micros['lgb'], 'mean=', np.nanmean(fold_micros['lgb']))
print('CV LR   macro:', fold_macros['lr'], 'mean=', np.nanmean(fold_macros['lr']))
print('CV LR   micro:', fold_micros['lr'], 'mean=', np.nanmean(fold_micros['lr']))

# 5) Tune blend weight on OOF (alpha for LGBM; (1-alpha) for LR) to maximize macro AUC
best_alpha = 1.0
best_macro = -1.0
for a in np.linspace(0.0, 1.0, 11):
    oof_blend_try = a * oof_lgb + (1.0 - a) * oof_lr
    m_try, _ = auc_macro_micro(Y_aligned, oof_blend_try)
    if m_try > best_macro:
        best_macro = m_try
        best_alpha = float(a)
print(f'Best OOF blend alpha (LGBM weight)={best_alpha:.2f} | macro={best_macro:.4f}')

# 6) Average test preds across folds within each model, then blend with tuned alpha
test_lgb = np.mean(test_lgb_folds, axis=0)
test_lr = np.mean(test_lr_folds, axis=0)
test_blend = best_alpha * test_lgb + (1.0 - best_alpha) * test_lr

# 7) Build submission according to sample_submission order
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(test_ids_sorted)}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend[ti, cid])
    if not np.isfinite(p): p = 0.0
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows')
display(sub_out.head())

Engineered features: total=714, kept=697
=== Fold 0 ===












  LGBM fold macro=0.8130 micro=0.7576




KeyboardInterrupt: 

In [19]:
# Improved CV with fixed tie-aware ranking, RepeatedMultilabelStratifiedKFold, stronger regularization, per-class scale_pos_weight (clip[1,20]), LGBM+XGB, rank-averaged ensembling
import numpy as np, pandas as pd, time, os, gc, warnings
from sklearn.metrics import roc_auc_score
from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold
import lightgbm as lgb
import xgboost as xgb
from sklearn.feature_selection import mutual_info_classif

def masked_auc_macro(y_true, y_pred, valid_mask, return_counts=False):
    C = y_true.shape[1]
    aucs = []
    counts = []
    for c in range(C):
        mask = valid_mask[:, c].astype(bool)
        if mask.sum() <= 1:
            continue
        yt = y_true[mask, c]
        yp = y_pred[mask, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
            counts.append(mask.sum())
        except Exception:
            pass
    macro = float(np.mean(aucs)) if len(aucs) else float('nan')
    if return_counts:
        return macro, aucs, counts
    return macro, aucs

# Tie-aware rank functions (PRIORITY 1 fix)
def rank_cols_with_mask(mat, valid_mask):
    from scipy.stats import rankdata
    N, C = mat.shape
    out = np.full((N, C), np.nan, dtype=np.float32)
    for c in range(C):
        mask = valid_mask[:, c].astype(bool)
        if not mask.any():
            continue
        col = mat[mask, c]
        r = rankdata(col, method='average')
        r = (r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0
        out[mask, c] = r.astype(np.float32)
    return out

def rank_cols(mat):
    from scipy.stats import rankdata
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

SEED = 42
N = X_train_df.shape[0]
C = Y_aligned.shape[1]
X_all_np = X_train_df.values.astype(np.float32)
T_all_np = X_test_df.values.astype(np.float32)

# Feature reduction via MI (global top-k by mean MI across classes, not union) [PRIORITY 2]
top_k = 220
mi_scores = []
for c in range(C):
    y = Y_aligned[:, c]
    if y.sum() < 2 or y.sum() == len(y):
        continue
    try:
        s = mutual_info_classif(X_all_np, y, discrete_features=False, random_state=SEED)
        mi_scores.append(s)
    except Exception:
        continue
if len(mi_scores) > 0:
    mi_scores = np.vstack(mi_scores)
    mi_mean = mi_scores.mean(axis=0)
    idx_sel = np.argsort(-mi_mean)[:min(top_k, X_all_np.shape[1])].astype(int)
else:
    idx_sel = np.arange(X_all_np.shape[1])
X_all_np = X_all_np[:, idx_sel]
T_all_np = T_all_np[:, idx_sel]
print(f'MI selection (mean across classes) -> kept features: {X_all_np.shape[1]} / {X_train_df.shape[1]}', flush=True)

# Repeated Multilabel Stratified K-Fold (5x2) for stability
rmskf = RepeatedMultilabelStratifiedKFold(n_splits=5, n_repeats=2, random_state=SEED)

oof_lgb = np.zeros((N, C), dtype=np.float32)
oof_xgb = np.zeros((N, C), dtype=np.float32)
oof_valid_mask_lgb = np.zeros((N, C), dtype=np.uint8)
oof_valid_mask_xgb = np.zeros((N, C), dtype=np.uint8)
test_lgb_folds = []
test_xgb_folds = []

# LightGBM params (stronger regularization, shallow trees)
lgb_base = dict(
    objective='binary', boosting_type='gbdt',
    n_estimators=5000, learning_rate=0.025, max_depth=3, num_leaves=7,
    min_child_samples=22, subsample=0.75, subsample_freq=1, colsample_bytree=0.5,
    reg_alpha=1.0, reg_lambda=15.0, random_state=SEED, n_jobs=-1, verbosity=-1
)

# XGBoost native params
xgb_base = dict(
    objective='binary:logistic', eval_metric='auc', tree_method='hist',
    eta=0.025, max_depth=3, min_child_weight=8, subsample=0.7, colsample_bytree=0.55,
    alpha=1.0, **{'lambda': 15.0}, seed=SEED, nthread=-1
)

fold_id = 0
for tr_idx, va_idx in rmskf.split(X_all_np, Y_aligned):
    t0 = time.time()
    print(f'=== Fold {fold_id} ===', flush=True)
    X_tr, X_va = X_all_np[tr_idx], X_all_np[va_idx]
    y_tr, y_va = Y_aligned[tr_idx], Y_aligned[va_idx]

    te_pred_lgb = np.zeros((T_all_np.shape[0], C), dtype=np.float32)
    te_pred_xgb = np.zeros((T_all_np.shape[0], C), dtype=np.float32)

    for c in range(C):
        y_tr_c = y_tr[:, c]
        y_va_c = y_va[:, c]
        tr_pos = int((y_tr_c == 1).sum()); tr_neg = int((y_tr_c == 0).sum())
        va_pos = int((y_va_c == 1).sum()); va_neg = int((y_va_c == 0).sum())
        valid = (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0)
        if not valid:
            continue
        spw = float(tr_neg) / float(max(tr_pos, 1))
        spw = float(np.clip(spw, 1.0, 20.0))

        # LightGBM
        lgb_params = dict(lgb_base)
        lgb_params['scale_pos_weight'] = spw
        lgb_clf = lgb.LGBMClassifier(**lgb_params)
        lgb_clf.fit(
            X_tr, y_tr_c,
            eval_set=[(X_va, y_va_c)],
            eval_metric='auc',
            callbacks=[lgb.early_stopping(250, verbose=False), lgb.log_evaluation(period=0)]
        )
        oof_lgb[va_idx, c] = lgb_clf.predict_proba(X_va)[:, 1]
        te_pred_lgb[:, c] = lgb_clf.predict_proba(T_all_np)[:, 1]
        oof_valid_mask_lgb[va_idx, c] = 1

        # XGBoost (native)
        x_params = dict(xgb_base)
        x_params['scale_pos_weight'] = spw
        dtr = xgb.DMatrix(X_tr, label=y_tr_c)
        dva = xgb.DMatrix(X_va, label=y_va_c)
        dte = xgb.DMatrix(T_all_np)
        bst = xgb.train(
            params=x_params,
            dtrain=dtr,
            num_boost_round=4000,
            evals=[(dva, 'valid')],
            early_stopping_rounds=250,
            verbose_eval=False
        )
        oof_xgb[va_idx, c] = bst.predict(dva)
        te_pred_xgb[:, c] = bst.predict(dte)
        oof_valid_mask_xgb[va_idx, c] = 1

    test_lgb_folds.append(te_pred_lgb)
    test_xgb_folds.append(te_pred_xgb)

    fold_mask_l = (oof_valid_mask_lgb[va_idx] > 0).astype(np.uint8)
    m_lgb, _ = masked_auc_macro(y_va, oof_lgb[va_idx], fold_mask_l)
    fold_mask_x = (oof_valid_mask_xgb[va_idx] > 0).astype(np.uint8)
    m_xgb, _ = masked_auc_macro(y_va, oof_xgb[va_idx], fold_mask_x)
    print(f'  Fold {fold_id} masked macro: LGBM={m_lgb:.4f} | XGB={m_xgb:.4f} | time={time.time()-t0:.1f}s', flush=True)
    fold_id += 1

# Global masked OOF
macro_lgb, _ = masked_auc_macro(Y_aligned, oof_lgb, oof_valid_mask_lgb)
macro_xgb, _ = masked_auc_macro(Y_aligned, oof_xgb, oof_valid_mask_xgb)
print(f'Global masked OOF macro: LGBM={macro_lgb:.4f} | XGB={macro_xgb:.4f}')

# Rank-avg tuning using intersection mask
valid_inter = (oof_valid_mask_lgb.astype(bool) & oof_valid_mask_xgb.astype(bool)).astype(np.uint8)
oof_lgb_rank = rank_cols_with_mask(oof_lgb, valid_inter)
oof_xgb_rank = rank_cols_with_mask(oof_xgb, valid_inter)
best_alpha = 0.5
best_macro = -1.0
for a in np.linspace(0.0, 1.0, 21):
    blend = a * oof_lgb_rank + (1.0 - a) * oof_xgb_rank
    m, _ = masked_auc_macro(Y_aligned, blend, valid_inter)
    if m > best_macro:
        best_macro = m
        best_alpha = float(a)
print(f'Best masked OOF rank-blend alpha (LGBM weight)={best_alpha:.2f} | macro={best_macro:.4f}')

# Average test preds across folds and rank-average with tuned alpha
test_lgb = np.mean(test_lgb_folds, axis=0)
test_xgb = np.mean(test_xgb_folds, axis=0)
test_lgb_rank = rank_cols(test_lgb)
test_xgb_rank = rank_cols(test_xgb)
test_blend_rank = best_alpha * test_lgb_rank + (1.0 - best_alpha) * test_xgb_rank

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_rank[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows (rank-averaged LGBM+XGB, repeated CV, MI-selected features)')
display(sub_out.head())

MI selection (mean across classes) -> kept features: 220 / 697


=== Fold 0 ===








































  Fold 0 masked macro: LGBM=0.7457 | XGB=0.7659 | time=6.4s


=== Fold 1 ===








































  Fold 1 masked macro: LGBM=0.7583 | XGB=0.7088 | time=6.1s


=== Fold 2 ===




































  Fold 2 masked macro: LGBM=0.7885 | XGB=0.6649 | time=5.3s


=== Fold 3 ===








































  Fold 3 masked macro: LGBM=0.7695 | XGB=0.7625 | time=6.8s


=== Fold 4 ===




































  Fold 4 masked macro: LGBM=0.7949 | XGB=0.8093 | time=5.9s


=== Fold 5 ===






































  Fold 5 masked macro: LGBM=0.7217 | XGB=0.7305 | time=5.7s


=== Fold 6 ===




































  Fold 6 masked macro: LGBM=0.7494 | XGB=0.7470 | time=5.7s


=== Fold 7 ===








































  Fold 7 masked macro: LGBM=0.7788 | XGB=0.7600 | time=5.5s


=== Fold 8 ===






































  Fold 8 masked macro: LGBM=0.7753 | XGB=0.7256 | time=6.1s


=== Fold 9 ===






































  Fold 9 masked macro: LGBM=0.7148 | XGB=0.6932 | time=6.0s


Global masked OOF macro: LGBM=0.6944 | XGB=0.6989


Best masked OOF rank-blend alpha (LGBM weight)=0.55 | macro=0.7108
Wrote submission.csv with 1216 rows (rank-averaged LGBM+XGB, repeated CV, MI-selected features)


Unnamed: 0,Id,Probability
0,100,0.302381
1,101,0.420635
2,102,0.420635
3,103,0.753968
4,104,0.584921


In [20]:
# Add CatBoost model with strong regularization and rank-ensemble with LGBM+XGB
import time, numpy as np, pandas as pd, warnings, sys, subprocess
warnings.filterwarnings('ignore')

def pip_install(pkg_line):
    print(f"Installing: {pkg_line}", flush=True)
    code = subprocess.call([sys.executable, '-m', 'pip', 'install'] + pkg_line.split())
    print(f"Return code: {code}")
    return code

try:
    import catboost
except Exception:
    pip_install('catboost==1.2.5')
    import catboost

from catboost import CatBoostClassifier, Pool

SEED = 42
N = X_all_np.shape[0]
C = Y_aligned.shape[1]

# OOF/test containers for CatBoost
oof_cb = np.zeros((N, C), dtype=np.float32)
oof_valid_mask_cb = np.zeros((N, C), dtype=np.uint8)
test_cb_folds = []

# Reuse the same RepeatedMultilabelStratifiedKFold splits for fairness
from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold
rmskf = RepeatedMultilabelStratifiedKFold(n_splits=5, n_repeats=2, random_state=SEED)

fold_id = 0
for tr_idx, va_idx in rmskf.split(X_all_np, Y_aligned):
    t0 = time.time()
    print(f'=== CB Fold {fold_id} ===', flush=True)
    X_tr, X_va = X_all_np[tr_idx], X_all_np[va_idx]
    y_tr, y_va = Y_aligned[tr_idx], Y_aligned[va_idx]
    te_pred_cb = np.zeros((T_all_np.shape[0], C), dtype=np.float32)

    for c in range(C):
        y_tr_c = y_tr[:, c]
        y_va_c = y_va[:, c]
        tr_pos = int((y_tr_c == 1).sum()); tr_neg = int((y_tr_c == 0).sum())
        va_pos = int((y_va_c == 1).sum()); va_neg = int((y_va_c == 0).sum())
        valid = (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0)
        if not valid:
            continue
        train_pool = Pool(X_tr, label=y_tr_c)
        valid_pool = Pool(X_va, label=y_va_c)
        cb = CatBoostClassifier(
            loss_function='Logloss',
            depth=4,
            learning_rate=0.025,
            iterations=5000,
            l2_leaf_reg=15.0,
            auto_class_weights='Balanced',
            random_seed=SEED,
            early_stopping_rounds=250,
            verbose=False,
            task_type='CPU',
            thread_count=-1
        )
        cb.fit(train_pool, eval_set=valid_pool)
        oof_cb[va_idx, c] = cb.predict_proba(X_va)[:, 1]
        te_pred_cb[:, c] = cb.predict_proba(T_all_np)[:, 1]
        oof_valid_mask_cb[va_idx, c] = 1

    test_cb_folds.append(te_pred_cb)
    # Per-fold monitor
    fold_mask_cb = (oof_valid_mask_cb[va_idx] > 0).astype(np.uint8)
    m_cb, _ = masked_auc_macro(y_va, oof_cb[va_idx], fold_mask_cb)
    print(f'  CB Fold {fold_id} masked macro: {m_cb:.4f} | time={time.time()-t0:.1f}s', flush=True)
    fold_id += 1

# Global CatBoost masked OOF
macro_cb, _ = masked_auc_macro(Y_aligned, oof_cb, oof_valid_mask_cb)
print(f'Global masked OOF macro: CatBoost={macro_cb:.4f}')

# Rank-ensemble across LGBM, XGB, CatBoost using intersection mask
valid_inter_all = (oof_valid_mask_lgb.astype(bool) & oof_valid_mask_xgb.astype(bool) & oof_valid_mask_cb.astype(bool)).astype(np.uint8)
oof_lgb_rank = rank_cols_with_mask(oof_lgb, valid_inter_all)
oof_xgb_rank = rank_cols_with_mask(oof_xgb, valid_inter_all)
oof_cb_rank  = rank_cols_with_mask(oof_cb,  valid_inter_all)

# Coarse grid search for weights that sum to 1: w_lgb, w_xgb, w_cb
best_w = (1/3, 1/3, 1/3)
best_macro = -1.0
grid = np.linspace(0.0, 1.0, 6)
for wl in grid:
    for wx in grid:
        wc = 1.0 - wl - wx
        if wc < 0 or wc > 1:
            continue
        blend = wl * oof_lgb_rank + wx * oof_xgb_rank + wc * oof_cb_rank
        m, _ = masked_auc_macro(Y_aligned, blend, valid_inter_all)
        if m > best_macro:
            best_macro = m
            best_w = (float(wl), float(wx), float(wc))
print(f'Best 3-model rank-blend weights (LGBM, XGB, CB)={best_w} | masked OOF macro={best_macro:.4f}')

# Average test preds across folds for CatBoost
test_cb = np.mean(test_cb_folds, axis=0)
test_lgb = np.mean(test_lgb_folds, axis=0)
test_xgb = np.mean(test_xgb_folds, axis=0)
test_lgb_rank = rank_cols(test_lgb)
test_xgb_rank = rank_cols(test_xgb)
test_cb_rank  = rank_cols(test_cb)

wl, wx, wc = best_w
test_blend_rank3 = wl * test_lgb_rank + wx * test_xgb_rank + wc * test_cb_rank

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_rank3[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows (rank-averaged LGBM+XGB+CatBoost)')
display(sub_out.head())

=== CB Fold 0 ===


  CB Fold 0 masked macro: 0.7851 | time=37.1s


=== CB Fold 1 ===


  CB Fold 1 masked macro: 0.7479 | time=16.8s


=== CB Fold 2 ===


  CB Fold 2 masked macro: 0.7568 | time=23.0s


=== CB Fold 3 ===


  CB Fold 3 masked macro: 0.8154 | time=41.1s


=== CB Fold 4 ===


  CB Fold 4 masked macro: 0.8180 | time=20.3s


=== CB Fold 5 ===


  CB Fold 5 masked macro: 0.7424 | time=25.6s


=== CB Fold 6 ===


  CB Fold 6 masked macro: 0.7677 | time=25.7s


=== CB Fold 7 ===


  CB Fold 7 masked macro: 0.7923 | time=29.3s


=== CB Fold 8 ===


  CB Fold 8 masked macro: 0.7989 | time=36.8s


=== CB Fold 9 ===


  CB Fold 9 masked macro: 0.7455 | time=21.8s


Global masked OOF macro: CatBoost=0.7996


Best 3-model rank-blend weights (LGBM, XGB, CB)=(0.2, 0.0, 0.8) | masked OOF macro=0.8018
Wrote submission.csv with 1216 rows (rank-averaged LGBM+XGB+CatBoost)


Unnamed: 0,Id,Probability
0,100,0.25873
1,101,0.522222
2,102,0.433333
3,103,0.753968
4,104,0.538095


In [21]:
# CatBoost variant (depth=3) + 4-model rank-ensemble (LGBM, XGB, CB_d4, CB_d3)
import numpy as np, time, pandas as pd
from catboost import CatBoostClassifier, Pool
from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold

SEED = 42
N = X_all_np.shape[0]
C = Y_aligned.shape[1]

oof_cb2 = np.zeros((N, C), dtype=np.float32)
oof_valid_mask_cb2 = np.zeros((N, C), dtype=np.uint8)
test_cb2_folds = []

rmskf = RepeatedMultilabelStratifiedKFold(n_splits=5, n_repeats=2, random_state=SEED)
fold_id = 0
for tr_idx, va_idx in rmskf.split(X_all_np, Y_aligned):
    t0 = time.time()
    print(f'=== CB2 Fold {fold_id} ===', flush=True)
    X_tr, X_va = X_all_np[tr_idx], X_all_np[va_idx]
    y_tr, y_va = Y_aligned[tr_idx], Y_aligned[va_idx]
    te_pred_cb2 = np.zeros((T_all_np.shape[0], C), dtype=np.float32)
    for c in range(C):
        y_tr_c = y_tr[:, c]; y_va_c = y_va[:, c]
        tr_pos = int((y_tr_c == 1).sum()); tr_neg = int((y_tr_c == 0).sum())
        va_pos = int((y_va_c == 1).sum()); va_neg = int((y_va_c == 0).sum())
        if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
            continue
        train_pool = Pool(X_tr, label=y_tr_c)
        valid_pool = Pool(X_va, label=y_va_c)
        cb2 = CatBoostClassifier(
            loss_function='Logloss',
            depth=3,
            learning_rate=0.03,
            iterations=5000,
            l2_leaf_reg=20.0,
            auto_class_weights='Balanced',
            random_seed=SEED,
            early_stopping_rounds=250,
            verbose=False,
            task_type='CPU',
            thread_count=-1
        )
        cb2.fit(train_pool, eval_set=valid_pool)
        oof_cb2[va_idx, c] = cb2.predict_proba(X_va)[:, 1]
        te_pred_cb2[:, c] = cb2.predict_proba(T_all_np)[:, 1]
        oof_valid_mask_cb2[va_idx, c] = 1
    test_cb2_folds.append(te_pred_cb2)
    fold_mask_cb2 = (oof_valid_mask_cb2[va_idx] > 0).astype(np.uint8)
    m_cb2, _ = masked_auc_macro(y_va, oof_cb2[va_idx], fold_mask_cb2)
    print(f'  CB2 Fold {fold_id} masked macro: {m_cb2:.4f} | time={time.time()-t0:.1f}s', flush=True)
    fold_id += 1

macro_cb2, _ = masked_auc_macro(Y_aligned, oof_cb2, oof_valid_mask_cb2)
print(f'Global masked OOF macro: CatBoost(depth=3)={macro_cb2:.4f}')

# Rank-ensemble 4 models: LGBM, XGB, CB(depth=4)=oof_cb, CB(depth=3)=oof_cb2
valid_inter_4 = (oof_valid_mask_lgb.astype(bool) & oof_valid_mask_xgb.astype(bool) & oof_valid_mask_cb.astype(bool) & oof_valid_mask_cb2.astype(bool)).astype(np.uint8)
o_lgb_r = rank_cols_with_mask(oof_lgb, valid_inter_4)
o_xgb_r = rank_cols_with_mask(oof_xgb, valid_inter_4)
o_cb1_r = rank_cols_with_mask(oof_cb,  valid_inter_4)
o_cb2_r = rank_cols_with_mask(oof_cb2, valid_inter_4)

best_w = (0.2, 0.0, 0.4, 0.4)  # init guess (LGB, XGB, CB1, CB2)
best_macro = -1.0
grid = np.linspace(0.0, 1.0, 11)
for wl in grid:
    for wx in grid:
        for wc1 in grid:
            wc2 = 1.0 - wl - wx - wc1
            if wc2 < 0 or wc2 > 1:
                continue
            blend = wl * o_lgb_r + wx * o_xgb_r + wc1 * o_cb1_r + wc2 * o_cb2_r
            m, _ = masked_auc_macro(Y_aligned, blend, valid_inter_4)
            if m > best_macro:
                best_macro = m
                best_w = (float(wl), float(wx), float(wc1), float(wc2))
print(f'Best 4-model weights (LGBM, XGB, CBd4, CBd3)={best_w} | masked OOF macro={best_macro:.4f}')

# Build final submission with 4-model weights
test_cb = np.mean(test_cb_folds, axis=0)
test_cb2 = np.mean(test_cb2_folds, axis=0)
test_lgb = np.mean(test_lgb_folds, axis=0)
test_xgb = np.mean(test_xgb_folds, axis=0)
t_lgb_r = rank_cols(test_lgb); t_xgb_r = rank_cols(test_xgb); t_cb1_r = rank_cols(test_cb); t_cb2_r = rank_cols(test_cb2)
wl, wx, wc1, wc2 = best_w
test_blend_rank4 = wl * t_lgb_r + wx * t_xgb_r + wc1 * t_cb1_r + wc2 * t_cb2_r

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_rank4[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows (4-model rank ensemble)')
display(sub_out.head())

=== CB2 Fold 0 ===


  CB2 Fold 0 masked macro: 0.8004 | time=33.7s


=== CB2 Fold 1 ===


  CB2 Fold 1 masked macro: 0.7753 | time=21.3s


=== CB2 Fold 2 ===


  CB2 Fold 2 masked macro: 0.7896 | time=18.3s


=== CB2 Fold 3 ===


  CB2 Fold 3 masked macro: 0.8137 | time=28.0s


=== CB2 Fold 4 ===


  CB2 Fold 4 masked macro: 0.8453 | time=18.3s


=== CB2 Fold 5 ===


  CB2 Fold 5 masked macro: 0.7454 | time=25.4s


=== CB2 Fold 6 ===


  CB2 Fold 6 masked macro: 0.7709 | time=25.0s


=== CB2 Fold 7 ===


  CB2 Fold 7 masked macro: 0.8051 | time=23.7s


=== CB2 Fold 8 ===


  CB2 Fold 8 masked macro: 0.8114 | time=25.0s


=== CB2 Fold 9 ===


  CB2 Fold 9 masked macro: 0.7387 | time=30.6s


Global masked OOF macro: CatBoost(depth=3)=0.8079


Best 4-model weights (LGBM, XGB, CBd4, CBd3)=(0.1, 0.0, 0.0, 0.9) | masked OOF macro=0.8107
Wrote submission.csv with 1216 rows (4-model rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.252381
1,101,0.534921
2,102,0.434921
3,103,0.753968
4,104,0.549206


In [22]:
# CatBoost variant 3 (depth=4, lr=0.02, l2=20, seed=123) + CB-only 3-model rank ensemble
import numpy as np, time, pandas as pd
from catboost import CatBoostClassifier, Pool
from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold

SEED3 = 123
N = X_all_np.shape[0]
C = Y_aligned.shape[1]

oof_cb3 = np.zeros((N, C), dtype=np.float32)
oof_valid_mask_cb3 = np.zeros((N, C), dtype=np.uint8)
test_cb3_folds = []

rmskf = RepeatedMultilabelStratifiedKFold(n_splits=5, n_repeats=2, random_state=SEED3)
fold_id = 0
for tr_idx, va_idx in rmskf.split(X_all_np, Y_aligned):
    t0 = time.time()
    print(f'=== CB3 Fold {fold_id} ===', flush=True)
    X_tr, X_va = X_all_np[tr_idx], X_all_np[va_idx]
    y_tr, y_va = Y_aligned[tr_idx], Y_aligned[va_idx]
    te_pred_cb3 = np.zeros((T_all_np.shape[0], C), dtype=np.float32)
    for c in range(C):
        y_tr_c = y_tr[:, c]; y_va_c = y_va[:, c]
        tr_pos = int((y_tr_c == 1).sum()); tr_neg = int((y_tr_c == 0).sum())
        va_pos = int((y_va_c == 1).sum()); va_neg = int((y_va_c == 0).sum())
        if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
            continue
        train_pool = Pool(X_tr, label=y_tr_c)
        valid_pool = Pool(X_va, label=y_va_c)
        cb3 = CatBoostClassifier(
            loss_function='Logloss',
            depth=4,
            learning_rate=0.02,
            iterations=5000,
            l2_leaf_reg=20.0,
            auto_class_weights='Balanced',
            random_seed=SEED3,
            early_stopping_rounds=250,
            verbose=False,
            task_type='CPU',
            thread_count=-1
        )
        cb3.fit(train_pool, eval_set=valid_pool)
        oof_cb3[va_idx, c] = cb3.predict_proba(X_va)[:, 1]
        te_pred_cb3[:, c] = cb3.predict_proba(T_all_np)[:, 1]
        oof_valid_mask_cb3[va_idx, c] = 1
    test_cb3_folds.append(te_pred_cb3)
    fold_mask_cb3 = (oof_valid_mask_cb3[va_idx] > 0).astype(np.uint8)
    m_cb3, _ = masked_auc_macro(y_va, oof_cb3[va_idx], fold_mask_cb3)
    print(f'  CB3 Fold {fold_id} masked macro: {m_cb3:.4f} | time={time.time()-t0:.1f}s', flush=True)
    fold_id += 1

macro_cb3, _ = masked_auc_macro(Y_aligned, oof_cb3, oof_valid_mask_cb3)
print(f'Global masked OOF macro: CatBoost(depth=4, lr=0.02, l2=20, seed=123)={macro_cb3:.4f}')

# Rank-ensemble CB-only (CB1=oof_cb, CB2=oof_cb2, CB3=oof_cb3) using intersection mask
valid_inter_cb = (oof_valid_mask_cb.astype(bool) & oof_valid_mask_cb2.astype(bool) & oof_valid_mask_cb3.astype(bool)).astype(np.uint8)
o_cb1_r = rank_cols_with_mask(oof_cb,  valid_inter_cb)
o_cb2_r = rank_cols_with_mask(oof_cb2, valid_inter_cb)
o_cb3_r = rank_cols_with_mask(oof_cb3, valid_inter_cb)

best_w = (1/3, 1/3, 1/3)
best_macro = -1.0
grid = np.linspace(0.0, 1.0, 11)
for w1 in grid:
    for w2 in grid:
        w3 = 1.0 - w1 - w2
        if w3 < 0 or w3 > 1:
            continue
        blend = w1 * o_cb1_r + w2 * o_cb2_r + w3 * o_cb3_r
        m, _ = masked_auc_macro(Y_aligned, blend, valid_inter_cb)
        if m > best_macro:
            best_macro = m
            best_w = (float(w1), float(w2), float(w3))
print(f'Best CB-only weights (CBd4_s42, CBd3_s42, CBd4_s123)={best_w} | masked OOF macro={best_macro:.4f}')

# Build final submission with CB-only weights
test_cb1 = np.mean(test_cb_folds, axis=0)
test_cb2 = np.mean(test_cb2_folds, axis=0)
test_cb3 = np.mean(test_cb3_folds, axis=0)
t_cb1_r = rank_cols(test_cb1); t_cb2_r = rank_cols(test_cb2); t_cb3_r = rank_cols(test_cb3)
w1, w2, w3 = best_w
test_blend_rank_cb = w1 * t_cb1_r + w2 * t_cb2_r + w3 * t_cb3_r

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_rank_cb[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows (CatBoost-only 3-model rank ensemble)')
display(sub_out.head())

=== CB3 Fold 0 ===


  CB3 Fold 0 masked macro: 0.7633 | time=21.9s


=== CB3 Fold 1 ===


  CB3 Fold 1 masked macro: 0.8363 | time=27.8s


=== CB3 Fold 2 ===


  CB3 Fold 2 masked macro: 0.8168 | time=58.9s


=== CB3 Fold 3 ===


  CB3 Fold 3 masked macro: 0.7522 | time=32.2s


=== CB3 Fold 4 ===


  CB3 Fold 4 masked macro: 0.8264 | time=15.3s


=== CB3 Fold 5 ===


  CB3 Fold 5 masked macro: 0.8126 | time=27.7s


=== CB3 Fold 6 ===


  CB3 Fold 6 masked macro: 0.8322 | time=43.1s


=== CB3 Fold 7 ===


  CB3 Fold 7 masked macro: 0.8047 | time=34.1s


=== CB3 Fold 8 ===


  CB3 Fold 8 masked macro: 0.8143 | time=35.4s


=== CB3 Fold 9 ===


  CB3 Fold 9 masked macro: 0.8080 | time=33.2s


Global masked OOF macro: CatBoost(depth=4, lr=0.02, l2=20, seed=123)=0.8109


Best CB-only weights (CBd4_s42, CBd3_s42, CBd4_s123)=(0.30000000000000004, 0.1, 0.6) | masked OOF macro=0.8365
Wrote submission.csv with 1216 rows (CatBoost-only 3-model rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.538095
2,102,0.41746
3,103,0.753968
4,104,0.52381


In [23]:
# Official CV (CVfolds_2) + CatBoost seed bag (equal-weight rank avg), no global MI, no weight tuning
import numpy as np, pandas as pd, time
from catboost import CatBoostClassifier, Pool

# Prepare matrices from full engineered features (no global MI)
X_tr_full = X_train_df.values.astype(np.float32)
X_te_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_tr_full.shape[0], Y_full.shape[1]

# Official folds mapping for the labeled train ids
fold_series = pd.Series(pd.Index(sorted(train_rec_ids))).map(fold_map).astype('Int64')
fold_series = fold_series.fillna(-1).astype(int)
folds_present = sorted([f for f in np.unique(fold_series.values) if f >= 0])
print('Folds present among labeled train ids:', folds_present, flush=True)

# Define CatBoost variants (diverse, strong regularization)
variants = [
    dict(name='cb_d3_lr003_l2_20_rsm06_s42', depth=3, learning_rate=0.03, l2_leaf_reg=20.0, rsm=0.6, subsample=0.8, random_strength=0.8, seed=42),
    dict(name='cb_d4_lr0025_l2_15_rsm06_s123', depth=4, learning_rate=0.025, l2_leaf_reg=15.0, rsm=0.6, subsample=0.8, random_strength=0.8, seed=123),
    dict(name='cb_d4_lr002_l2_25_rsm05_s456', depth=4, learning_rate=0.02, l2_leaf_reg=25.0, rsm=0.5, subsample=0.8, random_strength=1.0, seed=456),
    dict(name='cb_d3_lr0025_l2_25_rsm07_s789', depth=3, learning_rate=0.025, l2_leaf_reg=25.0, rsm=0.7, subsample=0.8, random_strength=0.6, seed=789),
]

oof_list = []
mask_list = []
test_pred_list = []

def masked_auc_macro(y_true, y_pred, valid_mask):
    from sklearn.metrics import roc_auc_score
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1:
            continue
        yt = y_true[m, c]
        yp = y_pred[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols_with_mask(mat, valid_mask):
    from scipy.stats import rankdata
    N, C = mat.shape
    out = np.full((N, C), np.nan, dtype=np.float32)
    for c in range(C):
        mask = valid_mask[:, c].astype(bool)
        if not mask.any():
            continue
        col = mat[mask, c]
        r = rankdata(col, method='average')
        r = (r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0
        out[mask, c] = r.astype(np.float32)
    return out

def rank_cols(mat):
    from scipy.stats import rankdata
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

for v in variants:
    print(f"=== Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fold in [0, 1]:
        trn_mask = (fold_series.values != fold)
        val_mask = (fold_series.values == fold)
        if trn_mask.sum() == 0 or val_mask.sum() == 0:
            print(f'  Skip fold {fold}: trn={trn_mask.sum()}, val={val_mask.sum()}', flush=True)
            continue
        X_tr, X_va = X_tr_full[trn_mask], X_tr_full[val_mask]
        y_tr, y_va = Y_full[trn_mask], Y_full[val_mask]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        t0 = time.time()
        for c in range(C):
            y_tr_c = y_tr[:, c]; y_va_c = y_va[:, c]
            tr_pos = int((y_tr_c == 1).sum()); tr_neg = int((y_tr_c == 0).sum())
            va_pos = int((y_va_c == 1).sum()); va_neg = int((y_va_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            train_pool = Pool(X_tr, label=y_tr_c)
            valid_pool = Pool(X_va, label=y_va_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=250,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(train_pool, eval_set=valid_pool)
            oof[val_mask, c] = cb.predict_proba(X_va)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[val_mask, c] = 1
        fold_macro = masked_auc_macro(y_va, oof[val_mask], vmask[val_mask])
        print(f"  Fold {fold} masked macro: {fold_macro:.4f} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_mean = np.mean(test_folds, axis=0) if len(test_folds) else np.zeros((X_te_full.shape[0], C), dtype=np.float32)
    oof_list.append(oof); mask_list.append(vmask); test_pred_list.append(test_mean)
    macro = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro:.4f}", flush=True)

# Equal-weight rank ensemble across variants
inter_mask = mask_list[0].astype(bool)
for m in mask_list[1:]:
    inter_mask &= m.astype(bool)
inter_mask = inter_mask.astype(np.uint8)
ranked_oofs = [rank_cols_with_mask(o, inter_mask) for o in oof_list]
oof_ens = np.mean(ranked_oofs, axis=0)
macro_ens = masked_auc_macro(Y_full, oof_ens, inter_mask)
print(f'Ensemble (equal-weight) masked OOF macro: {macro_ens:.4f}', flush=True)

# Build test ensemble submission (equal-weight rank avg)
ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens = np.mean(ranked_tests, axis=0)

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Official 2-fold CV, CB seed-bag equal-weight ranks)")
display(sub_out.head())

Folds present among labeled train ids: [0]


=== Variant cb_d3_lr003_l2_20_rsm06_s42 ===


  Skip fold 0: trn=0, val=145


  Skip fold 1: trn=145, val=0


Variant OOF masked macro: nan


=== Variant cb_d4_lr0025_l2_15_rsm06_s123 ===


  Skip fold 0: trn=0, val=145


  Skip fold 1: trn=145, val=0


Variant OOF masked macro: nan


=== Variant cb_d4_lr002_l2_25_rsm05_s456 ===


  Skip fold 0: trn=0, val=145


  Skip fold 1: trn=145, val=0


Variant OOF masked macro: nan


=== Variant cb_d3_lr0025_l2_25_rsm07_s789 ===


  Skip fold 0: trn=0, val=145


  Skip fold 1: trn=145, val=0


Variant OOF masked macro: nan


Ensemble (equal-weight) masked OOF macro: nan


Wrote submission.csv with 1216 rows (Official 2-fold CV, CB seed-bag equal-weight ranks)


Unnamed: 0,Id,Probability
0,100,0.5
1,101,0.5
2,102,0.5
3,103,0.5
4,104,0.5


In [24]:
# Full-train CatBoost seed bag (no CV), equal-weight rank average, no global MI
import numpy as np, pandas as pd, time
from catboost import CatBoostClassifier, Pool

# Use full engineered features (no MI) from X_train_df/X_test_df
X_full = X_train_df.values.astype(np.float32)
T_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_full.shape[0], Y_full.shape[1]

# Diverse CatBoost variants (bagging by seeds/params)
variants = [
    dict(name='cb_full_d3_lr003_l2_20_rsm06_s42',  depth=3, learning_rate=0.03,  l2_leaf_reg=20.0, rsm=0.6, subsample=0.8, random_strength=0.8, seed=42),
    dict(name='cb_full_d4_lr0025_l2_15_rsm06_s123', depth=4, learning_rate=0.025, l2_leaf_reg=15.0, rsm=0.6, subsample=0.8, random_strength=0.8, seed=123),
    dict(name='cb_full_d4_lr002_l2_25_rsm05_s456', depth=4, learning_rate=0.02,  l2_leaf_reg=25.0, rsm=0.5, subsample=0.8, random_strength=1.0, seed=456),
    dict(name='cb_full_d3_lr0025_l2_25_rsm07_s789',depth=3, learning_rate=0.025, l2_leaf_reg=25.0, rsm=0.7, subsample=0.8, random_strength=0.6, seed=789),
    dict(name='cb_full_d3_lr002_l2_30_rsm06_s321', depth=3, learning_rate=0.02,  l2_leaf_reg=30.0, rsm=0.6, subsample=0.85, random_strength=0.9, seed=321),
]

def rank_cols(mat):
    from scipy.stats import rankdata
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

test_preds = []
for v in variants:
    print(f"=== Train variant {v['name']} on FULL train ===", flush=True)
    t0 = time.time()
    te_pred = np.zeros((T_full.shape[0], C), dtype=np.float32)
    for c in range(C):
        y_c = Y_full[:, c]
        pos = int((y_c == 1).sum()); neg = int((y_c == 0).sum())
        if not (pos > 0 and neg > 0):
            continue
        tr_pool = Pool(X_full, label=y_c)
        cb = CatBoostClassifier(
            loss_function='Logloss',
            depth=v['depth'],
            learning_rate=v['learning_rate'],
            iterations=5000,
            l2_leaf_reg=v['l2_leaf_reg'],
            rsm=v['rsm'],
            subsample=v['subsample'],
            random_strength=v['random_strength'],
            auto_class_weights='Balanced',
            random_seed=v['seed'],
            verbose=False,
            task_type='CPU',
            thread_count=-1
        )
        # Fit on full data; no eval_set to avoid using any CV fold
        cb.fit(tr_pool)
        te_pred[:, c] = cb.predict_proba(T_full)[:, 1]
    test_preds.append(te_pred)
    print(f"  Done {v['name']} in {time.time()-t0:.1f}s", flush=True)

# Equal-weight rank-average across variants
ranked_tests = [rank_cols(tp) for tp in test_preds]
test_ens = np.mean(ranked_tests, axis=0)

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Full-train CatBoost seed bag, equal-weight rank avg)")
display(sub_out.head())

=== Train variant cb_full_d3_lr003_l2_20_rsm06_s42 on FULL train ===


  Done cb_full_d3_lr003_l2_20_rsm06_s42 in 275.1s


=== Train variant cb_full_d4_lr0025_l2_15_rsm06_s123 on FULL train ===


  Done cb_full_d4_lr0025_l2_15_rsm06_s123 in 370.2s


=== Train variant cb_full_d4_lr002_l2_25_rsm05_s456 on FULL train ===


  Done cb_full_d4_lr002_l2_25_rsm05_s456 in 349.8s


=== Train variant cb_full_d3_lr0025_l2_25_rsm07_s789 on FULL train ===


  Done cb_full_d3_lr0025_l2_25_rsm07_s789 in 283.7s


=== Train variant cb_full_d3_lr002_l2_30_rsm06_s321 on FULL train ===


  Done cb_full_d3_lr002_l2_30_rsm06_s321 in 274.3s


Wrote submission.csv with 1216 rows (Full-train CatBoost seed bag, equal-weight rank avg)


Unnamed: 0,Id,Probability
0,100,0.385714
1,101,0.601587
2,102,0.65873
3,103,0.753968
4,104,0.753968


In [25]:
# Adversarial validation to drop shifted features + CB RMSKF bag (equal weights) and submit
import numpy as np, pandas as pd, time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
from catboost import CatBoostClassifier, Pool
from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold

def rank_cols(mat):
    from scipy.stats import rankdata
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

# 1) Build adversarial dataset (train vs test) on current engineered features (697 cols)
X_tr_full = X_train_df.copy()
X_te_full = X_test_df.copy()
X_adv = pd.concat([X_tr_full.assign(is_test=0), X_te_full.assign(is_test=1)], axis=0).reset_index(drop=True)
y_adv = X_adv['is_test'].values.astype(int)
X_adv = X_adv.drop(columns=['is_test']).values.astype(np.float32)

# 2) Train a quick LGBM classifier with CV to get feature importances
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
feat_importance = np.zeros(X_adv.shape[1], dtype=np.float64)
for fi, (tr, va) in enumerate(skf.split(X_adv, y_adv)):
    Xtr, Xva = X_adv[tr], X_adv[va]
    ytr, yva = y_adv[tr], y_adv[va]
    clf = lgb.LGBMClassifier(
        objective='binary',
        n_estimators=1000, learning_rate=0.05,
        max_depth=3, num_leaves=7,
        subsample=0.8, colsample_bytree=0.6,
        reg_alpha=0.5, reg_lambda=5.0,
        n_jobs=-1, random_state=42, verbosity=-1
    )
    clf.fit(Xtr, ytr, eval_set=[(Xva, yva)], eval_metric='auc', callbacks=[lgb.early_stopping(100, verbose=False), lgb.log_evaluation(period=0)])
    feat_importance += clf.booster_.feature_importance(importance_type='gain').astype(np.float64)

feat_importance /= max(1, skf.get_n_splits())
cols = X_train_df.columns.tolist()
order = np.argsort(-feat_importance)  # descending
drop_top = 40
drop_cols = [cols[i] for i in order[:min(drop_top, len(cols))]]
keep_cols = [c for c in cols if c not in drop_cols]
print(f'Adversarial validation: dropping top {len(drop_cols)} drifted features; keeping {len(keep_cols)}', flush=True)

# Reduced features
X_tr_red = X_train_df[keep_cols].values.astype(np.float32)
X_te_red = X_test_df[keep_cols].values.astype(np.float32)
Y_mat = Y_aligned.astype(np.float32)
N, C = X_tr_red.shape[0], Y_mat.shape[1]

# 3) Train 2 CatBoost variants with RepeatedMultilabelStratifiedKFold (5x2), equal-weight rank-avg
rmskf = RepeatedMultilabelStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
variants = [
    dict(name='cb_rmskf_d3_lr0025_l2_25_rsm06_s42', depth=3, learning_rate=0.025, l2_leaf_reg=25.0, rsm=0.6, subsample=0.8, random_strength=0.8, seed=42),
    dict(name='cb_rmskf_d4_lr002_l2_20_rsm05_s123', depth=4, learning_rate=0.02,  l2_leaf_reg=20.0, rsm=0.5, subsample=0.8, random_strength=1.0, seed=123),
]

test_pred_list = []
for v in variants:
    print(f"=== Variant {v['name']} (RMSKF 5x2) ===", flush=True)
    fold_id = 0
    test_folds = []
    for tr_idx, va_idx in rmskf.split(X_tr_red, Y_mat):
        t0 = time.time()
        Xtr, Xva = X_tr_red[tr_idx], X_tr_red[va_idx]
        ytr, yva = Y_mat[tr_idx], Y_mat[va_idx]
        te_pred = np.zeros((X_te_red.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=250,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            te_pred[:, c] = cb.predict_proba(X_te_red)[:, 1]
        test_folds.append(te_pred)
        print(f"  Fold {fold_id} done in {time.time()-t0:.1f}s", flush=True)
        fold_id += 1
    test_mean = np.mean(test_folds, axis=0)
    test_pred_list.append(test_mean)

# 4) Equal-weight rank averaging across variants and write submission
ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens = np.mean(ranked_tests, axis=0)

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Adv-validated feature drop {len(drop_cols)}, 2x CB RMSKF equal-weight rank avg)")
display(sub_out.head())

Adversarial validation: dropping top 40 drifted features; keeping 657


=== Variant cb_rmskf_d3_lr0025_l2_25_rsm06_s42 (RMSKF 5x2) ===


  Fold 0 done in 88.8s


  Fold 1 done in 20.3s


  Fold 2 done in 29.1s


  Fold 3 done in 53.2s


  Fold 4 done in 31.0s


  Fold 5 done in 33.7s


  Fold 6 done in 31.8s


  Fold 7 done in 37.4s


  Fold 8 done in 29.8s


  Fold 9 done in 49.9s


=== Variant cb_rmskf_d4_lr002_l2_20_rsm05_s123 (RMSKF 5x2) ===


  Fold 0 done in 92.2s


  Fold 1 done in 26.2s


  Fold 2 done in 36.7s


  Fold 3 done in 90.7s


  Fold 4 done in 42.3s


  Fold 5 done in 47.5s


  Fold 6 done in 42.1s


  Fold 7 done in 73.3s


  Fold 8 done in 58.6s


  Fold 9 done in 62.7s


Wrote submission.csv with 1216 rows (Adv-validated feature drop 40, 2x CB RMSKF equal-weight rank avg)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.531746
2,102,0.412698
3,103,0.753968
4,104,0.539683


In [26]:
# Add HOG spectrogram features (filtered_spectrograms) and train CatBoost RMSKF bag (equal-weight ranks)
import numpy as np, pandas as pd, time, os
from pathlib import Path
from PIL import Image
from skimage.feature import hog
from catboost import CatBoostClassifier, Pool
from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold

VIEW_DIR = SUP/'filtered_spectrograms'
HOG_CACHE_T = Path('hog_train.npy')
HOG_CACHE_S = Path('hog_test.npy')

def load_bmp(stem):
    p = VIEW_DIR/f"{stem}.bmp"
    img = Image.open(p).convert('L')
    return img

def compute_hog_for_ids(id_index):
    feats = []
    for rid in id_index:
        stem = id2fn[int(rid)]
        img = load_bmp(stem)
        img = img.resize((256,256))
        arr = np.asarray(img, dtype=np.float32) / 255.0
        f = hog(arr, orientations=9, pixels_per_cell=(16,16), cells_per_block=(2,2), block_norm='L2-Hys', feature_vector=True)
        feats.append(f.astype(np.float32))
    return np.vstack(feats)

# Train/test rec_id order aligned to existing matrices
train_ids_sorted = pd.Index(sorted(train_rec_ids))
test_ids_sorted = pd.Index(sorted(test_rec_ids))

if HOG_CACHE_T.exists() and HOG_CACHE_S.exists():
    hog_train = np.load(HOG_CACHE_T)
    hog_test = np.load(HOG_CACHE_S)
else:
    t0=time.time()
    hog_train = compute_hog_for_ids(train_ids_sorted)
    hog_test = compute_hog_for_ids(test_ids_sorted)
    np.save(HOG_CACHE_T, hog_train); np.save(HOG_CACHE_S, hog_test)
    print(f'HOG computed in {time.time()-t0:.1f}s -> train {hog_train.shape}, test {hog_test.shape}', flush=True)

# Concatenate HOG to engineered tabular features
X_tr_aug = np.concatenate([X_train_df.values.astype(np.float32), hog_train], axis=1)
X_te_aug = np.concatenate([X_test_df.values.astype(np.float32), hog_test], axis=1)
Y_mat = Y_aligned.astype(np.float32)
N, C = X_tr_aug.shape[0], Y_mat.shape[1]
print('Augmented dims:', X_tr_aug.shape, X_te_aug.shape, flush=True)

# RepeatedMultilabelStratifiedKFold CatBoost bag (2 variants), equal-weight rank avg
rmskf = RepeatedMultilabelStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
variants = [
    dict(name='cb_hog_d3_lr0025_l2_25_rsm06_s42', depth=3, learning_rate=0.025, l2_leaf_reg=25.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_hog_d4_lr002_l2_20_rsm05_s123', depth=4, learning_rate=0.02,  l2_leaf_reg=20.0, rsm=0.5, subsample=0.85, random_strength=1.0, seed=123),
]

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

test_pred_list = []
for v in variants:
    print(f"=== Variant {v['name']} (RMSKF 5x2) ===", flush=True)
    test_folds = []
    fold_id = 0
    for tr_idx, va_idx in rmskf.split(X_tr_aug, Y_mat):
        t0 = time.time()
        Xtr, Xva = X_tr_aug[tr_idx], X_tr_aug[va_idx]
        ytr, yva = Y_mat[tr_idx], Y_mat[va_idx]
        te_pred = np.zeros((X_te_aug.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=250,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            te_pred[:, c] = cb.predict_proba(X_te_aug)[:, 1]
        test_folds.append(te_pred)
        print(f"  Fold {fold_id} done in {time.time()-t0:.1f}s", flush=True)
        fold_id += 1
    test_mean = np.mean(test_folds, axis=0)
    test_pred_list.append(test_mean)

# Equal-weight rank-average and write submission
ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens = np.mean(ranked_tests, axis=0)
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (HOG+tabular, 2x CB RMSKF equal-weight ranks)")
display(sub_out.head())

HOG computed in 2.9s -> train (145, 8100), test (64, 8100)


Augmented dims: (145, 8797) (64, 8797)


=== Variant cb_hog_d3_lr0025_l2_25_rsm06_s42 (RMSKF 5x2) ===


  Fold 0 done in 401.0s


  Fold 1 done in 226.4s


In [27]:
# Full-train CatBoost seed bag on augmented (tabular + HOG) features, equal-weight rank average
import numpy as np, pandas as pd, time
from pathlib import Path
from catboost import CatBoostClassifier, Pool

# Load cached HOG (from cell 17) or compute if missing
HOG_CACHE_T = Path('hog_train.npy')
HOG_CACHE_S = Path('hog_test.npy')
if HOG_CACHE_T.exists() and HOG_CACHE_S.exists():
    hog_train = np.load(HOG_CACHE_T)
    hog_test = np.load(HOG_CACHE_S)
else:
    # Fallback: compute quickly (same routine as cell 17)
    from PIL import Image
    from skimage.feature import hog
    VIEW_DIR = SUP/'filtered_spectrograms'
    def load_bmp(stem):
        p = VIEW_DIR/f"{stem}.bmp"
        img = Image.open(p).convert('L')
        return img
    def compute_hog_for_ids(id_index):
        feats = []
        for rid in id_index:
            stem = id2fn[int(rid)]
            img = load_bmp(stem)
            img = img.resize((256,256))
            arr = np.asarray(img, dtype=np.float32) / 255.0
            f = hog(arr, orientations=9, pixels_per_cell=(16,16), cells_per_block=(2,2), block_norm='L2-Hys', feature_vector=True)
            feats.append(f.astype(np.float32))
        return np.vstack(feats)
    train_ids_sorted = pd.Index(sorted(train_rec_ids))
    test_ids_sorted = pd.Index(sorted(test_rec_ids))
    t0=time.time()
    hog_train = compute_hog_for_ids(train_ids_sorted)
    hog_test = compute_hog_for_ids(test_ids_sorted)
    np.save(HOG_CACHE_T, hog_train); np.save(HOG_CACHE_S, hog_test)
    print(f'HOG computed in {time.time()-t0:.1f}s -> train {hog_train.shape}, test {hog_test.shape}', flush=True)

# Build augmented matrices
X_tab = X_train_df.values.astype(np.float32)
T_tab = X_test_df.values.astype(np.float32)
X_full = np.concatenate([X_tab, hog_train], axis=1)
T_full = np.concatenate([T_tab, hog_test], axis=1)
Y_full = Y_aligned.astype(np.float32)
N, D = X_full.shape[0], X_full.shape[1]
C = Y_full.shape[1]
print('Augmented dims (full-train):', X_full.shape, T_full.shape, flush=True)

# CatBoost seed/param bag (CPU), strong regularization
variants = [
    dict(name='cb_aug_d3_lr003_l2_25_rsm06_s42',  depth=3, learning_rate=0.03,  l2_leaf_reg=25.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_aug_d4_lr0025_l2_20_rsm06_s123', depth=4, learning_rate=0.025, l2_leaf_reg=20.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=123),
    dict(name='cb_aug_d4_lr002_l2_30_rsm05_s456',  depth=4, learning_rate=0.02,  l2_leaf_reg=30.0, rsm=0.5, subsample=0.85, random_strength=1.0, seed=456),
    dict(name='cb_aug_d3_lr0025_l2_30_rsm07_s789', depth=3, learning_rate=0.025, l2_leaf_reg=30.0, rsm=0.7, subsample=0.9,  random_strength=0.6, seed=789),
]

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

test_preds = []
for v in variants:
    print(f"=== FULL-TRAIN {v['name']} ===", flush=True)
    t0 = time.time()
    te_pred = np.zeros((T_full.shape[0], C), dtype=np.float32)
    for c in range(C):
        y_c = Y_full[:, c]
        pos = int((y_c == 1).sum()); neg = int((y_c == 0).sum())
        if not (pos > 0 and neg > 0):
            continue
        tr_pool = Pool(X_full, label=y_c)
        cb = CatBoostClassifier(
            loss_function='Logloss',
            depth=v['depth'],
            learning_rate=v['learning_rate'],
            iterations=5000,
            l2_leaf_reg=v['l2_leaf_reg'],
            rsm=v['rsm'],
            subsample=v['subsample'],
            random_strength=v['random_strength'],
            auto_class_weights='Balanced',
            random_seed=v['seed'],
            verbose=False,
            task_type='CPU',
            thread_count=-1
        )
        cb.fit(tr_pool)
        te_pred[:, c] = cb.predict_proba(T_full)[:, 1]
    test_preds.append(te_pred)
    print(f"  Done in {time.time()-t0:.1f}s", flush=True)

# Equal-weight rank-average
ranked_tests = [rank_cols(tp) for tp in test_preds]
test_ens = np.mean(ranked_tests, axis=0)

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Full-train CatBoost seed bag on augmented features, equal-weight rank avg)")
display(sub_out.head())

Augmented dims (full-train): (145, 8797) (64, 8797)


=== FULL-TRAIN cb_aug_d3_lr003_l2_25_rsm06_s42 ===


KeyboardInterrupt: 

In [28]:
# Fast pipeline: HOG+tabular -> TruncatedSVD -> OneVsRest LogisticRegression (balanced) -> submission
import numpy as np, pandas as pd, time
from pathlib import Path
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Load cached HOG features
HOG_CACHE_T = Path('hog_train.npy')
HOG_CACHE_S = Path('hog_test.npy')
assert HOG_CACHE_T.exists() and HOG_CACHE_S.exists(), 'Run cell 17 to create hog_train.npy/hog_test.npy first.'
hog_train = np.load(HOG_CACHE_T)
hog_test = np.load(HOG_CACHE_S)

# Build augmented matrices (tabular + HOG)
X_tab = X_train_df.values.astype(np.float32)
T_tab = X_test_df.values.astype(np.float32)
X_aug = np.concatenate([X_tab, hog_train], axis=1).astype(np.float32)
T_aug = np.concatenate([T_tab, hog_test], axis=1).astype(np.float32)
Y_mat = Y_aligned.astype(np.float32)
N, D = X_aug.shape
C = Y_mat.shape[1]
print('Aug dims:', X_aug.shape, T_aug.shape, flush=True)

# Scale then SVD to reduce dimensionality (fast, stable for high-D). Fit on train only.
t0 = time.time()
scaler = StandardScaler(with_mean=True, with_std=True)
X_s = scaler.fit_transform(X_aug)
T_s = scaler.transform(T_aug)
n_comp = min(256, X_s.shape[1]-1, X_s.shape[0]-1)
svd = TruncatedSVD(n_components=n_comp, random_state=42)
X_z = svd.fit_transform(X_s)
T_z = svd.transform(T_s)
print(f'SVD -> comps={n_comp}, explained_var_sum={svd.explained_variance_ratio_.sum():.4f}, time={time.time()-t0:.1f}s', flush=True)

# OneVsRest LogisticRegression (L2, balanced) on SVD components
clf = OneVsRestClassifier(LogisticRegression(solver='lbfgs', C=1.0, max_iter=3000, class_weight='balanced', n_jobs=-1))
t1 = time.time()
clf.fit(X_z, Y_mat)
print(f'LR OVR fit in {time.time()-t1:.1f}s', flush=True)
test_proba = np.vstack([est.predict_proba(T_z)[:,1] for est in clf.estimators_]).T.astype(np.float32)

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_proba[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows (LR-OVR on SVD(HOG+tabular))')
display(sub_out.head())

Aug dims: (145, 8797) (64, 8797)


SVD -> comps=144, explained_var_sum=1.0000, time=1.9s


LR OVR fit in 24.4s


Wrote submission.csv with 1216 rows (LR-OVR on SVD(HOG+tabular))


Unnamed: 0,Id,Probability
0,100,0.000213
1,101,0.009447
2,102,0.000718
3,103,0.000686
4,104,0.000461


In [29]:
# Blend CB-only rank ensemble (cell 13) with LR-SVD(HOG+tabular) (cell 19) and write submission
import numpy as np, pandas as pd

def rank_cols(mat):
    from scipy.stats import rankdata
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

# Expect test_blend_rank_cb from cell 13 and test_proba from cell 19
assert 'test_blend_rank_cb' in globals(), 'Run cell 13 (CB-only rank ensemble) first.'
assert 'test_proba' in globals(), 'Run cell 19 (LR-OVR SVD) first.'

test_lr_rank = rank_cols(test_proba)
# Fixed equal-weight or slightly CB-heavy. Start with CB 0.8 / LR 0.2
w_cb, w_lr = 0.8, 0.2
test_blend_final = w_cb * test_blend_rank_cb + w_lr * test_lr_rank

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows (CB-only rank blend 0.8 + LR-SVD 0.2)')
display(sub_out.head())

Wrote submission.csv with 1216 rows (CB-only rank blend 0.8 + LR-SVD 0.2)


Unnamed: 0,Id,Probability
0,100,0.288889
1,101,0.576508
2,102,0.429206
3,103,0.771429
4,104,0.542857


In [30]:
# Grouped CV by recording station (PCxx) + CatBoost bag (no MI) + rank ensemble submission
import re, time, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool

def extract_station(stem):
    if not isinstance(stem, str):
        return None
    m = re.search(r'(PC\d+)', stem)
    return m.group(1) if m else None

# Build groups (station) for labeled train
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station)
groups = train_meta['station'].fillna('UNK').values

# Matrices (full engineered features without MI)
X_tr_full = X_train_df.values.astype(np.float32)
X_te_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_tr_full.shape[0], Y_full.shape[1]

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    for c in range(y_true.shape[1]):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1:
            continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

# Define 4 diverse CatBoost variants (shallow, strong reg, feature subsampling and subsample)
variants = [
    dict(name='cb_gcv_d3_lr002_l2_30_rsm06_sub085_s42', depth=3, learning_rate=0.02,  l2_leaf_reg=30.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_gcv_d4_lr002_l2_40_rsm05_sub08_s123', depth=4, learning_rate=0.02,  l2_leaf_reg=40.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='cb_gcv_d3_lr0025_l2_25_rsm06_sub085_s456', depth=3, learning_rate=0.025, l2_leaf_reg=25.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=456),
    dict(name='cb_gcv_d4_lr0015_l2_50_rsm05_sub075_s789', depth=4, learning_rate=0.015, l2_leaf_reg=50.0, rsm=0.5, subsample=0.75, random_strength=1.0, seed=789),
]

gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_tr_full, Y_full, groups))
print('GroupKFold by station -> n_folds:', len(folds), 'unique stations:', len(np.unique(groups)))

test_pred_list = []
oof_list = []
mask_list = []

for v in variants:
    print(f"=== Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr, Xva = X_tr_full[tr_idx], X_tr_full[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=200,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_mean = np.mean(test_folds, axis=0)
    oof_list.append(oof); mask_list.append(vmask); test_pred_list.append(test_mean)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Rank-ensemble across variants (equal weights)
inter_mask = mask_list[0].astype(bool)
for m in mask_list[1:]:
    inter_mask &= m.astype(bool)
inter_mask = inter_mask.astype(np.uint8)

# Rank OOFs and tests
ranked_oofs = []
for o in oof_list:
    # rank with mask: only rank within observed rows per class
    from scipy.stats import rankdata
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask[:, c].astype(bool)
        if not m.any():
            continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens = np.nanmean(ranked_oofs, axis=0)
macro_ens = masked_auc_macro(Y_full, oof_ens, inter_mask)
print(f'Ensemble (equal-weight, rank OOF) masked macro: {macro_ens:.4f}', flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens = np.mean(ranked_tests, axis=0)

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (GroupKFold-by-station CB bag, equal-weight rank ensemble)")
display(sub_out.head())

GroupKFold by station -> n_folds: 5 unique stations: 13
=== Variant cb_gcv_d3_lr002_l2_30_rsm06_sub085_s42 ===


  Fold 0: masked macro=0.7448 | n_val=31 | time=25.5s


  Fold 1: masked macro=0.7934 | n_val=27 | time=34.2s


  Fold 2: masked macro=0.7036 | n_val=26 | time=38.7s


  Fold 3: masked macro=0.6803 | n_val=31 | time=30.8s


  Fold 4: masked macro=0.7185 | n_val=30 | time=14.7s


Variant OOF masked macro: 0.7528


=== Variant cb_gcv_d4_lr002_l2_40_rsm05_sub08_s123 ===


  Fold 0: masked macro=0.6899 | n_val=31 | time=31.6s


  Fold 1: masked macro=0.7857 | n_val=27 | time=52.0s


  Fold 2: masked macro=0.7296 | n_val=26 | time=68.5s


  Fold 3: masked macro=0.7026 | n_val=31 | time=42.7s


  Fold 4: masked macro=0.6444 | n_val=30 | time=19.9s


Variant OOF masked macro: 0.7158


=== Variant cb_gcv_d3_lr0025_l2_25_rsm06_sub085_s456 ===


  Fold 0: masked macro=0.7525 | n_val=31 | time=24.5s


  Fold 1: masked macro=0.7615 | n_val=27 | time=28.7s


  Fold 2: masked macro=0.6980 | n_val=26 | time=33.1s


  Fold 3: masked macro=0.7031 | n_val=31 | time=40.3s


  Fold 4: masked macro=0.7544 | n_val=30 | time=14.1s


Variant OOF masked macro: 0.7517


=== Variant cb_gcv_d4_lr0015_l2_50_rsm05_sub075_s789 ===


  Fold 0: masked macro=0.7208 | n_val=31 | time=34.8s


  Fold 1: masked macro=0.7406 | n_val=27 | time=60.0s


  Fold 2: masked macro=0.7077 | n_val=26 | time=69.0s


  Fold 3: masked macro=0.7301 | n_val=31 | time=50.6s


  Fold 4: masked macro=0.7148 | n_val=30 | time=25.1s


Variant OOF masked macro: 0.7346


Ensemble (equal-weight, rank OOF) masked macro: 0.7418


Wrote submission.csv with 1216 rows (GroupKFold-by-station CB bag, equal-weight rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.47619
2,102,0.40873
3,103,0.732143
4,104,0.384921


In [31]:
# GroupKFold by station with 4 folds (better class coverage) + CB bag and rank ensemble
import re, time, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool

def extract_station(stem):
    if not isinstance(stem, str):
        return None
    m = re.search(r'(PC\d+)', stem)
    return m.group(1) if m else None

# Build groups (station) for labeled train
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station)
groups = train_meta['station'].fillna('UNK').values

# Matrices
X_tr_full = X_train_df.values.astype(np.float32)
X_te_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_tr_full.shape[0], Y_full.shape[1]

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    for c in range(y_true.shape[1]):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1:
            continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

# Slightly adjusted variants for diversity
variants = [
    dict(name='cb_gcv4_d3_lr002_l2_35_rsm06_sub085_s42',  depth=3, learning_rate=0.02,  l2_leaf_reg=35.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_gcv4_d4_lr002_l2_45_rsm05_sub08_s123',  depth=4, learning_rate=0.02,  l2_leaf_reg=45.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='cb_gcv4_d3_lr0025_l2_30_rsm06_sub09_s456', depth=3, learning_rate=0.025, l2_leaf_reg=30.0, rsm=0.6, subsample=0.90, random_strength=0.9, seed=456),
    dict(name='cb_gcv4_d4_lr0015_l2_55_rsm05_sub075_s789',depth=4, learning_rate=0.015, l2_leaf_reg=55.0, rsm=0.5, subsample=0.75, random_strength=1.0, seed=789),
]

gkf = GroupKFold(n_splits=4)
folds = list(gkf.split(X_tr_full, Y_full, groups))
print('GroupKFold by station -> n_folds:', len(folds), 'unique stations:', len(np.unique(groups)))

test_pred_list = []
oof_list = []
mask_list = []

for v in variants:
    print(f"=== Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr, Xva = X_tr_full[tr_idx], X_tr_full[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=200,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_mean = np.mean(test_folds, axis=0)
    oof_list.append(oof); mask_list.append(vmask); test_pred_list.append(test_mean)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Rank-ensemble across variants (equal weights)
inter_mask = mask_list[0].astype(bool)
for m in mask_list[1:]:
    inter_mask &= m.astype(bool)
inter_mask = inter_mask.astype(np.uint8)

# Rank OOFs and tests
ranked_oofs = []
from scipy.stats import rankdata
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask[:, c].astype(bool)
        if not m.any():
            continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens = np.nanmean(ranked_oofs, axis=0)
macro_ens = masked_auc_macro(Y_full, oof_ens, inter_mask)
print(f'Ensemble (equal-weight, rank OOF) masked macro: {macro_ens:.4f}', flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens = np.mean(ranked_tests, axis=0)

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (GroupKFold-4-by-station CB bag, equal-weight rank ensemble)")
display(sub_out.head())

GroupKFold by station -> n_folds: 4 unique stations: 13
=== Variant cb_gcv4_d3_lr002_l2_35_rsm06_sub085_s42 ===


  Fold 0: masked macro=0.7211 | n_val=39 | time=12.3s


  Fold 1: masked macro=0.7248 | n_val=34 | time=36.6s


  Fold 2: masked macro=0.7279 | n_val=37 | time=28.3s


  Fold 3: masked macro=0.6715 | n_val=35 | time=28.3s


Variant OOF masked macro: 0.7326


=== Variant cb_gcv4_d4_lr002_l2_45_rsm05_sub08_s123 ===


  Fold 0: masked macro=0.6916 | n_val=39 | time=15.9s


  Fold 1: masked macro=0.7342 | n_val=34 | time=55.5s


  Fold 2: masked macro=0.6963 | n_val=37 | time=39.5s


  Fold 3: masked macro=0.7109 | n_val=35 | time=35.5s


Variant OOF masked macro: 0.7242


=== Variant cb_gcv4_d3_lr0025_l2_30_rsm06_sub09_s456 ===


  Fold 0: masked macro=0.6223 | n_val=39 | time=10.6s


  Fold 1: masked macro=0.7476 | n_val=34 | time=29.6s


  Fold 2: masked macro=0.6815 | n_val=37 | time=26.4s


  Fold 3: masked macro=0.7147 | n_val=35 | time=22.9s


Variant OOF masked macro: 0.6931


=== Variant cb_gcv4_d4_lr0015_l2_55_rsm05_sub075_s789 ===


  Fold 0: masked macro=0.6723 | n_val=39 | time=18.7s


  Fold 1: masked macro=0.7616 | n_val=34 | time=78.6s


  Fold 2: masked macro=0.7173 | n_val=37 | time=42.6s


  Fold 3: masked macro=0.6857 | n_val=35 | time=39.5s


Variant OOF masked macro: 0.7317


Ensemble (equal-weight, rank OOF) masked macro: 0.7312


Wrote submission.csv with 1216 rows (GroupKFold-4-by-station CB bag, equal-weight rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.301587
1,101,0.484127
2,102,0.488095
3,103,0.652778
4,104,0.392857


In [32]:
# Blend GroupKFold CB-bag ranks (test_ens) with LR-SVD(HOG+tab) at small weight (0.1) and write submission
import numpy as np, pandas as pd
from scipy.stats import rankdata

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

assert 'test_ens' in globals(), 'Run GroupKFold CB bag cell (21 or 22) first to get test_ens.'
assert 'test_proba' in globals(), 'Run cell 19 (LR-OVR SVD) first to get test_proba.'

lr_rank = rank_cols(test_proba.astype(np.float32))
w_cb, w_lr = 0.9, 0.1
blend_rank = w_cb * test_ens + w_lr * lr_rank

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(blend_rank[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (GroupKFold CB bag 0.9 + LR-SVD 0.1 rank blend)")
display(sub_out.head())

Wrote submission.csv with 1216 rows (GroupKFold CB bag 0.9 + LR-SVD 0.1 rank blend)


Unnamed: 0,Id,Probability
0,100,0.31746
1,101,0.50873
2,102,0.486905
3,103,0.671627
4,104,0.415476


In [33]:
# Three-way rank blend: GroupCV CB bag (test_ens) + CB-only (cell 13) + LR-SVD (cell 19)
import numpy as np, pandas as pd
from scipy.stats import rankdata

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

assert 'test_ens' in globals(), 'Run GroupKFold CB bag cell (21 or 22) first.'
assert 'test_blend_rank_cb' in globals(), 'Run CB-only ensemble (cell 13) first.'
assert 'test_proba' in globals(), 'Run LR-OVR SVD (cell 19) first.'

lr_rank = rank_cols(test_proba.astype(np.float32))
w_gcb, w_cb, w_lr = 0.7, 0.2, 0.1
blend_rank3 = w_gcb * test_ens + w_cb * test_blend_rank_cb + w_lr * lr_rank

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(blend_rank3[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (GroupCV CB 0.7 + CB-only 0.2 + LR-SVD 0.1 rank blend)")
display(sub_out.head())

Wrote submission.csv with 1216 rows (GroupCV CB 0.7 + CB-only 0.2 + LR-SVD 0.1 rank blend)


Unnamed: 0,Id,Probability
0,100,0.306349
1,101,0.519524
2,102,0.472778
3,103,0.691865
4,104,0.441667


In [34]:
# Grouped CV by station + adversarial feature drop (top-30) + CatBoost bag + rank ensemble submission
import numpy as np, pandas as pd, time, re
from sklearn.model_selection import GroupKFold, StratifiedKFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
from catboost import CatBoostClassifier, Pool

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

# Build groups (station) for labeled train
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station)
groups = train_meta['station'].fillna('UNK').values

# Full engineered features
X_tr_full_df = X_train_df.copy()
X_te_full_df = X_test_df.copy()
Y_full = Y_aligned.astype(np.float32)
N, C = X_tr_full_df.shape[0], Y_full.shape[1]

# 1) Adversarial validation to identify drifted features (train vs test) and drop top-K
X_adv = pd.concat([X_tr_full_df.assign(is_test=0), X_te_full_df.assign(is_test=1)], axis=0).reset_index(drop=True)
y_adv = X_adv['is_test'].values.astype(int)
X_adv_mat = X_adv.drop(columns=['is_test']).values.astype(np.float32)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
feat_importance = np.zeros(X_adv_mat.shape[1], dtype=np.float64)
for fi, (tr, va) in enumerate(skf.split(X_adv_mat, y_adv)):
    Xtr, Xva = X_adv_mat[tr], X_adv_mat[va]
    ytr, yva = y_adv[tr], y_adv[va]
    clf = lgb.LGBMClassifier(
        objective='binary', n_estimators=800, learning_rate=0.05,
        max_depth=3, num_leaves=7, subsample=0.8, colsample_bytree=0.6,
        reg_alpha=0.5, reg_lambda=5.0, n_jobs=-1, random_state=42, verbosity=-1
    )
    clf.fit(Xtr, ytr, eval_set=[(Xva, yva)], eval_metric='auc',
            callbacks=[lgb.early_stopping(100, verbose=False), lgb.log_evaluation(period=0)])
    feat_importance += clf.booster_.feature_importance(importance_type='gain').astype(np.float64)
feat_importance /= max(1, skf.get_n_splits())
cols = X_tr_full_df.columns.tolist()
order = np.argsort(-feat_importance)
drop_top = 30
drop_cols = [cols[i] for i in order[:min(drop_top, len(cols))]]
keep_cols = [c for c in cols if c not in drop_cols]
print(f'Adversarial drop: top {len(drop_cols)} features removed; keeping {len(keep_cols)}', flush=True)

X_tr = X_tr_full_df[keep_cols].values.astype(np.float32)
X_te = X_te_full_df[keep_cols].values.astype(np.float32)

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    for c in range(y_true.shape[1]):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

# 2) GroupKFold by station (5 folds); if poor coverage, user can switch to 4
gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_tr, Y_full, groups))
print('GroupKFold by station -> n_folds:', len(folds), 'unique stations:', len(np.unique(groups)))

# 3) CatBoost variants (shallow, strong regularization, feature subsampling + subsample)
variants = [
    dict(name='cb_gcv_adv_d3_lr002_l2_30_rsm06_sub085_s42', depth=3, learning_rate=0.02,  l2_leaf_reg=30.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_gcv_adv_d4_lr002_l2_40_rsm05_sub08_s123', depth=4, learning_rate=0.02,  l2_leaf_reg=40.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='cb_gcv_adv_d3_lr0025_l2_25_rsm06_sub09_s456', depth=3, learning_rate=0.025, l2_leaf_reg=25.0, rsm=0.6, subsample=0.90, random_strength=0.9, seed=456),
]

test_pred_list = []; oof_list = []; mask_list = []
for v in variants:
    print(f"=== Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr_f, Xva_f = X_tr[tr_idx], X_tr[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            tr_pool = Pool(Xtr_f, label=ytr_c)
            va_pool = Pool(Xva_f, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=200,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva_f)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_mean = np.mean(test_folds, axis=0)
    oof_list.append(oof); mask_list.append(vmask); test_pred_list.append(test_mean)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# 4) Equal-weight rank ensemble across variants
inter_mask = mask_list[0].astype(bool)
for m in mask_list[1:]:
    inter_mask &= m.astype(bool)
inter_mask = inter_mask.astype(np.uint8)

ranked_oofs = []
from scipy.stats import rankdata
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask[:, c].astype(bool)
        if not m.any():
            continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens = np.nanmean(ranked_oofs, axis=0)
macro_ens = masked_auc_macro(Y_full, oof_ens, inter_mask)
print(f'Ensemble (equal-weight, rank OOF) masked macro: {macro_ens:.4f}', flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens_adv = np.mean(ranked_tests, axis=0)

# 5) Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens_adv[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (GroupKFold-by-station + adv-drop{drop_top}, CB bag, rank ensemble)")
display(sub_out.head())

Adversarial drop: top 30 features removed; keeping 667


GroupKFold by station -> n_folds: 5 unique stations: 13
=== Variant cb_gcv_adv_d3_lr002_l2_30_rsm06_sub085_s42 ===


  Fold 0: masked macro=0.7473 | n_val=31 | time=23.3s


  Fold 1: masked macro=0.7609 | n_val=27 | time=30.6s


  Fold 2: masked macro=0.6889 | n_val=26 | time=35.4s


  Fold 3: masked macro=0.7023 | n_val=31 | time=27.8s


  Fold 4: masked macro=0.7574 | n_val=30 | time=12.8s


Variant OOF masked macro: 0.7610


=== Variant cb_gcv_adv_d4_lr002_l2_40_rsm05_sub08_s123 ===


  Fold 0: masked macro=0.6911 | n_val=31 | time=30.5s


  Fold 1: masked macro=0.7614 | n_val=27 | time=46.9s


  Fold 2: masked macro=0.6957 | n_val=26 | time=48.8s


  Fold 3: masked macro=0.7210 | n_val=31 | time=34.9s


  Fold 4: masked macro=0.7474 | n_val=30 | time=16.1s


Variant OOF masked macro: 0.7449


=== Variant cb_gcv_adv_d3_lr0025_l2_25_rsm06_sub09_s456 ===


  Fold 0: masked macro=0.7250 | n_val=31 | time=23.3s


  Fold 1: masked macro=0.7450 | n_val=27 | time=24.9s


  Fold 2: masked macro=0.7060 | n_val=26 | time=30.1s


  Fold 3: masked macro=0.7345 | n_val=31 | time=27.9s


  Fold 4: masked macro=0.7102 | n_val=30 | time=11.5s


Variant OOF masked macro: 0.7418


Ensemble (equal-weight, rank OOF) masked macro: 0.7527


Wrote submission.csv with 1216 rows (GroupKFold-by-station + adv-drop30, CB bag, rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.457672
2,102,0.436508
3,103,0.716931
4,104,0.420635


In [35]:
# GroupKFold-by-station with expanded 8-variant CatBoost bag (strong reg, diverse seeds) + rank ensemble submission
import re, time, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool

def extract_station(stem):
    if not isinstance(stem, str):
        return None
    m = re.search(r'(PC\d+)', stem)
    return m.group(1) if m else None

# Groups
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station)
groups = train_meta['station'].fillna('UNK').values

# Features and labels (full engineered, no MI)
X_tr_full = X_train_df.values.astype(np.float32)
X_te_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_tr_full.shape[0], Y_full.shape[1]

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    for c in range(y_true.shape[1]):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1:
            continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

# 8 diverse CB variants (shallow, strong L2, feature subsampling, subsample, varied seeds)
variants = [
    dict(name='cb_gcv_d3_lr002_l2_35_rsm06_sub085_s42',  depth=3, learning_rate=0.02,  l2_leaf_reg=35.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_gcv_d4_lr002_l2_45_rsm05_sub08_s123',  depth=4, learning_rate=0.02,  l2_leaf_reg=45.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='cb_gcv_d3_lr0025_l2_30_rsm06_sub09_s456', depth=3, learning_rate=0.025, l2_leaf_reg=30.0, rsm=0.6, subsample=0.90, random_strength=0.9, seed=456),
    dict(name='cb_gcv_d4_lr0015_l2_55_rsm05_sub075_s789',depth=4, learning_rate=0.015, l2_leaf_reg=55.0, rsm=0.5, subsample=0.75, random_strength=1.0, seed=789),
    dict(name='cb_gcv_d3_lr002_l2_50_rsm07_sub08_s101',  depth=3, learning_rate=0.02,  l2_leaf_reg=50.0, rsm=0.7, subsample=0.80, random_strength=0.7, seed=101),
    dict(name='cb_gcv_d4_lr002_l2_60_rsm04_sub08_s202',  depth=4, learning_rate=0.02,  l2_leaf_reg=60.0, rsm=0.4, subsample=0.80, random_strength=0.9, seed=202),
    dict(name='cb_gcv_d3_lr0015_l2_45_rsm05_sub09_s303', depth=3, learning_rate=0.015, l2_leaf_reg=45.0, rsm=0.5, subsample=0.90, random_strength=1.0, seed=303),
    dict(name='cb_gcv_d4_lr0025_l2_35_rsm06_sub085_s404',depth=4, learning_rate=0.025, l2_leaf_reg=35.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=404),
]

gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_tr_full, Y_full, groups))
print('GroupKFold by station -> n_folds:', len(folds), 'unique stations:', len(np.unique(groups)))

test_pred_list = []
oof_list = []
mask_list = []

for v in variants:
    print(f"=== Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr, Xva = X_tr_full[tr_idx], X_tr_full[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=200,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_mean = np.mean(test_folds, axis=0)
    oof_list.append(oof); mask_list.append(vmask); test_pred_list.append(test_mean)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Equal-weight rank ensemble across 8 variants
inter_mask = mask_list[0].astype(bool)
for m in mask_list[1:]:
    inter_mask &= m.astype(bool)
inter_mask = inter_mask.astype(np.uint8)

ranked_oofs = []
from scipy.stats import rankdata
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask[:, c].astype(bool)
        if not m.any():
            continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens = np.nanmean(ranked_oofs, axis=0)
macro_ens = masked_auc_macro(Y_full, oof_ens, inter_mask)
print(f'Ensemble (8x CB, equal-weight rank OOF) masked macro: {macro_ens:.4f}', flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens8 = np.mean(ranked_tests, axis=0)

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens8[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (GroupKFold-by-station, 8x CB bag, equal-weight rank ensemble)")
display(sub_out.head())

GroupKFold by station -> n_folds: 5 unique stations: 13
=== Variant cb_gcv_d3_lr002_l2_35_rsm06_sub085_s42 ===


  Fold 0: masked macro=0.7490 | n_val=31 | time=25.7s


  Fold 1: masked macro=0.7750 | n_val=27 | time=31.1s


  Fold 2: masked macro=0.7013 | n_val=26 | time=40.2s


  Fold 3: masked macro=0.6840 | n_val=31 | time=30.8s


  Fold 4: masked macro=0.7147 | n_val=30 | time=14.6s


Variant OOF masked macro: 0.7557


=== Variant cb_gcv_d4_lr002_l2_45_rsm05_sub08_s123 ===


  Fold 0: masked macro=0.6901 | n_val=31 | time=32.4s


  Fold 1: masked macro=0.7651 | n_val=27 | time=55.6s


  Fold 2: masked macro=0.7354 | n_val=26 | time=68.9s


  Fold 3: masked macro=0.7044 | n_val=31 | time=47.7s


  Fold 4: masked macro=0.6526 | n_val=30 | time=20.8s


Variant OOF masked macro: 0.7189


=== Variant cb_gcv_d3_lr0025_l2_30_rsm06_sub09_s456 ===


  Fold 0: masked macro=0.7330 | n_val=31 | time=25.6s


  Fold 1: masked macro=0.7682 | n_val=27 | time=27.5s


  Fold 2: masked macro=0.6894 | n_val=26 | time=34.1s


  Fold 3: masked macro=0.7429 | n_val=31 | time=31.5s


  Fold 4: masked macro=0.7512 | n_val=30 | time=14.2s


Variant OOF masked macro: 0.7427


=== Variant cb_gcv_d4_lr0015_l2_55_rsm05_sub075_s789 ===


  Fold 0: masked macro=0.7230 | n_val=31 | time=35.2s


  Fold 1: masked macro=0.7388 | n_val=27 | time=66.9s


  Fold 2: masked macro=0.7202 | n_val=26 | time=69.4s


  Fold 3: masked macro=0.7254 | n_val=31 | time=56.5s


  Fold 4: masked macro=0.7152 | n_val=30 | time=24.5s


Variant OOF masked macro: 0.7404


=== Variant cb_gcv_d3_lr002_l2_50_rsm07_sub08_s101 ===


  Fold 0: masked macro=0.7479 | n_val=31 | time=28.7s


  Fold 1: masked macro=0.7183 | n_val=27 | time=36.9s


  Fold 2: masked macro=0.7224 | n_val=26 | time=46.0s


  Fold 3: masked macro=0.7515 | n_val=31 | time=30.0s


  Fold 4: masked macro=0.6993 | n_val=30 | time=18.9s


Variant OOF masked macro: 0.7415


=== Variant cb_gcv_d4_lr002_l2_60_rsm04_sub08_s202 ===


  Fold 0: masked macro=0.7273 | n_val=31 | time=31.9s


  Fold 1: masked macro=0.7793 | n_val=27 | time=48.9s


  Fold 2: masked macro=0.7294 | n_val=26 | time=61.4s


  Fold 3: masked macro=0.7230 | n_val=31 | time=46.4s


  Fold 4: masked macro=0.7056 | n_val=30 | time=20.9s


Variant OOF masked macro: 0.7482


=== Variant cb_gcv_d3_lr0015_l2_45_rsm05_sub09_s303 ===


  Fold 0: masked macro=0.7119 | n_val=31 | time=28.6s


  Fold 1: masked macro=0.7327 | n_val=27 | time=42.8s


  Fold 2: masked macro=0.7049 | n_val=26 | time=37.8s


  Fold 3: masked macro=0.7281 | n_val=31 | time=35.9s


  Fold 4: masked macro=0.6851 | n_val=30 | time=16.0s


Variant OOF masked macro: 0.7368


=== Variant cb_gcv_d4_lr0025_l2_35_rsm06_sub085_s404 ===


  Fold 0: masked macro=0.7160 | n_val=31 | time=32.1s


  Fold 1: masked macro=0.7435 | n_val=27 | time=41.3s


  Fold 2: masked macro=0.6820 | n_val=26 | time=60.8s


  Fold 3: masked macro=0.7658 | n_val=31 | time=42.1s


  Fold 4: masked macro=0.7285 | n_val=30 | time=19.3s


Variant OOF masked macro: 0.7463


Ensemble (8x CB, equal-weight rank OOF) masked macro: 0.7474


Wrote submission.csv with 1216 rows (GroupKFold-by-station, 8x CB bag, equal-weight rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.464286
2,102,0.422619
3,103,0.719246
4,104,0.396825


In [36]:
# Group-aware CV: Random station-to-fold search to maximize class coverage + CB bag + rank ensemble
import numpy as np, pandas as pd, time, re, random
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

# Build groups and per-station label matrix
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station).fillna('UNK')
stations = sorted(train_meta['station'].unique().tolist())
st2idx = {s:i for i,s in enumerate(stations)}
N, C = X_train_df.shape[0], Y_aligned.shape[1]
st_labels = np.zeros((len(stations), C), dtype=np.int32)
for i, row in train_df.iterrows():
    st = train_meta.loc[i, 'station']
    st_i = st2idx[st]
    st_labels[st_i] += (Y_aligned[i] > 0).astype(np.int32)

# Candidate fold assignment search
def score_assignment(assign, n_folds):
    # assign: list length = n_stations with fold id in [0..n_folds-1]
    # Score: minimize total missing classes across folds (both in fold and in complement), and balance fold sizes
    total_miss = 0
    sizes = [0]*n_folds
    # compute per-fold label sums
    fold_lab = [np.zeros(C, dtype=np.int32) for _ in range(n_folds)]
    for si, f in enumerate(assign):
        fold_lab[f] += st_labels[si]
        sizes[f] += (train_meta['station'].values == stations[si]).sum()
    # per-fold missing positives in val
    for f in range(n_folds):
        val_pos = fold_lab[f]
        trn_pos = (np.sum(fold_lab, axis=0) - val_pos)
        miss_val = int((val_pos == 0).sum())
        miss_trn = int((trn_pos == 0).sum())
        total_miss += (miss_val + miss_trn)
    # size imbalance penalty
    sizes = np.array(sizes, dtype=np.int32)
    size_pen = int(np.var(sizes))
    return total_miss * 1000 + size_pen  # miss dominates

def random_assignments(n_st, n_folds, trials=256, seed=42):
    rng = random.Random(seed)
    best = None; best_score = None
    for t in range(trials):
        assign = [rng.randrange(n_folds) for _ in range(n_st)]
        # ensure all folds non-empty
        if len(set(assign)) < n_folds:
            continue
        sc = score_assignment(assign, n_folds)
        if (best_score is None) or (sc < best_score):
            best = assign; best_score = sc
    return best, best_score

n_folds = 5
best_assign, best_sc = random_assignments(len(stations), n_folds, trials=512, seed=42)
print('Stations:', len(stations), '| Best assign score:', best_sc, '| Assignment:', best_assign)
st2fold = {s: f for s, f in zip(stations, best_assign)}
fold_indices = []
for f in range(n_folds):
    va_idx = train_meta.index[train_meta['station'].map(st2fold) == f].to_numpy()
    tr_idx = train_meta.index[train_meta['station'].map(st2fold) != f].to_numpy()
    fold_indices.append((tr_idx, va_idx))
    print(f'Fold {f}: n_val={len(va_idx)} n_trn={len(tr_idx)}')

# Features and labels
X_tr_full = X_train_df.values.astype(np.float32)
X_te_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_tr_full.shape[0], Y_full.shape[1]

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    for c in range(y_true.shape[1]):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

# CatBoost variants (4 diverse, shallow, strong reg)
variants = [
    dict(name='cb_stfold_d3_lr002_l2_35_rsm06_sub085_s42',  depth=3, learning_rate=0.02,  l2_leaf_reg=35.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_stfold_d4_lr002_l2_45_rsm05_sub08_s123',  depth=4, learning_rate=0.02,  l2_leaf_reg=45.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='cb_stfold_d3_lr0025_l2_30_rsm06_sub09_s456', depth=3, learning_rate=0.025, l2_leaf_reg=30.0, rsm=0.6, subsample=0.90, random_strength=0.9, seed=456),
    dict(name='cb_stfold_d4_lr0015_l2_55_rsm05_sub075_s789',depth=4, learning_rate=0.015, l2_leaf_reg=55.0, rsm=0.5, subsample=0.75, random_strength=1.0, seed=789),
]

test_pred_list = []; oof_list = []; mask_list = []
for v in variants:
    print(f"=== Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(fold_indices):
        t0 = time.time()
        Xtr, Xva = X_tr_full[tr_idx], X_tr_full[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'], learning_rate=v['learning_rate'],
                iterations=5000, l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                auto_class_weights='Balanced', random_seed=v['seed'],
                early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_mean = np.mean(test_folds, axis=0)
    oof_list.append(oof); mask_list.append(vmask); test_pred_list.append(test_mean)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Equal-weight rank ensemble across variants
inter_mask = mask_list[0].astype(bool)
for m in mask_list[1:]: inter_mask &= m.astype(bool)
inter_mask = inter_mask.astype(np.uint8)

ranked_oofs = []
from scipy.stats import rankdata
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask[:, c].astype(bool)
        if not m.any(): continue
        col = o[m, c]; r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens = np.nanmean(ranked_oofs, axis=0)
macro_ens = masked_auc_macro(Y_full, oof_ens, inter_mask)
print(f'Ensemble (station-fold search, equal-weight rank OOF) masked macro: {macro_ens:.4f}', flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens_st = np.mean(ranked_tests, axis=0)

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens_st[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Group station-fold search CB bag, equal-weight rank ensemble)")
display(sub_out.head())

Stations: 13 | Best assign score: 26107 | Assignment: [4, 4, 1, 2, 0, 3, 0, 2, 3, 0, 0, 1, 2]
Fold 0: n_val=48 n_trn=97
Fold 1: n_val=30 n_trn=115
Fold 2: n_val=18 n_trn=127
Fold 3: n_val=22 n_trn=123
Fold 4: n_val=27 n_trn=118
=== Variant cb_stfold_d3_lr002_l2_35_rsm06_sub085_s42 ===


  Fold 0: masked macro=0.6392 | n_val=48 | time=34.1s


  Fold 1: masked macro=0.7072 | n_val=30 | time=13.3s


  Fold 2: masked macro=0.7767 | n_val=18 | time=18.6s


  Fold 3: masked macro=0.6884 | n_val=22 | time=40.8s


  Fold 4: masked macro=0.7750 | n_val=27 | time=31.2s


Variant OOF masked macro: 0.7347


=== Variant cb_stfold_d4_lr002_l2_45_rsm05_sub08_s123 ===


  Fold 0: masked macro=0.6342 | n_val=48 | time=59.4s


  Fold 1: masked macro=0.6465 | n_val=30 | time=12.4s


  Fold 2: masked macro=0.8068 | n_val=18 | time=35.6s


  Fold 3: masked macro=0.6753 | n_val=22 | time=58.2s


  Fold 4: masked macro=0.7651 | n_val=27 | time=55.6s


Variant OOF masked macro: 0.7473


=== Variant cb_stfold_d3_lr0025_l2_30_rsm06_sub09_s456 ===


  Fold 0: masked macro=0.6107 | n_val=48 | time=30.6s


  Fold 1: masked macro=0.7338 | n_val=30 | time=8.5s


  Fold 2: masked macro=0.8315 | n_val=18 | time=14.5s


  Fold 3: masked macro=0.7535 | n_val=22 | time=33.6s


  Fold 4: masked macro=0.7682 | n_val=27 | time=27.6s


Variant OOF masked macro: 0.7461


=== Variant cb_stfold_d4_lr0015_l2_55_rsm05_sub075_s789 ===


  Fold 0: masked macro=0.6316 | n_val=48 | time=62.4s


  Fold 1: masked macro=0.6725 | n_val=30 | time=13.6s


  Fold 2: masked macro=0.8232 | n_val=18 | time=43.0s


  Fold 3: masked macro=0.7336 | n_val=22 | time=72.6s


  Fold 4: masked macro=0.7388 | n_val=27 | time=67.0s


Variant OOF masked macro: 0.7524


Ensemble (station-fold search, equal-weight rank OOF) masked macro: 0.7528


Wrote submission.csv with 1216 rows (Group station-fold search CB bag, equal-weight rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.452381
2,102,0.440476
3,103,0.666667
4,104,0.392857


In [37]:
# Station-to-fold assignment search with constraints (5 folds, trials=2048) -> cached fold indices for views
import numpy as np, pandas as pd, re, random, time
from collections import Counter

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

# Prepare station groups and label matrix
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station).fillna('UNK')
stations = sorted(train_meta['station'].unique().tolist())
st2idx = {s:i for i,s in enumerate(stations)}
N, C = X_train_df.shape[0], Y_aligned.shape[1]
st_labels = np.zeros((len(stations), C), dtype=np.int32)
st_sizes = np.zeros(len(stations), dtype=np.int32)
for i in range(len(train_df)):
    st = train_meta.loc[i, 'station']
    si = st2idx[st]
    st_labels[si] += (Y_aligned[i] > 0).astype(np.int32)
    st_sizes[si] += 1

def build_fold_indices(assign, n_folds):
    st2fold = {s:f for s,f in zip(stations, assign)}
    folds = []
    for f in range(n_folds):
        va_idx = train_meta.index[train_meta['station'].map(st2fold) == f].to_numpy()
        tr_idx = train_meta.index[train_meta['station'].map(st2fold) != f].to_numpy()
        folds.append((tr_idx, va_idx))
    return folds

def violates_constraints(assign, n_folds, min_fold=25, max_fold=40, min_pos=3):
    # Fold sizes
    fold_sizes = [0]*n_folds
    fold_lab = [np.zeros(C, dtype=np.int32) for _ in range(n_folds)]
    for si, f in enumerate(assign):
        fold_sizes[f] += st_sizes[si]
        fold_lab[f] += st_labels[si]
    # Size bounds check
    for fs in fold_sizes:
        if fs < min_fold or fs > max_fold:
            return True
    # Class coverage: for each fold, for each class, val_pos >= min_pos and train_pos >= min_pos
    total = np.sum(fold_lab, axis=0)
    for f in range(n_folds):
        val_pos = fold_lab[f]
        trn_pos = total - val_pos
        if (val_pos < min_pos).any():
            return True
        if (trn_pos < min_pos).any():
            return True
    return False

def score_assignment(assign, n_folds):
    # Minimize size variance; constraints handled separately
    fold_sizes = [0]*n_folds
    for si, f in enumerate(assign):
        fold_sizes[f] += st_sizes[si]
    sizes = np.array(fold_sizes, dtype=np.float32)
    return float(np.var(sizes))

def random_search_station_folds(n_folds=5, trials=2048, seed=42, min_fold=25, max_fold=40, min_pos=3):
    rng = random.Random(seed)
    best = None; best_score = None; tried = 0; kept = 0
    for t in range(trials):
        # Random assign ensuring all folds non-empty
        assign = [rng.randrange(n_folds) for _ in stations]
        if len(set(assign)) < n_folds:
            continue
        tried += 1
        if violates_constraints(assign, n_folds, min_fold, max_fold, min_pos):
            continue
        kept += 1
        sc = score_assignment(assign, n_folds)
        if (best_score is None) or (sc < best_score):
            best = list(assign); best_score = sc
    return best, best_score, tried, kept

n_folds = 5
t0 = time.time()
assign1, sc1, tried1, kept1 = random_search_station_folds(n_folds=n_folds, trials=2048, seed=42, min_fold=25, max_fold=40, min_pos=3)
print(f'[Search1] tried={tried1} kept={kept1} best_score={sc1} assign={assign1} time={time.time()-t0:.1f}s', flush=True)
if assign1 is None:
    print('WARNING: No feasible assignment found with constraints. Consider relaxing min_pos or size bounds.', flush=True)
folds_station_opt1 = build_fold_indices(assign1, n_folds) if assign1 is not None else []
for f,(tr_idx,va_idx) in enumerate(folds_station_opt1):
    print(f'  Opt1 Fold {f}: n_val={len(va_idx)} n_trn={len(tr_idx)}')

# Optional second view with different seed
t1 = time.time()
assign2, sc2, tried2, kept2 = random_search_station_folds(n_folds=n_folds, trials=2048, seed=1337, min_fold=25, max_fold=40, min_pos=3)
print(f'[Search2] tried={tried2} kept={kept2} best_score={sc2} assign={assign2} time={time.time()-t1:.1f}s', flush=True)
folds_station_opt2 = build_fold_indices(assign2, n_folds) if assign2 is not None else []
for f,(tr_idx,va_idx) in enumerate(folds_station_opt2):
    print(f'  Opt2 Fold {f}: n_val={len(va_idx)} n_trn={len(tr_idx)}')

# Quick per-fold class coverage summary for Opt1
if assign1 is not None:
    st2fold1 = {s:f for s,f in zip(stations, assign1)}
    for f in range(n_folds):
        va_mask = (train_meta['station'].map(st2fold1).values == f)
        tr_mask = ~va_mask
        yv = Y_aligned[va_mask]
        yt = Y_aligned[tr_mask]
        val_pos = yv.sum(axis=0); trn_pos = yt.sum(axis=0)
        print(f'  Opt1 Fold {f}: min val_pos={int(val_pos.min())} min trn_pos={int(trn_pos.min())}', flush=True)

[Search1] tried=1504 kept=0 best_score=None assign=None time=0.0s




[Search2] tried=1517 kept=0 best_score=None assign=None time=0.0s


In [38]:
# Relaxed constrained station-to-fold assignment search (multi-try) to obtain feasible folds
import numpy as np, random, time

def relaxed_search(n_folds=5, trial_grid=(2048, 4096), size_bounds=[(25,40), (22,42), (20,45)], min_pos_list=[3,2,1], seeds=[42,1337]):
    results = {}
    def violates(assign, min_fold, max_fold, min_pos):
        fold_sizes = [0]*n_folds
        fold_lab = [np.zeros(C, dtype=np.int32) for _ in range(n_folds)]
        for si, f in enumerate(assign):
            fold_sizes[f] += st_sizes[si]
            fold_lab[f] += st_labels[si]
        for fs in fold_sizes:
            if fs < min_fold or fs > max_fold:
                return True
        total = np.sum(fold_lab, axis=0)
        for f in range(n_folds):
            val_pos = fold_lab[f]
            trn_pos = total - val_pos
            if (val_pos < min_pos).any():
                return True
            if (trn_pos < min_pos).any():
                return True
        return False
    def score(assign):
        fold_sizes = [0]*n_folds
        for si, f in enumerate(assign):
            fold_sizes[f] += st_sizes[si]
        sizes = np.array(fold_sizes, dtype=np.float32)
        return float(np.var(sizes))
    def build(assign):
        st2fold = {s:f for s,f in zip(stations, assign)}
        folds = []
        for f in range(n_folds):
            va_idx = train_meta.index[train_meta['station'].map(st2fold) == f].to_numpy()
            tr_idx = train_meta.index[train_meta['station'].map(st2fold) != f].to_numpy()
            folds.append((tr_idx, va_idx))
        return folds
    for sd in seeds:
        got = False
        best = None; best_sc = None; best_cfg = None; tried_total=0; kept_total=0
        for trials in trial_grid:
            if got: break
            for (mn_sz, mx_sz) in size_bounds:
                if got: break
                for mp in min_pos_list:
                    rng = random.Random(sd)
                    best_local = None; best_sc_local=None; tried=0; kept=0
                    t0=time.time()
                    for t in range(trials):
                        assign = [rng.randrange(n_folds) for _ in stations]
                        if len(set(assign)) < n_folds:
                            continue
                        tried += 1
                        if violates(assign, mn_sz, mx_sz, mp):
                            continue
                        kept += 1
                        sc = score(assign)
                        if (best_sc_local is None) or (sc < best_sc_local):
                            best_local = list(assign); best_sc_local = sc
                    tried_total += tried; kept_total += kept
                    print(f'[Seed {sd}] trials={trials} size=({mn_sz},{mx_sz}) min_pos={mp} -> tried={tried} kept={kept} best_var={best_sc_local}', flush=True)
                    if best_sc_local is not None:
                        best = best_local; best_sc = best_sc_local; best_cfg=(mn_sz, mx_sz, mp, trials); got=True; break
        if best is not None:
            folds_opt = build(best)
            key = f'opt_seed_{sd}'
            results[key] = dict(assign=best, score=best_sc, cfg=best_cfg, folds=folds_opt, tried=tried_total, kept=kept_total)
            print(f'[Seed {sd}] FOUND assignment: score={best_sc} cfg={best_cfg} assign={best}', flush=True)
            for f,(tr_idx,va_idx) in enumerate(folds_opt):
                yv = Y_aligned[va_idx]; yt = Y_aligned[tr_idx]
                print(f'  Fold {f}: n_val={len(va_idx)} n_trn={len(tr_idx)} | min val_pos={int(yv.sum(axis=0).min())} min trn_pos={int(yt.sum(axis=0).min())}', flush=True)
        else:
            print(f'[Seed {sd}] No feasible assignment found under all relaxations.', flush=True)
    return results

t0=time.time()
station_fold_search = relaxed_search()
print('Search total time:', f'{time.time()-t0:.1f}s')

# Export first two found assignments (if any) to folds_station_opt1/opt2
found_keys = list(station_fold_search.keys())
if len(found_keys) >= 1:
    folds_station_opt1 = station_fold_search[found_keys[0]]['folds']
    print('folds_station_opt1 ready with key:', found_keys[0])
if len(found_keys) >= 2:
    folds_station_opt2 = station_fold_search[found_keys[1]]['folds']
    print('folds_station_opt2 ready with key:', found_keys[1])

[Seed 42] trials=2048 size=(25,40) min_pos=3 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(25,40) min_pos=2 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(25,40) min_pos=1 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(22,42) min_pos=3 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(22,42) min_pos=2 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(22,42) min_pos=1 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(20,45) min_pos=3 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(20,45) min_pos=2 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=2048 size=(20,45) min_pos=1 -> tried=1504 kept=0 best_var=None


[Seed 42] trials=4096 size=(25,40) min_pos=3 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(25,40) min_pos=2 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(25,40) min_pos=1 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(22,42) min_pos=3 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(22,42) min_pos=2 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(22,42) min_pos=1 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(20,45) min_pos=3 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(20,45) min_pos=2 -> tried=3039 kept=0 best_var=None


[Seed 42] trials=4096 size=(20,45) min_pos=1 -> tried=3039 kept=0 best_var=None


[Seed 42] No feasible assignment found under all relaxations.


[Seed 1337] trials=2048 size=(25,40) min_pos=3 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(25,40) min_pos=2 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(25,40) min_pos=1 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(22,42) min_pos=3 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(22,42) min_pos=2 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(22,42) min_pos=1 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(20,45) min_pos=3 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(20,45) min_pos=2 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=2048 size=(20,45) min_pos=1 -> tried=1517 kept=0 best_var=None


[Seed 1337] trials=4096 size=(25,40) min_pos=3 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(25,40) min_pos=2 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(25,40) min_pos=1 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(22,42) min_pos=3 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(22,42) min_pos=2 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(22,42) min_pos=1 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(20,45) min_pos=3 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(20,45) min_pos=2 -> tried=3029 kept=0 best_var=None


[Seed 1337] trials=4096 size=(20,45) min_pos=1 -> tried=3029 kept=0 best_var=None


[Seed 1337] No feasible assignment found under all relaxations.


Search total time: 2.5s


In [39]:
# View-level CB bags (GroupKFold-by-station) + Rank Stacking (Ridge) across views A (full) and B (adv-drop30)
import numpy as np, pandas as pd, time, re
from sklearn.model_selection import GroupKFold, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import Ridge
from catboost import CatBoostClassifier, Pool
import lightgbm as lgb

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    for c in range(y_true.shape[1]):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    from scipy.stats import rankdata
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

# Fixed 5-fold GroupKFold by station
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station).fillna('UNK')
groups = train_meta['station'].values
X_full = X_train_df.values.astype(np.float32)
T_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_full.shape[0], Y_full.shape[1]
gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_full, Y_full, groups))
print('GroupKFold-by-station: folds=', len(folds), 'unique stations=', len(np.unique(groups))),

# ---- View A: Full features, 4 CB variants, equal-weight rank within view ----
variants_A = [
    dict(depth=3, learning_rate=0.02,  l2_leaf_reg=30.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(depth=4, learning_rate=0.02,  l2_leaf_reg=40.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(depth=3, learning_rate=0.025, l2_leaf_reg=25.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=456),
    dict(depth=4, learning_rate=0.015, l2_leaf_reg=50.0, rsm=0.5, subsample=0.75, random_strength=1.0, seed=789),
]

def train_view(X_tr, T_te, Y, folds, variants, label='A'):
    N, C = X_tr.shape[0], Y.shape[1]
    view_oofs = []
    view_masks = []
    view_tests = []
    for vi, v in enumerate(variants):
        print(f'[View {label}] Variant {vi+1}/{len(variants)}', flush=True)
        oof = np.zeros((N, C), dtype=np.float32)
        vmask = np.zeros((N, C), dtype=np.uint8)
        test_folds = []
        for fi, (tr_idx, va_idx) in enumerate(folds):
            t0 = time.time()
            Xtr, Xva = X_tr[tr_idx], X_tr[va_idx]
            ytr, yva = Y[tr_idx], Y[va_idx]
            te_pred = np.zeros((T_te.shape[0], C), dtype=np.float32)
            for c in range(C):
                ytr_c = ytr[:, c]; yva_c = yva[:, c]
                tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
                va_pos = int((yva_c == 1).sum()); va_neg = int((yva_c == 0).sum())
                if not (tr_pos > 0 and tr_neg > 0 and va_pos > 0 and va_neg > 0):
                    continue
                tr_pool = Pool(Xtr, label=ytr_c)
                va_pool = Pool(Xva, label=yva_c)
                cb = CatBoostClassifier(
                    loss_function='Logloss',
                    depth=v['depth'], learning_rate=v['learning_rate'],
                    iterations=5000, l2_leaf_reg=v['l2_leaf_reg'],
                    rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                    auto_class_weights='Balanced', random_seed=v['seed'],
                    early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1
                )
                cb.fit(tr_pool, eval_set=va_pool)
                oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
                te_pred[:, c] = cb.predict_proba(T_te)[:, 1]
                vmask[va_idx, c] = 1
            fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
            print(f'  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s', flush=True)
            test_folds.append(te_pred)
        view_oofs.append(oof)
        view_masks.append(vmask)
        view_tests.append(np.mean(test_folds, axis=0))
        macro_v = masked_auc_macro(Y, oof, vmask)
        print(f'[View {label}] Variant OOF masked macro: {macro_v:.4f}', flush=True)
    # intersection mask across variants
    inter_mask = view_masks[0].astype(bool)
    for m in view_masks[1:]: inter_mask &= m.astype(bool)
    inter_mask = inter_mask.astype(np.uint8)
    # rank-avg OOF within view
    ranked_oofs = []
    from scipy.stats import rankdata
    for o in view_oofs:
        Nn, Cc = o.shape
        R = np.full_like(o, np.nan, dtype=np.float32)
        for c in range(Cc):
            m = inter_mask[:, c].astype(bool)
            if not m.any(): continue
            col = o[m, c]; r = rankdata(col, method='average')
            R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
        ranked_oofs.append(R)
    view_oof_rank = np.nanmean(ranked_oofs, axis=0)
    # rank-avg test within view
    ranked_tests = [rank_cols(tp) for tp in view_tests]
    view_test_rank = np.mean(ranked_tests, axis=0)
    view_macro = masked_auc_macro(Y, view_oof_rank, inter_mask)
    print(f'[View {label}] View-level OOF masked macro: {view_macro:.4f}', flush=True)
    return view_oof_rank.astype(np.float32), inter_mask.astype(np.uint8), view_test_rank.astype(np.float32)

OA, MA, TA = train_view(X_full, T_full, Y_full, folds, variants_A, label='A')

# ---- View B: Adv-drop-30, 6 CB variants per guidance ----
print('[View B] Adversarial feature pruning (top-30) ...', flush=True)
X_tr_full_df = X_train_df.copy(); X_te_full_df = X_test_df.copy()
X_adv = pd.concat([X_tr_full_df.assign(is_test=0), X_te_full_df.assign(is_test=1)], axis=0).reset_index(drop=True)
y_adv = X_adv['is_test'].values.astype(int)
X_adv_mat = X_adv.drop(columns=['is_test']).values.astype(np.float32)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
feat_importance = np.zeros(X_adv_mat.shape[1], dtype=np.float64)
for fi, (tr, va) in enumerate(skf.split(X_adv_mat, y_adv)):
    Xtr, Xva = X_adv_mat[tr], X_adv_mat[va]
    ytr, yva = y_adv[tr], y_adv[va]
    clf = lgb.LGBMClassifier(objective='binary', n_estimators=800, learning_rate=0.05,
                             max_depth=3, num_leaves=7, subsample=0.8, colsample_bytree=0.6,
                             reg_alpha=0.5, reg_lambda=5.0, n_jobs=-1, random_state=42, verbosity=-1)
    clf.fit(Xtr, ytr, eval_set=[(Xva, yva)], eval_metric='auc',
            callbacks=[lgb.early_stopping(100, verbose=False), lgb.log_evaluation(period=0)])
    feat_importance += clf.booster_.feature_importance(importance_type='gain').astype(np.float64)
feat_importance /= max(1, skf.get_n_splits())
cols = X_tr_full_df.columns.tolist()
order = np.argsort(-feat_importance)
drop_top = 30
drop_cols = [cols[i] for i in order[:min(drop_top, len(cols))]]
keep_cols = [c for c in cols if c not in drop_cols]
print(f'[View B] Dropping top {len(drop_cols)} features; keeping {len(keep_cols)}', flush=True)
X_B = X_tr_full_df[keep_cols].values.astype(np.float32)
T_B = X_te_full_df[keep_cols].values.astype(np.float32)
variants_B = []
# depth=3, seeds=[42,123,456]
for s in [42,123,456]:
    variants_B.append(dict(depth=3, learning_rate=0.015, l2_leaf_reg=40.0, rsm=0.55, subsample=0.85, random_strength=0.8, seed=s))
# depth=4, seeds=[789,101,202]
for s in [789,101,202]:
    variants_B.append(dict(depth=4, learning_rate=0.01,  l2_leaf_reg=50.0, rsm=0.45, subsample=0.80, random_strength=1.0, seed=s))
OB, MB, TB = train_view(X_B, T_B, Y_full, folds, variants_B, label='B')

# ---- Rank Stacking (Ridge) across views OA, OB using folds from View A ----
print('[Stack] Rank stacking over views A,B with Ridge(alpha=15), fold-wise to avoid leakage', flush=True)
views_oof = [OA, OB]
views_test = [TA, TB]
inter_mask_views = (MA.astype(bool) & MB.astype(bool)).astype(np.uint8)

meta_oof = np.zeros_like(OA, dtype=np.float32)
test_meta_folds = []
for fi, (tr_idx, va_idx) in enumerate(folds):
    t0 = time.time()
    # build meta-train rows over intersection mask
    for c in range(C):
        m_full = inter_mask_views[:, c].astype(bool)
        m_tr = np.zeros(N, dtype=bool); m_tr[tr_idx] = True; m_tr &= m_full
        m_va = np.zeros(N, dtype=bool); m_va[va_idx] = True; m_va &= m_full
        if m_tr.sum() < 5 or m_va.sum() < 2:
            continue
        yt_tr = Y_full[m_tr, c]
        if yt_tr.max() == yt_tr.min():
            continue
        Xtr_meta = np.column_stack([v[m_tr, c] for v in views_oof]).astype(np.float32)
        Xva_meta = np.column_stack([v[m_va, c] for v in views_oof]).astype(np.float32)
        # Fit Ridge on ranks
        reg = Ridge(alpha=15.0, positive=False, random_state=42)
        try:
            reg.fit(Xtr_meta, yt_tr)
        except Exception:
            continue
        meta_oof[m_va, c] = reg.predict(Xva_meta).astype(np.float32)
    print(f'  Stack fold {fi} done in {time.time()-t0:.1f}s', flush=True)
    # test fold prediction: fit per-class on training rows and predict test ranks
    te_pred = np.zeros((T_full.shape[0], C), dtype=np.float32)
    for c in range(C):
        m_full = inter_mask_views[:, c].astype(bool)
        m_tr = np.zeros(N, dtype=bool); m_tr[tr_idx] = True; m_tr &= m_full
        if m_tr.sum() < 5: continue
        yt_tr = Y_full[m_tr, c]
        if yt_tr.max() == yt_tr.min():
            continue
        Xtr_meta = np.column_stack([v[m_tr, c] for v in views_oof]).astype(np.float32)
        Xte_meta = np.column_stack([vt[:, c] for vt in views_test]).astype(np.float32)
        reg = Ridge(alpha=15.0, positive=False, random_state=42)
        try:
            reg.fit(Xtr_meta, yt_tr)
            te_pred[:, c] = reg.predict(Xte_meta).astype(np.float32)
        except Exception:
            continue
    test_meta_folds.append(te_pred)

# Evaluate meta OOF on intersection mask
meta_macro = masked_auc_macro(Y_full, meta_oof, inter_mask_views)
print(f'[Stack] Meta OOF masked macro: {meta_macro:.4f}', flush=True)
test_meta = np.mean(test_meta_folds, axis=0)
test_meta_rank = rank_cols(test_meta.astype(np.float32))

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_meta_rank[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Rank-stacked views A+B)")
display(sub_out.head())

GroupKFold-by-station: folds= 5 unique stations= 13
[View A] Variant 1/4


  Fold 0: masked macro=0.7448 | n_val=31 | time=25.6s


  Fold 1: masked macro=0.7934 | n_val=27 | time=34.2s


  Fold 2: masked macro=0.7036 | n_val=26 | time=38.9s


  Fold 3: masked macro=0.6803 | n_val=31 | time=31.0s


  Fold 4: masked macro=0.7185 | n_val=30 | time=14.7s


[View A] Variant OOF masked macro: 0.7528


[View A] Variant 2/4


  Fold 0: masked macro=0.6899 | n_val=31 | time=31.9s


  Fold 1: masked macro=0.7857 | n_val=27 | time=52.7s


  Fold 2: masked macro=0.7296 | n_val=26 | time=68.9s


  Fold 3: masked macro=0.7026 | n_val=31 | time=42.7s


  Fold 4: masked macro=0.6444 | n_val=30 | time=19.9s


[View A] Variant OOF masked macro: 0.7158


[View A] Variant 3/4


  Fold 0: masked macro=0.7525 | n_val=31 | time=24.6s


  Fold 1: masked macro=0.7615 | n_val=27 | time=28.6s


  Fold 2: masked macro=0.6980 | n_val=26 | time=33.3s


  Fold 3: masked macro=0.7031 | n_val=31 | time=40.4s


  Fold 4: masked macro=0.7544 | n_val=30 | time=14.2s


[View A] Variant OOF masked macro: 0.7517


[View A] Variant 4/4


  Fold 0: masked macro=0.7208 | n_val=31 | time=35.0s


  Fold 1: masked macro=0.7406 | n_val=27 | time=60.7s


  Fold 2: masked macro=0.7077 | n_val=26 | time=69.8s


  Fold 3: masked macro=0.7301 | n_val=31 | time=51.7s


  Fold 4: masked macro=0.7148 | n_val=30 | time=25.4s


[View A] Variant OOF masked macro: 0.7346


[View A] View-level OOF masked macro: 0.7418


[View B] Adversarial feature pruning (top-30) ...


[View B] Dropping top 30 features; keeping 667


[View B] Variant 1/6


  Fold 0: masked macro=0.7063 | n_val=31 | time=24.8s


  Fold 1: masked macro=0.7604 | n_val=27 | time=37.5s


  Fold 2: masked macro=0.7070 | n_val=26 | time=36.7s


  Fold 3: masked macro=0.7190 | n_val=31 | time=24.2s


  Fold 4: masked macro=0.7141 | n_val=30 | time=14.6s


[View B] Variant OOF masked macro: 0.7334


[View B] Variant 2/6


  Fold 0: masked macro=0.7385 | n_val=31 | time=25.1s


  Fold 1: masked macro=0.7322 | n_val=27 | time=37.2s


  Fold 2: masked macro=0.6652 | n_val=26 | time=37.9s


  Fold 3: masked macro=0.7017 | n_val=31 | time=19.2s


  Fold 4: masked macro=0.7430 | n_val=30 | time=17.7s


[View B] Variant OOF masked macro: 0.7430


[View B] Variant 3/6


  Fold 0: masked macro=0.7208 | n_val=31 | time=25.3s


  Fold 1: masked macro=0.7306 | n_val=27 | time=39.0s


  Fold 2: masked macro=0.7077 | n_val=26 | time=36.8s


  Fold 3: masked macro=0.7200 | n_val=31 | time=28.4s


  Fold 4: masked macro=0.6820 | n_val=30 | time=17.5s


[View B] Variant OOF masked macro: 0.7162


[View B] Variant 4/6


  Fold 0: masked macro=0.7475 | n_val=31 | time=35.4s


  Fold 1: masked macro=0.7357 | n_val=27 | time=70.3s


  Fold 2: masked macro=0.6511 | n_val=26 | time=52.4s


  Fold 3: masked macro=0.7260 | n_val=31 | time=42.2s


  Fold 4: masked macro=0.6984 | n_val=30 | time=24.8s


[View B] Variant OOF masked macro: 0.7374


[View B] Variant 5/6


  Fold 0: masked macro=0.6936 | n_val=31 | time=35.3s


  Fold 1: masked macro=0.7588 | n_val=27 | time=65.4s


  Fold 2: masked macro=0.7067 | n_val=26 | time=48.1s


  Fold 3: masked macro=0.7651 | n_val=31 | time=43.2s


  Fold 4: masked macro=0.6914 | n_val=30 | time=24.4s


[View B] Variant OOF masked macro: 0.7422


[View B] Variant 6/6


  Fold 0: masked macro=0.7296 | n_val=31 | time=34.6s


  Fold 1: masked macro=0.7376 | n_val=27 | time=66.0s


  Fold 2: masked macro=0.6961 | n_val=26 | time=49.6s


  Fold 3: masked macro=0.7798 | n_val=31 | time=41.3s


  Fold 4: masked macro=0.7100 | n_val=30 | time=23.9s


[View B] Variant OOF masked macro: 0.7561


[View B] View-level OOF masked macro: 0.7453


[Stack] Rank stacking over views A,B with Ridge(alpha=15), fold-wise to avoid leakage


  Stack fold 0 done in 0.0s


  Stack fold 1 done in 0.0s


  Stack fold 2 done in 0.0s


  Stack fold 3 done in 0.0s


  Stack fold 4 done in 0.0s


[Stack] Meta OOF masked macro: 0.6573


Wrote submission.csv with 1216 rows (Rank-stacked views A+B)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.468254
2,102,0.404762
3,103,0.246032
4,104,0.373016


In [40]:
# Fallback: rank-blend views A (full) and B (adv-drop30) with OOF-tuned weight; write submission
import numpy as np, pandas as pd
from scipy.stats import rankdata

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out

assert 'OA' in globals() and 'OB' in globals() and 'MA' in globals() and 'MB' in globals(), 'Run cell 30 to compute view ranks (OA/OB) and masks (MA/MB) first.'
assert 'TA' in globals() and 'TB' in globals(), 'Run cell 30 to compute view test ranks (TA/TB) first.'

# Intersection mask
inter_mask = (MA.astype(bool) & MB.astype(bool)).astype(np.uint8)

# Grid search weight for View B (adv-drop) on masked OOF
best_wb, best_macro = 0.5, -1.0
for wb in np.linspace(0.0, 1.0, 21):
    blend_oof = (1.0 - wb) * OA + wb * OB
    # compute masked macro AUC
    from sklearn.metrics import roc_auc_score
    aucs = []
    C = blend_oof.shape[1]
    for c in range(C):
        m = inter_mask[:, c].astype(bool)
        if not m.any():
            continue
        yt = Y_aligned[m, c]
        yp = blend_oof[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    macro = float(np.mean(aucs)) if len(aucs) else float('nan')
    if macro > best_macro:
        best_macro = macro
        best_wb = float(wb)
print(f'[Blend A+B] Best OOF masked macro={best_macro:.4f} at wB={best_wb:.2f}')

# Blend test ranks with best weight
test_blend_rank = (1.0 - best_wb) * TA + best_wb * TB

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_rank[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Rank-blend Views A+B, wB={best_wb:.2f})")
display(sub_out.head())

[Blend A+B] Best OOF masked macro=0.7476 at wB=0.80
Wrote submission.csv with 1216 rows (Rank-blend Views A+B, wB=0.80)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.45291
2,102,0.405556
3,103,0.69246
4,104,0.385979


In [41]:
# 3-view rank blend per expert fallback: View A (TA), View B (TB), View C (station-search test_ens_st)
import numpy as np, pandas as pd
from scipy.stats import rankdata

def ensure_rank(mat):
    # If input not in [0,1] ranks, convert to ranks per column
    if mat.min() < 0 or mat.max() > 1.0 + 1e-6:
        N, C = mat.shape
        out = np.zeros_like(mat, dtype=np.float32)
        for c in range(C):
            r = rankdata(mat[:, c], method='average')
            out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
        return out.astype(np.float32)
    return mat.astype(np.float32)

assert 'TA' in globals() and 'TB' in globals(), 'Run cell 30 to compute TA/TB first.'
assert 'test_ens_st' in globals(), 'Run cell 27 to compute station-search view (test_ens_st) first.'

RA = ensure_rank(TA)
RB = ensure_rank(TB)
RC = ensure_rank(test_ens_st)

# Default weights: wA=0.2 (full), wB=0.4 (adv-drop30), wC=0.4 (station-search)
wA, wB, wC = 0.2, 0.4, 0.4
test_blend_rank_3v = wA * RA + wB * RB + wC * RC

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_rank_3v[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (3-view rank blend: A=0.2, B=0.4, C=0.4)")
display(sub_out.head())

Wrote submission.csv with 1216 rows (3-view rank blend: A=0.2, B=0.4, C=0.4)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.455026
2,102,0.419841
3,103,0.686111
4,104,0.388624


In [42]:
# 3-view OOF-tuned rank blend (Views A:OA/TA, B:OB/TB, C:oof_ens/test_ens_st) and submission
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
from scipy.stats import rankdata

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    C = y_true.shape[1]
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1:
            continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def ensure_rank(mat):
    if mat.min() < 0 or mat.max() > 1.0 + 1e-6:
        N, C = mat.shape
        out = np.zeros_like(mat, dtype=np.float32)
        for c in range(C):
            r = rankdata(mat[:, c], method='average')
            out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
        return out.astype(np.float32)
    return mat.astype(np.float32)

assert 'OA' in globals() and 'OB' in globals() and 'MA' in globals() and 'MB' in globals(), 'Run cell 30 first to get OA/OB/MA/MB/TA/TB.'
assert 'TA' in globals() and 'TB' in globals(), 'Run cell 30 first to get TA/TB.'
assert 'oof_ens' in globals() and 'test_ens_st' in globals(), 'Run cell 27 first to get oof_ens and test_ens_st.'

# View oofs and masks
OA_r = ensure_rank(OA)
OB_r = ensure_rank(OB)
OC_r = ensure_rank(oof_ens.astype(np.float32))
MC = np.isfinite(OC_r).astype(np.uint8)  # mask where OC exists

# Intersection mask across all three views
inter_mask_abc = (MA.astype(bool) & MB.astype(bool) & MC.astype(bool)).astype(np.uint8)

# Grid search weights wB, wC (wA = 1 - wB - wC)
best_wB, best_wC, best_macro = 0.4, 0.4, -1.0
grid = np.linspace(0.0, 1.0, 21)
for wB in grid:
    for wC in grid:
        wA = 1.0 - wB - wC
        if wA < 0 or wA > 1:
            continue
        blend_oof = wA * OA_r + wB * OB_r + wC * OC_r
        macro = masked_auc_macro(Y_aligned, blend_oof, inter_mask_abc)
        if macro > best_macro:
            best_macro = macro; best_wB = float(wB); best_wC = float(wC); best_wA = float(wA)
print(f'[3-view blend] Best masked OOF macro={best_macro:.4f} at weights A={best_wA:.2f}, B={best_wB:.2f}, C={best_wC:.2f}')

# Blend test ranks with best weights
RA = ensure_rank(TA); RB = ensure_rank(TB); RC = ensure_rank(test_ens_st.astype(np.float32))
test_blend_rank_3v_opt = best_wA * RA + best_wB * RB + best_wC * RC

# Build submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_blend_rank_3v_opt[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (3-view OOF-tuned rank blend: A={best_wA:.2f}, B={best_wB:.2f}, C={best_wC:.2f})")
display(sub_out.head())

[3-view blend] Best masked OOF macro=0.7652 at weights A=0.50, B=0.05, C=0.45
Wrote submission.csv with 1216 rows (3-view OOF-tuned rank blend: A=0.50, B=0.05, C=0.45)


In [43]:
# Pseudo-labeling with high-confidence test ranks + GroupKFold-by-station CB bag (augment train folds only), rank-ensemble submission
import re, time, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool
from scipy.stats import rankdata

def extract_station(stem):
    if not isinstance(stem, str):
        return None
    m = re.search(r'(PC\d+)', stem)
    return m.group(1) if m else None

# Use best available test ranks for pseudo-labeling
assert 'test_blend_rank_3v_opt' in globals() or 'test_blend_rank' in globals() or 'test_ens' in globals(), 'Need test rank predictions (run cells 30/31/27).'
if 'test_blend_rank_3v_opt' in globals():
    T_rank = test_blend_rank_3v_opt.astype(np.float32)
elif 'test_blend_rank' in globals():
    T_rank = test_blend_rank.astype(np.float32)
else:
    T_rank = test_ens.astype(np.float32)

# Build test rec_id index map
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
test_rec_sorted = pd.Index(sorted(test_rec_ids))
rec_to_idx = {rid:i for i, rid in enumerate(test_rec_sorted)}

# Pseudo-labels: select top/bottom rank thresholds per class
thr_pos, thr_neg = 0.98, 0.02
n_te, C = T_rank.shape
Y_pseudo = np.full((n_te, C), np.nan, dtype=np.float32)
pos_ct = 0; neg_ct = 0
for c in range(C):
    col = T_rank[:, c]
    pos_mask = col >= thr_pos
    neg_mask = col <= thr_neg
    Y_pseudo[pos_mask, c] = 1.0; pos_ct += int(pos_mask.sum())
    Y_pseudo[neg_mask, c] = 0.0; neg_ct += int(neg_mask.sum())
print(f'Pseudo labels selected: pos={pos_ct}, neg={neg_ct} across {C} classes (thr_pos={thr_pos}, thr_neg={thr_neg})', flush=True)

# GroupKFold-by-station on original train; augment only the training split with pseudo-labeled test rows (class-wise)
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station).fillna('UNK')
groups = train_meta['station'].values

X_tr_full = X_train_df.values.astype(np.float32)
X_te_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N = X_tr_full.shape[0]

gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_tr_full, Y_full, groups))
print('GroupKFold by station -> n_folds:', len(folds), 'unique stations:', len(np.unique(groups)))

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    for c in range(y_true.shape[1]):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1:
            continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    Nn, Cc = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(Cc):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (Nn - 1) if Nn > 1 else r*0.0
    return out

# Two diverse CB variants to start (can add more later)
variants = [
    dict(name='cb_pl_d3_lr0015_l2_45_rsm06_sub085_s42', depth=3, learning_rate=0.015, l2_leaf_reg=45.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_pl_d4_lr0015_l2_55_rsm05_sub08_s123', depth=4, learning_rate=0.015, l2_leaf_reg=55.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
]

test_pred_list = []
oof_list = []
mask_list = []

for v in variants:
    print(f"=== Variant {v['name']} (pseudo-augmented folds) ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr, Xva = X_tr_full[tr_idx], X_tr_full[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        for c in range(C):
            # Build class-wise augmented training data: original train split + pseudo rows with labels for class c
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            valid_val = (yva_c.max() != yva_c.min())
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            if not (tr_pos > 0 and tr_neg > 0) or not valid_val:
                continue
            # Select pseudo rows for class c
            pseudo_mask_c = np.isfinite(Y_pseudo[:, c])
            X_pseudo_c = X_te_full[pseudo_mask_c]
            y_pseudo_c = Y_pseudo[pseudo_mask_c, c].astype(np.float32)
            # Concatenate
            if X_pseudo_c.shape[0] > 0:
                Xtr_c = np.concatenate([Xtr, X_pseudo_c], axis=0)
                ytr_c_aug = np.concatenate([ytr_c, y_pseudo_c], axis=0)
            else:
                Xtr_c = Xtr
                ytr_c_aug = ytr_c
            tr_pool = Pool(Xtr_c, label=ytr_c_aug)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'],
                learning_rate=v['learning_rate'],
                iterations=5000,
                l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'],
                subsample=v['subsample'],
                random_strength=v['random_strength'],
                auto_class_weights='Balanced',
                random_seed=v['seed'],
                early_stopping_rounds=200,
                verbose=False,
                task_type='CPU',
                thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_mean = np.mean(test_folds, axis=0)
    oof_list.append(oof); mask_list.append(vmask); test_pred_list.append(test_mean)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Rank-ensemble across variants
inter_mask = mask_list[0].astype(bool)
for m in mask_list[1:]:
    inter_mask &= m.astype(bool)
inter_mask = inter_mask.astype(np.uint8)

ranked_oofs = []
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask[:, c].astype(bool)
        if not m.any():
            continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens_pl = np.nanmean(ranked_oofs, axis=0)
macro_ens_pl = masked_auc_macro(Y_full, oof_ens_pl, inter_mask)
print(f'Pseudo-label ensemble (rank OOF) masked macro: {macro_ens_pl:.4f}', flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens_pl = np.mean(ranked_tests, axis=0)

# Build submission from pseudo-label ensemble ranks
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_ens_pl[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Pseudo-label GroupKFold-by-station CB bag, rank ensemble)")
display(sub_out.head())

Pseudo labels selected: pos=18, neg=16 across 19 classes (thr_pos=0.98, thr_neg=0.02)


GroupKFold by station -> n_folds: 5 unique stations: 13
=== Variant cb_pl_d3_lr0015_l2_45_rsm06_sub085_s42 (pseudo-augmented folds) ===


  Fold 0: masked macro=0.7805 | n_val=31 | time=40.9s


  Fold 1: masked macro=0.7724 | n_val=27 | time=42.4s


  Fold 2: masked macro=0.7069 | n_val=26 | time=40.7s


  Fold 3: masked macro=0.7231 | n_val=31 | time=37.5s


  Fold 4: masked macro=0.7477 | n_val=30 | time=21.4s


Variant OOF masked macro: 0.7683


=== Variant cb_pl_d4_lr0015_l2_55_rsm05_sub08_s123 (pseudo-augmented folds) ===


  Fold 0: masked macro=0.6815 | n_val=31 | time=51.7s


  Fold 1: masked macro=0.7744 | n_val=27 | time=71.9s


  Fold 2: masked macro=0.6850 | n_val=26 | time=70.3s


  Fold 3: masked macro=0.7146 | n_val=31 | time=53.4s


  Fold 4: masked macro=0.7062 | n_val=30 | time=38.0s


Variant OOF masked macro: 0.7291


Pseudo-label ensemble (rank OOF) masked macro: 0.7470


Wrote submission.csv with 1216 rows (Pseudo-label GroupKFold-by-station CB bag, rank ensemble)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.460317
2,102,0.436508
3,103,0.72619
4,104,0.380952


In [44]:
# Final 4-view blend: OOF-tuned combination of 3-view blend and pseudo-label ensemble
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score
from scipy.stats import rankdata

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    C = y_true.shape[1]
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1:
            continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min():
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

assert 'OA_r' in globals() and 'OB_r' in globals() and 'OC_r' in globals() and 'inter_mask_abc' in globals(), 'Run cell 33 first to get 3-view OOF ranks and mask.'
assert 'best_wA' in globals() and 'best_wB' in globals() and 'best_wC' in globals(), 'Weights from cell 33 not found.'
assert 'oof_ens_pl' in globals(), 'Run cell 34 first to compute pseudo-label OOF ensemble.'
assert 'test_blend_rank_3v_opt' in globals() and 'test_ens_pl' in globals(), 'Need test ranks from 3-view and pseudo-label models.'

# Build 3-view OOF blend
oof_3v = best_wA * OA_r + best_wB * OB_r + best_wC * OC_r

# Combined intersection mask
mask_pl = np.isfinite(oof_ens_pl).astype(np.uint8)
inter_mask_comb = (inter_mask_abc.astype(bool) & mask_pl.astype(bool)).astype(np.uint8)

# Tune weight for pseudo-label ensemble
best_wp, best_macro = 0.0, -1.0
for wp in np.linspace(0.0, 1.0, 21):
    blend_oof = (1.0 - wp) * oof_3v + wp * oof_ens_pl
    macro = masked_auc_macro(Y_aligned, blend_oof, inter_mask_comb)
    if macro > best_macro:
        best_macro = macro; best_wp = float(wp)
print(f'[4-view blend] Best masked OOF macro={best_macro:.4f} at pseudo weight={best_wp:.2f}', flush=True)

# Apply best weight to test predictions
test_final = (1.0 - best_wp) * test_blend_rank_3v_opt + best_wp * test_ens_pl

# Write submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f'Wrote submission.csv with {len(sub_out)} rows (4-view: 3-view blend + pseudo-label ensemble, wp={best_wp:.2f})')
display(sub_out.head())

[4-view blend] Best masked OOF macro=0.7652 at pseudo weight=0.00


Wrote submission.csv with 1216 rows (4-view: 3-view blend + pseudo-label ensemble, wp=0.00)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.464021
2,102,0.422817
3,103,0.700198
4,104,0.388558


In [45]:
# Pseudo-labeling V2: consensus 2/3 views, class-conditional thresholds, hard labels with confidence weights, 5-fold GroupKFold-by-station CB bag, 4-view reblend
import numpy as np, pandas as pd, re, time
from sklearn.metrics import roc_auc_score, precision_score
from sklearn.model_selection import GroupKFold
from catboost import CatBoostClassifier, Pool
from scipy.stats import rankdata

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

def masked_auc_macro(y_true, y_pred, valid_mask):
    aucs = []
    C = y_true.shape[1]
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out.astype(np.float32)

# Preconditions
assert 'OA_r' in globals() and 'OB_r' in globals() and 'OC_r' in globals() and 'inter_mask_abc' in globals(), 'Run cell 33 first to get OOF ranks/mask.'
assert 'TA' in globals() and 'TB' in globals() and 'test_ens_st' in globals(), 'Run cells 30 and 27 first to get test view ranks.'
assert 'test_blend_rank_3v_opt' in globals(), 'Run cell 33 to get best 3-view test ranks.'

# Per-class class-conditional positive threshold t_pos_c: smallest threshold with precision >= 0.90 on OOF intersection mask,
# using the averaged OOF rank across the three views.
OOF_mean = (OA_r + OB_r + OC_r) / 3.0
C = OOF_mean.shape[1]
t_pos = np.full(C, 0.90, dtype=np.float32)  # fallback
for c in range(C):
    m = inter_mask_abc[:, c].astype(bool)
    if not m.any():
        continue
    yt = Y_aligned[m, c].astype(np.int32)
    if yt.max() == yt.min():
        continue
    yp = OOF_mean[m, c]
    # grid thresholds from 0.5 to 0.99, finer near top
    grid = np.unique(np.concatenate([np.linspace(0.5, 0.95, 10), np.linspace(0.95, 0.99, 9)]))
    found = False
    for thr in grid:
        pred = (yp >= thr).astype(int)
        if pred.sum() == 0:
            continue
        prec = precision_score(yt, pred, zero_division=0)
        if prec >= 0.90:
            t_pos[c] = float(thr)
            found = True
            break
    if not found:
        t_pos[c] = 0.90

# Build consensus pseudo-positive mask on TEST:
# require at least 2/3 views >= t_pos_c and mean >= t_pos_c.
RA = TA.astype(np.float32); RB = TB.astype(np.float32); RC = test_ens_st.astype(np.float32)
RA = RA if RA.min() >= 0 and RA.max() <= 1.0 else rank_cols(RA)
RB = RB if RB.min() >= 0 and RB.max() <= 1.0 else rank_cols(RB)
RC = RC if RC.min() >= 0 and RC.max() <= 1.0 else rank_cols(RC)
T_mean = (RA + RB + RC) / 3.0
n_te = RA.shape[0]
Y_pseudo = np.full((n_te, C), np.nan, dtype=np.float32)
added_ct = 0
for c in range(C):
    thr = t_pos[c]
    mA = RA[:, c] >= thr; mB = RB[:, c] >= thr; mC = RC[:, c] >= thr
    votes = (mA.astype(int) + mB.astype(int) + mC.astype(int))
    pos_mask = (votes >= 2) & (T_mean[:, c] >= thr)
    # store candidate positives (we'll cap per fold/class later)
    Y_pseudo[pos_mask, c] = 1.0
    added_ct += int(pos_mask.sum())
print(f'[PL V2] Initial consensus pseudo-positive candidates: {added_ct} across {C} classes')

# GroupKFold-by-station (5 folds)
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station).fillna('UNK')
groups = train_meta['station'].values
X_tr_full = X_train_df.values.astype(np.float32)
X_te_full = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N = X_tr_full.shape[0]
gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_tr_full, Y_full, groups))
print('GroupKFold by station -> n_folds:', len(folds), 'unique stations:', len(np.unique(groups)))

# PL V2 CatBoost variants (stronger regularization)
variants = [
    dict(name='cb_plv2_d3_lr0015_l2_60_rsm05_sub08_s42', depth=3, learning_rate=0.015, l2_leaf_reg=60.0, rsm=0.5, subsample=0.80, random_strength=0.8, seed=42),
    dict(name='cb_plv2_d4_lr0015_l2_70_rsm04_sub075_s123', depth=4, learning_rate=0.015, l2_leaf_reg=70.0, rsm=0.4, subsample=0.75, random_strength=1.0, seed=123),
]

oof_list = []; mask_list = []; test_pred_list = []
for v in variants:
    print(f"=== Variant {v['name']} (PL V2 pseudo-aug) ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr = X_tr_full[tr_idx]; Xva = X_tr_full[va_idx]
        ytr = Y_full[tr_idx]; yva = Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        # per-class training
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            if ytr_c.max() == ytr_c.min() or yva_c.max() == yva_c.min():
                continue
            tr_pos = int((ytr_c == 1).sum())
            tr_neg = int((ytr_c == 0).sum())
            if tr_pos == 0 or tr_neg == 0:
                continue
            # compute per-class cap
            cap_base = min(6, int(round(0.08 * tr_pos + 1)))
            cap_pct = int(max(1, np.floor(0.15 * tr_pos)))
            cap = max(0, min(cap_base, cap_pct))
            # candidate pseudo indices for this class
            cand_mask = np.isfinite(Y_pseudo[:, c]) & (Y_pseudo[:, c] == 1.0)
            if cap > 0 and cand_mask.any():
                # score by blended rank (best 3-view tuned), keep top-k
                scores = test_blend_rank_3v_opt[cand_mask, c]
                order = np.argsort(-scores)
                take = min(cap, order.size)
                idx_te = np.where(cand_mask)[0][order[:take]]
                X_p = X_te_full[idx_te]
                y_p = np.ones(take, dtype=np.float32)
                # confidence weights: w = clip(((rank-0.5)*2)^2, 0.3, 1.0)
                r = test_blend_rank_3v_opt[idx_te, c].astype(np.float32)
                w_p = np.clip(((r - 0.5) * 2.0) ** 2, 0.3, 1.0).astype(np.float32)
                # assemble augmented training
                Xtr_c = np.concatenate([Xtr, X_p], axis=0)
                ytr_c_aug = np.concatenate([ytr_c, y_p], axis=0)
                w_tr = np.concatenate([np.ones(Xtr.shape[0], dtype=np.float32), w_p], axis=0)
            else:
                Xtr_c = Xtr; ytr_c_aug = ytr_c; w_tr = np.ones(Xtr.shape[0], dtype=np.float32)
            tr_pool = Pool(Xtr_c, label=ytr_c_aug, weight=w_tr)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'], learning_rate=v['learning_rate'],
                iterations=5000, l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                auto_class_weights='Balanced', random_seed=v['seed'],
                early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_pred_list.append(np.mean(test_folds, axis=0))
    oof_list.append(oof); mask_list.append(vmask)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Rank-ensemble PL V2 variants
inter_mask_plv2 = mask_list[0].astype(bool)
for m in mask_list[1:]: inter_mask_plv2 &= m.astype(bool)
inter_mask_plv2 = inter_mask_plv2.astype(np.uint8)
ranked_oofs = []
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask_plv2[:, c].astype(bool)
        if not m.any(): continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens_pl_v2 = np.nanmean(ranked_oofs, axis=0)
macro_pl_v2 = masked_auc_macro(Y_full, oof_ens_pl_v2, inter_mask_plv2)
print(f'[PL V2] Rank-ensemble OOF masked macro: {macro_pl_v2:.4f}', flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens_pl_v2 = np.mean(ranked_tests, axis=0).astype(np.float32)

# 4-view reblend: (A,B,C) 3-view tuned + PL V2 with small weight grid [0..0.3]
assert 'test_blend_rank_3v_opt' in globals(), 'Need 3-view test ranks'
oof_3v = best_wA * OA_r + best_wB * OB_r + best_wC * OC_r
mask_3v = inter_mask_abc
mask_comb = (mask_3v.astype(bool) & inter_mask_plv2.astype(bool)).astype(np.uint8)
best_wp, best_macro = 0.0, -1.0
for wp in np.linspace(0.0, 0.3, 16):
    blend_oof = (1.0 - wp) * oof_3v + wp * oof_ens_pl_v2
    macro = masked_auc_macro(Y_full, blend_oof, mask_comb)
    if macro > best_macro:
        best_macro = macro; best_wp = float(wp)
print(f"[4-view (PL V2)] Best masked OOF macro={best_macro:.4f} at wp={best_wp:.2f}")

test_final = (1.0 - best_wp) * test_blend_rank_3v_opt + best_wp * test_ens_pl_v2

# Write submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (4-view with PL V2, wp={best_wp:.2f})")
display(sub_out.head())

[PL V2] Initial consensus pseudo-positive candidates: 54 across 19 classes
GroupKFold by station -> n_folds: 5 unique stations: 13
=== Variant cb_plv2_d3_lr0015_l2_60_rsm05_sub08_s42 (PL V2 pseudo-aug) ===


  Fold 0: masked macro=0.7373 | n_val=31 | time=27.7s


  Fold 1: masked macro=0.7641 | n_val=27 | time=45.1s


  Fold 2: masked macro=0.7107 | n_val=26 | time=43.2s


  Fold 3: masked macro=0.7399 | n_val=31 | time=42.8s


  Fold 4: masked macro=0.7630 | n_val=30 | time=32.0s


Variant OOF masked macro: 0.7622


=== Variant cb_plv2_d4_lr0015_l2_70_rsm04_sub075_s123 (PL V2 pseudo-aug) ===


  Fold 0: masked macro=0.7461 | n_val=31 | time=48.8s


  Fold 1: masked macro=0.7593 | n_val=27 | time=72.4s


  Fold 2: masked macro=0.7388 | n_val=26 | time=63.4s


  Fold 3: masked macro=0.7450 | n_val=31 | time=67.0s


  Fold 4: masked macro=0.7478 | n_val=30 | time=35.3s


Variant OOF masked macro: 0.7668


[PL V2] Rank-ensemble OOF masked macro: 0.7681


[4-view (PL V2)] Best masked OOF macro=0.7682 at wp=0.30
Wrote submission.csv with 1216 rows (4-view with PL V2, wp=0.30)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.474815
2,102,0.426925
3,103,0.709186
4,104,0.383896


In [46]:
# Power-mean rank transform + joint weight tuning for 3-view + PL V2 (4-view) blend
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def pow_rank(mat, gamma):
    # ranks in [0,1]; power transform keeps [0,1]
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)

assert 'OA_r' in globals() and 'OB_r' in globals() and 'OC_r' in globals() and 'inter_mask_abc' in globals(), 'Run cell 33 first.'
assert 'oof_ens_pl_v2' in globals() and 'inter_mask_plv2' in globals(), 'Run cell 36 (PL V2) first.'
assert 'TA' in globals() and 'TB' in globals() and 'test_ens_st' in globals() and 'test_ens_pl_v2' in globals(), 'Need test ranks from views A/B/C and PL V2.'

# Grids
gammas = [1.0, 1.1, 1.2, 1.3, 1.4, 1.5]
w_grid = np.linspace(0.0, 1.0, 21)
wp_grid = np.linspace(0.0, 0.3, 16)

best_cfg = None
best_macro = -1.0

for g in gammas:
    A_o = pow_rank(OA_r, g); B_o = pow_rank(OB_r, g); C_o = pow_rank(OC_r, g); PL_o = pow_rank(oof_ens_pl_v2, g)
    inter_mask_comb = (inter_mask_abc.astype(bool) & inter_mask_plv2.astype(bool)).astype(np.uint8)
    for wB in w_grid:
        for wC in w_grid:
            wA = 1.0 - wB - wC
            if wA < 0 or wA > 1: continue
            oof_3 = wA * A_o + wB * B_o + wC * C_o
            for wp in wp_grid:
                blend_oof = (1.0 - wp) * oof_3 + wp * PL_o
                macro = masked_auc_macro(Y_aligned, blend_oof, inter_mask_comb)
                if macro > best_macro:
                    best_macro = macro
                    best_cfg = (float(g), float(wA), float(wB), float(wC), float(wp))

g_best, wA_best, wB_best, wC_best, wp_best = best_cfg
print(f"[Power+Weights] Best masked OOF macro={best_macro:.4f} | gamma={g_best} | wA={wA_best:.2f} wB={wB_best:.2f} wC={wC_best:.2f} | wp={wp_best:.2f}", flush=True)

# Build test with best config
RA = pow_rank(TA.astype(np.float32), g_best)
RB = pow_rank(TB.astype(np.float32), g_best)
RC = pow_rank(test_ens_st.astype(np.float32), g_best)
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), g_best)
test_3 = wA_best * RA + wB_best * RB + wC_best * RC
test_final = (1.0 - wp_best) * test_3 + wp_best * RPL

# Write submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Power+Weights tuned: gamma={g_best}, wA={wA_best:.2f}, wB={wB_best:.2f}, wC={wC_best:.2f}, wp={wp_best:.2f})")
display(sub_out.head())

[Power+Weights] Best masked OOF macro=0.7691 | gamma=1.1 | wA=0.45 wB=0.05 wC=0.50 | wp=0.30


Wrote submission.csv with 1216 rows (Power+Weights tuned: gamma=1.1, wA=0.45, wB=0.05, wC=0.50, wp=0.30)


Unnamed: 0,Id,Probability
0,100,0.213841
1,101,0.439926
2,102,0.393238
3,103,0.682867
4,104,0.349133


In [47]:
# Advanced power-rank per-view tuning + joint weight search (3-view + PL V2)
import numpy as np, pandas as pd
from sklearn.metrics import roc_auc_score

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def pow_rank(mat, gamma):
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)

assert 'OA_r' in globals() and 'OB_r' in globals() and 'OC_r' in globals() and 'inter_mask_abc' in globals(), 'Run cell 33 first.'
assert 'oof_ens_pl_v2' in globals() and 'inter_mask_plv2' in globals(), 'Run cell 36 first (PL V2).'
assert 'TA' in globals() and 'TB' in globals() and 'test_ens_st' in globals() and 'test_ens_pl_v2' in globals(), 'Need test ranks for A/B/C/PL V2.'

# Small per-view gamma sets to keep search tractable
gammas = [1.0, 1.1, 1.2]
w_grid = np.linspace(0.0, 1.0, 21)  # for B and C; A = 1 - B - C
wp_grid = np.linspace(0.0, 0.3, 16)

best = dict(macro=-1.0, gA=None, gB=None, gC=None, gP=None, wA=None, wB=None, wC=None, wp=None)

inter_mask_comb = (inter_mask_abc.astype(bool) & inter_mask_plv2.astype(bool)).astype(np.uint8)

for gA in gammas:
    A_o = pow_rank(OA_r, gA)
    for gB in gammas:
        B_o = pow_rank(OB_r, gB)
        for gC in gammas:
            C_o = pow_rank(OC_r, gC)
            for wB in w_grid:
                for wC in w_grid:
                    wA = 1.0 - wB - wC
                    if wA < 0 or wA > 1: continue
                    oof_3 = wA * A_o + wB * B_o + wC * C_o
                    for gP in gammas:
                        PL_o = pow_rank(oof_ens_pl_v2, gP)
                        for wp in wp_grid:
                            blend_oof = (1.0 - wp) * oof_3 + wp * PL_o
                            macro = masked_auc_macro(Y_aligned, blend_oof, inter_mask_comb)
                            if macro > best['macro']:
                                best.update(macro=float(macro), gA=float(gA), gB=float(gB), gC=float(gC), gP=float(gP),
                                            wA=float(wA), wB=float(wB), wC=float(wC), wp=float(wp))

print(f"[Per-view Power+Weights] Best masked OOF macro={best['macro']:.4f} | gA={best['gA']} gB={best['gB']} gC={best['gC']} gP={best['gP']} | wA={best['wA']:.2f} wB={best['wB']:.2f} wC={best['wC']:.2f} | wp={best['wp']:.2f}", flush=True)

# Build test with best config
RA = pow_rank(TA.astype(np.float32), best['gA'])
RB = pow_rank(TB.astype(np.float32), best['gB'])
RC = pow_rank(test_ens_st.astype(np.float32), best['gC'])
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), best['gP'])
test_3 = best['wA'] * RA + best['wB'] * RB + best['wC'] * RC
test_final = (1.0 - best['wp']) * test_3 + best['wp'] * RPL

# Write submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Per-view Power+Weights tuned)")
display(sub_out.head())

In [50]:
# PL V2: add 2 more CatBoost variants to the bag, re-ensemble, re-run power+weights search, and write submission
import numpy as np, time
from catboost import CatBoostClassifier, Pool
from scipy.stats import rankdata
from sklearn.metrics import roc_auc_score

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out.astype(np.float32)

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

# Preconditions from PL V2 cell
assert 'folds' in globals() and 'X_tr_full' in globals() and 'X_te_full' in globals() and 'Y_full' in globals(), 'Run cell 36 first.'
assert 'Y_pseudo' in globals() and 'test_blend_rank_3v_opt' in globals(), 'PL candidates and 3-view ranks required.'
assert 'oof_list' in globals() and 'mask_list' in globals() and 'test_pred_list' in globals(), 'PL V2 lists missing; run cell 36.'

# Two additional PL V2 variants per expert guidance
extra_variants = [
    dict(name='cb_plv2_d3_lr0015_l2_80_rsm05_sub08_s456', depth=3, learning_rate=0.015, l2_leaf_reg=80.0, rsm=0.5, subsample=0.80, random_strength=0.8, seed=456),
    dict(name='cb_plv2_d4_lr0015_l2_100_rsm04_sub075_s789', depth=4, learning_rate=0.015, l2_leaf_reg=100.0, rsm=0.4, subsample=0.75, random_strength=1.0, seed=789),
]

C = Y_full.shape[1]
for v in extra_variants:
    print(f"=== Variant {v['name']} (PL V2 pseudo-aug EXTRA) ===", flush=True)
    oof = np.zeros((X_tr_full.shape[0], C), dtype=np.float32)
    vmask = np.zeros((X_tr_full.shape[0], C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr = X_tr_full[tr_idx]; Xva = X_tr_full[va_idx]
        ytr = Y_full[tr_idx]; yva = Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            if ytr_c.max() == ytr_c.min() or yva_c.max() == yva_c.min():
                continue
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            if tr_pos == 0 or tr_neg == 0:
                continue
            # per-class cap (same as PL V2)
            cap_base = min(6, int(round(0.08 * tr_pos + 1)))
            cap_pct = int(max(1, np.floor(0.15 * tr_pos)))
            cap = max(0, min(cap_base, cap_pct))
            cand_mask = np.isfinite(Y_pseudo[:, c]) & (Y_pseudo[:, c] == 1.0)
            if cap > 0 and cand_mask.any():
                scores = test_blend_rank_3v_opt[cand_mask, c]
                order = np.argsort(-scores)
                take = min(cap, order.size)
                idx_te = np.where(cand_mask)[0][order[:take]]
                X_p = X_te_full[idx_te]
                y_p = np.ones(take, dtype=np.float32)
                r = test_blend_rank_3v_opt[idx_te, c].astype(np.float32)
                w_p = np.clip(((r - 0.5) * 2.0) ** 2, 0.3, 1.0).astype(np.float32)
                Xtr_c = np.concatenate([Xtr, X_p], axis=0)
                ytr_c_aug = np.concatenate([ytr_c, y_p], axis=0)
                w_tr = np.concatenate([np.ones(Xtr.shape[0], dtype=np.float32), w_p], axis=0)
            else:
                Xtr_c = Xtr; ytr_c_aug = ytr_c; w_tr = np.ones(Xtr.shape[0], dtype=np.float32)
            tr_pool = Pool(Xtr_c, label=ytr_c_aug, weight=w_tr)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'], learning_rate=v['learning_rate'],
                iterations=5000, l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                auto_class_weights='Balanced', random_seed=v['seed'],
                early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_pred_list.append(np.mean(test_folds, axis=0))
    oof_list.append(oof); mask_list.append(vmask)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Recompute PL V2 rank-ensemble with 4 variants
inter_mask_plv2 = mask_list[0].astype(bool)
for m in mask_list[1:]: inter_mask_plv2 &= m.astype(bool)
inter_mask_plv2 = inter_mask_plv2.astype(np.uint8)
ranked_oofs = []
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask_plv2[:, c].astype(bool)
        if not m.any(): continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens_pl_v2 = np.nanmean(ranked_oofs, axis=0)
macro_pl_v2 = masked_auc_macro(Y_full, oof_ens_pl_v2, inter_mask_plv2)
print(f"[PL V2 x4] Rank-ensemble OOF masked macro: {macro_pl_v2:.4f}", flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens_pl_v2 = np.mean(ranked_tests, axis=0).astype(np.float32)

# Re-run power+weights grid (same as cell 37) with updated PL V2 ensemble
def pow_rank(mat, gamma):
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)
assert 'OA_r' in globals() and 'OB_r' in globals() and 'OC_r' in globals() and 'inter_mask_abc' in globals(), 'Need 3-view OOF ranks and mask (cell 33)'
assert 'TA' in globals() and 'TB' in globals() and 'test_ens_st' in globals(), 'Need A/B/C test ranks'
gammas = [1.0, 1.1, 1.2, 1.3]
w_grid = np.linspace(0.0, 1.0, 21)
wp_grid = np.linspace(0.0, 0.3, 16)
best_cfg = None; best_macro = -1.0
inter_mask_comb = (inter_mask_abc.astype(bool) & inter_mask_plv2.astype(bool)).astype(np.uint8)
for g in gammas:
    A_o = pow_rank(OA_r, g); B_o = pow_rank(OB_r, g); C_o = pow_rank(OC_r, g); PL_o = pow_rank(oof_ens_pl_v2, g)
    for wB in w_grid:
        for wC in w_grid:
            wA = 1.0 - wB - wC
            if wA < 0 or wA > 1: continue
            oof_3 = wA * A_o + wB * B_o + wC * C_o
            for wp in wp_grid:
                blend_oof = (1.0 - wp) * oof_3 + wp * PL_o
                macro = masked_auc_macro(Y_aligned, blend_oof, inter_mask_comb)
                if macro > best_macro:
                    best_macro = macro; best_cfg = (float(g), float(wA), float(wB), float(wC), float(wp))
g_best, wA_best, wB_best, wC_best, wp_best = best_cfg
print(f"[Power+Weights x4 PL] Best masked OOF macro={best_macro:.4f} | gamma={g_best} | wA={wA_best:.2f} wB={wB_best:.2f} wC={wC_best:.2f} | wp={wp_best:.2f}", flush=True)

# Build and write submission
RA = pow_rank(TA.astype(np.float32), g_best)
RB = pow_rank(TB.astype(np.float32), g_best)
RC = pow_rank(test_ens_st.astype(np.float32), g_best)
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), g_best)
test_3 = wA_best * RA + wB_best * RB + wC_best * RC
test_final = (1.0 - wp_best) * test_3 + wp_best * RPL
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (PL V2 x4 + Power+Weights tuned)")
display(sub_out.head())

=== Variant cb_plv2_d3_lr0015_l2_80_rsm05_sub08_s456 (PL V2 pseudo-aug EXTRA) ===


  Fold 0: masked macro=0.7481 | n_val=31 | time=29.8s


  Fold 1: masked macro=0.7696 | n_val=27 | time=52.2s


  Fold 2: masked macro=0.6882 | n_val=26 | time=43.1s


  Fold 3: masked macro=0.7215 | n_val=31 | time=48.2s


  Fold 4: masked macro=0.7317 | n_val=30 | time=34.9s


Variant OOF masked macro: 0.7669


=== Variant cb_plv2_d4_lr0015_l2_100_rsm04_sub075_s789 (PL V2 pseudo-aug EXTRA) ===


  Fold 0: masked macro=0.7556 | n_val=31 | time=51.0s


  Fold 1: masked macro=0.7519 | n_val=27 | time=76.8s


  Fold 2: masked macro=0.7085 | n_val=26 | time=64.5s


KeyboardInterrupt: 

In [49]:
# PL V2: add 2 more CatBoost variants (total 6), re-ensemble, re-run power+weights search, and write submission
import numpy as np, time
from catboost import CatBoostClassifier, Pool
from scipy.stats import rankdata
from sklearn.metrics import roc_auc_score

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out.astype(np.float32)

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

# Preconditions from PL V2 cell
assert 'folds' in globals() and 'X_tr_full' in globals() and 'X_te_full' in globals() and 'Y_full' in globals(), 'Run cell 36 first.'
assert 'Y_pseudo' in globals() and 'test_blend_rank_3v_opt' in globals(), 'PL candidates and 3-view ranks required.'
assert 'oof_list' in globals() and 'mask_list' in globals() and 'test_pred_list' in globals(), 'PL V2 lists missing; run cell 36/39.'

# Two more PL V2 variants (diverse seeds/regs)
extra_variants2 = [
    dict(name='cb_plv2_d3_lr0015_l2_90_rsm05_sub08_s1010', depth=3, learning_rate=0.015, l2_leaf_reg=90.0, rsm=0.5, subsample=0.80, random_strength=0.9, seed=1010),
    dict(name='cb_plv2_d4_lr0015_l2_110_rsm04_sub075_s2020', depth=4, learning_rate=0.015, l2_leaf_reg=110.0, rsm=0.4, subsample=0.75, random_strength=1.1, seed=2020),
]

C = Y_full.shape[1]
for v in extra_variants2:
    print(f"=== Variant {v['name']} (PL V2 pseudo-aug EXTRA2) ===", flush=True)
    oof = np.zeros((X_tr_full.shape[0], C), dtype=np.float32)
    vmask = np.zeros((X_tr_full.shape[0], C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr = X_tr_full[tr_idx]; Xva = X_tr_full[va_idx]
        ytr = Y_full[tr_idx]; yva = Y_full[va_idx]
        te_pred = np.zeros((X_te_full.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            if ytr_c.max() == ytr_c.min() or yva_c.max() == yva_c.min():
                continue
            tr_pos = int((ytr_c == 1).sum()); tr_neg = int((ytr_c == 0).sum())
            if tr_pos == 0 or tr_neg == 0:
                continue
            cap_base = min(6, int(round(0.08 * tr_pos + 1)))
            cap_pct = int(max(1, np.floor(0.15 * tr_pos)))
            cap = max(0, min(cap_base, cap_pct))
            cand_mask = np.isfinite(Y_pseudo[:, c]) & (Y_pseudo[:, c] == 1.0)
            if cap > 0 and cand_mask.any():
                scores = test_blend_rank_3v_opt[cand_mask, c]
                order = np.argsort(-scores)
                take = min(cap, order.size)
                idx_te = np.where(cand_mask)[0][order[:take]]
                X_p = X_te_full[idx_te]
                y_p = np.ones(take, dtype=np.float32)
                r = test_blend_rank_3v_opt[idx_te, c].astype(np.float32)
                w_p = np.clip(((r - 0.5) * 2.0) ** 2, 0.3, 1.0).astype(np.float32)
                Xtr_c = np.concatenate([Xtr, X_p], axis=0)
                ytr_c_aug = np.concatenate([ytr_c, y_p], axis=0)
                w_tr = np.concatenate([np.ones(Xtr.shape[0], dtype=np.float32), w_p], axis=0)
            else:
                Xtr_c = Xtr; ytr_c_aug = ytr_c; w_tr = np.ones(Xtr.shape[0], dtype=np.float32)
            tr_pool = Pool(Xtr_c, label=ytr_c_aug, weight=w_tr)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'], learning_rate=v['learning_rate'],
                iterations=5000, l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                auto_class_weights='Balanced', random_seed=v['seed'],
                early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_full)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    test_pred_list.append(np.mean(test_folds, axis=0))
    oof_list.append(oof); mask_list.append(vmask)
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Recompute PL V2 rank-ensemble with 6 variants
inter_mask_plv2 = mask_list[0].astype(bool)
for m in mask_list[1:]: inter_mask_plv2 &= m.astype(bool)
inter_mask_plv2 = inter_mask_plv2.astype(np.uint8)
ranked_oofs = []
for o in oof_list:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = inter_mask_plv2[:, c].astype(bool)
        if not m.any(): continue
        col = o[m, c]
        r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
oof_ens_pl_v2 = np.nanmean(ranked_oofs, axis=0)
macro_pl_v2 = masked_auc_macro(Y_full, oof_ens_pl_v2, inter_mask_plv2)
print(f"[PL V2 x6] Rank-ensemble OOF masked macro: {macro_pl_v2:.4f}", flush=True)

ranked_tests = [rank_cols(tp) for tp in test_pred_list]
test_ens_pl_v2 = np.mean(ranked_tests, axis=0).astype(np.float32)

# Re-run power+weights grid (same as cell 37) with updated PL V2 ensemble
def pow_rank(mat, gamma):
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)
assert 'OA_r' in globals() and 'OB_r' in globals() and 'OC_r' in globals() and 'inter_mask_abc' in globals(), 'Need 3-view OOF ranks and mask (cell 33)'
assert 'TA' in globals() and 'TB' in globals() and 'test_ens_st' in globals(), 'Need A/B/C test ranks'
gammas = [1.0, 1.1, 1.2, 1.3]
w_grid = np.linspace(0.0, 1.0, 21)
wp_grid = np.linspace(0.0, 0.3, 16)
best_cfg = None; best_macro = -1.0
inter_mask_comb = (inter_mask_abc.astype(bool) & inter_mask_plv2.astype(bool)).astype(np.uint8)
for g in gammas:
    A_o = pow_rank(OA_r, g); B_o = pow_rank(OB_r, g); C_o = pow_rank(OC_r, g); PL_o = pow_rank(oof_ens_pl_v2, g)
    for wB in w_grid:
        for wC in w_grid:
            wA = 1.0 - wB - wC
            if wA < 0 or wA > 1: continue
            oof_3 = wA * A_o + wB * B_o + wC * C_o
            for wp in wp_grid:
                blend_oof = (1.0 - wp) * oof_3 + wp * PL_o
                macro = masked_auc_macro(Y_aligned, blend_oof, inter_mask_comb)
                if macro > best_macro:
                    best_macro = macro; best_cfg = (float(g), float(wA), float(wB), float(wC), float(wp))
g_best, wA_best, wB_best, wC_best, wp_best = best_cfg
print(f"[Power+Weights x6 PL] Best masked OOF macro={best_macro:.4f} | gamma={g_best} | wA={wA_best:.2f} wB={wB_best:.2f} wC={wC_best:.2f} | wp={wp_best:.2f}", flush=True)

# Build and write submission
RA = pow_rank(TA.astype(np.float32), g_best)
RB = pow_rank(TB.astype(np.float32), g_best)
RC = pow_rank(test_ens_st.astype(np.float32), g_best)
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), g_best)
test_3 = wA_best * RA + wB_best * RB + wC_best * RC
test_final = (1.0 - wp_best) * test_3 + wp_best * RPL
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (PL V2 x6 + Power+Weights tuned)")
display(sub_out.head())

=== Variant cb_plv2_d3_lr0015_l2_90_rsm05_sub08_s1010 (PL V2 pseudo-aug EXTRA2) ===


  Fold 0: masked macro=0.7403 | n_val=31 | time=30.4s


  Fold 1: masked macro=0.7714 | n_val=27 | time=57.7s


  Fold 2: masked macro=0.6868 | n_val=26 | time=46.2s


  Fold 3: masked macro=0.7457 | n_val=31 | time=51.6s


  Fold 4: masked macro=0.7184 | n_val=30 | time=30.6s


Variant OOF masked macro: 0.7605


=== Variant cb_plv2_d4_lr0015_l2_110_rsm04_sub075_s2020 (PL V2 pseudo-aug EXTRA2) ===


  Fold 0: masked macro=0.7493 | n_val=31 | time=51.1s


  Fold 1: masked macro=0.7645 | n_val=27 | time=80.5s


  Fold 2: masked macro=0.7043 | n_val=26 | time=64.2s


  Fold 3: masked macro=0.7585 | n_val=31 | time=69.6s


  Fold 4: masked macro=0.7370 | n_val=30 | time=37.7s


Variant OOF masked macro: 0.7699


[PL V2 x6] Rank-ensemble OOF masked macro: 0.7706


[Power+Weights x6 PL] Best masked OOF macro=0.7702 | gamma=1.0 | wA=0.40 wB=0.05 wC=0.55 | wp=0.30


Wrote submission.csv with 1216 rows (PL V2 x6 + Power+Weights tuned)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.472355
2,102,0.429147
3,103,0.694682
4,104,0.385245


In [53]:
# Lightweight linear View (LR on standardized tabular features with GroupKFold-by-station) + add as 5th view with small weight to the power+weights 4-view blend
import numpy as np, pandas as pd, time, re
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score
from scipy.stats import rankdata

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out.astype(np.float32)

def pow_rank(mat, gamma):
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)

# Preconditions: reuse best 4-view config from power+weights (cell 37) and PL V2 (cell 36)
assert 'g_best' in globals() and 'wA_best' in globals() and 'wB_best' in globals() and 'wC_best' in globals() and 'wp_best' in globals(), 'Run cell 37 first (power+weights best cfg)'
assert 'OA_r' in globals() and 'OB_r' in globals() and 'OC_r' in globals() and 'oof_ens_pl_v2' in globals(), 'Need OOF ranks from views and PL V2'
assert 'TA' in globals() and 'TB' in globals() and 'test_ens_st' in globals() and 'test_ens_pl_v2' in globals(), 'Need test ranks from views and PL V2'
assert 'inter_mask_abc' in globals() and 'inter_mask_plv2' in globals(), 'Need masks from previous cells'

# Build GroupKFold by station splits
train_meta = train_df[['rec_id','filename']].copy()
train_meta['station'] = train_meta['filename'].map(extract_station).fillna('UNK')
groups = train_meta['station'].values
X_full_tab = X_train_df.values.astype(np.float32)
T_full_tab = X_test_df.values.astype(np.float32)
Y_full = Y_aligned.astype(np.float32)
N, C = X_full_tab.shape[0], Y_full.shape[1]
gkf = GroupKFold(n_splits=5)
folds_lin = list(gkf.split(X_full_tab, Y_full, groups))

# Train LR-OVR per fold with in-fold StandardScaler; collect OOF and test preds (rank later)
oof_lin = np.zeros((N, C), dtype=np.float32)
vmask_lin = np.zeros((N, C), dtype=np.uint8)
test_lin_folds = []
for fi, (tr_idx, va_idx) in enumerate(folds_lin):
    t0 = time.time()
    Xtr, Xva = X_full_tab[tr_idx], X_full_tab[va_idx]
    ytr, yva = Y_full[tr_idx], Y_full[va_idx]
    scaler = StandardScaler(with_mean=True, with_std=True)
    Xtr_s = scaler.fit_transform(Xtr)
    Xva_s = scaler.transform(Xva)
    Tes_s = scaler.transform(T_full_tab)
    clf = OneVsRestClassifier(LogisticRegression(solver='lbfgs', penalty='l2', C=1.0, max_iter=2000, class_weight='balanced', n_jobs=-1))
    clf.fit(Xtr_s, ytr)
    va_pred = np.vstack([est.predict_proba(Xva_s)[:,1] for est in clf.estimators_]).T.astype(np.float32)
    te_pred = np.vstack([est.predict_proba(Tes_s)[:,1] for est in clf.estimators_]).T.astype(np.float32)
    oof_lin[va_idx] = va_pred
    vmask_lin[va_idx] = 1
    test_lin_folds.append(te_pred)
    fold_macro = masked_auc_macro(yva, va_pred, np.ones_like(va_pred, dtype=np.uint8))
    print(f"[LR View] Fold {fi}: macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)

# Rank OOF/test for LR view
oof_lin_r = np.full_like(oof_lin, np.nan, dtype=np.float32)
for c in range(C):
    m = vmask_lin[:, c].astype(bool)
    if not m.any(): continue
    col = oof_lin[m, c]
    r = rankdata(col, method='average')
    oof_lin_r[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
test_lin_mean = np.mean(test_lin_folds, axis=0)
test_lin_r = rank_cols(test_lin_mean)

# Build current best 4-view OOF under power transform g_best
A_o = pow_rank(OA_r, g_best); B_o = pow_rank(OB_r, g_best); C_o = pow_rank(OC_r, g_best); P_o = pow_rank(oof_ens_pl_v2, g_best)
oof_3 = wA_best * A_o + wB_best * B_o + wC_best * C_o
oof_4 = (1.0 - wp_best) * oof_3 + wp_best * P_o

# Apply same power to LR view ranks
L_o = pow_rank(oof_lin_r, g_best)

# Combined intersection mask (previous 4-view mask) ∩ LR mask
mask_4 = (inter_mask_abc.astype(bool) & inter_mask_plv2.astype(bool)).astype(np.uint8)
mask_lin = vmask_lin.astype(np.uint8)
mask_comb5 = (mask_4.astype(bool) & mask_lin.astype(bool)).astype(np.uint8)

# Tune small weight for LR view in [0..0.15]
best_wl, best_macro = 0.0, masked_auc_macro(Y_full, oof_4, mask_4)
for wl in np.linspace(0.0, 0.15, 16):
    blend_oof5 = (1.0 - wl) * oof_4 + wl * L_o
    macro = masked_auc_macro(Y_full, blend_oof5, mask_comb5)
    if macro > best_macro:
        best_macro = macro; best_wl = float(wl)
print(f"[5-view add LR] Best masked OOF macro={best_macro:.4f} at w_lr={best_wl:.3f}", flush=True)

# Build test with best_wl
RA = pow_rank(TA.astype(np.float32), g_best)
RB = pow_rank(TB.astype(np.float32), g_best)
RC = pow_rank(test_ens_st.astype(np.float32), g_best)
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), g_best)
RL  = pow_rank(test_lin_r.astype(np.float32), g_best)
test_3 = wA_best * RA + wB_best * RB + wC_best * RC
test_4 = (1.0 - wp_best) * test_3 + wp_best * RPL
test_final = (1.0 - best_wl) * test_4 + best_wl * RL

# Write submission
test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (5th view LR added, w_lr={best_wl:.3f}, g={g_best})")
display(sub_out.head())

[LR View] Fold 0: macro=0.6257 | n_val=31 | time=22.3s


[LR View] Fold 1: macro=0.6528 | n_val=27 | time=24.5s


[LR View] Fold 2: macro=0.6595 | n_val=26 | time=0.5s


[LR View] Fold 3: macro=0.6993 | n_val=31 | time=0.5s


[LR View] Fold 4: macro=0.6095 | n_val=30 | time=0.5s


[5-view add LR] Best masked OOF macro=0.7702 at w_lr=0.000


Wrote submission.csv with 1216 rows (5th view LR added, w_lr=0.000, g=1.0)


Unnamed: 0,Id,Probability
0,100,0.246032
1,101,0.472355
2,102,0.429147
3,103,0.694682
4,104,0.385245


In [58]:
# Station one-hot feature augmentation + compact CB bag (View A_stn) + re-run power+weights (swap A with A_stn) -> submission
import re, time, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool
from scipy.stats import rankdata

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out.astype(np.float32)

def pow_rank(mat, gamma):
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)

# Preconditions for blending later
assert 'OB_r' in globals() and 'OC_r' in globals() and 'oof_ens_pl_v2' in globals() and 'inter_mask_plv2' in globals(), 'Run cells 30/27/36 for OB_r/OC_r/PL V2.'
assert 'TB' in globals() and 'test_ens_st' in globals() and 'test_ens_pl_v2' in globals(), 'Need test ranks for B/C/PL V2.'

# 1) Build station one-hot dummies and augment features
train_ids_sorted = pd.Index(sorted(train_rec_ids))
test_ids_sorted = pd.Index(sorted(test_rec_ids))
tr_stations = train_df.set_index('rec_id').loc[train_ids_sorted, 'filename'].map(extract_station).fillna('UNK')
te_stations = pd.Series([id2fn[int(r)] for r in test_ids_sorted], index=test_ids_sorted).map(extract_station).fillna('UNK')
d_tr = pd.get_dummies(tr_stations, prefix='stn').astype(np.float32)
d_te = pd.get_dummies(te_stations, prefix='stn').astype(np.float32)
d_all_cols = sorted(set(d_tr.columns).union(set(d_te.columns)))
d_tr = d_tr.reindex(columns=d_all_cols, fill_value=0.0).astype(np.float32)
d_te = d_te.reindex(columns=d_all_cols, fill_value=0.0).astype(np.float32)
X_tr_aug = np.hstack([X_train_df.values.astype(np.float32), d_tr.values.astype(np.float32)])
X_te_aug = np.hstack([X_test_df.values.astype(np.float32), d_te.values.astype(np.float32)])
Y_full = pd.DataFrame(Y, index=train_rec_ids).loc[train_ids_sorted].values.astype(np.float32)
N, C = X_tr_aug.shape[0], Y_full.shape[1]

# Groups for GroupKFold by station aligned to train_ids_sorted
groups = tr_stations.values
gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_tr_aug, Y_full, groups))
print('Station one-hot columns:', len(d_all_cols), '| GroupKFold folds:', len(folds), flush=True)

# 2) Compact CB bag (3 variants) on augmented matrices -> OA_stn (ranked OOF), MA_stn (mask), TA_stn (ranked test)
variants = [
    dict(name='cb_stn_d3_lr002_l2_50_rsm06_sub085_s42', depth=3, learning_rate=0.02,  l2_leaf_reg=50.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_stn_d4_lr0015_l2_60_rsm05_sub08_s123', depth=4, learning_rate=0.015, l2_leaf_reg=60.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='cb_stn_d3_lr002_l2_70_rsm06_sub085_s456', depth=3, learning_rate=0.02,  l2_leaf_reg=70.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=456),
]
oofs = []; masks = []; tests = []
for v in variants:
    print(f"=== Station-aug Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr, Xva = X_tr_aug[tr_idx], X_tr_aug[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te_aug.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            if ytr_c.max() == ytr_c.min() or yva_c.max() == yva_c.min():
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'], learning_rate=v['learning_rate'],
                iterations=5000, l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                auto_class_weights='Balanced', random_seed=v['seed'],
                early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_aug)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    oofs.append(oof); masks.append(vmask); tests.append(np.mean(test_folds, axis=0))
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Rank-avg within view A_stn
inter_mask_A = masks[0].astype(bool)
for m in masks[1:]: inter_mask_A &= m.astype(bool)
MA_stn = inter_mask_A.astype(np.uint8)
ranked_oofs = []
for o in oofs:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = MA_stn[:, c].astype(bool)
        if not m.any(): continue
        col = o[m, c]; r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
OA_stn = np.nanmean(ranked_oofs, axis=0).astype(np.float32)
TA_stn = np.mean([rank_cols(tp) for tp in tests], axis=0).astype(np.float32)
macro_Astn = masked_auc_macro(Y_full, OA_stn, MA_stn)
print(f'[View A_stn] OOF masked macro={macro_Astn:.4f}', flush=True)

# 3) Re-run power+weights tuning with A_stn replacing A
assert 'MB' in globals(), 'Need View B mask (from cell 30)'
MC = np.isfinite(OC_r).astype(np.uint8)
inter_mask_new = (MA_stn.astype(bool) & MB.astype(bool) & MC.astype(bool)).astype(np.uint8)
gammas = [1.0, 1.1, 1.2, 1.3]
w_grid = np.linspace(0.0, 1.0, 21)
wp_grid = np.linspace(0.0, 0.3, 16)
best_cfg = None; best_macro = -1.0
for g in gammas:
    A_o = pow_rank(OA_stn, g); B_o = pow_rank(OB_r, g); C_o = pow_rank(OC_r, g); PL_o = pow_rank(oof_ens_pl_v2, g)
    for wB in w_grid:
        for wC in w_grid:
            wA = 1.0 - wB - wC
            if wA < 0 or wA > 1: continue
            oof_3 = wA * A_o + wB * B_o + wC * C_o
            for wp in wp_grid:
                blend_oof = (1.0 - wp) * oof_3 + wp * PL_o
                macro = masked_auc_macro(Y_full, blend_oof, inter_mask_new)
                if macro > best_macro:
                    best_macro = macro; best_cfg = (float(g), float(wA), float(wB), float(wC), float(wp))
g_best_astn, wA_best_astn, wB_best_astn, wC_best_astn, wp_best_astn = best_cfg
print(f"[Power+Weights (A_stn)] Best masked OOF={best_macro:.4f} | gamma={g_best_astn} | wA={wA_best_astn:.2f} wB={wB_best_astn:.2f} wC={wC_best_astn:.2f} | wp={wp_best_astn:.2f}", flush=True)

# 4) Build test with A_stn best config and write submission
RA = pow_rank(TA_stn.astype(np.float32), g_best_astn)
RB = pow_rank(TB.astype(np.float32), g_best_astn)
RC = pow_rank(test_ens_st.astype(np.float32), g_best_astn)
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), g_best_astn)
test_3 = wA_best_astn * RA + wB_best_astn * RB + wC_best_astn * RC
test_final = (1.0 - wp_best_astn) * test_3 + wp_best_astn * RPL

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Station one-hot View A_stn + power+weights)")
display(sub_out.head())

Station one-hot columns: 13 | GroupKFold folds: 5


=== Station-aug Variant cb_stn_d3_lr002_l2_50_rsm06_sub085_s42 ===


  Fold 0: masked macro=0.7189 | n_val=31 | time=25.5s


  Fold 1: masked macro=0.7417 | n_val=27 | time=26.9s


  Fold 2: masked macro=0.7262 | n_val=26 | time=18.8s


  Fold 3: masked macro=0.7574 | n_val=31 | time=33.0s


  Fold 4: masked macro=0.7250 | n_val=30 | time=13.8s


Variant OOF masked macro: 0.7479


=== Station-aug Variant cb_stn_d4_lr0015_l2_60_rsm05_sub08_s123 ===


  Fold 0: masked macro=0.7178 | n_val=31 | time=35.2s


  Fold 1: masked macro=0.7562 | n_val=27 | time=54.0s


  Fold 2: masked macro=0.7142 | n_val=26 | time=35.2s


  Fold 3: masked macro=0.7459 | n_val=31 | time=58.0s


  Fold 4: masked macro=0.7338 | n_val=30 | time=22.2s


Variant OOF masked macro: 0.7580


=== Station-aug Variant cb_stn_d3_lr002_l2_70_rsm06_sub085_s456 ===


  Fold 0: masked macro=0.7103 | n_val=31 | time=26.4s


  Fold 1: masked macro=0.7587 | n_val=27 | time=31.5s


  Fold 2: masked macro=0.7165 | n_val=26 | time=16.9s


  Fold 3: masked macro=0.7231 | n_val=31 | time=36.0s


  Fold 4: masked macro=0.6913 | n_val=30 | time=14.6s


Variant OOF masked macro: 0.7349


[View A_stn] OOF masked macro=0.7514


[Power+Weights (A_stn)] Best masked OOF=0.7787 | gamma=1.0 | wA=0.55 wB=0.00 wC=0.45 | wp=0.22


Wrote submission.csv with 1216 rows (Station one-hot View A_stn + power+weights)


Unnamed: 0,Id,Probability
0,100,0.177937
1,101,0.414608
2,102,0.374345
3,103,0.694923
4,104,0.370348


In [55]:
# Station + Time feature augmentation (cyclical time) -> compact CB bag (View A_stn_time) -> re-run power+weights blend
import re, time, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool
from scipy.stats import rankdata

def extract_station(stem):
    if not isinstance(stem, str): return None
    m = re.search(r'(PC\d+)', stem); return m.group(1) if m else None

def parse_datetime_from_stem(stem):
    # Examples: PC10_20090513_054500_0020 -> date 2009-05-13, time 05:45:00
    # Returns: year, month, day, hour, minute (ints); if fails, None
    if not isinstance(stem, str):
        return None
    m = re.search(r'_(\d{8})_(\d{6})_', stem + '_')  # ensure trailing underscore for regex
    if not m:
        return None
    d8 = m.group(1); t6 = m.group(2)
    try:
        year = int(d8[0:4]); month = int(d8[4:6]); day = int(d8[6:8])
        hour = int(t6[0:2]); minute = int(t6[2:4]);
        return year, month, day, hour, minute
    except Exception:
        return None

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]
    aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt = y_true[m, c]; yp = y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except Exception: pass
    return float(np.mean(aucs)) if len(aucs) else float('nan')

def rank_cols(mat):
    N, C = mat.shape
    out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = (r - 1) / (N - 1) if N > 1 else r*0.0
    return out.astype(np.float32)

def pow_rank(mat, gamma):
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)

# Preconditions: existing views B/C/PL V2 for blending
assert 'OB_r' in globals() and 'OC_r' in globals() and 'oof_ens_pl_v2' in globals() and 'inter_mask_plv2' in globals(), 'Need OB_r/OC_r/PL V2 (run cells 30/27/36)'
assert 'TB' in globals() and 'test_ens_st' in globals() and 'test_ens_pl_v2' in globals(), 'Need TB/test_ens_st/test_ens_pl_v2 (run cells 30/27/36)'

# 1) Build station OHE + time cyclical features
train_ids_sorted = pd.Index(sorted(train_rec_ids))
test_ids_sorted = pd.Index(sorted(test_rec_ids))
tr_stems = train_df.set_index('rec_id').loc[train_ids_sorted, 'filename']
te_stems = pd.Series([id2fn[int(r)] for r in test_ids_sorted], index=test_ids_sorted)
tr_stations = tr_stems.map(extract_station).fillna('UNK')
te_stations = te_stems.map(extract_station).fillna('UNK')
d_tr = pd.get_dummies(tr_stations, prefix='stn').astype(np.float32)
d_te = pd.get_dummies(te_stations, prefix='stn').astype(np.float32)
d_all_cols = sorted(set(d_tr.columns).union(set(d_te.columns)))
d_tr = d_tr.reindex(columns=d_all_cols, fill_value=0.0).astype(np.float32)
d_te = d_te.reindex(columns=d_all_cols, fill_value=0.0).astype(np.float32)

# Parse time
tr_dt = tr_stems.map(parse_datetime_from_stem)
te_dt = te_stems.map(parse_datetime_from_stem)
def dt_to_feats(dt_series):
    # Build month, day, hour, minute and cyclical encodings
    df = pd.DataFrame(index=dt_series.index)
    df['month'] = dt_series.map(lambda x: x[1] if x else np.nan).astype('float32')
    df['day'] = dt_series.map(lambda x: x[2] if x else np.nan).astype('float32')
    df['hour'] = dt_series.map(lambda x: x[3] if x else np.nan).astype('float32')
    df['minute'] = dt_series.map(lambda x: x[4] if x else np.nan).astype('float32')
    # fill missings with -1 then zero out cyclical
    df = df.fillna(-1.0).astype('float32')
    # cyclical encodings (month in [1..12] -> [0..1) ; hour in [0..23])
    m_norm = (df['month'] - 1.0) / 12.0
    h_norm = (df['hour']) / 24.0
    df['month_sin'] = np.sin(2*np.pi*m_norm).astype('float32')
    df['month_cos'] = np.cos(2*np.pi*m_norm).astype('float32')
    df['hour_sin'] = np.sin(2*np.pi*h_norm).astype('float32')
    df['hour_cos'] = np.cos(2*np.pi*h_norm).astype('float32')
    # Zero out cyclical for missing month/hour (marked -1)
    df.loc[df['month'] < 0, ['month_sin','month_cos']] = 0.0
    df.loc[df['hour'] < 0, ['hour_sin','hour_cos']] = 0.0
    return df[['month','day','hour','minute','month_sin','month_cos','hour_sin','hour_cos']].astype('float32')

tr_time = dt_to_feats(tr_dt)
te_time = dt_to_feats(te_dt)

# Assemble augmented matrices
X_tr_base = X_train_df.set_index(pd.Index(sorted(train_rec_ids))).loc[train_ids_sorted].values.astype(np.float32)
X_te_base = X_test_df.set_index(pd.Index(sorted(test_rec_ids))).loc[test_ids_sorted].values.astype(np.float32)
X_tr_aug = np.hstack([X_tr_base, d_tr.values.astype(np.float32), tr_time.values.astype(np.float32)])
X_te_aug = np.hstack([X_te_base, d_te.values.astype(np.float32), te_time.values.astype(np.float32)])
Y_full = pd.DataFrame(Y, index=train_rec_ids).loc[train_ids_sorted].values.astype(np.float32)
N, C = X_tr_aug.shape[0], Y_full.shape[1]

# Groups for GroupKFold by station
groups = tr_stations.values
gkf = GroupKFold(n_splits=5)
folds = list(gkf.split(X_tr_aug, Y_full, groups))
print('Station OHE:', len(d_all_cols), '| Time cols:', tr_time.shape[1], '| Aug dims:', X_tr_aug.shape[1], '| folds:', len(folds), flush=True)

# 2) Compact CB bag (3 variants) on augmented matrices -> OA_stn_time, MA_stn_time, TA_stn_time
variants = [
    dict(name='cb_stn_time_d3_lr002_l2_50_rsm06_sub085_s42', depth=3, learning_rate=0.02,  l2_leaf_reg=50.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='cb_stn_time_d4_lr0015_l2_60_rsm05_sub08_s123', depth=4, learning_rate=0.015, l2_leaf_reg=60.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='cb_stn_time_d3_lr002_l2_70_rsm06_sub085_s456', depth=3, learning_rate=0.02,  l2_leaf_reg=70.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=456),
]
oofs, masks, tests = [], [], []
for v in variants:
    print(f"=== Station+Time Variant {v['name']} ===", flush=True)
    oof = np.zeros((N, C), dtype=np.float32)
    vmask = np.zeros((N, C), dtype=np.uint8)
    test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time()
        Xtr, Xva = X_tr_aug[tr_idx], X_tr_aug[va_idx]
        ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te_aug.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c = ytr[:, c]; yva_c = yva[:, c]
            if ytr_c.max() == ytr_c.min() or yva_c.max() == yva_c.min():
                continue
            tr_pool = Pool(Xtr, label=ytr_c)
            va_pool = Pool(Xva, label=yva_c)
            cb = CatBoostClassifier(
                loss_function='Logloss',
                depth=v['depth'], learning_rate=v['learning_rate'],
                iterations=5000, l2_leaf_reg=v['l2_leaf_reg'],
                rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                auto_class_weights='Balanced', random_seed=v['seed'],
                early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1
            )
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva)[:, 1]
            te_pred[:, c] = cb.predict_proba(X_te_aug)[:, 1]
            vmask[va_idx, c] = 1
        fold_macro = masked_auc_macro(yva, oof[va_idx], vmask[va_idx])
        print(f"  Fold {fi}: masked macro={fold_macro:.4f} | n_val={len(va_idx)} | time={time.time()-t0:.1f}s", flush=True)
        test_folds.append(te_pred)
    oofs.append(oof); masks.append(vmask); tests.append(np.mean(test_folds, axis=0))
    macro_v = masked_auc_macro(Y_full, oof, vmask)
    print(f"Variant OOF masked macro: {macro_v:.4f}", flush=True)

# Rank-avg within view
inter_mask_time = masks[0].astype(bool)
for m in masks[1:]: inter_mask_time &= m.astype(bool)
MA_stn_time = inter_mask_time.astype(np.uint8)
ranked_oofs = []
for o in oofs:
    Nn, Cc = o.shape
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(Cc):
        m = MA_stn_time[:, c].astype(bool)
        if not m.any(): continue
        col = o[m, c]; r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else r*0.0).astype(np.float32)
    ranked_oofs.append(R)
OA_stn_time = np.nanmean(ranked_oofs, axis=0).astype(np.float32)
TA_stn_time = np.mean([rank_cols(tp) for tp in tests], axis=0).astype(np.float32)
macro_Astn_time = masked_auc_macro(Y_full, OA_stn_time, MA_stn_time)
print(f'[View A_stn_time] OOF masked macro={macro_Astn_time:.4f}')

# 3) Power+weights tuning with A_stn_time replacing A
MC = np.isfinite(OC_r).astype(np.uint8)
inter_mask_new = (MA_stn_time.astype(bool) & MB.astype(bool) & MC.astype(bool)).astype(np.uint8)
gammas = [1.0, 1.05, 1.1]
w_grid = np.linspace(0.0, 1.0, 21)
wp_grid = np.linspace(0.0, 0.3, 16)
best_cfg = None; best_macro = -1.0
for g in gammas:
    A_o = pow_rank(OA_stn_time, g); B_o = pow_rank(OB_r, g); C_o = pow_rank(OC_r, g); PL_o = pow_rank(oof_ens_pl_v2, g)
    for wB in w_grid:
        for wC in w_grid:
            wA = 1.0 - wB - wC
            if wA < 0 or wA > 1: continue
            oof_3 = wA * A_o + wB * B_o + wC * C_o
            for wp in wp_grid:
                blend_oof = (1.0 - wp) * oof_3 + wp * PL_o
                macro = masked_auc_macro(Y_full, blend_oof, inter_mask_new)
                if macro > best_macro:
                    best_macro = macro; best_cfg = (float(g), float(wA), float(wB), float(wC), float(wp))
g_best_ast, wA_best_ast, wB_best_ast, wC_best_ast, wp_best_ast = best_cfg
print(f"[Power+Weights (A_stn_time)] Best masked OOF={best_macro:.4f} | gamma={g_best_ast} | wA={wA_best_ast:.2f} wB={wB_best_ast:.2f} wC={wC_best_ast:.2f} | wp={wp_best_ast:.2f}", flush=True)

# 4) Build test and write submission
RA = pow_rank(TA_stn_time.astype(np.float32), g_best_ast)
RB = pow_rank(TB.astype(np.float32), g_best_ast)
RC = pow_rank(test_ens_st.astype(np.float32), g_best_ast)
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), g_best_ast)
test_3 = wA_best_ast * RA + wB_best_ast * RB + wC_best_ast * RC
test_final = (1.0 - wp_best_ast) * test_3 + wp_best_ast * RPL

test_id_order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, row in test_id_order.iterrows():
    rid = int(row['rec_id']); cid = int(row['class_id'])
    ti = rec_to_idx[rid]
    p = float(test_final[ti, cid])
    probs.append(min(max(p, 0.0), 1.0))
sub_out = pd.DataFrame({'Id': test_id_order['Id'].astype(int), 'Probability': np.round(probs, 6)})
sub_out.to_csv('submission.csv', index=False)
print(f"Wrote submission.csv with {len(sub_out)} rows (Station+Time View A_stn_time + power+weights)")
display(sub_out.head())

Station OHE: 13 | Time cols: 8 | Aug dims: 718 | folds: 5


=== Station+Time Variant cb_stn_time_d3_lr002_l2_50_rsm06_sub085_s42 ===


  Fold 0: masked macro=0.7321 | n_val=31 | time=24.2s


  Fold 1: masked macro=0.7736 | n_val=27 | time=36.9s


  Fold 2: masked macro=0.7136 | n_val=26 | time=13.3s


  Fold 3: masked macro=0.7336 | n_val=31 | time=33.1s


  Fold 4: masked macro=0.6664 | n_val=30 | time=13.5s


Variant OOF masked macro: 0.7487


=== Station+Time Variant cb_stn_time_d4_lr0015_l2_60_rsm05_sub08_s123 ===


  Fold 0: masked macro=0.7097 | n_val=31 | time=50.9s


  Fold 1: masked macro=0.7833 | n_val=27 | time=80.1s


  Fold 2: masked macro=0.7273 | n_val=26 | time=19.7s


  Fold 3: masked macro=0.7344 | n_val=31 | time=55.2s


  Fold 4: masked macro=0.6805 | n_val=30 | time=21.5s


Variant OOF masked macro: 0.7546


=== Station+Time Variant cb_stn_time_d3_lr002_l2_70_rsm06_sub085_s456 ===


  Fold 0: masked macro=0.7247 | n_val=31 | time=21.4s


  Fold 1: masked macro=0.7838 | n_val=27 | time=46.3s


  Fold 2: masked macro=0.6649 | n_val=26 | time=13.3s


  Fold 3: masked macro=0.7487 | n_val=31 | time=25.2s


  Fold 4: masked macro=0.7035 | n_val=30 | time=15.2s


Variant OOF masked macro: 0.7567


[View A_stn_time] OOF masked macro=0.7606


[Power+Weights (A_stn_time)] Best masked OOF=0.7748 | gamma=1.0 | wA=0.65 wB=0.00 wC=0.35 | wp=0.24


Wrote submission.csv with 1216 rows (Station+Time View A_stn_time + power+weights)


Unnamed: 0,Id,Probability
0,100,0.189836
1,101,0.305042
2,102,0.31341
3,103,0.68573
4,104,0.420749


In [57]:
# A_stn_cat: station as single categorical feature (CatBoost) + compact blend
import re, time, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool
from scipy.stats import rankdata

def extract_station(s):
    m = re.search(r'(PC\d+)', s) if isinstance(s, str) else None
    return m.group(1) if m else 'UNK'

def masked_auc_macro(y_true, y_pred, valid_mask):
    C = y_true.shape[1]; aucs = []
    for c in range(C):
        m = valid_mask[:, c].astype(bool)
        if m.sum() <= 1: continue
        yt, yp = y_true[m, c], y_pred[m, c]
        if yt.max() == yt.min(): continue
        try: aucs.append(roc_auc_score(yt, yp))
        except: pass
    return float(np.mean(aucs)) if aucs else float('nan')

def rank_cols(mat):
    N, C = mat.shape; out = np.zeros_like(mat, dtype=np.float32)
    for c in range(C):
        r = rankdata(mat[:, c], method='average')
        out[:, c] = ((r - 1) / (N - 1) if N > 1 else 0).astype(np.float32)
    return out

def pow_rank(mat, g):
    return np.clip(np.power(mat, g, dtype=np.float64), 0.0, 1.0).astype(np.float32)

# Preconditions
assert 'OB_r' in globals() and 'OC_r' in globals() and 'oof_ens_pl_v2' in globals() and 'inter_mask_plv2' in globals(), 'Need OB_r/OC_r/PLV2'
assert 'TB' in globals() and 'test_ens_st' in globals() and 'test_ens_pl_v2' in globals(), 'Need TB/test_ens_st/test_ens_pl_v2'
assert 'MB' in globals(), 'Need MB mask from view B'

# Build station categorical DF
train_ids_sorted = pd.Index(sorted(train_rec_ids)); test_ids_sorted = pd.Index(sorted(test_rec_ids))
tr_stems = train_df.set_index('rec_id').loc[train_ids_sorted, 'filename']
te_stems = pd.Series([id2fn[int(r)] for r in test_ids_sorted], index=test_ids_sorted)
tr_station = tr_stems.map(extract_station); te_station = te_stems.map(extract_station)
X_tr = X_train_df.set_index(pd.Index(sorted(train_rec_ids))).loc[train_ids_sorted].reset_index(drop=True).astype(np.float32)
X_te = X_test_df.set_index(pd.Index(sorted(test_rec_ids))).loc[test_ids_sorted].reset_index(drop=True).astype(np.float32)
X_tr['station_id'] = tr_station.values.astype(str); X_te['station_id'] = te_station.values.astype(str)
cat_idx = [X_tr.columns.get_loc('station_id')]
Y_full = pd.DataFrame(Y, index=train_rec_ids).loc[train_ids_sorted].values.astype(np.float32)
N, C = X_tr.shape[0], Y_full.shape[1]
groups = tr_station.values
folds = list(GroupKFold(n_splits=5).split(X_tr, Y_full, groups))
print('Station categorical idx:', cat_idx[0], '| dims:', X_tr.shape, flush=True)

# Compact 3-variant CatBoost bag
variants = [
    dict(name='stncat_d3_s42', depth=3, learning_rate=0.02,  l2_leaf_reg=50.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=42),
    dict(name='stncat_d4_s123',depth=4, learning_rate=0.015, l2_leaf_reg=60.0, rsm=0.5, subsample=0.80, random_strength=1.0, seed=123),
    dict(name='stncat_d3_s456',depth=3, learning_rate=0.02,  l2_leaf_reg=70.0, rsm=0.6, subsample=0.85, random_strength=0.8, seed=456),
]
oofs, masks, tests = [], [], []
for v in variants:
    print('===', v['name'], '===', flush=True)
    oof = np.zeros((N, C), dtype=np.float32); vmask = np.zeros((N, C), dtype=np.uint8); test_folds = []
    for fi, (tr_idx, va_idx) in enumerate(folds):
        t0 = time.time(); Xtr_df, Xva_df = X_tr.iloc[tr_idx], X_tr.iloc[va_idx]; ytr, yva = Y_full[tr_idx], Y_full[va_idx]
        te_pred = np.zeros((X_te.shape[0], C), dtype=np.float32)
        for c in range(C):
            ytr_c, yva_c = ytr[:, c], yva[:, c]
            if ytr_c.max() == ytr_c.min() or yva_c.max() == yva_c.min(): continue
            tr_pool = Pool(Xtr_df, label=ytr_c, cat_features=cat_idx); va_pool = Pool(Xva_df, label=yva_c, cat_features=cat_idx)
            cb = CatBoostClassifier(loss_function='Logloss', depth=v['depth'], learning_rate=v['learning_rate'], iterations=5000,
                                 l2_leaf_reg=v['l2_leaf_reg'], rsm=v['rsm'], subsample=v['subsample'], random_strength=v['random_strength'],
                                 auto_class_weights='Balanced', random_seed=v['seed'], early_stopping_rounds=200, verbose=False, task_type='CPU', thread_count=-1)
            cb.fit(tr_pool, eval_set=va_pool)
            oof[va_idx, c] = cb.predict_proba(Xva_df)[:, 1]; te_pred[:, c] = cb.predict_proba(X_te)[:, 1]; vmask[va_idx, c] = 1
        print(f'  Fold {fi}: macro={masked_auc_macro(yva, oof[va_idx], vmask[va_idx]):.4f} | n_val={len(va_idx)} | t={time.time()-t0:.1f}s', flush=True)
        test_folds.append(te_pred)
    oofs.append(oof); masks.append(vmask); tests.append(np.mean(test_folds, axis=0))
    print('Variant OOF:', f'{masked_auc_macro(Y_full, oof, vmask):.4f}', flush=True)

# Rank-avg within A_stn_cat
inter_mask = masks[0].astype(bool)
for m in masks[1:]: inter_mask &= m.astype(bool)
MA_stn_cat = inter_mask.astype(np.uint8)
ranked_oofs = []
for o in oofs:
    R = np.full_like(o, np.nan, dtype=np.float32)
    for c in range(C):
        m = MA_stn_cat[:, c].astype(bool)
        if not m.any(): continue
        col = o[m, c]; r = rankdata(col, method='average')
        R[m, c] = ((r - 1) / (len(r) - 1) if len(r) > 1 else 0).astype(np.float32)
    ranked_oofs.append(R)
OA_stn_cat = np.nanmean(ranked_oofs, axis=0).astype(np.float32)
TA_stn_cat = np.mean([rank_cols(tp) for tp in tests], axis=0).astype(np.float32)
print('[A_stn_cat] OOF macro=', f'{masked_auc_macro(Y_full, OA_stn_cat, MA_stn_cat):.4f}', flush=True)

# Power+weights with A_stn_cat
MC = np.isfinite(OC_r).astype(np.uint8)
inter_mask_new = (MA_stn_cat.astype(bool) & MB.astype(bool) & MC.astype(bool) & inter_mask_plv2.astype(bool)).astype(np.uint8)
gammas = [1.0, 1.1]; w_grid = np.linspace(0.0, 1.0, 21); wp_grid = np.linspace(0.0, 0.3, 16)
best = None; best_macro = -1.0
for g in gammas:
    A_o = pow_rank(OA_stn_cat, g); B_o = pow_rank(OB_r, g); C_o = pow_rank(OC_r, g); P_o = pow_rank(oof_ens_pl_v2, g)
    for wB in w_grid:
        for wC in w_grid:
            wA = 1.0 - wB - wC
            if wA < 0 or wA > 1: continue
            o3 = wA*A_o + wB*B_o + wC*C_o
            for wp in wp_grid:
                blend = (1.0 - wp)*o3 + wp*P_o
                m = masked_auc_macro(Y_full, blend, inter_mask_new)
                if m > best_macro: best_macro = m; best = (float(g), float(wA), float(wB), float(wC), float(wp))
gA, wA, wB, wC, wp = best
print(f"[Blend A_stn_cat] Best OOF={best_macro:.4f} | g={gA} | wA={wA:.2f} wB={wB:.2f} wC={wC:.2f} | wp={wp:.2f}", flush=True)

# Build test and write submission
RA = pow_rank(TA_stn_cat.astype(np.float32), gA); RB = pow_rank(TB.astype(np.float32), gA)
RC = pow_rank(test_ens_st.astype(np.float32), gA); RPL = pow_rank(test_ens_pl_v2.astype(np.float32), gA)
test_3 = wA*RA + wB*RB + wC*RC; test_final = (1.0 - wp)*test_3 + wp*RPL
order = sub_df[['Id','rec_id','class_id']].copy(); rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, r in order.iterrows():
    ti = rec_to_idx[int(r['rec_id'])]; cid = int(r['class_id'])
    p = float(test_final[ti, cid]); probs.append(min(max(p, 0.0), 1.0))
pd.DataFrame({'Id': order['Id'].astype(int), 'Probability': np.round(probs, 6)}).to_csv('submission.csv', index=False)
print('Wrote submission.csv (A_stn_cat + power+weights)')

Station categorical idx: 697 | dims: (145, 698)


=== stncat_d3_s42 ===


  Fold 0: macro=0.7520 | n_val=31 | t=35.2s


  Fold 1: macro=0.7644 | n_val=27 | t=49.3s


  Fold 2: macro=0.6944 | n_val=26 | t=14.7s


  Fold 3: macro=0.7403 | n_val=31 | t=26.4s


  Fold 4: macro=0.7066 | n_val=30 | t=20.6s


Variant OOF: 0.7427


=== stncat_d4_s123 ===


  Fold 0: macro=0.7289 | n_val=31 | t=55.7s


  Fold 1: macro=0.7952 | n_val=27 | t=104.2s


  Fold 2: macro=0.6866 | n_val=26 | t=49.8s


  Fold 3: macro=0.7218 | n_val=31 | t=66.9s


  Fold 4: macro=0.7068 | n_val=30 | t=44.2s


Variant OOF: 0.7310


=== stncat_d3_s456 ===


  Fold 0: macro=0.6949 | n_val=31 | t=34.3s


  Fold 1: macro=0.7650 | n_val=27 | t=50.4s


  Fold 2: macro=0.6663 | n_val=26 | t=20.3s


  Fold 3: macro=0.7254 | n_val=31 | t=38.8s


  Fold 4: macro=0.7078 | n_val=30 | t=19.5s


Variant OOF: 0.7261


[A_stn_cat] OOF macro= 0.7306


[Blend A_stn_cat] Best OOF=0.7679 | g=1.0 | wA=0.25 wB=0.15 wC=0.60 | wp=0.30


Wrote submission.csv (A_stn_cat + power+weights)


In [59]:
# Hedge submission: reuse A_stn view but with gamma=1.1 and same weights as best A_stn config
import numpy as np, pandas as pd

def pow_rank(mat, gamma):
    return np.clip(np.power(mat, gamma, dtype=np.float64), 0.0, 1.0).astype(np.float32)

assert 'TA_stn' in globals() and 'TB' in globals() and 'test_ens_st' in globals() and 'test_ens_pl_v2' in globals(), 'Run cell 42 first (A_stn) and have B/C/PL test ranks ready'
assert 'wA_best_astn' in globals() and 'wB_best_astn' in globals() and 'wC_best_astn' in globals() and 'wp_best_astn' in globals(), 'Run cell 42 to set best weights for A_stn'

g_hedge = 1.1
RA = pow_rank(TA_stn.astype(np.float32), g_hedge)
RB = pow_rank(TB.astype(np.float32), g_hedge)
RC = pow_rank(test_ens_st.astype(np.float32), g_hedge)
RPL = pow_rank(test_ens_pl_v2.astype(np.float32), g_hedge)
test_3 = wA_best_astn * RA + wB_best_astn * RB + wC_best_astn * RC
test_final = (1.0 - wp_best_astn) * test_3 + wp_best_astn * RPL

order = sub_df[['Id','rec_id','class_id']].copy()
rec_to_idx = {rid:i for i, rid in enumerate(sorted(test_rec_ids))}
probs = []
for _, r in order.iterrows():
    ti = rec_to_idx[int(r['rec_id'])]; cid = int(r['class_id'])
    p = float(test_final[ti, cid]); probs.append(min(max(p, 0.0), 1.0))
pd.DataFrame({'Id': order['Id'].astype(int), 'Probability': np.round(probs, 6)}).to_csv('submission.csv', index=False)
print('Wrote submission.csv (A_stn weights hedge, gamma=1.1)')

Wrote submission.csv (A_stn weights hedge, gamma=1.1)
