# Lean Rebuild Plan: MLSP 2013 Birds (Primary Plan)

Objective: Clean, tabular-only pipeline using exactly the 588 histogram features (no priors, no station OHE). Train 4 highly-regularized models; select best single and best blend using pooled OOF macro AUC. Target: LB medal.

Key constraints learned:
- 19 classes only (assertions enforced).
- No leakage: fit anything only on train folds.
- Correct scorer: pooled plain macro AUC over OOF probabilities.
- LOSO CV, handle single-class folds robustly.

Data & Features:
- Use existing histogram features spec (588-dim) identical to build_hist_features from production but exclude station priors and any station OHE.
- Ensure deterministic feature order and identical transforms for train/test.
- Robust loaders for species list and labels (header-safe).

Validation:
- LOSO CV (leave-one-station-out) based on station parsed from filenames/metadata.
- For any fold with classes missing in train: skip model or backfill with class prior = global prevalence from training split (EB optional, but avoid station priors).
- Pooled OOF macro AUC across all samples as the single selection metric.

Models (4 total):
- CatBoost #1: strong regularization, depth ~4-6, l2 ~ 8-20, learning_rate ~ 0.03-0.06, iterations ~ 2000 with early stopping.
- CatBoost #2: even stronger regularization / different depth & lr to diversify.
- LightGBM: gbdt with high reg (min_data_in_leaf high, feature_fraction ~0.5-0.8, bagging_fraction ~0.6-0.9, lambda_l1/l2 > 0), early stopping.
- ExtraTrees (sklearn): many estimators (1000+), max_depth limited, min_samples_leaf >= 5, bootstrap=False.
- All models output calibrated probabilities; ensure no inversion for LGBM.

Training Loop:
- Multilabel setup: fit per-class binary model for tree models (1-vs-rest), or implement native multilabel when supported (CatBoost supports multi-label? If uncertain, use 1-vs-rest for consistency).
- Efficient parallelization per class or per fold using joblib.
- Log per-fold timing, class counts, and AUC per class; flush periodically.

Blending:
- Start with equal-weight average of model probabilities.
- Do simple weight sweep on OOF (grid over a simplex for 2-4 models) to find best pooled macro AUC; keep it tiny and fast.
- Select: (a) best single model by pooled OOF macro AUC, (b) best simple blend.

Submission:
- Train on full train set per model, predict test, produce two submissions: best_single.csv and best_blend.csv.
- Sanity checks: match sample_submission columns and length.

Checkpoints for Expert Review:
1) This plan & exact model hyperparameters (request guidance on precise values).
2) After implementing data loaders and feature builder (confirm 588 features, shapes, and no leakage).
3) After first LOSO CV run with one model to verify pooled OOF macro AUC stability.
4) After training all 4 models and producing OOFs; before weight sweep.
5) Before final training-on-full and submission generation.

Open Questions for Experts:
- Exact hyperparameters for the 4 models you recommend (depth, iterations, reg).
- Whether to use CatBoost's multi-label mode or stick to 1-vs-rest for reliability here.
- Any LOSO edge-case handling tips specific to this dataset.

Next Steps:
- Implement robust loaders + 588-feature builder (no priors/OHE).
- Implement LOSO splitter and pooled macro AUC scorer.
- Implement class-wise 1-vs-rest training for CatBoost/LGBM/ExtraTrees with logging.
- Run initial CV with one conservative model to validate pipeline, then expand to all four.

In [1]:
import os, json, numpy as np, pandas as pd, re, sys, gc, time
from pathlib import Path

def log(s):
    print(time.strftime('%H:%M:%S'), s, flush=True)

# Inspect available precomputed feature artifacts and submission schema
log('Listing candidate feature files...')
candidates = ['hog_train.npy','hog_test.npy','filtered_stats.npy','panns_cnn14_emb.npy','vggish_256.npy']
avail = {}
for f in candidates:
    if os.path.exists(f):
        try:
            arr = np.load(f, allow_pickle=True)
            shape = getattr(arr, 'shape', None)
            avail[f] = shape
            log(f'{f}: shape={shape}')
        except Exception as e:
            log(f'{f}: load error {e}')
    else:
        log(f'{f}: MISSING')

log('Reading sample_submission.csv ...')
sample = pd.read_csv('sample_submission.csv')
log(f'sample_submission shape={sample.shape}; head:\n{sample.head()}')

# Inspect essential mappings and filenames to derive stations and counts
rec_map_path = Path('essential_data/rec_id2filename.txt')
if rec_map_path.exists():
    # Robust load: auto-detect delimiter, skip header if present
    with open(rec_map_path, 'r') as fh:
        head = fh.readline()
    delim = ',' if ',' in head and '\t' not in head else ('\t' if '\t' in head else None)
    df_map = pd.read_csv(rec_map_path, sep=delim, engine='python')
    # Try to normalize columns
    cols_lower = [c.lower() for c in df_map.columns]
    if 'id' in cols_lower and 'filename' in cols_lower:
        pass
    elif len(df_map.columns) >= 2:
        df_map.columns = ['Id','Filename'] + list(df_map.columns[2:])
    log(f'rec_id2filename rows={len(df_map)}, cols={df_map.columns.tolist()}')
    # Derive station as prefix before first underscore
    if 'Filename' in df_map.columns:
        df_map['station'] = df_map['Filename'].astype(str).str.split('_').str[0]
        log(f"Stations (n={df_map['station'].nunique()}): {sorted(df_map['station'].unique())[:10]}...")
else:
    log('rec_id2filename.txt not found')

# Inspect essential_data/species_list.txt for class count (should be 19)
species_path = Path('essential_data/species_list.txt')
if species_path.exists():
    with open(species_path, 'r') as fh:
        lines = [ln.strip() for ln in fh if ln.strip()]
    # Drop header if present (assume first line contains 'Species' or non-alpha-numeric)
    if lines and (lines[0].lower().startswith('species') or ',' in lines[0] or '\t' in lines[0]):
        lines = lines[1:]
    log(f'species_list entries={len(lines)} (expect 19)')
    if len(lines) != 19:
        log('WARNING: species count != 19')
else:
    log('species_list.txt not found')

log('Done.')

22:31:43 Listing candidate feature files...


22:31:43 hog_train.npy: shape=(145, 8100)


22:31:43 hog_test.npy: shape=(64, 8100)


22:31:43 filtered_stats.npy: shape=()


22:31:43 panns_cnn14_emb.npy: shape=(322, 2048)


22:31:43 vggish_256.npy: shape=(322, 256)


22:31:43 Reading sample_submission.csv ...


22:31:44 sample_submission shape=(1216, 2); head:
    Id  Probability
0  100            0
1  101            0
2  102            0
3  103            0
4  104            0


22:31:44 rec_id2filename rows=322, cols=['Id', 'Filename']


22:31:44 Stations (n=13): ['PC1', 'PC10', 'PC11', 'PC13', 'PC15', 'PC16', 'PC17', 'PC18', 'PC2', 'PC4']...


22:31:44 species_list entries=19 (expect 19)


22:31:44 Done.


In [2]:
# Data loading + 588-feature loader (filtered_stats.npy) and alignment
import numpy as np, pandas as pd, os, time, sys
from pathlib import Path

DATA_DIR = Path('essential_data')

def log(msg):
    print(time.strftime('%H:%M:%S'), msg, flush=True)

def load_species_list(path: Path):
    # Robust header-safe loader; expect 19 entries
    try:
        df = pd.read_csv(path)
        if {'class_id','code'}.issubset(df.columns):
            return df.sort_values('class_id')['code'].tolist()
    except Exception:
        pass
    with open(path, 'r') as f:
        lines = [ln.strip() for ln in f if ln.strip()]
    if lines and (lines[0].lower().startswith('species') or ',' in lines[0] or '\t' in lines[0]):
        lines = lines[1:]
    out = []
    for s in lines:
        parts = s.split(',')
        out.append(parts[1] if len(parts)>1 else s)
    return out

def parse_rec_id2filename(path: Path):
    df = pd.read_csv(path)
    df = df.rename(columns={df.columns[0]:'rec_id', df.columns[1]:'filename'})
    df['rec_id'] = df['rec_id'].astype(int)
    df['station'] = df['filename'].str.extract(r'^(PC\d+)')
    return df[['rec_id','filename','station']]

def parse_labels(path: Path, C: int):
    rec_ids, flags, Y = [], [], []
    with open(path, 'r') as f:
        f.readline()  # skip header
        for line in f:
            line = line.strip()
            if not line: continue
            parts = [tok.strip() for tok in line.split(',')]
            try: rid = int(parts[0])
            except: continue
            tokens = parts[1:] if len(parts)>1 else []
            is_test = any(tok=='?' for tok in tokens)
            y = np.zeros(C, dtype=np.uint8)
            if not is_test and tokens:
                for tok in tokens:
                    if tok in ('','?'): continue
                    try: idx = int(tok)
                    except: continue
                    if 0 <= idx < C: y[idx] = 1
            rec_ids.append(rid); flags.append(is_test); Y.append(y)
    lab_cols = [f'label_{i}' for i in range(C)]
    ydf = pd.DataFrame(np.vstack(Y), columns=lab_cols)
    df = pd.DataFrame({'rec_id': rec_ids, 'is_test': flags})
    return df.join(ydf), lab_cols

def load_filtered_stats(path: Path, mask_train=None, mask_test=None):
    arr = np.load(path, allow_pickle=True)
    # npz-like
    if hasattr(arr, 'files'):
        Xtr = arr['X_train'] if 'X_train' in arr.files else arr[arr.files[0]]
        Xte = arr['X_test'] if 'X_test' in arr.files else None
        id_tr = arr['train_ids'] if 'train_ids' in arr.files else None
        id_te = arr['test_ids'] if 'test_ids' in arr.files else None
        return Xtr, Xte, id_tr, id_te
    # object ndarray holding dict
    if isinstance(arr, np.ndarray) and arr.dtype == object:
        try:
            obj = arr.item()
        except Exception:
            obj = None
        if isinstance(obj, dict):
            Xtr = obj.get('X_train') or obj.get('train') or obj.get('train_X')
            Xte = obj.get('X_test') or obj.get('test') or obj.get('test_X')
            id_tr = obj.get('train_ids') or obj.get('ids_train') or obj.get('rec_ids_train')
            id_te = obj.get('test_ids') or obj.get('ids_test') or obj.get('rec_ids_test')
            return Xtr, Xte, id_tr, id_te
    # single 2D stacked array
    if isinstance(arr, np.ndarray) and arr.ndim == 2 and mask_train is not None and mask_test is not None:
        assert arr.shape[0] == int(mask_train.sum() + mask_test.sum()), 'Rows mismatch total records'
        return arr[mask_train], arr[mask_test], None, None
    raise RuntimeError('Unsupported filtered_stats.npy format')

def align_by_ids(X, ids, order_ids):
    if ids is None:
        # assume already aligned to order
        return X
    mp = {int(r): i for i, r in enumerate(ids)}
    idx = [mp[int(r)] for r in order_ids]
    return X[idx]

# 1) Core metadata
species = load_species_list(DATA_DIR/'species_list.txt')
assert len(species) == 19, f'Expected 19 species, got {len(species)}'
rec_map = parse_rec_id2filename(DATA_DIR/'rec_id2filename.txt')
labels_df, _ = parse_labels(DATA_DIR/'rec_labels_test_hidden.txt', len(species))
df_all = rec_map.merge(labels_df, on='rec_id', how='right')
train_df = df_all[~df_all['is_test']].copy()
test_df = df_all[df_all['is_test']].copy()
y_train = train_df[[c for c in train_df.columns if c.startswith('label_')]].copy()
y_train.columns = [f'label_{s}' for s in species]
meta_train = train_df[['rec_id','filename','station']].copy()
meta_test = test_df[['rec_id','filename','station']].copy()
log(f'Train N={len(train_df)} Test N={len(test_df)} Classes={y_train.shape[1]} Stations={train_df.station.nunique()}')

# 2) Load 588-d filtered stats and align
mask_train = (~df_all['is_test']).values
mask_test = (df_all['is_test']).values
Xtr_raw, Xte_raw, id_tr, id_te = load_filtered_stats(Path('filtered_stats.npy'), mask_train=mask_train, mask_test=mask_test)
Xtr = align_by_ids(Xtr_raw, id_tr, meta_train['rec_id'].values.tolist())
Xte = align_by_ids(Xte_raw, id_te, meta_test['rec_id'].values.tolist())
assert Xtr is not None and Xte is not None, 'Failed to load/align filtered_stats'
log(f'Filtered stats shapes: X_train={getattr(Xtr, "shape", None)}, X_test={getattr(Xte, "shape", None)}')
assert Xtr.shape[1] == 588, f'Expected 588 features, got {Xtr.shape[1]}'
assert y_train.shape[1] == 19, f'Expected 19 label columns, got {y_train.shape[1]}'
assert len(Xtr) == len(meta_train) and len(Xte) == len(meta_test), 'Row count mismatch after alignment'

# 3) Quick summaries
pos_per_class = y_train.values.sum(axis=0)
neg_per_class = len(y_train) - pos_per_class
log('Class positives (first 10): ' + ', '.join(map(str, pos_per_class[:10].tolist())))
log('Any empty classes in train? ' + str(bool((pos_per_class==0).any())))
log('Stations: ' + ', '.join(sorted(train_df.station.unique().tolist())))
log('Done data/feature prep.')

22:32:48 Train N=258 Test N=64 Classes=19 Stations=13


AssertionError: Failed to load/align filtered_stats

In [3]:
# Diagnose filtered_stats.npy structure
import numpy as np
from pathlib import Path
p = Path('filtered_stats.npy')
arr = np.load(p, allow_pickle=True)
print('loaded type:', type(arr), 'shape:', getattr(arr, 'shape', None), 'dtype:', getattr(arr, 'dtype', None))
if hasattr(arr, 'files'):
    print('npz-like files:', arr.files)
elif isinstance(arr, np.ndarray) and arr.dtype == object:
    try:
        obj = arr.item()
        if isinstance(obj, dict):
            print('top-level dict keys:', list(obj.keys()))
            for k, v in obj.items():
                if hasattr(v, 'shape'):
                    print(' key', k, '-> shape', v.shape, 'dtype', getattr(v, 'dtype', None))
                else:
                    print(' key', k, '-> type', type(v))
        else:
            print('top-level object type:', type(obj))
    except Exception as e:
        print('arr.item() failed:', e)
else:
    print('Unsupported container; direct shape:', getattr(arr, 'shape', None))

loaded type: <class 'numpy.ndarray'> shape: () dtype: object
top-level dict keys: ['mean', 'std', 'Hmean', 'Wmean']
 key mean -> shape () dtype float64
 key std -> type <class 'float'>
 key Hmean -> type <class 'int'>
 key Wmean -> type <class 'int'>


In [5]:
# Lean rebuild: robust loaders and 588-feature builder (no priors/OHE), assemble X/y and LOSO groups
import os, sys, re, time, gc, warnings
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.metrics import roc_auc_score

warnings.filterwarnings('ignore')
DATA_DIR = Path('essential_data')
SUPP_DIR = Path('supplemental_data')

def log(msg):
    print(time.strftime('%H:%M:%S'), msg, flush=True)

def load_species_list(path: Path):
    # CSV with headers class_id,code,species; fallback to line list
    try:
        df = pd.read_csv(path)
        if {'class_id','code'}.issubset(df.columns):
            return df.sort_values('class_id')['code'].tolist()
    except Exception:
        pass
    lines = []
    with open(path, 'r') as f:
        for ln in f:
            s = ln.strip()
            if not s: continue
            if s.lower().startswith('species') or ',' in s or '\t' in s:
                continue
            lines.append(s)
    return lines

def parse_rec_id2filename(path: Path):
    df = pd.read_csv(path)
    df = df.rename(columns={df.columns[0]:'rec_id', df.columns[1]:'filename'})
    df['rec_id'] = df['rec_id'].astype(int)
    df['station'] = df['filename'].str.extract(r'^(PC\d+)')
    return df[['rec_id','filename','station']]

def parse_labels(path: Path, C: int, species_codes):
    rec_ids, flags, Y = [], [], []
    with open(path, 'r') as f:
        _ = f.readline()
        for line in f:
            line = line.strip()
            if not line: continue
            parts = [tok.strip() for tok in line.split(',')]
            try: rid = int(parts[0])
            except: continue
            tokens = parts[1:] if len(parts)>1 else []
            is_test = any(tok=='?' for tok in tokens)
            y = np.zeros(C, dtype=np.uint8)
            if not is_test:
                for tok in tokens:
                    if tok in ('','?'): continue
                    try: idx = int(tok)
                    except: continue
                    if 0 <= idx < C: y[idx] = 1
            rec_ids.append(rid); flags.append(is_test); Y.append(y)
    lab_df = pd.DataFrame(np.vstack(Y), columns=[f'label_{s}' for s in species_codes])
    base = pd.DataFrame({'rec_id': rec_ids, 'is_test': flags})
    return pd.concat([base, lab_df], axis=1)

def _safe_int(tok):
    try: return int(tok)
    except: return None

def _parse_numeric_list(tokens):
    vals = []
    for t in tokens:
        try: vals.append(float(t))
        except:
            t2 = re.sub(r'[^0-9eE+\-\.]', '', t)
            if t2 != '':
                try: vals.append(float(t2))
                except: pass
    return vals

def load_histograms(path: Path):
    rows = []
    with open(path, 'r') as f:
        first = f.readline()
        delim = ',' if ',' in first else None
        parts = [p for p in (first.strip().split(',') if delim==',' else first.strip().split()) if p!='']
        rid = _safe_int(parts[0]) if parts else None
        if rid is not None:
            rows.append((rid, _parse_numeric_list(parts[1:])))
        for line in f:
            p = [pp for pp in (line.strip().split(',') if delim==',' else line.strip().split()) if pp!='']
            if not p: continue
            rid = _safe_int(p[0])
            if rid is None: continue
            rows.append((rid, _parse_numeric_list(p[1:])))
    if not rows:
        return pd.DataFrame({'rec_id': []}), []
    maxK = max(len(r[1]) for r in rows)
    rec_ids, data = [], []
    for rid, vals in rows:
        if len(vals) < maxK: vals = vals + [0.0]*(maxK-len(vals))
        elif len(vals) > maxK: vals = vals[:maxK]
        rec_ids.append(rid); data.append(vals)
    H = np.asarray(data, dtype=float)
    hist_cols = [f'hist_{i:03d}' for i in range(H.shape[1])]
    df = pd.DataFrame(H, columns=hist_cols)
    df.insert(0, 'rec_id', rec_ids)
    return df, hist_cols

def load_segment_features(path: Path):
    rows = []
    with open(path, 'r') as f:
        first = f.readline()
        delim = ',' if ',' in first else None
        parts = [p for p in (first.strip().split(',') if delim==',' else first.strip().split()) if p!='']
        rid = _safe_int(parts[0]) if parts else None
        if rid is not None:
            rows.append((rid, _parse_numeric_list(parts[1:])))
        for line in f:
            p = [pp for pp in (line.strip().split(',') if delim==',' else line.strip().split()) if pp!='']
            if not p: continue
            rid = _safe_int(p[0])
            if rid is None: continue
            rows.append((rid, _parse_numeric_list(p[1:])))
    if not rows: return pd.DataFrame({'rec_id': []}), []
    maxM = max(len(r[1]) for r in rows)
    rec_ids = [r[0] for r in rows]
    X = []
    for _, vals in rows:
        if len(vals) < maxM: vals = vals + [0.0]*(maxM-len(vals))
        elif len(vals) > maxM: vals = vals[:maxM]
        X.append(vals)
    X = np.asarray(X, dtype=float)
    seg_cols = [f'seg_{j:03d}' for j in range(X.shape[1])]
    seg_df = pd.DataFrame(X, columns=seg_cols)
    seg_df.insert(0, 'rec_id', rec_ids)
    return seg_df, seg_cols

def aggregate_segments(seg_df: pd.DataFrame, seg_cols):
    # per rec_id aggregations; include skew to match 588 feature target
    g = seg_df.groupby('rec_id')
    aggs = g[seg_cols].agg(['mean','std','min','max','median','skew'])
    aggs.columns = [f"{c}_{stat}" for c, stat in aggs.columns]
    aggs = aggs.reset_index()
    cnt = g.size().rename('n_seg').reset_index()
    out = aggs.merge(cnt, on='rec_id', how='left')
    out['n_seg_log1p'] = np.log1p(out['n_seg'])
    return out

def build_hist_features(hist_df: pd.DataFrame, hist_cols):
    H = hist_df[hist_cols].values.astype(float)
    n_bins = H.shape[1]
    raw = pd.DataFrame(H, columns=[f'h_raw_{i:03d}' for i in range(n_bins)])
    log1p = pd.DataFrame(np.log1p(H), columns=[f'h_log1p_{i:03d}' for i in range(n_bins)])
    sums = H.sum(axis=1, keepdims=True) + 1e-9
    prop = H / sums
    prop_df = pd.DataFrame(prop, columns=[f'h_prop_{i:03d}' for i in range(n_bins)])
    ent = -(prop * (np.log(prop + 1e-12))).sum(axis=1)
    b0, b1, b2, b3 = 0, n_bins//3, 2*n_bins//3, n_bins
    def band_stats(M, prefix):
        low, mid, high = M[:, b0:b1], M[:, b1:b2], M[:, b2:b3]
        d = {}
        for name, part in zip(['low','mid','high'], [low,mid,high]):
            d[f'{prefix}{name}_sum'] = part.sum(axis=1)
            d[f'{prefix}{name}_mean'] = part.mean(axis=1)
        d[f'{prefix}low_mid_ratio'] = d[f'{prefix}low_sum']/np.clip(d[f'{prefix}mid_sum'],1e-8,None)
        d[f'{prefix}low_high_ratio'] = d[f'{prefix}low_sum']/np.clip(d[f'{prefix}high_sum'],1e-8,None)
        d[f'{prefix}mid_high_ratio'] = d[f'{prefix}mid_sum']/np.clip(d[f'{prefix}high_sum'],1e-8,None)
        return d
    band_raw = band_stats(H, 'band_raw_')
    band_log = band_stats(np.log1p(H), 'band_log1p_')
    band_prop = band_stats(prop, 'band_prop_')
    HHI = (prop**2).sum(axis=1); gini_imp = 1.0 - HHI; renyi2 = -np.log(HHI + 1e-12)
    max_bin_prop = prop.max(axis=1)
    part = np.partition(prop, -2, axis=1)[:, -2:]; top2_sum_prop = part.sum(axis=1)
    idx = np.arange(n_bins).astype(float); idx_z = (idx - idx.mean())/(idx.std()+1e-9)
    centroid = (prop * idx).sum(axis=1)
    spread = (prop * (idx - centroid[:, None])**2).sum(axis=1)
    slope = (prop * idx_z).sum(axis=1) / (idx_z.var() + 1e-9)
    def row_moments(M):
        mu = M.mean(axis=1, keepdims=True); sd = M.std(axis=1, keepdims=True) + 1e-9
        z = (M - mu)/sd
        return (z**3).mean(axis=1), (z**4).mean(axis=1)
    skew_raw, kurt_raw = row_moments(H); skew_prop, kurt_prop = row_moments(prop)
    L = np.log1p(prop); L_mean = L.mean(axis=1); L_std = L.std(axis=1); L_ent = -(L*np.log(L+1e-12)).sum(axis=1)
    p10 = np.percentile(H, 10, axis=1); p25 = np.percentile(H, 25, axis=1); p75 = np.percentile(H, 75, axis=1); p90 = np.percentile(H, 90, axis=1)
    summa = H.sum(axis=1)
    extras = pd.DataFrame({
        'hist_entropy': ent, 'hist_sum': summa, 'hist_p10': p10, 'hist_p25': p25, 'hist_p75': p75, 'hist_p90': p90,
        'prop_HHI': HHI, 'prop_gini_impurity': gini_impurity if (gini_impurity:=gini_imp) is not None else gini_imp, 'prop_renyi2': renyi2, 'prop_max_bin': max_bin_prop, 'prop_top2_sum': top2_sum_prop,
        'spec_centroid': centroid, 'spec_spread': spread, 'spec_slope': slope,
        'raw_skew': skew_raw, 'raw_kurt': kurt_raw, 'prop_skew': skew_prop, 'prop_kurt': kurt_prop,
        'log1pprop_mean': L_mean, 'log1pprop_std': L_std, 'log1pprop_entropy': L_ent,
    })
    for d in (band_raw, band_log, band_prop):
        for k, v in d.items(): extras[k] = v
    out = pd.concat([hist_df[['rec_id']], raw, log1p, prop_df, extras], axis=1)
    return out

def assemble_dataset():
    species = load_species_list(DATA_DIR/'species_list.txt')
    assert len(species) == 19, f'Expected 19 species, got {len(species)}'
    rec_map = parse_rec_id2filename(DATA_DIR/'rec_id2filename.txt')
    labels = parse_labels(DATA_DIR/'rec_labels_test_hidden.txt', len(species), species)
    hist_df_raw, hist_cols = load_histograms(SUPP_DIR/'histogram_of_segments.txt')
    seg_df_raw, seg_cols = load_segment_features(SUPP_DIR/'segment_features.txt')
    hist_feats = build_hist_features(hist_df_raw, hist_cols)
    seg_agg = aggregate_segments(seg_df_raw, seg_cols) if len(seg_cols) else pd.DataFrame({'rec_id': rec_map['rec_id']})
    df = rec_map.merge(labels, on='rec_id', how='right')
    df = df.merge(hist_feats, on='rec_id', how='left')
    df = df.merge(seg_agg, on='rec_id', how='left')
    # derive simple time features
    dt_str = df['filename'].str.split('_').str[1]
    ts = pd.to_datetime(dt_str, format='%Y%m%d', errors='coerce')
    df['month'] = ts.dt.month.fillna(0).astype(int)
    df['day_of_year'] = ts.dt.dayofyear.fillna(0).astype(int)
    df['doy_sin'] = np.sin(2*np.pi*df['day_of_year']/366.0)
    df['doy_cos'] = np.cos(2*np.pi*df['day_of_year']/366.0)
    label_cols = [c for c in df.columns if c.startswith('label_')]
    feature_exclude = ['rec_id','filename','station','is_test'] + label_cols
    feature_cols = [c for c in df.columns if c not in feature_exclude]
    train_df = df[~df['is_test']].copy()
    test_df = df[df['is_test']].copy()
    X_train = train_df[feature_cols].fillna(0)
    y_train = train_df[label_cols].copy()
    X_test = test_df[feature_cols].fillna(0)
    groups = train_df['station'].values
    meta_train = train_df[['rec_id','filename','station']].copy()
    meta_test = test_df[['rec_id','filename','station']].copy()
    return species, X_train, y_train, X_test, groups, meta_train, meta_test, feature_cols, label_cols

# Build dataset
species, X_train, y_train, X_test, groups, meta_train, meta_test, feature_cols, label_cols = assemble_dataset()
log(f'Train={len(X_train)} Test={len(X_test)} Features={len(feature_cols)} Labels={len(label_cols)} Stations={len(pd.unique(groups))}')
assert len(feature_cols) == 588, f'Expected 588 features, got {len(feature_cols)}'
assert y_train.shape[1] == 19, f'y_train columns != 19'
assert len(X_test) == len(meta_test) and len(X_train) == len(meta_train)
log('Sample features: ' + ', '.join(feature_cols[:5]) + ' ...')
log('Stations: ' + ', '.join(sorted(pd.unique(groups))))
pos = y_train.values.sum(axis=0); log('Positives per class (first 10): ' + ', '.join(map(str, pos[:10].tolist())))
logo = LeaveOneGroupOut()
splits = [(tr, va) for tr, va in logo.split(np.arange(len(groups)), groups=groups)]
log(f'Built LOSO splits: {len(splits)} folds')

22:37:09 Train=258 Test=64 Features=588 Labels=19 Stations=13


22:37:09 Sample features: h_raw_000, h_raw_001, h_raw_002, h_raw_003, h_raw_004 ...


22:37:09 Stations: PC1, PC10, PC11, PC13, PC15, PC16, PC17, PC18, PC2, PC4, PC5, PC7, PC8


22:37:09 Positives per class (first 10): 7, 28, 19, 4, 9, 5, 19, 19, 24, 21


22:37:09 Built LOSO splits: 13 folds


In [None]:
# LOSO CV training: 4 models (CatBoost x2, LightGBM, ExtraTrees), 1-vs-rest, pooled OOF macro AUC; generate two submissions
import numpy as np, pandas as pd, time, sys, gc, warnings, subprocess, importlib, os
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import ExtraTreesClassifier

warnings.filterwarnings('ignore')

# Ensure lightgbm and catboost are available
try:
    import lightgbm as lgb
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'lightgbm==4.3.0'], check=True)
    import lightgbm as lgb
try:
    from catboost import CatBoostClassifier
except Exception:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'catboost==1.2.5'], check=True)
    from catboost import CatBoostClassifier

def log(msg):
    print(time.strftime('%H:%M:%S'), msg, flush=True)

def macro_auc_pooled(P: np.ndarray, Y: np.ndarray) -> float:
    C = Y.shape[1]; aucs = []
    for c in range(C):
        yt = Y[:, c]; yp = P[:, c]
        if yt.sum() == 0 or yt.sum() == len(yt):
            continue
        try:
            aucs.append(roc_auc_score(yt, yp))
        except Exception:
            pass
    return float(np.mean(aucs)) if aucs else np.nan

# Configs (use exactly as advised; note: CatBoost Bayesian bootstrap ignores 'subsample')
cb1_params = dict(loss_function='Logloss', eval_metric='AUC', iterations=3000, learning_rate=0.03, depth=4, l2_leaf_reg=80, subsample=0.7, colsample_bylevel=0.6, random_strength=0.2, bagging_temperature=0.5, bootstrap_type='Bayesian', early_stopping_rounds=300, verbose=False, random_seed=42)
cb2_params = dict(loss_function='Logloss', eval_metric='AUC', iterations=3000, learning_rate=0.025, depth=5, l2_leaf_reg=100, subsample=0.8, colsample_bylevel=0.5, random_strength=0.2, bagging_temperature=0.8, bootstrap_type='Bayesian', early_stopping_rounds=300, verbose=False, random_seed=2025)
lgb_params = dict(objective='binary', metric='auc', n_estimators=3000, learning_rate=0.02, num_leaves=16, max_depth=5, min_child_samples=20, subsample=0.7, colsample_bytree=0.6, reg_alpha=0.1, reg_lambda=20, is_unbalance=True, n_jobs=-1, random_state=42, boosting_type='gbdt')
et_params = dict(n_estimators=1000, max_features=0.4, min_samples_leaf=5, min_samples_split=5, bootstrap=False, class_weight='balanced', n_jobs=-1, random_state=42)

def sanitize_cb(params: dict) -> dict:
    p = params.copy()
    # CatBoost with bootstrap_type='Bayesian' does not support 'subsample'; drop it
    if str(p.get('bootstrap_type', '')).lower() == 'bayesian' and 'subsample' in p:
        p.pop('subsample', None)
    p['allow_writing_files'] = False
    return p

models_cfg = [
    {'name': 'CatBoost#1', 'type': 'catboost', 'params': cb1_params},
    {'name': 'CatBoost#2', 'type': 'catboost', 'params': cb2_params},
    {'name': 'LightGBM',   'type': 'lightgbm', 'params': lgb_params},
    {'name': 'ExtraTrees', 'type': 'extratrees','params': et_params},
]

# Build LOSO splits
from sklearn.model_selection import LeaveOneGroupOut
logo = LeaveOneGroupOut()
idx_all = np.arange(len(groups))
splits = list(logo.split(idx_all, groups=groups))
log(f'LOSO folds: {len(splits)}')

# Global prevalence from FULL training set (used for missing-class folds)
Y_full = y_train.values.astype(np.uint8)
p_global_full = Y_full.mean(axis=0)

oof_by_model = []
test_by_model = []
scores = []

for mi, cfg in enumerate(models_cfg):
    name = cfg['name']; mtype = cfg['type']; params = cfg['params']
    N, C = len(X_train), y_train.shape[1]
    P_oof = np.zeros((N, C), dtype=np.float32)
    fold_t0 = time.time()
    for fold, (tr, va) in enumerate(splits):
        t_fold = time.time()
        Xtr = X_train.iloc[tr]; Xva = X_train.iloc[va]
        Ytr = y_train.iloc[tr].values.astype(np.uint8)
        # Per-class training
        for c in range(C):
            ytr_c = Ytr[:, c]
            # Missing-class in TRAIN split -> fill VAL with global prevalence; skip fit
            if ytr_c.min() == ytr_c.max():
                P_oof[va, c] = p_global_full[c]
                continue
            if mtype == 'catboost':
                params_cb = sanitize_cb(params)
                model = CatBoostClassifier(**params_cb)
                # Early stopping on held-out station
                model.fit(Xtr, ytr_c, eval_set=(Xva, y_train.iloc[va, c].values.astype(np.uint8)))
                P_oof[va, c] = model.predict_proba(Xva)[:, 1]
                del model
            elif mtype == 'lightgbm':
                model = lgb.LGBMClassifier(**params)
                model.fit(Xtr.values, ytr_c, eval_set=[(Xva.values, y_train.iloc[va, c].values.astype(np.uint8))], eval_metric='auc', callbacks=[lgb.early_stopping(stopping_rounds=300, verbose=False)])
                P_oof[va, c] = model.predict_proba(Xva.values)[:, 1]
                del model
            elif mtype == 'extratrees':
                model = ExtraTreesClassifier(**params)
                model.fit(Xtr.values, ytr_c)
                P_oof[va, c] = model.predict_proba(Xva.values)[:, 1]
                del model
        if (fold+1) % 1 == 0:
            log(f'[{name}] fold {fold+1:02d}/{len(splits)} done in {time.time()-t_fold:.1f}s')
        gc.collect()
    auc = macro_auc_pooled(P_oof, y_train.values.astype(np.uint8))
    oof_by_model.append(P_oof)
    scores.append((name, auc))
    log(f'[{name}] pooled OOF macro AUC: {auc:.4f} | dt={time.time()-fold_t0:.1f}s')
    gc.collect()

# Equal-weight OOF blend
P_oof_stack = np.stack(oof_by_model, axis=0)
P_oof_mean = P_oof_stack.mean(axis=0)
auc_blend = macro_auc_pooled(P_oof_mean, y_train.values.astype(np.uint8))
log('Equal-weight 4-model blend pooled OOF macro AUC: %.4f' % auc_blend)

# Select best single model
best_idx = int(np.argmax([s[1] for s in scores]))
best_name, best_auc = scores[best_idx]
log(f'Best single model: {best_name} | AUC={best_auc:.4f}')

# Full-train models and generate two submissions
def train_full_and_predict(cfg):
    name = cfg['name']; mtype = cfg['type']; params = cfg['params']
    C = y_train.shape[1]
    P_test = np.zeros((len(X_test), C), dtype=np.float32)
    for c in range(C):
        y_c = y_train.iloc[:, c].values.astype(np.uint8)
        if y_c.min() == y_c.max():
            P_test[:, c] = p_global_full[c]
            continue
        if mtype == 'catboost':
            params_cb = sanitize_cb(params)
            model = CatBoostClassifier(**params_cb)
            model.fit(X_train, y_c)
            P_test[:, c] = model.predict_proba(X_test)[:, 1]
            del model
        elif mtype == 'lightgbm':
            model = lgb.LGBMClassifier(**params)
            model.fit(X_train.values, y_c)
            P_test[:, c] = model.predict_proba(X_test.values)[:, 1]
            del model
        elif mtype == 'extratrees':
            model = ExtraTreesClassifier(**params)
            model.fit(X_train.values, y_c)
            P_test[:, c] = model.predict_proba(X_test.values)[:, 1]
            del model
    gc.collect()
    return P_test

log('Full-train: generating test predictions for all four models...')
test_preds_all = []
for cfg in models_cfg:
    log(f'Full-train {cfg["name"]} ...')
    test_preds_all.append(train_full_and_predict(cfg))

# Best single submission
best_test = test_preds_all[best_idx]
rows_single = []
test_rec_ids = meta_test['rec_id'].values.tolist()
C = y_train.shape[1]
for i, rid in enumerate(test_rec_ids):
    for c in range(C):
        rows_single.append((rid*100 + c, float(best_test[i, c])))
sub_single = pd.DataFrame(rows_single, columns=['Id','Probability']).sort_values('Id').reset_index(drop=True)
sub_single.to_csv('submission_rebuild_single.csv', index=False)

# Equal-weight blend submission
blend_test = np.mean(np.stack(test_preds_all, axis=0), axis=0)
rows_blend = []
for i, rid in enumerate(test_rec_ids):
    for c in range(C):
        rows_blend.append((rid*100 + c, float(blend_test[i, c])))
sub_blend = pd.DataFrame(rows_blend, columns=['Id','Probability']).sort_values('Id').reset_index(drop=True)
sub_blend.to_csv('submission_rebuild_blend.csv', index=False)

log('Saved submission_rebuild_single.csv and submission_rebuild_blend.csv')
print('Model OOF scores:')
for n, s in scores:
    print(f'  {n}: {s:.4f}')
print('Blend OOF macro AUC:', f'{auc_blend:.4f}')

22:39:50 LOSO folds: 13
