# OpenVaccine: COVID-19 mRNA Degradation — Plan

Goals:
- Build a medal-winning solution (multi-class-log-loss ≤ 0.35340).
- Establish robust CV mirroring test; ship fast baseline; iterate with FE and model diversity; ensemble.

Initial Facts:
- Data: train.json, test.json, sample_submission.csv.
- Target: per-base predictions for several reactivity/deg signals; metric: multi-class-log-loss.
- Common winning approaches: sequence + structure features; deep models (GRU/LSTM/Transformer), sometimes + pseudo-labeling; strong CV by sequence length buckets.

Validation Strategy:
- GroupKFold by sequence length buckets (or sequence id) to mimic test distribution and avoid leakage across positions.
- OOF per base, compute multi-class-log-loss with same targets as submission.

Baseline v0:
- Simple featureizer: one-hot nucleotides, base pairing info (from provided structure strings if available), position, length-normalized position.
- Model: light baseline (e.g., CatBoost/XGBoost per-target) to get quick OOF and sanity-check pipeline.
- Ship a valid submission quickly.

Feature Engineering (Iterative):
- Sequence encodings: k-mer counts, rolling windows, GC content, local neighborhood features.
- Structure encodings: from dot-bracket to pairing indices, loop types, paired distance, stems/loops length, minimum free energy if available.
- Positional features: relative position, ends proximity, segment IDs.
- Augment with base quality heuristics (masking/unmasking missing).

Modeling Path:
1) Baseline tree models (GPU XGBoost/CatBoost) per target with multioutput wrapper; quick CV.
2) RNN/Transformer sequence model using PyTorch (cu121) with positional/structural embeddings; train with masking to variable lengths; early stopping; mixed precision.
3) Blend/stack: weighted average of diverse seeds/models; calibration if needed.

Execution Plan (24h):
T0: Environment + GPU check
T1: Data loading + schema/target check; build CV folds; metric function replica
T2: Baseline features + fast GPU tree model; OOF + submission v0
T3: Error analysis; add key structural features; v1
T4: PyTorch model (GRU baseline); v2
T5: Ensembling and tuning; final submission

Checkpoints requiring expert review:
- After this plan
- After data/EDA/CV setup
- After baseline v0 OOF
- After deep model v1 OOF
- Before final blend

Next Action:
- Run GPU/environment check, list files, preview data schema to define targets and submission columns.

In [1]:
# Environment + Data schema check
import os, sys, json, time, subprocess, shutil, math
import pandas as pd
import numpy as np

def run(cmd):
    return subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True).stdout

print('=== GPU CHECK (nvidia-smi) ===', flush=True)
print(run(['bash','-lc','nvidia-smi || true']))

print('=== CWD and files ===', flush=True)
print('CWD:', os.getcwd())
for f in sorted(os.listdir('.')):
    try:
        sz = os.path.getsize(f)
    except Exception:
        sz = -1
    print(f'{f}\t{sz:,} bytes')

print('\n=== sample_submission.csv preview ===', flush=True)
sub = pd.read_csv('sample_submission.csv')
print('shape:', sub.shape)
print('columns:', list(sub.columns))
print(sub.head(3))

def load_json(path):
    with open(path, 'r') as f:
        return json.load(f)

print('\n=== Load train.json/test.json ===', flush=True)
train = load_json('train.json')
test = load_json('test.json')
print('train entries:', len(train), 'test entries:', len(test))
if len(train):
    print('train[0] keys:', sorted(train[0].keys()))
    # Inspect basic fields
    ex = train[0]
    for k in ['id','seq','structure','predicted_loop_type','signal_to_noise','SN_filter']:
        if k in ex:
            if isinstance(ex[k], (list, tuple)):
                print(f'{k}: list len {len(ex[k])}')
            else:
                print(f'{k}:', ex[k])
    # Detect targets for regression
    possible_targets = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']
    found_targets = [k for k in possible_targets if k in ex]
    if found_targets:
        print('Detected regression targets:', found_targets, 'lengths:', [len(ex[k]) for k in found_targets])
    else:
        # Try to detect classification label field
        label_keys = [k for k,v in ex.items() if isinstance(v, list) and len(v)>0 and isinstance(v[0], (int,float,str)) and k not in ['seq','structure','predicted_loop_type']]
        print('Potential per-base label keys:', label_keys)

# Infer task from submission columns
cols = list(sub.columns)
task = 'unknown'
if any(c.startswith('class_') for c in cols[1:]):
    task = 'classification'
elif any(c in cols for c in ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']):
    task = 'regression'
print('\nInferred task type from submission columns:', task)

# Check id format
id_col = cols[0] if len(cols)>0 else None
print('ID column:', id_col)
if id_col and id_col.lower().startswith('id'):
    # Try to parse pos from id_seqpos like '{id}_{pos}'
    sample_ids = sub[id_col].head(5).tolist()
    print('Sample IDs:', sample_ids)
    def parse_pos(s):
        try:
            return int(s.split('_')[-1])
        except Exception:
            return None
    poses = [parse_pos(s) for s in sub[id_col].head(500).tolist()]
    poses = [p for p in poses if p is not None]
    if poses:
        print('Submission pos stats: min=', min(poses), 'max=', max(poses))

print('\n=== DONE: Decide next steps based on detected schema ===', flush=True)

=== GPU CHECK (nvidia-smi) ===


Sat Sep 27 03:12:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

CWD: /var/lib/simon/agent_run_states/stanford-covid-vaccine-20250927-030510
.00_eda_and_planning_kernel_state.json	182 bytes
00_eda_and_planning.ipynb	7,822 bytes
agent_metadata	4,096 bytes
description.md	13,414 bytes
docker_run.log	47,527 bytes
requirements.txt	2,021 bytes
sample_submission.csv	2,079,421 bytes
submission.csv	2,079,421 bytes
task.txt	4,854 bytes
test.json	105,970 bytes
train.json	11,445,889 bytes

=== sample_submission.csv preview ===


shape: (25680, 6)
columns: ['id_seqpos', 'reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C']
        id_seqpos  reactivity  deg_Mg_pH10  deg_pH10  deg_Mg_50C  deg_50C
0  id_00b436dec_0         0.0          0.0       0.0         0.0      0.0
1  id_00b436dec_1         0.0          0.0       0.0         0.0      0.0
2  id_00b436dec_2         0.0          0.0       0.0         0.0      0.0

=== Load train.json/test.json ===


JSONDecodeError: Extra data: line 2 column 1 (char 5340)

In [2]:
# Robust JSONL loader and schema inspection
import json

def load_jsonl(path):
    rows = []
    with open(path, 'r') as f:
        for i, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                rows.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f'JSON decode error at line {i}: {e}')
                raise
    return rows

print('=== Reload as JSON Lines ===', flush=True)
train = load_jsonl('train.json')
test = load_jsonl('test.json')
print('train entries:', len(train), 'test entries:', len(test))
ex = train[0]
print('train[0] keys:', sorted(ex.keys()))
L = len(ex.get('seq',''))
print('Example length:', L)
targets = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']
present = [t for t in targets if t in ex]
print('Targets present:', present, 'lens:', [len(ex[t]) for t in present])
print('Has SN fields:', 'signal_to_noise' in ex, 'SN_filter' in ex)
print('Structure fields present:', 'structure' in ex, 'predicted_loop_type' in ex)

# Quick DF overview
import pandas as pd
df_train = pd.DataFrame([{
    'id': r.get('id'),
    'len': len(r.get('seq','')),
    'SN_filter': r.get('SN_filter', None),
    'has_structure': 'structure' in r,
    'has_loop_type': 'predicted_loop_type' in r
} for r in train])
print(df_train.describe(include='all'))

# Determine scored positions from submission (max pos + 1)
def parse_pos(s):
    try:
        return int(s.split('_')[-1])
    except Exception:
        return None
poses = [parse_pos(x) for x in pd.read_csv('sample_submission.csv')['id_seqpos'].head(2000)]
poses = [p for p in poses if p is not None]
L_pred = (max(poses)+1) if poses else None
print('Inferred scored positions (L_pred):', L_pred)
print('=== JSONL load OK ===', flush=True)

=== Reload as JSON Lines ===


train entries: 2160 test entries: 240
train[0] keys: ['SN_filter', 'deg_50C', 'deg_Mg_50C', 'deg_Mg_pH10', 'deg_error_50C', 'deg_error_Mg_50C', 'deg_error_Mg_pH10', 'deg_error_pH10', 'deg_pH10', 'id', 'index', 'predicted_loop_type', 'reactivity', 'reactivity_error', 'seq_length', 'seq_scored', 'sequence', 'signal_to_noise', 'structure']
Example length: 0
Targets present: ['reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C'] lens: [68, 68, 68, 68, 68]
Has SN fields: True True
Structure fields present: True True
                  id     len    SN_filter has_structure has_loop_type
count           2160  2160.0  2160.000000          2160          2160
unique          2160     NaN          NaN             1             1
top     id_001f94081     NaN          NaN          True          True
freq               1     NaN          NaN          2160          2160
mean             NaN     0.0     0.624537           NaN           NaN
std              NaN     0.0     0.484354          

In [4]:
# Build features, CV, XGBoost baseline, OOF MCRMSE, and submission
import math, gc, time
from collections import defaultdict, deque
import numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_squared_error
import xgboost as xgb

TARGETS = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']

def mcrmse(y_true, y_pred):
    rmses = []
    for i in range(y_true.shape[1]):
        rmses.append(mean_squared_error(y_true[:, i], y_pred[:, i], squared=False))
    return float(np.mean(rmses))

def parse_pairs(dot):
    stack = []
    pair = [-1]*len(dot)
    for i,ch in enumerate(dot):
        if ch == '(':
            stack.append(i)
        elif ch == ')':
            if stack:
                j = stack.pop()
                pair[i] = j
                pair[j] = i
    return pair

def window_gc(seq, k=11):
    n = len(seq)
    half = k//2
    arr = np.zeros(n, dtype=np.float32)
    for i in range(n):
        l = max(0, i-half)
        r = min(n, i+half+1)
        s = seq[l:r]
        if len(s)==0:
            arr[i] = 0.0
        else:
            gc_count = s.count('G') + s.count('C')
            arr[i] = gc_count/len(s)
    return arr

def build_features_rows(record):
    rid = record['id']
    seq = record['sequence']
    struct = record['structure']
    loop = record['predicted_loop_type']
    L = len(seq)
    pair = parse_pairs(struct)
    is_paired = [1 if p!=-1 else 0 for p in pair]
    pair_dist = [abs(i-pair[i]) if pair[i]!=-1 else -1 for i in range(L)]
    gc_win = window_gc(seq, 11)
    rows = []
    for i in range(L):
        base = seq[i]
        s = struct[i]
        lp = loop[i]
        # neighbors
        def nb(arr, idx, off):
            j = idx+off
            return arr[j] if 0<=j<len(arr) else 'PAD' if isinstance(arr, str) else -1
        base_m1 = seq[i-1] if i-1>=0 else 'PAD'
        base_p1 = seq[i+1] if i+1<L else 'PAD'
        base_m2 = seq[i-2] if i-2>=0 else 'PAD'
        base_p2 = seq[i+2] if i+2<L else 'PAD'
        struct_m1 = struct[i-1] if i-1>=0 else 'P'
        struct_p1 = struct[i+1] if i+1<L else 'P'
        loop_m1 = loop[i-1] if i-1>=0 else 'P'
        loop_p1 = loop[i+1] if i+1<L else 'P'
        partner_idx = pair[i]
        partner_base = seq[partner_idx] if partner_idx!=-1 else 'NP'
        rows.append({
            'id': rid,
            'pos': i,
            'L': L,
            'base': base,
            'struct': s,
            'loop': lp,
            'base_m1': base_m1, 'base_p1': base_p1, 'base_m2': base_m2, 'base_p2': base_p2,
            'struct_m1': struct_m1, 'struct_p1': struct_p1,
            'loop_m1': loop_m1, 'loop_p1': loop_p1,
            'is_paired': is_paired[i],
            'pair_dist': pair_dist[i],
            'partner_base': partner_base,
            'rel_pos': i/max(1, L-1),
            'dist_to_end': min(i, L-1-i),
            'gc11': float(gc_win[i])
        })
    return rows

print('Flattening train/test to per-base rows...', flush=True)
t0 = time.time()
train_rows = []
for r in train:
    rows = build_features_rows(r)
    # only keep scored positions (first seq_scored)
    L_scored = int(r.get('seq_scored', 68))
    for row in rows[:L_scored]:
        # attach targets
        for t in TARGETS:
            row[t] = r[t][row['pos']]
        row['SN_filter'] = r.get('SN_filter', 1)
        train_rows.append(row)
print(f'Train rows: {len(train_rows)} in {time.time()-t0:.2f}s')

test_rows = []
for r in test:
    rows = build_features_rows(r)
    # We'll predict for all positions that appear in submission for this id; to be safe keep all and merge later
    for row in rows:
        test_rows.append(row)
print(f'Test rows (all positions): {len(test_rows)}')

df_tr = pd.DataFrame(train_rows)
df_te = pd.DataFrame(test_rows)
print('df_tr shape:', df_tr.shape, 'df_te shape:', df_te.shape)
print('df_tr cols:', df_tr.columns.tolist()[:20], '...')

# Prepare categorical and numeric features
cat_cols = ['base','struct','loop','base_m1','base_p1','base_m2','base_p2','struct_m1','struct_p1','loop_m1','loop_p1','partner_base']
num_cols = ['pos','L','rel_pos','dist_to_end','gc11','is_paired','pair_dist']

# One-hot encode categoricals on full training set for column stability, then align test
df_tr_cat = pd.get_dummies(df_tr[cat_cols], prefix=cat_cols, dummy_na=False)
df_te_cat = pd.get_dummies(df_te[cat_cols], prefix=cat_cols, dummy_na=False)
df_tr_num = df_tr[num_cols].astype(np.float32)
df_te_num = df_te[num_cols].astype(np.float32)
X_tr = pd.concat([df_tr_num.reset_index(drop=True), df_tr_cat.reset_index(drop=True)], axis=1)
X_te = pd.concat([df_te_num.reset_index(drop=True), df_te_cat.reset_index(drop=True)], axis=1)
# Align columns
X_te = X_te.reindex(columns=X_tr.columns, fill_value=0)

y_tr = df_tr[TARGETS].values.astype(np.float32)
groups = df_tr['id'].values
sn_mask = (df_tr['SN_filter'].values==1)

print('Feature matrix shapes:', X_tr.shape, X_te.shape)

# CV: GroupKFold by id
gkf = GroupKFold(n_splits=5)
oof = np.zeros_like(y_tr)
models = {t: [] for t in TARGETS}

def train_fold(X, y, tr_idx, va_idx, target_name):
    dtr = xgb.DMatrix(X.iloc[tr_idx], label=y[tr_idx])
    dva = xgb.DMatrix(X.iloc[va_idx], label=y[va_idx])
    params = {
        'objective': 'reg:squarederror',
        'eval_metric': 'rmse',
        'tree_method': 'gpu_hist',
        'max_depth': 8,
        'eta': 0.05,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'min_child_weight': 1.0,
        'lambda': 1.0
    }
    evallist = [(dtr, 'train'), (dva, 'valid')]
    bst = xgb.train(params, dtr, num_boost_round=2000, evals=evallist,
                    early_stopping_rounds=100, verbose_eval=False)
    pred = bst.predict(dva)
    return bst, pred

start = time.time()
for fold, (tr_idx, va_idx) in enumerate(gkf.split(X_tr, y_tr[:,0], groups), 1):
    t_fold = time.time()
    print(f'Fold {fold}: train={len(tr_idx)} valid={len(va_idx)} elapsed={time.time()-start:.1f}s', flush=True)
    for ti, t in enumerate(TARGETS):
        y = y_tr[:, ti]
        bst, pred = train_fold(X_tr, y, tr_idx, va_idx, t)
        models[t].append(bst)
        oof[va_idx, ti] = pred
    # Compute fold metric on SN_filter==1 only
    mask = sn_mask[va_idx]
    score = mcrmse(y_tr[va_idx][mask], oof[va_idx][mask]) if mask.any() else mcrmse(y_tr[va_idx], oof[va_idx])
    print(f'Fold {fold} MCRMSE (SN==1): {score:.5f} | fold time {time.time()-t_fold:.1f}s', flush=True)

# Overall OOF
mask_all = sn_mask
oof_score = mcrmse(y_tr[mask_all], oof[mask_all]) if mask_all.any() else mcrmse(y_tr, oof)
print(f'OOF MCRMSE (SN==1): {oof_score:.6f}')

# Predict on test: average across folds per target
dte = xgb.DMatrix(X_te)
pred_te = np.zeros((X_te.shape[0], len(TARGETS)), dtype=np.float32)
for ti, t in enumerate(TARGETS):
    preds_t = None
    for bst in models[t]:
        best_it = getattr(bst, 'best_iteration', None)
        if best_it is not None:
            p = bst.predict(dte, iteration_range=(0, best_it + 1))
        else:
            p = bst.predict(dte)
        preds_t = p if preds_t is None else (preds_t + p)
    pred_te[:, ti] = preds_t / max(1, len(models[t]))

# Build submission by mapping id,pos to rows required
sub = pd.read_csv('sample_submission.csv')
sub['id'] = sub['id_seqpos'].apply(lambda s: '_'.join(s.split('_')[:-1]))
sub['pos'] = sub['id_seqpos'].apply(lambda s: int(s.split('_')[-1]))
df_te_key = df_te[['id','pos']].copy()
for i, t in enumerate(TARGETS):
    df_te_key[t] = pred_te[:, i]
merge = sub[['id','pos','id_seqpos']].merge(df_te_key, on=['id','pos'], how='left')
miss = merge[TARGETS].isna().any(axis=1).sum()
if miss>0:
    print(f'Warning: {miss} submission rows missing predictions; filling with 0.0')
    for t in TARGETS:
        merge[t] = merge[t].fillna(0.0)
out = merge[['id_seqpos'] + TARGETS].copy()
out.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape', out.shape)
print(out.head())
gc.collect()
print('Baseline done.')

Flattening train/test to per-base rows...


Train rows: 146880 in 0.54s
Test rows (all positions): 25680


df_tr shape: (146880, 26) df_te shape: (25680, 20)
df_tr cols: ['id', 'pos', 'L', 'base', 'struct', 'loop', 'base_m1', 'base_p1', 'base_m2', 'base_p2', 'struct_m1', 'struct_p1', 'loop_m1', 'loop_p1', 'is_paired', 'pair_dist', 'partner_base', 'rel_pos', 'dist_to_end', 'gc11'] ...
Feature matrix shapes: (146880, 66) (25680, 66)
Fold 1: train=117504 valid=29376 elapsed=0.0s



    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



Fold 1 MCRMSE (SN==1): 0.33253 | fold time 6.2s


Fold 2: train=117504 valid=29376 elapsed=6.3s



    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



Fold 2 MCRMSE (SN==1): 0.33570 | fold time 5.2s


Fold 3: train=117504 valid=29376 elapsed=11.5s



    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



Fold 3 MCRMSE (SN==1): 0.33226 | fold time 5.2s


Fold 4: train=117504 valid=29376 elapsed=16.7s



    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



Fold 4 MCRMSE (SN==1): 0.32328 | fold time 5.3s


Fold 5: train=117504 valid=29376 elapsed=22.0s



    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"




    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"



Fold 5 MCRMSE (SN==1): 0.32206 | fold time 5.3s



    E.g. tree_method = "hist", device = "cuda"



OOF MCRMSE (SN==1): 0.329411
Saved submission.csv with shape (25680, 6)
        id_seqpos  reactivity  deg_Mg_pH10  deg_pH10  deg_Mg_50C   deg_50C
0  id_00b436dec_0    0.537878     0.565004  1.877807    0.478339  0.607095
1  id_00b436dec_1    1.567672     2.089131  2.848847    2.024551  1.682175
2  id_00b436dec_2    1.163138     0.745847  0.953829    0.994940  1.045937
3  id_00b436dec_3    0.952121     0.569766  0.677064    0.812523  0.819617
4  id_00b436dec_4    1.149979     1.084227  1.182120    1.330350  1.189750
Baseline done.


In [None]:
# Next iteration prep: length-stratified folds, SN weights, extra features, and improved XGB params (not executed yet)
import numpy as np, pandas as pd, time, math, gc
from sklearn.model_selection import StratifiedGroupKFold
import xgboost as xgb

TARGETS = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']

# Utilities to compute additional structural features on per-base DataFrame (df_tr/df_te style)
def compute_run_lengths(arr):
    # returns run length and index within run for a boolean array
    n = len(arr)
    run_len = np.zeros(n, dtype=np.int32)
    idx_in_run = np.zeros(n, dtype=np.int32)
    i = 0
    while i < n:
        j = i
        val = arr[i]
        while j < n and arr[j] == val:
            j += 1
        length = j - i
        run_len[i:j] = length
        idx_in_run[i:j] = np.arange(length)
        i = j
    return run_len, idx_in_run

def add_extra_features(df):
    # df contains columns: id, pos, base, struct, loop, partner_base, is_paired, pair_dist, L
    df = df.copy()
    # Stem/loop run lengths per id
    df['stem_len'] = 0
    df['loop_len'] = 0
    df['pos_in_loop'] = -1.0
    df['dist_to_end_norm'] = df['dist_to_end'] / df['L'].clip(lower=1)
    for rid, g in df.groupby('id', sort=False):
        is_paired = g['is_paired'].to_numpy().astype(bool)
        stem_run, _ = compute_run_lengths(is_paired)
        loop_run, loop_idx = compute_run_lengths(~is_paired)
        idx = g.index
        df.loc[idx, 'stem_len'] = stem_run
        df.loc[idx, 'loop_len'] = loop_run
        denom = np.maximum(loop_run - 1, 1)
        df.loc[idx, 'pos_in_loop'] = (loop_idx / denom).astype(np.float32)
    # Pair type/strength
    pair_map = {('G','C'):3, ('C','G'):3, ('A','U'):2, ('U','A'):2, ('G','U'):1, ('U','G'):1}
    bases = df['base'].astype(str).values
    partners = df['partner_base'].astype(str).values
    strengths = np.zeros(len(df), dtype=np.float32)
    for i,(b,p) in enumerate(zip(bases, partners)):
        strengths[i] = pair_map.get((b,p), 0)
    df['pair_strength'] = strengths
    # Stacking context: paired neighbors
    df['paired_m1'] = df.groupby('id')['is_paired'].shift(1).fillna(0).astype(int)
    df['paired_p1'] = df.groupby('id')['is_paired'].shift(-1).fillna(0).astype(int)
    df['stacked_pairs'] = (df['paired_m1'] & df['is_paired']) + (df['paired_p1'] & df['is_paired'])
    return df

def build_len_buckets(seq_lengths, bins=None):
    x = np.array(seq_lengths)
    if bins is None:
        # fixed bins capturing 68/91/107/130-like ranges; fallback to quantiles if needed
        bins = [0, 90, 110, 140, 10**9]
    labels = np.digitize(x, bins, right=True)
    return labels

def make_stratified_group_folds(df_seq):
    # df_seq: one row per sequence id with 'id' and 'seq_length' columns
    y_strat = build_len_buckets(df_seq['seq_length'].values)
    g = df_seq['id'].values
    sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
    folds = []
    for tr_idx, va_idx in sgkf.split(np.zeros(len(df_seq)), y_strat, groups=g):
        folds.append((df_seq.iloc[tr_idx]['id'].values, df_seq.iloc[va_idx]['id'].values))
    return folds

# Prepare next-run matrices with extra features and aligned dummies
def prepare_matrices_with_extra(df_tr, df_te):
    df_tr2 = add_extra_features(df_tr)
    df_te2 = add_extra_features(df_te)
    cat_cols = ['base','struct','loop','base_m1','base_p1','base_m2','base_p2','struct_m1','struct_p1','loop_m1','loop_p1','partner_base']
    num_cols = ['pos','L','rel_pos','dist_to_end','dist_to_end_norm','gc11','is_paired','pair_dist','stem_len','loop_len','pos_in_loop','pair_strength','stacked_pairs']
    Xtr = pd.concat([df_tr2[num_cols].astype(np.float32), pd.get_dummies(df_tr2[cat_cols], prefix=cat_cols, dummy_na=False)], axis=1).reset_index(drop=True)
    Xte = pd.concat([df_te2[num_cols].astype(np.float32), pd.get_dummies(df_te2[cat_cols], prefix=cat_cols, dummy_na=False)], axis=1).reset_index(drop=True)
    Xte = Xte.reindex(columns=Xtr.columns, fill_value=0)
    return Xtr, Xte

def train_xgb_weighted(X, y, groups, seq_lengths, sn_mask, folds=None):
    if folds is None:
        # build sequence-level folds
        df_seq = pd.DataFrame({'id': pd.Series(groups).unique()})
        # Map a representative seq_length per id from the first occurrence in X (needs external mapping)
        raise RuntimeError('Provide precomputed sequence-level folds to avoid recomputing here.')
    oof = np.zeros((X.shape[0], len(TARGETS)), dtype=np.float32)
    models = {t: [] for t in TARGETS}
    # Sample weights: SN==1 -> 1.0, else 0.5
    w = np.where(sn_mask, 1.0, 0.5).astype(np.float32)
    params = {
        'objective': 'reg:squarederror',
        'eval_metric': 'rmse',
        'tree_method': 'hist',
        'device': 'cuda',
        'max_depth': 6,
        'eta': 0.05,
        'subsample': 0.8,
        'colsample_bytree': 0.75,
        'min_child_weight': 3.0,
        'reg_lambda': 1.0,
        'reg_alpha': 0.0
    }
    # Build an index lookup for id -> row indices
    df_idx = pd.DataFrame({'id': groups})
    for fold, (tr_ids, va_ids) in enumerate(folds, 1):
        va_mask = df_idx['id'].isin(set(va_ids)).values
        tr_mask = ~va_mask
        tr_idx = np.where(tr_mask)[0]
        va_idx = np.where(va_mask)[0]
        print(f'[NextRun] Fold {fold}: train={len(tr_idx)} valid={len(va_idx)}')
        dte = None
        for ti, t in enumerate(TARGETS):
            dtr = xgb.DMatrix(X.iloc[tr_idx], label=y[tr_idx, ti], weight=w[tr_idx])
            dva = xgb.DMatrix(X.iloc[va_idx], label=y[va_idx, ti], weight=w[va_idx])
            bst = xgb.train(params, dtr, num_boost_round=1500, evals=[(dtr,'train'),(dva,'valid')], early_stopping_rounds=100, verbose_eval=False)
            pred = bst.predict(dva, iteration_range=(0, (bst.best_iteration or 0)+1))
            models[t].append(bst)
            oof[va_idx, ti] = pred
        # compute fold score on SN==1 only
        from sklearn.metrics import mean_squared_error
        def mcrmse(a,b):
            return np.mean([mean_squared_error(a[:,i], b[:,i], squared=False) for i in range(a.shape[1])])
        mask = sn_mask[va_idx]
        score = mcrmse(y[va_idx][mask], oof[va_idx][mask]) if mask.any() else mcrmse(y[va_idx], oof[va_idx])
        print(f'[NextRun] Fold {fold} MCRMSE (SN==1): {score:.5f}')
    return oof, models

# Seq-level frame for folds: derive from original train objects
seq_df = pd.DataFrame([{ 'id': r['id'], 'seq_length': int(r.get('seq_length', len(r.get('sequence','')))) } for r in train])
folds_len = make_stratified_group_folds(seq_df)
print('Prepared length-stratified folds for next run.')
print('Next: reuse df_tr/df_te to call prepare_matrices_with_extra(), then train_xgb_weighted() with folds_len. Also proceed to build BiGRU in a separate script.')