# Plan to Medal: OpenVaccine (stanford-covid-vaccine)

Objectives:
- Establish GPU availability and stable environment
- Build a robust CV mirroring test conditions
- Ship a fast baseline ASAP, then iterate with feature engineering and stronger models
- Use expert reviews at milestones

Initial Steps (Milestone 1):
1) Environment check:
- Verify GPU via nvidia-smi; if unavailable, exit competition
- Install CUDA 12.1-compatible PyTorch only if/when needed

2) Data audit:
- Load train.json/test.json; inspect schema, sizes, and sample rows
- Inspect sample_submission.csv to confirm expected output format and row count

3) Validation plan:
- Reproduce competition split logic (sequence-level, per-base targets)
- Deterministic KFold/GroupKFold as appropriate; ensure transforms fit inside folds

4) Baseline model (Milestone 2):
- Start with a simple per-base model (e.g., token features + lightweight model) to verify pipeline
- Generate submission.csv; validate format and sanity-check predictions

5) Feature engineering (Milestone 3):
- Encodings: sequence tokens, base pairing structure, loop types if provided
- Positional features: index, length-normalized, GC-content windows
- Optionally augment with predicted structure if available/feasible

6) Stronger models (Milestone 4):
- Sequence models: BiGRU/LSTM or lightweight Transformer
- Tune with early stopping; log progress per fold
- Cache OOF/test predictions for blends

7) Ensembling (Milestone 5):
- Blend diverse seeds/architectures; weighted average on OOF

Checkpoints for Expert Review:
- After this plan
- After data audit + CV finalized
- After baseline submission
- After FE + stronger models
- Before final blend

Risk controls:
- Subsample for smoke tests
- Print elapsed time per fold; interrupt long/stuck jobs
- Verify submission file before scoring

In [2]:
# Environment check and quick data audit
import os, sys, json, time, subprocess, math, statistics as stats
import pandas as pd

print('=== GPU CHECK: nvidia-smi ===', flush=True)
try:
    out = subprocess.run(['bash','-lc','nvidia-smi || true'], capture_output=True, text=True)
    print(out.stdout)
except Exception as e:
    print('nvidia-smi failed:', e)

print('=== Files present ===', flush=True)
print(os.listdir('.'))

print('=== sample_submission.csv audit ===')
sub = pd.read_csv('sample_submission.csv')
print('sample_submission shape:', sub.shape)
print('submission columns:', list(sub.columns))
print(sub.head(3))

def read_json_robust(path):
    # Many Kaggle datasets here are JSON lines; try lines=True first
    try:
        df = pd.read_json(path, lines=True)
        return df
    except Exception as e1:
        print(f'lines=True failed for {path}:', e1)
        try:
            df = pd.read_json(path)
            return df
        except Exception as e2:
            print(f'standard read_json failed for {path}:', e2)
            raise

print('=== train.json audit ===')
train = read_json_robust('train.json')
print('train rows (sequences):', len(train))
print('train columns:', list(train.columns))
print(train.head(1).T.head(30))

# Infer key fields commonly present
expected_cols = ['id','sequence','structure','predicted_loop_type','signal_to_noise','SN_filter','seq_scored']
present = [c for c in expected_cols if c in train.columns]
print('present expected columns:', present)

# Derive lengths and scored lengths if possible
def safe_len(x):
    try:
        return len(x) if isinstance(x, (list, str)) else int(x) if pd.notnull(x) else None
    except Exception:
        return None

lengths = train['sequence'].map(safe_len) if 'sequence' in train.columns else pd.Series([None]*len(train))
if 'seq_scored' in train.columns:
    scored = train['seq_scored']
else:
    target_like = [c for c in train.columns if c.startswith('react') or c.startswith('deg') or c.endswith('_error')]
    scored = train[target_like[0]].map(safe_len) if target_like else pd.Series([None]*len(train))
print('length value counts (top):')
print(lengths.value_counts().head(10))
print('seq_scored value counts (top):')
print(scored.value_counts().head(10))

print('=== test.json audit ===')
test = read_json_robust('test.json')
print('test rows (sequences):', len(test))
print('test columns:', list(test.columns))
print(test.head(1).T.head(30))

# Confirm target columns from sample_submission
target_cols = [c for c in sub.columns if c != 'id_seqpos']
print('Targets per base (from submission):', target_cols, ' | count:', len(target_cols))

print('=== QUICK SANITY: id/seqpos mapping example ===')
print(sub.head(5)['id_seqpos'])

print('AUDIT COMPLETE')

=== GPU CHECK: nvidia-smi ===


Thu Sep 25 01:27:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

['sample_submission.csv', '.00_eda_and_planning_kernel_state.json', 'docker_run.log', 'requirements.txt', 'submission.csv', 'train.json', 'agent_metadata', 'task.txt', 'test.json', '00_eda_and_planning.ipynb', 'description.md']
=== sample_submission.csv audit ===
sample_submission shape: (25680, 6)
submission columns: ['id_seqpos', 'reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C']
        id_seqpos  reactivity  deg_Mg_pH10  deg_pH10  deg_Mg_50C  deg_50C
0  id_00b436dec_0         0.0          0.0       0.0         0.0      0.0
1  id_00b436dec_1         0.0          0.0       0.0         0.0      0.0
2  id_00b436dec_2         0.0          0.0       0.0         0.0      0.0
=== train.json audit ===
train rows (sequences): 2160
train columns: ['index', 'id', 'sequence', 'structure', 'predicted_loop_type', 'signal_to_noise', 'SN_filter', 'seq_length', 'seq_scored', 'reactivity_error', 'deg_error_Mg_pH10', 'deg_error_pH10', 'deg_error_Mg_50C', 'deg_error_50C', 'reactivity', '

In [3]:
# Install CUDA 12.1 PyTorch stack and verify GPU
import os, sys, subprocess, shutil
from pathlib import Path

def pip(*args):
    print('> pip', *args, flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Uninstall any preinstalled torch variants to avoid conflicts
for pkg in ('torch','torchvision','torchaudio'):
    subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', pkg], check=False)

# Clean stray site dirs that might shadow correct wheels
for d in (
    '/app/.pip-target/torch',
    '/app/.pip-target/torchvision',
    '/app/.pip-target/torchaudio',
    '/app/.pip-target/torchgen',
    '/app/.pip-target/functorch',
):
    if os.path.exists(d):
        print('Removing', d)
        shutil.rmtree(d, ignore_errors=True)

# Install exact cu121 torch stack
pip('install',
    '--index-url', 'https://download.pytorch.org/whl/cu121',
    '--extra-index-url', 'https://pypi.org/simple',
    'torch==2.4.1', 'torchvision==0.19.1', 'torchaudio==2.4.1')

# Freeze versions for later installs
Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')

import torch
print('torch:', torch.__version__, 'built CUDA:', getattr(torch.version, 'cuda', None))
print('CUDA available:', torch.cuda.is_available())
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f'Wrong CUDA build: {torch.version.cuda}'
assert torch.cuda.is_available(), 'CUDA not available'
print('GPU:', torch.cuda.get_device_name(0))





> pip install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 390.5 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 220.7 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 316.8 MB/s eta 0:00:00
Collecting filelock


  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 85.5 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 418.6 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 513.9 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 508.6 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 226.5 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 173.6 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 223.5 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 47.9 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 349.0 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 477.6 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 51.0 MB/s eta 0:00:00


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 420.9 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 354.9 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 205.7 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 220.3 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 501.6 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 106.6 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 172.8 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 483.6 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 195.5 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 488.9 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.2 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


torch: 2.4.1+cu121 built CUDA: 12.1
CUDA available: True
GPU: NVIDIA A10-24Q


In [7]:
# Data pipeline: encoding, pairing parse, Dataset/Collate
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

SEQ_VOCAB = {'A':0,'C':1,'G':2,'U':3}
STR_VOCAB = {'.':0,'(':1,')':2}
LOOP_VOCAB = {'B':0,'E':1,'H':2,'I':3,'M':4,'S':5,'X':6}

def encode_string(s, vocab, unk_val=0):
    return np.array([vocab.get(ch, unk_val) for ch in s], dtype=np.int64)

def parse_dot_bracket(struct):
    stack = []
    pair_idx = np.full(len(struct), -1, dtype=np.int32)
    for i, ch in enumerate(struct):
        if ch == '(':
            stack.append(i)
        elif ch == ')':
            if not stack:
                continue
            j = stack.pop()
            pair_idx[i] = j
            pair_idx[j] = i
    is_paired = (pair_idx != -1).astype(np.int8)
    pair_dist = np.where(pair_idx!=-1, np.abs(np.arange(len(struct)) - pair_idx), 0).astype(np.int16)
    return pair_idx, is_paired, pair_dist

TARGET_COLS = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']

class RNADataset(Dataset):
    def __init__(self, df, train_mode=True):
        self.df = df.reset_index(drop=True)
        self.train_mode = train_mode

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        seq = row['sequence']
        struct = row['structure']
        loop = row['predicted_loop_type']
        L = len(seq)
        seq_ids = encode_string(seq, SEQ_VOCAB)
        str_ids = encode_string(struct, STR_VOCAB)
        loop_ids = encode_string(loop, LOOP_VOCAB)
        pair_idx, is_paired, pair_dist = parse_dot_bracket(struct)
        pos_idx = np.arange(L, dtype=np.int32)
        pos_norm = pos_idx / max(L-1, 1)
        snr = float(row.get('signal_to_noise', 1.0))
        snr_feat = np.full(L, snr, dtype=np.float32)
        # features per position
        feats = np.stack([
            pos_idx.astype(np.float32),
            pos_norm.astype(np.float32),
            is_paired.astype(np.float32),
            pair_dist.astype(np.float32),
            snr_feat,
        ], axis=1)  # [L, F]
        seq_scored = int(row['seq_scored']) if 'seq_scored' in row else L
        mask_scored = np.zeros(L, dtype=np.float32)
        mask_scored[:seq_scored] = 1.0
        item = {
            'seq_ids': torch.from_numpy(seq_ids),
            'str_ids': torch.from_numpy(str_ids),
            'loop_ids': torch.from_numpy(loop_ids),
            'feats': torch.from_numpy(feats),
            'mask_scored': torch.from_numpy(mask_scored),
            'id': row['id'],
        }
        if self.train_mode:
            # Targets are length seq_scored (68). Pad to full sequence length L.
            T = len(TARGET_COLS)
            tar = np.zeros((L, T), dtype=np.float32)
            for ti, c in enumerate(TARGET_COLS):
                arr = np.array(row[c], dtype=np.float32)
                take = min(len(arr), L)
                tar[:take, ti] = arr[:take]
            item['targets'] = torch.from_numpy(tar)
        return item

def collate_pad(batch):
    # pad to max len in batch; guard against any per-item length mismatch by slicing to min length
    lens_seq = [len(b['seq_ids']) for b in batch]
    maxL = max(lens_seq)
    T = len(TARGET_COLS)
    B = len(batch)
    seq_ids = torch.full((B, maxL), 0, dtype=torch.long)
    str_ids = torch.full((B, maxL), 0, dtype=torch.long)
    loop_ids = torch.full((B, maxL), 0, dtype=torch.long)
    feats = torch.zeros((B, maxL, 5), dtype=torch.float32)
    mask_pad = torch.zeros((B, maxL), dtype=torch.float32)
    mask_scored = torch.zeros((B, maxL), dtype=torch.float32)
    ids = []
    targets = None
    has_targets = 'targets' in batch[0]
    if has_targets:
        targets = torch.zeros((B, maxL, T), dtype=torch.float32)
    for i, b in enumerate(batch):
        Ls = len(b['seq_ids'])
        Lt = b['targets'].shape[0] if has_targets else Ls
        L = min(Ls, Lt)
        seq_ids[i,:L] = b['seq_ids'][:L]
        str_ids[i,:L] = b['str_ids'][:L]
        loop_ids[i,:L] = b['loop_ids'][:L]
        feats[i,:L] = b['feats'][:L]
        mask_pad[i,:L] = 1.0
        mask_scored[i,:L] = b['mask_scored'][:L]
        ids.append(b['id'])
        if has_targets:
            targets[i,:L] = b['targets'][:L]
    out = {'seq_ids':seq_ids, 'str_ids':str_ids, 'loop_ids':loop_ids, 'feats':feats,
           'mask_pad':mask_pad, 'mask_scored':mask_scored, 'ids':ids}
    if has_targets:
        out['targets'] = targets
    return out

print('Dataset utilities defined.')

Dataset utilities defined.


In [None]:
# BiGRU baseline: model, CV training (3-fold smoke), inference, submission
import math, time, gc
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from sklearn.model_selection import GroupKFold

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
TARGET_COLS = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']

class BiGRUModel(nn.Module):
    def __init__(self, emb_dim=48, feat_dim=5, hidden=256, layers=2, dropout=0.3, n_targets=5):
        super().__init__()
        self.seq_emb = nn.Embedding(len(SEQ_VOCAB), emb_dim, padding_idx=0)
        self.str_emb = nn.Embedding(len(STR_VOCAB), emb_dim, padding_idx=0)
        self.loop_emb = nn.Embedding(len(LOOP_VOCAB), emb_dim, padding_idx=0)
        in_dim = emb_dim*3 + feat_dim
        self.proj = nn.Linear(in_dim, hidden*2)
        self.gru = nn.GRU(input_size=hidden*2, hidden_size=hidden, num_layers=layers, batch_first=True, bidirectional=True, dropout=dropout if layers>1 else 0.0)
        self.dropout = nn.Dropout(dropout)
        self.head = nn.Linear(hidden*2, n_targets)

    def forward(self, seq_ids, str_ids, loop_ids, feats, lengths):
        x = torch.cat([
            self.seq_emb(seq_ids),
            self.str_emb(str_ids),
            self.loop_emb(loop_ids),
            feats
        ], dim=-1)  # [B,L,3E+F]
        x = self.proj(x)
        # pack for efficiency
        packed = nn.utils.rnn.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)
        packed_out, _ = self.gru(packed)
        out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
        out = self.dropout(out)
        preds = self.head(out)
        return preds

def masked_mse(preds, targets, mask):
    # mask: [B,L] 1=include
    diff = (preds - targets)**2
    mask3 = mask.unsqueeze(-1).expand_as(diff)
    num = (diff * mask3).sum(dim=(0,1))  # sum over B,L per target
    den = mask3.sum(dim=(0,1)).clamp_min(1.0)
    mse_t = num / den  # per target
    return mse_t, torch.sqrt(mse_t.mean())  # per-target and MCRMSE

def train_one_epoch(model, loader, optimizer, scaler, scheduler=None):
    model.train()
    total_loss = 0.0
    n_batches = 0
    t0 = time.time()
    for it, batch in enumerate(loader):
        seq_ids = batch['seq_ids'].to(device)
        str_ids = batch['str_ids'].to(device)
        loop_ids = batch['loop_ids'].to(device)
        feats = batch['feats'].to(device)
        mask_scored = (batch['mask_scored'] * batch['mask_pad']).to(device)
        targets = batch['targets'].to(device)
        lengths = batch['mask_pad'].sum(dim=1).to(device)
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast(enabled=True):
            preds = model(seq_ids, str_ids, loop_ids, feats, lengths)
            mse_t, mcrmse = masked_mse(preds, targets, mask_scored)
            loss = mse_t.mean()
        scaler.scale(loss).backward()
        # unscale for clipping then step
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        if scheduler is not None:
            scheduler.step()
        total_loss += loss.item()
        n_batches += 1
        if (it+1) % 50 == 0:
            print(f'  iter {it+1}/{len(loader)} loss {total_loss/n_batches:.5f} elapsed {time.time()-t0:.1f}s', flush=True)
    return total_loss / max(n_batches,1)

def validate(model, loader):
    model.eval()
    t_mse_sum = torch.zeros(len(TARGET_COLS), device=device)
    t_den = torch.zeros(len(TARGET_COLS), device=device)
    with torch.no_grad():
        for batch in loader:
            seq_ids = batch['seq_ids'].to(device)
            str_ids = batch['str_ids'].to(device)
            loop_ids = batch['loop_ids'].to(device)
            feats = batch['feats'].to(device)
            mask_scored = (batch['mask_scored'] * batch['mask_pad']).to(device)
            targets = batch['targets'].to(device)
            lengths = batch['mask_pad'].sum(dim=1).to(device)
            preds = model(seq_ids, str_ids, loop_ids, feats, lengths)
            diff = (preds - targets)**2
            mask3 = mask_scored.unsqueeze(-1).expand_as(diff)
            t_mse_sum += (diff * mask3).sum(dim=(0,1))
            t_den += mask3.sum(dim=(0,1))
    mse_t = (t_mse_sum / t_den.clamp_min(1.0)).detach().cpu().numpy()
    mcrmse = float(np.sqrt(mse_t.mean()))
    return mcrmse, mse_t

def predict_model(model, loader):
    model.eval()
    preds_all = []
    ids_all = []
    with torch.no_grad():
        for batch in loader:
            seq_ids = batch['seq_ids'].to(device)
            str_ids = batch['str_ids'].to(device)
            loop_ids = batch['loop_ids'].to(device)
            feats = batch['feats'].to(device)
            lengths = batch['mask_pad'].sum(dim=1).to(device)
            preds = model(seq_ids, str_ids, loop_ids, feats, lengths)  # [B,L,T]
            preds_all.append(preds.detach().cpu().numpy())
            ids_all.extend(batch['ids'])
    return np.concatenate(preds_all, axis=0), ids_all  # [N,L,T], ids list

# Build dataframes
train_df = read_json_robust('train.json')
test_df = read_json_robust('test.json')
target_cols = TARGET_COLS.copy()

# Filter training by quality as advised
if 'SN_filter' in train_df.columns:
    train_df = train_df[train_df['SN_filter']==1].reset_index(drop=True)
print('Train after SN_filter==1:', train_df.shape, flush=True)

# Config
FOLDS = 3  # smoke test; later switch to 5
EPOCHS = 10
BATCH_SIZE = 64
LR = 1e-3
WD = 1e-4
DROPOUT = 0.3
HIDDEN = 256
EMB = 48

# CV split
groups = train_df['id'].values
gkf = GroupKFold(n_splits=FOLDS)
fold_indices = list(gkf.split(train_df, groups=groups))
print('Fold sizes:', [ (len(tr), len(va)) for tr,va in fold_indices ])

oof_preds = np.zeros((len(train_df), train_df['seq_length'].iloc[0], len(TARGET_COLS)), dtype=np.float32)
test_preds_folds = []

for fi, (tr_idx, va_idx) in enumerate(fold_indices):
    print(f'===== Fold {fi+1}/{FOLDS} =====', flush=True)
    tr_df = train_df.iloc[tr_idx].reset_index(drop=True)
    va_df = train_df.iloc[va_idx].reset_index(drop=True)
    tr_ds = RNADataset(tr_df, train_mode=True)
    va_ds = RNADataset(va_df, train_mode=True)
    te_ds = RNADataset(test_df, train_mode=False)
    tr_loader = DataLoader(tr_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True, collate_fn=collate_pad)
    va_loader = DataLoader(va_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate_pad)
    te_loader = DataLoader(te_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate_pad)

    model = BiGRUModel(emb_dim=EMB, feat_dim=5, hidden=HIDDEN, layers=2, dropout=DROPOUT, n_targets=len(TARGET_COLS)).to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=WD)
    total_steps = EPOCHS * len(tr_loader)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=total_steps)
    scaler = torch.cuda.amp.GradScaler(enabled=True)

    best_mcrmse = 1e9
    best_state = None
    patience = 5
    bad = 0
    t0 = time.time()
    for epoch in range(1, EPOCHS+1):
        ep_t0 = time.time()
        tr_loss = train_one_epoch(model, tr_loader, optimizer, scaler, scheduler)
        val_mcrmse, val_mse_t = validate(model, va_loader)
        print(f'Epoch {epoch}/{EPOCHS} fold {fi} tr_loss {tr_loss:.5f} val_mcrmse {val_mcrmse:.5f} per-target {np.sqrt(val_mse_t)} time {time.time()-ep_t0:.1f}s', flush=True)
        if val_mcrmse < best_mcrmse - 1e-4:
            best_mcrmse = val_mcrmse
            best_state = {k:v.cpu() for k,v in model.state_dict().items()}
            bad = 0
        else:
            bad += 1
        if bad >= patience:
            print('Early stopping.', flush=True)
            break
    print(f'Fold {fi} best mcrmse: {best_mcrmse:.5f} elapsed {time.time()-t0:.1f}s', flush=True)
    if best_state is not None:
        model.load_state_dict({k:v.to(device) for k,v in best_state.items()})

    # OOF preds
    model.eval()
    with torch.no_grad():
        va_loader2 = DataLoader(va_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate_pad)
        ptr = 0
        for batch in va_loader2:
            B = len(batch['ids'])
            seq_ids = batch['seq_ids'].to(device)
            str_ids = batch['str_ids'].to(device)
            loop_ids = batch['loop_ids'].to(device)
            feats = batch['feats'].to(device)
            lengths = batch['mask_pad'].sum(dim=1).to(device)
            preds = model(seq_ids, str_ids, loop_ids, feats, lengths).detach().cpu().numpy()
            oof_preds[va_idx[ptr:ptr+B]] = preds
            ptr += B

    # Test preds
    fold_test_preds, _ = predict_model(model, te_loader)
    test_preds_folds.append(fold_test_preds)
    del model, optimizer, scheduler, scaler, tr_loader, va_loader, te_loader
    gc.collect()
    torch.cuda.empty_cache()

# OOF metric on train (scored positions only) with padded y_true
seq_len = int(train_df['seq_length'].iloc[0])
seq_scored = int(train_df['seq_scored'].iloc[0])
y_true = np.zeros((len(train_df), seq_len, len(TARGET_COLS)), dtype=np.float32)
for ti, col in enumerate(TARGET_COLS):
    arrs = train_df[col].tolist()
    for i, arr in enumerate(arrs):
        a = np.asarray(arr, dtype=np.float32)
        take = min(len(a), seq_len, seq_scored)
        y_true[i, :take, ti] = a[:take]
mask = np.zeros((len(train_df), seq_len), dtype=np.float32)
mask[:, :seq_scored] = 1.0
mse_t = ((oof_preds - y_true)**2 * mask[...,None]).sum(axis=(0,1)) / mask.sum(axis=(0,1)).clip(1)
oof_mcrmse = float(np.sqrt(mse_t.mean()))
print('OOF MCRMSE:', oof_mcrmse, ' per-target RMSE:', np.sqrt(mse_t))

# Ensemble test preds over folds
test_preds = np.mean(np.stack(test_preds_folds, axis=0), axis=0)  # [Ntest, L, T]

# Clip predictions per target (hardcoded safe bounds) and slice first 68 positions
seq_scored_test = int(test_df['seq_scored'].iloc[0])
bounds = {'reactivity':3.0, 'deg_Mg_pH10':5.0, 'deg_pH10':5.0, 'deg_Mg_50C':5.0, 'deg_50C':5.0}
for ti, col in enumerate(TARGET_COLS):
    test_preds[..., ti] = np.clip(test_preds[..., ti], 0.0, bounds[col])

# Build submission from sample order
sub = pd.read_csv('sample_submission.csv')
id_to_row = {rid:i for i,rid in enumerate(test_df['id'].values)}
records = []
for i in range(len(test_df)):
    rid = test_df['id'].iloc[i]
    for pos in range(seq_scored_test):
        rec = {'id_seqpos': f'{rid}_{pos}'}
        for ti, col in enumerate(TARGET_COLS):
            rec[col] = float(test_preds[i, pos, ti])
        records.append(rec)
pred_df = pd.DataFrame(records)
sub_out = sub[['id_seqpos']].merge(pred_df, on='id_seqpos', how='left')
assert sub_out.shape == sub.shape, f'Wrong submission shape: {sub_out.shape} vs {sub.shape}'
sub_out.to_csv('submission.csv', index=False)
print('Saved submission.csv with shape', sub_out.shape, 'OOF MCRMSE', oof_mcrmse)
print(sub_out.head())
print('Done.')

Train after SN_filter==1: (1349, 19)


Fold sizes: [(899, 450), (899, 450), (900, 449)]
===== Fold 1/3 =====


  scaler = torch.cuda.amp.GradScaler(enabled=True)


  with torch.cuda.amp.autocast(enabled=True):


Epoch 1/10 fold 0 tr_loss 0.61789 val_mcrmse 0.46464 per-target [0.42203888 0.49103627 0.42863777 0.43786088 0.53363615] time 0.5s


Epoch 2/10 fold 0 tr_loss 0.21553 val_mcrmse 0.40631 per-target [0.38557398 0.45640785 0.41188267 0.42247215 0.34689045] time 0.4s


Epoch 3/10 fold 0 tr_loss 0.17475 val_mcrmse 0.38148 per-target [0.3467144  0.44628134 0.38573438 0.38672563 0.33151618] time 0.5s


Epoch 4/10 fold 0 tr_loss 0.15714 val_mcrmse 0.36894 per-target [0.35675916 0.4192098  0.36312023 0.3765078  0.3223946 ] time 0.5s


Epoch 5/10 fold 0 tr_loss 0.14136 val_mcrmse 0.35654 per-target [0.31844038 0.41604698 0.35648796 0.35778314 0.32558483] time 0.5s


Epoch 6/10 fold 0 tr_loss 0.13102 val_mcrmse 0.34075 per-target [0.31065324 0.39456546 0.34727368 0.34295106 0.30025217] time 0.5s


Epoch 7/10 fold 0 tr_loss 0.12575 val_mcrmse 0.33628 per-target [0.3085154  0.38822353 0.34262368 0.3374293  0.2971208 ] time 0.5s


Epoch 8/10 fold 0 tr_loss 0.12272 val_mcrmse 0.33370 per-target [0.3072244  0.38512233 0.34006754 0.33471873 0.29393208] time 0.5s


Epoch 9/10 fold 0 tr_loss 0.12104 val_mcrmse 0.33302 per-target [0.30614346 0.38426918 0.33965087 0.33402365 0.29358512] time 0.5s


Epoch 10/10 fold 0 tr_loss 0.12131 val_mcrmse 0.33333 per-target [0.3075565  0.38434333 0.33943585 0.33448124 0.29348603] time 0.5s


Fold 0 best mcrmse: 0.33302 elapsed 4.8s
