# Plan: OpenVaccine (stanford-covid-vaccine) — Medal-Oriented Workflow

Objectives:
- Build a strong baseline fast, align CV with LB, iterate to medal.

Datasets:
- train.json, test.json, sample_submission.csv

Metric:
- multi-class-log-loss (per-position targets across 5 tasks: reactivity, deg_Mg_pH10, deg_pH10, deg_Mg_50C, deg_50C for L=107 or 130 seq lengths depending on data).

High-level Strategy:
1) Environment & GPU:
- Verify GPU with nvidia-smi and torch CUDA stack sanity.

2) Data Audit:
- Load train.json/test.json; inspect schema, lengths, missing values, target columns, sequence length distributions, and public vs private split hints.
- Confirm how to format submission (per-base predictions melted to long with id_seqpos).

3) Validation Protocol:
- Use GroupKFold by sequence_id or stratification by sequence length/structure; mirror test: sequence-level CV, not token-level.
- Fix random seed; 5 folds. Save folds to disk.

4) Baseline Model (fast):
- Token features: sequence (A,C,G,U one-hot), structure (BPP features from provided pairing probability if present; else dot-bracket + simple pairing features), predicted_loop_type one-hot.
- Local context windows (k-mer embeddings) and position index features.
- Model A: Lightweight BiLSTM/GRU with attention, trained with MSE to match baseline (regression to targets); optionally multi-head outputs.
- Model B: CatBoost/XGBoost on pooled/contextual features (for quick sanity).

5) Feature Engineering v1:
- Base pairing probabilities (from bpps matrices if included or compute via ViennaRNA if allowed/time permits).
- Positional features: distance to paired base, GC content in window, k-mer counts, loop type runs.
- Temperature/Mg condition indicators per target head.

6) Training Details:
- Sequence-length bucketing and padding mask.
- Per-position loss masked to valid positions.
- Early stopping on CV fold OOF loss. Log progress, time/fold.
- Save OOF predictions and test predictions per seed.

7) Ensembling:
- Blend multiple seeds/architectures (BiLSTM + GRU; with/without BPP; small Transformer) by weighted average on OOF to pick weights, then apply to test.

8) Error Analysis:
- Analyze OOF by position bins, structure classes, sequence length, confidence; iterate on features.

9) Submission:
- Generate submission.csv exactly matching sample format.

Milestones (request expert review after each):
- Plan (this).
- Data audit + CV plan.
- Baseline training + OOF.
- FE v1 + improved model.
- Ensemble + final checks.

Next steps:
- 1) Verify GPU, install torch-cu121 stack.
- 2) Load and profile data; design CV folds.
- 3) Implement quick CatBoost baseline for smoke test and submission shape check, then move to BiLSTM.

In [1]:
# Environment check + data audit: GPU, schema, metric/target confirmation
import os, sys, json, time, shutil, subprocess
import pandas as pd, numpy as np
from pathlib import Path
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)

def sh(cmd):
    print(f"$ {' '.join(cmd)}", flush=True)
    try:
        out = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
        print(out.stdout)
    except Exception as e:
        print('CMD failed:', e)

print('== GPU check ==')
sh(['bash','-lc','nvidia-smi || true'])

print('== Files ==')
print(os.listdir('.'))

print('== sample_submission.csv head/cols ==')
ss = pd.read_csv('sample_submission.csv')
print(ss.shape)
print(ss.head())
print('Columns:', ss.columns.tolist()[:20], '... total:', len(ss.columns))

print('== train.json schema/head ==')
try:
    tr = pd.read_json('train.json')
except ValueError:
    tr = pd.read_json('train.json', lines=True)
print(tr.shape)
print(tr.columns.tolist())
print(tr.head(2))
print(tr.dtypes)

# Inspect target nature and aux columns
cols = tr.columns.tolist()
targets = [c for c in cols if c.startswith('react') or c.startswith('deg')]
print('Target-like columns:', targets)
if targets:
    first = tr.iloc[0]
    for c in targets[:10]:
        v = first.get(c, None)
        if isinstance(v, (list, tuple, np.ndarray)) and len(v)>0:
            print(f'{c}: list len={len(v)}, type0={type(v[0])}')
        else:
            print(f'{c}: type={type(v)} value_sample={str(v)[:60]}')

for name in ['sequence','structure','predicted_loop_type','seq_scored','signal_to_noise','SN_filter']:
    print(f"Has {name}:", name in tr.columns)

# Sequence length distribution
if 'sequence' in tr.columns:
    tr['seq_len'] = tr['sequence'].astype(str).str.len()
    print('Train seq_len value_counts:\n', tr['seq_len'].value_counts().sort_index())
else:
    print('No sequence column found.')

print('== test.json schema/head ==')
try:
    te = pd.read_json('test.json')
except ValueError:
    te = pd.read_json('test.json', lines=True)
print(te.shape)
print(te.columns.tolist())
print(te.head(2))
if 'sequence' in te.columns:
    te['seq_len'] = te['sequence'].astype(str).str.len()
    print('Test seq_len value_counts:\n', te['seq_len'].value_counts().sort_index())

print('== sample_submission format inference ==')
print('First row id_seqpos:', ss.iloc[0,0])
print('Submission columns (first 10):', ss.columns[:10].tolist())


== GPU check ==
$ bash -lc nvidia-smi || true


Sat Sep 27 02:20:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# Fast baseline: per-position means for 0..67; global means for 68..106; write submission.csv
import pandas as pd, numpy as np
from pathlib import Path

ss = pd.read_csv('sample_submission.csv')
try:
    tr = pd.read_json('train.json')
except ValueError:
    tr = pd.read_json('train.json', lines=True)
try:
    te = pd.read_json('test.json')
except ValueError:
    te = pd.read_json('test.json', lines=True)

targets = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']

# Stack train targets into long dataframe with seqpos only for scored positions
rows = []
for _, r in tr.iterrows():
    scored = int(r['seq_scored'])
    for t in targets:
        vals = np.array(r[t], dtype=float)
        for i in range(scored):
            rows.append((t, i, vals[i]))
train_long = pd.DataFrame(rows, columns=['target','seqpos','value'])

# Compute per-position mean for 0..67 and global mean
pos_means = (train_long.groupby(['target','seqpos'])['value'].mean().unstack('seqpos'))
global_means = train_long.groupby('target')['value'].mean()
max_pos = pos_means.columns.max() if pos_means.shape[1] > 0 else 67
print('Computed pos_means up to position:', int(max_pos))

# Build prediction dict: for each test id and position 0..106
pred_map = {}  # key: (id, pos) -> dict of target->pred
for _, r in te[['id','seq_length','seq_scored']].iterrows():
    rid = r['id']
    L = int(r['seq_length'])
    scored = int(r['seq_scored'])  # expected 68
    for i in range(L):
        for t in targets:
            if i in pos_means.columns:
                val = float(pos_means.loc[t, i])
            else:
                # unscored positions fallback
                val = float(global_means.loc[t])
            pred_map[(rid, i, t)] = val

# Merge predictions into sample_submission to preserve order
def parse_id_seqpos(x):
    # 'id_xxx_pos'
    s, pos = x.rsplit('_', 1)
    return s, int(pos)

ids = []
pos = []
for v in ss['id_seqpos'].values:
    i, p = parse_id_seqpos(v)
    ids.append(i)
    pos.append(p)
ss['_id'] = ids
ss['_pos'] = pos

for t in targets:
    ss[t] = [pred_map[(i, p, t)] for i, p in zip(ss['_id'], ss['_pos'])]

ss.drop(columns=['_id','_pos'], inplace=True)
ss.to_csv('submission.csv', index=False)
print('Wrote submission.csv with shape:', ss.shape)
print(ss.head())

Computed pos_means up to position: 67


Wrote submission.csv with shape: (25680, 6)
        id_seqpos  reactivity  deg_Mg_pH10  deg_pH10  deg_Mg_50C   deg_50C
0  id_00b436dec_0    0.531662     0.590755  2.001310    0.474894  0.608811
1  id_00b436dec_1    1.515921     2.078699  2.904483    2.025060  1.792079
2  id_00b436dec_2    1.117059     0.713412  0.918506    0.951581  1.013684
3  id_00b436dec_3    0.823220     0.540694  0.716204    0.728683  0.751078
4  id_00b436dec_4    0.664863     0.676643  0.731160    0.846463  0.736625


In [3]:
# Install PyTorch cu121 stack and sanity-check GPU
import os, sys, subprocess, shutil
from pathlib import Path

def pip(*args):
    print('>', *args, flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Uninstall any preinstalled torch stacks (idempotent)
for pkg in ('torch','torchvision','torchaudio'):
    subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', pkg], check=False)

# Clean stray site dirs that can shadow correct wheels (idempotent)
for d in (
    '/app/.pip-target/torch',
    '/app/.pip-target/torch-2.8.0.dist-info',
    '/app/.pip-target/torch-2.4.1.dist-info',
    '/app/.pip-target/torchvision',
    '/app/.pip-target/torchvision-0.23.0.dist-info',
    '/app/.pip-target/torchvision-0.19.1.dist-info',
    '/app/.pip-target/torchaudio',
    '/app/.pip-target/torchaudio-2.8.0.dist-info',
    '/app/.pip-target/torchaudio-2.4.1.dist-info',
    '/app/.pip-target/torchgen',
    '/app/.pip-target/functorch',
):
    if os.path.exists(d):
        print('Removing', d)
        shutil.rmtree(d, ignore_errors=True)

# 1) Install the EXACT cu121 torch stack
pip('install',
    '--index-url', 'https://download.pytorch.org/whl/cu121',
    '--extra-index-url', 'https://pypi.org/simple',
    'torch==2.4.1', 'torchvision==0.19.1', 'torchaudio==2.4.1')

# 2) Freeze torch versions for later installs
Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')

# 3) Sanity gate (hard fail on drift)
import torch
print('torch:', torch.__version__, 'built CUDA:', getattr(torch.version, 'cuda', None))
print('CUDA available:', torch.cuda.is_available())
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f'Wrong CUDA build: {torch.version.cuda}'
assert torch.cuda.is_available(), 'CUDA not available'
print('GPU:', torch.cuda.get_device_name(0))





> install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 230.1 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 230.1 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 249.3 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 146.7 MB/s eta 0:00:00


Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 190.3 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 221.0 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 490.0 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 432.0 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 235.0 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 494.8 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 265.2 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 185.9 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 157.1 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 149.6 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 177.9 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 178.5 MB/s eta 0:00:00


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 433.6 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 213.3 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 493.3 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 384.6 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 347.3 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 223.2 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 175.0 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 542.1 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.2 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


torch: 2.4.1+cu121 built CUDA: 12.1
CUDA available: True
GPU: NVIDIA A10-24Q


In [5]:
# Build CV splits (StratifiedGroupKFold on SN_filter) and define masked MCRMSE metric
import pandas as pd, numpy as np
from pathlib import Path
from sklearn.model_selection import GroupKFold
try:
    from iterstrat.ml_stratifiers import MultilabelStratifiedKFold  # optional
except Exception:
    MultilabelStratifiedKFold = None
try:
    from sklearn.model_selection import StratifiedKFold
except Exception:
    StratifiedKFold = None

SEED = 42
N_FOLDS = 5
targets = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']

def set_seed(seed=SEED):
    import random, os
    random.seed(seed); np.random.seed(seed)
set_seed()

# Robust JSON reader (handles non-line-delimited and line-delimited)
try:
    tr = pd.read_json('train.json')
except ValueError:
    tr = pd.read_json('train.json', lines=True)

# Grouping by sequence id; stratify by SN_filter (0/1).
groups = tr['id'].values
y_strat = tr['SN_filter'].values if 'SN_filter' in tr.columns else np.zeros(len(tr), dtype=int)

# Create folds
folds = np.full(len(tr), -1, dtype=int)
gkf = GroupKFold(n_splits=N_FOLDS)
splitter = gkf.split(np.zeros(len(tr)), y_strat, groups)
for fold, (tr_idx, va_idx) in enumerate(splitter):
    folds[va_idx] = fold

assert (folds >= 0).all(), 'Some folds not assigned'
tr_folds = tr[['id','SN_filter']].copy() if 'SN_filter' in tr.columns else tr[['id']].copy()
tr_folds['fold'] = folds
tr_folds.to_csv('folds.csv', index=False)
print('Saved folds.csv with shape:', tr_folds.shape)
print('Fold counts:\n', tr_folds['fold'].value_counts().sort_index())
if 'SN_filter' in tr_folds.columns:
    print('SN_filter by fold:\n', tr_folds.groupby('fold')['SN_filter'].value_counts().unstack(fill_value=0))

# Masked MCRMSE utility (evaluate on positions 0..67 only)
def masked_mcrmse(y_true, y_pred, mask):
    # y_* shape: [B, L, T], mask: [B, L] boolean for scored positions
    T = y_true.shape[-1]
    rmses = []
    for t in range(T):
        yt = y_true[..., t]; yp = y_pred[..., t]
        m = mask.astype(bool)
        diff = (yp - yt)[m]
        rmse = np.sqrt(np.mean(diff**2)) if diff.size > 0 else np.nan
        rmses.append(rmse)
    return float(np.nanmean(rmses))

print('Masked MCRMSE ready. Use mask = (pos < 68).')

Saved folds.csv with shape: (2160, 3)
Fold counts:
 fold
0    432
1    432
2    432
3    432
4    432
Name: count, dtype: int64
SN_filter by fold:
 SN_filter    0    1
fold               
0          154  278
1          149  283
2          170  262
3          168  264
4          170  262
Masked MCRMSE ready. Use mask = (pos < 68).


In [10]:
# Dataset, model, and training utilities with engineered structure features, standardization, SmoothL1 loss, Conv1D front-end
import math, random, gc, time, os
import numpy as np, pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
SEED = 42
torch.manual_seed(SEED); np.random.seed(SEED); random.seed(SEED)

# Encoders
BASES = ['A','C','G','U']
STRUCT = ['.', '(', ')']
LOOPS = list('BEHIMSX')
BASE2IDX = {b:i for i,b in enumerate(BASES)}
STRUCT2IDX = {c:i for i,c in enumerate(STRUCT)}
LOOP2IDX = {c:i for i,c in enumerate(LOOPS)}

def one_hot(idx, K):
    v = np.zeros(K, dtype=np.float32)
    if idx is not None and 0 <= idx < K: v[idx]=1.0
    return v

def dotbracket_partner_and_depth(struct):
    stack = []
    L = len(struct)
    partner = np.full(L, -1, dtype=np.int32)
    depth = np.zeros(L, dtype=np.int32)
    cur_depth = 0
    for i,ch in enumerate(struct):
        if ch == '(':
            stack.append(i); cur_depth += 1; depth[i] = cur_depth
        elif ch == ')':
            if stack:
                j = stack.pop();
                partner[i] = j; partner[j] = i
                depth[i] = cur_depth
            cur_depth = max(0, cur_depth-1)
        else:
            depth[i] = cur_depth
    return partner, depth

def run_lengths(arr):
    # length of the contiguous run containing each position (arr is array of codes/ints)
    L = len(arr); out = np.zeros(L, dtype=np.int32)
    start = 0
    for i in range(1, L+1):
        if i==L or arr[i] != arr[start]:
            out[start:i] = i - start
            start = i
    return out

def encode_row(row):
    seq = str(row['sequence']); struct = str(row['structure']); loop = str(row['predicted_loop_type'])
    L = len(seq)
    partner, depth = dotbracket_partner_and_depth(struct)
    is_paired = (partner >= 0).astype(np.float32)
    partner_dist = np.zeros(L, dtype=np.float32)
    for i in range(L):
        if partner[i] >= 0:
            partner_dist[i] = abs(partner[i] - i) / max(1,(L-1))
        else:
            partner_dist[i] = 0.0
    # stem run-length (paired runs) and loop run-length (same loop char) normalized
    paired_int = is_paired.astype(np.int32)
    stem_run = run_lengths(paired_int) / max(1, L)
    loop_codes = np.array([LOOP2IDX.get(c, 0) for c in loop], dtype=np.int32)
    loop_run = run_lengths(loop_codes) / max(1, L)
    # local GC ratio (win=5 centered)
    base_codes = np.array([BASE2IDX.get(b, -1) for b in seq], dtype=np.int32)
    is_gc = np.isin(base_codes, [BASE2IDX['G'], BASE2IDX['C']]).astype(np.float32)
    gc_win = np.zeros(L, dtype=np.float32)
    k = 5; rad = k//2
    csum = np.concatenate([[0.0], is_gc.cumsum()])
    for i in range(L):
        a = max(0, i - rad); b = min(L, i + rad + 1)
        gc_win[i] = (csum[b] - csum[a]) / max(1, b - a)
    feats = []
    for i,(b,s,l) in enumerate(zip(seq, struct, loop)):
        v = []
        v.extend(one_hot(BASE2IDX.get(b, -1), len(BASES)))
        v.extend(one_hot(STRUCT2IDX.get(s, -1), len(STRUCT)))
        v.extend(one_hot(LOOP2IDX.get(l, -1), len(LOOPS)))
        # positional scalars
        pos_norm = i / max(1,(L-1))
        v.append(float(i))
        v.append(float(L))
        v.append(pos_norm)
        v.append(math.sin(2*math.pi*pos_norm))
        v.append(math.cos(2*math.pi*pos_norm))
        # engineered continuous features
        v.append(float(is_paired[i]))
        v.append(float(partner_dist[i]))
        v.append(float(depth[i]))
        v.append(float(stem_run[i]))
        v.append(float(loop_run[i]))
        v.append(float(gc_win[i]))
        feats.append(v)
    x = np.asarray(feats, dtype=np.float32)  # [L, C]
    return x

TARGETS = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']
SCORED_LEN = 68

def get_targets(row):
    y = np.zeros((len(str(row['sequence'])), len(TARGETS)), dtype=np.float32)
    for t_i, t in enumerate(TARGETS):
        arr = np.array(row[t], dtype=np.float32)[:SCORED_LEN]
        y[:SCORED_LEN, t_i] = arr
    return y

# Index where continuous channels start (one-hots first): 4 + 3 + 7 = 14
CONT_START = 4 + 3 + 7

class TrainDataset(Dataset):
    def __init__(self, df, scaler=None, sample_weights=None):
        self.df = df.reset_index(drop=True)
        self.scaler = scaler  # dict with 'mean','std' for continuous cols
        self.sample_weights = sample_weights if sample_weights is not None else np.ones(len(self.df), dtype=np.float32)
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        x = encode_row(r)  # [L, C]
        if self.scaler is not None:
            mu = self.scaler['mean']; sd = self.scaler['std']
            x_cont = x[:, CONT_START:]
            x[:, CONT_START:] = np.clip((x_cont - mu) / (sd + 1e-6), -5.0, 5.0)
        y = get_targets(r) # [L, T]
        L = x.shape[0]
        mask = np.zeros((L,), dtype=np.float32)
        mask[:SCORED_LEN]=1.0
        w = float(self.sample_weights[idx])
        return x, y, mask, r['id'], w

class TestDataset(Dataset):
    def __init__(self, df, scaler=None):
        self.df = df.reset_index(drop=True); self.scaler = scaler
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        x = encode_row(r)
        if self.scaler is not None:
            mu = self.scaler['mean']; sd = self.scaler['std']
            x_cont = x[:, CONT_START:]
            x[:, CONT_START:] = np.clip((x_cont - mu) / (sd + 1e-6), -5.0, 5.0)
        L = x.shape[0]
        mask = np.zeros((L,), dtype=np.float32)
        mask[:SCORED_LEN]=1.0
        return x, mask, r['id']

def collate_train(batch):
    lens = [b[0].shape[0] for b in batch]
    maxL = max(lens)
    C = batch[0][0].shape[1]
    T = batch[0][1].shape[1]
    B = len(batch)
    x = np.zeros((B,maxL,C), dtype=np.float32)
    y = np.zeros((B,maxL,T), dtype=np.float32)
    mask = np.zeros((B,maxL), dtype=np.float32)
    ids = []; w = np.zeros((B,), dtype=np.float32)
    for i,(xi, yi, mi, idv, wi) in enumerate(batch):
        L = xi.shape[0]
        x[i,:L,:] = xi
        y[i,:L,:] = yi
        mask[i,:L] = mi
        ids.append(idv); w[i]=wi
    return torch.from_numpy(x), torch.from_numpy(y), torch.from_numpy(mask), ids, torch.from_numpy(w)

def collate_test(batch):
    lens = [b[0].shape[0] for b in batch]
    maxL = max(lens); C = batch[0][0].shape[1]; B=len(batch)
    x = np.zeros((B,maxL,C), dtype=np.float32)
    mask = np.zeros((B,maxL), dtype=np.float32)
    ids = []
    for i,(xi, mi, idv) in enumerate(batch):
        L = xi.shape[0]
        x[i,:L,:]=xi; mask[i,:L]=mi; ids.append(idv)
    return torch.from_numpy(x), torch.from_numpy(mask), ids

class CNNBiLSTMModel(nn.Module):
    def __init__(self, in_ch, conv_ch=128, hidden=256, num_layers=2, out_ch=5, dropout=0.3):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv1d(in_ch, conv_ch, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.Dropout(dropout),
        )
        self.lstm = nn.LSTM(input_size=conv_ch, hidden_size=hidden, num_layers=num_layers,
                            dropout=dropout if num_layers>1 else 0.0, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.head = nn.Linear(hidden*2, out_ch)
    def forward(self, x, lengths):
        # x: [B,L,C]; lengths: [B]
        x = x.transpose(1,2)  # [B,C,L]
        x = self.conv(x)
        x = x.transpose(1,2)  # [B,L,conv_ch]
        packed = nn.utils.rnn.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)
        out_packed, _ = self.lstm(packed)
        out, _ = nn.utils.rnn.pad_packed_sequence(out_packed, batch_first=True)
        out = self.dropout(out)
        y = self.head(out)  # [B,L,5]
        return y

def _lengths_from_x(xb):
    return (xb.abs().sum(dim=2) > 0).sum(dim=1).long().clamp(min=1)

def masked_smoothl1_loss(pred, target, mask, sample_weights=None, beta=1.0):
    # pred/target: [B,L,T], mask: [B,L], sample_weights: [B]
    loss = F.smooth_l1_loss(pred, target, reduction='none', beta=beta)  # [B,L,T]
    mask3 = mask.unsqueeze(-1)
    loss = loss * mask3
    denom = (mask3.sum(dim=(1,2)).clamp(min=1.0))  # [B]
    loss_b = loss.sum(dim=(1,2)) / denom  # [B]
    if sample_weights is not None:
        loss_b = loss_b * sample_weights
        return loss_b.mean()
    else:
        return loss_b.mean()

def compute_scaler(train_df):
    # Fit mean/std for continuous channels from train fold only
    xs = []
    for _, r in train_df.iterrows():
        x = encode_row(r)
        xs.append(x[:, CONT_START:])
    X = np.concatenate(xs, axis=0)  # [sumL, C_cont]
    mu = X.mean(axis=0).astype(np.float32)
    sd = X.std(axis=0).astype(np.float32)
    return {'mean': mu, 'std': sd}

def run_fold(fold, tr_df, te_df, folds_df, epochs=20, batch_size=64, lr=1e-3, wd=1e-4, use_amp=True):
    # split ids
    tr_ids = set(tr_df['id'].values)
    train_ids = set(folds_df.loc[folds_df['fold']!=fold, 'id'].values) & tr_ids
    valid_ids = set(folds_df.loc[folds_df['fold']==fold, 'id'].values) & tr_ids
    dtr = tr_df[tr_df['id'].isin(train_ids)].reset_index(drop=True)
    dva = tr_df[tr_df['id'].isin(valid_ids)].reset_index(drop=True)
    # sample weights by SN_filter
    if 'SN_filter' in dtr.columns:
        sw_tr = np.where(dtr['SN_filter'].values.astype(int)==1, 1.0, 0.5).astype(np.float32)
    else:
        sw_tr = np.ones(len(dtr), dtype=np.float32)
    # scaler per fold
    scaler = compute_scaler(dtr)
    ds_tr = TrainDataset(dtr, scaler=scaler, sample_weights=sw_tr)
    ds_va = TrainDataset(dva, scaler=scaler, sample_weights=np.ones(len(dva), dtype=np.float32))
    ds_te = TestDataset(te_df, scaler=scaler)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=2, pin_memory=True, collate_fn=collate_train)
    dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate_train)
    dl_te = DataLoader(ds_te, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True, collate_fn=collate_test)
    in_ch = ds_tr[0][0].shape[1]
    model = CNNBiLSTMModel(in_ch=in_ch).to(DEVICE)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, mode='min', patience=2, factor=0.5, verbose=False)
    scaler_amp = torch.cuda.amp.GradScaler(enabled=use_amp and DEVICE=='cuda')
    best = 1e9; best_state=None; start=time.time()
    for ep in range(1, epochs+1):
        model.train(); tr_loss=0.0; n_batches=0; t0=time.time()
        for xb, yb, mb, _, wb in dl_tr:
            xb=xb.to(DEVICE); yb=yb.to(DEVICE); mb=mb.to(DEVICE); wb=wb.to(DEVICE); lens = _lengths_from_x(xb)
            opt.zero_grad(set_to_none=True)
            with torch.amp.autocast('cuda', enabled=use_amp and DEVICE=='cuda'):
                preds = model(xb, lens)
                loss = masked_smoothl1_loss(preds, yb, mb, sample_weights=wb, beta=1.0)
            scaler_amp.scale(loss).backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler_amp.step(opt); scaler_amp.update()
            tr_loss += loss.item(); n_batches+=1
        # validate
        model.eval(); ys=[]; ps=[]; ms=[]
        with torch.no_grad():
            for xb, yb, mb, _, _ in dl_va:
                xb=xb.to(DEVICE); yb=yb.to(DEVICE); mb=mb.to(DEVICE); lens = _lengths_from_x(xb)
                preds = model(xb, lens)
                ys.append(yb.cpu().numpy()); ps.append(preds.cpu().numpy()); ms.append(mb.cpu().numpy())
        y_true = np.concatenate(ys, axis=0); y_pred = np.concatenate(ps, axis=0); m = np.concatenate(ms, axis=0).astype(bool)
        # MCRMSE
        rmses=[]
        for t in range(y_true.shape[-1]):
            diff = (y_pred[...,t]-y_true[...,t])[m]
            rmse = float(np.sqrt(np.mean(diff**2))) if diff.size>0 else np.nan
            rmses.append(rmse)
        mcrmse = float(np.nanmean(rmses))
        scheduler.step(mcrmse)
        per_target_str = '[' + ', '.join(str(round(r,5)) for r in rmses) + ']'
        print(f'Fold {fold} Epoch {ep}/{epochs} tr_loss={tr_loss/max(1,n_batches):.5f} val_MCRMSE={mcrmse:.5f} per-target={per_target_str} time_ep={time.time()-t0:.1f}s elapsed={time.time()-start:.1f}s', flush=True)
        if mcrmse < best: best=mcrmse; best_state = {k:v.cpu() for k,v in model.state_dict().items()}
    if best_state is not None: model.load_state_dict(best_state)
    # OOF preds
    model.eval(); ys=[]; ps=[]; ms=[]; va_ids=[]
    with torch.no_grad():
        for xb, yb, mb, ids, _ in dl_va:
            xb=xb.to(DEVICE); mb=mb.to(DEVICE); lens = _lengths_from_x(xb)
            preds = model(xb, lens)
            ys.append(yb.cpu().numpy()); ps.append(preds.cpu().numpy()); ms.append(mb.cpu().numpy()); va_ids.extend(ids)
    oof_y = np.concatenate(ys, axis=0); oof_p = np.concatenate(ps, axis=0); oof_m = np.concatenate(ms, axis=0).astype(bool)
    # Test preds with reverse TTA
    test_preds = [] ; test_ids = []
    with torch.no_grad():
        for xb, mb, ids in dl_te:
            xb=xb.to(DEVICE); mb=mb.to(DEVICE); lens = _lengths_from_x(xb)
            preds_fwd = model(xb, lens)
            xb_rev = xb.flip(1); lens_rev = lens  # same lengths
            preds_rev = model(xb_rev, lens_rev).flip(1)
            preds = 0.5*(preds_fwd + preds_rev)
            test_preds.append(preds.cpu().numpy()); test_ids.extend(ids)
    test_pred = np.concatenate(test_preds, axis=0)
    return oof_y, oof_p, oof_m, va_ids, test_pred, test_ids, best

print('Dataset/model utilities ready (engineered features + standardization + SmoothL1 + Conv1D + reverse TTA).')

Dataset/model utilities ready (engineered features + standardization + SmoothL1 + Conv1D + reverse TTA).


In [11]:
# Orchestrate 5-fold training, aggregate OOF and test preds, and build submission
import pandas as pd, numpy as np, time, os

# Load data and folds
try:
    tr = pd.read_json('train.json')
except ValueError:
    tr = pd.read_json('train.json', lines=True)
try:
    te = pd.read_json('test.json')
except ValueError:
    te = pd.read_json('test.json', lines=True)
folds_df = pd.read_csv('folds.csv')

N_FOLDS = 5
EPOCHS = 20  # increased for stronger training with engineered features
BATCH_SIZE = 64
LR = 1e-3
WD = 1e-4

all_oof_true = []
all_oof_pred = []
all_oof_mask = []
all_oof_ids = []
test_pred_accum = None
test_ids_ref = None
fold_scores = []

t0 = time.time()
for fold in range(N_FOLDS):
    print(f'===== Fold {fold} / {N_FOLDS} =====', flush=True)
    oof_y, oof_p, oof_m, va_ids, te_pred, te_ids, best = run_fold(
        fold, tr, te, folds_df, epochs=EPOCHS, batch_size=BATCH_SIZE, lr=LR, wd=WD, use_amp=True
    )
    # compute fold MCRMSE
    m = oof_m.astype(bool)
    rmses=[]
    for t in range(oof_y.shape[-1]):
        diff = (oof_p[...,t]-oof_y[...,t])[m]
        rmse = float(np.sqrt(np.mean(diff**2))) if diff.size>0 else np.nan
        rmses.append(rmse)
    mcrmse = float(np.nanmean(rmses))
    fold_scores.append(mcrmse)
    print(f'Fold {fold} best_val_MCRMSE={best:.5f} OOF_MCRMSE={mcrmse:.5f} per-target={[round(r,5) for r in rmses]}', flush=True)
    all_oof_true.append(oof_y); all_oof_pred.append(oof_p); all_oof_mask.append(oof_m); all_oof_ids.extend(va_ids)
    # accumulate test preds (align by te_ids order)
    if test_pred_accum is None:
        test_pred_accum = te_pred.copy()
        test_ids_ref = te_ids
    else:
        # ensure same order
        assert test_ids_ref == te_ids, 'Test id order mismatch across folds'
        test_pred_accum += te_pred

print('Fold scores:', fold_scores, 'mean:', float(np.nanmean(fold_scores)))

# Average test predictions across folds
test_pred_mean = test_pred_accum / N_FOLDS
print('Test pred shape:', test_pred_mean.shape)

# Build submission by merging onto sample_submission order
ss = pd.read_csv('sample_submission.csv')

# Map test id to row index in te dataframe to access its prediction row
id_to_idx = {idv:i for i, idv in enumerate(test_ids_ref)}
targets = ['reactivity','deg_Mg_pH10','deg_pH10','deg_Mg_50C','deg_50C']

def parse_id_seqpos(x):
    s, pos = x.rsplit('_', 1)
    return s, int(pos)

ids = []
pos = []
for v in ss['id_seqpos'].values:
    i, p = parse_id_seqpos(v)
    ids.append(i); pos.append(p)
ss['_id'] = ids; ss['_pos'] = pos

for t_i, t in enumerate(targets):
    vals = []
    for i, p in zip(ss['_id'].values, ss['_pos'].values):
        idx = id_to_idx[i]
        vals.append(float(test_pred_mean[idx, p, t_i]))
    ss[t] = vals

ss.drop(columns=['_id','_pos'], inplace=True)
ss.to_csv('submission.csv', index=False)
print('Wrote submission.csv with shape:', ss.shape)

# Compute overall OOF MCRMSE
oof_y_full = np.concatenate(all_oof_true, axis=0)
oof_p_full = np.concatenate(all_oof_pred, axis=0)
oof_m_full = np.concatenate(all_oof_mask, axis=0).astype(bool)
rmses=[]
for t in range(oof_y_full.shape[-1]):
    diff = (oof_p_full[...,t]-oof_y_full[...,t])[oof_m_full]
    rmse = float(np.sqrt(np.mean(diff**2))) if diff.size>0 else np.nan
    rmses.append(rmse)
mcrmse = float(np.nanmean(rmses))
print(f'OOF MCRMSE={mcrmse:.5f} per-target={[round(r,5) for r in rmses]} total_time={time.time()-t0:.1f}s')

print('Done. Next: consider adding BPP features and reverse TTA to improve scores.')

===== Fold 0 / 5 =====


  scaler_amp = torch.cuda.amp.GradScaler(enabled=use_amp and DEVICE=='cuda')


Fold 0 Epoch 1/20 tr_loss=0.51085 val_MCRMSE=0.92862 per-target=[0.67074, 0.69005, 1.28625, 0.88451, 1.11155] time_ep=1.7s elapsed=1.7s


Fold 0 Epoch 2/20 tr_loss=0.40382 val_MCRMSE=0.91585 per-target=[0.65898, 0.67288, 1.26874, 0.87228, 1.10636] time_ep=1.6s elapsed=3.3s


Fold 0 Epoch 3/20 tr_loss=0.38127 val_MCRMSE=0.90645 per-target=[0.65106, 0.65795, 1.25803, 0.86288, 1.10235] time_ep=1.6s elapsed=4.9s


Fold 0 Epoch 4/20 tr_loss=0.36253 val_MCRMSE=0.89826 per-target=[0.64627, 0.64102, 1.25088, 0.8542, 1.09891] time_ep=1.6s elapsed=6.5s


Fold 0 Epoch 5/20 tr_loss=0.34706 val_MCRMSE=0.89244 per-target=[0.64385, 0.63046, 1.2471, 0.84525, 1.09555] time_ep=1.6s elapsed=8.1s


Fold 0 Epoch 6/20 tr_loss=0.33905 val_MCRMSE=0.89179 per-target=[0.64342, 0.62961, 1.24624, 0.84501, 1.09465] time_ep=1.6s elapsed=9.7s


Fold 0 Epoch 7/20 tr_loss=0.33453 val_MCRMSE=0.88966 per-target=[0.64127, 0.6265, 1.24456, 0.84262, 1.09333] time_ep=1.6s elapsed=11.4s


Fold 0 Epoch 8/20 tr_loss=0.32971 val_MCRMSE=0.88885 per-target=[0.64043, 0.62524, 1.24457, 0.84119, 1.09282] time_ep=1.6s elapsed=13.0s


Fold 0 Epoch 9/20 tr_loss=0.32711 val_MCRMSE=0.88758 per-target=[0.6389, 0.62364, 1.24292, 0.8403, 1.09216] time_ep=1.6s elapsed=14.6s


Fold 0 Epoch 10/20 tr_loss=0.32342 val_MCRMSE=0.88594 per-target=[0.63785, 0.62098, 1.24222, 0.83769, 1.09096] time_ep=1.6s elapsed=16.3s


Fold 0 Epoch 11/20 tr_loss=0.32138 val_MCRMSE=0.88633 per-target=[0.63799, 0.62166, 1.24163, 0.83925, 1.09111] time_ep=1.6s elapsed=17.9s


Fold 0 Epoch 12/20 tr_loss=0.32051 val_MCRMSE=0.88428 per-target=[0.63727, 0.61827, 1.24026, 0.83618, 1.0894] time_ep=1.6s elapsed=19.5s


Fold 0 Epoch 13/20 tr_loss=0.31701 val_MCRMSE=0.88388 per-target=[0.6354, 0.61779, 1.2403, 0.83675, 1.08914] time_ep=1.6s elapsed=21.1s


Fold 0 Epoch 14/20 tr_loss=0.31452 val_MCRMSE=0.88270 per-target=[0.63515, 0.61649, 1.23905, 0.83415, 1.08868] time_ep=1.7s elapsed=22.8s


Fold 0 Epoch 15/20 tr_loss=0.31170 val_MCRMSE=0.88287 per-target=[0.63435, 0.61698, 1.23924, 0.83566, 1.08812] time_ep=1.9s elapsed=24.7s


Fold 0 Epoch 16/20 tr_loss=0.30897 val_MCRMSE=0.88185 per-target=[0.63428, 0.61598, 1.23776, 0.83367, 1.08755] time_ep=1.7s elapsed=26.3s


Fold 0 Epoch 17/20 tr_loss=0.30751 val_MCRMSE=0.87929 per-target=[0.63185, 0.61141, 1.23646, 0.83092, 1.0858] time_ep=1.6s elapsed=28.0s


Fold 0 Epoch 18/20 tr_loss=0.30406 val_MCRMSE=0.87942 per-target=[0.63271, 0.61187, 1.23589, 0.83101, 1.0856] time_ep=1.6s elapsed=29.6s


Fold 0 Epoch 19/20 tr_loss=0.30252 val_MCRMSE=0.87758 per-target=[0.63068, 0.60897, 1.23472, 0.8291, 1.08446] time_ep=1.6s elapsed=31.3s


Fold 0 Epoch 20/20 tr_loss=0.30118 val_MCRMSE=0.87828 per-target=[0.63152, 0.61001, 1.23496, 0.82997, 1.08496] time_ep=1.6s elapsed=32.9s


Fold 0 best_val_MCRMSE=0.87758 OOF_MCRMSE=0.87758 per-target=[0.63068, 0.60897, 1.23472, 0.8291, 1.08446]


===== Fold 1 / 5 =====


Fold 1 Epoch 1/20 tr_loss=0.50783 val_MCRMSE=0.90326 per-target=[0.66619, 0.68628, 1.19572, 0.78906, 1.17906] time_ep=1.7s elapsed=1.7s


Fold 1 Epoch 2/20 tr_loss=0.41065 val_MCRMSE=0.88989 per-target=[0.65052, 0.67053, 1.17806, 0.77648, 1.17388] time_ep=1.6s elapsed=3.3s


Fold 1 Epoch 3/20 tr_loss=0.38421 val_MCRMSE=0.87952 per-target=[0.64074, 0.65444, 1.16716, 0.76602, 1.16925] time_ep=1.6s elapsed=4.9s


Fold 1 Epoch 4/20 tr_loss=0.36266 val_MCRMSE=0.87066 per-target=[0.63506, 0.63709, 1.1602, 0.75549, 1.16545] time_ep=1.6s elapsed=6.5s


Fold 1 Epoch 5/20 tr_loss=0.34713 val_MCRMSE=0.86677 per-target=[0.63378, 0.62857, 1.15775, 0.74962, 1.16413] time_ep=1.6s elapsed=8.2s


Fold 1 Epoch 6/20 tr_loss=0.33974 val_MCRMSE=0.86542 per-target=[0.63266, 0.62651, 1.15711, 0.74803, 1.1628] time_ep=1.6s elapsed=9.8s


Fold 1 Epoch 7/20 tr_loss=0.33606 val_MCRMSE=0.86428 per-target=[0.63172, 0.62516, 1.15575, 0.74604, 1.16271] time_ep=1.7s elapsed=11.4s


Fold 1 Epoch 8/20 tr_loss=0.33284 val_MCRMSE=0.86334 per-target=[0.63111, 0.62337, 1.15501, 0.74547, 1.16171] time_ep=1.7s elapsed=13.1s


Fold 1 Epoch 9/20 tr_loss=0.33173 val_MCRMSE=0.86274 per-target=[0.62956, 0.62238, 1.15494, 0.74494, 1.16189] time_ep=2.0s elapsed=15.2s


Fold 1 Epoch 10/20 tr_loss=0.32728 val_MCRMSE=0.86175 per-target=[0.62837, 0.62125, 1.15397, 0.74339, 1.16176] time_ep=2.2s elapsed=17.4s


Fold 1 Epoch 11/20 tr_loss=0.32440 val_MCRMSE=0.86017 per-target=[0.62706, 0.61846, 1.15285, 0.74204, 1.16043] time_ep=1.7s elapsed=19.1s


Fold 1 Epoch 12/20 tr_loss=0.32151 val_MCRMSE=0.86020 per-target=[0.62708, 0.61971, 1.1526, 0.74138, 1.16022] time_ep=1.7s elapsed=20.7s


Fold 1 Epoch 13/20 tr_loss=0.31869 val_MCRMSE=0.85905 per-target=[0.62667, 0.61697, 1.15135, 0.74088, 1.15937] time_ep=1.7s elapsed=22.4s


Fold 1 Epoch 14/20 tr_loss=0.31600 val_MCRMSE=0.85950 per-target=[0.62703, 0.61823, 1.15144, 0.74088, 1.15994] time_ep=1.7s elapsed=24.0s


Fold 1 Epoch 15/20 tr_loss=0.31417 val_MCRMSE=0.85723 per-target=[0.62541, 0.61427, 1.14974, 0.73874, 1.158] time_ep=1.7s elapsed=25.7s


Fold 1 Epoch 16/20 tr_loss=0.31166 val_MCRMSE=0.85601 per-target=[0.62411, 0.61302, 1.14933, 0.73598, 1.15759] time_ep=1.6s elapsed=27.3s


Fold 1 Epoch 17/20 tr_loss=0.30909 val_MCRMSE=0.85581 per-target=[0.62414, 0.612, 1.1498, 0.7354, 1.15772] time_ep=1.7s elapsed=29.0s


Fold 1 Epoch 18/20 tr_loss=0.30781 val_MCRMSE=0.85568 per-target=[0.62403, 0.61181, 1.14896, 0.7361, 1.15748] time_ep=1.7s elapsed=30.7s


Fold 1 Epoch 19/20 tr_loss=0.30543 val_MCRMSE=0.85458 per-target=[0.62376, 0.61004, 1.14804, 0.73478, 1.15627] time_ep=1.6s elapsed=32.4s


Fold 1 Epoch 20/20 tr_loss=0.30426 val_MCRMSE=0.85327 per-target=[0.62234, 0.6071, 1.14723, 0.73367, 1.15601] time_ep=1.7s elapsed=34.1s


Fold 1 best_val_MCRMSE=0.85327 OOF_MCRMSE=0.85327 per-target=[0.62234, 0.6071, 1.14723, 0.73367, 1.15601]


===== Fold 2 / 5 =====


Fold 2 Epoch 1/20 tr_loss=0.50759 val_MCRMSE=0.88423 per-target=[0.69684, 0.66298, 1.22714, 0.80654, 1.02763] time_ep=1.6s elapsed=1.6s


Fold 2 Epoch 2/20 tr_loss=0.40464 val_MCRMSE=0.87333 per-target=[0.68665, 0.64785, 1.20921, 0.7989, 1.02405] time_ep=1.6s elapsed=3.3s


Fold 2 Epoch 3/20 tr_loss=0.38138 val_MCRMSE=0.86607 per-target=[0.68286, 0.63566, 1.20072, 0.79007, 1.02105] time_ep=1.7s elapsed=4.9s


Fold 2 Epoch 4/20 tr_loss=0.36436 val_MCRMSE=0.85875 per-target=[0.67912, 0.62112, 1.19348, 0.78177, 1.01824] time_ep=1.6s elapsed=6.6s


Fold 2 Epoch 5/20 tr_loss=0.35152 val_MCRMSE=0.85414 per-target=[0.67607, 0.6134, 1.1902, 0.77545, 1.01557] time_ep=1.7s elapsed=8.3s


Fold 2 Epoch 6/20 tr_loss=0.34415 val_MCRMSE=0.85395 per-target=[0.67729, 0.61222, 1.18935, 0.77538, 1.01553] time_ep=1.6s elapsed=10.0s


Fold 2 Epoch 7/20 tr_loss=0.33975 val_MCRMSE=0.85200 per-target=[0.67455, 0.60981, 1.18857, 0.77314, 1.01393] time_ep=1.7s elapsed=11.7s


Fold 2 Epoch 8/20 tr_loss=0.33622 val_MCRMSE=0.85062 per-target=[0.67329, 0.60791, 1.18817, 0.77103, 1.01272] time_ep=1.7s elapsed=13.3s


Fold 2 Epoch 9/20 tr_loss=0.33327 val_MCRMSE=0.84966 per-target=[0.67256, 0.60691, 1.18533, 0.7707, 1.01282] time_ep=1.6s elapsed=15.0s


Fold 2 Epoch 10/20 tr_loss=0.32995 val_MCRMSE=0.84713 per-target=[0.67081, 0.60283, 1.1843, 0.76725, 1.01046] time_ep=1.6s elapsed=16.6s


Fold 2 Epoch 11/20 tr_loss=0.32662 val_MCRMSE=0.84609 per-target=[0.66983, 0.60187, 1.18275, 0.76608, 1.00989] time_ep=1.7s elapsed=18.3s


Fold 2 Epoch 12/20 tr_loss=0.32493 val_MCRMSE=0.84511 per-target=[0.66848, 0.59969, 1.18251, 0.76521, 1.00965] time_ep=1.7s elapsed=19.9s


Fold 2 Epoch 13/20 tr_loss=0.32263 val_MCRMSE=0.84405 per-target=[0.66786, 0.59869, 1.18116, 0.76415, 1.00838] time_ep=1.6s elapsed=21.6s


Fold 2 Epoch 14/20 tr_loss=0.32021 val_MCRMSE=0.84427 per-target=[0.66807, 0.59823, 1.18113, 0.76442, 1.00949] time_ep=1.6s elapsed=23.2s


Fold 2 Epoch 15/20 tr_loss=0.31778 val_MCRMSE=0.84297 per-target=[0.6665, 0.59677, 1.18, 0.7634, 1.00819] time_ep=1.7s elapsed=24.9s


Fold 2 Epoch 16/20 tr_loss=0.31563 val_MCRMSE=0.84167 per-target=[0.66524, 0.5951, 1.17898, 0.76148, 1.00753] time_ep=1.6s elapsed=26.5s


Fold 2 Epoch 17/20 tr_loss=0.31371 val_MCRMSE=0.84096 per-target=[0.6648, 0.59325, 1.17864, 0.76168, 1.00644] time_ep=1.7s elapsed=28.2s


Fold 2 Epoch 18/20 tr_loss=0.31210 val_MCRMSE=0.83968 per-target=[0.66352, 0.59186, 1.17738, 0.75981, 1.0058] time_ep=1.7s elapsed=29.8s


Fold 2 Epoch 19/20 tr_loss=0.30965 val_MCRMSE=0.83859 per-target=[0.66287, 0.58975, 1.17682, 0.75815, 1.00538] time_ep=1.6s elapsed=31.5s


Fold 2 Epoch 20/20 tr_loss=0.30827 val_MCRMSE=0.83932 per-target=[0.66475, 0.59064, 1.17636, 0.75967, 1.00516] time_ep=1.6s elapsed=33.1s


Fold 2 best_val_MCRMSE=0.83859 OOF_MCRMSE=0.83859 per-target=[0.66287, 0.58975, 1.17682, 0.75815, 1.00538]


===== Fold 3 / 5 =====


Fold 3 Epoch 1/20 tr_loss=0.50378 val_MCRMSE=1.02185 per-target=[0.74329, 0.7111, 1.4657, 0.93761, 1.25153] time_ep=1.6s elapsed=1.6s


Fold 3 Epoch 2/20 tr_loss=0.39808 val_MCRMSE=1.00971 per-target=[0.73049, 0.69632, 1.44888, 0.92693, 1.2459] time_ep=1.7s elapsed=3.3s


Fold 3 Epoch 3/20 tr_loss=0.37798 val_MCRMSE=1.00252 per-target=[0.72607, 0.68441, 1.43958, 0.91928, 1.24328] time_ep=1.7s elapsed=5.0s


Fold 3 Epoch 4/20 tr_loss=0.35886 val_MCRMSE=0.99726 per-target=[0.72349, 0.67317, 1.43515, 0.91284, 1.24163] time_ep=1.7s elapsed=6.7s


Fold 3 Epoch 5/20 tr_loss=0.34356 val_MCRMSE=0.99162 per-target=[0.72205, 0.66145, 1.42878, 0.90667, 1.23913] time_ep=1.6s elapsed=8.3s


Fold 3 Epoch 6/20 tr_loss=0.33408 val_MCRMSE=0.98912 per-target=[0.71928, 0.65783, 1.42786, 0.90307, 1.23754] time_ep=1.6s elapsed=9.9s


Fold 3 Epoch 7/20 tr_loss=0.32911 val_MCRMSE=0.98776 per-target=[0.71815, 0.65559, 1.42699, 0.90151, 1.23656] time_ep=1.7s elapsed=11.6s


Fold 3 Epoch 8/20 tr_loss=0.32453 val_MCRMSE=0.98653 per-target=[0.7171, 0.65361, 1.42609, 0.90003, 1.23579] time_ep=1.6s elapsed=13.3s


Fold 3 Epoch 9/20 tr_loss=0.32178 val_MCRMSE=0.98590 per-target=[0.7169, 0.65243, 1.42579, 0.89907, 1.23532] time_ep=1.7s elapsed=14.9s


Fold 3 Epoch 10/20 tr_loss=0.31886 val_MCRMSE=0.98607 per-target=[0.71579, 0.65238, 1.42726, 0.89892, 1.23597] time_ep=1.7s elapsed=16.6s


Fold 3 Epoch 11/20 tr_loss=0.31639 val_MCRMSE=0.98504 per-target=[0.71581, 0.6517, 1.42393, 0.89902, 1.23473] time_ep=1.7s elapsed=18.2s


Fold 3 Epoch 12/20 tr_loss=0.31460 val_MCRMSE=0.98351 per-target=[0.71475, 0.64901, 1.42377, 0.89627, 1.23377] time_ep=1.7s elapsed=19.9s


Fold 3 Epoch 13/20 tr_loss=0.31141 val_MCRMSE=0.98198 per-target=[0.71381, 0.64654, 1.42222, 0.8949, 1.23242] time_ep=1.7s elapsed=21.6s


Fold 3 Epoch 14/20 tr_loss=0.30819 val_MCRMSE=0.98012 per-target=[0.71158, 0.64378, 1.42076, 0.89303, 1.23145] time_ep=1.6s elapsed=23.2s


Fold 3 Epoch 15/20 tr_loss=0.30567 val_MCRMSE=0.97972 per-target=[0.71162, 0.64353, 1.41977, 0.89252, 1.23119] time_ep=1.6s elapsed=24.9s


Fold 3 Epoch 16/20 tr_loss=0.30372 val_MCRMSE=0.97905 per-target=[0.71088, 0.64193, 1.42061, 0.89105, 1.23078] time_ep=1.7s elapsed=26.5s


Fold 3 Epoch 17/20 tr_loss=0.30021 val_MCRMSE=0.97836 per-target=[0.71047, 0.6408, 1.41964, 0.89059, 1.23031] time_ep=1.7s elapsed=28.2s


Fold 3 Epoch 18/20 tr_loss=0.29998 val_MCRMSE=0.97901 per-target=[0.71206, 0.64282, 1.41912, 0.89087, 1.23018] time_ep=1.6s elapsed=29.9s


Fold 3 Epoch 19/20 tr_loss=0.29789 val_MCRMSE=0.97666 per-target=[0.70901, 0.63791, 1.41909, 0.88885, 1.22841] time_ep=1.6s elapsed=31.5s


Fold 3 Epoch 20/20 tr_loss=0.29532 val_MCRMSE=0.97609 per-target=[0.70899, 0.63708, 1.41783, 0.88773, 1.22881] time_ep=1.7s elapsed=33.2s


Fold 3 best_val_MCRMSE=0.97609 OOF_MCRMSE=0.97609 per-target=[0.70899, 0.63708, 1.41783, 0.88773, 1.22881]


===== Fold 4 / 5 =====


Fold 4 Epoch 1/20 tr_loss=0.49804 val_MCRMSE=0.95594 per-target=[0.73223, 0.67484, 1.33571, 0.87283, 1.16411] time_ep=1.7s elapsed=1.7s


Fold 4 Epoch 2/20 tr_loss=0.40235 val_MCRMSE=0.94718 per-target=[0.72487, 0.66295, 1.32261, 0.86427, 1.1612] time_ep=1.6s elapsed=3.3s


Fold 4 Epoch 3/20 tr_loss=0.37977 val_MCRMSE=0.93897 per-target=[0.71835, 0.64843, 1.31486, 0.85532, 1.1579] time_ep=1.6s elapsed=4.9s


Fold 4 Epoch 4/20 tr_loss=0.35885 val_MCRMSE=0.93274 per-target=[0.71627, 0.63472, 1.30976, 0.84791, 1.15504] time_ep=1.6s elapsed=6.6s


Fold 4 Epoch 5/20 tr_loss=0.34607 val_MCRMSE=0.93034 per-target=[0.71546, 0.6298, 1.30885, 0.84389, 1.15369] time_ep=1.6s elapsed=8.2s


Fold 4 Epoch 6/20 tr_loss=0.33951 val_MCRMSE=0.92848 per-target=[0.71306, 0.6276, 1.30694, 0.84216, 1.15264] time_ep=1.6s elapsed=9.8s


Fold 4 Epoch 7/20 tr_loss=0.33650 val_MCRMSE=0.92807 per-target=[0.71346, 0.62658, 1.30655, 0.84145, 1.15233] time_ep=1.7s elapsed=11.5s


Fold 4 Epoch 8/20 tr_loss=0.33324 val_MCRMSE=0.92737 per-target=[0.71246, 0.62554, 1.30588, 0.84078, 1.15218] time_ep=1.6s elapsed=13.2s


Fold 4 Epoch 9/20 tr_loss=0.33027 val_MCRMSE=0.92623 per-target=[0.71054, 0.62377, 1.30628, 0.83928, 1.15126] time_ep=1.7s elapsed=14.8s


Fold 4 Epoch 10/20 tr_loss=0.32863 val_MCRMSE=0.92667 per-target=[0.71196, 0.62369, 1.3051, 0.84039, 1.15222] time_ep=1.6s elapsed=16.5s


Fold 4 Epoch 11/20 tr_loss=0.32585 val_MCRMSE=0.92396 per-target=[0.70887, 0.62065, 1.30332, 0.83737, 1.14958] time_ep=1.7s elapsed=18.1s


Fold 4 Epoch 12/20 tr_loss=0.32271 val_MCRMSE=0.92408 per-target=[0.70902, 0.62029, 1.30357, 0.8374, 1.15012] time_ep=1.7s elapsed=19.8s


Fold 4 Epoch 13/20 tr_loss=0.32101 val_MCRMSE=0.92198 per-target=[0.70752, 0.61716, 1.30076, 0.8358, 1.14867] time_ep=1.6s elapsed=21.4s


Fold 4 Epoch 14/20 tr_loss=0.31747 val_MCRMSE=0.92126 per-target=[0.70577, 0.61621, 1.30104, 0.83537, 1.14792] time_ep=1.6s elapsed=23.1s


Fold 4 Epoch 15/20 tr_loss=0.31456 val_MCRMSE=0.92023 per-target=[0.70506, 0.61467, 1.3004, 0.83378, 1.14723] time_ep=1.6s elapsed=24.7s


Fold 4 Epoch 16/20 tr_loss=0.31284 val_MCRMSE=0.91981 per-target=[0.70475, 0.61371, 1.29932, 0.83418, 1.14709] time_ep=1.7s elapsed=26.4s


Fold 4 Epoch 17/20 tr_loss=0.31039 val_MCRMSE=0.91913 per-target=[0.70442, 0.61278, 1.29894, 0.83295, 1.14654] time_ep=1.7s elapsed=28.0s


Fold 4 Epoch 18/20 tr_loss=0.30794 val_MCRMSE=0.91841 per-target=[0.70403, 0.61184, 1.29845, 0.83202, 1.14571] time_ep=1.7s elapsed=29.7s


Fold 4 Epoch 19/20 tr_loss=0.30789 val_MCRMSE=0.91808 per-target=[0.70334, 0.61112, 1.29839, 0.83162, 1.14593] time_ep=1.6s elapsed=31.4s


Fold 4 Epoch 20/20 tr_loss=0.30603 val_MCRMSE=0.91724 per-target=[0.70284, 0.61073, 1.29717, 0.83062, 1.14483] time_ep=1.7s elapsed=33.0s


Fold 4 best_val_MCRMSE=0.91724 OOF_MCRMSE=0.91724 per-target=[0.70284, 0.61073, 1.29717, 0.83062, 1.14483]


Fold scores: [0.8775848388671875, 0.8532713294029236, 0.8385927438735962, 0.976087236404419, 0.917236077785492] mean: 0.8925544452667236
Test pred shape: (240, 107, 5)
Wrote submission.csv with shape: (25680, 6)
OOF MCRMSE=0.89440 per-target=[0.6665, 0.61091, 1.25845, 0.80975, 1.12639] total_time=179.1s
Done. Next: consider adding BPP features and reverse TTA to improve scores.


In [9]:
# Install ViennaRNA (RNAplfold), compute BPP features for all ids, and cache per-position aggregates
import os, sys, time, math, json, shutil, subprocess
import numpy as np, pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path

def sh(cmd):
    print(f"$ {cmd}", flush=True)
    p = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    print(p.stdout)
    return p.returncode

start = time.time()
print('== Install ViennaRNA (RNAplfold) ==', flush=True)
rc = sh('apt-get update -qq && apt-get install -y -qq viennarna')
if rc != 0:
    print('APT install failed; trying pip ViennaRNAx (fallback).', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'ViennaRNAx'], check=True)
print('RNAplfold version/help:')
sh('RNAplfold -h || true')

# Load train/test and collect sequences
try:
    tr = pd.read_json('train.json')
except ValueError:
    tr = pd.read_json('train.json', lines=True)
try:
    te = pd.read_json('test.json')
except ValueError:
    te = pd.read_json('test.json', lines=True)

all_df = pd.concat([tr[['id','sequence']], te[['id','sequence']]], ignore_index=True)
ids = all_df['id'].tolist()
seqs = all_df['sequence'].tolist()

work_root = Path('plfold_work'); work_root.mkdir(exist_ok=True)
out_root = Path('plfold_out'); out_root.mkdir(exist_ok=True)
feat_root = Path('bpp_features'); feat_root.mkdir(exist_ok=True)

def run_plfold(one):
    idv, seq = one
    d = work_root / idv
    out_dp = out_root / f'{idv}_dp.ps'
    out_lunp = out_root / f'{idv}_lunp'
    if out_dp.exists() and out_lunp.exists():
        return idv, True
    if d.exists():
        shutil.rmtree(d, ignore_errors=True)
    d.mkdir(parents=True, exist_ok=True)
    fasta = d / f'{idv}.fa'
    fasta.write_text(f'>{idv}\n{seq}\n')
    env = os.environ.copy()
    env['OMP_NUM_THREADS'] = '1'
    cmd = f'RNAplfold -W 150 -L 120 -u 3 -noLP < {fasta.name}'
    p = subprocess.run(cmd, shell=True, cwd=str(d), env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    if p.returncode != 0:
        return idv, False
    # Move outputs
    dp_src = d / 'plfold_dp.ps'
    lunp_src = d / 'plfold_lunp'
    if dp_src.exists():
        shutil.move(str(dp_src), str(out_dp))
    if lunp_src.exists():
        shutil.move(str(lunp_src), str(out_lunp))
    shutil.rmtree(d, ignore_errors=True)
    return idv, (out_dp.exists() and out_lunp.exists())

print('== Running RNAplfold in parallel ==', flush=True)
ok = 0; fail = 0; t0 = time.time()
with ThreadPoolExecutor(max_workers=min(16, os.cpu_count() or 8)) as ex:
    futures = {ex.submit(run_plfold, item): item[0] for item in zip(ids, seqs)}
    for i, fut in enumerate(as_completed(futures), 1):
        idv = futures[fut]
        try:
            _, success = fut.result()
            if success: ok += 1
            else: fail += 1
        except Exception:
            fail += 1
        if i % 100 == 0 or i == len(futures):
            print(f'Processed {i}/{len(futures)} ok={ok} fail={fail} elapsed={time.time()-t0:.1f}s', flush=True)

def parse_lunp(path):
    # Returns pu1 (unpaired prob for u=1) per position, length L
    lines = Path(path).read_text().strip().splitlines()
    vals = []
    # Try two formats: with leading position index or not
    for ln in lines:
        parts = ln.strip().split()
        if not parts: continue
        # Expect at least pu1 present
        if len(parts) == 1:
            # Single column unlikely; treat as pu1
            vals.append(float(parts[0]))
        else:
            # If first token is integer index
            try:
                int(parts[0]); has_idx = True
            except ValueError:
                has_idx = False
            pu1 = float(parts[1] if has_idx else parts[0])
            vals.append(pu1)
    return np.array(vals, dtype=np.float32)

def parse_dp_ps(path, L):
    # Build LxL probability matrix from _dp.ps; lines like: i j p_sqrt ubox
    P = np.zeros((L, L), dtype=np.float32)
    for ln in Path(path).read_text().splitlines():
        ln = ln.strip()
        if not ln or ln.startswith('%'):
            continue
        parts = ln.split()
        if len(parts) >= 4 and parts[-1] == 'ubox':
            try:
                i = int(parts[0]) - 1
                j = int(parts[1]) - 1
                psqrt = float(parts[2])
            except Exception:
                continue
            p = psqrt * psqrt
            if 0 <= i < L and 0 <= j < L:
                P[i, j] = max(P[i, j], p)
                P[j, i] = max(P[j, i], p)
    return P

def build_features(P, pu1):
    L = P.shape[0]
    eps = 1e-8
    row_sum = P.sum(axis=1)  # [L]
    row_max = P.max(axis=1)
    # entropy on normalized rows
    Q = P / (row_sum[:, None] + eps)
    entropy = -(Q * (np.log(Q + eps))).sum(axis=1)
    # expected distance normalized
    idx = np.arange(L, dtype=np.float32)
    D = np.abs(idx[None, :] - idx[:, None])
    exp_dist = (P * D).sum(axis=1) / (row_sum + eps) / max(1.0, (L - 1))
    # local mass windows
    feats = [row_sum, row_max, entropy, exp_dist]
    for k in (3, 7, 15):
        mask = (D <= k).astype(np.float32)
        lm = (P * mask).sum(axis=1)
        feats.append(lm)
    # append pu1 and pairedness
    pu1 = pu1.astype(np.float32)
    if pu1.shape[0] != L:
        # pad or trim
        pu1_fix = np.zeros((L,), dtype=np.float32)
        m = min(L, pu1.shape[0])
        pu1_fix[:m] = pu1[:m]
        pu1 = pu1_fix
    paired = 1.0 - pu1
    feats.append(pu1); feats.append(paired)
    X = np.stack(feats, axis=1).astype(np.float32)  # [L, C]
    return X

print('== Parsing RNAplfold outputs and caching features ==', flush=True)
n_done = 0
for idv, seq in zip(ids, seqs):
    npy_out = feat_root / f'{idv}.npy'
    if npy_out.exists():
        n_done += 1; continue
    dp = out_root / f'{idv}_dp.ps'
    lunp = out_root / f'{idv}_lunp'
    if not (dp.exists() and lunp.exists()):
        # missing outputs; write zeros as fallback
        L = len(seq)
        np.save(npy_out, np.zeros((L, 10), dtype=np.float32))
        continue
    L = len(seq)
    P = parse_dp_ps(dp, L)
    pu1 = parse_lunp(lunp)
    X = build_features(P, pu1)  # [L, ~10]
    np.save(npy_out, X)
    n_done += 1
    if n_done % 200 == 0:
        print(f'Cached {n_done}/{len(ids)} feature arrays...', flush=True)

print(f'Done BPP features. Cached {n_done} arrays. Total time {time.time()-start:.1f}s', flush=True)

== Install ViennaRNA (RNAplfold) ==


$ apt-get update -qq && apt-get install -y -qq viennarna


E: List directory /var/lib/apt/lists/partial is missing. - Acquire (30: Read-only file system)

APT install failed; trying pip ViennaRNAx (fallback).


ERROR: Could not find a version that satisfies the requirement ViennaRNAx (from versions: none)
ERROR: No matching distribution found for ViennaRNAx


CalledProcessError: Command '['/usr/bin/python3.11', '-m', 'pip', 'install', 'ViennaRNAx']' returned non-zero exit status 1.