# Stanford RNA 3D Folding 2 — Template Matching v1 | WandB Offline Sync via kaggle-wandb-sync

## Synced with kaggle-wandb-sync

This notebook uses **[kaggle-wandb-sync](https://pypi.org/project/kaggle-wandb-sync/)** to track experiments with Weights & Biases — even with internet disabled.

Since Kaggle competition notebooks run with internet **disabled**, you can't push W&B metrics in real time.
`kaggle-wandb-sync` solves this with a simple offline sync pipeline:

```
Notebook (WANDB_MODE=offline)
    → W&B logs saved to /kaggle/working/wandb/
    → kaggle kernels output  (download via GitHub Actions)
    → wandb sync             (push to W&B cloud)
```

### How to use

```bash
pip install kaggle-wandb-sync

# All-in-one: push notebook → poll → download output → wandb sync
export WANDB_API_KEY=your_api_key
kaggle-wandb-sync run stanford-rna-3d-folding-2/

# Or step by step:
kaggle-wandb-sync push  stanford-rna-3d-folding-2/
kaggle-wandb-sync poll  yasunorim/stanford-rna-3d-folding-2-baseline
kaggle-wandb-sync output yasunorim/stanford-rna-3d-folding-2-baseline
kaggle-wandb-sync sync  ./kaggle_output
```

→ **kaggle-wandb-sync GitHub**: https://github.com/yasumorishima/kaggle-wandb-sync  
→ **kaggle-wandb-sync PyPI**: https://pypi.org/project/kaggle-wandb-sync/  
→ **Notebook source**: https://github.com/yasumorishima/kaggle-competitions/tree/main/stanford-rna-3d-folding-2

---

## Competition Overview

**Task**: Predict the 3D structure of RNA molecules from sequence alone.  
**Metric**: TM-score (best of 5 predictions per sequence, averaged across targets)  
**Output**: x, y, z coordinates of the C1' atom for every nucleotide — 5 structures per sequence.

**This notebook (template-v1)**: Use training data 3D coordinates as templates.  
For each test sequence, find the 5 training structures with the closest sequence length, then interpolate coordinates to match the test length. Real RNA structures provide much better starting points than idealized helices.

In [None]:
# W&B must be set to offline BEFORE importing wandb
import os
os.environ['WANDB_MODE'] = 'offline'
os.environ['WANDB_PROJECT'] = 'stanford-rna-3d-folding-2'
os.environ['WANDB_RUN_GROUP'] = 'template'

import numpy as np
import pandas as pd
from pathlib import Path
from scipy.interpolate import interp1d
import warnings
warnings.filterwarnings('ignore')

import wandb

print('Libraries loaded.')
print(f'WANDB_MODE: {os.environ["WANDB_MODE"]}')

In [None]:
# --- Data path detection ---
INPUT_ROOT = Path('/kaggle/input')
SLUG = 'stanford-rna-3d-folding-2'

print('=== /kaggle/input/ structure ===')
for p in sorted(INPUT_ROOT.iterdir()):
    print(f'  {p.name}/')
    for sub in sorted(p.iterdir())[:5]:
        print(f'    {sub.name}')

# Auto-detect DATA_DIR
DATA_DIR = None
for p in INPUT_ROOT.rglob('test_sequences.csv'):
    DATA_DIR = p.parent
    break

if DATA_DIR is None:
    raise FileNotFoundError(f'test_sequences.csv not found under {INPUT_ROOT}')

print(f'\nDATA_DIR: {DATA_DIR}')
print('\nCSV files:')
for f in sorted(DATA_DIR.glob('*.csv')):
    print(f'  {f.name}')

## 1. Load Data (Test + Training)

In [None]:
test_df = pd.read_csv(DATA_DIR / 'test_sequences.csv')
sample_sub = pd.read_csv(DATA_DIR / 'sample_submission.csv')
train_seq_df = pd.read_csv(DATA_DIR / 'train_sequences.csv')
train_labels_df = pd.read_csv(DATA_DIR / 'train_labels.csv')

print(f'Test sequences:     {len(test_df)} rows')
print(f'Sample submission:  {len(sample_sub)} rows')
print(f'Train sequences:    {len(train_seq_df)} rows')
print(f'Train labels:       {len(train_labels_df)} rows')
print(f'\nTrain sequences columns: {list(train_seq_df.columns)}')
print(f'Train labels columns:    {list(train_labels_df.columns)}')
train_seq_df.head(3)

In [None]:
# Build training structure library: {target_id: (length, xyz_array)}
# train_labels has columns like: ID, resname, resid, x_1, y_1, z_1
# We use the first structure (x_1, y_1, z_1) as the representative coords

# Detect coordinate columns
label_cols = train_labels_df.columns.tolist()
print(f'Train labels columns: {label_cols}')

# Extract target_id from ID column (format: targetid_resid)
train_labels_df['_target'] = train_labels_df['ID'].str.rsplit('_', n=1).str[0]

# Build library of training structures
train_structures = {}
for target_id, group in train_labels_df.groupby('_target', sort=False):
    coords = group[['x_1', 'y_1', 'z_1']].values.astype(np.float64)
    # Skip structures with NaN coordinates
    if np.isnan(coords).any():
        nan_frac = np.isnan(coords).mean()
        if nan_frac > 0.5:
            continue
        # For partial NaN, interpolate missing values
        for col in range(3):
            mask = np.isnan(coords[:, col])
            if mask.any() and not mask.all():
                valid = np.where(~mask)[0]
                coords[mask, col] = np.interp(
                    np.where(mask)[0], valid, coords[valid, col]
                )
    train_structures[target_id] = coords

# Compute lengths
train_lengths = {tid: len(c) for tid, c in train_structures.items()}
train_ids = np.array(list(train_lengths.keys()))
train_lens = np.array(list(train_lengths.values()))

print(f'\nTraining structures loaded: {len(train_structures)}')
print(f'Length range: {train_lens.min()} - {train_lens.max()} (mean: {train_lens.mean():.1f})')

# Test sequence lengths (from sample_submission)
sample_sub['_target'] = sample_sub['ID'].str.rsplit('_', n=1).str[0]
test_lengths = sample_sub.groupby('_target', sort=False).size()
print(f'\nTest sequences: {len(test_lengths)}')
print(f'Test length range: {test_lengths.min()} - {test_lengths.max()} (mean: {test_lengths.mean():.1f})')

## 2. W&B Initialization

In [None]:
SEQ_COL = 'sequence' if 'sequence' in test_df.columns else test_df.columns[1]
seq_lengths = test_df[SEQ_COL].str.len()

run = wandb.init(
    project='stanford-rna-3d-folding-2',
    name='template-v1',
    config={
        'approach': 'template_matching',
        'n_structures': 5,
        'n_templates': 5,
        'interpolation': 'linear',
        'n_train_structures': len(train_structures),
    }
)

wandb.log({
    'n_test_sequences': len(test_lengths),
    'seq_len_min': int(test_lengths.min()),
    'seq_len_max': int(test_lengths.max()),
    'seq_len_mean': float(test_lengths.mean()),
    'n_train_structures': len(train_structures),
    'train_len_min': int(train_lens.min()),
    'train_len_max': int(train_lens.max()),
    'train_len_mean': float(train_lens.mean()),
})

print(f'W&B run: {run.name} (mode: {os.environ["WANDB_MODE"]})')

## 3. Prediction — Template Matching with Interpolation

For each test sequence:
1. Find the 5 training structures with the closest sequence length
2. If the training structure has a different length, use linear interpolation (scipy) to resize the 3D coordinates to match the test length
3. Each of the 5 templates becomes one of the 5 required structure predictions

This leverages real RNA folding patterns from the training data, which should produce much better predictions than idealized helices. The diversity across 5 different templates also provides meaningful structural variety for best-of-5 TM-score evaluation.

In [None]:
N_TEMPLATES = 5


def interpolate_coords(coords: np.ndarray, target_len: int) -> np.ndarray:
    """Resize 3D coordinates from source length to target length using linear interpolation.

    Args:
        coords: (source_len, 3) array of x, y, z coordinates
        target_len: desired output length

    Returns:
        (target_len, 3) array of interpolated coordinates
    """
    source_len = len(coords)
    if source_len == target_len:
        return coords.copy()

    # Normalize positions to [0, 1] for both source and target
    src_positions = np.linspace(0, 1, source_len)
    tgt_positions = np.linspace(0, 1, target_len)

    result = np.zeros((target_len, 3))
    for dim in range(3):
        f = interp1d(src_positions, coords[:, dim], kind='linear')
        result[:, dim] = f(tgt_positions)

    return result


def find_closest_templates(target_len: int, n: int = 5) -> list:
    """Find n training structures with lengths closest to target_len.

    Returns list of (train_target_id, length_diff) tuples.
    """
    diffs = np.abs(train_lens - target_len)
    # Get indices of n smallest differences
    closest_idx = np.argsort(diffs)[:n]
    return [(train_ids[i], diffs[i]) for i in closest_idx]


# Use sample_submission as template
submission = sample_sub.copy()

template_stats = []

for target_id, group in submission.groupby('_target', sort=False):
    test_len = len(group)
    idx = group.index

    # Find 5 closest training structures
    templates = find_closest_templates(test_len, N_TEMPLATES)

    for s, (train_tid, len_diff) in enumerate(templates, 1):
        train_coords = train_structures[train_tid]
        # Interpolate to match test length
        pred_coords = interpolate_coords(train_coords, test_len)

        submission.loc[idx, f'x_{s}'] = pred_coords[:, 0].round(3)
        submission.loc[idx, f'y_{s}'] = pred_coords[:, 1].round(3)
        submission.loc[idx, f'z_{s}'] = pred_coords[:, 2].round(3)

    template_stats.append({
        'target_id': target_id,
        'test_len': test_len,
        'templates': [(tid, int(d)) for tid, d in templates],
        'avg_len_diff': float(np.mean([d for _, d in templates])),
    })

submission = submission.drop(columns=['_target'])

# Log template matching stats
avg_diff = np.mean([s['avg_len_diff'] for s in template_stats])
max_diff = max(s['avg_len_diff'] for s in template_stats)
wandb.log({
    'avg_template_len_diff': avg_diff,
    'max_template_len_diff': max_diff,
})

print(f'Submission shape: {submission.shape}')
print(f'Average template length difference: {avg_diff:.1f}')
print(f'Max average length difference: {max_diff:.1f}')
print(f'\nTemplate assignments (first 5):')
for s in template_stats[:5]:
    t_info = ', '.join([f'{tid}(±{d})' for tid, d in s['templates']])
    print(f'  {s["target_id"]} (len={s["test_len"]}): {t_info}')

submission.head(3)

## 4. Save Submission

In [None]:
OUTPUT_PATH = Path('/kaggle/working/submission.csv')
submission.to_csv(OUTPUT_PATH, index=False)
print(f'Saved: {OUTPUT_PATH}  ({OUTPUT_PATH.stat().st_size / 1024:.1f} KB)')

wandb.log({
    'n_submission_rows': len(submission),
    'submission_columns': len(submission.columns),
})

# Log template assignment table
template_table = wandb.Table(
    columns=['target_id', 'test_len', 'template_1', 'template_2', 'template_3', 'template_4', 'template_5', 'avg_len_diff'],
    data=[
        [
            s['target_id'], s['test_len'],
            *[f"{tid}(±{d})" for tid, d in s['templates']],
            round(s['avg_len_diff'], 1),
        ]
        for s in template_stats
    ],
)
wandb.log({'template_assignments': template_table})

wandb.finish()
print('\nW&B run finished (offline). Sync with:')
print('  kaggle-wandb-sync run stanford-rna-3d-folding-2/ --skip-push')

## Summary

| Step | Detail |
|---|---|
| Approach | Template matching — closest-length training structures |
| Templates | 5 per test sequence (closest by sequence length) |
| Interpolation | Linear (scipy) to resize coordinates |
| W&B run | template-v1 |
| W&B mode | offline → synced via kaggle-wandb-sync |

### Why template matching?

The previous approach (A-form helix + noise) places all residues on a single helix, which doesn't reflect actual RNA folding. Real RNA structures from the training data contain stems, loops, hairpins, and other motifs. By using training structures as templates:
- We get realistic 3D geometries instead of idealized helices
- Different templates provide meaningful structural diversity
- Linear interpolation preserves the overall shape while adjusting to the target length

### Sync W&B runs after execution

```bash
# After kaggle kernels push:
kaggle-wandb-sync run stanford-rna-3d-folding-2/

# Or to re-sync without re-running:
kaggle-wandb-sync run stanford-rna-3d-folding-2/ --skip-push
```