# Stanford RNA 3D Folding Competition Notebook

This notebook is designed for the "Stanford RNA 3D Folding" Kaggle competition.
It covers:

1. Data Exploration  
2. Data Preprocessing  
   - Sequence encoding  
   - Label grouping and padding (with NaN handling)
3. Model Building using a fast CNN architecture  
4. Model Training with early stopping  
5. Prediction on test set and submission file generation

_Note: This notebook uses only the provided CSV files (no external internet access)._

## 1. Import Libraries

In [36]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# TensorFlow/Keras for deep learning model
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Conv1D, BatchNormalization, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import losses

# Ensure GPU usage
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Restrict to first GPU and set memory growth
        tf.config.set_visible_devices(gpus[0], 'GPU')
        tf.config.experimental.set_memory_growth(gpus[0], True)
        print("Using GPU:", gpus[0])
    except RuntimeError as e:
        print(e)
else:
    print("Using CPU")

# Define pairs
def is_complementary(base1, base2):
    """Check if two nucleotides are complementary."""
    pairs = {
        'A': ['U'],  # Adenine pairs with Uracil
        'U': ['A', 'G'],  # Uracil pairs with Adenine or Guanine
        'G': ['C', 'U'],  # Guanine pairs with Cytosine or Uracil
        'C': ['G']  # Cytosine pairs with Guanine
    }
    return base2 in pairs.get(base1, [])

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

Using GPU: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


## 2. Data Loading and Exploration

We load the CSV files provided in the competition:
- `train_sequences.csv`
- `train_labels.csv`
- `validation_sequences.csv` & `validation_labels.csv`
- `test_sequences.csv`
- `sample_submission.csv`

**Important:** We fill missing values in the labels data with 0 to avoid NaN issues during training.

In [37]:
# Define file paths (Kaggle input paths)
TRAIN_SEQ_PATH = '/kaggle/input/stanford-rna-3d-folding/train_sequences.csv'
TRAIN_LABELS_PATH = '/kaggle/input/stanford-rna-3d-folding/train_labels.csv'
VALID_SEQ_PATH = '/kaggle/input/stanford-rna-3d-folding/validation_sequences.csv'
VALID_LABELS_PATH = '/kaggle/input/stanford-rna-3d-folding/validation_labels.csv'
TEST_SEQ_PATH  = '/kaggle/input/stanford-rna-3d-folding/test_sequences.csv'
SAMPLE_SUB_PATH = '/kaggle/input/stanford-rna-3d-folding/sample_submission.csv'

# Load CSV files
train_sequences = pd.read_csv(TRAIN_SEQ_PATH)
train_labels = pd.read_csv(TRAIN_LABELS_PATH)
valid_sequences = pd.read_csv(VALID_SEQ_PATH)
valid_labels = pd.read_csv(VALID_LABELS_PATH)
test_sequences = pd.read_csv(TEST_SEQ_PATH)
sample_submission = pd.read_csv(SAMPLE_SUB_PATH)

# Fill missing values in labels with 0
train_labels.fillna(0, inplace=True)
valid_labels.fillna(0, inplace=True)

# Display basic info
print("Train Sequences Shape:", train_sequences.shape)
print("Train Labels Shape:", train_labels.shape)
print("Validation Sequences Shape:", valid_sequences.shape)
print("Validation Labels Shape:", valid_labels.shape)
print("Test Sequences Shape:", test_sequences.shape)

# Look at a few examples
print("\nTrain Sequences Head:")
print(train_sequences.head())
print("\nTrain Labels Head:")
print(train_labels.head())

Train Sequences Shape: (844, 5)
Train Labels Shape: (137095, 6)
Validation Sequences Shape: (12, 5)
Validation Labels Shape: (2515, 123)
Test Sequences Shape: (12, 5)

Train Sequences Head:
  target_id                            sequence temporal_cutoff  \
0    1SCL_A       GGGUGCUCAGUACGAGAGGAACCGCACCC      1995-01-26   
1    1RNK_A  GGCGCAGUGGGCUAGCGCCACUCAAAAGGCCCAU      1995-02-27   
2    1RHT_A            GGGACUGACGAUCACGCAGUCUAU      1995-06-03   
3    1HLX_A                GGGAUAACUUCGGUUGUCCC      1995-09-15   
4    1HMH_E  GGCGACCCUGAUGAGGCCGAAAGGCCGAAACCGU      1995-12-07   

                                         description  \
0               THE SARCIN-RICIN LOOP, A MODULAR RNA   
1  THE STRUCTURE OF AN RNA PSEUDOKNOT THAT CAUSES...   
2  24-MER RNA HAIRPIN COAT PROTEIN BINDING SITE F...   
3  P1 HELIX NUCLEIC ACIDS (DNA/RNA) RIBONUCLEIC ACID   
4  THREE-DIMENSIONAL STRUCTURE OF A HAMMERHEAD RI...   

                                       all_sequences  
0  >1SCL_1|Chai

## 3. Data Preprocessing

### 3.1 Sequence Encoding

We map each nucleotide to an integer:
- A: 1, C: 2, G: 3, U: 4  
Unknown characters are mapped to 0.

In [38]:
from collections import defaultdict

nucleotide_map = {'A': 1, 'C': 2, 'G': 3, 'U': 4}

def encode_sequence(seq, known_sequences):
    """Encodes an RNA sequence into a list of integers, replacing unknown nucleotides
       with the most similar known nucleotide's mapping.
    """
    encoded_seq = []
    for ch in seq:
        if ch in nucleotide_map:
            encoded_seq.append(nucleotide_map[ch])
        else:
            # Find the most similar known nucleotide
            most_similar = find_most_similar_nucleotide(ch, known_sequences)
            if most_similar:
                encoded_seq.append(nucleotide_map[most_similar])
            else:
                # If no similar nucleotide is found, handle accordingly (e.g., raise an error, use a default value, etc.)
                encoded_seq.append(0)  # Or raise ValueError(f"Unknown nucleotide: {ch}")

    return encoded_seq

def find_most_similar_nucleotide(unknown_nucleotide, known_sequences):
    """Finds the most similar nucleotide from known sequences."""
    similarity_counts = defaultdict(int)

    for known_seq in known_sequences:
        for known_ch in known_seq:
            if known_ch in nucleotide_map:
                if are_similar(unknown_nucleotide, known_ch):
                    similarity_counts[known_ch] += 1

    if similarity_counts:
        return max(similarity_counts, key=similarity_counts.get)
    else:
        return None

def are_similar(unknown_nucleotide, known_nucleotide):
    """Determines if two nucleotides are similar.
       This is a placeholder; you'll need to define your similarity logic.
       Example: considering 'N' as any nucleotide.
    """
    if unknown_nucleotide == known_nucleotide:
        return True
    if unknown_nucleotide == 'N': # N means any nucleotide
        return True
    if known_nucleotide == 'N': # N means any nucleotide.
        return True
    # Add other similarity rules as needed.
    return False

def preprocess_sequences(sequences_df, known_sequences):
    """Encodes sequences in a DataFrame, using known sequences for unknown nucleotides."""
    sequences_df['encoded'] = sequences_df['sequence'].apply(lambda seq: encode_sequence(seq, known_sequences))
    return sequences_df

# Assuming train_sequences, valid_sequences, and test_sequences are pandas DataFrames
# and 'sequence' is the column containing the RNA sequences.

# Create a list of all known nucleotides from your training set.
known_sequences = train_sequences['sequence'].tolist()

# Apply encoding to all sequence files
train_sequences = preprocess_sequences(train_sequences, known_sequences)
valid_sequences = preprocess_sequences(valid_sequences, known_sequences)
test_sequences = preprocess_sequences(test_sequences, known_sequences)

### 3.2 Processing Label Data

Each row in the labels CSV is for one residue, with an `ID` formatted as `target_id_resid`.
We group rows by `target_id` and sort by residue number.
Here, we use the first structure (x_1, y_1, z_1) as our target coordinates.

In [39]:
def process_labels(labels_df):
    """
    Processes a labels DataFrame by grouping rows by target_id.
    Returns a dictionary mapping target_id to an array of coordinates (seq_len, 3).
    """
    label_dict = {}
    for idx, row in labels_df.iterrows():
        # Split ID into target_id and residue number (assumes format "targetid_resid")
        parts = row['ID'].split('_')
        target_id = "_".join(parts[:-1])
        resid = int(parts[-1])
        # Extract the coordinates; they should be numeric (missing values already set to 0)
        coord = np.array([row['x_1'], row['y_1'], row['z_1']], dtype=np.float32)
        if target_id not in label_dict:
            label_dict[target_id] = []
        label_dict[target_id].append((resid, coord))
    
    # Sort residues by resid and stack coordinates
    for key in label_dict:
        sorted_coords = sorted(label_dict[key], key=lambda x: x[0])
        coords = np.stack([c for r, c in sorted_coords])
        label_dict[key] = coords
    return label_dict

# Process training and validation labels
train_labels_dict = process_labels(train_labels)
valid_labels_dict = process_labels(valid_labels)

### 3.3 Creating Datasets and Padding

We match each target sequence with its corresponding coordinate labels.
Then we pad sequences and coordinate arrays to a uniform length.

Padded positions in coordinates are set to 0.

In [40]:
def create_dataset(sequences_df, labels_dict):
    """
    Creates a dataset from a sequences DataFrame and a labels dictionary.
    Returns:
        X: list of encoded sequences,
        y: list of coordinate arrays,
        target_ids: list of target ids.
    """
    X, y, target_ids = [], [], []
    for idx, row in sequences_df.iterrows():
        tid = row['target_id']
        if tid in labels_dict:
            X.append(row['encoded'])
            y.append(labels_dict[tid])
            target_ids.append(tid)
    return X, y, target_ids

# Create training and validation datasets
X_train, y_train, train_ids = create_dataset(train_sequences, train_labels_dict)
X_valid, y_valid, valid_ids = create_dataset(valid_sequences, valid_labels_dict)

# Determine maximum sequence length from training set
max_len = max(len(seq) for seq in X_train)
print("Maximum sequence length (train):", max_len)

# Pad the sequences (padding value = 0)
X_train_pad = pad_sequences(X_train, maxlen=max_len, padding='post', value=0)
X_valid_pad = pad_sequences(X_valid, maxlen=max_len, padding='post', value=0)

# Function to pad coordinate arrays
def pad_coordinates(coord_array, max_len):
    L = coord_array.shape[0]
    if L < max_len:
        pad_width = ((0, max_len - L), (0, 0))
        return np.pad(coord_array, pad_width, mode='constant', constant_values=0)
    else:
        return coord_array

# Pad coordinate arrays
y_train_pad = np.array([pad_coordinates(arr, max_len) for arr in y_train])
y_valid_pad = np.array([pad_coordinates(arr, max_len) for arr in y_valid])

# Check for any NaN values in the targets
print("Any NaN in y_train_pad?", np.isnan(y_train_pad).any())
print("Any NaN in y_valid_pad?", np.isnan(y_valid_pad).any())

print("X_train_pad shape:", X_train_pad.shape)
print("y_train_pad shape:", y_train_pad.shape)

Maximum sequence length (train): 4298
Any NaN in y_train_pad? False
Any NaN in y_valid_pad? False
X_train_pad shape: (844, 4298)
y_train_pad shape: (844, 4298, 3)


## 4. Fast CNN Model Training

In this section, we build a faster CNN-based model.
The model uses:
- An Embedding layer  
- Two Conv1D blocks (with BatchNormalization and Dropout)  
- A final Conv1D layer (kernel size 1) to output 3 coordinates per residue

In [41]:
# Define hyperparameters for the CNN model
vocab_size = max(nucleotide_map.values()) + 1  # +1 for padding token 0
embedding_dim = 16
num_filters = 64
kernel_size = 3
drop_rate = 0.2

# Build the CNN model
input_seq_cnn = Input(shape=(max_len,), name='input_seq')
x_cnn = Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=True, name='embedding')(input_seq_cnn)

# First convolutional block
x_cnn = Conv1D(filters=num_filters, kernel_size=kernel_size, padding='same', activation='relu', name='conv1')(x_cnn)
x_cnn = BatchNormalization(name='bn1')(x_cnn)
x_cnn = Dropout(drop_rate, name='drop1')(x_cnn)

# Second convolutional block
x_cnn = Conv1D(filters=num_filters, kernel_size=kernel_size, padding='same', activation='relu', name='conv2')(x_cnn)
x_cnn = BatchNormalization(name='bn2')(x_cnn)
x_cnn = Dropout(drop_rate, name='drop2')(x_cnn)

# Final convolution to output 3 coordinates per residue (x, y, z)
output_coords_cnn = Conv1D(filters=3, kernel_size=1, padding='same', activation='linear', name='predicted_coords')(x_cnn)

cnn_model = Model(inputs=input_seq_cnn, outputs=output_coords_cnn)
cnn_model.compile(optimizer='adam', loss='mse')

cnn_model.summary()



## 5. Model Training

We train the CNN model using early stopping to monitor the validation loss.
With the NaN issues addressed in the data, training should proceed without nan losses.

In [42]:
# 5. Model Training (Original Teacher)
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_pad, y_train_pad))
train_dataset = train_dataset.shuffle(1000).batch(64).prefetch(tf.data.AUTOTUNE)

valid_dataset = tf.data.Dataset.from_tensor_slices((X_valid_pad, y_valid_pad))
valid_dataset = valid_dataset.batch(64).prefetch(tf.data.AUTOTUNE)

early_stop_cnn = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
history_cnn = cnn_model.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=50,
    callbacks=[early_stop_cnn],
    verbose=1
)

Epoch 1/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 189ms/step - loss: 573.8710 - val_loss: 290832909220992902620675565420544.0000
Epoch 2/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - loss: 971.4595 - val_loss: 290832909220992902620675565420544.0000
Epoch 3/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - loss: 615.1906 - val_loss: 290832909220992902620675565420544.0000
Epoch 4/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - loss: 834.8428 - val_loss: 290832909220992902620675565420544.0000
Epoch 5/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - loss: 567.2512 - val_loss: 290832909220992902620675565420544.0000
Epoch 6/50
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - loss: 524.1871 - val_loss: 290832909220992902620675565420544.0000


In [None]:
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, Conv1D, BatchNormalization, Dropout
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np

# Function to build the CNN model
def build_cnn_model():
    input_seq = Input(shape=(max_len,), name='input_seq')
    x = Embedding(vocab_size, embedding_dim, mask_zero=True, name='embedding')(input_seq)
    x = Conv1D(num_filters, kernel_size, padding='same', activation='relu', name='conv1')(x)
    x = BatchNormalization(name='bn1')(x)
    x = Dropout(drop_rate, name='drop1')(x)
    x = Conv1D(num_filters, kernel_size, padding='same', activation='relu', name='conv2')(x)
    x = BatchNormalization(name='bn2')(x)
    x = Dropout(drop_rate, name='drop2')(x)
    outputs = Conv1D(3, 1, padding='same', activation='linear', name='predicted_coords')(x)
    model = Model(inputs=input_seq, outputs=outputs)
    return model

# Define distillation loss combining true labels and teacher predictions
def combined_loss(y_true_combined, y_pred):
    y_true = y_true_combined[..., :3]  # Original labels
    y_teacher = y_true_combined[..., 3:]  # Teacher predictions
    mse_true = tf.keras.losses.MeanSquaredError()(y_true, y_pred)
    mse_teacher = tf.keras.losses.MeanSquaredError()(y_teacher, y_pred)
    return 0.5 * mse_true + 0.5 * mse_teacher

# Perform 3 distillation steps
teacher_model = cnn_model  # Start with original model
for distil_step in range(3):
    print(f"\nPerforming distillation step {distil_step + 1}/3")

    # Generate teacher predictions
    print("Generating teacher predictions for training and validation data...")
    train_teacher_pred = teacher_model.predict(X_train_pad, verbose=1)
    valid_teacher_pred = teacher_model.predict(X_valid_pad, verbose=1)

    # Debug: Check shapes before concatenation
    print(f"y_train_pad shape: {y_train_pad.shape}")
    print(f"train_teacher_pred shape: {train_teacher_pred.shape}")

    # Concatenate along the last axis
    combined_y_train = np.concatenate([y_train_pad, train_teacher_pred], axis=-1)
    combined_y_valid = np.concatenate([y_valid_pad, valid_teacher_pred], axis=-1)

    # Debug: Check shapes after concatenation
    print(f"combined_y_train shape: {combined_y_train.shape}")
    print(f"combined_y_valid shape: {combined_y_valid.shape}")

    # Create datasets for student training
    student_train_dataset = tf.data.Dataset.from_tensor_slices((X_train_pad, combined_y_train))
    student_train_dataset = student_train_dataset.shuffle(1000).batch(64).prefetch(tf.data.AUTOTUNE)

    # Create validation dataset
    valid_dataset = tf.data.Dataset.from_tensor_slices((X_valid_pad, combined_y_valid))
    valid_dataset = valid_dataset.batch(64).prefetch(tf.data.AUTOTUNE)

    # Build and train student model
    student_model = build_cnn_model()
    student_model.compile(optimizer='adam', loss=combined_loss)

    early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    print("Training student model...")
    history_student_model = student_model.fit(
        student_train_dataset,
        validation_data=valid_dataset,
        epochs=50,
        callbacks=[early_stop],
        verbose=1
    )

    teacher_model = student_model  # Student becomes teacher for next step


# Modified distillation loop with physical constraints
teacher_model = cnn_model
for distil_step in range(3):
    print(f"\nPerforming distillation step {distil_step + 1}/3 with physical constraints")

    # Generate teacher predictions with physical validation
    print("Generating teacher predictions for training and validation data...")
    train_teacher_pred = teacher_model.predict(X_train_pad, verbose=0)
    valid_teacher_pred = teacher_model.predict(X_valid_pad, verbose=0)

    # Apply physical constraints to teacher predictions
    def apply_physical_constraints(sequences, coords):
        print(f"Applying physical constraints to predictions...")
        constrained_coords = []
        for seq, coord in zip(sequences, coords):
            seq_str = seq['sequence']  # Extract the RNA sequence
            valid_coords = tf.Variable(coord.copy())  # Create a TensorFlow variable for GPU computation

            n = len(seq_str)
            for i in range(n):
                for j in range(i+4, n):  # Minimum loop length of 4
                    if is_complementary(seq_str[i], seq_str[j]):
                        # Compute distance using TensorFlow
                        current_dist = tf.norm(valid_coords[i] - valid_coords[j])
                        if current_dist < 2.8 or current_dist > 3.8:
                            direction = (valid_coords[j] - valid_coords[i]) / (current_dist + 1e-8)
                            valid_coords_i_updated = valid_coords[i] + direction * (3.4 - current_dist) * 0.1
                            valid_coords_j_updated = valid_coords[j] - direction * (3.4 - current_dist) * 0.1
                            valid_coords = tf.tensor_scatter_nd_update(valid_coords, [[i], [j]], [valid_coords_i_updated, valid_coords_j_updated])

            # Resolve steric clashes
            for i in range(n):
                for j in range(i+2, n):  # Skip adjacent residues
                    if abs(i-j) < 4: continue
                    dist = tf.norm(valid_coords[i] - valid_coords[j])
                    if dist < 2.0:
                        direction = (valid_coords[j] - valid_coords[i]) / (dist + 1e-8)
                        push_force = (2.0 - dist) * 0.2
                        valid_coords_i_updated = valid_coords[i] - direction * push_force
                        valid_coords_j_updated = valid_coords[j] + direction * push_force
                        valid_coords = tf.tensor_scatter_nd_update(valid_coords, [[i], [j]], [valid_coords_i_updated, valid_coords_j_updated])

            constrained_coords.append(valid_coords)
        return tf.stack(constrained_coords)

    # Apply constraints to teacher predictions
    print("Applying physical constraints to teacher predictions...")
    train_teacher_pred = apply_physical_constraints(train_sequences.to_dict('records'), train_teacher_pred)
    valid_teacher_pred = apply_physical_constraints(valid_sequences.to_dict('records'), valid_teacher_pred)

    # Create enhanced training targets with physical guidance
    combined_y_train = np.concatenate([y_train_pad, train_teacher_pred], axis=-1)
    combined_y_valid = np.concatenate([y_valid_pad, valid_teacher_pred], axis=-1)

    # Build student model with physics-aware loss
    student_model = build_cnn_model()

    # Enhanced loss function with physical regularization
    def physics_aware_loss(y_true_combined, y_pred):
        # Standard distillation loss
        y_true = y_true_combined[..., :3]
        y_teacher = y_true_combined[..., 3:]
        mse_loss = 0.5 * tf.keras.losses.MeanSquaredError()(y_true, y_pred) + 0.5 * tf.keras.losses.MeanSquaredError()(y_teacher, y_pred)

        # Physics regularization
        batch_size = tf.shape(y_pred)[0]
        phys_loss = tf.TensorArray(tf.float32, size=0, dynamic_size=True)

        for i in tf.range(batch_size):
            coords = y_pred[i]
            seq = train_sequences.iloc[i.numpy()]['sequence']

            # Calculate base pair satisfaction
            pair_loss = 0.0
            for j in range(len(seq)):
                for k in range(j+4, len(seq)):
                    if is_complementary(seq[j], seq[k]):
                        dist = tf.norm(coords[j] - coords[k])
                        pair_loss += tf.maximum(0.0, tf.abs(dist - 3.4) - 0.4)  # Allow 3.0-3.8Å range

            # Calculate steric clash penalty
            clash_loss = 0.0
            for j in range(len(seq)):
                for k in range(j+2, len(seq)):
                    if abs(j-k) > 3:
                        dist = tf.norm(coords[j] - coords[k])
                        clash_loss += tf.maximum(0.0, 2.0 - dist)  # Penalize < 2.0Å

            phys_loss = phys_loss.write(i, pair_loss + clash_loss)

        total_phys_loss = tf.reduce_mean(phys_loss.stack())
        return mse_loss + 0.1 * total_phys_loss  # Weighted combination

    student_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001*0.8**distil_step),
                          loss=physics_aware_loss)

    # Training with physical validation callback
    class PhysicalConstraintCallback(tf.keras.callbacks.Callback):
        def on_epoch_end(self, epoch, logs=None):
            val_pred = self.model.predict(X_valid_pad[:32], verbose=0)  # Sample validation
            valid_pairs, clashes = check_structure(valid_sequences.iloc[0]['sequence'], 
                                                   val_pred[0][:len(valid_sequences.iloc[0]['sequence'])])
            logs['val_pairs'] = len(valid_pairs)
            logs['val_clashes'] = len(clashes)
            print(f" | Val_pairs: {len(valid_pairs)} | Val_clashes: {len(clashes)}")

    print("Training student model with physical constraints...")
    history_student_model_distill = student_model.fit(
        student_train_dataset, combined_y_train,
        validation_data=valid_dataset,
        epochs=50,
        callbacks=[EarlyStopping(monitor='val_loss', patience=3), PhysicalConstraintCallback()],
        verbose=1
    )

    teacher_model = student_model



Performing distillation step 1/3 with physical constraints


## 6. Generating Predictions and Submission File

For each test sequence, we predict the 3D coordinates using our trained CNN model.

The submission requires 5 sets of coordinates per target. In this baseline, we replicate the same predicted structure 5 times.

In [None]:
# 6. Generating Predictions and Verification
X_test = test_sequences['encoded'].tolist()
X_test_pad = pad_sequences(X_test, maxlen=max_len, padding='post', value=0)
predictions = teacher_model.predict(X_test_pad)

# Physical Soundness Verification Functions
def is_complementary(base1, base2):
    pairs = {'A': ['U'], 'U': ['A', 'G'], 'G': ['C', 'U'], 'C': ['G']}
    return base2 in pairs.get(base1, [])

def calculate_distance(coord1, coord2):
    return np.sqrt(sum((a - b)**2 for a, b in zip(coord1, coord2)))

def check_structure(sequence, coords):
    valid_pairs = []
    clashes = []
    pair_distance_range = (2.5, 4.0)  # Base pairing distance in Å
    clash_threshold = 2.0
    
    for i in range(len(sequence)):
        for j in range(i+1, len(sequence)):
            if abs(i - j) < 4: continue
            
            distance = calculate_distance(coords[i], coords[j])
            
            if is_complementary(sequence[i], sequence[j]):
                if pair_distance_range[0] <= distance <= pair_distance_range[1]:
                    valid_pairs.append((i+1, j+1, round(distance, 2)))

            if abs(i - j) > 1 and distance < clash_threshold:
                clashes.append((i+1, j+1, round(distance, 2)))

    return valid_pairs, clashes

# Verify all test predictions before submission
print("\n=== Physical Soundness Verification ===")
for idx, row in test_sequences.iterrows():
    target_id = row['target_id']
    seq = row['sequence']
    pred_coords = predictions[idx][:len(seq)]  # Remove padding
    
    valid_pairs, clashes = check_structure(seq, pred_coords)
    
    print(f"\nTarget: {target_id}")
    print(f"Valid base pairs: {len(valid_pairs)} | Steric clashes: {len(clashes)}")
    if clashes:
        print(f"WARNING: {len(clashes)} steric clashes detected!")
        print(f"Example clashes (residues, distance): {clashes[:3]}")

## 7. Saving the Submission File

Finally, we save the submission file as `submission.csv`.

In [None]:
# 7. Generate Submission File After Verification
submission_rows = []
for idx, row in test_sequences.iterrows():
    target_id = row['target_id']
    pred_coords = predictions[idx][:len(row['encoded'])]  # Actual residues
    
    for i in range(len(pred_coords)):
        coords = pred_coords[i]
        submission_rows.append({
            'ID': f"{target_id}_{i+1}",
            'resname': row['sequence'][i],
            'resid': i+1,
            **{f"x_{j+1}": coords[0] for j in range(5)},
            # ... rest of coordinate columns
        })

submission_df = pd.DataFrame(submission_rows)
submission_df.to_csv("submission.csv", index=False)
print("Final submission generated with verified predictions")