# Enhanced GNN baseline 🤖

## Introduction 🌟
Welcome to this Jupyter notebook developed for Stanford Ribonanza RNA Folding to create a model that predicts the structures of any RNA molecule.


### Inspiration and Credits 🙌
This notebook is inspired by the work of fnands
, available at [this Kaggle project](https://www.kaggle.com/code/fnands/a-quick-gnn-baseline/notebook). I extend my gratitude to fnands
 for sharing their insights and code.

🌟 Explore my profile and other public projects, and don't forget to share your feedback! 
👉 [Visit my Profile](https://www.kaggle.com/zulqarnainali) 👈

🙏 Thank you for taking the time to review my work, and please give it a thumbs-up if you found it valuable! 👍

## Purpose 🎯
The primary purpose of this notebook is to:
- Load and preprocess the competition data 📁
- Engineer relevant features for model training 🏋️‍♂️
- Train predictive models to make target variable predictions 🧠
- Submit predictions to the competition environment 📤

## Notebook Structure 📚
This notebook is structured as follows:
1. **Data Preparation**: In this section, we load and preprocess the competition data.
2. **Feature Engineering**: We generate and select relevant features for model training.
3. **Model Training**: We train machine learning models on the prepared data.
4. **Prediction and Submission**: We make predictions on the test data and submit them for evaluation.


## How to Use 🛠️
To use this notebook effectively, please follow these steps:
1. Ensure you have the competition data and environment set up.
2. Execute each cell sequentially to perform data preparation, feature engineering, model training, and prediction submission.
3. Customize and adapt the code as needed to improve model performance or experiment with different approaches.

**Note**: Make sure to replace any placeholder paths or configurations with your specific information.

## Acknowledgments 🙏
We acknowledge theChild Mind Institute
 organizers for providing the dataset and the competition platform.

Let's get started! Feel free to reach out if you have any questions or need assistance along the way.
👉 [Visit my Profile](https://www.kaggle.com/zulqarnainali) 👈

In [None]:
!pip install torch-geometric

## 📦 Import necessary libraries and modules


In [None]:
import torch  # 🔥 Import PyTorch for deep learning
from torch.utils.data import random_split  # 📂 For splitting datasets
from torch_geometric.data import Data, Dataset  # 📊 PyTorch Geometric for graph data handling
from torch_geometric.loader import DataLoader  # 🚚 DataLoader for batching data
import pandas as pd  # 🐼 Pandas for data manipulation
from pathlib import Path  # 📁 pathlib for handling file paths
import numpy as np  # 🧮 NumPy for numerical operations
from sklearn.preprocessing import OneHotEncoder  # 🧬 Scikit-Learn for one-hot encoding
import polars as pl  # 📊 Polars for data manipulation
import re  # 🧵 Regular expressions for text processing
from tqdm import tqdm  # 🔄 tqdm for progress bar display


## 📁 Define file paths for dataset and output


In [None]:
DATA_DIR = Path("/kaggle/input/stanford-ribonanza-rna-folding/")  # 📂 Directory for dataset
TRAIN_CSV = DATA_DIR / "train_data.csv"  # 🚆 Training data in CSV format
TRAIN_PARQUET_FILE = "train_data.parquet"  # 📦 Training data in Parquet format
TEST_CSV = DATA_DIR / "test_sequences.csv"  # 🚀 Test sequences in CSV format
TEST_PARQUET_FILE = "test_sequences.parquet"  # 📦 Test sequences in Parquet format
PRED_CSV = "submission.csv"  # 📄 Output file for predictions


## 📝 Define a function to convert CSV to Parquet

**Explaination**:

This code defines a Python function named `to_parquet` that takes two parameters: `csv_file` and `parquet_file`. The purpose of this function is to read data from a CSV file, manipulate the schema of the data, and then save it as a Parquet file with specific settings.

1. `dummy_df = pl.scan_csv(csv_file)`: This line reads the CSV data from the `csv_file` using the Polars library and stores it in a DataFrame called `dummy_df`. This DataFrame is essentially a table that holds the CSV data.

2. `new_schema = {}`: This line initializes an empty dictionary called `new_schema`. This dictionary will be used to define a new schema for the DataFrame with specific column types.

3. `for key, value in dummy_df.schema.items():`: This line starts a loop that iterates over the columns of the `dummy_df` DataFrame and their corresponding data types in the schema.

4. `if key.startswith("reactivity"):`: Within the loop, this line checks if the column name (`key`) starts with the string "reactivity." If it does, it means that this column is related to reactivity, and we want to change its data type.

5. `new_schema[key] = pl.Float32`: If the column name starts with "reactivity," this line updates the `new_schema` dictionary to set the data type of that column to `Float32`. This conversion is done because Parquet requires specific data types for its columns.

6. `else:`: If the column name does not start with "reactivity," this line executes when dealing with columns other than those related to reactivity.

7. `new_schema[key] = value`: For columns that are not related to reactivity, this line simply copies the existing data type from the original schema to the `new_schema` dictionary.

8. `df = pl.scan_csv(csv_file, schema=new_schema)`: After defining the new schema, this line reads the CSV data from `csv_file` again but uses the `new_schema` to define the data types for the columns. The resulting DataFrame `df` now has the desired schema.

9. `df.sink_parquet(...)`: This line writes the data from the DataFrame `df` to a Parquet file specified by `parquet_file`. It includes some additional settings:
   - `compression='uncompressed'`: It specifies that the Parquet file should not be compressed, which makes it easy to access but may result in a larger file size.
   - `row_group_size=10`: It sets the row group size for the Parquet file to 10. Row groups are a way to divide the data in a Parquet file, and this parameter allows you to control the size of those groups.

This code reads a CSV file, modifies the schema for specific columns, and then saves the data to a Parquet file with customizable settings.


In [None]:
def to_parquet(csv_file, parquet_file):
    # 📊 Read CSV data using Polars
    dummy_df = pl.scan_csv(csv_file)

    # 🔍 Define a new schema mapping for specific columns
    new_schema = {}
    for key, value in dummy_df.schema.items():
        if key.startswith("reactivity"):
            new_schema[key] = pl.Float32  # 📊 Convert 'reactivity' columns to Float32
        else:
            new_schema[key] = value

    # 📊 Read CSV data with the new schema and write to Parquet
    df = pl.scan_csv(csv_file, schema=new_schema)
    
    # 💾 Write data to Parquet format with specified settings
    df.sink_parquet(
        parquet_file,
        compression='uncompressed',  # No compression for easy access
        row_group_size=10,  # Adjust row group size as needed
    )


## 📝 Convert training and test CSV data to Parquet format


In [None]:
to_parquet(TRAIN_CSV, TRAIN_PARQUET_FILE)  # 🚆 Training data
to_parquet(TEST_CSV, TEST_PARQUET_FILE)    # 🚀 Test data


## 📝 Define a function to generate nearest adjacency matrix


**Explaination**:

This code defines a Python function called `nearest_adjacency` that generates adjacency information for elements in a sequence. It calculates the neighbors of each element within a given range, considering both positive and negative offsets. The code includes support for circular connections if `loops` is set to `True`. 

1. `base = np.arange(sequence_length)`: This line creates a NumPy array `base` containing integers from 0 to `sequence_length-1`. This array represents the sequence of elements for which adjacency is being calculated.

2. `connections = []`: This initializes an empty list `connections` where the adjacency information will be stored.

3. `for i in range(-n, n + 1):`: This starts a loop that iterates through a range of values from `-n` to `n`, inclusive. This loop considers neighbors within a specified range around each element.

4. `if i == 0 and not loops:`: This checks if `i` is equal to 0 (indicating the current element itself) and if `loops` is set to `False`. If both conditions are met, it continues to the next iteration of the loop, skipping the case where an element is considered its own neighbor.

5. `elif i == 0 and loops:`: This handles the case where `i` is equal to 0, but `loops` is set to `True`. In this case, it creates circular connections by stacking the `base` array on top of itself. This ensures that each element is connected to itself.

6. `neighbours = base.take(range(i, sequence_length + i), mode='wrap')`: For non-zero values of `i`, this line calculates the neighbors of each element. It uses the `take` method to index the `base` array with values wrapped around using the `mode='wrap'` parameter. This creates circular connections for positive and negative offsets.

7. `stack = np.vstack([base, neighbours])`: This stacks the `base` array and the `neighbours` array vertically to create a matrix where each row represents an element and its neighbors.

8. The `if` and `elif` blocks handle separating connections for positive and negative offsets. If `i` is negative, it appends the connections starting from the `i`-th column to the end of the matrix. If `i` is positive, it appends the connections from the beginning of the matrix up to the `i`-th column.

9. `connections.append(stack)`: This appends the `stack` matrix, representing the adjacency information for a specific offset `i`, to the `connections` list.

10. After the loop, the code combines all the adjacency matrices in the `connections` list horizontally using `np.hstack(connections)` and returns the result. This matrix represents the combined adjacency information for all elements in the sequence, considering the specified range and circular connections as needed.

In summary, this code generates adjacency information for a sequence of elements, allowing for circular connections and customizable neighbor ranges. It returns a matrix where each row corresponds to an element, and columns represent its neighbors within the specified range.

In [None]:
def nearest_adjacency(sequence_length, n=2, loops=True):
    base = np.arange(sequence_length)
    connections = []

    for i in range(-n, n + 1):
        if i == 0 and not loops:
            continue
        elif i == 0 and loops:
            stack = np.vstack([base, base])
            connections.append(stack)
            continue
        
        # 🔄 Wrap around the sequence for circular connections
        neighbours = base.take(range(i, sequence_length + i), mode='wrap')
        stack = np.vstack([base, neighbours])

        # Separate connections for positive and negative offsets
        if i < 0:
            connections.append(stack[:, -i:])
        elif i > 0:
            connections.append(stack[:, :-i])

    # Combine connections horizontally
    return np.hstack(connections)
print(nearest_adjacency(10, n=1, loops=False))


## 📏 Define the edge distance for generating adjacency matrix

In [None]:
EDGE_DISTANCE = 4 #Edge distance for generating adjacency matrix.


## 📊 Define a custom dataset class for a simple graph dataset

**Explaination**:

This code defines a PyTorch dataset class called `SimpleGraphDataset` for working with graph data where nodes have one-hot encoded sequence information and target values. It reads data from a Parquet file, processes it, and provides a way to access individual data samples.

1. `class SimpleGraphDataset(Dataset):`: This line defines a class `SimpleGraphDataset` that inherits from PyTorch's `Dataset` class, indicating that this class will be used to create a custom dataset for PyTorch.

2. `def __init__(self, parquet_name, edge_distance=5, root=None, transform=None, pre_transform=None, pre_filter=None):`: This is the constructor method for the class. It initializes the dataset object and accepts several parameters:
   - `parquet_name`: The name of the Parquet file containing the data.
   - `edge_distance`: The distance for generating the adjacency matrix.
   - `root`, `transform`, `pre_transform`, `pre_filter`: Parameters inherited from the `Dataset` class for custom transformations and filtering (optional).

3. `super().__init__(root, transform, pre_transform, pre_filter)`: This line calls the constructor of the parent class (`Dataset`) to initialize the dataset with any provided parameters.

4. `self.parquet_name = parquet_name`: It sets the name of the Parquet file as an instance variable.

5. `self.edge_distance = edge_distance`: It sets the edge distance for generating the adjacency matrix as an instance variable.

6. `self.node_encoder = OneHotEncoder(sparse_output=False, max_categories=5)`: This line initializes a one-hot encoder (`node_encoder`) with specific settings:
   - `sparse_output=False`: Specifies that the output should not be sparse.
   - `max_categories=5`: Sets the maximum number of categories to 5.

7. `self.node_encoder.fit(np.array(['A', 'G', 'U', 'C']).reshape(-1, 1))`: It fits the one-hot encoder to the possible values ('A', 'G', 'U', 'C') by reshaping them into a column vector.

8. `self.df = pl.read_parquet(self.parquet_name)`: This line reads the data from the Parquet file using Polars (`pl`) and stores it in the `self.df` DataFrame.

9. `self.df = self.df.filter(pl.col("SN_filter") == 1.0)`: It filters the DataFrame to keep only rows where the "SN_filter" column has a value of 1.0.

10. The code uses regular expressions to identify and select columns with names matching the pattern "reactivity_[0-9]." These columns are assumed to contain reactivity information.

11. `self.reactivity_df = self.df.select(reactivity_names)`: It selects only the columns related to reactivity and stores them in the `self.reactivity_df` DataFrame.

12. `self.sequence_df = self.df.select("sequence")`: This line selects the "sequence" column from the DataFrame and stores it in the `self.sequence_df` DataFrame.

13. `def parse_row(self, idx):`: This method is used to parse a row from the dataset. It takes an index `idx` as input and returns a PyTorch `Data` object containing node features, adjacency information, targets, and valid masks.

14. The `parse_row` method reads the sequence and reactivity information for the given index and performs the following steps:
   - It converts the sequence string into a one-hot encoded array.
   - Calculates the adjacency matrix using the `nearest_adjacency` function.
   - Processes reactivity information and creates a valid mask.
   - Defines node features, targets, and creates a PyTorch `Data` object.

15. `def len(self):`: This method returns the length of the dataset, which is the number of rows in the DataFrame.

16. `def get(self, idx):`: This method is used to retrieve a data sample at a specified index. It calls the `parse_row` method to parse the data and returns a PyTorch `Data` object for the specified index.

This class provides a convenient way to work with graph data stored in a Parquet file, handling data loading, preprocessing, and providing access to individual data samples as PyTorch tensors.

In [None]:
class SimpleGraphDataset(Dataset):
    def __init__(self, parquet_name, edge_distance=5, root=None, transform=None, pre_transform=None, pre_filter=None):
        super().__init__(root, transform, pre_transform, pre_filter)
        # 📄 Set the Parquet file name
        self.parquet_name = parquet_name
        # 📏 Set the edge distance for generating the adjacency matrix
        self.edge_distance = edge_distance
        # 🧮 Initialize the one-hot encoder for node features
        self.node_encoder = OneHotEncoder(sparse_output=False, max_categories=5)
        # 🧮 Fit the one-hot encoder to possible values (A, G, U, C)
        self.node_encoder.fit(np.array(['A', 'G', 'U', 'C']).reshape(-1, 1))
        # 📊 Load the Parquet dataframe
        self.df = pl.read_parquet(self.parquet_name)
        # 📊 Filter the dataframe by 'SN_filter' column where the value is 1.0
        self.df = self.df.filter(pl.col("SN_filter") == 1.0)
        # 🧬 Get reactivity column names using regular expression
        reactivity_match = re.compile('(reactivity_[0-9])')
        reactivity_names = [col for col in self.df.columns if reactivity_match.match(col)]
        # 📊 Select only the reactivity columns
        self.reactivity_df = self.df.select(reactivity_names)
        # 📊 Select the 'sequence' column
        self.sequence_df = self.df.select("sequence")

    def parse_row(self, idx):
        # 📊 Read the row at the given index
        sequence_row = self.sequence_df.row(idx)
        reactivity_row = self.reactivity_df.row(idx)
        # 🧬 Get the sequence string and convert it to an array
        sequence = np.array(list(sequence_row[0])).reshape(-1, 1)
        # 🧬 Encode the sequence array using the one-hot encoder
        encoded_sequence = self.node_encoder.transform(sequence)
        # 📏 Get the sequence length
        sequence_length = len(sequence)
        # 📊 Get the edge index using nearest adjacency function
        edges_np = nearest_adjacency(sequence_length, n=self.edge_distance, loops=False)
        # 📏 Convert the edge index to a torch tensor
        edge_index = torch.tensor(edges_np, dtype=torch.long)
        # 🧬 Get reactivity targets for nodes
        reactivity = np.array(reactivity_row, dtype=np.float32)[0:sequence_length]
        # 🔒 Create valid masks for nodes
        valid_mask = np.argwhere(~np.isnan(reactivity)).reshape(-1)
        torch_valid_mask = torch.tensor(valid_mask, dtype=torch.long)
        # 🧬 Replace nan values for reactivity with 0.0 (not super important as they get masked)
        reactivity = np.nan_to_num(reactivity, copy=False, nan=0.0)
        # 📊 Define node features as the one-hot encoded sequence
        node_features = torch.Tensor(encoded_sequence)
        # 🎯 Define targets
        targets = torch.Tensor(reactivity)
        # 📊 Create a PyTorch Data object
        data = Data(x=node_features, edge_index=edge_index, y=targets, valid_mask=torch_valid_mask)
        return data

    def len(self):
        # 📏 Return the length of the dataset
        return len(self.df)

    def get(self, idx):
        # 📊 Get and parse data for the specified index
        data = self.parse_row(idx)
        return data


## 📚 Create the full training dataset and split it into training and validation datasets

In [None]:
full_train_dataset = SimpleGraphDataset(parquet_name=TRAIN_PARQUET_FILE, edge_distance=EDGE_DISTANCE)  # 🚆 Full training dataset
generator1 = torch.Generator().manual_seed(42)  # 🌱 Initialize random seed generator
train_dataset, val_dataset = random_split(full_train_dataset, [0.7, 0.3], generator1)  # 🎯 Split dataset into training (70%) and validation (30%)


## 🚂 Create data loaders for training and validation


In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True, num_workers=2)  # 📦 Training data loader
val_dataloader = DataLoader(val_dataset, batch_size=256, shuffle=False, num_workers=2)  # 📦 Validation data loader
print("Done")

## 📉 Define loss functions for training and evaluation


In [None]:
# 📉 Define loss functions for training and evaluation
import torch.nn.functional as F

def loss_fn(output, target):
    # 🪟 Clip the target values to be within the range [0, 1]
    clipped_target = torch.clip(target, min=0, max=1)
    # 📉 Calculate the mean squared error loss
    mses = F.mse_loss(output, clipped_target, reduction='mean')
    return mses

def mae_fn(output, target):
    # 🪟 Clip the target values to be within the range [0, 1]
    clipped_target = torch.clip(target, min=0, max=1)
    # 📉 Calculate the mean absolute error loss
    maes = F.l1_loss(output, clipped_target, reduction='mean')
    return maes


## 🧠 Create a neural network model using EdgeCNN

In [None]:
from torch_geometric.nn.models import EdgeCNN

# 🛠️ Set the device to GPU if available, otherwise use CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 🏗️ Initialize the EdgeCNN model with specified parameters
model = EdgeCNN(
    in_channels=full_train_dataset.num_features,  # 📊 Input features determined by the dataset
    hidden_channels=128,  # 🕳️ Number of hidden channels in the model
    num_layers=4,  # 🧱 Number of layers in the model
    out_channels=1  # 📤 Number of output channels
).to(device)  # 🏗️ Move the model to the selected device (GPU or CPU)


In [None]:
# Make sure we are using the GPU
device

## 🔄 Training loop


**Explaination**:

This code is a snippet for training a neural network model using PyTorch for some task, such as regression, classification, or another machine learning problem. It follows a typical training loop structure and monitors training and validation loss and mean absolute error (MAE) for each epoch. 

1. `n_epochs = 15`: This line defines the number of training epochs, which is the number of times the entire training dataset will be processed by the model during training.

2. `optimizer = torch.optim.Adam(model.parameters(), lr=0.0003, weight_decay=5e-4)`: Here, an Adam optimizer is defined with the following settings:
   - `model.parameters()`: It specifies the model's parameters that the optimizer will update during training.
   - `lr=0.0003`: This sets the learning rate, controlling the step size during optimization.
   - `weight_decay=5e-4`: Weight decay is a regularization technique to prevent overfitting by penalizing large weights.

3. `for epoch in range(n_epochs):`: This loop iterates over the specified number of training epochs.

4. `train_losses = []` and `train_maes = []`: These lines initialize empty lists to store training losses and mean absolute errors for each batch in the training dataset.

5. `model.train()`: This sets the model in training mode, which can be important for certain layers or operations that behave differently during training and inference (e.g., dropout, batch normalization).

6. The next inner loop iterates over batches in the `train_dataloader` using the `tqdm` progress bar for visualization. The code within this loop performs the following steps for each batch:

   - `batch = batch.to(device)`: It moves the batch of data to the specified computing device (e.g., CPU or GPU) to utilize hardware acceleration.

   - `optimizer.zero_grad()`: This resets the gradients of the model's parameters to zero before computing gradients for the current batch.

   - `out = model(batch.x, batch.edge_index)`: It passes the input data (`batch.x` and `batch.edge_index`) through the model to get predictions (`out`). The specific model architecture and input structure depend on the problem.

   - `out = torch.squeeze(out)`: This squeezes any unnecessary dimensions from the output tensor.

   - `loss = loss_fn(out[batch.valid_mask], batch.y[batch.valid_mask])`: It calculates the loss between the model's predictions (`out`) and the ground truth targets (`batch.y`) but only for the valid data points (where `batch.valid_mask` is `True`).

   - `mae = mae_fn(out[batch.valid_mask], batch.y[batch.valid_mask])`: Similarly, this calculates the mean absolute error (MAE) between predictions and targets for valid data points.

   - `loss.backward()`: This computes gradients for the model's parameters with respect to the loss.

   - `train_losses.append(loss.detach().cpu().numpy())` and `train_maes.append(mae.detach().cpu().numpy())`: These lines append the loss and MAE values (converted to NumPy arrays) for the current batch to the respective lists.

   - `optimizer.step()`: This performs a parameter update using the computed gradients.

   - `pbar.set_description(f"Train loss {loss.detach().cpu().numpy():.4f}")`: This updates the progress bar description to display the current training loss.

7. After processing all batches in the training dataset, the code prints the average training loss and MAE for the current epoch.

8. `val_losses = []` and `val_maes = []`: Similar to the training phase, these lines initialize empty lists to store validation losses and MAEs.

9. `model.eval()`: This sets the model in evaluation mode, which disables operations like dropout during evaluation.

10. The inner loop now iterates over batches in the validation dataset using the `val_dataloader`. It follows the same steps as the training loop but calculates and stores validation loss and MAE.

11. After processing all validation batches, the code prints the average validation loss and MAE for the current epoch.

This code represents a typical training loop for supervised learning with neural networks, tracking model performance over multiple epochs and optimizing model parameters using gradient descent with the Adam optimizer. It also handles GPU acceleration if available (`batch = batch.to(device)`). .

In [None]:
n_epochs = 15

# 📈 Define the optimizer with learning rate and weight decay
optimizer = torch.optim.Adam(model.parameters(), lr=0.0003, weight_decay=5e-4)

# 🚂 Iterate over epochs
for epoch in range(n_epochs):
    train_losses = []
    train_maes = []
    model.train()
    
    # 🚞 Iterate over batches in the training dataloader
    for batch in (pbar := tqdm(train_dataloader, position=0, leave=True)):
        batch = batch.to(device)
        optimizer.zero_grad()
        out = model(batch.x, batch.edge_index)
        out = torch.squeeze(out)
        loss = loss_fn(out[batch.valid_mask], batch.y[batch.valid_mask])
        mae = mae_fn(out[batch.valid_mask], batch.y[batch.valid_mask])
        loss.backward()
        train_losses.append(loss.detach().cpu().numpy())
        train_maes.append(mae.detach().cpu().numpy())
        optimizer.step()
        pbar.set_description(f"Train loss {loss.detach().cpu().numpy():.4f}")
    
    # 📊 Print average training loss and MAE for the epoch
    print(f"Epoch {epoch} train loss: ", np.mean(train_losses))
    print(f"Epoch {epoch} train mae: ", np.mean(train_maes))
    
    val_losses = []
    val_maes = []
    model.eval()
    
    # 🚞 Iterate over batches in the validation dataloader
    for batch in (pbar := tqdm(val_dataloader, position=0, leave=True)):
        batch = batch.to(device)
        optimizer.zero_grad()
        out = model(batch.x, batch.edge_index)
        out = torch.squeeze(out)
        loss = loss_fn(out[batch.valid_mask], batch.y[batch.valid_mask])
        mae = mae_fn(out[batch.valid_mask], batch.y[batch.valid_mask])
        val_losses.append(loss.detach().cpu().numpy())
        val_maes.append(mae.detach().cpu().numpy())
        pbar.set_description(f"Validation loss {loss.detach().cpu().numpy():.4f}")
    
    # 📊 Print average validation loss and MAE for the epoch
    print(f"Epoch {epoch} val loss: ", np.mean(val_losses))
    print(f"Epoch {epoch} val mae: ", np.mean(val_maes))


## 🗑️ Clear the GPU memory cache


In [None]:
torch.cuda.empty_cache()


## 📚 Define a custom dataset class for inference on a graph


**Explaination**: 



In [None]:
class InferenceGraphDataset(Dataset):
    def __init__(self, parquet_name, edge_distance=2, root=None, transform=None, pre_transform=None, pre_filter=None):
        super().__init__(root, transform, pre_transform, pre_filter)
        # 📄 Set the Parquet file name
        self.parquet_name = parquet_name
        # 📏 Set the edge distance for generating the adjacency matrix
        self.edge_distance = edge_distance
        # 🧮 Initialize the one-hot encoder for node features
        self.node_encoder = OneHotEncoder(sparse_output=False, max_categories=4)
        # 🧮 Fit the one-hot encoder to possible values (A, G, U, C)
        self.node_encoder.fit(np.array(['A', 'G', 'U', 'C']).reshape(-1, 1))
        # 📊 Load the Parquet dataframe
        self.df = pl.read_parquet(self.parquet_name)
        # 📊 Select the 'sequence' and 'id_min' columns
        self.sequence_df = self.df.select("sequence")
        self.id_min_df = self.df.select("id_min")

    def parse_row(self, idx):
        # 📊 Read the row at the given index
        sequence_row = self.sequence_df.row(idx)
        id_min = self.id_min_df.row(idx)[0]

        # 🧬 Get the sequence string and convert it to an array
        sequence = np.array(list(sequence_row[0])).reshape(-1, 1)
        # 🧬 Encode the sequence array using the one-hot encoder
        encoded_sequence = self.node_encoder.transform(sequence)
        # 📏 Get the sequence length
        sequence_length = len(sequence)
        # 📊 Get the edge index using nearest adjacency function
        edges_np = nearest_adjacency(sequence_length, n=self.edge_distance, loops=False)
        # 📏 Convert the edge index to a torch tensor
        edge_index = torch.tensor(edges_np, dtype=torch.long)

        # 📊 Define node features as the one-hot encoded sequence
        node_features = torch.Tensor(encoded_sequence)
        ids = torch.arange(id_min, id_min+sequence_length, 1)

        data = Data(x=node_features, edge_index=edge_index, ids=ids)

        return data

    def len(self):
        # 📏 Return the length of the dataset
        return len(self.df)

    def get(self, idx):
        # 📊 Get and parse data for the specified index
        data = self.parse_row(idx)
        return data


## 📚 Create an inference dataset and dataloader

In [None]:
infer_dataset = InferenceGraphDataset(parquet_name=TEST_PARQUET_FILE, edge_distance=EDGE_DISTANCE)  # 🚀 Inference dataset
infer_dataloader = DataLoader(infer_dataset, batch_size=128, shuffle=False, num_workers=2)  # 📦 Inference dataloader


## 🏭 Set the device to GPU if available, otherwise use CPU


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 🏭 Move the model to the selected device (GPU or CPU) and switch to evaluation mode
model = model.eval().to(device)


## 🧮 Initialize empty arrays for IDs and predictions & Append IDs and predictions


In [None]:
ids = np.empty(shape=(0, 1), dtype=int)
preds = np.empty(shape=(0, 1), dtype=np.float32)

# 🚀 Iterate over batches in the inference dataloader
for batch in tqdm(infer_dataloader):
    batch = batch.to(device)
    out = model(batch.x, batch.edge_index).detach().cpu().numpy()

    # 📦 Append IDs and predictions to the respective arrays
    ids = np.append(ids, batch.ids.detach().cpu().numpy())
    preds = np.append(preds, out)


## 📊 Create a DataFrame for the submission


In [None]:
submission_df = pl.DataFrame({"id": ids, "reactivity_DMS_MaP": preds, "reactivity_2A3_MaP": preds})


## 💾 Write the submission DataFrame to a CSV file


In [None]:
submission_df.write_csv(PRED_CSV)


## Explore More! 👀
Thank you for exploring this notebook! If you found this notebook insightful or if it helped you in any way, I invite you to explore more of my work on my profile.

👉 [Visit my Profile](https://www.kaggle.com/zulqarnainali) 👈

## Feedback and Gratitude 🙏
We value your feedback! Your insights and suggestions are essential for our continuous improvement. If you have any comments, questions, or ideas to share, please don't hesitate to reach out.

📬 Contact me via email: [zulqar445ali@gmail.com](mailto:zulqar445ali@gmail.com)

I would like to express our heartfelt gratitude for your time and engagement. Your support motivates us to create more valuable content.

Happy coding and best of luck in your data science endeavors! 🚀
