# 🧬 IEDB Epitope Prediction (Jespersen et al.)

📖 **Dataset Description**

Epitope prediction involves identifying regions on an antigen that are recognized by B-cell receptors or antibodies. These regions, called **B-cell epitopes**, are critical in vaccine design and immunotherapy.

The **IEDB Jespersen** dataset is curated from BepiPred 2.0, which gathers B-cell epitope data from the **Immune Epitope Database (IEDB)**. The task is structured as a **token-level classification** problem, where each **amino acid residue** in a protein sequence is labeled as:  
- `1` if it's part of an epitope  
- `0` otherwise

The dataset contains antigen sequences with corresponding residue-level labels determined using crystallographic data.

- **Task Type**: Token-level binary classification  
- **Input**: Amino acid sequence (protein/antigen)  
- **Output**: List of active residue positions (epitope = `1`, non-epitope = `0`)  
- **Size**: 3,159 antigen sequences  

## 📚 References  
1. Vita, Randi, et al. “The immune epitope database (IEDB): 2018 update.” *Nucleic acids research* 47.D1 (2019): D339-D343.  
2. Jespersen, Martin Closter, et al. “BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes.” *Nucleic acids research* 45.W1 (2017): W24-W29.

**License**: CC BY 4.0


## 📦 Imports

In [1]:
import os
import sys

sys.path.append(os.path.abspath(".."))

import pandas as pd
from scripts.tdc_dataset_download import TDCDatasetDownloader
from scripts.eda_utils import DatasetLoader, EDAVisualizer, SMARTSPatternAnalyzer


______

## 📥 Download AMES Dataset

In [2]:
# Declare category and dataset
category = 'epitope'
dataset = 'IEDB_Jespersen'

In [3]:
# Initiate downloader class to download the dataset
downloader = TDCDatasetDownloader(category, dataset)

Downloading...
100%|██████████| 2.18M/2.18M [01:07<00:00, 32.2kiB/s]
Loading...
Done!


✅ Dataset 'IEDB_Jespersen' saved to '/Users/taiwoadelakin/Documents/Doc/Projects/outreachy-contributions/data/IEDB_Jespersen/IEDB_Jespersen.csv'
✅ train split saved to '/Users/taiwoadelakin/Documents/Doc/Projects/outreachy-contributions/data/IEDB_Jespersen/splits/train.csv'
✅ valid split saved to '/Users/taiwoadelakin/Documents/Doc/Projects/outreachy-contributions/data/IEDB_Jespersen/splits/valid.csv'
✅ test split saved to '/Users/taiwoadelakin/Documents/Doc/Projects/outreachy-contributions/data/IEDB_Jespersen/splits/test.csv'


________

## 📊 Exploratory Data Analysis

In this section, we analyze the AMES mutagenicity dataset to better understand its structure, balance, and molecular content. Exploratory Data Analysis (EDA) helps us uncover patterns, detect anomalies, and gain insights that will guide our preprocessing and modeling steps.

### 1. Load Datasets ###

In [4]:
######################################## Initialize the dataset loader ######################################## 
loader = DatasetLoader(dataset_name=dataset)

In [5]:
######################################## Load datasets ######################################## 
main_df, train_df, valid_df, test_df = loader.load_all()

🧬 IEDB_Jespersen_main  ➡️ (3159, 3)
⚗️ IEDB_Jespersen_train ➡️ (2211, 3)
🔬 IEDB_Jespersen_valid ➡️ (316, 3)
🧪 IEDB_Jespersen_test  ➡️ (632, 3)


______

### 2. 📦 Dataset Overview ###

Before diving into the analysis, it’s essential to understand the structure of the datasets we’re working with — including the main dataset and the train/validation/test splits.

The function `EDAVisualizer.show_dataset_info()` provides a concise summary of:
- Shape (rows × columns)
- Column names
- Missing values
- Sample preview (via `.head()`)

You can run this by:
```python
EDAVisualizer.show_dataset_info(loader)

```
Alternatively, for specific splits only e.g for just Train and Test Data:
```python
selected_datasets = ['train','test']
EDAVisualizer.show_dataset_info(loader, dataset_names=selected_datasets)



In [6]:
######################################## Show Dataset Info  ########################################

EDAVisualizer.show_dataset_info(loader)

🧬 IEDB_Jespersen_main Info
----------------------------------------
Shape: (3159, 3)
Columns: ['Antigen_ID', 'Antigen', 'Y']

Missing values:
Antigen_ID    0
Antigen       0
Y             0
dtype: int64

Preview:


Unnamed: 0,Antigen_ID,Antigen,Y
0,Protein 1,MASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFF...,"[109, 110, 111, 112, 113, 114, 115, 116, 117, ..."
1,Protein 2,MSDLTDIQEDITRHEQQLIVARQKLKDAERAVEVDPDDVNKNTLQA...,"[312, 313, 314, 315, 316, 317, 318, 319, 320, ..."
2,Protein 3,MAEGFAANRQWIGPEEAEELLDFDIAIQMNEEGPLNPGVNPFRVPG...,"[585, 586, 587, 588, 589, 590, 591, 592, 593, ..."
3,Protein 4,MSKKPGGPGKSRAVNMLKRGMPRVLSLTGLKRAMLSLIDGRGPTRF...,"[811, 812, 813, 814, 815, 816, 817, 818, 819, ..."
4,Protein 5,MKLLILTCLVAVALARPKHPIKHQGLPQEVLNENLLRFFVAPFPEV...,"[55, 56, 57, 58, 59, 60, 61, 62, 63, 64]"




⚗️ IEDB_Jespersen_train Info
----------------------------------------
Shape: (2211, 3)
Columns: ['Antigen_ID', 'Antigen', 'Y']

Missing values:
Antigen_ID    0
Antigen       0
Y             0
dtype: int64

Preview:


Unnamed: 0,Antigen_ID,Antigen,Y
0,Protein 2,MSDLTDIQEDITRHEQQLIVARQKLKDAERAVEVDPDDVNKNTLQA...,"[312, 313, 314, 315, 316, 317, 318, 319, 320, ..."
1,Protein 3,MAEGFAANRQWIGPEEAEELLDFDIAIQMNEEGPLNPGVNPFRVPG...,"[585, 586, 587, 588, 589, 590, 591, 592, 593, ..."
2,Protein 4,MSKKPGGPGKSRAVNMLKRGMPRVLSLTGLKRAMLSLIDGRGPTRF...,"[811, 812, 813, 814, 815, 816, 817, 818, 819, ..."
3,Protein 5,MKLLILTCLVAVALARPKHPIKHQGLPQEVLNENLLRFFVAPFPEV...,"[55, 56, 57, 58, 59, 60, 61, 62, 63, 64]"
4,Protein 7,MNTLLLTLVVVTIVCLDFGYTTKCLTKFSPGLQTSQTCPAGQKICF...,"[27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 3..."




🔬 IEDB_Jespersen_valid Info
----------------------------------------
Shape: (316, 3)
Columns: ['Antigen_ID', 'Antigen', 'Y']

Missing values:
Antigen_ID    0
Antigen       0
Y             0
dtype: int64

Preview:


Unnamed: 0,Antigen_ID,Antigen,Y
0,Protein 491,GSCVTTMAKNKPTLDFELIKTEAKQPATLRKYCIEAKLTNTTTESR...,"[198, 199, 200, 201, 202, 203, 204, 205]"
1,Protein 1277,QSVLTQPPSVSGAPGQRVTISCTGSRSNIGAGYHVHWYQQLPGTAP...,"[132, 133, 134, 135, 136, 137, 138, 139, 140, ..."
2,Protein 1039,MRTLWIMAVLLVGVEGDLWQFGQMILKETGKLPFPYYTTYGCYCGW...,"[36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 4..."
3,Protein 893,MATLEKLMKAFESLKSFQQQQQQQQQQQQQQQQQQQQQQQPPPPPP...,"[442, 443, 444, 445, 446, 447, 448, 449, 450, ..."
4,Protein 110,MACATLKRTHDWDPLHSPNGRSPKRRRCMPLSVTQAATPPTRAHQI...,"[56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 6..."




🧪 IEDB_Jespersen_test Info
----------------------------------------
Shape: (632, 3)
Columns: ['Antigen_ID', 'Antigen', 'Y']

Missing values:
Antigen_ID    0
Antigen       0
Y             0
dtype: int64

Preview:


Unnamed: 0,Antigen_ID,Antigen,Y
0,Protein 2464,MNMSRQGIFQTVGSGLDHILSLADIEEEQMIQSVDRTAVTGASYFT...,"[821, 822, 823, 824, 825, 826, 827, 828, 829, ..."
1,Protein 2509,MEVRVPNFHSFVEGITSSYLKTPACWNAQTAWDTVTFHVPDVIRVG...,"[31, 32, 33, 34, 35, 36, 37]"
2,Protein 2987,MAATGTAAAAATGRLLLLLLVGLTAPALALAGYIEALAANAGTGFA...,"[661, 662, 663, 664, 665, 666, 667, 668, 669, ..."
3,Protein 1407,YASVAVAVAVLGAGFANQTTVKAESSNNAESSNISQESKLINTLTD...,"[23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 3..."
4,Protein 860,MRALAVLSVTLVMACTEAFFPFISRGKELLWGKPEESRVSSVLEES...,"[361, 362, 363, 364, 365, 366, 367, 368, 369, ..."






-----

### 3. 💎 Unique SMILES Analysis ###

Knowing how many unique compounds exist in each dataset helps:
- Measure diversity
- Avoid duplication bias
- Confirm splits are stratified

Run:
```python
EDAVisualizer.compare_unique_smiles(
    dfs=[train_df, valid_df, test_df],
    df_names=['Train', 'Valid', 'Test'],
    smiles_col='Drug'  
)


In [7]:
######################################## Check Unique  Drug  Count in each dataset  ########################################
print(f"Number of Unique drug count in {dataset} train data is: =====>> {train_df["Drug"].nunique()}\n")
print(f"Number of Unique drug count in {dataset} validation data is: =====>> {valid_df["Drug"].nunique()}\n")
print(f"Number of Unique drug count in {dataset} test data is: =====>> {test_df["Drug"].nunique()}\n")
print(f"Total number of Unique drug count in {dataset} data is: =====>> {main_df["Drug"].nunique()}\n")

KeyError: 'Drug'

In [None]:
######################################## Compare Unique  Drug  Count Distribution  ########################################
EDAVisualizer.compare_unique_smiles(
    dfs=[train_df, valid_df, test_df],
    df_names=['Train', 'Valid', 'Test'],
    smiles_col='Drug'  
)

_____

### 4. 🧮  Target Class Distribution in dataset ###

Class imbalance is a common challenge in classification tasks. Plotting the distribution of the target labels helps us understand:
- Whether the dataset is balanced
- The dominant class (if any)
- The need for resampling or weighted loss functions

We use:
```python
EDAVisualizer.plot_label_distribution(df)


In [None]:
########################################  Plot class distribution on Main Dataset ######################################## 
EDAVisualizer.plot_label_distribution(main_df, target_col='Y', title='Main - Target Mutagenicity Distribution')

In [None]:
########################################  Plot class distribution Comparison between Split Dataset ######################################## 
EDAVisualizer.compare_label_distributions(
    dfs=[train_df, valid_df, test_df], 
    df_names=['Train', 'Valid', 'Test'], 
    target_col='Y'
)

______

### 5. 📏 SMILES Length Analysis ###

SMILES strings vary in length depending on the molecular complexity. Analyzing their length:
- Highlights outliers or unusually long/short molecules
- Informs sequence-based model designs (like RNNs, Transformers)

To analyze the distribution:
```python
EDAVisualizer.check_smiles_length(loader=loader)


In [None]:
########################################  Check the SMILES Length Distribution of the Full Dataset ######################################## 
EDAVisualizer.check_smiles_length(loader=loader)

In [None]:
########################################  Alternatively, Check the SMILES Length Distribution of the Selected/Multiple Dataset Splits ######################################## 
EDAVisualizer.check_smiles_length(dfs=[test_df], names=["Test"])


In [None]:
########################################  Compare the SMILES Length Distribution between Split Dataset ######################################## 

EDAVisualizer.compare_smiles_length(loader=loader)

In [None]:
######################################## Alternatively, we can compare selected splits dataset by executing below ######################################## 

selected_dfs = [train_df, valid_df]
dataset_names=['Train', 'Validation']


EDAVisualizer.compare_smiles_length(selected_dfs, dataset_names)

_____________

### 5. ✔️ RDKit Molecular Validity Check ###

Not all SMILES strings are guaranteed to represent valid molecules. Some may contain syntax errors or rare patterns RDKit cannot parse.

This function evaluates validity by attempting to convert each SMILES to an RDKit Mol object:
```python
EDAVisualizer.check_molecular_validity(loader=loader)


You can also pass a different column name if your SMILES column isn’t named Drug e.g:

EDAVisualizer.check_molecular_validity(train_df, smiles_col='SMILES')

In [None]:
######################################## Check Drug Molecular Validity ######################################## 

EDAVisualizer.check_molecular_validity(loader=loader)

____________

### 6. 🧬 Molecular Descriptor Engineering ###
Use RDKit to calculate standard drug-likeness properties for molecules — a key step in both Exploratory Data Analysis (EDA) and featurization.

This function computes key cheminformatics descriptors from SMILES using RDKit:
- Molecular weight (MW)
- LogP (lipophilicity)
- Topological Polar Surface Area (TPSA)
- Hydrogen Bond Donors/Acceptors (HBD/HBA)
- Rotatable bonds
- Ring counts (total and aromatic)

These descriptors provide chemical insights into your dataset during EDA (e.g., distribution of molecular weights) and also serve as informative features for downstream machine learning models.

You can add descriptors to a dataset via:
```python
EDAVisualizer.add_molecular_descriptors(loader=loader)

```
You can also:

🔹 Apply to Specific DataFrames:

```python
EDAVisualizer.add_molecular_descriptors(dfs=[train_df, test_df], names=["Train", "Test"])
```

🔹 Apply to a single DataFrame:

```python
EDAVisualizer.add_molecular_descriptors(dfs=valid_df, names=["Validation"])
```

🔹 Return new DataFrames:

```python
updated = EDAVisualizer.add_molecular_descriptors(dfs=[train_df], inplace=False)

In [None]:
######################################## Add Molecular Descriptors ######################################## 
EDAVisualizer.add_molecular_descriptors(loader=loader)


_____

### 7. ✨ SMARTS Pattern Matching ###

SMARTS patterns represent functional groups (e.g., nitro groups, amines, halogens). This module detects presence of such substructures and summarizes their occurrence by class.

**Steps**:
1. Use `SMARTSPatternAnalyzer().analyze(df)` to add SMARTS flags.
2. Use `.summarize_patterns(df)` to compare frequency by label.

Example:
```python
smarts_analyzer = SMARTSPatternAnalyzer()
train_df = smarts_analyzer.analyze(train_df)
smarts_analyzer.summarize_patterns(train_df)


In [None]:
######################################## Initialize SMARTS Pattern Analyzer ######################################## 
analyzer = SMARTSPatternAnalyzer()

In [None]:
######################################## Detect SMARTS Substructures in Main Dataset ######################################## 

data_with_flags = analyzer.analyze(main_df)

In [None]:
########################################  Analyze SMARTS Substructures ######################################## 

analyzer.summarize_patterns(data_with_flags)

__________

### 8. 📈 Visualize correlation between features  ###

### Feature Correlation Heatmaps

Correlation matrices help identify:
- Redundant features
- Feature interactions
- Potential for multicollinearity

We visualize correlations for numeric descriptors:
```python
EDAVisualizer.compare_correlation_heatmaps(
    [main_df],
    ["Data"],
    cols=['MW', 'LogP', 'TPSA', 'HBD', 'HBA', 'RotBonds', 'RingCount', 'AromaticRings', 'Y']
)
```

You can also visualize for multiple dataset by:
```python
EDAVisualizer.compare_correlation_heatmaps(
    dfs=[train_df, valid_df, test_df],
    df_names=["Train", "Validation", "Test"],
    cols=['MW', 'LogP', 'TPSA', 'HBD', 'HBA', 'RotBonds', 'RingCount', 'AromaticRings']
)


In [None]:
######################################## Show Correlation Heatmaps for Molecular Descriptors ######################################## 

EDAVisualizer.compare_correlation_heatmaps(
    [main_df],
    ["Data"],
    cols=['MW', 'LogP', 'TPSA', 'HBD', 'HBA', 'RotBonds', 'RingCount', 'AromaticRings', 'Y']
)


In [None]:
######################################## Check Correlation Heatmaps across Splits ######################################## 

EDAVisualizer.compare_correlation_heatmaps(
    [train_df, valid_df, test_df],
    ["Train", "Validation", "Test"],
    cols=['MW', 'LogP', 'TPSA', 'HBD', 'HBA', 'RotBonds', 'RingCount', 'AromaticRings', 'Y']
)

### Boxplot of Numeric Features Across Splits

Boxplots give another visual cue on median, spread, and outliers per descriptor.

Use:
```python
EDAVisualizer.compare_boxplots(
    dfs=[train_df, valid_df, test_df],
    df_names=['Train', 'Validation', 'Test'],
    col='MW'
)


In [None]:
######################################## Compare MW Boxplots ######################################## 

EDAVisualizer.compare_boxplots(
    [train_df, valid_df, test_df],
    ["Train", "Validation", "Test"],
    col='MW'
)

### Compare Feature Distributions (Histogram)

Distribution shifts between train/test datasets can impact model generalization. Here, we compare numeric descriptor histograms.

Example:
```python
EDAVisualizer.compare_numeric_distribution(
    dfs=[train_df, test_df],
    df_names=['Train', 'Test'],
    col='MW'  # or LogP, TPSA, etc.
)


In [None]:
######################################## xxxxx ######################################## 

EDAVisualizer.compare_numeric_distribution(
    [train_df, valid_df, test_df],
    ["Train", "Validation", "Test"],
    col='LogP'
)

_____________

### 9. 🔍 Visual Inspection of Molecules ###

Before building models, it’s helpful to get a “chemical feel” for the data by visualizing molecules per class.

This function:
- Randomly samples `n` molecules per class
- Converts them to RDKit molecules
- Displays them with grid labels

Usage:
```python
EDAVisualizer.draw_samples_by_class(main_df, n=4)



In [None]:
######################################## Visualize "n" samples of molecules per class ########################################
EDAVisualizer.draw_samples_by_class(main_df, n=5)

In [None]:
######################################## Draw Sample DRUG SMILE ########################################

EDAVisualizer.draw_molecules([
    "CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"  # Ibuprofen
])

________