<h1 style='text-align: center;'> B-cell Epitope Immune Response | Rafael Almazan </h1>
<h3 style='text-align: center;'> Data Cleaning and Preparation </h3>

## Data Introduction:
This dataset is taken from [Kaggle](https://www.kaggle.com/datasets/futurecorporation/epitope-prediction?sort=votes) and gives information as to whether or not an amino acid peptide produces an immune response (antibody activity). The protein and peptide features for each peptide sequence is taken from the IEDB website for epitope predictions as well as the biopython package an explanation for each feature is included in the EDA notebook.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from Bio import SeqUtils
from Bio.SeqUtils.ProtParam import ProteinAnalysis

pd.set_option('display.max_columns', None)
sns.set_style('darkgrid')
%matplotlib inline

To start, we will first look at our B-cell dataset

In [3]:
bcell = pd.read_csv('input_bcell.csv')
bcell.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,165,SASFT,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,255,LCLKI,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,149,AHRET,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1
3,O84462,MTNSISGYQPTVTTSTSSTTSASGASGSLGASSVSTTANATVTQTA...,152,156,SNYDD,1.41,2.548,0.936,6.32,4.237976,0.044776,-0.521393,30.765373,1
4,P00918,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKP...,85,89,DGTYR,1.214,1.908,0.937,4.64,6.867493,0.103846,-0.578846,21.684615,1


In [4]:
# get info
bcell.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14387 entries, 0 to 14386
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   parent_protein_id    14387 non-null  object 
 1   protein_seq          14387 non-null  object 
 2   start_position       14387 non-null  int64  
 3   end_position         14387 non-null  int64  
 4   peptide_seq          14387 non-null  object 
 5   chou_fasman          14387 non-null  float64
 6   emini                14387 non-null  float64
 7   kolaskar_tongaonkar  14387 non-null  float64
 8   parker               14387 non-null  float64
 9   isoelectric_point    14387 non-null  float64
 10  aromaticity          14387 non-null  float64
 11  hydrophobicity       14387 non-null  float64
 12  stability            14387 non-null  float64
 13  target               14387 non-null  int64  
dtypes: float64(8), int64(3), object(3)
memory usage: 1.5+ MB


In [5]:
# get brief description
bcell.describe()

Unnamed: 0,start_position,end_position,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
count,14387.0,14387.0,14387.0,14387.0,14387.0,14387.0,14387.0,14387.0,14387.0,14387.0,14387.0
mean,297.675818,308.085077,0.994706,1.059788,1.021188,1.767137,7.067472,0.075727,-0.406097,43.703902,0.271217
std,353.74145,353.733297,0.124772,1.621931,0.053804,1.968985,1.888708,0.025767,0.394618,16.682362,0.444603
min,1.0,6.0,0.534,0.0,0.838,-9.029,3.686096,0.0,-1.971171,5.448936,0.0
25%,84.0,95.0,0.911,0.248,0.986,0.6,5.621033,0.060606,-0.606215,31.614529,0.0
50%,191.0,200.0,0.99,0.556,1.02,1.793,6.499573,0.074534,-0.33054,42.287268,0.0
75%,382.0,393.0,1.074,1.209,1.055,3.0095,8.676575,0.091312,-0.189591,49.101172,1.0
max,3079.0,3086.0,1.546,27.189,1.255,9.12,12.232727,0.182254,1.267089,137.046667,1.0


In [6]:
# checking the proportions of the target column
bcell.target.value_counts()

0    10485
1     3902
Name: target, dtype: int64

In [7]:
# check for shape
bcell.shape

(14387, 14)

In [8]:
# check for null values
bcell.isna().sum()

parent_protein_id      0
protein_seq            0
start_position         0
end_position           0
peptide_seq            0
chou_fasman            0
emini                  0
kolaskar_tongaonkar    0
parker                 0
isoelectric_point      0
aromaticity            0
hydrophobicity         0
stability              0
target                 0
dtype: int64

In [9]:
# checking unique values
bcell.nunique()

parent_protein_id        760
protein_seq              757
start_position          1443
end_position            1452
peptide_seq            14362
chou_fasman              768
emini                   3342
kolaskar_tongaonkar      350
parker                  3614
isoelectric_point        744
aromaticity              687
hydrophobicity           757
stability                757
target                     2
dtype: int64

We have loaded in our dataframe and we see that there are no null values. Its initial shape is (14387, 14) which means there are 14387 rows and 14 columns in the dataframe. There are also three object data types and the rest are numerical floats or integers. The target column is binary, stating whether or not the peptide will induce an immune response in B-cells

We will now check for duplicated rows

In [10]:
# check for duplicated rows
bcell[bcell.duplicated()]

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
769,P06821,MSLLTEVETPIRNEWGCRCNGSSDPLAIAANIIGILHLILWILDRL...,6,13,EVETPIRN,0.93,1.055,0.982,2.8,5.681824,0.092784,-0.207216,42.820619,1
1977,P18012,MEIQNTKPTQTLYTDISTKQTQSSSETQKSQNYQQIAAHIPLNVGK...,162,169,VTQVGITG,0.936,0.139,1.062,1.55,7.830627,0.016529,-0.440496,45.818182,0
7307,P9WNK7,MTEQQWNFAGIEAAASAIQGNVTSIHSLLDEGKQSLTKLAAAWGGS...,85,95,STEGNVTGMFA,1.012,0.462,0.961,2.555,4.478577,0.063158,-0.255789,36.353684,1
7980,P87020,MNYLLFCLFFAFSVAAPVTVTRFVDASPTGYDWRADWVKGFPIDSS...,221,232,YYALDVYAYDVT,0.946,0.621,1.119,0.433,5.004089,0.12709,-0.398328,31.725753,1
13324,P17763,MNNQRKKTGRPSFNMLKRARNRVSTVSQLAKRFSKGLLSGQGPMKL...,586,594,FKLEKEVAE,0.732,1.217,1.022,1.644,8.702576,0.081368,-0.203184,33.844487,1


We will drop these duplicated rows since it does not make sense to have duplicates and there are only 5 rows of values.

In [11]:
# dropping duplicated rows
bcell = bcell.drop_duplicates()

In [12]:
# sanity check for duplicates
bcell[bcell.duplicated()]

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target


Now that we have confirmed that there are no duplicates, it's time to refine the data to prepare it for analysis.

To start, we will create a new column stating each peptide's length (How many Amino Acids there are in its sequence) and removing the end_position column. The end_position is highly correlated with the start_position and therefore will be removed. We have chosen to keep the start_position as it preserves the positional information of the peptide in the parent protein sequence.

In [13]:
# creating a new column
peptide_length = (bcell['end_position'] - bcell['start_position'] + 1) # plus 1 because the end_position is inclusive
bcell.insert(5, 'peptide_length', peptide_length)

#dropping end position
bcell.drop('end_position', axis=1, inplace=True)

#sanity check
bcell.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,SASFT,5,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,LCLKI,5,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,AHRET,5,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1
3,O84462,MTNSISGYQPTVTTSTSSTTSASGASGSLGASSVSTTANATVTQTA...,152,SNYDD,5,1.41,2.548,0.936,6.32,4.237976,0.044776,-0.521393,30.765373,1
4,P00918,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKP...,85,DGTYR,5,1.214,1.908,0.937,4.64,6.867493,0.103846,-0.578846,21.684615,1


Next, we will use biopython to add some peptide characteristics:
- Isoelectric point
- Aromaticity
- Molecular weight
- Stability
- Hydrophobicity
- Charge at pH=7.4

We are looking at the charge at pH=7.4 since the pH of human body ranges from 7.35 to 7.45


While we have these values for the parent proteins, it is also still important to consider the peptide values since we are currently working with peptides. We will rename the parent values to parent_XXXXX

We will also rename the target variable

In [14]:
# creating dictionary with new feature names
parent_names = {'isoelectric_point': 'parent_isoelectric_point', 
                'aromaticity': 'parent_aromaticity', 
                'hydrophobicity': 'parent_hydrophobicity', 
                'stability': 'parent_instability_index'}

# renaming column
bcell = bcell.rename(columns=parent_names)

# renaming target variable
bcell = bcell.rename(columns={'target': 'antibody_activity'})

# sanity check
bcell.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,SASFT,5,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,LCLKI,5,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,AHRET,5,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1
3,O84462,MTNSISGYQPTVTTSTSSTTSASGASGSLGASSVSTTANATVTQTA...,152,SNYDD,5,1.41,2.548,0.936,6.32,4.237976,0.044776,-0.521393,30.765373,1
4,P00918,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKP...,85,DGTYR,5,1.214,1.908,0.937,4.64,6.867493,0.103846,-0.578846,21.684615,1


In [15]:
# create a function to create columns with each peptide parameter cuz im lazy
def add_protein_analysis_parameter(df, seq_column, analysis_params):
    '''
    inputs:
    df -> pandas.DataFrame 
    seq_column -> name of protein sequence column 
    analysis_params -> list of parameters to be added to dataframe
    
    output:
    new_df -> pandas.Dataframe with added protein parameter columns
    '''
    
    new_df = df.copy()

    # go over each parameter in analysis_params list
    for param in analysis_params:
        
        # Create empty list to store the analysis parameter values
        parameter_values = []
        
        if param == 'charge_at_pH':
            for sequence in df[seq_column]:
                protein = ProteinAnalysis(sequence)
                param_value = getattr(protein, param)(7)
                parameter_values.append(param_value)    
        
        else:
            # get peptide parameter for every peptide in dataset
            for sequence in df[seq_column]:
                protein = ProteinAnalysis(sequence)
                param_value = getattr(protein, param)()
                parameter_values.append(param_value)

        # add new column to dataframe
        if param == 'gravy':
            new_df['hydrophobicity'] = parameter_values
        elif param == 'charge_at_pH':
            new_df['charge_at_pH=7.4'] = parameter_values
        else:
            new_df[param] = parameter_values

    return new_df

In [16]:
# create list of all protein analysis parameters
peptide_params = ['isoelectric_point', 'aromaticity', 'molecular_weight', 
                 'instability_index', 'gravy', 'charge_at_pH']


# Adding peptide parameter features
bcell = add_protein_analysis_parameter(bcell, 'peptide_seq', peptide_params)


In [17]:
# sanity check
bcell.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity,isoelectric_point,aromaticity,molecular_weight,instability_index,hydrophobicity,charge_at_pH=7.4
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,SASFT,5,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1,5.240009,0.2,511.5255,8.0,0.46,-0.539854
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,LCLKI,5,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1,8.222249,0.0,588.8033,12.56,2.14,0.749202
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,AHRET,5,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1,6.794385,0.0,612.6361,-8.98,-2.02,-0.114151
3,O84462,MTNSISGYQPTVTTSTSSTTSASGASGSLGASSVSTTANATVTQTA...,152,SNYDD,5,1.41,2.548,0.936,6.32,4.237976,0.044776,-0.521393,30.765373,1,4.050028,0.2,612.5432,55.36,-2.52,-2.53543
4,P00918,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKP...,85,DGTYR,5,1.214,1.908,0.937,4.64,6.867493,0.103846,-0.578846,21.684615,1,5.835682,0.2,610.6168,-42.8,-2.08,-0.239787


Now that we have our peptide features, we will go ahead and deal with our Amino Acid content. Each letter in the peptide sequence corresponds to an Amino Acid. We will create a column for each Amino Acid where the values will represent the quantity of each amino acid present in the peptide.

| A | R | N | D | C | H | I | Q | E | G | L | K | M | F | P | S | T | W | Y | V |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |

In [18]:
def amino_acid_breakdown(df, seq_column):
    """
    inputs: 
    df -> pd.DataFrame
    seq_column -> name of column containing sequences
    
    output:
    new_df -> pd.DataFrame
    
    """
    new_df = df.copy()

    # List of unique amino acids
    amino_acids = {'A': 'ala', 'R': 'arg', 'N': 'asn', 'D': 'asp', 'C': 'cys', 'Q': 'gln', 'E': 'glu', 
                   'G': 'gly', 'H': 'his', 'I': 'ile', 'L': 'leu', 'K': 'lys', 'M': 'met', 'F': 'phe', 
                   'P': 'pro', 'S': 'ser', 'T': 'thr', 'W': 'trp', 'Y': 'tyr', 'V': 'val'}
    # Create a column for each amino acid
    for amino_acid in amino_acids:
        column_name = f'amino_acid_{amino_acids[amino_acid]}'
        # this line taken from chatGPT:
        # counts all of the specific amino acids in the column and applies it onto the created amino acid column
        new_df[column_name] = df[seq_column].apply(lambda seq: seq.count(amino_acid))

    return new_df

bcell = amino_acid_breakdown(bcell, 'peptide_seq')
bcell.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity,isoelectric_point,aromaticity,molecular_weight,instability_index,hydrophobicity,charge_at_pH=7.4,amino_acid_ala,amino_acid_arg,amino_acid_asn,amino_acid_asp,amino_acid_cys,amino_acid_gln,amino_acid_glu,amino_acid_gly,amino_acid_his,amino_acid_ile,amino_acid_leu,amino_acid_lys,amino_acid_met,amino_acid_phe,amino_acid_pro,amino_acid_ser,amino_acid_thr,amino_acid_trp,amino_acid_tyr,amino_acid_val
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,SASFT,5,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1,5.240009,0.2,511.5255,8.0,0.46,-0.539854,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,1,0,0,0
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,LCLKI,5,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1,8.222249,0.0,588.8033,12.56,2.14,0.749202,0,0,0,0,1,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,AHRET,5,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1,6.794385,0.0,612.6361,-8.98,-2.02,-0.114151,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0
3,O84462,MTNSISGYQPTVTTSTSSTTSASGASGSLGASSVSTTANATVTQTA...,152,SNYDD,5,1.41,2.548,0.936,6.32,4.237976,0.044776,-0.521393,30.765373,1,4.050028,0.2,612.5432,55.36,-2.52,-2.53543,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
4,P00918,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKP...,85,DGTYR,5,1.214,1.908,0.937,4.64,6.867493,0.103846,-0.578846,21.684615,1,5.835682,0.2,610.6168,-42.8,-2.08,-0.239787,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0


In [19]:
bcell.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14382 entries, 0 to 14386
Data columns (total 40 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   parent_protein_id         14382 non-null  object 
 1   protein_seq               14382 non-null  object 
 2   start_position            14382 non-null  int64  
 3   peptide_seq               14382 non-null  object 
 4   peptide_length            14382 non-null  int64  
 5   chou_fasman               14382 non-null  float64
 6   emini                     14382 non-null  float64
 7   kolaskar_tongaonkar       14382 non-null  float64
 8   parker                    14382 non-null  float64
 9   parent_isoelectric_point  14382 non-null  float64
 10  parent_aromaticity        14382 non-null  float64
 11  parent_hydrophobicity     14382 non-null  float64
 12  parent_instability_index  14382 non-null  float64
 13  antibody_activity         14382 non-null  int64  
 14  isoele

Here, we have our complete dataset with no duplicates or null values. This is the data we will be conducting our EDA on as well as building our models

In [20]:
# saving pandas dataframe to .csv

bcell.to_csv('bcell_cleaned.csv', index=False)

Now that we have our cleaned b-cell dataset, we will go through the same process with our SARS and COVID datasets

In [21]:
# reading in our sars dataset
sars = pd.read_csv('input_sars.csv')

In [22]:
# checking the first 5 rows
display(sars.head())

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
0,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,17,MFIFLLFLTLTSGSDLD,0.887,0.04,1.056,-2.159,5.569763,0.116335,-0.061116,33.205116,0
1,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,15,MFIFLLFLTLTSGSD,0.869,0.047,1.056,-2.5,5.569763,0.116335,-0.061116,33.205116,0
2,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,2,10,FIFLLFLTL,0.621,0.042,1.148,-7.467,5.569763,0.116335,-0.061116,33.205116,0
3,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,6,20,LFLTLTSGSDLDRCT,1.021,0.23,1.049,0.927,5.569763,0.116335,-0.061116,33.205116,0
4,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,9,25,TLTSGSDLDRCTTFDDV,1.089,0.627,1.015,3.165,5.569763,0.116335,-0.061116,33.205116,0


In [23]:
# checking the info
sars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   parent_protein_id    520 non-null    object 
 1   protein_seq          520 non-null    object 
 2   start_position       520 non-null    int64  
 3   end_position         520 non-null    int64  
 4   peptide_seq          520 non-null    object 
 5   chou_fasman          520 non-null    float64
 6   emini                520 non-null    float64
 7   kolaskar_tongaonkar  520 non-null    float64
 8   parker               520 non-null    float64
 9   isoelectric_point    520 non-null    float64
 10  aromaticity          520 non-null    float64
 11  hydrophobicity       520 non-null    float64
 12  stability            520 non-null    float64
 13  target               520 non-null    int64  
dtypes: float64(8), int64(3), object(3)
memory usage: 57.0+ KB


In [24]:
# getting a description summary of our data.
sars.describe()

Unnamed: 0,start_position,end_position,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
count,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0
mean,617.871154,635.876923,1.000442,1.719804,1.03896,1.278696,5.569763,0.1163347,-0.06111554,33.20512,0.269231
std,349.582246,349.315328,0.08719,4.736354,0.037978,1.418791,0.0,2.7782300000000004e-17,6.945576e-18,1.422454e-14,0.443987
min,1.0,10.0,0.621,0.0,0.908,-7.467,5.569763,0.1163347,-0.06111554,33.20512,0.0
25%,359.0,373.75,0.949,0.17975,1.013,0.5345,5.569763,0.1163347,-0.06111554,33.20512,0.0
50%,571.5,592.5,1.009,0.4395,1.036,1.412,5.569763,0.1163347,-0.06111554,33.20512,0.0
75%,921.0,940.0,1.05525,1.18125,1.058,2.245,5.569763,0.1163347,-0.06111554,33.20512,1.0
max,1241.0,1255.0,1.317,40.605,1.228,4.907,5.569763,0.1163347,-0.06111554,33.20512,1.0


In [25]:
# checking for null values
sars.isna().sum()

parent_protein_id      0
protein_seq            0
start_position         0
end_position           0
peptide_seq            0
chou_fasman            0
emini                  0
kolaskar_tongaonkar    0
parker                 0
isoelectric_point      0
aromaticity            0
hydrophobicity         0
stability              0
target                 0
dtype: int64

In [26]:
# checking to see the target proportions
sars.target.value_counts()

0    380
1    140
Name: target, dtype: int64

In [27]:
# getting the shape
sars.shape

(520, 14)

Our SARS dataset is very small (only 520 rows and 14 columns). We see no null values in our data. We will now check for duplicated rows

In [28]:
# checking for duplicated rows
sars[sars.duplicated()]

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
130,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,359,494,TSFSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGV...,1.089,9.922,1.031,1.672,5.569763,0.116335,-0.061116,33.205116,1
176,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,429,454,DATSTGNYNYKYRYLRHGKLRPFERD,1.098,40.605,0.982,2.669,5.569763,0.116335,-0.061116,33.205116,1
177,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,429,454,DATSTGNYNYKYRYLRHGKLRPFERD,1.098,40.605,0.982,2.669,5.569763,0.116335,-0.061116,33.205116,1
179,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,429,454,DATSTGNYNYKYRYLRHGKLRPFERD,1.098,40.605,0.982,2.669,5.569763,0.116335,-0.061116,33.205116,1
275,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,597,603,LYQDVNC,1.06,0.51,1.123,1.371,5.569763,0.116335,-0.061116,33.205116,1
488,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1164,1191,EIDRLNEVAKNLNESLIDLQELGKYEQY,0.948,3.29,1.01,1.871,5.569763,0.116335,-0.061116,33.205116,1


We see that we have 6 duplicated rows. Like we did with the b-cell duplicates, we will also drop these rows from our dataset.

In [29]:
# dropping duplicated rows
sars.drop_duplicates(inplace=True)

In [30]:
# sanity check
sars[sars.duplicated()]

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target


Now that we have dropped our duplicated columns, we will add the peptide length column and drop the end_position column as we did above.

In [31]:
# creating a new column
peptide_length = (sars['end_position'] - sars['start_position'] + 1) # plus 1 because the end_position is inclusive
sars.insert(5, 'peptide_length', peptide_length) #insert beside peptide_seq

#dropping end position
sars.drop('end_position', axis=1, inplace=True)

#sanity check
sars.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
0,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSDLD,17,0.887,0.04,1.056,-2.159,5.569763,0.116335,-0.061116,33.205116,0
1,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSD,15,0.869,0.047,1.056,-2.5,5.569763,0.116335,-0.061116,33.205116,0
2,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,2,FIFLLFLTL,9,0.621,0.042,1.148,-7.467,5.569763,0.116335,-0.061116,33.205116,0
3,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,6,LFLTLTSGSDLDRCT,15,1.021,0.23,1.049,0.927,5.569763,0.116335,-0.061116,33.205116,0
4,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,9,TLTSGSDLDRCTTFDDV,17,1.089,0.627,1.015,3.165,5.569763,0.116335,-0.061116,33.205116,0


We will also rename the parent protein columns and the target column as well as create the peptide feature columns as we did above.

In [32]:
# take parent_names dictionary from above and use to rename sars columns
sars = sars.rename(columns=parent_names)

# renaming target column
sars = sars.rename(columns={'target': 'antibody_activity'})

#sanity check
sars.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity
0,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSDLD,17,0.887,0.04,1.056,-2.159,5.569763,0.116335,-0.061116,33.205116,0
1,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSD,15,0.869,0.047,1.056,-2.5,5.569763,0.116335,-0.061116,33.205116,0
2,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,2,FIFLLFLTL,9,0.621,0.042,1.148,-7.467,5.569763,0.116335,-0.061116,33.205116,0
3,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,6,LFLTLTSGSDLDRCT,15,1.021,0.23,1.049,0.927,5.569763,0.116335,-0.061116,33.205116,0
4,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,9,TLTSGSDLDRCTTFDDV,17,1.089,0.627,1.015,3.165,5.569763,0.116335,-0.061116,33.205116,0


In [33]:
# add peptide parameter features, peptide_params taken from above
sars = add_protein_analysis_parameter(sars, 'peptide_seq', peptide_params)

# add the amino acid features taken from above
sars = amino_acid_breakdown(sars, 'peptide_seq')

#sanity check
sars.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity,isoelectric_point,aromaticity,molecular_weight,instability_index,hydrophobicity,charge_at_pH=7.4,amino_acid_ala,amino_acid_arg,amino_acid_asn,amino_acid_asp,amino_acid_cys,amino_acid_gln,amino_acid_glu,amino_acid_gly,amino_acid_his,amino_acid_ile,amino_acid_leu,amino_acid_lys,amino_acid_met,amino_acid_phe,amino_acid_pro,amino_acid_ser,amino_acid_thr,amino_acid_trp,amino_acid_tyr,amino_acid_val
0,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSDLD,17,0.887,0.04,1.056,-2.159,5.569763,0.116335,-0.061116,33.205116,0,4.050028,0.176471,1933.2668,9.411765,1.376471,-2.494223,0,0,0,2,0,0,0,1,0,1,5,0,1,3,0,2,2,0,0,0
1,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSD,15,0.869,0.047,1.056,-2.5,5.569763,0.116335,-0.061116,33.205116,0,4.298131,0.2,1705.0218,9.333333,1.54,-1.495344,0,0,0,1,0,0,0,1,0,1,4,0,1,3,0,2,2,0,0,0
2,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,2,FIFLLFLTL,9,0.621,0.042,1.148,-7.467,5.569763,0.116335,-0.061116,33.205116,0,5.525,0.333333,1126.4286,8.888889,3.044444,-0.239898,0,0,0,0,0,0,0,0,0,1,4,0,0,3,0,0,1,0,0,0
3,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,6,LFLTLTSGSDLDRCT,15,1.021,0.23,1.049,0.927,5.569763,0.116335,-0.061116,33.205116,0,4.207813,0.066667,1641.8405,26.04,0.326667,-1.247568,0,1,0,2,1,0,0,1,0,0,4,0,0,1,0,2,3,0,0,0
4,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,9,TLTSGSDLDRCTTFDDV,17,1.089,0.627,1.015,3.165,5.569763,0.116335,-0.061116,33.205116,0,4.050028,0.058824,1845.935,38.670588,-0.364706,-3.607231,0,1,0,4,1,0,0,1,0,0,2,0,0,1,0,2,4,0,0,1


In [34]:
# checking all columns
sars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 514 entries, 0 to 519
Data columns (total 40 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   parent_protein_id         514 non-null    object 
 1   protein_seq               514 non-null    object 
 2   start_position            514 non-null    int64  
 3   peptide_seq               514 non-null    object 
 4   peptide_length            514 non-null    int64  
 5   chou_fasman               514 non-null    float64
 6   emini                     514 non-null    float64
 7   kolaskar_tongaonkar       514 non-null    float64
 8   parker                    514 non-null    float64
 9   parent_isoelectric_point  514 non-null    float64
 10  parent_aromaticity        514 non-null    float64
 11  parent_hydrophobicity     514 non-null    float64
 12  parent_instability_index  514 non-null    float64
 13  antibody_activity         514 non-null    int64  
 14  isoelectri

This will be our final SARS dataset, we will train our model on the B-cell data and test on this SARS data. We will also combine both datasets and train our model to test the COVID dataset which does not contain our target variable.

In [35]:
# saving the SARS dataset as a csv
sars.to_csv('sars_cleaned.csv', index=False)

In [36]:
display(bcell.head())
display(sars.head())

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity,isoelectric_point,aromaticity,molecular_weight,instability_index,hydrophobicity,charge_at_pH=7.4,amino_acid_ala,amino_acid_arg,amino_acid_asn,amino_acid_asp,amino_acid_cys,amino_acid_gln,amino_acid_glu,amino_acid_gly,amino_acid_his,amino_acid_ile,amino_acid_leu,amino_acid_lys,amino_acid_met,amino_acid_phe,amino_acid_pro,amino_acid_ser,amino_acid_thr,amino_acid_trp,amino_acid_tyr,amino_acid_val
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,SASFT,5,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1,5.240009,0.2,511.5255,8.0,0.46,-0.539854,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,1,0,0,0
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,LCLKI,5,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1,8.222249,0.0,588.8033,12.56,2.14,0.749202,0,0,0,0,1,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,AHRET,5,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1,6.794385,0.0,612.6361,-8.98,-2.02,-0.114151,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0
3,O84462,MTNSISGYQPTVTTSTSSTTSASGASGSLGASSVSTTANATVTQTA...,152,SNYDD,5,1.41,2.548,0.936,6.32,4.237976,0.044776,-0.521393,30.765373,1,4.050028,0.2,612.5432,55.36,-2.52,-2.53543,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
4,P00918,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKP...,85,DGTYR,5,1.214,1.908,0.937,4.64,6.867493,0.103846,-0.578846,21.684615,1,5.835682,0.2,610.6168,-42.8,-2.08,-0.239787,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0


Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity,isoelectric_point,aromaticity,molecular_weight,instability_index,hydrophobicity,charge_at_pH=7.4,amino_acid_ala,amino_acid_arg,amino_acid_asn,amino_acid_asp,amino_acid_cys,amino_acid_gln,amino_acid_glu,amino_acid_gly,amino_acid_his,amino_acid_ile,amino_acid_leu,amino_acid_lys,amino_acid_met,amino_acid_phe,amino_acid_pro,amino_acid_ser,amino_acid_thr,amino_acid_trp,amino_acid_tyr,amino_acid_val
0,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSDLD,17,0.887,0.04,1.056,-2.159,5.569763,0.116335,-0.061116,33.205116,0,4.050028,0.176471,1933.2668,9.411765,1.376471,-2.494223,0,0,0,2,0,0,0,1,0,1,5,0,1,3,0,2,2,0,0,0
1,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,MFIFLLFLTLTSGSD,15,0.869,0.047,1.056,-2.5,5.569763,0.116335,-0.061116,33.205116,0,4.298131,0.2,1705.0218,9.333333,1.54,-1.495344,0,0,0,1,0,0,0,1,0,1,4,0,1,3,0,2,2,0,0,0
2,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,2,FIFLLFLTL,9,0.621,0.042,1.148,-7.467,5.569763,0.116335,-0.061116,33.205116,0,5.525,0.333333,1126.4286,8.888889,3.044444,-0.239898,0,0,0,0,0,0,0,0,0,1,4,0,0,3,0,0,1,0,0,0
3,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,6,LFLTLTSGSDLDRCT,15,1.021,0.23,1.049,0.927,5.569763,0.116335,-0.061116,33.205116,0,4.207813,0.066667,1641.8405,26.04,0.326667,-1.247568,0,1,0,2,1,0,0,1,0,0,4,0,0,1,0,2,3,0,0,0
4,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,9,TLTSGSDLDRCTTFDDV,17,1.089,0.627,1.015,3.165,5.569763,0.116335,-0.061116,33.205116,0,4.050028,0.058824,1845.935,38.670588,-0.364706,-3.607231,0,1,0,4,1,0,0,1,0,0,2,0,0,1,0,2,4,0,0,1


In [37]:
# concatinating the bcell and sars data
bcell_sars = pd.concat([bcell, sars]).reset_index().drop('index', axis=1)
bcell_sars

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,antibody_activity,isoelectric_point,aromaticity,molecular_weight,instability_index,hydrophobicity,charge_at_pH=7.4,amino_acid_ala,amino_acid_arg,amino_acid_asn,amino_acid_asp,amino_acid_cys,amino_acid_gln,amino_acid_glu,amino_acid_gly,amino_acid_his,amino_acid_ile,amino_acid_leu,amino_acid_lys,amino_acid_met,amino_acid_phe,amino_acid_pro,amino_acid_ser,amino_acid_thr,amino_acid_trp,amino_acid_tyr,amino_acid_val
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,SASFT,5,1.016,0.703,1.018,2.220,5.810364,0.103275,-0.143829,40.273300,1,5.240009,0.200000,511.5255,8.000000,0.460000,-0.539854,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,1,0,0,0
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,LCLKI,5,0.770,0.179,1.199,-3.860,6.210876,0.065476,-0.036905,24.998512,1,8.222249,0.000000,588.8033,12.560000,2.140000,0.749202,0,0,0,0,1,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,AHRET,5,0.852,3.427,0.960,4.280,8.223938,0.091787,0.879227,27.863333,1,6.794385,0.000000,612.6361,-8.980000,-2.020000,-0.114151,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0
3,O84462,MTNSISGYQPTVTTSTSSTTSASGASGSLGASSVSTTANATVTQTA...,152,SNYDD,5,1.410,2.548,0.936,6.320,4.237976,0.044776,-0.521393,30.765373,1,4.050028,0.200000,612.5432,55.360000,-2.520000,-2.535430,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
4,P00918,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKP...,85,DGTYR,5,1.214,1.908,0.937,4.640,6.867493,0.103846,-0.578846,21.684615,1,5.835682,0.200000,610.6168,-42.800000,-2.080000,-0.239787,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14891,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1231,SCGSCCKFDEDDSEP,15,1.227,0.503,1.035,4.907,5.569763,0.116335,-0.061116,33.205116,0,4.050028,0.066667,1621.6787,112.386667,-1.033333,-4.561572,0,0,0,3,3,0,2,1,0,0,0,1,0,1,1,3,0,0,0,0
14892,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1234,SCCKFDEDDSEPVLKGVKLHYT,22,1.047,0.606,1.064,2.577,5.569763,0.116335,-0.061116,33.205116,0,4.825710,0.090909,2513.7967,76.972727,-0.645455,-2.467494,0,0,0,3,2,0,2,1,1,0,2,3,0,1,1,2,1,0,1,2
14893,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1236,CKFDEDDSEPVLKGVKLHYT,20,1.021,1.361,1.049,2.440,5.569763,0.116335,-0.061116,33.205116,1,4.828552,0.100000,2323.5765,67.370000,-0.795000,-2.157638,0,0,0,3,1,0,2,1,1,0,2,3,0,1,1,1,1,0,1,2
14894,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1236,CKFDEDDSEPVLKGV,15,1.051,0.886,1.042,3.127,5.569763,0.116335,-0.061116,33.205116,0,4.105049,0.066667,1680.8302,70.440000,-0.706667,-3.242814,0,0,0,3,1,0,2,1,0,0,1,2,0,1,1,1,0,0,0,2


In [38]:
# saving the combined data into a csv
bcell_sars.to_csv('bcell_sars_cleaned.csv', index=False)

Now, we have our SARS and combinded dataset, we will finally apply the same edits to our COVID dataset. This dataset does not have the target feature of the presence of an immune response since it was not yet recorded at the time of this data collection. We will then use our models to predict these epitopes and compare the peptide sequences to that of the mRNA vaccines available right now.

In [39]:
# reading in covid dataset
covid = pd.read_csv('input_covid.csv')
covid.head()

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability
0,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,1,5,MGILP,0.948,0.28,1.033,-2.72,6.03595,0.10929,-0.138642,31.377603
1,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,2,6,GILPS,1.114,0.379,1.07,-0.58,6.03595,0.10929,-0.138642,31.377603
2,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,3,7,ILPSP,1.106,0.592,1.108,-1.3,6.03595,0.10929,-0.138642,31.377603
3,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,4,8,LPSPG,1.324,0.836,1.053,1.44,6.03595,0.10929,-0.138642,31.377603
4,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,5,9,PSPGM,1.326,1.004,0.968,2.44,6.03595,0.10929,-0.138642,31.377603


In [40]:
# getting info
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20312 entries, 0 to 20311
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   parent_protein_id    20312 non-null  object 
 1   protein_seq          20312 non-null  object 
 2   start_position       20312 non-null  int64  
 3   end_position         20312 non-null  int64  
 4   peptide_seq          20312 non-null  object 
 5   chou_fasman          20312 non-null  float64
 6   emini                20312 non-null  float64
 7   kolaskar_tongaonkar  20312 non-null  float64
 8   parker               20312 non-null  float64
 9   isoelectric_point    20312 non-null  float64
 10  aromaticity          20312 non-null  float64
 11  hydrophobicity       20312 non-null  float64
 12  stability            20312 non-null  float64
dtypes: float64(8), int64(2), object(3)
memory usage: 2.0+ MB


In [41]:
# getting a summary description
covid.describe()

Unnamed: 0,start_position,end_position,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability
count,20312.0,20312.0,20312.0,20312.0,20312.0,20312.0,20312.0,20312.0,20312.0,20312.0
mean,635.258369,646.741631,1.003054,0.999996,1.037257,1.334786,6.03595,0.1092896,-0.1386417,31.3776
std,366.496487,366.496487,0.106191,1.287882,0.046677,1.539362,0.0,4.163439e-17,5.551252e-17,7.105602e-15
min,1.0,5.0,0.596,0.003,0.837,-7.317,6.03595,0.1092896,-0.1386417,31.3776
25%,318.0,329.0,0.935,0.272,1.008,0.453,6.03595,0.1092896,-0.1386417,31.3776
50%,635.0,647.0,1.001,0.587,1.035,1.406,6.03595,0.1092896,-0.1386417,31.3776
75%,953.0,964.0,1.067,1.222,1.064,2.289,6.03595,0.1092896,-0.1386417,31.3776
max,1277.0,1281.0,1.538,18.298,1.282,7.3,6.03595,0.1092896,-0.1386417,31.3776


In [42]:
# checking for null values
covid.isna().sum()

parent_protein_id      0
protein_seq            0
start_position         0
end_position           0
peptide_seq            0
chou_fasman            0
emini                  0
kolaskar_tongaonkar    0
parker                 0
isoelectric_point      0
aromaticity            0
hydrophobicity         0
stability              0
dtype: int64

In [43]:
# checking shape
covid.shape

(20312, 13)

We have taken a high-level look at our covid dataset and have found no null values. The shape of our dataset is (20312, 13) meaning that there are 20312 rows of peptide data and 13 columns. We are lacking the target column for this set since it had not been tested and we do not have the information yet at the time of creation. Next, we will take a look at possible duplicated data

In [44]:
# checking for duplicates
covid[covid.duplicated()]

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability


We see that there are no duplicated rows in our data and we are able to move on to the next part of our preprocessing. To avoid redundancy, we will add the sequence length, drop the end_position, add the peptide features and add the amino acid breakdown all in one section.

In [45]:
# creating a new column
peptide_length = (covid['end_position'] - covid['start_position'] + 1) # plus 1 because the end_position is inclusive
covid.insert(5, 'peptide_length', peptide_length) #insert beside peptide_seq

#dropping end position
covid.drop('end_position', axis=1, inplace=True)

# take parent_names dictionary from above and use to rename sars columns
covid = covid.rename(columns=parent_names)

# renaming target column
covid = covid.rename(columns={'target': 'antibody_activity'})

# add peptide parameter features, peptide_params taken from above
covid = add_protein_analysis_parameter(covid, 'peptide_seq', peptide_params)

# add the amino acid features taken from above
covid = amino_acid_breakdown(covid, 'peptide_seq')


In [46]:
#sanity check
display(covid.head())
print(covid.shape)

Unnamed: 0,parent_protein_id,protein_seq,start_position,peptide_seq,peptide_length,chou_fasman,emini,kolaskar_tongaonkar,parker,parent_isoelectric_point,parent_aromaticity,parent_hydrophobicity,parent_instability_index,isoelectric_point,aromaticity,molecular_weight,instability_index,hydrophobicity,charge_at_pH=7.4,amino_acid_ala,amino_acid_arg,amino_acid_asn,amino_acid_asp,amino_acid_cys,amino_acid_gln,amino_acid_glu,amino_acid_gly,amino_acid_his,amino_acid_ile,amino_acid_leu,amino_acid_lys,amino_acid_met,amino_acid_phe,amino_acid_pro,amino_acid_ser,amino_acid_thr,amino_acid_trp,amino_acid_tyr,amino_acid_val
0,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,1,MGILP,5,0.948,0.28,1.033,-2.72,6.03595,0.10929,-0.138642,31.377603,5.275022,0.0,529.693,68.06,1.64,-0.499645,0,0,0,0,0,0,0,1,0,1,1,0,1,0,1,0,0,0,0,0
1,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,2,GILPS,5,1.114,0.379,1.07,-0.58,6.03595,0.10929,-0.138642,31.377603,5.525,0.0,485.5743,106.58,1.1,-0.239898,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0,0,0
2,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,3,ILPSP,5,1.106,0.592,1.108,-1.3,6.03595,0.10929,-0.138642,31.377603,5.525,0.0,525.6382,211.44,0.86,-0.239898,0,0,0,0,0,0,0,0,0,1,1,0,0,0,2,1,0,0,0,0
3,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,4,LPSPG,5,1.324,0.836,1.053,1.44,6.03595,0.10929,-0.138642,31.377603,5.525,0.0,469.5319,172.92,-0.12,-0.239898,0,0,0,0,0,0,0,1,0,0,1,0,0,0,2,1,0,0,0,0
4,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,5,PSPGM,5,1.326,1.004,0.968,2.44,6.03595,0.10929,-0.138642,31.377603,5.954987,0.0,487.5703,134.4,-0.5,-0.041471,0,0,0,0,0,0,0,1,0,0,0,0,1,0,2,1,0,0,0,0


(20312, 39)


Here, we see that our covid dataset has undergone the same changes as the two other datasets above. We see that it has one less column since it is lacking that 'target' column that the other two have. Now that we have cleaned and preprocessed this dataset, we will export it as a .csv file and continue into our EDA process

In [47]:
# saving to a csv file
covid.to_csv('covid_cleaned.csv', index=False)

## Creating Mask for peptides within protein sequence

We will create a mask for our peptide positions within the protein sequences. This will be used for the loss function of our neural networks

In [48]:
def create_peptide_mask(start_position, peptide_sequence, protein_sequence):
    mask = [0] * len(protein_sequence)  # Initialize the mask with zeros
    
    # Determine the end position of the peptide
    end_position = start_position + len(peptide_sequence)
    
    # Set the corresponding positions in the mask to 1
    for i in range(start_position-1, end_position-1):
        mask[i] = 1
    
    return mask

Now that we have our function, we will verify that it is doing what we want it to do

In [68]:
# verify that the function works
print(create_peptide_mask(1, 'M', 'M'))

# verify that the function works
print(create_peptide_mask(1, 'M', 'MA'))

# verify that the function works
print(create_peptide_mask(2, 'A', 'MA'))

# verify that the function works
print(create_peptide_mask(2, 'AA', 'MAAOOOPPS'))

# verify that the function works
print(create_peptide_mask(3, 'ABDSK', 'MAABDSKJFSLNFB'))

[1]
[1, 0]
[0, 1]
[0, 1, 1, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]


As we can see by the examples above, the function is able to create a mask for the position of each peptide within the larger sequence

Now that we have validated our function, we will insert a new column to every dataset we have.

In [71]:
# creating new column on datasets for mask
bcell['peptide_mask'] = bcell.apply(lambda row: create_peptide_mask(row['start_position'], row['peptide_seq'], row['protein_seq']), axis=1)
sars['peptide_mask'] = sars.apply(lambda row: create_peptide_mask(row['start_position'], row['peptide_seq'], row['protein_seq']), axis=1)
covid['peptide_mask'] = covid.apply(lambda row: create_peptide_mask(row['start_position'], row['peptide_seq'], row['protein_seq']), axis=1)
bcell_sars['peptide_mask'] = bcell_sars.apply(lambda row: create_peptide_mask(row['start_position'], row['peptide_seq'], row['protein_seq']), axis=1)

#sanity checks
display(bcell[['peptide_seq', 'protein_seq', 'start_position', 'peptide_mask']].head(1))
display(sars[['peptide_seq', 'protein_seq', 'start_position', 'peptide_mask']].head(1))
display(covid[['peptide_seq', 'protein_seq', 'start_position', 'peptide_mask']].head(1))
display(bcell_sars[['peptide_seq', 'protein_seq', 'start_position', 'peptide_mask']].head(1))


Unnamed: 0,peptide_seq,protein_seq,start_position,peptide_mask
0,SASFT,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Unnamed: 0,peptide_seq,protein_seq,start_position,peptide_mask
0,MFIFLLFLTLTSGSDLD,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


Unnamed: 0,peptide_seq,protein_seq,start_position,peptide_mask
0,MGILP,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,1,"[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Unnamed: 0,peptide_seq,protein_seq,start_position,peptide_mask
0,SASFT,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [72]:
# saving to csv
bcell.to_csv('bcell_cleaned.csv', index=False)
sars.to_csv('sars_cleaned.csv', index=False)
covid.to_csv('covid_cleaned.csv', index=False)
bcell_sars.to_csv('bcell_sars_cleaned.csv', index=False)