#Antibody Epitope Prediction - Tutorial

I. Methods for predicting continuous antibody epitope from protein sequences

General basis: Parameters such as hydrophilicity, flexibility, accessibility, turns, exposed surface, polarity and antigenic propensity of polypeptides chains have been correlated with the location of continuous epitopes. This has led to a search for empirical rules that would allow the position of continuous epitopes to be predicted from certain features of the protein sequence. All prediction calculations are based on propensity scales for each of the 20 amino acids. Each scale consists of 20 values assigned to each of the amino acid residues on the basis of their relative propensity to possess the property described by the scale.
General method: When computing the score for a given residue i, the amino acids in an interval of the chosen length, centered around residue i, are considered. In other words, for a window size n, the i - (n-1)/2 neighboring residues on each side of residue i were used to compute the score for residue i. Unless specified, the score for residue i is the average of the scale values for these amino acids (see table 1 for specific method implementation details). In general, a window size of 5 to 7 is appropriate for finding regions that may potentially be antigenic.

Interpretation of output graphs and tables: On the graphs, the Y-axes depicts for each residue the correspondent score (averaged in the specified window), be it a BepiPred score or a residue score on the Karplus and Schulz flexibility scale; while the X-axes depicts the residue positions in the sequence. The tables provide values of calculated scores for each residue. The larger score for the residues might be interpreted as that the residue might have a higher probability to be part of epitope (those residues are colored in yellow on the graphs). However, the presented methods do not predict the epitopes per se, either linear or discontinuous, -- they might only guide the researchers to further explore the protein regions on being genuine epitopes.http://tools.iedb.org/bcell/help/

![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRe9oro6cBfFoQvBOOyQ_BkLV00V1CgerDTIQ&usqp=CAU)slideplayer.com

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #visualization
import matplotlib.pyplot as plt #visualization
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
import cv2

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#B-Cell

![](https://askabiologist.asu.edu/sites/default/files/resources/activities/body_depot/viral_attack/duck3.gif)You might think B-cells got their name because they are made inside your bones. It is true that most blood cells are made inside the bone marrow, but that is not where the “B” in B-cells came from. Their name comes from the name of the place they were discovered, the Bursa of Fabricius. The Bursa is an organ only found in birds.(Probably Kagglers have that organ too, since we're all are birds) https://askabiologist.asu.edu/b-cell

In [None]:
df = pd.read_csv('../input/epitope-prediction/input_sars.csv', encoding='ISO-8859-2')
df.head()

In [None]:
df1 = pd.read_csv('../input/ai4all-project/results/deconvolution/CIBERSORTx_Results_Krasnow_facs_droplet.csv', encoding='ISO-8859-2')
df1.head()

In [None]:
sns.countplot(x="B cell",data=df1,palette="flag",edgecolor="black")
plt.title('B cell', weight='bold')
plt.xticks(rotation=45)
plt.yticks(rotation=45)
# changing the font size
sns.set(font_scale=1)

In [None]:
# checking dataset

print ("Rows     : " ,df.shape[0])
print ("Columns  : " ,df.shape[1])
print ("\nFeatures : \n" ,df.columns.tolist())
print ("\nMissing values :  ", df.isnull().sum().values.sum())
print ("\nUnique values :  \n",df.nunique())

In [None]:
# Distribution of different type of amount
fig , ax = plt.subplots(1,3,figsize = (12,5))

start_position = df.start_position.values
end_position = df.end_position.values
target = df.target.values

sns.distplot(start_position , ax = ax[0] , color = 'blue').set_title('B Cell Start Position' , fontsize = 14)
sns.distplot(end_position , ax = ax[1] , color = 'cyan').set_title('B Cell End Position' , fontsize = 14)
sns.distplot(target , ax = ax[2] , color = 'purple').set_title('B Cell Target' , fontsize = 14)


plt.show()

In [None]:
import matplotlib.gridspec as gridspec
from scipy.stats import skew
from sklearn.preprocessing import RobustScaler,MinMaxScaler
from scipy import stats
import matplotlib.style as style
style.use('seaborn-colorblind')

In [None]:
def plotting_3_chart(df, feature): 
    ## Creating a customized chart. and giving in figsize and everything. 
    fig = plt.figure(constrained_layout=True, figsize=(10,6))
    ## crea,ting a grid of 3 cols and 3 rows. 
    grid = gridspec.GridSpec(ncols=3, nrows=3, figure=fig)
    #gs = fig3.add_gridspec(3, 3)

    ## Customizing the histogram grid. 
    ax1 = fig.add_subplot(grid[0, :2])
    ## Set the title. 
    ax1.set_title('Histogram')
    ## plot the histogram. 
    sns.distplot(df.loc[:,feature], norm_hist=True, ax = ax1)

    # customizing the QQ_plot. 
    ax2 = fig.add_subplot(grid[1, :2])
    ## Set the title. 
    ax2.set_title('QQ_plot')
    ## Plotting the QQ_Plot. 
    stats.probplot(df.loc[:,feature], plot = ax2)

    ## Customizing the Box Plot. 
    ax3 = fig.add_subplot(grid[:, 2])
    ## Set title. 
    ax3.set_title('Box Plot')
    ## Plotting the box plot. 
    sns.boxplot(df.loc[:,feature], orient='v', ax = ax3 );
 

print('Skewness: '+ str(df['target'].skew())) 
print("Kurtosis: " + str(df['target'].kurt()))
plotting_3_chart(df, 'target')

In [None]:
train_heat=df[df["target"].notnull()]
train_heat=train_heat.drop(["target"],axis=1)
style.use('ggplot')
sns.set_style('whitegrid')
plt.subplots(figsize = (10,8))
## Plotting heatmap. 

# Generate a mask for the upper triangle (taken from seaborn example gallery)
mask = np.zeros_like(train_heat.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True


sns.heatmap(train_heat.corr(), 
            cmap=sns.diverging_palette(255, 133, l=60, n=7), 
            mask = mask, 
            annot=True, 
            center = 0, 
           );
## Give title. 
plt.title("Heatmap of all the Features", fontsize = 30);

##Kolaskar and Tongaonkar antigenicity scale

Reference: Kolaskar AS, Tongaonkar PC. A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Lett. 1990 Dec 10;276(1-2):172-4. Description: A semi-empirical method which makes use of physicochemical properties of amino acid residues and their frequencies of occurrence in experimentally known segmental epitopes was developed to predict antigenic determinants on proteins. Application of this method to a large number of proteins has shown by the authors that the method can predict antigenic determinants with about 75% accuracy which is better than most of the known methods http://tools.iedb.org/bcell/help/

In [None]:
fig = px.bar(df, 
             x='kolaskar_tongaonkar', y='isoelectric_point', color_discrete_sequence=['#2B3A67'],
             title='Kolaskar and Tongaonkar antigenicity scale', text='end_position')
fig.show()

In [None]:
fig = px.bar(df, 
             x='kolaskar_tongaonkar', y='hydrophobicity', color_discrete_sequence=['crimson'],
             title='Kolaskar and Tongaonkar antigenicity scale', text='end_position')
fig.show()

In [None]:
ax = df.groupby('kolaskar_tongaonkar')['end_position'].mean().plot(kind='barh', figsize=(12,8),
                                                           title='Mean estimated Kolaskar Tongaonkar')
plt.xlabel('Mean estimated Kolaskar Tongaonkar scale')
plt.ylabel('2018')
plt.show()

#Chou and Fasman beta turn prediction

Reference: Chou PY, Fasman GD. Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol Relat Areas Mol Biol. 1978;47:45-148.
Description: The rationale for predicting turns to predict antibody epitopes is based on the paper by Pellequer et al, Immunology Letters, 36 (1993) 83-99. Instead of implementing the turn scale of that paper which has some non-standard properties, we decided to use the Chou and Fasman scale which is commonly used to predict beta turns as described in the reference link above.
http://tools.iedb.org/bcell/help/

In [None]:
fig = px.bar(df, 
             x='chou_fasman', y='stability', color_discrete_sequence=['orange'],
             title='Chou & Fasman Beta Turn Prediction', text='end_position')
fig.show()

In [None]:
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

cols_selected = ['chou_fasman']
ohe = OneHotEncoder(cols=cols_selected, use_cat_names=True)
df_t = ohe.fit_transform(df[cols_selected+['end_position']])

#scaler = MaxAbsScaler()
X = df_t.iloc[:,:-1]
y = df_t.iloc[:, -1].fillna(df_t.iloc[:, -1].mean()) / df_t.iloc[:, -1].max()

mdl = Ridge(alpha=0.1)
mdl.fit(X,y)

pd.Series(mdl.coef_, index=X.columns).sort_values().head(10).plot.barh()

#Parker Hydrophilicity Prediction

Reference: Parker JM, Guo D, Hodges RS. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry. 1986 Sep 23; 25(19):5425-32.
Description: In this method, hydrophilic scale based on peptide retention times during high-performance liquid chromatography (HPLC) on a reversed-phase column was constructed. A window of seven residues was used for analyzing epitope region. The corresponding value of the scale was introduced for each of the seven residues and the arithmetical mean of the seven residue value was assigned to the fourth, (i+3), residue in the segment.http://tools.iedb.org/bcell/help/

In [None]:
fig = px.bar(df, 
             x='hydrophobicity', y='parker', color_discrete_sequence=['darkgreen'],
             title='Parker Hydrophilicity Prediction', text='end_position')
fig.show()

In [None]:
ax = df.groupby('parker')['end_position'].min().sort_values(ascending=True).plot(kind='barh', figsize=(12,8), color='r',
                                                                                  title='Min.estimated Parker Prediction')
plt.xlabel('Min.estimated Parker Prediction')
plt.ylabel('End Position')
plt.show()

#Emini surface accessibility scale

Reference: Emini EA, Hughes JV, Perlow DS, Boger J. Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol. 1985 Sep;55(3):836-9.
Description: The calculation was based on surface accessibility scale on a product instead of an addition within the window. The accessibility profile was obtained using the formulae Sn = (n+4+i ) (0.37)-6 where Sn is the surface probability, dn is the fractional surface probability value, and i vary from 1 to 6. A hexapeptide sequence with Sn greater than 1.0 indicates an increased probability for being found on the surface.http://tools.iedb.org/bcell/help/

In [None]:
fig = px.bar(df, 
             x='aromaticity', y='emini', color_discrete_sequence=['purple'],
             title='Emini surface accessibility scale', text='end_position')
fig.show()

In [None]:
def plot_emini(col, df, title):
    fig, ax = plt.subplots(figsize=(18,6))
    df.groupby(['emini'])[col].sum().plot(rot=45, kind='bar', ax=ax, legend=True, cmap='bone')
    ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
    ax.set(Title=title, xlabel='Emini')
    return ax

In [None]:
plot_emini('isoelectric_point', df, 'B Cell Emini Scale');

In [None]:
ax = df.groupby('parker')['stability', 'hydrophobicity'].sum().plot(kind='bar', rot=45, figsize=(12,6), logy=True,
                                                                 title='Parker Scale')
plt.xlabel('Parker Scale')
plt.ylabel('Stability & Hydrophilicity')

plt.show()

#Other Scales in the Tutorial, though they aren't in this file.

Karplus and Schulz flexibility scale
Reference: Karplus PA, Schulz GE. Prediction of Chain Flexibility in Proteins - A tool for the Selection of Peptide Antigens. Naturwissenschafren 1985; 72:212-3.
Description: In this method, flexibility scale based on mobility of protein segments on the basis of the known temperature B factors of the a-carbons of 31 proteins of known structure was constructed. The calculation based on a flexibility scale is similar to classical calculation, except that the center is the first amino acid of the six amino acids window length, and there are three scales for describing flexibility instead of a single one.

Bepipred-1.0 Linear Epitope Prediction
Reference: Jens Erik Pontoppidan Larsen, Ole Lund and Morten Nielsen. Improved method for predicting linear B-cell epitopes. Immunome Res. 2006; 2: 2..
Description: BepiPred predicts the location of linear B-cell epitopes using a combination of a hidden Markov model and a propensity scale method. The residues with scores above the threshold (default value is 0.35) are predicted to be part of an epitope and colored in yellow on the graph (where Y-axes depicts residue scores and X-axes residue positions in the sequence) and marked with "E" in the output table. TheÊvaluesÊof the scores are not affected by the selected threshold. The table below shows the relationship between selected thresholds and the sensitivity/specificity of the prediction method, calculated on basis of the epitope/non-epitope predictions. The table is based on a large benchmark calculation containing close to 85 B cell epitopes.

BepiPred-2.0: Sequential B-Cell Epitope Predictor
Reference: Jespersen MC, Peters B, Nielsen M, Marcatili P. BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 2017.
The BepiPred-2.0 server predicts B-cell epitopes from a protein sequence, using a Random Forest algorithm trained on epitopes and non-epitope amino acids determined from crystal structures. A sequential prediction smoothing is performed afterwards. The residues with scores above the threshold (default value is 0.5) are predicted to be part of an epitope and colored in yellow on the graph (where Y-axes depicts residue scores and X-axes residue positions in the sequence) and marked with "E" in the output table. TheÊvaluesÊof the scores are not affected by the selected threshold. The table below shows the relationship between selected thresholds and the sensitivity/specificity of the prediction method. http://tools.iedb.org/bcell/help/

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split

In [None]:
#fill in mean for floats
for c in df.columns:
    if df[c].dtype=='float16' or  df[c].dtype=='float32' or  df[c].dtype=='float64':
        df[c].fillna(df[c].mean())

#fill in -999 for categoricals
df = df.fillna(-999)
# Label Encoding
for f in df.columns:
    if df[f].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(df[f].values))
        df[f] = lbl.transform(list(df[f].values))
        
print('Labelling done.')

In [None]:
from sklearn.model_selection import train_test_split
# Hot-Encode Categorical features
df = pd.get_dummies(df) 

# Splitting dataset back into X and test data
X = df[:len(df)]
test = df[len(df):]

X.shape

In [None]:
# Save target value for later
y = df.target.values

# In order to make imputing easier, we combine train and test data
df.drop(['target'], axis=1, inplace=True)
df = pd.concat((df, test)).reset_index(drop=True)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=0)

In [None]:
from sklearn.model_selection import KFold
# Indicate number of folds for cross validation
kfolds = KFold(n_splits=5, shuffle=True, random_state=42)

# Parameters for models
e_alphas = [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007]
e_l1ratio = [0.8, 0.85, 0.9, 0.95, 0.99, 1]
alphas_alt = [14.5, 14.6, 14.7, 14.8, 14.9, 15, 15.1, 15.2, 15.3, 15.4, 15.5]
alphas2 = [0.00005, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008]

#Lasso

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LassoCV
# Lasso Model
lasso = make_pipeline(RobustScaler(), LassoCV(max_iter=1e7, alphas = alphas2, random_state = 42, cv=kfolds))

# Printing Lasso Score with Cross-Validation
lasso_score = cross_val_score(lasso, X, y, cv=kfolds, scoring='neg_mean_squared_error')
lasso_rmse = np.sqrt(-lasso_score.mean())
print("LASSO RMSE: ", lasso_rmse)
print("LASSO STD: ", lasso_score.std())

In [None]:
# Training Model for later
lasso.fit(X_train, y_train)

In [None]:
from PIL import Image
im = Image.open("../input/ai4all-project/figures/classifier/lassoRandomForest_5gene_roc.png")
#tlabel = np.asarray(Image.open("../input/train_label/170908_061523257_Camera_5_instanceIds.png")) // 1000
#tlabel[tlabel != 0] = 255
# plt.imshow(Image.blend(im, Image.fromarray(tlabel).convert('RGB'), alpha=0.4))
plt.imshow(im)
display(plt.show())

In [None]:
#plt.style.use('dark_background')
def plot_count(feature, title, df, size=1):
    f, ax = plt.subplots(1,1, figsize=(4*size,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:20], palette='Set2')
    g.set_title("Number and percentage of {}".format(title))
    if(size > 2):
        plt.xticks(rotation=90, size=8)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()

In [None]:
plot_count("start_position", "start_position", df,4)

In [None]:
plt.style.use('dark_background')
plot_count("end_position", "end_position", df,4)

Das War's Kaggle Notebook Runner: Marília Prata   @mpwolke