#Epitope Predictions Indicate the Presence of Two Distinct Types of Epitope-Antibody-Reactivities Determined by Epitope #Profiling of Intravenous Immunoglobulins

Authors: Mitja Luštrek,  ,  Peter Lorenz,  Michael Kreutzer,  Zilliang Qian,  ,  Felix Steinbeck,  Di Wu,  ,  Nadine Born,  Bjoern Ziems,  Michael Hecker,  Miri Blank,  Yehuda Shoenfeld,  Zhiwei Cao,  Michael O. Glocker,  Yixue Li,  Georg Fuellen,  and Hans-Jürgen Thiesen - Francesco Pappalardo, Editor
PLoS One. 2013; 8(11): e78605. - Published online 2013 Nov 11. doi: 10.1371/journal.pone.0078605

Computational prediction of linear B cell epitopes was conducted using machine learning with an ensemble of classifiers in combination with position weight matrix (PWM) analysis. 

The first attempts to predict continuous B cell epitopes were based on propensity scales. Current state-of-the-art epitope prediction generally uses machine learning approaches. The peptide chips were incubated with commercial intravenous immunoglobulin fractions (IVIG). 

The peptides that bind antibodies are considered to contain at least one epitope, i.e., one antibody binding site. The input data set was split in half by random sampling to form a training and a test set.

The training and test sets contain roughly three times more non-binding than binding peptides. Such imbalanced data sets might reduce the performance of classifiers trained by machine learning algorithms. To handle imbalanced data sets, two methods were applied, random oversampling and undersampling.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

#Regulatory T cell (Treg) epitopes, now known as Tregitopes

In IgG, the main component of intravenous immunoglobulin therapy (IVIg). Tregitopes provide one explanation for the expansion and activation of Treg cells following IVIg treatment.

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR89lzYCi4Vxbt6i0tv2w27hvgR6QfFtEegLw&usqp=CAU)epivax.com

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #visualization
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#Coronavirus epitope prediction from highly conserved region of spike protein

Author: Valentina Yurina - Clin Exp Vaccine Res 2020;9:169-173 -https://doi.org/10.7774/cevr.2020.9.2.169 - pISSN 2287-3651 • eISSN 2287-366X 

Purpose: The aim of this research was to predict the epitope for coronavirus family spike protein. Coronavirus family is highly evolved viruses which cause several outbreaks in the past decades. Therefore, it is crucial to design a global vaccine candidate to prevent the coronavirus outbreak in the future.

Materials and Methods: The spike protein amino acid sequences from nine coronavirus
family were searched in the Uniprot database. The spike protein sequences were aligned using Clustal method. The highly conservatives amino acids were analyzed its B cell linear and
continuous epitopes and T cell epitopes.

Results: From the alignment results it was found that there is a highly conserved region in the extracellular domain of spike protein. With prediction methods from this highly conserved region, B cell and T cell epitopes from spike protein were derived.

Conclusion: From several different prediction results, B cell epitope and T cell epitope were identified in the highly conserved region thus it is promising to be developed as a coronavirus vaccine candidate.

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQu4YuKHtelKCe6ignKtJZMDeHuQbMPvDNqOQ&usqp=CAU)
https://ecevr.org/Synapse/Data/PDFData/9995CEVR/cevr-9-169.pdf

![](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/bin/pone.0078605.g001.jpg)
Data set preparation and computational workflow for the prediction of epitope-antibody-reactivities (EAR) determined for IVIG antibodies.
Rectangles represent groups of peptides (numbers in each group are indicated), boxes with rounded corners indicate the applied classification approaches. 1All peptides printed on the microarrays 2Removal of false positive (binding) peptides (e.g. those reactive with secondary antibodies) 3Separation of peptide set according to signal intensities of EAR into non-binders, binders and unassigned peptides 4Classification approach ML-advanced = machine learning with an ensemble classifier 5Number of peptides predicted to be non-binding/binding, separated into those predicted correctly (underlined) and incorrectly 6Classification approach PWM = position weight matrix 7Classification approach ML-simple = simplified machine learning using human-understandable attributes 8Capital letters A–H indicate subsets of peptides assigned in supplementary information table S1 and explained there in the legend
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
df = pd.read_csv('../input/cusersmarildownloadspeptidecsv/peptide.csv', delimiter=';', encoding = "ISO-8859-1", nrows = nRowsRead)
df.dataframeName = 'peptide.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
df.head()

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSiP_3UF8oo5O6nluFqxU7fQmYeZTHMEM9Gbw&usqp=CAU)onlinelibrary.wiley.com

#Simplified Machine Learning Using Human Understandable Attributes (ML-simple)

A simplified machine learning approach was conducted using human-understandable attributes and rules (ML-simple). The principle advantage of the additional PWM approach was that it is more readily readable and facilitates to point to position effects and amino acid patterns that can be experimentally explored.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
df.plot()

In [None]:
fig = px.pie(df,
             values="PepPosition",
             names="Prediction_ML_advanced_original",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="percent+label")
fig.show()

#Epitope Prediction - Attibutes: High Aromaticity, Low Polarity and High Tyrosine Content

Classifiers for epitope prediction were trained on the training set and applied to the test set. Both sets contained three times more non-binding than binding peptides, but this may not necessarily be the case for potential new data. 

AUC scores and accuracies characterized ML-advanced as slightly better compared to the machine learning according to El-Manzalawy when trained on both, the original and balanced training sets.The corresponding ROC curves are visualized  together with those derived from PWM classification of the training set. 

To determine the attributes that enable the prediction of “binding”, peptides of the training set were classified in a machine learning approach named ML-simple by using attributes reflecting human-readable rules.

Attributes significantly associated with prediction of “binding” were in particular high aromaticity, low polarity and high tyrosine content. However, the moderate percentages of correctly classified peptides for the individual attributes illustrate once more the advantage of using an ensemble approach for the most accurate prediction. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
fig = px.pie(df,
             values="NQE",
             names="Prediction_EL_Manzalawy_balanced",
             template="seaborn")
fig.update_traces(rotation=90, pull=0.05, textinfo="percent+label")
fig.show()

#Predicting linear B-cell epitopes using string kernels

Authors: Yasser El-Manzalawy , Drena Dobbs, Vasant Honavar - PMID: 18496882 PMCID: PMC2683948 DOI: 10.1002/jmr.893


The identification and characterization of B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production.

Therefore, computational tools for reliably predicting linear B-cell epitopes are highly desirable. El-Manzalawy, Dobbs and Honavar (authors above) evaluated Support Vector Machine (SVM) classifiers trained utilizing five different kernel methods using fivefold cross-validation on a homology-reduced data set of 701 linear B-cell epitopes, extracted from Bcipep database, and 701 non-epitopes, randomly extracted from SwissProt sequences.

Analysis of the data sets used and the results of this comparison show that conclusions about the relative performance of different B-cell epitope prediction methods drawn on the basis of experiments using data sets of unique B-cell epitopes are likely to yield overly optimistic estimates of performance of evaluated methods. 

This argues for the use of carefully homology-reduced data sets in comparing B-cell epitope prediction methods to avoid misleading conclusions about how different methods compare to each other. https://pubmed.ncbi.nlm.nih.gov/18496882/

![](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/bin/pone.0078605.g002.jpg)Performance comparison of the tested classifiers for IVIG binding prediction on the test set peptides.

ROC analysis for the authors machine learning approach with an ensemble classifier (ML-advanced), the machine learning method of El-Manzalawy et al. (2008), and a PWM approach using an PWM derived from the training set. Both machine learning approaches were trained on the original training set (“original”: three times more non-binding than binding peptides) and on the balanced training set (“balanced”: equal number of binding and non-binding peptides) and finally applied on the test set. AUC values are indicated as well. Note that the curves based on the original and balanced training set of the authors ML-advanced method show almost complete overlap.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
# Plotting boxplot for 'perc_premium_paid_by_cash_credit' column using the 'seaborn' library:
sns.boxplot(df['Binding'],color='brown')
plt.show()

In [None]:
sns.boxplot(df['PepPosition'],color='brown')
plt.show()

#Accuracy: expected and not expected

While experimenting with various attributes and machine learning algorithms, the authors discovered that many of them can predict epitopes with an accuracy of around 80%. Not all classifiers misclassified the same peptides, which is why combining the classifiers into an ensemble improved the performance. However, it seemed that 15–20% of the peptides resisted correct first round classification irrespective of the method used.

The accuracy on the 1st degree classifiable peptides was close to 100%, which was also as expected. The value did not reach 100% because the classifier was exclusively trained on the 1st degree classifiable peptides, whereas the classifier that divided the peptides into classifiable and unclassifiable was trained on all peptides of the training set. However, the accuracy on the 1st degree unclassifiable peptides was also high (91.5%), which was not expected.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
#Nuclear Quantum Effects (NQE)
sns.distplot(df['NQE'],color='green')
plt.show()

#Evidence for Two Types of EAR (Epitope Antibody Reactivities)

The authors machine learning approaches initially identified two main classes of peptides, 1st degree classifiable and unclassifiable peptides. In order to directly relate this classification to information on the amino acid composition, the characteristics of respective peptides were visualized with the help of two ratio PWMs in separate scatter graphs for the “binding” and “non-binding” peptides. The visualization indicated that the 1st degree classifiable and unclassifiable peptides disperse into two distinguished groups for both, the “binding” and “non-binding” peptides, due to their opposite physico-chemical characteristics.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
sns.distplot(df['YWF'],color='purple')
plt.show()

In [None]:
# Now taking logarithm of the Income data and then plotting its distribution:
sns.distplot(np.log(df["PepPosition"]),color='red')
plt.show()

In [None]:
plt.hist(['PepPosition'],histtype='stepfilled',label=str,color='cyan',bins=10)
plt.show()

I copied the lines below from Rachit Shukla  https://www.kaggle.com/rax007/insurance-company-premium

Notice from the distribution plot that although the 'PepPosition' variable is considered float by python but it is actually an int variable as it has discrete values & not continuous! It reminds me of categorical variable. So now is the time I introduce the categorical variables & their analysis:

The categorical variables are discrete in nature & are stored as 'object' datatype. During the Univariate analysis of categorical variables, the task is to look for 'count' and 'count%'.

In [None]:
#df['PepPosition'].value_counts().plot.bar(color='green')
#plt.show()

In [None]:
#word cloud
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.IVIG_sample)
# Create and generate a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='Set3', background_color="black").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Display the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

In [None]:
#word cloud
from wordcloud import WordCloud, ImageColorGenerator
text = " ".join(str(each) for each in df.Description)
# Create and generate a word cloud image:
wordcloud = WordCloud(max_words=200,colormap='GnBu', background_color="white").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Display the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

In [None]:
#Codes by Pooja Jain https://www.kaggle.com/jainpooja/av-guided-hackathon-predict-youtube-likes/notebook

text_cols = ['PepSPID', 'Description', 'IVIG_sample']

from wordcloud import WordCloud, STOPWORDS

wc = WordCloud(stopwords = set(list(STOPWORDS) + ['|']), random_state = 42)
fig, axes = plt.subplots(2, 2, figsize=(20, 12))
axes = [ax for axes_row in axes for ax in axes_row]

for i, c in enumerate(text_cols):
  op = wc.generate(str(df[c]))
  _ = axes[i].imshow(op)
  _ = axes[i].set_title(c.upper(), fontsize=24)
  _ = axes[i].axis('off')

_ = fig.delaxes(axes[3])

![](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/bin/pone.0078605.g007.jpg)
Distribution of peptides initially scored as 1st degree classifiable and unclassifiable by ML-simple using PWM measures.

Peptides of the training set were assigned to the groups 1st degree classifiable and unclassifiable by the authors “simplified machine learning using human-understandable attributes” (ML-simple) approach. They were further divided into peptides reacting with IVIG (“binding”; panel A) or not reactive with IVIG (“non-binding”; panel B). In a next step each peptide was assigned values using two ratio PWMs. The x-axis values derive from a PWM that was based on all peptides present in the training set. They are calculated by multiplying the ratios of the relative frequencies of each amino acid at each position in a peptide sequence for the group “binding” (panel A) and “non-binding” (panel B), respectively. The y-axis values were calculated in the same way, however, only the 1st degree unclassifiable peptides present in the training set were used as input of the PWM. Each peptide is represented by one dot. Peptides in red in panel A correspond to the type I EAR while those in black depict the type II EAR.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

#Above we can see what was made by the authors. From now, that's what I made.

Please, don't laugh.

In [None]:
#Nuclear Quantum Effects (NQE)
sns.stripplot(x = "NQE", y = "PepPosition", data = df, jitter = True);

In [None]:
sns.catplot("NQE", "PepPosition", data = df)
#Nuclear Quantum Effects (NQE)

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(30,30))
sns.catplot(y='PepPosition',x='NQE',data=df,ci=None,col='Binding',sharey=False)

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(30,30))
sns.catplot(y='PepPosition',x='YWF',data=df,ci=None,col='Binding',sharey=False)

In [None]:
# For this analysis I'll use scatter plot:
sns.scatterplot(df['NQE'],df['PepPosition'])
plt.show()

In [None]:
px.scatter(df, x='Prediction_EL_Manzalawy_balanced', y='PepPosition', color='Binding')

#Building classifier ensembles for B-cell epitope prediction

Authors:  EL-Manzalawy Y, Vasant Honavar
Methods in Molecular Biology (Clifton, N.J.), 31 Dec 2013, 1184:285-294
DOI: 10.1007/978-1-4939-1115-8_15 PMID: 25048130 PMCID: PMC4385709

Identification of B-cell epitopes in target antigens is a critical step in epitope-driven vaccine design, immunodiagnostic tests, and antibody production.

B-cell epitopes could be linear, i.e., a contiguous amino acid sequence fragment of an antigen, or conformational, i.e., amino acids that are often not contiguous in the primary sequence but appear in close proximity within the folded 3D antigen structure. 

Numerous computational methods have been proposed for predicting both types of B-cell epitopes. However, the development of tools for reliably predicting B-cell epitopes remains a major challenge in immunoinformatics.

Classifier ensembles a promising approach for combining a set of classifiers such that the overall performance of the resulting ensemble is better than the predictive performance of the best individual classifier. 
http://europepmc.org/article/PMC/4385709 

In [None]:
px.scatter(df, x='Prediction_ML_advanced_original', y='PepPosition', color='Binding')

#FastRNABindR: Fast and accurate prediction of protein-RNA interface residues

Authors: Yasser EL-Manzalawy, Mostafa Abbas, Qutaibah M. Malluhi, Vasant Honavar
July 2016PLoS ONE 11(7):e0158445 -DOI: 10.1371/journal.pone.0158445

A wide range of biological processes, including regulation of gene expression, protein syn-thesis, and replication and assembly of many viruses are mediated by RNA-protein interac-tions. 

However, experimental determination of the structures of protein-RNA complexes isexpensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces.

The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, the authors experimented with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million proteinsequences in UniRef100.

Their results suggest that random sampled databases produce bet-ter PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). 

Based on their results, they developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generatedusing 1% of the UniRef100 sequences sampled uniformly at random.

FastRNABindR is the only protein-RNA interface residue prediction onlineserver that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission.

Their approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces. https://www.researchgate.net/publication/304993225_FastRNABindR_Fast_and_accurate_prediction_of_protein-RNA_interface_residues

In [None]:
px.scatter(df, x='Peptide', y='PepPosition', color='Binding')

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ7ph5SZePZuB_DZazVHWwZIDyCVeaqujEQ6w&usqp=CAU)sciencedirect.com

In [None]:
px.scatter(df, x='NQE', y='PepPosition', color='Binding', color_discrete_sequence=["red", "green", "blue"])

#Limits of Epitope Prediction - Immune system classifies based on more specific characteristics using different modes of #antibody binding.

The design of a high-performance classifier for epitope prediction raises the principle question on the limits of such machine learning approaches. There may be representations of peptides, which could be used as attributes for machine learning, that capture all relevant epitope features for antibody binding and ignore all properties irrelevant for antibody binding. It appears that such a representation needs to include the information on whether a peptide belongs to the classifiable or unclassifiable group (at least 1st degree, if not also 2nd). 

However, the authors do not yet know how to obtain this information without knowing the class of the peptide (binding or non-binding).

Some peptides belong to groups that have common characteristics, which can be learned by machine learning algorithms, so they are classified correctly – these are the 1st degree classifiable ones. The remaining peptides are not classified correctly because their broad characteristics point to the wrong classification, while the immune system classifies them based on more specific characteristics most likely by using different modes of antibody binding.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
num_vars=[x for x in df.columns if df[x].dtypes!='O']
num_vars

#Type I EAR (Epitope Antibody Reactivities) -  T cell activation

The typical classifiable epitopes bound by IVIG antibodies (Type I EAR) share several properties expected from the literature: high frequency of tyrosine, low hydrophobicity and high antigenicity.

Antibody generation giving rise to Type I EAR (Epitope Antibody Reactivities) might directly be induced by T cell activation elicited by peptide binding to MHC class II complexes, ROC curve of 1st degree classifiable peptides that are predicted to be binding to MHC class II with a higher AUC score than all peptides of the training set. 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
#Code by Pranay https://www.kaggle.com/phuskisher/epitope-prediction-bcell-auc-0-9

fig, axes=plt.subplots(2,2, figsize=(20,20))
for i,j in enumerate(num_vars):
    ax=axes[int(i/2), i%2]
    sns.kdeplot(df[j], ax=ax)

#Nuclear Quantum Effects (NQE)

In [None]:
#Nuclear Quantum Effects (NQE)
#sns.barplot(df['PepPosition'],df['NQE'],color='b',errcolor='c',errwidth='.26')
#plt.xticks(rotation=45)
#plt.show()
#It's messy, though I save the snippet for a next time.

#Type II EAR - Antibodies binding generated independently of specific MHC-peptide presentation to T cell receptors

(MHC- Major Histocompatibility Complex)

A second class of epitopes are represented by 1st degree unclassifiable peptides bound by IVIG (Type II EAR). Their properties are opposed to those of Type I EAR peptides, i.e. they are specifically enriched in polar amino acids asparagine, glutamine and glutamic acid and display low aromaticity. 

The polarity in this group here would fit previous epitope descriptions . Antibodies binding to these peptides are suspected to have been generated independently of specific MHC-peptide (MHC: major histocompatibility complex) presentation to T cell receptors, ROC curve of 1st degree unclassifiable peptides.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
px.line(df,x="NQE",y="PepPosition",width = 500)
#Nuclear Quantum Effects (NQE)

In [None]:
#Code by Chumajin https://www.kaggle.com/chumajin/eda-for-biginner/notebook

px.histogram(
    df, 
    x='PepPosition',
    nbins=50,
    width = 500
)

The existence of two distinct EAR modes on the epitope level might have counterparts on the corresponding paratope level of binding antibodies. The question is whether these two EAR groups can be related to specific types of antibody species and whether paratope binding rules can be established in the future as well.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
fig = px.bar(df, x= "NQE", y= "PepPosition", color_discrete_sequence=['crimson'], title="Nuclear Quantum Effects & Peptides Position")
fig.show()

#Immunological Perspectives

Antibodies from healthy human donors have been found by epitope profiling of IVIG preparations to bind to a vast number of peptides of human origin. 

In absolute terms, the human immune system should not have any antibodies directed against their own proteome. However, due to the way how antibodies are generated, sophisticated mechanisms have to be in place either to prevent or to minimize B cells from generating autoreactive antibodies or to remove B cells (e.g. by autophagy) to eliminate the generation of highly reactive auto-antibodies.

Possibly, the immunoglobulin locus including the machinery regulating humoral in conjunction with cellular immune responses has been modified over millions of years under constraints to select immunoglobulin structures harmless to their own organisms.

About 25% of peptides in the authors training and test sets were scored as “binding” to IVIG with high confidence. However, this percentage ignores the number of peptides tested in total and is probably overestimated.

EAR (Epitope Antibody Reactivities) analysis leads to the hypothesis that under physiological conditions immunoglobulins possibly contribute to the homeostasis of the immune system by constantly capturing circulating peptides that originate from human proteins. The postulated scavenger function of eliminating self-peptides should not lead to inflammatory processes as exemplified by autoimmune diseases.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3823795/

In [None]:
fig = px.bar(df, x= "YWF", y= "PepPosition", color_discrete_sequence=['#2B3A67'], title="YWF Activator & Peptides Position")
fig.show()

In [None]:
#Code by Tushar Mishra https://www.kaggle.com/tmchls/covid-19-epitope-prediction-task-1-b-cell/notebook
features=["Binding","NQE","YWF","PepPosition"]
plt.figure(figsize=(20,20))
plt.subplots_adjust(hspace=2.0)
j=1
for i in features:
    plt.subplot(4,5,j)
    sns.distplot(df[i])
    j+=1

#Handling Missing Values  

In [None]:
df.isnull().sum()

In [None]:
# categorical features with missing values
categorical_nan = [feature for feature in df.columns if df[feature].isna().sum()>0 and df[feature].dtypes=='O']
print(categorical_nan)

In [None]:
# replacing missing values in categorical features
for feature in categorical_nan:
    df[feature] = df[feature].fillna('None')

In [None]:
df[categorical_nan].isna().sum()

In [None]:
# Lets handle numerical features with nan value
numerical_nan = [feature for feature in df.columns if df[feature].isna().sum()>1 and df[feature].dtypes!='O']
numerical_nan

In [None]:
df[numerical_nan].isna().sum()

In [None]:
## Replacing the numerical Missing Values

for feature in numerical_nan:
    ## We will replace by using median since there are outliers
    median_value=df[feature].median()
    
    df[feature].fillna(median_value,inplace=True)
    
df[numerical_nan].isnull().sum()

In [None]:
from sklearn.preprocessing import LabelEncoder

#fill in mean for floats
for c in df.columns:
    if df[c].dtype=='float16' or  df[c].dtype=='float32' or  df[c].dtype=='float64':
        df[c].fillna(df[c].mean())

#fill in -999 for categoricals
df = df.fillna(-999)
# Label Encoding
for f in df.columns:
    if df[f].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(df[f].values))
        df[f] = lbl.transform(list(df[f].values))
        
print('Labelling done.')

#Dummification

In [None]:
df = pd.get_dummies(df)

In [None]:
df.head()

In [None]:
#Code By Pranay  https://www.kaggle.com/phuskisher/epitope-prediction-bcell-auc-0-9

for i in num_vars:
    fig=px.histogram(df, x=i, color='NQE')
    fig.show()

In [None]:
for i in num_vars:
    fig=px.box(df, y=i, color='YWF')
    fig.show()

In [None]:
corr= df.corr()
mask=np.array(corr)
mask[np.tril_indices_from(mask)] = False
fig,ax = plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(corr, mask=mask, vmax=0.9, square=True, annot=True, cmap='YlGnBu')
plt.show()

#Training the Model, by Rachit Scukla https://www.kaggle.com/rax007/insurance-company-premium

First, Create a set of independent variables from the train dataset. Drop the 'target' variable from it using axis=1. This axis=1 specifies that the drop shall happen from the column. Store this set in an object called "x" as follows:

In [None]:
x = df.drop('PepPosition',axis=1)

 Keep only the 'target' variable in an object y:

In [None]:
y = df['PepPosition']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state=11111, shuffle=True, train_size=None, test_size=None)

#I don't have train or test in this Dataset. I have previously made dummification above.

In [None]:
# Creating dummies of both train_x and test_x sets:
train_x = pd.get_dummies(train_x)
test_x = pd.get_dummies(test_x)

#Check the proportion of 1s and 0s in the dependent variables of train & test that are just created.

In [None]:
train_y.value_counts()/len(train_y)

It was suppose to be 0 and 1, but I don't have train and the result above doesn't explain much as expected.

In [None]:
test_y.value_counts()/len(test_y)

#Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

#Create an object for this so that use the "fit" & the "predict" functions on it. Named it as logr

In [None]:
logr = LogisticRegression(n_jobs=1,max_iter=100,random_state=11111)

In [None]:
logr.fit(train_x,train_y)

In [None]:
logr.score(train_x,train_y)

In [None]:
logr.score(test_x,test_y)

I just copy that explanation for the moment I have train/test.

This means there's 58.1 % accuracy on my train dataset & 39,6 % accuracy on the test dataset. This also means that the test sample is really a representative of the train sample. However, a training score of 58,1 % is not that great for a good prediction of an unseen test dataset. The test set that I've been using till now was made out of the train dataset. But the test dataset that I've got as "test.csv" file has data of totally new customers. So it's like an unseen data for my model. For this reason, my LogisticRegression model will not give true predictions for the test.csv dataset.

#Decision Tree Classifier Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtc= DecisionTreeClassifier(max_depth=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0)

In [None]:
dtc.fit(train_x,train_y)

In [None]:
dtc.score(train_x,train_y)

In [None]:
dtc.score(test_x,test_y)

I just copied and changed the numbers, since there is no train/test 

So there's 100 % accuracy on my train dataset and 49.6 % accuracy on the test dataset. This means my model is 100% accurate on the train dataset now. But still, a test accuracy of 49,6% is not that great. So I'll take up one more model which is called Random Forest

#Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators = 100)

In [None]:
rfc.fit(train_x,train_y)

In [None]:
rfc.score(train_x,train_y)

In [None]:
rfc.score(test_x,test_y)

As Random Forest model has achieved significant score, Use this model to get the predictions. Start with getting predictions on train_x set as follows

In [None]:
rfc.predict(train_x)

In [None]:
# Similarly getting predictions on test_x as follows:
rfc.predict(test_x)

Compare the sizes of this and train_x set to see how different is test dataset currently from the train_x

We don't have test and train files, therefore test.shape Now is df.shape

In [None]:
df.shape , train_x.shape

We don't have test/train. Save the lines below for another Dataset.

In [None]:
#rfc.predict(test)

In [None]:
#test_prediction = rfc.predict(test)

In [None]:
#submission = pd.DataFrame()
#submission['target'] = test_prediction
#submission.to_csv('customer_premium_on_time.csv', header=True, index=False)

In [None]:
#Code by Olga Belitskaya https://www.kaggle.com/olgabelitskaya/sequential-data/comments
from IPython.display import display,HTML
c1,c2,f1,f2,fs1,fs2=\
'#eb3434','#eb3446','Akronim','Smokum',30,15
def dhtml(string,fontcolor=c1,font=f1,fontsize=fs1):
    display(HTML("""<style>
    @import 'https://fonts.googleapis.com/css?family="""\
    +font+"""&effect=3d-float';</style>
    <h1 class='font-effect-3d-float' style='font-family:"""+\
    font+"""; color:"""+fontcolor+"""; font-size:"""+\
    str(fontsize)+"""px;'>%s</h1>"""%string))
    
    
dhtml('Marília Prata, @mpwolke was Here' )