# PROJECT 1 
### Author : Sourav Yadav(17285)

To predict the Compound state(inhibition cut-off 75%) i.e. compound is active if inhibition value > 75% otherwise inactive.  


### Our Approach : 

Rather than predicting inhibition values, we chose to classify a compound as active or inactive(inhibition value > 75% are active), mainly because the latter performs considerably better. This comes with a little loss in precision, but nevertheless is a good predictor of what might work and what doesn't work. Also, reduces our working set to a great degree.  





### Data Collection :
 
The Coronaviridae are a family of positive single stranded encapsulated viruses. They typically cause mild respiratory diseases, but infections with the ß-coronavirus SARS-CoV, MERS and SARS-CoV-2 can lead to acute respiratory diseases and high mortality, particularly in individuals with underlying health conditions. In the last 20 years, Coronaviridae have emerged in two severe outbreaks, 2002/2003 with SARS-CoV, 2012 with MERS and one pandemic, in late 2019 with SARS-CoV-2. At the time of writing, the coronavirus SARS-Cov-2 pandemic has led to close to 39 million confirmed infections and 1 million deaths (https://coronavirus.jhu.edu). 

Multiple interventional clinical trials have been initiated in the search for effective pharmacological treatments against SARS-CoV-2 infection and the related disease Covid-19. Bioinformatics analyses have proposed repurposed drugs based on the interactome between viral encoded proteins and host-cell pathways. In the absence of safe and effective vaccines against SARS-CoV-2, repurposing of existing drugs represents a first pragmatic strategy for the treatment of Covid-19 patients. We thus identify potential inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells by predicting its bioactivity using ML after training the data on a large scale drug repurposing collection[3]. Bioactivity describes the characteristic of an implant material to interact with or initiate a specific reaction of living tissue upon exposure. The biochemical systems encountered by a drug molecule (implant material) are extremely complex. The factors affecting the bioactivity[4] may be divided into three categories: 1. Physicochemical properties such as solubility, partition coefficients, and ionization. 2. Chemical structure parameters such as resonance, inductive effect, oxidation potentials, types of bonding, and isosterism. 3. Spatial considerations such as molecular dimensions, interatomic distances, and stereochemistry.

To identify possible candidates for progression towards clinical studies against SARS-CoV-2, the authors of the paper[1] screened a well-defined collection of compounds. We obtained this data via the data base - ChEMBL_27 SARS-CoV-2 release under the title Identification of inhibitors of SARS-CoV-2 in-vitro cellular toxicity in human (Caco-2) cells using a large scale drug repurposing collection[3]. 

We obtain this data in two parts as follows:

1. inhibition.csv – Data in this data set is used for stating the bioactivity measure which are trained against the corresponding attributes from compounds.csv to classify compounds into two categories active and inactive. A compound is called active if it's inhibition value is greater than 75% otherwise it is classified as inactive.  

2. compounds.csv – Data in this data set is used for stating factors that affect towards bioactivities as stated previously. The factors selected from the given set of attributes are Molecular Weight, AlogP, PSA, HBA, HBD, CX ApKa, CX BpKa & CX LogD which we shall use as our descriptors.


### Importing the Libraries 

In [2]:
#Linear Algebra 
import numpy as np 

#Data processing 
import pandas as pd 

#Preprocessing/Model Selection 
from sklearn.model_selection import train_test_split, cross_validate
from pandas_profiling import ProfileReport

#Classifiers
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC 

### Data Processing

In [3]:
df_compounds = pd.read_csv('compounds.csv', sep=';').drop_duplicates('ChEMBL ID')
df_inhibition = pd.read_csv('inhibition.csv', sep=';').drop_duplicates('Molecule ChEMBL ID')
df = df_compounds.merge(df_inhibition, how='inner', left_on='ChEMBL ID', right_on='Molecule ChEMBL ID', validate='one_to_one')
df = df[['Molecular Weight_x', 'AlogP_x','PSA','HBA', 'HBD','CX ApKa', 'CX BpKa', 'CX LogD','QED Weighted','Standard Value']]
df = df.replace(to_replace='None', value=np.nan).dropna()
df['Bioactivity'] = df.apply(lambda df: df['Standard Value']>75, axis=1).astype(int)
df = df.apply(pd.to_numeric)
df.info()
df.describe(include="all") 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1909 entries, 0 to 5614
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Molecular Weight_x  1909 non-null   float64
 1   AlogP_x             1909 non-null   float64
 2   PSA                 1909 non-null   float64
 3   HBA                 1909 non-null   int64  
 4   HBD                 1909 non-null   int64  
 5   CX ApKa             1909 non-null   float64
 6   CX BpKa             1909 non-null   float64
 7   CX LogD             1909 non-null   float64
 8   QED Weighted        1909 non-null   float64
 9   Standard Value      1909 non-null   float64
 10  Bioactivity         1909 non-null   int64  
dtypes: float64(8), int64(3)
memory usage: 179.0 KB


Unnamed: 0,Molecular Weight_x,AlogP_x,PSA,HBA,HBD,CX ApKa,CX BpKa,CX LogD,QED Weighted,Standard Value,Bioactivity
count,1909.0,1909.0,1909.0,1909.0,1909.0,1909.0,1909.0,1909.0,1909.0,1909.0,1909.0
mean,400.865663,2.533133,98.185904,5.684652,2.556836,8.71406,6.160272,0.891964,0.522745,11.122142,0.051336
std,155.54067,2.443517,50.761157,2.874941,1.801043,4.127851,3.08711,3.365741,0.212746,24.691942,0.22074
min,75.07,-8.9,3.24,1.0,0.0,-4.37,0.0,-17.92,0.02,-39.64,0.0
25%,297.27,1.1,67.16,4.0,1.0,4.67,3.62,-0.62,0.37,-2.65,0.0
50%,392.44,2.8,88.85,5.0,2.0,9.81,6.72,1.53,0.52,2.92,0.0
75%,487.41,4.13,114.62,7.0,3.0,12.28,8.89,3.05,0.69,14.12,0.0
max,1385.46,11.51,437.28,25.0,16.0,14.0,15.38,10.04,0.94,135.09,1.0


### Data Analysis 

In [4]:
prof = ProfileReport(df)
prof

Summarize dataset: 100%|██████████| 26/26 [00:17<00:00,  1.48it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.63s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.12s/it]




In [5]:
#Active indexes 
active = df[lambda df : df['Standard Value']>75]
active.describe()

Unnamed: 0,Molecular Weight_x,AlogP_x,PSA,HBA,HBD,CX ApKa,CX BpKa,CX LogD,QED Weighted,Standard Value,Bioactivity
count,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0
mean,458.143878,4.071224,91.467347,5.959184,2.132653,10.201122,5.848571,2.873265,0.481735,92.029286,1.0
std,130.964269,2.062545,33.450349,2.237996,1.163338,3.400624,2.73711,2.226608,0.181875,10.76374,0.0
min,142.16,-1.32,23.47,2.0,0.0,0.36,0.27,-4.51,0.11,75.06,1.0
25%,382.1975,2.7375,65.8275,5.0,1.0,8.51,3.7125,1.4675,0.37,83.865,1.0
50%,459.56,4.12,91.97,6.0,2.0,11.65,5.81,3.075,0.47,92.5,1.0
75%,525.14,5.42,111.4225,7.0,3.0,12.6525,8.3575,4.195,0.6075,99.865,1.0
max,1004.84,9.76,201.75,14.0,6.0,13.9,10.83,9.2,0.86,135.09,1.0


### Model 


In [8]:
x = df.drop(['Standard Value', 'Bioactivity', 'QED Weighted'], axis=1)
y = df[['Bioactivity']]
qed = df[['QED Weighted']]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = SVC(kernel='rbf',C=1.0,gamma='scale')
model.fit(x_train, y_train)
cv_results = cross_validate(model, x, y, cv=5, return_train_score=True)
    
print("Training Score {}\n".format(model.score(x_train, y_train)))
print("Testing Score {}\n ".format(model.score(x_test, y_test)))
print("Cross Validation Train Scores: {}\n".format(cv_results['train_score']))
print("Cross Validation Train Mean: {}\n".format(np.mean(cv_results['train_score'])))
print("Cross Validation Test Scores: {}\n".format(cv_results['test_score']))
print("Cross Validation Test Mean: {}\n".format(np.mean(cv_results['test_score'])))

Training Score 0.9482645710543549

Testing Score 0.9502617801047121
 
Cross Validation Train Scores: [0.94826457 0.94891945 0.94891945 0.94891945 0.94829843]

Cross Validation Train Mean: 0.9486642700158061

Cross Validation Test Scores: [0.95026178 0.94764398 0.94764398 0.94764398 0.95013123]

Cross Validation Test Mean: 0.9486649901746574



### References

1. https://www.researchsquare.com/article/rs-23951/v1

2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3524573/

3. https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL4303101/

4. https://www.drugtimes.org/how-drugs-act/factors-affecting-bioactivity.html

5. https://www.researchgate.net/figure/Values-MW-clogP-HBA-HBD-PSA-logBB-and-logP-e-a-for-ML-and-1_tbl1_259626002