Predicting Bioactivity towards inhibiting the Acetylcholinesterase enzyme - A drug target for Alzheimer's Disease
-
When the brain's structures begin to shrink, it can no longer efficiently produce the ACh chemical needed for intra-neuronal messaging and the receptors responsible for catching these chemicals to process the signal are also less plentiful. Although both of these components begin to fail simultaneously, the decline in the number of receptors to catch these ACh molecules is much more significant than the reduced chemical production.
-
At this stage, although there no treatments can reverse the process of destruction, there are options to promote a strong signaling pathway for as long as possible. This is the goal of an Acetylcholinesterase inhibitor. It works by blocking an enzyme, acetylcholinesterase, from attaching to the ACh chemical and digesting it. By allowing the chemical to have more time to bind to a receptor, there is more of a chance for a message to be sent from one neuron to another and allow the body to have coordinated movements and functioning.
Recent Discoveries in Drug Discovery of Alzheimer's: https://www.bbc.com/news/health-63749586
-
To study the Chemical Structure [ here - canonical_smiles ] of Compounds inhibiting Human AChE enzyme with the help of Lipinski Descriptors
-
To Study the Relantionship between Molecular Fingerprints Descriptor andMolar Concentration [IC50] value of the Compound, inhibiting the Human AChE enzyme
-
On the basis of Insights from the relationship [ be it weak or Strong, in both cases it will provide us with some insights ], Building a Model and a Web App which will predict the Bioactivity [ IC50 ] of the Compound
Fingerprints are arrnagement of bits on the basis of Division of Chemical Structure of Compounds so Logically
they should not be Major Contributors in explaining the Variations in IC50 values [ Molar Concentration ] of
inhibiting Human AChE Enzyme
Fingerprints Holds an Important Place in the field of Bioinformatics and Genomics. They are Quantative Arrangement
of Structures on the basis of some rules and Algorithms, Explaining certain properties in that manner . They Holds an
important place in drug biology and are important in explaining bioacitivity of many different types of compounds but
now a question arises within me which is,
-
Are they significant in explaining bioactivity of Compounds inhibiting Human AChE enzyme ? They Do contribute but the curosity is, are they the Major Contributors or we should look for another ones when it comes to inhibiting Human AChE Enzyme
-
So this is what we are Trying to Study, the relationship between Fingerprints and Bioactivity [IC50] of Compounds, inhibiting the Human AChE enzyme
I wish to achieve this Relationship with the Help of the concept of Hypothesis Testing , here
- Alternative Hypothesis [Ha] - Fingerprints are not Major Contributors in explaining the Variations in IC50 values inhibiting the Human AChE Enzyme
- Null Hypothesis [H0] - Fingerprints are Major Contributors in explaining the Variations in IC50 values inhibiting the Human AChE Enzyme [this is what I wish to disprove]
R square value in regression analysis tells us about how much variation in dependent / Target variable can be explained by Independent variables. A Relatively low value would suggest that even though the features are significant they are not able to explain the variation, which will support my cause and will result into rejecting null hypothesis Will i get that Value? Let's Find Out
The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications
Installing ChEMBL web service Package to Extract bioactivity data from the ChEMBL Database
! pip install chembl_webresource_client
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from chembl_webresource_client.new_client import new_client
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
import math
from numpy.random import seed
from numpy.random import randn
from scipy.stats import mannwhitneyu
target = new_client.target
target_query = target.search('Acetylcholinesterase')
target_query # Dictionary
targets = pd.DataFrame(target_query)
targets.columns
Out target is to Predict for Human Acetylcholinesterase: Retrieve chembl id for Human Acetylcholinesterase ['CHEMBL220']
filt = (targets.organism == 'Homo sapiens') & (targets.pref_name == 'Acetylcholinesterase')
targets[filt]
selected_target = targets[filt]['target_chembl_id'][0]
selected_target
In the field of pharmacology, potency is a measure of drug activity expressed in terms of the amount required to produce an effect of given intensity A highly potent drug (e.g., fentanyl, alprazolam, risperidone, bumetanide, bisoprolol) evokes a given response at low concentrations, while a drug of lower potency (meperidine, diazepam, ziprasidone, furosemide, metoprolol) evokes the same response only at higher concentrations
The half maximal inhibitory concentration (IC50) is a measure of the potency of a substance in inhibiting a specific biological or biochemical function. IC50 is a quantitative measure that indicates how much of a particular inhibitory substance (e.g. drug) is needed to inhibit, in vitro, a given biological process or biological component by 50%. The biological component could be an enzyme, cell, cell receptor or microorganism. IC50 values are typically expressed as molar concentration.
activity = new_client.activity
res = activity.filter(target_chembl_id = selected_target).filter(standard_types="IC50")
df = pd.DataFrame(res)
test = df[['standard_value', 'standard_type', 'standard_units', 'type', 'units']]
filt = df['standard_type'] == 'IC50'
df = df[filt]
df.to_csv("AChE_Bioactivity_data_1.csv", index=False)
Required Columns are : molecule_chembl_id
, canonical_smiles
and standard_value
- molecule_chembl_id has no missing values - so we are good to go
- canonical_smiles contains 0.43% of missing values which we can drop easily as replacing values brings variability / variations so droping them is the best option
- standard_value contains 21.74% of missing values which we can drop easily as replacing values brings variability / variations so droping them is the best option
df1.dropna(subset=['canonical_smiles', 'standard_value'], axis=0, inplace=True)
df2 = df1[['molecule_chembl_id','canonical_smiles', 'standard_value']]
df2.to_csv('AChE_Bioactivity_data_2.csv', index=False)
Labeling Compounds as either being active, inactive or intermediate: The Bioactivity data is in the IC50 unit. Compounds Having values [Potency] of
- Less than 1000nM will be considered as active
- Greater than 10000nM will be considered as inactive
- Between 1000nM and 10000nM will be considered as intermediate
df3 = pd.read_csv('AChE_Bioactivity_data_2.csv')
bioactivity_threshold = []
for i in df3['standard_value']:
if i >= 10000:
bioactivity_threshold.append('inactive')
elif i <= 1000:
bioactivity_threshold.append('active')
else:
bioactivity_threshold.append('intermediate')
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4.to_csv('AChE_Bioactivity_data_3.csv', index=False)
df4['class'].value_counts()
Christopher Lipinski, a scientist at Pfizer, came up with a set of rule-of-thumb for evaluating the druglikeness of compounds. Such druglikeness is based on the Absorption, Distribution, Metabolism and Excretion (ADME) that is also known as the pharmacokinetic profile. Lipinski analyzed all orally active FDA-approved drugs in the formulation of what is to be known as the Rule-of-Five or Lipinski's Rule.
The Lipinski's Rule stated the following:
- Molecular weight < 500 Dalton
- Octanol-water partition coefficient (LogP) < 5
- Hydrogen bond donors < 5
- Hydrogen bond acceptors < 10
def lipinski(smiles, verbose=False):
moldata= []
for elem in smiles:
mol=Chem.MolFromSmiles(elem)
moldata.append(mol)
baseData= np.arange(1,1)
i=0
for mol in moldata:
desc_MolWt = Descriptors.MolWt(mol)
desc_MolLogP = Descriptors.MolLogP(mol)
desc_NumHDonors = Lipinski.NumHDonors(mol)
desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)
row = np.array([desc_MolWt,
desc_MolLogP,
desc_NumHDonors,
desc_NumHAcceptors])
if(i==0):
baseData=row
else:
baseData=np.vstack([baseData, row])
i=i+1
columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]
descriptors = pd.DataFrame(data=baseData,columns=columnNames)
return descriptors
df_lipinski = lipinski(df5['canonical_smiles'])
df6 = pd.concat([df5, df_lipinski], axis=1)
To Reduce Skewness and Variablity in the Data and allowing Data to be uniformly Distributed, I will convert IC50 to pIC50 Value which is basically a negative log -> -log10(IC50)
- Take the IC50 values from the standard_value column and converts it from nM to M by multiplying the value by 10^-9
- Take the molar value and apply -log10
- Delete the standard_value column and create a new pIC50 column
norm =[]
for i in df7.standard_value:
if i > math.pow(10, 8):
i = math.pow(10, 8)
norm.append(i)
df7['standard_norm_value'] = norm
df7.drop('standard_value', axis=1, inplace=True)
pIC50=[]
for i in df7.standard_norm_value:
molar = i*math.pow(10, -9)
pIC50.append(-np.log10(molar))
df7['pIC50'] = pIC50
df7.drop('standard_norm_value', axis=1, inplace=True)
Removing Intermediate Bioactivity Class: To perform Mann whitney U Test
Scatter plot of MW vs LogP
The Mann-Whitney U Test is a statistical test used to determine if 2 groups are significantly different from each other on the basis of variable of interest. Variable of interest should be continuous and 2 groups should have similar values on variable of interest.
def mannwhitney(descriptor, verbose=False):
selection = [descriptor, 'class']
df_test = df8[selection]
active = df_test[df_test['class'] == 'active']
active = active[descriptor]
selection = [descriptor, 'class']
df_test = df8[selection]
inactive = df_test[df_test['class'] == 'inactive']
inactive = inactive[descriptor]
stat, p = mannwhitneyu(active, inactive)
alpha = 0.05
if p > alpha:
interpretation = 'Same distribution (fail to reject H0)'
else:
interpretation = 'Different distribution (reject H0)'
results = pd.DataFrame({'Descriptor':descriptor,
'Statistics':stat,
'p':p,
'alpha':alpha,
'Interpretation':interpretation}, index=[0])
filename = 'mannwhitneyu_' + descriptor + '.csv'
results.to_csv(filename)
return results
pIC50 values
Taking a look at pIC50 values, the actives and inactives displayed statistically significant difference, which is to be expected since threshold values
- IC50 < 1,000 nM = Actives while IC50 > 10,000 nM = Inactives, corresponding to
- pIC50 > 6 = Actives and pIC50 < 5 = Inactives were used to define actives and inactives.
Lipinski's descriptors
3 of the 4 Lipinski's descriptors exhibited statistically significant difference between the actives and inactives.
Molecular Descriptors are Quantitive Description of the Compounds, can be defined as mathematical representations of molecules properties that are generated by algorithms. The numerical values of molecular descriptors are used to quantitatively describe the physical and chemical information of the molecule Fingerprint is the Molecular Descriptor that is used here, Specifically pubchem Fingerprints. Molecular Fingerprint are useful in predicting drug Binding Affinities or Bioactivity
Bit allocation in Pubchem Fingerprints
Software used : PaDEL-Descriptor ! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
df_final = pd.read_csv('AChE_Bioactivity_data_4.csv')
df_final_selected = df_final[['canonical_smiles', 'molecule_chembl_id']]
df_final_selected.to_csv('molecule.smi', sep='\t', index=False, header=False)
PaDEL-Descriptor
Result obtained from the Software is Stored in Fingerprints.csv
similar cases for other fingerprints
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.feature_selection import VarianceThreshold
import lazypredict
from lazypredict.Supervised import LazyRegressor
import lightgbm as ltb
import math
from scipy import stats
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
Variance Threshold
The Variance Threshold is a feature selection technique used in data analysis and machine learning to remove features with low variance. This method is particularly useful in simplifying models, reducing overfitting, and improving computational efficiency.
def remove_low_variance(input_data, threshold):
selection = VarianceThreshold(threshold)
selection.fit(input_data)
return input_data[input_data.columns[selection.get_support(indices=True)]]
X = remove_low_variance(X, threshold=0.1)
p_value = []
col_name = []
names = []
for x in X.columns:
w = stats.pearsonr(DataSet[x], Y)
p_value.append(w[1])
col_name.append(x)
Imp = pd.DataFrame({'Column_name':col_name,
'p-value':p_value})
filt = Imp['p-value'] > 0.05
df = Imp[filt]
for i in df['Column_name']:
names.append(i)
X.drop(names, axis=1, inplace=True)
X.to_csv('descriptor_list.csv', index=False)
Data Splitting
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=42)
print(f"{X_train.shape} \t {Y_train.shape} \n {X_test.shape} \t {Y_test.shape}")
Lazy Regressor for Regression Analysis
clf = LazyRegressor(verbose=0,ignore_warnings=True, custom_metric=None)
models_test,predictions_test = clf.fit(X_train, X_test, Y_train, Y_test)
plt.figure(figsize=(5, 10))
sns.set_theme(style="whitegrid")
ax = sns.barplot(y=predictions_test.index, x="RMSE", data=predictions_test)
ax.set(xlim=(0, 10))
import pickle
pickle.dump(model, open('AChE_model_pubchem.pkl', 'wb'))
Key Takeaways
After Observing the Results from all 5 fingerprints
- Data is in form of 0s and 1s, so no place for encoding and Feature Scalling
- From the list from Lazy Predict and our random forest model with hyperparameter tuning, we can tell that,
- The range of R square values ranges from 20-28 percent, which is quite low
- The error / loss / MSE value ranges from 1-2, which is optimal
This function performs molecular descriptor calculations using the PaDEL-Descriptor software. It runs a Java command to execute the PaDEL-Descriptor JAR file with options to remove salts, standardize nitro groups, and calculate fingerprints based on the PubChem Fingerprinter descriptor types. The results are saved to 'descriptors_output.csv'. After the calculation, the input file 'molecule.smi' is deleted to clean up the working directory. The function uses the subprocess module to handle the execution of the Java command and captures any output or errors generated during the process.
def desc_calc():
# Performs the descriptor calculation
bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes " \
"./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
os.remove('molecule.smi')
def build_model(input_data):
# Reads in saved regression model
load_model = pickle.load(open('AChE_model_pubchem.pkl', 'rb'))
# Apply model to make predictions
prediction = load_model.predict(input_data)
st.header('**Prediction output**')
prediction_output = pd.Series(prediction, name='pIC50')
molecule_name = pd.Series(load_data[1], name='molecule_name')
df = pd.concat([molecule_name, prediction_output], axis=1)
st.write(df)
st.markdown(filedownload(df), unsafe_allow_html=True)
st.header('**Prediction Graph**')
st.latex('pIC50 = -log (IC50)')
fig = plt.figure()
ax = sns.barplot(y=df['molecule_name'], x=df['pIC50'], errwidth=0)
temp = int(max(df['pIC50'])) + 2
plt.xticks(np.arange(0, temp, 0.5))
plt.xlim([0, temp])
for i in ax.containers:
ax.bar_label(i, )
plt.ylabel(" CHEMBL_ID ", size=12, fontstyle='normal', weight=600)
plt.xlabel("pIC50 Values ", size=12, fontstyle='normal', weight=600)
plt.title("pIC50 value of various Molecules", fontstyle='normal', weight=600)
st.pyplot(fig)
with st.sidebar.header('1. Upload your CSV data'):
uploaded_file = st.sidebar.file_uploader("Upload your input file", type=['txt'])
st.sidebar.markdown("""
[Example input file](https://drive.google.com/file/d/1BkljjkCckFPn1mKLXIBsAmLk9d-ZBQ1R/view?usp=sharing)
""")
if st.sidebar.button('Predict'):
load_data = pd.read_table(uploaded_file, sep=' ', header=None)
load_data.to_csv('molecule.smi', sep='\t', header=False, index=False)
st.header('**Original input data**')
st.write(load_data)
with st.spinner("Calculating descriptors..."):
desc_calc()
# Read in calculated descriptors and display the dataframe
st.header('**Calculated molecular descriptors**')
desc = pd.read_csv('descriptors_output.csv')
st.write(desc)
st.write(desc.shape)
# Read descriptor list used in previously built model
st.header('**Subset of descriptors from previously built model**')
Xlist = list(pd.read_csv('descriptor_list.csv').columns)
desc_subset = desc[Xlist]
st.write(desc_subset)
st.write(desc_subset.shape)
# Apply trained model to make prediction on query compounds
build_model(desc_subset)
else:
st.info('Upload input data in the sidebar to start!')