#Predict Drug Activity for Androgen Receptor

This tutorial uses ML algorithm to predicit activity of molecule as drug for androgen receptor using Quantitative
Structure–Property Relationship (QSPR) descriptors (obtained by mordred)

Adapted from Tomasz K. Piskorz. 
 [Predict Drug activity for androgen receptor](https://github.com/tkpiskorz/cheminformatics_notebooks/blob/master/AR/Androgen%20receptor.ipynb).

##Installing RDKit and mordred on Google Colab
RDKit: a Python [Open-Source Cheminformatics Software](https://www.rdkit.org/).

mordred: a python [molecular descriptor calculator](https://github.com/mordred-descriptor/mordred) package.

In [None]:
!pip install rdkit-pypi

In [None]:
!pip install mordred

##Import required libraries

In [None]:
from rdkit import Chem
import pandas as pd
from rdkit.Chem.Draw import IPythonConsole
from IPython.core.display import display
import numpy as np

##“Toxicology in the 21st Century” (Tox21) Dataset

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

The data file contains a csv table, in which columns below are used:

- "smiles" - SMILES representation of the molecular structure
- "NR-XXX" - Nuclear receptor signaling bioassays results
  - [AR](https://pubchem.ncbi.nlm.nih.gov/bioassay/743040): qHTS assay to identify small molecule agonists of the androgen receptor (AR) signaling pathway using the MDA cell line.
  - [AhR](https://pubchem.ncbi.nlm.nih.gov/bioassay/743122): qHTS assay to identify small molecule that activate the aryl hydrocarbon receptor (AhR) signaling pathway.
  - [AR-LBD](https://pubchem.ncbi.nlm.nih.gov/bioassay/74353): qHTS assay to identify small molecule agonists of the androgen receptor (AR) signaling pathway.
 - [ER](https://pubchem.ncbi.nlm.nih.gov/bioassay/743079): qHTS assay to identify small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway using the BG1 cell line.
  - [ER-LBD](https://pubchem.ncbi.nlm.nih.gov/bioassay/743077): qHTS assay to identify small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway.
  - [aromatase](https://pubchem.ncbi.nlm.nih.gov/bioassay/743139): qHTS assay to identify aromatase inhibitors.
  - [PPAR-gamma](https://pubchem.ncbi.nlm.nih.gov/bioassay/743140): qHTS assay to identify small molecule agonists of the peroxisome proliferator-activated receptor gamma (PPARg) signaling pathway.

- "SR-XXX" - Stress response bioassays results
	- [ARE](https://pubchem.ncbi.nlm.nih.gov/bioassay/743219): qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway.
	- [ATAD5](https://pubchem.ncbi.nlm.nih.gov/bioassay/720516): qHTS assay for small molecules that induce genotoxicity in human embryonic kidney cells expressing luciferase-tagged ATAD5.
	- [HSE](https://pubchem.ncbi.nlm.nih.gov/bioassay/743228): qHTS assay for small molecule activators of the heat shock response signaling pathway. 
	- [MMP](https://pubchem.ncbi.nlm.nih.gov/bioassay/720637): qHTS assay for small molecule disruptors of the mitochondrial membrane potential.
	- [p53](https://pubchem.ncbi.nlm.nih.gov/bioassay/720552): qHTS assay for small molecule agonists of the p53 signaling pathway.

please refer to the links at https://tripod.nih.gov/tox21/challenge/data.jsp for details.

References:

Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/

##Download Tox21 dataset

In [None]:
!pip install wget
!python -m wget -o tox21.csv "https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/tox21.csv"
df = pd.read_csv('tox21.csv')

In [None]:
#Show top 10 rows
df.head(10)

In [None]:
#Show descriptive summary statistics
df.describe()

In [None]:
# Get column names
list(df.columns)

In [None]:
# Get only 'NR-AR','smiles' columns
df = df[['NR-AR','smiles']]
df.head()

In [None]:
df.shape

In [None]:
# Importing pandasTools enables several features that allow for using RDKit molecules as columns of a Pandas dataframe.
from rdkit.Chem import PandasTools

In [None]:
# Converts the molecules contains in "smilesCol" to RDKit molecules and appends them to the
# dataframe "frame" using the specified column name.
# If desired, a fingerprint can be computed and stored with the molecule objects to accelerate
# substructure matching
PandasTools.AddMoleculeColumnToFrame(df,smilesCol='smiles')
df.head()

In [None]:
# Remove rows with missing values (NaN)
df = df[~df['ROMol'].isnull()]
df = df[~df['NR-AR'].isnull()]
df.shape

We can see 566 rows with missing values (NaN) are removed.

In [None]:
#Draw grid image of molecules in pandas DataFrame for 'NR-AR' of 1
display(PandasTools.FrameToGridImage(df[df['NR-AR']==1].head(5), legendsCol='NR-AR', molsPerRow=5))

In [None]:
#Draw grid image of mols in pandas DataFrame for 'NR-AR' of 0
display(PandasTools.FrameToGridImage(df[df['NR-AR']==0].head(5), legendsCol='NR-AR', molsPerRow=5))

In [None]:
# Count number of distinct elements in 'NR-AR' column
df['NR-AR'].unique()

In [None]:
# Count the total number of elements in 'NR-AR' column
df['NR-AR'].count()

In [None]:
# Sum of the 'NR-AR' column
df['NR-AR'].sum()

#What is molecular descriptor?

Molecular descriptors can be defined as mathematical representations of molecules’ properties that are generated by algorithms. The numerical values of molecular descriptors are used to quantitatively describe the physical and chemical information of the molecules. They can be used to predict the
activity, toxicity, and other properties resulting from the
chemical structures of compounds.

In [None]:
from mordred import Calculator, descriptors

In [None]:
# create descriptor calculator with all descriptors
calc = Calculator(descriptors, ignore_3D=True)

In [None]:
# Show the first molecule
mol =df['ROMol'][0]
mol

In [None]:
# pandas method calculates multiple molecules, return pandas DataFrame
df2 = calc.pandas(df['ROMol'])

In [None]:
df2.head()

In [None]:
df2.shape

In [None]:
# Find columns with known errored value
from mordred.error import Missing
missing = []
for column in df2.columns:
    if (df2[column].apply(lambda x: type(x) ==Missing)).any():
        missing.append(column)
    

In [None]:
# Drop columns with known errored value
df_new = df2.drop(missing, axis=1)

In [None]:
df_new.head()

In [None]:
df_new.shape

In [None]:
# Target
y = df['NR-AR']
# Molecular descriptors
X = df_new

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [None]:
# Split data into 75% training and 25% test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
#A random forest is a meta estimator that fits a number of decision tree
#classifiers on various sub-samples of the dataset and uses averaging to
#improve the predictive accuracy and control over-fitting.

from sklearn.ensemble import RandomForestClassifier

In [None]:
# Create a RandomForestClassifer with 100 trees in the forest.
clf = RandomForestClassifier(n_estimators=100).fit(X_train ,y_train)

In [None]:
# Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from predicted class probabilities 
# The predicted class probabilities of an input sample are computed as 
# the mean predicted class probabilities of the trees in the forest.
# The class probability of a single tree is the fraction of samples of
# the same class in a leaf.
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

In [None]:
# Compute the mean accuracy of training data
clf.score(X_train,y_train)

In [None]:
# Compute the mean accuracy of testing data
clf.score(X_test,y_test)

In [None]:
# Multi-layer Perceptron classifier.
# This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
from sklearn.neural_network import MLPClassifier

In [None]:
# Create a Multi-layer Perceptron classifier with 6 hidden layers with corresponding number of neurons of
# 1000,500,250,100,50,20
clf = MLPClassifier(hidden_layer_sizes=[1000,500,250,100,50,20]).fit(X_train, y_train)

In [None]:
# Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from predicted class probabilities
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

In [None]:
# Compute the mean accuracy of testing data
clf.score(X_train,y_train)

In [None]:
# Compute the mean accuracy of testing data
clf.score(X_test,y_test)