# Predict Drug Activity for Androgen Receptor

Adapted from Tomasz K. Piskorz. [Predict Drug activity for androgen receptor](https://github.com/tkpiskorz/cheminformatics_notebooks/blob/master/AR/Androgen%20receptor.ipynb).

## Overview

This tutorial demonstrates how to use machine learning algorithms to predict drug activity for androgen receptor using Quantitative Structure–Property Relationship (QSPR) descriptors. The analysis uses the Tox21 dataset and molecular descriptors calculated using the `mordred` package to build predictive models.

This tutorial uses a couple of packages we have not yet seen.  You can learn more about them here:

- RDKit: a Python [Open-Source Cheminformatics Software](https://www.rdkit.org/).
- mordred: a python [molecular descriptor calculator](https://github.com/mordred-descriptor/mordred) package.

## Learning Objectives

- Learn how to work with chemical structure data using RDKit and `mordred`
- Understand how to calculate and use molecular descriptors for drug activity prediction 
- Build and evaluate machine learning models for drug activity classification
- Interpret model performance using ROC AUC scores and accuracy metrics

### Tasks to complete

Load and preprocess Tox21 dataset
Calculate molecular descriptors using mordred
Train and evaluate Random Forest model
Train and evaluate Neural Network model
Compare model performances

## Prerequisites

- A working Python environment and familiarity with Python
- Basic understanding of machine learning concepts
- Familiarity with pandas and numpy libraries
- Knowledge of basic statistical concepts

## Get Started

### Set up conda environment

Ensure that you have created then conda environment using the `environment.yml` file included in this repository.  E.g.,

```
# Create conda environment
conda env create -f conda_env_submodule_4.yml

# Register the kernel
python -m ipykernel install --user \
    --name=nigms_sandbox_ud__submodule_4 \
    --display-name "Python (NIGMS Sandbox UD, Submodule 4)"
```

Then, when starting the notebook, select the  `"Python (nigms_sandbox_ud)"` kernel from the list.

Note that you may need to restart Jupyter Lab for these changes to take effect.

## Import necessary libraries

*Note: You may get a deprecation warning regarding `IPython.core.display`.  This shouldn't affect the results of the notebook.)

In [None]:
import numpy as np
import pandas as pd
from IPython.core.display import display
from mordred import Calculator, descriptors
from mordred.error import Missing
from rdkit import Chem

# Importing pandasTools enables several features that allow for using
# RDKit molecules as columns of a Pandas dataframe.
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

## *Toxicology in the 21st Century* (Tox21) Dataset

The *Toxicology in the 21st Century* (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

The data file contains a csv table, in which columns below are used:

- "smiles" - SMILES representation of the molecular structure
- "NR-XXX" - Nuclear receptor signaling bioassays results
  - [AR](https://pubchem.ncbi.nlm.nih.gov/bioassay/743040): qHTS assay to identify small molecule agonists of the androgen receptor (AR) signaling pathway using the MDA cell line.
  - [AhR](https://pubchem.ncbi.nlm.nih.gov/bioassay/743122): qHTS assay to identify small molecule that activate the aryl hydrocarbon receptor (AhR) signaling pathway.
  - [AR-LBD](https://pubchem.ncbi.nlm.nih.gov/bioassay/74353): qHTS assay to identify small molecule agonists of the androgen receptor (AR) signaling pathway.
 - [ER](https://pubchem.ncbi.nlm.nih.gov/bioassay/743079): qHTS assay to identify small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway using the BG1 cell line.
  - [ER-LBD](https://pubchem.ncbi.nlm.nih.gov/bioassay/743077): qHTS assay to identify small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway.
  - [aromatase](https://pubchem.ncbi.nlm.nih.gov/bioassay/743139): qHTS assay to identify aromatase inhibitors.
  - [PPAR-gamma](https://pubchem.ncbi.nlm.nih.gov/bioassay/743140): qHTS assay to identify small molecule agonists of the peroxisome proliferator-activated receptor gamma (PPARg) signaling pathway.

- "SR-XXX" - Stress response bioassays results
	- [ARE](https://pubchem.ncbi.nlm.nih.gov/bioassay/743219): qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway.
	- [ATAD5](https://pubchem.ncbi.nlm.nih.gov/bioassay/720516): qHTS assay for small molecules that induce genotoxicity in human embryonic kidney cells expressing luciferase-tagged ATAD5.
	- [HSE](https://pubchem.ncbi.nlm.nih.gov/bioassay/743228): qHTS assay for small molecule activators of the heat shock response signaling pathway. 
	- [MMP](https://pubchem.ncbi.nlm.nih.gov/bioassay/720637): qHTS assay for small molecule disruptors of the mitochondrial membrane potential.
	- [p53](https://pubchem.ncbi.nlm.nih.gov/bioassay/720552): qHTS assay for small molecule agonists of the p53 signaling pathway.

Please refer to the links at https://tripod.nih.gov/tox21/challenge/data.jsp for details.

### References

Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/

### Load Tox21 dataset

In [None]:
df = pd.read_csv("../../Data/tox21.csv")

In [None]:
# Show top 10 rows
df.head(10)

In [None]:
# Show descriptive summary statistics
df.describe()

In [None]:
# Get column names
list(df.columns)

In [None]:
# Get only 'NR-AR','smiles' columns
df = df[["NR-AR", "smiles"]]
df.head()

In [None]:
df.shape

In [None]:
# Converts the molecules contains in "smilesCol" to RDKit molecules and appends them to the
# dataframe "frame" using the specified column name.
# If desired, a fingerprint can be computed and stored with the molecule objects to accelerate
# substructure matching
PandasTools.AddMoleculeColumnToFrame(df, smilesCol="smiles")
df.head()

In [None]:
# Remove rows with missing values (NaN)
df = df[~df["ROMol"].isnull()]
df = df[~df["NR-AR"].isnull()]
df.shape

We can see 566 rows with missing values (NaN) are removed.

In [None]:
# Draw grid image of molecules in pandas DataFrame for 'NR-AR' of 1
display(
    PandasTools.FrameToGridImage(
        df[df["NR-AR"] == 1].head(5), legendsCol="NR-AR", molsPerRow=5
    )
)

In [None]:
# Draw grid image of mols in pandas DataFrame for 'NR-AR' of 0
display(
    PandasTools.FrameToGridImage(
        df[df["NR-AR"] == 0].head(5), legendsCol="NR-AR", molsPerRow=5
    )
)

In [None]:
# Count number of distinct elements in 'NR-AR' column
df["NR-AR"].unique()

In [None]:
# Count the total number of elements in 'NR-AR' column
df["NR-AR"].count()

In [None]:
# Sum of the 'NR-AR' column
df["NR-AR"].sum()

## What is a molecular descriptor?

Molecular descriptors can be defined as mathematical representations of molecules’ properties that are generated by algorithms. The numerical values of molecular descriptors are used to quantitatively describe the physical and chemical information of the molecules. They can be used to predict the
activity, toxicity, and other properties resulting from the
chemical structures of compounds.

In [None]:
# create descriptor calculator with all descriptors
calc = Calculator(descriptors, ignore_3D=True)

In [None]:
# Show the first molecule
mol = df["ROMol"][0]
mol

(The following step may take a few minutes to complete.)

In [None]:
# pandas method calculates multiple molecules, return pandas DataFrame
df2 = calc.pandas(df["ROMol"])

In [None]:
df2.head()

In [None]:
df2.shape

In [None]:
# Find columns with known errored value

missing = []
for column in df2.columns:
    if (df2[column].apply(lambda x: type(x) == Missing)).any():
        missing.append(column)

In [None]:
# Drop columns with known errored value
df_new = df2.drop(missing, axis=1)

In [None]:
df_new.head()

In [None]:
df_new.shape

In [None]:
# Target
y = df["NR-AR"]

# Molecular descriptors
X = df_new

In [None]:
# Split data into 75% training and 25% test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Random forest

A random forest is a meta estimator that fits a number of decision tree
classifiers on various sub-samples of the dataset and uses averaging to
improve the predictive accuracy and control over-fitting.


In [None]:
# Create a RandomForestClassifer with 100 trees in the forest.
clf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)

### Receiver Operating Characteristic Curve (ROC AUC)

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from predicted class probabilities

The predicted class probabilities of an input sample are computed as
the mean predicted class probabilities of the trees in the forest.

The class probability of a single tree is the fraction of samples of
the same class in a leaf.


In [None]:
roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])

In [None]:
# Compute the mean accuracy of training data
clf.score(X_train, y_train)

In [None]:
# Compute the mean accuracy of testing data
clf.score(X_test, y_test)

## Multi-layer Perceptron classifier

This model optimizes the log-loss function using LBFGS or stochastic gradient descent.

In [None]:
print(X_train)

print(X_train.dropna())

(The following cell may take a few minutes to complete.)

In [None]:
# Create a Multi-layer Perceptron classifier with 6 hidden layers with corresponding number of neurons of
# 1000,500,250,100,50,20
clf = MLPClassifier(hidden_layer_sizes=[1000, 500, 250, 100, 50, 20]).fit(
    X_train, y_train
)

### Receiver Operating Characteristic Curve (ROC AUC)

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from predicted class probabilities.

In [None]:
roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])

In [None]:
# Compute the mean accuracy of testing data
clf.score(X_train, y_train)

In [None]:
# Compute the mean accuracy of testing data
clf.score(X_test, y_test)

## Conclusion

This tutorial demonstrated how to:

- Work with chemical structure data using RDKit
- Calculate molecular descriptors using `mordred`
- Build and evaluate machine learning models for predicting drug activity
- Use different model architectures (Random Forest and Neural Networks) for classification tasks
- Assess model performance using ROC AUC scores and accuracy metrics

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
