# Predict Drug Activity for Androgen Receptor

Adapted from Tomasz K. Piskorz. [Predict Drug activity for androgen receptor](https://github.com/tkpiskorz/cheminformatics_notebooks/blob/master/AR/Androgen%20receptor.ipynb).

## Overview

This tutorial provides a step-by-step guide on leveraging machine learning algorithms to predict drug activity for the androgen receptor using Quantitative Structure–Property Relationship (QSPR) descriptors. The analysis is based on the Tox21 dataset, which contains chemical compounds and their biological activities. To build predictive models, we calculate molecular descriptors using the mordred package, a powerful tool for generating a wide range of molecular features. These descriptors capture essential chemical properties, enabling the development of robust models that can predict whether a compound will activate or inhibit the androgen receptor. By following this tutorial, you’ll gain insights into how machine learning can be applied to drug discovery and chemical informatics, bridging the gap between molecular structure and biological activity.

This tutorial uses a couple of packages we have not yet seen. You can learn more about them here:

- RDKit: a Python [Open-Source Cheminformatics Software](https://www.rdkit.org/).
- mordred: a python [molecular descriptor calculator](https://github.com/mordred-descriptor/mordred) package.

## Learning Objectives

- Learn how to work with chemical structure data using RDKit and `mordred`
- Understand how to calculate and use molecular descriptors for drug activity prediction
- Build and evaluate machine learning models for drug activity classification
- Interpret model performance using ROC AUC scores and accuracy metrics

### Tasks to complete

Load and preprocess Tox21 dataset
Calculate molecular descriptors using mordred
Train and evaluate Random Forest model
Train and evaluate Neural Network model
Compare model performances

## Prerequisites

- A working Python environment and familiarity with Python
- Basic understanding of machine learning concepts
- Familiarity with pandas and numpy libraries
- Knowledge of basic statistical concepts


## Get Started

- Please select kernel "conda_tensorflow2_p310" from SageMaker notebook instance.


## Import necessary libraries

\*Note: You may get a deprecation warning regarding `IPython.core.display`. This shouldn't affect the results of the notebook.


In [None]:
# Install the mordredcommunity library version 2.0.6 using pip package manager.
%pip install mordredcommunity==2.0.6

In [None]:
# Import the NumPy library for numerical operations, often used for array manipulation and mathematical functions.
import numpy as np

# Import the Pandas library for data manipulation and analysis, particularly for working with DataFrames.
import pandas as pd

# Import the display function from IPython.display to enable rich outputs like DataFrames in notebooks.
from IPython.display import display

# Import the Calculator class and the descriptors module from the mordred library. Mordred is used for molecular descriptor calculation.
from mordred import Calculator, descriptors

# Import the Missing class from mordred.error to handle missing descriptor values.
from mordred.error import Missing

# Import the Chem module from RDKit, which is the core module for chemical informatics tasks like molecule handling.
from rdkit import Chem

# Import PandasTools from rdkit.Chem. This module enhances Pandas DataFrames to work seamlessly with RDKit molecules.
# It allows you to store and manipulate RDKit molecules directly within DataFrame columns.
from rdkit.Chem import PandasTools

# Import IPythonConsole from rdkit.Chem.Draw to enable the display of molecule images directly in IPython environments like Jupyter notebooks.
from rdkit.Chem.Draw import IPythonConsole

# Import the RandomForestClassifier from sklearn.ensemble. This is a machine learning model used for classification tasks.
from sklearn.ensemble import RandomForestClassifier

# Import the roc_auc_score function from sklearn.metrics to evaluate the performance of classification models, specifically using the Area Under the ROC Curve metric.
from sklearn.metrics import roc_auc_score

# Import the train_test_split function from sklearn.model_selection to split datasets into training and testing sets for model evaluation.
from sklearn.model_selection import train_test_split

# Import the MLPClassifier from sklearn.neural_network. This is a Multi-layer Perceptron classifier, a type of neural network used for classification.
from sklearn.neural_network import MLPClassifier

## _Toxicology in the 21st Century_ (Tox21) Dataset

The _Toxicology in the 21st Century_ (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

The data file contains a csv table, in which columns below are used:

- "smiles" - SMILES representation of the molecular structure
- "NR-XXX" - Nuclear receptor signaling bioassays results
  - [AR](https://pubchem.ncbi.nlm.nih.gov/bioassay/743040): qHTS assay to identify small molecule agonists of the androgen receptor (AR) signaling pathway using the MDA cell line.
  - [AhR](https://pubchem.ncbi.nlm.nih.gov/bioassay/743122): qHTS assay to identify small molecule that activate the aryl hydrocarbon receptor (AhR) signaling pathway.
  - [AR-LBD](https://pubchem.ncbi.nlm.nih.gov/bioassay/74353): qHTS assay to identify small molecule agonists of the androgen receptor (AR) signaling pathway.
- [ER](https://pubchem.ncbi.nlm.nih.gov/bioassay/743079): qHTS assay to identify small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway using the BG1 cell line.
- [ER-LBD](https://pubchem.ncbi.nlm.nih.gov/bioassay/743077): qHTS assay to identify small molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway.
- [aromatase](https://pubchem.ncbi.nlm.nih.gov/bioassay/743139): qHTS assay to identify aromatase inhibitors.
- [PPAR-gamma](https://pubchem.ncbi.nlm.nih.gov/bioassay/743140): qHTS assay to identify small molecule agonists of the peroxisome proliferator-activated receptor gamma (PPARg) signaling pathway.

- "SR-XXX" - Stress response bioassays results
  - [ARE](https://pubchem.ncbi.nlm.nih.gov/bioassay/743219): qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway.
  - [ATAD5](https://pubchem.ncbi.nlm.nih.gov/bioassay/720516): qHTS assay for small molecules that induce genotoxicity in human embryonic kidney cells expressing luciferase-tagged ATAD5.
  - [HSE](https://pubchem.ncbi.nlm.nih.gov/bioassay/743228): qHTS assay for small molecule activators of the heat shock response signaling pathway.
  - [MMP](https://pubchem.ncbi.nlm.nih.gov/bioassay/720637): qHTS assay for small molecule disruptors of the mitochondrial membrane potential.
  - [p53](https://pubchem.ncbi.nlm.nih.gov/bioassay/720552): qHTS assay for small molecule agonists of the p53 signaling pathway.

Please refer to the links at https://tripod.nih.gov/tox21/challenge/data.jsp for details.

### References

Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/


### Load Tox21 dataset


In [None]:
# Reads the tox21.csv file from the "../../Data/" directory into a pandas DataFrame called 'df'.
df = pd.read_csv("../../Data/tox21.csv")

In [None]:
# Display the first 10 rows of the DataFrame 'df' to get a quick overview of the data.
df.head(10)

In [None]:
# Show descriptive summary statistics for the DataFrame 'df'.
# This will include count, mean, std, min, 25%, 50%, 75%, max for numerical columns.
# For categorical columns, it will include count, unique, top, and freq.
df.describe()

In [None]:
# Get column names of the DataFrame 'df' and convert them to a list.
list(df.columns)

In [None]:
# Selects only the columns 'NR-AR' and 'smiles' from the DataFrame 'df'.
df = df[["NR-AR", "smiles"]]

# Displays the first few rows of the DataFrame 'df' with the selected columns.
df.head()

In [None]:
# Get the shape of the DataFrame 'df' (number of rows and columns).
df.shape

In [None]:
# Converts the molecules contained in the column "smilesCol" to RDKit molecule objects and adds them to the DataFrame "df".
# This function also allows for the optional computation and storage of molecular fingerprints to speed up substructure searching.
from rdkit import RDLogger

# Suppress RDKit warnings to keep the output cleaner.
RDLogger.DisableLog("rdApp.warning")  # Disables RDKit warning messages.
RDLogger.DisableLog(
    "rdApp.error"
)  # Disables RDKit error messages as well (optional, but good for cleaner output).

# Uses PandasTools to add a new column of RDKit molecule objects to the DataFrame 'df'. 
# The molecules are created from the SMILES strings in the column named "smiles".
PandasTools.AddMoleculeColumnToFrame(
    df, smilesCol="smiles"
)

# Displays the first few rows of the DataFrame 'df' to show the newly added molecule column.
df.head()  

In [None]:
# Remove rows from the DataFrame 'df' where the 'ROMol' column contains missing values (NaN).
df = df[~df["ROMol"].isnull()]

# Remove rows from the DataFrame 'df' where the 'NR-AR' column contains missing values (NaN).
df = df[~df["NR-AR"].isnull()]

# Print the shape (number of rows and columns) of the DataFrame 'df' after removing rows with missing values.
df.shape

We can see 566 rows with missing values (NaN) are removed.


In [None]:
# Use RDKit PandasTools to generate a grid image of molecules from a Pandas DataFrame.
display(
    PandasTools.FrameToGridImage(
        # Filter the DataFrame 'df' to select rows where the 'NR-AR' column is equal to 1.
        df[df["NR-AR"] == 1].head(5),
        
        # Specify that the 'NR-AR' column should be used to generate legends for each molecule in the grid.
        legendsCol="NR-AR",
        
        # Set the number of molecules to be displayed in each row of the grid to 5.
        molsPerRow=5,
    )
)

In [None]:
# Display a grid image of molecules from a Pandas DataFrame where the 'NR-AR' column is equal to 0.
display(
    PandasTools.FrameToGridImage(
        # Filter the DataFrame 'df' to include only rows where the 'NR-AR' column is 0.
        df[df["NR-AR"] == 0].head(5),
        
        # Specify that the 'NR-AR' column should be used for legends in the grid image.
        legendsCol="NR-AR",
        
        # Set the number of molecules to display per row in the grid to 5.
        molsPerRow=5,
    )
)

In [None]:
# Get the unique values from the 'NR-AR' column of the DataFrame 'df'. This effectively counts the number of distinct elements in that column.
df["NR-AR"].unique()

In [None]:
# Counts the number of non-missing values in the 'NR-AR' column of the DataFrame 'df'.
df["NR-AR"].count()

In [None]:
# Calculate and return the sum of the values in the 'NR-AR' column of the DataFrame 'df'.
df["NR-AR"].sum()

## What is a molecular descriptor?

Molecular descriptors are mathematical representations of a molecule's properties, generated through algorithmic calculations. These descriptors translate the physical and chemical characteristics of molecules into numerical values, providing a quantitative way to describe their structure and behavior. By capturing essential information about molecular features, such as size, shape, polarity, and electronic properties, molecular descriptors serve as powerful tools for predicting various outcomes, including biological activity, toxicity, and other properties derived from the chemical structure of compounds. They play a critical role in fields like drug discovery, chemical informatics, and environmental science, enabling researchers to link molecular structure to function in a systematic and data-driven manner.

(All together, the code in this section may take up to 20 minutes to complete.)


In [None]:
# Create a descriptor calculator object named 'calc' that will compute all descriptors listed in the 'descriptors' variable.
# The argument 'ignore_3D=True' specifies that 3D descriptors should be excluded from the calculation.
calc = Calculator(descriptors, ignore_3D=True)

In [None]:
# Accesses the first element of the 'ROMol' column in the DataFrame 'df'.
mol = df["ROMol"][0]

# Displays the 'mol' object, which likely represents a molecule loaded by RDKit.
mol

(The following step may take up to 40 minutes to complete.)


In [None]:
# Uses the 'pandas' method from the 'calc' object to calculate molecular properties for multiple molecules in the 'ROMol' column of DataFrame 'df'. Returns the results as a pandas DataFrame named 'df2'.
df2 = calc.pandas(df["ROMol"])

In [None]:
# Display the first few rows of the DataFrame 'df2' (by default, it shows the first 5 rows).
df2.head()

In [None]:
# Get the shape of the DataFrame df2 (number of rows and columns).
df2.shape

In [None]:
# Initialize an empty list called 'missing' to store column names with missing values.
missing = []

# Iterate through each column name in the DataFrame 'df2'.
for column in df2.columns:
    # Check if any value in the current 'column' is of type 'Missing'.
    # 'df2[column].apply(lambda x: type(x) == Missing)' applies a function to each element in the column.
    # The lambda function checks if the type of the element 'x' is equal to the type 'Missing'.
    # '.any()' returns True if at least one element in the Series is True (i.e., if at least one value is of type 'Missing').
    if (df2[column].apply(lambda x: type(x) == Missing)).any():
        
        # If the condition in the 'if' statement is True (meaning the column contains at least one 'Missing' value),
        # append the name of the 'column' to the 'missing' list.
        missing.append(column)

In [None]:
# Drop columns with known errored value from the DataFrame 'df2' and assign the result to 'df_new'.
df_new = df2.drop(missing, axis=1)

In [None]:
# Display the first 5 rows of the DataFrame 'df_new' to inspect the data.
df_new.head()

In [None]:
# Returns the shape of the DataFrame 'df_new' as a tuple (number of rows, number of columns).
df_new.shape

In [None]:
# Assigns the 'NR-AR' column from the DataFrame 'df' to the variable 'y' as the target variable.
y = df["NR-AR"]

# Assigns the DataFrame 'df_new' to the variable 'X' to be used as the feature matrix (molecular descriptors).
X = df_new

In [None]:
# Split data into 75% training and 25% test sets
# X_train: Features for the training dataset (75% of X).
# X_test: Features for the test dataset (25% of X).
# y_train: Labels for the training dataset (75% of y).
# y_test: Labels for the test dataset (25% of y).
# train_test_split: Function used to split the dataset into training and testing sets.
# X: Features data to be split.
# y: Labels data to be split.
# By default, test_size is 0.25 if not specified, meaning 25% of the data will be used for testing, and 75% for training.
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Random forest

A **random forest** is an ensemble learning method that functions as a **meta-estimator**. It works by training multiple **decision tree classifiers** on different sub-samples of the dataset, typically drawn with replacement (bootstrapping). Each tree is trained independently, and the final prediction is obtained by averaging the predictions of all the individual trees (for regression tasks) or through majority voting (for classification tasks). This approach not only enhances the model's **predictive accuracy** but also helps to **control over-fitting** by reducing the variance that can occur with individual decision trees. By combining the strengths of many trees, random forests create a robust and reliable model that performs well on a wide range of datasets.


In [None]:
# Import the RandomForestClassifier class from scikit-learn's ensemble module.
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier object with 100 trees (n_estimators=100) and train it using the training data (X_train, y_train).
clf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)

### Receiver Operating Characteristic Curve (ROC AUC)

The **Area Under the Receiver Operating Characteristic Curve (ROC AUC)** is a performance metric used to evaluate the effectiveness of a classification model. It measures the model's ability to distinguish between classes by plotting the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold settings. A higher ROC AUC score indicates better model performance, with a score of 1 representing perfect classification and 0.5 indicating random guessing.

In the context of a **random forest classifier**, the predicted class probabilities for an input sample are calculated as the **mean predicted class probabilities** across all the trees in the forest. For a single decision tree, the class probability is determined by the fraction of samples belonging to the same class in a given leaf node. By averaging these probabilities across all trees, the random forest provides a robust estimate of the likelihood that a sample belongs to a particular class. This approach enhances the model's reliability and accuracy, making ROC AUC a valuable metric for assessing its performance.


In [None]:
# Calculate the Area Under the Receiver Operating Characteristic Curve (ROC AUC) score.
# The roc_auc_score function is used to evaluate the model's performance by comparing
# the true labels (y_test) with the predicted probabilities for the positive class.

# clf.predict_proba(X_test)[:, 1] extracts the predicted probabilities for the positive class (class 1)
# from the test dataset (X_test). The [:, 1] indexing selects the second column of the probability
# array, which corresponds to the positive class.

roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])

In [None]:
# Compute the mean accuracy of training data using the classifier 'clf' and training features 'X_train' and labels 'y_train'.
clf.score(X_train, y_train)

In [None]:
# Compute the mean accuracy of the classifier (clf) on the test data (X_test, y_test).
# This method calculates the accuracy by comparing the classifier's predictions for X_test
# against the true labels y_test and returning the mean accuracy score.
clf.score(X_test, y_test)

## Multi-layer Perceptron classifier

The **Multi-layer Perceptron (MLP) classifier** is a type of artificial neural network designed for supervised learning tasks, particularly classification. This model optimizes the log-loss function (also known as cross-entropy loss) to minimize the difference between predicted probabilities and actual labels. The optimization process can be performed using one of two methods:

- **LBFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno)**: A quasi-Newton optimization algorithm that approximates the Hessian matrix to efficiently find the minimum of the loss function. It is well-suited for smaller datasets due to its memory efficiency.

- **Stochastic Gradient Descent (SGD)**: An iterative optimization method that updates model parameters using small random subsets (batches) of the data. SGD is more scalable and commonly used for larger datasets.

The MLP classifier consists of multiple layers of interconnected nodes (neurons), including an input layer, one or more hidden layers, and an output layer. Each neuron applies a non-linear activation function (e.g., ReLU or sigmoid) to its inputs, enabling the network to learn complex patterns and relationships in the data. This flexibility makes the MLP classifier a powerful tool for solving a wide range of classification problems.


In [None]:
# Print the original DataFrame X_train, including rows with missing values (NaN).
print(X_train)

# Print a new DataFrame that is created by removing rows with any missing values (NaN) from X_train.
# This will show X_train with only complete rows, where no values are missing.
print(X_train.dropna())

(The following cell should take about three minutes to complete.)


In [None]:
# Create a Multi-layer Perceptron classifier instance
# with 6 hidden layers.
# The number of neurons in each hidden layer are specified as a list: [1000, 500, 250, 100, 50, 20].
# Layer 1: 1000 neurons
# Layer 2: 500 neurons
# Layer 3: 250 neurons
# Layer 4: 100 neurons
# Layer 5: 50 neurons
# Layer 6: 20 neurons
clf = MLPClassifier(hidden_layer_sizes=[1000, 500, 250, 100, 50, 20])

# Train the Multi-layer Perceptron classifier model
# using the training data (X_train features and y_train labels).
clf = clf.fit(X_train, y_train)

### Receiver Operating Characteristic Curve (ROC AUC)

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from predicted class probabilities.


In [None]:
# Calculate the Area Under the Receiver Operating Characteristic Curve (ROC AUC score).
# This metric evaluates the performance of the classifier by measuring the area under the ROC curve.
# It uses the true labels (y_test) and the predicted probabilities of the positive class (class '1')
# from the classifier (clf) on the test data (X_test).
roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])

In [None]:
# Compute the mean accuracy of the classifier 'clf' on the training data (X_train, y_train).
clf.score(X_train, y_train)

In [None]:
# Compute the mean accuracy of the classifier on the test data (X_test, y_test).
clf.score(X_test, y_test)

## Conclusion

This tutorial demonstrated how to:

- Work with chemical structure data using RDKit
- Calculate molecular descriptors using `mordred`
- Build and evaluate machine learning models for predicting drug activity
- Use different model architectures (Random Forest and Neural Networks) for classification tasks
- Assess model performance using ROC AUC scores and accuracy metrics

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
