# QSAR - Regression

Build either a Random Forest or k-Nearest Neighbor regression model using scikit-learn.  

## Script variables

Script variables that need to be changed are in the script below.  The script requires that you provide four pieces of information.  

1) `SDFILE_DIR`: the filepath to the SDFile containing the chemicals to build a ML model
2) `ACTIVITY_COLUMN`: the name of the column/property in the SDFile that contains the activity you would like to perform QSAR on
3) `NAME_COLUMN`: the name of the column/property in the SDFile that contains the name or identifier of the molecule
4) `ALGORITHM`: which machine learning algorithm you would like to use - either random forest (rf) or k-nearest neighbors (knn)

In [1]:
SDFILE_DIR = 'data/QSAR_solubility_example.sdf'
ACTIVITY_COLUMN = 'Solubility'
NAME_COLUMN = 'Compound ID'
ALGORITHM = 'rf'

First, import the necessary packages.

In [2]:
# Standard imports
import pandas as pd
import numpy as np
import os
from rdkit import Chem 
from rdkit.Chem import PandasTools
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors

# ML imports
from sklearn import model_selection
from sklearn import pipeline
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

# Plotting imports
import seaborn as sns
import matplotlib.pyplot as plt

## Data preprocessing

Here we create desctipors for the chemicals and prepare the descriptor matrix and the and the activity column we want to learn.

In [3]:
def calc_descriptors_from_mol(mol):
    """
    Encode a molecule from a RDKit Mol into a set of descriptors.

    Parameters
    ----------
    mol : RDKit Mol
        The RDKit molecule.

    Returns
    -------
    list
        The set of chemical descriptors as a list.

    """
    calc = MoleculeDescriptors.MolecularDescriptorCalculator([desc[0] for desc in Descriptors.descList])
    return list(calc.CalcDescriptors(mol))

Loop through the data frame and create chemical descriptors for each molecule. 

In [4]:
df = PandasTools.LoadSDF(SDFILE_DIR)

desc_list = []

for mol in df.ROMol.tolist():
    desc = calc_descriptors_from_mol(mol)
    desc_list.append(desc)
    
desc_frame = pd.DataFrame(desc_list, columns = [desc[0] for desc in Descriptors.descList])

Create variables to feed into the ML algorithm.  `X` is the descriptor matrix and `y` contains the activity to predict.

In [5]:
X = desc_frame.copy()
X.index = df[NAME_COLUMN].astype(str)
y = df[ACTIVITY_COLUMN].astype(float)
y.index = df[NAME_COLUMN].astype(str)

print(X.shape)
print(y.shape)

(250, 208)
(250,)


Sometimes chemicals can not get descriptors made for them, so we need to remove these

In [6]:
X_train = X[X.notnull().all(1)]
y_train = y[X.notnull().all(1)]

print(X_train.shape)
print(y_train.shape)

(250, 208)
(250,)


## Model training

The model is trained using 5-fold cross validation.  Model parameters are searching using a "grid search" method that searches through all possible parameters and finds the optimal solution.  

The 5-fold cross validation predictions are exported to a file `data/five_fold_predictions.csv`.

In [7]:
# RF
model_RF = RandomForestRegressor()

#knn
model_KNN = KNeighborsRegressor(metric='euclidean')


models = {
    "rf": model_RF,
    "knn": model_KNN
}

params_dic = {
    'rf' : {'rf__n_estimators': [10, 25, 50, 100]},
    'knn': {'knn__n_neighbors': [1, 2, 3, 4, 5, 10, 25]}
}

In [8]:
N_FOLDS = 5

clf = models[ALGORITHM]
params = params_dic[ALGORITHM]
pipe = pipeline.Pipeline([('scaler', StandardScaler()), (ALGORITHM, clf)])
cv = KFold(n_splits=N_FOLDS, shuffle=True, random_state=0)
grid_search = model_selection.GridSearchCV(pipe, param_grid=params, cv=cv, refit='AUC')
grid_search.fit(X_train, y_train)
best_estimator = grid_search.best_estimator_

Export the predictions

In [9]:
preds = pd.DataFrame(cross_val_predict(best_estimator, X_train, y_train), 
                    index=X_train.index, columns=['CV Prediction'])
preds['CV Prediction'] = preds['CV Prediction'].round(2)
preds['Compound'] = X_train.index
preds[['Compound', 'CV Prediction']].to_csv('data/five_fold_predictions.csv', index=False)