# Regression Models for Fa Prediction using Descriptors Calculated with Mordred

## Materials and Method

- Libraries: NumPy, pandas, scikit-learn, matplotlib, RDKit, mordred and SHAP
- Dataset: Fraction of absorption (Fa) and Parmeability measured by Caco-2 cells (Papp), which were collected previous strudy (Esaki, et al., Journal of Phermeceutical Sciences, 2019)
- Descriptor calcularion: Mordred

### Library Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print('numpy version: ', np.__version__)
print('pandas version: ', pd.__version__)

import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import VarianceThreshold
print('scikit-learn version: ', sklearn.__version__)

import shap
print('shap version: ', shap.__version__)

In [None]:
target = 'Fa'

### Dataset Import

The dataset contained information on the chemical structure of 5567 compounds as SMILES strings. In this dataset, the number of Fa experimental values was 946, respectively. Owing to its accuracy, we used CORINA (ver. 4.4.0) to generate 3D structures of the chemical compounds as structure data format (SDF) .

We used the molecular descriptor calculator, Mordred, to calculate descriptors (1613 for 1D and 2D). These descriptors were calculated for Fa dataset.

In [None]:
df_2Ddescriptors = pd.read_csv('CBIJ_Esaki_et_al_Descriptor_Mordred_1D2D.csv', index_col=0)
df_2Ddescriptors.head(5)

In [None]:
df_2Ddescriptors = df_2Ddescriptors.drop("Papp", axis=1)
df_2Ddescriptors.head(5)

## Definition of Functions

### Data preparation and Distribution conformation

The Fa dataset was randomly split into training (70%, 660 compounds) and test set (30%, 286 compounds) using the train_test_split function in scikit-learn. The Fa measurements ranged between 0.0 and 1.0, with localization around either 0.0 or 1.0. We transformed these values to log10(Fa/(1 - Fa)) to scatter the response variable. Additionally, Fa = 0.0 was set to 0.01, and Fa = 1.0 was set to 0.99.

In [None]:
def data_separation(df, target):
    df = df.dropna(subset=[target])
    X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 2:], df[target],
                                                        train_size=0.7, random_state=0)
    return X_train, X_test, y_train, y_test


def distribution_conformation(df, target):
    X_train, X_test, y_train, y_test = data_separation(df, target)
    
    # remove descriptors with nan
    X_train = X_train.dropna(axis=1)
    X_test = X_test[X_train.columns]
    print(X_train.shape, X_test.shape, X_test.dropna(axis=1).shape)
    
    fig = plt.figure(figsize=(15, 3))
    
    for f in range(4):
        ax = fig.add_subplot(1, 4, f+1)
        if f == 0:
            ax.hist(y_train, bins=25, label='Training set')
            ax.hist(y_test, bins=25, label='Test set')
            ax.set_title('{}'.format(target))
        elif f == 1:
            if target == 'Fa':
                y_train[y_train == 0] = 0.01
                y_train[y_train == 1] = 0.99
                y_train = np.log(y_train/(1-y_train))
                y_test[y_test == 0] = 0.01
                y_test[y_test == 1] = 0.99
                y_test = np.log(y_test/(1-y_test))
            elif target == 'Papp':
                y_train = np.log(y_train)
                y_test = np.log(y_test)
            ax.hist(y_train, bins=25, label='Training set')
            ax.hist(y_test, bins=25, label='Test set')
            ax.set_title('Converted {}'.format(target))
        elif f == 2:
            ax.hist(X_train['MW'], range=(0,700), bins=25, label='Training set')
            ax.hist(X_test['MW'], range=(0,700), bins=25, label='Test set')
            ax.set_title('MW: {} data'.format(target))
        elif f == 3:
            ax.hist(X_train['SLogP'], range=(-10,10), bins=25, label='Training set')
            ax.hist(X_test['SLogP'], range=(-10,10), bins=25, label='Test set')
            ax.set_title('SLogP: {} data'.format(target))
        plt.legend(fontsize=12)
    plt.show()
    
    return X_train, X_test, y_train, y_test

### Descriptor preparation

We performed data preparation steps for descriptors of compounds in the training set. First, descriptors with nan were removed. Next, descriptors with small variance were removed using the VarianceThreshold function (threshold=1.0).

In [None]:
def descriptor_preparation(X_train, X_test):
    # Remove no variance descriptors
    var = VarianceThreshold(threshold=1.0).fit(X_train)
    X_train = X_train.loc[:, var.get_support()]
    X_test = X_test.loc[:, var.get_support()]
    
    # Fill in NaNs with avereges
    train_averages = X_train.mean()
    X_test = X_test.fillna(train_averages)
    
    return X_train, X_test

### Model Construction

To construct Random Forest Regression (RFR) model for Fa prediction. RFR is an ensemble model based on the decision tree method that requires us to optimize the following parameters: n_estimators: the number of trees, max_depth: the maximum depth, min_samples_split: minimum number of samples required to split an internal node.

In [None]:
from sklearn.ensemble import RandomForestRegressor

def model_construction_rfr(X_train, y_train):    
    # RFR
    search_params = [{"max_depth": [15, 17, 20, 25, 30],
                      "n_estimators":[750, 1000, 1200, 1500, 1750, 2000],
                      "min_samples_split": [2, 3, 5]}]
    gs_rfr = GridSearchCV(RandomForestRegressor(random_state=0, n_jobs=-1),
                          search_params,
                          cv=10,
                          n_jobs=-1,
                          scoring='neg_mean_squared_error')
    gs_rfr.fit(X_train, y_train)
    
    return gs_rfr

### Visualization of results

The R2 is an inadequate score for nonlinear models. Thus the root mean squared error (RMSE) were employed for comparing the predictive performance between the three algorithms.

The scatter plot shows the result of the observed and predicted Fa.

In [None]:
def visualization_results(model, X_test, y_test):
    #予測値
    y_pred = model.predict(X_test)
    
    print(f'Correlation coefficient between observed and predicted values: {round(np.corrcoef(y_pred, y_test)[1,0], 4)}')
    
    #グラフのスケール設定
    y_min = min([min(list(y_test)), min(list(y_pred))])
    y_max = max([max(list(y_test)), max(list(y_pred))])
    
    ###実測値と予測値の散布図###
    fig = plt.figure(figsize=(4, 4))
    #散布図
    plt.scatter(y_test, y_pred)
    #タイトル
    plt.title('Scatter plot, RMSE: {}'.format(round(np.sqrt(np.sum((y_pred - y_test)**2)/len(y_test)), 4)),fontsize=14)
    plt.xlabel('Observed value')
    plt.ylabel('Predicted value')
    
    #グラフ間の距離の調整
    plt.grid()
    plt.show()

## Results: Fa Prediction using 1D2D Descriptors

### Checking Data Distribution 

The distributions of this dataset in terms of molecular weight and slogp that were calculated using Mordred are shown in Figure 1 in body text. The p-values of the Wilcoxon signed rank test were higher than 0.05 in the converted Fa, MW, and slogp measurements, confirming that there was no bias in the training and test sets.

In [None]:
X_train, X_test, y_train, y_test = distribution_conformation(df_2Ddescriptors, target)

### Model Construction

The suitable parameters were selected using GridsearchCV (cv=10, score=’neg-mean-squared-error').

In [None]:
X_train, X_test = descriptor_preparation(X_train, X_test)

In [None]:
rfr_model = model_construction_rfr(X_train, y_train)
rfr_model.best_params_

### Result Visualization

The predicted Fa was calculated with RFR model constructed using 1D and 2D descriptor. The x-axis and y-axis show the values of converted Fa. The correlation coefficient scores between the observed and predicted values were 0.6861.

In [None]:
visualization_results(rfr_model, X_test, y_test)

### Contribution of descriptors for Fa prediction

Although the generated relationship between the descriptors and predicted results were effective for compound optimization, the constructed models were complex, and it was difficult to elucidate their relationships. Shapley additive explanations (SHAP) is a useful tool to overcome this hurdle, where the method calculates an important value of each feature for a prediction based on game theory. In Fa prediction by RFR, low values of TopoPSA(NO) and VSA_EState are likely to increase output value.

In [None]:
import shap
print('shap version: ', shap.__version__)

# load JS visualization code to notebook
shap.initjs()

# Create object that can calculate shap values
explainer = shap.KernelExplainer(rfr_model.predict, pd.DataFrame(X_test).values)
# Calculate SHAP values
shap_values = explainer.shap_values(X_test.values)

shap.summary_plot(shap_values, X_test)