# Train Machine Learning Models

## 1. Load Libraries

All classes and functions that are required for training the machine learning models are contained in the files.py 
classes_MLENS.py and classes_ML_training. Start a python session and import all functions as follows:

In [1]:
import sys
sys.path.append('ex3_train_ML_model')

from classes_ML_training import *




## 2. Load Data

The pkl files containing the dataframes of the terms extracted from the MD simulations and the 2D-counts for each of the compounds are in the folder ./ex3_train_ML_model/data. 

We load the dataframes of the the ChEMBL compounds, as follows:

In [2]:
df_chembl = Read_data.read_data_from_dataframe('./ex3_train_ML_model/data/Properties_ChEMBL_water_Final.pkl')
df_chembl.head()

Unnamed: 0,2d_psa,2d_shape,3d_psa_av,3d_psa_med,3d_psa_sd,Br_count,Cl_count,F_count,HA_count,HBA_count,...,std_mu,std_mu_x,std_mu_y,std_mu_z,wat_rgyr_av,wat_rgyr_med,wat_rgyr_std,wat_sasa_av,wat_sasa_med,wat_sasa_std
0,8.354,11.251257,7.165937,7.168406,0.125717,0,0,0,25,6,...,0.03352,0.050041,0.049648,0.05537,0.435483,0.430718,0.029359,5.986559,6.011076,0.109505
1,3.579,3.172216,3.915703,3.919708,0.133734,0,0,0,21,3,...,0.209238,0.559725,0.588688,0.547068,0.435365,0.438954,0.026534,5.940969,5.962865,0.125045
2,2.769,1.74505,2.75293,2.757647,0.111028,0,0,0,26,3,...,0.023356,0.032846,0.033016,0.033972,0.398355,0.398852,0.022665,6.21604,6.232614,0.207518
3,12.526,8.268967,12.279819,12.279489,0.142294,0,0,1,28,7,...,0.024918,0.040317,0.035263,0.038872,0.50939,0.50966,0.003255,6.59743,6.597555,0.04193
4,5.552,10.665094,5.640595,5.673066,0.157287,0,1,0,23,2,...,0.069553,0.384246,0.415959,0.428572,0.396757,0.368957,0.067673,5.657603,5.567729,0.40568


If the dataframe contains also a column with labels, you can read the labels in a separate variable by specifying the labels_column argument:

In [3]:
df_chembl, is_sub_chembl = Read_data.read_data_from_dataframe('./ex3_train_ML_model/data/Properties_ChEMBL_water_Final.pkl', labels_column = "is_sub")


Check the labels and the number of instances for each label (0 = nonsunstrate, 1 = substrate):

In [4]:
np.unique(is_sub_chembl, return_counts = True)

(array([0, 1]), array([394, 720]))

## 3. Compute Other Descriptors


In addition to the MDFP descriptors, other descriptors can be computed: the ECFP4 structural descriptor, the RDKit Topological fingerprint (RDKitFP, see https://www.rdkit.org/docs/GettingStartedInPython.html#topological-fingerprints), or the RDKit 2D descriptor (RDKit2D, see https://github.com/bp-kelley/descriptastorus). 

In [5]:
if 'RDKitFP' not in df_chembl:
    df_chembl = DataPrep.add_RDKitFP(df_chembl, smiles_column = "smiles")

if 'RDKit2D' not in df_chembl:
    df_chembl = DataPrep.add_RDKit2D(df_chembl, smiles_column = "smiles")

if 'ECFP4' not in df_chembl:
    df_chembl = DataPrep.add_ECFP4(df_chembl, smiles_column = "smiles")


The newly computed descriptors have been added to the dataframe in columns name "RDKitFP", "RDKit2D", and "ECFP4".

In [6]:
df_chembl.head()

Unnamed: 0,2d_psa,2d_shape,3d_psa_av,3d_psa_med,3d_psa_sd,Br_count,Cl_count,F_count,HA_count,HBA_count,...,std_mu_z,wat_rgyr_av,wat_rgyr_med,wat_rgyr_std,wat_sasa_av,wat_sasa_med,wat_sasa_std,RDKitFP,RDKit2D,ECFP4
0,8.354,11.251257,7.165937,7.168406,0.125717,0,0,0,25,6,...,0.05537,0.435483,0.430718,0.029359,5.986559,6.011076,0.109505,"[1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, ...","[1.6761953361556736, 1036.0250126830442, 17.38...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,3.579,3.172216,3.915703,3.919708,0.133734,0,0,0,21,3,...,0.547068,0.435365,0.438954,0.026534,5.940969,5.962865,0.125045,"[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[1.5173530348416693, 465.8582594494218, 14.656...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,2.769,1.74505,2.75293,2.757647,0.111028,0,0,0,26,3,...,0.033972,0.398355,0.398852,0.022665,6.21604,6.232614,0.207518,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1.7096091090142584, 827.8664162688564, 18.192...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,12.526,8.268967,12.279819,12.279489,0.142294,0,0,1,28,7,...,0.038872,0.50939,0.50966,0.003255,6.59743,6.597555,0.04193,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, ...","[1.9965564476842292, 1077.4262127752988, 20.42...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,5.552,10.665094,5.640595,5.673066,0.157287,0,1,0,23,2,...,0.428572,0.396757,0.368957,0.067673,5.657603,5.567729,0.40568,"[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[1.485990605271661, 801.4766465829462, 16.0707...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


## 4. Descriptors

### 4.1. Available Descriptors

Check all the available descriptors:

In [7]:
MDFP_terms.descriptor_dictionary.keys()


dict_keys(['MDFP', 'MDFP_D', 'MDFP+', 'MDFP_P2', 'MDFP_P3', 'MDFP_PP', 'MDFP++', 'MDFP_P2++', 'MDFP_P3++', 'C2D', 'P2D', 'CP2D', 'CP2D_P2', 'DIP_MOM', 'RDKitFP', 'RDKit2D', 'MDFP_RDKit2D', 'ECFP4', 'ECFP4++', 'ECFP4_P2', 'ECFP4_P2++', 'ECFP4_P3++', 'ECFP4_MDFP', 'ECFP4_MDFP_P3', 'ECFP4_MDFP_P3++', 'ECFP4_RDKit2D', 'ECFP4_MDFP_RDKit2D'])

Select the descriptors, on which to train the machine learning models

In [8]:
# select all of them
descriptors_list_all = MDFP_terms.descriptor_dictionary.keys() 
# select a subset
descriptors_list = ['CP2D', 'MDFP_P3++', 'RDKit2D', 'ECFP4', 'RDKitFP', 'ECFP4_MDFP_P3++'] 

#### Description of every descriptor:
##### Property-based descriptors:
- **C2D:** 2D-counts as described in the original publication. These are number of heavy atoms, rotatable bonds, and N, O, P, S, F, Cl, Br, I atoms.
- **P2D:** additional 2D-counts. These are number of hydrogen bond donors, number of hydrogen bond acceptors, molecular weight, 2D-shape and binary label for zwitterionic compounds. The 2D-shape is calculated as the ratio between the eigenvalues of the covariance matrix of the 2D coordinates. Instead, a compound is considered zwitterinic if it contains both a negative and a positive charge at pH 7.
- **CP2D:** C2D plus P2D
- **RDKit2D:** Collection of 200 2D-properties, including the some 2D-counts, estimated LogP values and topological polar surface area (TPSA). To check all of them, call the function DataPrep.get_RDKit2D_colnames()

##### MD DEscriptors
- **MDFP:** MDFP as described in the original publication [Link](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00778?casa_token=M_3CANCWZzAAAAAA:cQ0WGF5SUMEraDcrYzlEI9wdUkjLLLqCQDXwgdidT5P71rdHLRCuVj21vXuxCuWneJCEomavBA9Qptel)
- **MDFP+:** MDFP+ as described in the original publication. It consists of MDFP plus C2D
- **DIP_MOM:** Dipole moment terms. These are mean, median, and standard deviation of the dipole moment magnitude and of the dipole moment x,y,z components
- **MDFP_D:** MDFP plus DIP_MOM
- **MDFP_P2:** MDFP plus the topological polar surface area (TPSA)
- **MDFP_P3:** MDFP plus the mean, median, and standard deviation of the 3D polar surface area (3D-PSA) calculated over the simulation trajectory
- **MDFP_PP:** MDFP plus the mean, median, and standard deviation of the 3D-PSA calculated using partial charges (> 0,3) over the simulation trajectory
- **MDFP++:** MDFP+ plus P2D
- **MDFP_P2++:** MDFP_P2 plus CP2D (or MDFP++ plus TPSA)
- **MDFP_P3++:** MDFP_P3 plus CP2D (or MDFP++ plus 3D-PSA terms)
- **MDFP_RDKit2D:** MDFP plus RDKit2D

##### Structure-based descriptors: circular (ECFP) or path based (RDKitFP) fingerprints 
- **ECFP4:** RDKit Morgan fingerprint with radius = 2 and 2048 bits (RDKit funtion: AllChem.GetMorganFingerprintAsBitVect(x,2, nBits=2048))
- **ECFP4++:** ECFP4 plus CP2D
- **ECFP4_P2:** ECFP4 plus TPSA
- **ECFP4_P2++:** ECFP4++ plus TPSA
- **RDKitFP:** bits identify topological paths in the molecule. (RDKit function: Chem.rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=5, fpSize=2048))


##### Hybrid fingerprints (combinations of the descriptors described above)
- **ECFP4_P3++:** ECFP4++ plus 3D-PSA terms
- **ECFP4_MDFP:** ECFP4 plus MDFP
- **ECFP4_MDFP_P3:** ECFP4 plus MDFP_P3
- **ECFP4_MDFP_P3++:** ECFP4 plus MDFP_P3++
- **ECFP4_RDKit2D:** ECFP4 plus RDKit2D
- **ECFP4_MDFP_RDKit2D:** ECFP4 plus MDFP_RDKit2D

    


### 4.2. Customized MDFPs

This section describes how to define a customized descriptor such that it can be read by the available functions.

Let's assume that from the MD simulations, features other than those computed by default by the ComposerGMX class have been obtained (See Tutorial2.ipynb). For example, let's assume that the number of water molecules in the lower (wothin 0.34 nm) and upper (within 0.5 nm) solvation shells have been computed for all the compounds and that mean, standard deviation, and median have been stored in the pkl file. 

To define a descriptor that contains both the terms of the original MDFP and these additional features, the MDFP_terms class has to be modified, as follows:

**1.** Add a line with the list of the the names of the columns containing the additional features: 



In [9]:
solvation_waters_terms = ["N_waters_lower_av", "N_waters_lower_std", "N_waters_lower_med", 
                          "N_waters_lower_av", "N_waters_lower_std", "N_waters_lower_med"]

**2.** Combine this list with the list of the features composing the MDFP:

In [10]:
MDFP_new = MDFP_terms.mdfp + solvation_waters_terms

# print the features composing the new MDFP
MDFP_new


['intra_ene_av_wat',
 'intra_ene_std_wat',
 'intra_ene_med_wat',
 'intra_lj_av_wat',
 'intra_lj_std_wat',
 'intra_lj_med_wat',
 'intra_crf_av_wat',
 'intra_crf_std_wat',
 'intra_crf_med_wat',
 'total_ene_av_wat',
 'total_ene_std_wat',
 'total_ene_med_wat',
 'total_lj_av_wat',
 'total_lj_std_wat',
 'total_lj_med_wat',
 'total_crf_av_wat',
 'total_crf_std_wat',
 'total_crf_med_wat',
 'wat_rgyr_av',
 'wat_rgyr_std',
 'wat_rgyr_med',
 'wat_sasa_av',
 'wat_sasa_std',
 'wat_sasa_med',
 'N_waters_lower_av',
 'N_waters_lower_std',
 'N_waters_lower_med',
 'N_waters_lower_av',
 'N_waters_lower_std',
 'N_waters_lower_med']

**3.** Assign a name to the new MDFP and add it to the MDFP_terms.descriptor_dictionary.

  #### IMPORTANT:

  For the new descriptor to be recognized as a MDFP, the descriptor name must not contain the word "RDKit" or "ECFP4".

4. To ensure that you do not get an error when generating the features importance plots, define also colors for the new features and add them to the colors_dictionary 

## 5. Train-Test Splits

The next step is to split the dataset into a training and a test set. To this end, different functions are available in the class TrainTestSplit. All functions return four outputs, i.e.
the dataframe of the training set, the dataframe of the test set, the labels for the training set, and the labels for the test set.

### Description of the available train-test splits

The available splits and rescpective functions are:

**5.1. Chemical Diversity Splits:**

To form the test set, a maximally diverse subset of compounds is selected from the dataset using the MaxMin algorithm in the RDKit.

- **General**: split the data such that the test set contains a maximumally diverse set of compounds:
        TrainTestSplit.max_chem_diversity(df_dataset, smiles_column='smiles', test_set_size=None, random_seed=None)
            
- **Balanced**: split the data such that the test set is class balanced and contains a maximumally diverse set of compounds:
        TrainTestSplit.max_chem_diversity_class_balanced(df_dataset, smiles_column='smiles', labels_column='is_sub', test_set_size=None, random_seed=None)

- **Stratified**: split the data such that the test set has the same class proportions of the dataset and contains a maximumally diverse set of compounds:
        TrainTestSplit.max_chem_diversity_class_stratified(df_dataset, smiles_column='smiles', labels_column='is_sub', test_set_size=None, random_seed=None)

The documentation of the functions can be checked as follows:


In [11]:
print(TrainTestSplit.max_chem_diversity_class_stratified.__doc__)

It splits the dataset into a training and a test set such that the test set 
        contains a maximumally diverse set of compounds. Moreover, the test set preserves the class distribution of the (eventually imbalanced) dataset.
        To use this function, the input dataset has to contain a smiles_column and a labels_column.
        
        To ensure that the test set is stratified, the compounds are first divided into two groups based on their class. 
        Then, the chemical similarity among the groups is quantified using the ECFP4 Tanimoto coefficient. 
        Finally, to obtain the test set, a maximally diverse subset of compounds is selected from each group using the MaxMin algorithm in the RDKit.
        The remaining compounds are included in the training set.
 
        Parameters:
        -----------
        df_dataset: df
            dataframe of the dataset. It should contain features, a SMILES column, and classification labels
        smiles_column: str, optional
    

For example, split the ChEMBL dataset using the stratified chemical diversity split:

In [12]:
# Chemical Diversity Split for the ChEMBL dataset
df_chembl_train, df_chembl_test, is_substrate_chembl_train, is_substrate_chembl_test = \
TrainTestSplit.max_chem_diversity_class_stratified(df_chembl, smiles_column='smiles', 
                                                   labels_column='is_sub', test_set_size = df_chembl.shape[0]/5)


**5.2. Chemical Series Splits (and simulated time split):**

The test set will be formed by chemical series different from those contained in the training set. The output dataframe of the test set will contain and extra column, names "clusters". Compounds belonging to the same chemical series will be grouped in the same cluster.

The data are split using the following procedure: (i) the compounds are decomposed into Murcko frameworks using RDKit. (ii) Frameworks are represented by ECFP4 fingerprints. (iii) ECFP4 fingerprints are clustered based on the Tanimoto similarity using the Butina algorithm in the RDKit.(iv) Clusters are randomly picked and included in the test set.

- **General or simulated time split**: The clusters can contain compounds which are representatives either of a single class or of different classes. To include in the test set only clusters that contain mixed classes of compounds, specify the labels_column and set mixed_clusters_only = True. One can also specify a cutoff fot the imbalance ratio for the clusters to be picked for the test set. See the documentation of the function, below.
        TrainTestSplit.chemical_series(df_dataset, smiles_column='smiles', labels_column=None, remove_atom_types=False, exclude_series_larger_than=None, exclude_from_test_series_larger_than=None, test_set_size=None, include_singletons_in_test=False, mixed_clusters_only=False, ratio_cutoff=None, clustering_cutoff=0.2, plot_clustering_results=False, plot_basename=None, class_plot_labels=None, colors_classes=None)
            
- **Per class**: compounds belonging to different classes are separately clustered. This means that the componds are first divided into classes and then the procedure described above is applied to every group. The test set will contain random clusters from each of the groups. The clusteres can be picked in order to have balanced classes in the test set (set balanced = True) or such that the test set has the same class distribution of the original dataset (set stratified = True). By default stratified = True.
        TrainTestSplit.chemical_series_per_class(df_dataset, smiles_column='smiles', labels_column='is_sub', remove_atom_types=False, exclude_series_larger_than=None, exclude_from_test_series_larger_than=None, test_set_size=None, include_singletons_in_test=False, clustering_cutoff=0.2, balanced=False, stratified=False, plot_clustering_results=None, plot_basename=None)


Print out the documentation of each function, use help(TrainTestSplit.chemical_series_per_class).

Here are some examples of the train-test splits that can be done with these two functions:

In [13]:
# split the data into a training and a test set, such that the test contains all the identified chemical series 
# (clusters with 2 or more members). Labels are not taken into account.
df_train, df_test = TrainTestSplit.chemical_series(df_chembl, smiles_column='smiles',
                                                   plot_clustering_results = True, plot_basename = "all_series")


No chemical series is excluded from the dataset
Singletons are NOT picked for the test set but are included in the training set
Large chemical series can also be picked for the test set


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_set['clusters'] = cluster_info


In [14]:
# Simulated time split of the ChEMBL dataset. 
# All clusters containing different classes of compounds and with an imbalance ratio <= 10 
# are included in the test set:
df_chembl_train, df_chembl_test, is_substrate_chembl_train, is_substrate_chembl_test = \
TrainTestSplit.chemical_series(df_chembl, smiles_column='smiles', labels_column = "is_sub", 
                               mixed_clusters_only = True, ratio_cutoff = 10, 
                               plot_clustering_results = True, plot_basename = "time_split")

# To visualize the clusters that have been included in the test set and the class distribution in each of the clusters,
# check the figure barplot_clusters_murcko_time_split_testset.png


No chemical series is excluded from the dataset
Singletons are NOT picked for the test set but are included in the training set
Large chemical series can also be picked for the test set
Baseline Accuracy: 63.62827584601777


The baseline accuracy is based on the null hypothesis is based on the null hypothesis that all the compounds in a given cluster are predicted according to the majority class of that cluster. 

In [15]:
# Cluster substrates and nonsubstrates separately. 
# - Ensure that the test set is class balanced (balanced = True).
# - Exclude large clusters from the test set to avoid that the accuracy metrics are biased 
#   by the results on a particular chemical series (exclude_from_test_series_larger_than = 10).
# - Set the test set size to be 20% of the dataset. 
# - If there are not enough clusters to have a test set that is balanced and is 30% of the dataset, 
#   include singletons in the test set (include_singletons_in_test = True).

df_chembl_train, df_chembl_test, is_substrate_chembl_train, is_substrate_chembl_test = \
TrainTestSplit.chemical_series_per_class(df_chembl, smiles_column='smiles', labels_column = "is_sub",
                                         balanced = True, exclude_from_test_series_larger_than = 10, 
                                         test_set_size=df_chembl.shape[0]/5, include_singletons_in_test = True, 
                                         plot_clustering_results = True, plot_basename = "per_class")

# check that the test set is actually balanced
print("Number of nonsubstrates (0) and substrates (1):")
print(np.unique(is_substrate_chembl_test, return_counts = True))

# check if the test set contains singletons:
clus, counts = np.unique(list(df_chembl_test['clusters']), return_counts = True)
if any(i == 1 for i in counts):
    print("The test set contains singletons")
else:
    print("The test set does not contain singletons")

No chemical series is excluded from the dataset
Also singletons can be picked for the test set
Chemical series larger than 10 are NOT picked for the test set


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_set['clusters'] = cluster_info1


No chemical series is excluded from the dataset
Also singletons can be picked for the test set
Chemical series larger than 10 are NOT picked for the test set
Number of nonsubstrates (0) and substrates (1):
(array([0, 1]), array([111, 111]))
The test set does not contain singletons


**5.3. Random, Class Stratified Split**

The dataset is split into a training and a test set such that the the test set has the same class proportions of the dataset.

    TrainTestSplit.random_class_stratified(df_dataset, labels_column=None, test_set_size=None, random_seed=None) 

In [16]:
# Random stratified split of the ChEMBL dataset
df_chembl_train, df_chembl_test, is_substrate_chembl_train, is_substrate_chembl_test = \
TrainTestSplit.random_class_stratified(df_chembl, labels_column = "is_sub", test_set_size = 1000) 


## 6. Train Machine Learning Models on Single Descriptors


**Stratified Chemical diversity train-test split**

Let's split the datasets such that the test contains a maximally chemical diverse subset:

In [17]:
df_chembl_train, df_chembl_test, is_substrate_chembl_train, is_substrate_chembl_test = \
TrainTestSplit.max_chem_diversity_class_stratified(df_chembl, smiles_column='smiles', labels_column='is_sub', 
                                                   test_set_size = df_chembl.shape[0]/5)



**Select the descriptors of interest:**

In [18]:
# select 2D-counts or MDFPs
# Descriptors are defined in the class MDFP_terms. 
# The terms composing each descriptor can be extracted from the MDFP_terms class itself 
# or using the functions of the class SelectDescriptors.

# Extract MDFP from the class MDFP_terms
mdfp_train = np.array(df_chembl_train[MDFP_terms.mdfp])
mdfp_test = np.array(df_chembl_test[MDFP_terms.mdfp])  

# Extract MDFP and CP2D using the fuctions SelectDescriptors.MDFPFromList()
mdfp_dict = SelectDescriptors.MDFPFromList(['CP2D', 'MDFP'])

mdfp_features = mdfp_dict['MDFP']
mdfp_train = np.array(df_chembl_train[mdfp_features])
mdfp_test = np.array(df_chembl_test[mdfp_features])  

cp2d_features = mdfp_dict['CP2D']
cp2d_train = np.array(df_chembl_train[cp2d_features])
cp2d_train = np.array(df_chembl_test[cp2d_features])

# select ECFP4 or RDKitFP or RDKit2D
# if previously computed (see section 3) then:
ecfp4_train = list(df_chembl_train['ECFP4'])
ecfp4_test = list(df_chembl_test['ECFP4'])

rdkitfp_train = list(df_chembl_train['RDKitFP'])
rdkitfp_test = list(df_chembl_test['RDKitFP'])

rdkit2d_train = list(df_chembl_train['RDKit2D'])
rdkit2d_test = list(df_chembl_test['RDKit2D'])


**Train a ML model, for example random forest, using scikit-learn:**

In [19]:
# specify fingerprints of the training and test sets (any of the ones decribed above)
fp_train = mdfp_train
fp_test = mdfp_test

In [20]:
# train ML model
clf_rf = None
clf_rf = RandomForestClassifier(max_depth=6, n_estimators=1000, 
                                min_samples_leaf = 1).fit(fp_train, is_substrate_chembl_train)

**Calculate classification metrics:**
number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), global accuracy (GA), sensitivity or recall (SE), specificity (SP), precision (Prec), Matthew’s correlation coefficient (MCC), Cohen’s Kappa, and area under (AUC) the receiver operating characteristic curve (ROC).



In [21]:
metrics = Evaluate.calc_metrics(clf_rf, fp_train, is_substrate_chembl_train, fp_test, is_substrate_chembl_test)
metrics_colnames = ['Train Score','Test Score', 'TP', 'TN', 'FP', 'FN', 'AUC', 'Prec', 'SE', 'SP', 'Kappa', 'F1', 'MCC']
df_metrics = pd.DataFrame([metrics], columns=metrics_colnames)
df_metrics['fingerprint'] = "MDFP"
df_metrics

Unnamed: 0,Train Score,Test Score,TP,TN,FP,FN,AUC,Prec,SE,SP,Kappa,F1,MCC,fingerprint
0,0.894619,0.765766,124,46,32,20,0.779469,0.794872,0.861111,0.589744,0.467331,0.826667,0.470902,MDFP


## 7. Train Machine Learning Models on Multiple Descriptors

Functions:

In [22]:
def calc_descriptors(descriptors_list, df_training_set, df_test_set, smiles_column = 'smiles'):
    #ECFP4
    if 'ECFP4' in descriptors_list and 'ECFP4' not in list(df_training_set):
        df_training_set = DataPrep.add_ECFP4(df_training_set, smiles_column = smiles_column)
        df_test_set = DataPrep.add_ECFP4(df_test_set, smiles_column = smiles_column)
    #RDKit2D
    if 'RDKit2D' in descriptors_list and 'RDKit2D' not in list(df_training_set):
        df_training_set = DataPrep.add_RDKit2D(df_training_set, smiles_column = smiles_column)
        df_test_set = DataPrep.add_RDKit2D(df_test_set, smiles_column = smiles_column)
    #RDKitFP
    if 'RDKitFP' in descriptors_list and 'RDKitFP' not in list(df_training_set):
        df_training_set = DataPrep.add_RDKitFP(df_training_set, smiles_column = smiles_column)
        df_test_set = DataPrep.add_RDKitFP(df_test_set, smiles_column = smiles_column)
    #MDFP_RDKit2D
    if 'MDFP_RDKit2D' in descriptors_list and 'MDFP_RDKit2D' not in list(df_training_set):
        df_training_set = DataPrep.add_MDFP_RDKit2D(df_training_set, smiles_column = smiles_column)
        df_test_set = DataPrep.add_MDFP_RDKit2D(df_test_set, smiles_column = smiles_column)
    #ECFP4_RDKit2D
    if 'ECFP4_RDKit2D' in descriptors_list and 'ECFP4_RDKit2D' not in list(df_training_set):
        df_training_set = DataPrep.add_ECFP4_RDKit2D(df_training_set, smiles_column = smiles_column)
        df_test_set = DataPrep.add_ECFP4_RDKit2D(df_test_set, smiles_column = smiles_column)
    #ECFP4_MDFP_RDKit2D
    if 'ECFP4_MDFP_RDKit2D' in descriptors_list and 'ECFP4_MDFP_RDKit2D' not in list(df_training_set):
        df_training_set = DataPrep.add_ECFP4_MDFP_RDKit2D(df_training_set, smiles_column = smiles_column)
        df_test_set = DataPrep.add_ECFP4_MDFP_RDKit2D(df_test_set, smiles_column = smiles_column)
    return df_training_set, df_test_set


def train_RF(descriptors_list, df_training_set, classes_train, df_test_set, classes_test, **kwargs):
    # eventually compute missing descriptors
    df_training_set, df_test_set = calc_descriptors(descriptors_list, df_training_set, df_test_set)
    #select descriptors
    mdfp_dict = SelectDescriptors.MDFPFromList(descriptors_list)
    ECFP4combi_dict = SelectDescriptors.ECFP4CombiFromList(descriptors_list)
    rdkit_dict = SelectDescriptors.RDKitFPsFromList(descriptors_list)
    if 'ECFP4' in descriptors_list:
        rdkit_dict.update({'ECFP4': ['ECFP4']})
    # 
    decision_threshold = 0.5   #standard for random forest
    # Initialize output variables
    rf_metrics = []
    predictions = []
    trained_models = {}
    # Train RF on MDFPs
    if len(mdfp_dict) != 0:
        for fp_name, terms_to_test in mdfp_dict.items():
            print(fp_name)
            mdfp_train = np.array(df_training_set[terms_to_test])      #array containing mdfp
            mdfp_test = np.array(df_test_set[terms_to_test])           #array containing mdfp
            clf_rf = None
            clf_rf = RandomForestClassifier(random_state = 0, max_depth=6, n_estimators=1000, **kwargs).fit(mdfp_train, classes_train)
            # write outputs
            trained_models.update({fp_name: clf_rf})
            rf_metrics.append({"fingerprint": fp_name, "metrics": Evaluate.calc_metrics(clf_rf, mdfp_train, classes_train, mdfp_test, classes_test, decision_threshold = decision_threshold)})
            predictions.append({"fingerprint": fp_name, "y_pred": clf_rf.predict(mdfp_test), "pred_proba": clf_rf.predict_proba(mdfp_test)[:,1]})
    # Train RF on RDKit fingerprints and ECFP4
    if len(rdkit_dict) != 0:
        for fp_name, terms_to_test in rdkit_dict.items():
            print(fp_name)
            mdfp_train = list(df_training_set[fp_name])
            mdfp_test = list(df_test_set[fp_name])
            clf_rf = None
            clf_rf = RandomForestClassifier(random_state = 0, max_depth=6, n_estimators=1000, **kwargs).fit(mdfp_train, classes_train)
            trained_models.update({fp_name: clf_rf})
            rf_metrics.append({"fingerprint": fp_name, "metrics": Evaluate.calc_metrics(clf_rf, mdfp_train, classes_train, mdfp_test, classes_test, decision_threshold = decision_threshold)})
            predictions.append({"fingerprint": fp_name, "y_pred": clf_rf.predict(mdfp_test), "pred_proba": clf_rf.predict_proba(mdfp_test)[:,1]})
    #train model on combinations of ECFP4 with other descriptors
    if len(ECFP4combi_dict) != 0:
        for fp_name, terms_to_test in ECFP4combi_dict.items():
            print(fp_name)
            mdfp_train, mdfp_test = DataPrep.combine_descriptors_train_test(df_training_set, df_test_set, 'ECFP4', terms_to_test)       #array containing mdfp
            clf_rf = None
            clf_rf = RandomForestClassifier(random_state = 0, max_depth=6, n_estimators=1000, **kwargs).fit(mdfp_train, classes_train)
            trained_models.update({fp_name: clf_rf})
            rf_metrics.append({"fingerprint": fp_name, "metrics": Evaluate.calc_metrics(clf_rf, mdfp_train, classes_train, mdfp_test, classes_test, decision_threshold = decision_threshold)})
            predictions.append({"fingerprint": fp_name, "y_pred": clf_rf.predict(mdfp_test), "pred_proba": clf_rf.predict_proba(mdfp_test)[:,1]})
                
    # output results in a dataframe
    df_metrics = pd.DataFrame(rf_metrics)
    metrics_colnames = ['Train Score','Test Score', 'TP', 'TN', 'FP', 'FN', 'AUC', 'Prec', 'SE', 'SP', 'Kappa', 'F1', 'MCC']
    df_tmp = pd.DataFrame(df_metrics['metrics'].values.tolist(), columns=metrics_colnames)
    df_metrics2 = pd.concat([df_metrics.drop(columns=['metrics']), df_tmp], axis=1)
    df_metrics2 = df_metrics2.round(2)
    df_metrics2 = df_metrics2[['fingerprint', 'Train Score','Test Score', 'TP', 'TN', 'FP', 'FN', 'AUC', 'Prec', 'SE', 'SP', 'Kappa', 'F1', 'MCC']]

    return df_metrics2, predictions, trained_models
                    
    

In [23]:
descriptors_list = ['MDFP','MDFP+', 'MDFP_P2', 'MDFP_P3','MDFP++','MDFP_P3++','C2D',
                    'CP2D','RDKitFP','RDKit2D','MDFP_RDKit2D','ECFP4','ECFP4_MDFP','ECFP4_MDFP_P3',
                    'ECFP4_MDFP_P3++','ECFP4_RDKit2D','ECFP4_MDFP_RDKit2D']

In [24]:
df_metrics, rf_predictions, rf_trained_models = train_RF(descriptors_list, df_chembl_train, is_substrate_chembl_train, df_chembl_test, is_substrate_chembl_test)

MDFP
MDFP+
MDFP_P2
MDFP_P3
MDFP++
MDFP_P3++
C2D
CP2D
RDKitFP
RDKit2D
MDFP_RDKit2D
ECFP4_RDKit2D
ECFP4_MDFP_RDKit2D
ECFP4
ECFP4_MDFP
ECFP4_MDFP_P3
ECFP4_MDFP_P3++


Print output metrics

In [25]:
df_metrics

Unnamed: 0,fingerprint,Train Score,Test Score,TP,TN,FP,FN,AUC,Prec,SE,SP,Kappa,F1,MCC
0,MDFP,0.89,0.76,124,44,34,20,0.78,0.78,0.86,0.56,0.44,0.82,0.45
1,MDFP+,0.9,0.76,124,44,34,20,0.78,0.78,0.86,0.56,0.44,0.82,0.45
2,MDFP_P2,0.9,0.77,124,47,31,20,0.78,0.8,0.86,0.6,0.48,0.83,0.48
3,MDFP_P3,0.9,0.75,122,44,34,22,0.79,0.78,0.85,0.56,0.43,0.81,0.43
4,MDFP++,0.9,0.76,124,45,33,20,0.79,0.79,0.86,0.58,0.46,0.82,0.46
5,MDFP_P3++,0.9,0.77,125,45,33,19,0.8,0.79,0.87,0.58,0.46,0.83,0.47
6,C2D,0.82,0.75,121,46,32,23,0.77,0.79,0.84,0.59,0.44,0.81,0.44
7,CP2D,0.83,0.76,123,46,32,21,0.78,0.79,0.85,0.59,0.46,0.82,0.46
8,RDKitFP,0.92,0.71,110,47,31,34,0.77,0.78,0.76,0.6,0.36,0.77,0.36
9,RDKit2D,0.91,0.79,129,47,31,15,0.8,0.81,0.9,0.6,0.52,0.85,0.53


The trained models have been stored in a dictionary:

In [26]:
rf_trained_models['MDFP']

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=6, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

The prediction and prediction probabilities for the test set have also been stored as a list of dictionaries:

In [27]:
rf_predictions

[{'fingerprint': 'MDFP',
  'y_pred': array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
         1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
         0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
         0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
         1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
         1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
         1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]),
  'pred_proba': array([0.47646741, 0.24406168, 0.47156882, 0.20374857, 0.12677227,
         0.68517711, 0.23842339, 0.0539608 , 0.06158898, 0.23355088,
         0.60624977, 0.19683324, 0.4769958 , 0