# Machine learning 

In this part, we use machine learning (ML) models, particularly random forests (RF), to predict the cytotoxicity of our previously processed dataset. As a reminder the dataset contains morphological compounds for around 30k molecules as well as their smiles computer morgan fingerprint.

## Import libraries and previous data


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import matthews_corrcoef

### Load data : per-treatment profile fingerprints and annotation file

- Load fingerprint file named _fingerprint_ds.pkl_ generated in _Part2-Similarity_Analysis.ipynb_ 
- Load endocrine activity file:  _ER_activity_luc_bg1.csv_
- Concatenate the data frames into one 

1. Load the data 

In [2]:
dataset = pd.read_pickle('../Data/Output/output_notebook_2.pkl')
dataset.head()

Unnamed: 0,Metadata_broad_sample,CPD_NAME,CPD_SMILES,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,STD_smile,morganB_fps,euclidean_distance_morganB_fps,tanimoto_distance_morganB_fps,morphological_fingerprint,euclidean_distance_morphological_fingerprint
0,BRD-K08693008-001-01-9,BRD-K08693008,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@@H]1...,0.048557,1.511709,0.671319,1.608394,-0.511233,0.565885,0.257311,...,0.586195,1.125321,1.009055,-1.088308,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@H]1CC[C@H](CCn2cc(C3...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[0.04855710428510099, 1.5117089902809278, 0.67...",27.358629
1,BRD-K63982890-001-01-9,BRD-K63982890,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@H]1N...,-0.4525,0.951801,0.099539,0.742502,-0.953422,-0.629503,0.145339,...,-0.662275,-0.373907,1.198932,-0.733997,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@@H]1CC[C@H](CCn2cc(C...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[-0.4524995916327004, 0.951801364194127, 0.099...",27.667467
2,BRD-K41006887-001-01-9,BRD-K41006887,OC[C@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@@H]1N...,-0.057587,1.049713,0.849848,1.010894,0.477231,0.104341,-0.089863,...,0.225152,0.297243,0.937306,-0.964718,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@H]1CC[C@H](CCn2cc(C3...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[-0.0575871545944799, 1.0497128524077577, 0.84...",19.326218
3,BRD-K06226868-001-01-9,BRD-K06226868,OC[C@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@H]1NC...,-0.479606,-0.420357,0.579939,-0.186316,0.066921,-0.435716,-0.235182,...,2.401989,0.121881,1.291487,0.298728,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@@H]1CC[C@H](CCn2cc(C...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[-0.47960621508791773, -0.4203570009833136, 0....",19.968936
4,BRD-K80296876-001-01-1,BRD-K80296876,OC[C@@H]1O[C@H](CCn2cc(nn2)C2CCCCC2)CC[C@@H]1N...,-0.640939,-0.311107,0.639009,-0.391117,-0.171688,-0.574212,-0.597188,...,1.411005,1.217564,1.154178,0.036902,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@H]1CC[C@@H](CCn2cc(C...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[-0.6409394395723342, -0.3111072013741303, 0.6...",17.030844


In this part we won't neet drop the euclidean distances per fingerprint so this colunms can be droped.


In [3]:
dataset.drop(['euclidean_distance_morganB_fps', 'euclidean_distance_morphological_fingerprint',
             'tanimoto_distance_morganB_fps'], axis=1, inplace=True)

2. Load file with endocrine activity (ER)

In [4]:
er_table = pd.read_csv(
    '../Data/Annotations/ER_activity_luc_bg1.csv', sep=',', index_col=0)

We want to mark any endocrine activity, i.e., independent on whether the compound is marked as antagonist or agonist, as 'active' (1) and only 'inactive' (0) for the compounds with no activity.

In [5]:
er_table.dropna(inplace=True)

In [43]:
def convert_to_three_classes(x):
    ''' 
        needed to add the class for the multiclass setting

        @param x: row in dataframe
        @retun: class as int
    '''
    if (x['TOX21_ERa_LUC_BG1_Antagonist'] == 1) and (x['TOX21_ERa_LUC_BG1_Agonist'] == 1):
        return 3
    elif x['TOX21_ERa_LUC_BG1_Antagonist'] == 1:
        return 2
    elif x['TOX21_ERa_LUC_BG1_Agonist'] == 1:
        return 1
    else:
        return 0

In [44]:
class_encoding_dict = {3: 'both', 2: 'antagonist',
                       1: 'agonist', 0: 'not active'}

In [45]:
er_table['three_class_response'] = er_table.apply(
    lambda x: convert_to_three_classes(x), axis=1)

In [46]:
er_table['activity'] = er_table.apply(lambda x: min(int(
    x['TOX21_ERa_LUC_BG1_Antagonist']) + int(x['TOX21_ERa_LUC_BG1_Agonist']), 1), axis=1)

In [47]:
activity_only_df = er_table.drop(
    columns=['TOX21_ERa_LUC_BG1_Antagonist', 'TOX21_ERa_LUC_BG1_Agonist'])

In [48]:
print(
    f'In the dataset, we find {activity_only_df["activity"].sum()} active compounds.')
print(
    f'{round((activity_only_df["activity"].sum()/len(activity_only_df))*100, 1)}% of the compounds are active.')

In the dataset, we find 212 active compounds.
24.1% of the compounds are active.


In [49]:
merged_df = activity_only_df.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
merged_df.reset_index(inplace=True, drop=True)
merged_df.dropna(inplace=True)

In [50]:
merged_df.head()

Unnamed: 0,CPD_NAME,three_class_response,activity,Metadata_broad_sample,CPD_SMILES,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,...,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,STD_smile,morganB_fps,morphological_fingerprint
0,mibefradil,2,1,BRD-K09549677-311-05-6,COCC(=O)O[C@]1(CCN(C)CCCc2nc3ccccc3[nH]2)CCc2c...,-1.522142,-0.167552,-0.057663,0.922271,1.969991,...,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,COCC(=O)O[C@]1(CCN(C)CCCc2nc3ccccc3[nH]2)CCc2c...,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,mibefradil,2,1,BRD-K09549677-300-01-8,COCC(=O)O[C@]1(CCN(C)CCCc2nc3ccccc3[nH]2)CCc2c...,0.307927,0.775698,0.480218,2.664921,1.42163,...,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,COCC(=O)O[C@]1(CCN(C)CCCc2nc3ccccc3[nH]2)CCc2c...,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."
2,fenbufen,0,0,BRD-K12513978-001-18-3,OC(=O)CCC(=O)c1ccc(cc1)-c1ccccc1,-1.087932,-0.341208,-0.165434,-0.144534,-0.205437,...,-0.245718,-0.488566,-0.129325,-0.082212,0.240349,-0.054421,0.19311,O=C(O)CCC(=O)c1ccc(-c2ccccc2)cc1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.0879321652896299, -0.34120750238044284, -0..."
3,fenbufen,0,0,BRD-K12513978-001-04-3,OC(=O)CCC(=O)c1ccc(cc1)-c1ccccc1,-0.121079,-0.045807,0.049945,0.036023,0.389832,...,-0.092717,-0.255664,-0.289665,-0.079264,-0.198003,-0.273866,-0.079771,O=C(O)CCC(=O)c1ccc(-c2ccccc2)cc1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-0.12107863384308526, -0.04580730280029293, 0..."
4,diethanolamine,0,0,BRD-K19401842-001-04-8,OCCNCCO,-0.436675,-0.146961,-0.460239,0.638962,-0.16615,...,0.157471,-0.42289,-0.580701,-0.228447,-0.132181,-0.549056,-0.466799,OCCNCCO,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-0.4366754247968007, -0.14696086259426278, -0..."


We notice that some compound names appear several times in the list with different Broad IDs.

## Spliting Data in Training and TestSet

For ML, we always need training and a test set that are pairwise disjoint. <br>
To this end, we split the data into two parts, where the training set will contain 80% of the samples and the test set the remaining 20%. <br>
Moreover, we need a response vector y, i.e., the column with the activity information and a feature matrix X, i.e., a matrix with the morphological or molecular fingerprints or a combination thereof, for each compound. <br>
To make sure we do not have a data leakage, we ensure that all instances of compounds, which appear several times in the dataset, are put in the same set. Moreover, we want to ensure that we have enough instances of the active class in both sets. Therefore, we perform a stratified split that ensures a constant ratio of active and inactive compounds in all datasets.

1. Features and variable to predict (response vector)

In [51]:
train_names, test_names, train_activities, test_activities = train_test_split(
    activity_only_df['CPD_NAME'].values, activity_only_df['activity'].values.astype('int'), random_state=42, test_size=0.2, shuffle=True)

In [52]:
if 'linoleic acid' in train_names:
    print('linoleic acid is in training set')
elif 'linoleic acid' in test_names:
    print('linoleic acid is in test set')

linoleic acid is in training set


## Binary Classification

In [53]:
X = merged_df.copy(deep=True)
X_train = X.loc[X['CPD_NAME'].isin(train_names), :].copy(deep=True)
X_test = X.loc[X['CPD_NAME'].isin(test_names), :].copy(deep=True)

broad_ids_train = X_train['Metadata_broad_sample'].values
broad_ids_test = X_test['Metadata_broad_sample'].values

y_train = X_train['activity'].values
y_test = X_test['activity'].values

cols_to_drop = ['morganB_fps', 'morphological_fingerprint', 'activity', 'CPD_SMILES',
                'CPD_NAME', 'Metadata_broad_sample', 'STD_smile', 'three_class_response']
X_test_morph_fp = X_test.drop(cols_to_drop, axis=1)
X_train_morph_fp = X_train.drop(cols_to_drop, axis=1)

## Cross Validation and Hyperparameter Tuning

In ML, we want to minimize a loss function, e.g., the mean-squared error (MSE). In particular, we want to fit the training data in order to minimize the loss on the training samples, while still having a small error on the unseen test samples. Moreover, there are hyperparameters defining the hypothesis space of the model, which can be tuned in order to minimize the estimated test error. The test error can be estimated using a so-called k-fold cross validation (CV), which is performed in the following.

For an RF, there are three hyperparameters that we want to tune in order to optimize the estimated test error: 
* depth of the trees in the forest: 10, 20, or 30
* number of trees in the forest: 100, 300, or 500
* minimum number of samples per leaf: 10, 15, or 20

In [54]:
def rf_cross_validation(X_train, y_train, max_depth_range=[10, 20, 30], num_tree_range=[100, 300, 500], min_samples_leaf_range=[10, 15, 20]):
    '''
        performs a 5-fold CV for a Random Forest for given X_train and y_train

        @param X_train: the training matrix
        @param y_train: the associated response vector
        @param max_depth_range: list containing the values that should be tested for max depth, default [10,20,30]
        @param num_tree_range: list containing the values that should be tested for the number of trees, default [100,300,500]
        @param min_samples_leaf_range: list containing the values that should be tested for the minimum number of samples per leaf, default [10,15,20]

        @return: a forest with the best hyperparameter according to the estimated test MSE and trained on the whole training set
    '''
    best_score = -float('inf')
    for depth in max_depth_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, max_depth=depth, n_jobs=-1,
                                    class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_depth = depth

    best_score = -float('inf')
    for n_tree in num_tree_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, n_estimators=n_tree,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_n_tree = n_tree

    best_score = -float('inf')
    for num_samples in min_samples_leaf_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, min_samples_leaf=num_samples,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_min_samples = num_samples

    rf = RandomForestClassifier(random_state=42, n_estimators=best_n_tree, max_depth=best_depth,
                                min_samples_leaf=best_min_samples, n_jobs=-1, class_weight='balanced')
    rf.fit(X_train, y_train)
    return rf

In [55]:
best_estimator = rf_cross_validation(X_train_morph_fp, y_train)

Now, we print the hyperparameters of the forest, to see what performed best in the CV.

In [56]:
param_dict = best_estimator.get_params()
print(
    f'We use a max depth of {param_dict["max_depth"]}, {param_dict["n_estimators"]} trees, and at least {param_dict["min_samples_leaf"]} samples per leaf.')

We use a max depth of 20, 300 trees, and at least 10 samples per leaf.


## Prediction on Test Set 

Now, we predict for the test data, i.e., so far unseen samples to calculate the real test error.

In [57]:
random_forest = best_estimator
y_pred = random_forest.predict(X_test_morph_fp)

And calculate the Matthews correlation coefficient (MCC) on our test set prediciton.

In [58]:
print(
    f'The RF trained on morphological fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred, y_true=y_test)}')

The RF trained on morphological fingerprints reached an MCC of 0.23936958727293156


## Using Molecular (Morgan) Fingerprints

1. Features and variable to predict (response vector)

In [59]:

X_test_morgan_fp = X_test['morganB_fps'].copy(deep=True).values
X_train_morgan_fp = X_train['morganB_fps'].copy(deep=True).values

In [60]:
def convert_list_features_to_numpy(x):
    '''
        converts the features that are stored in an array containing lists to an array of arrays such that shape works

        @param x: array containing lists 
        @return x_new: array of arrays of ints
    '''
    new_x = []
    for element in x:
        new_x.append(np.array(element))
    return np.array(new_x)

In [61]:
X_test_morgan_fp = convert_list_features_to_numpy(X_test_morgan_fp)
X_train_morgan_fp = convert_list_features_to_numpy(X_train_morgan_fp)

In [62]:
best_estimator = rf_cross_validation(X_train_morgan_fp, y_train)

In [63]:
param_dict = best_estimator.get_params()
print(
    f'We use a max depth of {param_dict["max_depth"]}, {param_dict["n_estimators"]} trees, and at least {param_dict["min_samples_leaf"]} samples per leaf.')

We use a max depth of 10, 500 trees, and at least 10 samples per leaf.


### Prediction on Test Set

In [64]:
y_pred = best_estimator.predict(X_test_morgan_fp)

In [65]:
print(
    f'The RF trained on Morgan fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred, y_true=y_test)}')

The RF trained on Morgan fingerprints reached an MCC of 0.3907841737774495


## Using both Molecular (Morgan) and Morphological Fingerprints

In [66]:
def append_concat_molecular_and_morphological_FP(x):
    '''
        creates a new column 'appended_profile' containing the joint morphological and Morgan fingerprints

        @param x: pandas DataFrame with columns 'morphological_fingerprint' and 'morganB_fps'
        @return x with new column 'appended_profile' 
    '''
    x['appended_profile'] = X.apply(lambda x: np.concatenate(
        [x['morphological_fingerprint'], x['morganB_fps']]), axis=1)
    return x

In [67]:
X_test_morph_morgan_fp = append_concat_molecular_and_morphological_FP(
    X_test.copy(deep=True))
X_train_morph_morgan_fp = append_concat_molecular_and_morphological_FP(
    X_train.copy(deep=True))

In [68]:
X_train_morph_morgan_fp = convert_list_features_to_numpy(
    X_train_morph_morgan_fp['appended_profile'].values)
X_test_morph_morgan_fp = convert_list_features_to_numpy(
    X_test_morph_morgan_fp['appended_profile'].values)

In [69]:
rf_combined_morgan_morph = rf_cross_validation(
    X_train=X_train_morph_morgan_fp, y_train=y_train)

In [70]:
y_pred_combi = rf_combined_morgan_morph.predict(X_test_morph_morgan_fp)

In [71]:
print(
    f'The RF trained on both morphological and Morgan fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred_combi, y_true=y_test)}')

The RF trained on both morphological and Morgan fingerprints reached an MCC of 0.23561281812452103


## Conclusion

Apparently, the model trained on Morgan fingerprints only reached the best (highest) MCC.

## Multiclass Classification

In [72]:
y_train = X_train['three_class_response'].values
y_test = X_test['three_class_response'].values

cols_to_drop = ['morganB_fps', 'morphological_fingerprint', 'activity',
                'CPD_SMILES', 'CPD_NAME', 'Metadata_broad_sample', 'STD_smile']
X_test_morph_fp = X_test.drop(cols_to_drop, axis=1)
X_train_morph_fp = X_train.drop(cols_to_drop, axis=1)

In [73]:
best_estimator = rf_cross_validation(X_train_morph_fp, y_train)

In [74]:
y_pred = best_estimator.predict(X_test_morph_fp)

In [76]:
print(
    f'The RF trained on both morphological and morphological fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred, y_true=y_test)}')

The RF trained on both morphological and morphological fingerprints reached an MCC of 0.6938813594468924


To do LM (biologist proof):
- DMSO as control (know that it is not toxic)+ 1 molecule known to be toxic chemoterapeutics 
- a concret exemple with 2 molecules 