# Machine learning 

In this part, we use machine learning (ML) models, particularly random forests (RF), to predict the Estrogen receptor activity of our previously processed dataset.

## Import libraries 

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import matthews_corrcoef
import sklearn.model_selection as model_selection 
from sklearn.metrics import f1_score, balanced_accuracy_score,classification_report, accuracy_score,make_scorer

## Reminder : The datasets 


As a reminder, we are using:

- **Morphological fingerprint**: Morphological fingerprint features represent the effect of treatment measured on cells, in this case, U20S cells from osteosarcoma cancer.
- **Structural fingerprint**: Structural fingerprints are computer-readable vectors obtained from the SMILES. They encapsulate atom environment, connectivity, and substructure information.

We are now introducing:

- **Activity annotation**: Activity and inactivity annotations are obtained from agonist and antagonist assays. A set of molecules has been tested on LUC BG1 breast cancer cells to estimate IC50 (TO CHECK). The values are then converted to a binary format: 1 (agonist or antagonist) / 0 (non-agonist, non-antagonist).



### Load data: per-treatment profile fingerprints and annotation file

1. Load the morphological data 

In [2]:
dataset = pd.read_pickle('../Data/Output/output_notebook_2.pkl')
dataset.head(2)

Unnamed: 0,Metadata_broad_sample,CPD_NAME,CPD_SMILES,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,STD_smile,morganB_fps,euclidean_distance_morganB_fps,tanimoto_distance_morganB_fps,morphological_fingerprint,euclidean_distance_morphological_fingerprint
0,BRD-K08693008-001-01-9,BRD-K08693008,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@@H]1...,0.048557,1.511709,0.671319,1.608394,-0.511233,0.565885,0.257311,...,0.586195,1.125321,1.009055,-1.088308,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@H]1CC[C@H](CCn2cc(C3...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[0.04855710428510099, 1.5117089902809278, 0.67...",27.358629
1,BRD-K63982890-001-01-9,BRD-K63982890,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@H]1N...,-0.4525,0.951801,0.099539,0.742502,-0.953422,-0.629503,0.145339,...,-0.662275,-0.373907,1.198932,-0.733997,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@@H]1CC[C@H](CCn2cc(C...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[-0.4524995916327004, 0.951801364194127, 0.099...",27.667467


In [Part2-Similarity_Analysis](Part2-Similarity_Analysis), we compute Euclidean distance, as well as Tanmoto distance that won't be needed, therefore we can drop them. 

In [3]:
dataset.drop(['CPD_SMILES','STD_smile','euclidean_distance_morganB_fps', 'euclidean_distance_morphological_fingerprint', 'tanimoto_distance_morganB_fps'], axis = 1, inplace=True)

2. Endocrine activity (ER) dataset 

This section on the annotation data set we will : 
- Load data 
- Check the data for missing value 
- Define the ratio of active / inactive 


In [4]:
er_table = pd.read_csv('../Data/Annotations/ER_activity_luc_bg1.csv', sep = ',', index_col=0)
print(f'There are {len(er_table["CPD_NAME"])} tested for ER alpha activity')
er_table.head(2)

There are 886 tested for ER alpha activity


Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME
0,1.0,0.0,mibefradil
1,0.0,0.0,fenbufen


Check if the sum of NaNs for each column is 0

In [5]:

if er_table.isna().sum().sum() == 0:
    print("There are no NaN values in er_table.")
else:
    print("There are NaN values in er_table : ")

er_table[er_table.isna().any(axis=1)]


There are NaN values in er_table : 


Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME
35,,,metoprolol tartrate
255,,,"2,4-dichlorophenoxyacetic acid"
281,,,ciprofloxacin
402,,,paraxanthine
632,,,tyrphostin AG-1478


In [6]:
# Drop the NaN value 
er_table.dropna(inplace=True)

## Activity classes 

We want to make a distinction between the activities classes i.e agonism/ antagonism and therefore train one model for each 

1. Agonism activity and Antagonism activity 

In [7]:
agonist = er_table[['TOX21_ERa_LUC_BG1_Agonist','CPD_NAME']]
antagonist = er_table[['TOX21_ERa_LUC_BG1_Antagonist','CPD_NAME']]

agonist.loc[:, 'TOX21_ERa_LUC_BG1_Agonist'] = agonist['TOX21_ERa_LUC_BG1_Agonist'].astype('int')
antagonist.loc[:, 'TOX21_ERa_LUC_BG1_Antagonist'] = antagonist['TOX21_ERa_LUC_BG1_Antagonist'].astype('int')

We  also want to mark any endocrine activity, i.e., independent of whether the compound is marked as antagonist or agonist, as 'active' (1) and only 'inactive' (0) for the compounds with no activity.

2. Estrogen receptor activity vs non-activity 

In [8]:
er_table['activity'] = er_table.apply(lambda x: min(int(x['TOX21_ERa_LUC_BG1_Antagonist']) + int(x['TOX21_ERa_LUC_BG1_Agonist']), 1), axis=1)

In [9]:
activity_only_df = er_table.drop(
    columns=['TOX21_ERa_LUC_BG1_Antagonist', 'TOX21_ERa_LUC_BG1_Agonist'])

## Merging the annotation and morphological fingerprint 

We will create three types of datasets:

- **Activity dataset:** Classifies molecules as active or non-active based on both agonism and antagonism assays. Non-active molecules did not show activity for ERα.

- **Agonism dataset:** Considers only the agonism assay results. Active molecules are agonists, and non-active molecules did not show agonist effects during the assay.

- **Antagonism dataset:** Considers only the antagonism assay results. Active molecules are antagonists, and non-active molecules did not show antagonist effects during the assay.

1. Merging the annotation to the morphological data 

In [10]:
activity_df = activity_only_df.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
activity_df.reset_index(inplace=True, drop=True)
activity_df.dropna(inplace=True)
#activity_df.drop('activity_label',axis = 1,inplace=True)
duplicated_activity_df = activity_df.duplicated(subset='Metadata_broad_sample', keep='first')
unique_activity_df = activity_df[~duplicated_activity_df]
print(f'{round((unique_activity_df["activity"].sum()/len(unique_activity_df))*100, 1)}% of the compounds are active.') 
unique_activity_df.head(2)

23.0% of the compounds are active.


Unnamed: 0,CPD_NAME,activity,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,mibefradil,1,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,mibefradil,1,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


In [11]:
agonist_df = agonist.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
agonist_df.rename(columns={'TOX21_ERa_LUC_BG1_Agonist': 'activity'},inplace=True)
agonist_df.reset_index(inplace=True, drop=True)
agonist_df.dropna(inplace=True)
duplicated_agonist_df = agonist_df.duplicated(subset='Metadata_broad_sample', keep='first')
unique_agonist_df = agonist_df[~duplicated_agonist_df]
print(f'{round((unique_agonist_df["activity"].sum()/len(unique_agonist_df))*100, 1)}% of the compounds are agonist.') 
unique_agonist_df.head(2)

14.3% of the compounds are agonist.


Unnamed: 0,activity,CPD_NAME,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,0.0,mibefradil,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,0.0,mibefradil,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


In [12]:
antagonist_df = antagonist.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
antagonist_df.rename(columns={'TOX21_ERa_LUC_BG1_Antagonist': 'activity'},inplace=True)
antagonist_df.reset_index(inplace=True, drop=True)
antagonist_df.dropna(inplace=True)
duplicated_antagonist_df = antagonist_df.duplicated(subset='Metadata_broad_sample', keep='first')
unique_antagonist_df = antagonist_df[~duplicated_antagonist_df]
print(f'{round((unique_antagonist_df["activity"].sum()/len(unique_antagonist_df))*100, 1)}% of the compounds are antagonist.') 
unique_antagonist_df.head(2)

10.4% of the compounds are antagonist.


Unnamed: 0,activity,CPD_NAME,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,1.0,mibefradil,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,1.0,mibefradil,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


As mentioned before, treatments have been tested under different stereoisomers their Broad ID defers, although the compound names are identic. Therefore, we will be careful that different Broad IDs are only contained in one of the Sets to avoid data leakage.

## Spliting Data in Training and Test Set

For ML, we always need training and a test set that are pairwise disjoint. <br>

To this end, we split the data into two parts, where the training set will contain 80% of the samples and the test set the remaining 20%. <br>

Moreover, we need a response vector y, i.e., the column with the activity information and a feature matrix X, i.e., a matrix with the morphological or molecular fingerprints or a combination thereof, for each compound. <br>

To make sure we do not have data leakage, we ensure that all instances of compounds, that appear several times in the dataset, are put in the same set. Moreover, we want to ensure that we have enough instances of the active class in both sets. Therefore, we perform a stratified split that ensures a constant ratio of active and inactive compounds in all datasets.

1. Features and variables to predict 

In [13]:
def stratified_split(data):
    X_train, X_test, y_train_activities, y_test_activities= train_test_split(
        data.iloc[:,3:], 
        data['activity'].values.astype('int'), 
        random_state=42, 
        test_size=0.2, 
        shuffle=True,
        stratify=data['activity'].values.astype('int')) #stratify split 
    return X_train, X_test, y_train_activities, y_test_activities

In [14]:
X_train, X_test, y_train_activities, y_test_activities = stratified_split(unique_antagonist_df) #unique_antagonist_df , unique_agonist_df

A. For morphological 

In [15]:
X_train_morpho = X_train.drop(['morganB_fps','morphological_fingerprint'], axis=1).values
X_test_morpho = X_test.drop(['morganB_fps','morphological_fingerprint'], axis=1).values

B. For structural 

In [16]:
def convert_list_features_to_numpy(x):
    '''
        converts the features that are stored in an array containing lists to an array of arrays such that shape works

        @param x: array containing lists 
        @return x_new: array of arrays of ints
    '''
    new_x = []
    for element in x:
        new_x.append(np.array(element))
    return np.array(new_x)

In [17]:
X_test_morgan_fp = X_test['morganB_fps'].copy(deep=True).values
X_train_morgan_fp = X_train['morganB_fps'].copy(deep=True).values
#convert as an array 
X_test_morgan_fp = convert_list_features_to_numpy(X_test_morgan_fp)
X_train_morgan_fp = convert_list_features_to_numpy(X_train_morgan_fp)

C. For combined features

In [18]:
def append_concat_molecular_and_morphological_FP(x):
    '''
        creates a new column 'appended_profile' containing the joint morphological and Morgan fingerprints

        @param x: pandas DataFrame with columns 'morphological_fingerprint' and 'morganB_fps'
        @return x with new column 'appended_profile' 
    '''
    x['appended_profile'] = x.apply(lambda x: np.concatenate(
        [x['morphological_fingerprint'], x['morganB_fps']]), axis=1)
    return x

In [19]:
X_test_morph_morgan_fp = append_concat_molecular_and_morphological_FP(X_test.copy(deep=True))
X_train_morph_morgan_fp = append_concat_molecular_and_morphological_FP(X_train.copy(deep=True))

In [20]:
X_train_morph_morgan_fp = convert_list_features_to_numpy(X_train_morph_morgan_fp['appended_profile'].values)
X_test_morph_morgan_fp = convert_list_features_to_numpy(X_test_morph_morgan_fp['appended_profile'].values)

## Binary Classification

### Cross-validation and Hyperparemeter Tuning 

In ML, we want to minimize a loss function, e.g., the mean-squared error (MSE). In particular, we want to fit the training data to minimize the loss on the training samples, while still having a small error on the unseen test samples. Moreover, hyperparameters define the hypothesis space of the model, which can be tuned to minimize the estimated test error. The test error can be estimated using a so-called k-fold cross-validation (CV), which is performed in the following.

For an RF, there are three hyperparameters that we want to tune to optimize the estimated test error: 
* depth of the trees in the forest: 10, 20, 30 or 50 
* number of trees in the forest: 100, 200, 300 or 500
* minimum number of samples per leaf: 5, 10, 15, 20 or 25

img 

In [21]:
def rf_cross_validation(X_train, y_train, max_depth_range=[10, 20, 30, 50], num_tree_range=[100, 200, 300, 500], min_samples_leaf_range=[5,10, 15, 20, 25]):
    '''
        performs a 5-fold CV for a Random Forest for given X_train and y_train

        @param X_train: the training matrix
        @param y_train: the associated response vector
        @param max_depth_range: list containing the values that should be tested for max depth, default [10,20,30]
        @param num_tree_range: list containing the values that should be tested for the number of trees, default [100,300,500]
        @param min_samples_leaf_range: list containing the values that should be tested for the minimum number of samples per leaf, default [10,15,20]

        @return: a forest with the best hyperparameter according to the estimated test MSE and trained on the whole training set
    '''
    best_score = -float('inf')
    for depth in max_depth_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, max_depth=depth, n_jobs=-1,
                                    class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_depth = depth

    best_score = -float('inf')
    for n_tree in num_tree_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, n_estimators=n_tree,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_n_tree = n_tree

    best_score = -float('inf')
    for num_samples in min_samples_leaf_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, min_samples_leaf=num_samples,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_min_samples = num_samples

    rf = RandomForestClassifier(random_state=42, n_estimators=best_n_tree, max_depth=best_depth,
                                min_samples_leaf=best_min_samples, n_jobs=-1, class_weight='balanced')
    rf.fit(X_train, y_train)
    
    return rf

In [22]:
best_estimator_morpho = rf_cross_validation(X_train_morpho, y_train_activities)
best_estimator_struct= rf_cross_validation(X_train_morgan_fp, y_train_activities)
best_estimator_combined = rf_cross_validation(X_train_morph_morgan_fp, y_train_activities)

### Instantiate model - evaluation fonction 

In [23]:
models = {
    "RF"         : [RandomForestClassifier(random_state=42,class_weight ='balanced')], ##default
    "RF_morpho"  : [best_estimator_morpho],
    "RF_struct"  : [best_estimator_struct],
    "RF_combined": [best_estimator_combined]
}

def evaluate_model(y_test,y_pred):
    
    '''
    Compute and asses a model with acc balanced acc and f1 metrics 
        Parameters : 
            y_test: test to predict 
            y_pred: pred made by the model
        Returns : 
            classification report 
    '''    
    
    #acc = accuracy_score(y_test, y_pred)
    ba = balanced_accuracy_score(y_test,y_pred)
    #f1score = f1_score(y_test, y_pred, average='weighted')
    mcc = matthews_corrcoef(y_pred=y_pred, y_true=y_test)

    print(f' Evaluation of predicition made on Test set : ')
    print(f'Metrics to evaluate the model: \n balanced accuracy : {ba*100:.2f} % , \n MCC : {mcc:.3f}.')
    print(f'Summary metrics in a cross table: \n')
    print(classification_report(y_test, y_pred))

def make_predictions(models_dict ,selected_model,X_train,X_test,y_test,y_train) :

    '''
    Function to call to evaluate the model and do a cross val 
        Parameters : 
            models_dict (dictionary): dict of defined architecture and its parameters
            selected_model (str): name of the model to choose
            y_test : vector to predict either morphological or structural fingerprint or both 
    '''      

            
    if selected_model in models_dict :
        model = models_dict[selected_model][0]

    # learn model on train
    model.fit(X_train,y_train) 
    
    #compute prediction on test set 
    predictions = model.predict(X_test)
    evaluate_model(y_test,predictions) 
    return model



###  Make Predictions

1. Morphological fingerprint

A. Default Setting 


In [24]:
make_predictions(models,'RF',X_train_morpho,X_test_morpho,y_test_activities,y_train_activities)


 Evaluation of predicition made on Test set : 
Metrics to evaluate the model: 
 balanced accuracy : 52.78 % , 
 MCC : 0.224.
Summary metrics in a cross table: 

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       156
           1       1.00      0.06      0.11        18

    accuracy                           0.90       174
   macro avg       0.95      0.53      0.53       174
weighted avg       0.91      0.90      0.86       174



B. Tuned: using hyperparameters setting 

In [25]:
morpho_model = make_predictions(models,'RF_morpho',X_train_morpho,X_test_morpho,y_test_activities,y_train_activities)

 Evaluation of predicition made on Test set : 
Metrics to evaluate the model: 
 balanced accuracy : 73.61 % , 
 MCC : 0.425.
Summary metrics in a cross table: 

              precision    recall  f1-score   support

           0       0.95      0.92      0.93       156
           1       0.43      0.56      0.49        18

    accuracy                           0.88       174
   macro avg       0.69      0.74      0.71       174
weighted avg       0.89      0.88      0.89       174



2. Structural fingerprint

A. Default Setting

In [26]:
make_predictions(models,'RF',X_train_morgan_fp,X_test_morgan_fp,y_test_activities,y_train_activities)

 Evaluation of predicition made on Test set : 
Metrics to evaluate the model: 
 balanced accuracy : 57.69 % , 
 MCC : 0.280.
Summary metrics in a cross table: 

              precision    recall  f1-score   support

           0       0.91      0.99      0.95       156
           1       0.60      0.17      0.26        18

    accuracy                           0.90       174
   macro avg       0.76      0.58      0.60       174
weighted avg       0.88      0.90      0.88       174



B. Tuned: using hyperparameters setting 

In [27]:
make_predictions(models,'RF_struct',X_train_morgan_fp,X_test_morgan_fp,y_test_activities,y_train_activities)

 Evaluation of predicition made on Test set : 
Metrics to evaluate the model: 
 balanced accuracy : 58.97 % , 
 MCC : 0.145.
Summary metrics in a cross table: 

              precision    recall  f1-score   support

           0       0.92      0.85      0.88       156
           1       0.20      0.33      0.25        18

    accuracy                           0.79       174
   macro avg       0.56      0.59      0.56       174
weighted avg       0.84      0.79      0.81       174



3. Combined fingerprint 

A. Default Setting

In [28]:
make_predictions(models,'RF',X_train_morph_morgan_fp,X_test_morph_morgan_fp,y_test_activities,y_train_activities)

 Evaluation of predicition made on Test set : 
Metrics to evaluate the model: 
 balanced accuracy : 55.56 % , 
 MCC : 0.317.
Summary metrics in a cross table: 

              precision    recall  f1-score   support

           0       0.91      1.00      0.95       156
           1       1.00      0.11      0.20        18

    accuracy                           0.91       174
   macro avg       0.95      0.56      0.58       174
weighted avg       0.92      0.91      0.87       174



B. Tuned: using hyperparameters setting 

In [29]:
make_predictions(models,'RF_combined',X_train_morph_morgan_fp,X_test_morph_morgan_fp,y_test_activities,y_train_activities)

 Evaluation of predicition made on Test set : 
Metrics to evaluate the model: 
 balanced accuracy : 72.97 % , 
 MCC : 0.399.
Summary metrics in a cross table: 

              precision    recall  f1-score   support

           0       0.95      0.90      0.92       156
           1       0.40      0.56      0.47        18

    accuracy                           0.87       174
   macro avg       0.67      0.73      0.69       174
weighted avg       0.89      0.87      0.88       174



## Conclusion

The Matthews correlation coefficient (MCC), is a more reliable statistical rate that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.

The model trained on Morgan fingerprints only reached the best (highest) MCC.

To do LM (biologist proof):
- DMSO as control (know that it is not toxic)+ 1 molecule known to be toxic chemoterapeutics 
- a concret exemple with 2 molecules 

In [None]:
#to test : BRD-K02985669,BRD-K98953112,BRD-K57359365 ,'malonoben' 
test_mol = dataset[dataset['CPD_NAME'] == 'BRD-K02985669'].iloc[:,2:-2]
morpho_model.predict(test_mol.values)[0]

In [None]:

for name in dataset['CPD_NAME'].tolist():
    test = dataset[dataset['CPD_NAME'] == name].iloc[:, 2:-2]
    if name not in antagonist_df['CPD_NAME'].tolist() and morpho_model.predict(test.values)[0] != 0:
        print(name)


In [46]:
antagonist_predicited= []
filtered_dataset = dataset[~dataset['CPD_NAME'].isin(unique_activity_df['CPD_NAME'])]
for name in filtered_dataset['CPD_NAME'].to_list():
    test = filtered_dataset[filtered_dataset['CPD_NAME'] == name].iloc[:, 2:-2]
    if morpho_model.predict(test.values)[0] != 0:
        antagonist_predicited.append(name)
        print(name)

BRD-K98953112
BRD-K57359365
BRD-K39406901
BRD-K67843691
BRD-K31614544
BRD-K89701103
BRD-K99873441
BRD-K85641667
BRD-K52130884
BRD-K50389815
BRD-K86897283
BRD-K53889798
BRD-K77834935
BRD-K16093370
BRD-K19502817
BRD-K94489736
BRD-K98061177
BRD-K09693670
BRD-K01774461
BRD-K08406395
BRD-K13110820
BRD-K03624299
BRD-K85357344
BRD-K55940699
BRD-K60483981
BRD-K15322427
BRD-K13817284
BRD-K53026281
BRD-K67670670
BRD-K20984459
BRD-K04358482
BRD-K66831269
BRD-K93698227
BRD-K58836556
BRD-K39573478
BRD-K39600742
BRD-K39624517
BRD-K39681206
BRD-K39695412
BRD-K39709957
BRD-K40864085
BRD-K40954438
BRD6802
BRD-K40982265
BRD-K41033343
BRD-K41044903
BRD-K41116634
BRD-K41133627
BRD-K41135617
BRD-K41172555
BRD-K41185648
BRD-K41272069
BRD-K41412255
BRD-K41517267
BRD-K41562445
BRD-K41656340
BRD-K41674016
BRD-K41774018
BRD-K41866979
BRD-K41939184
BRD-K42032479
BRD-K42455150
BRD-K42654859
BRD-K43287163
BRD-K44506257
BRD-K44648112
BRD3154
BRD-K45282594
BRD-K46401069
BRD-K46494135
BRD-K47533800
BRD-K48549243
BRD-