# Machine learning 

In this part, we use machine learning (ML) models, particularly random forests (RF), to predict the Estrogen receptor activity of our previously processed dataset.

## Import libraries and previous data


In [43]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import matthews_corrcoef

## Reminder : The datasets 


As a reminder, we are using:

- **Morphological fingerprint**: Morphological fingerprint features represent the effect of treatment measured on cells, in this case, U20S cells from osteosarcoma cancer.
- **Structural fingerprint**: Structural fingerprints are computer-readable vectors obtained from the SMILES. They encapsulate atom environment, connectivity, and substructure information.

We are now introducing:

- **Activity annotation**: Activity and inactivity annotations are obtained from agonist and antagonist assays. A set of molecules have been tested on LUC BG1 breast cancer cells to estimate [IC50?]. The values are then converted to a binary format: 1 (agonist or antagonist) / 0 (non-agonist, non-antagonist).



### Load data : per-treatment profile fingerprints and annotation file

- Load fingerprint file named _fingerprint_ds.pkl_ generated in _Part2-Similarity_Analysis.ipynb_ 
- Load endocrine activity file:  _ER_activity_luc_bg1.csv_
- Concatenate the data frames into one 

1. Load the morphological data 

In [44]:
dataset = pd.read_pickle('../Data/Output/output_notebook_2.pkl')
dataset.head(2)

Unnamed: 0,Metadata_broad_sample,CPD_NAME,CPD_SMILES,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,STD_smile,morganB_fps,euclidean_distance_morganB_fps,tanimoto_distance_morganB_fps,morphological_fingerprint,euclidean_distance_morphological_fingerprint
0,BRD-K08693008-001-01-9,BRD-K08693008,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@@H]1...,0.048557,1.511709,0.671319,1.608394,-0.511233,0.565885,0.257311,...,0.586195,1.125321,1.009055,-1.088308,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@H]1CC[C@H](CCn2cc(C3...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[0.04855710428510099, 1.5117089902809278, 0.67...",27.358629
1,BRD-K63982890-001-01-9,BRD-K63982890,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@H]1N...,-0.4525,0.951801,0.099539,0.742502,-0.953422,-0.629503,0.145339,...,-0.662275,-0.373907,1.198932,-0.733997,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@@H]1CC[C@H](CCn2cc(C...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[-0.4524995916327004, 0.951801364194127, 0.099...",27.667467


In [Part2-Similarity_Analysis](Part2-Similarity_Analysis), we compute Euclidean distance, as well as Tanmoto distance that won't be needed, therefore we can drop them. 

In [45]:
dataset.drop(['CPD_SMILES','STD_smile','euclidean_distance_morganB_fps', 'euclidean_distance_morphological_fingerprint',
             'tanimoto_distance_morganB_fps'], axis=1, inplace=True)

2. Endocrine activity (ER) dataset 

This section on the annotation data set we will : 
- Load data 
- Check the data for missing value 
- Make 2 class for estrogen receptor activity 
- Define the ratio of active / inactive and plot it as a bar plot 

In [46]:
er_table = pd.read_csv(
    '../Data/Annotations/ER_activity_luc_bg1.csv', sep=',', index_col=0)
er_table.head(2)

Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME
0,1.0,0.0,mibefradil
1,0.0,0.0,fenbufen


In [47]:
print(f'There are {len(er_table["CPD_NAME"])} tested for ER alpha activity')

There are 886 tested for ER alpha activity


Check if NaN (missing value) are present and drop them. 

In [48]:
# Check if the sum of NaNs for each column is 0
if er_table.isna().sum().sum() == 0:
    print("There are no NaN values in er_table.")
else:
    print("There are NaN values in er_table : ")

er_table[er_table.isna().any(axis=1)]


There are NaN values in er_table : 


Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME
35,,,metoprolol tartrate
255,,,"2,4-dichlorophenoxyacetic acid"
281,,,ciprofloxacin
402,,,paraxanthine
632,,,tyrphostin AG-1478


In [49]:
# Drop the NaN value 
er_table.dropna(inplace=True)

## Activity class 

We want to mark any endocrine activity, i.e., independent on whether the compound is marked as antagonist or agonist, as 'active' (1) and only 'inactive' (0) for the compounds with no activity.

1. Endocrine activity vs non-activity 

In [50]:
er_table['activity'] = er_table.apply(lambda x: min(int(x['TOX21_ERa_LUC_BG1_Antagonist']) + int(x['TOX21_ERa_LUC_BG1_Agonist']), 1), axis=1)

Here we can make a view of the active and inactive compounds

In [51]:
tot_active = er_table[(er_table['TOX21_ERa_LUC_BG1_Antagonist'] != 0) | (er_table['TOX21_ERa_LUC_BG1_Agonist'] != 0)].shape[0]
print(f"There are {tot_active} active compound, here are the first : ")
er_table[(er_table['TOX21_ERa_LUC_BG1_Antagonist'] != 0) | (er_table['TOX21_ERa_LUC_BG1_Agonist'] != 0)].head(3)

There are 212 active compound, here are the first : 


Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME,activity
0,1.0,0.0,mibefradil,1
6,0.0,1.0,BRD-K03978601,1
7,1.0,0.0,mifepristone,1


In [52]:
tot_inactive = er_table[(er_table['TOX21_ERa_LUC_BG1_Antagonist'] == 0) & (er_table['TOX21_ERa_LUC_BG1_Agonist'] == 0)].shape[0]
print(f"There are {tot_inactive} active compound, here are the first : ")
er_table[(er_table['TOX21_ERa_LUC_BG1_Antagonist'] == 0) & (er_table['TOX21_ERa_LUC_BG1_Agonist'] == 0)].head(3)

There are 669 active compound, here are the first : 


Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME,activity
1,0.0,0.0,fenbufen,0
2,0.0,0.0,diethanolamine,0
3,0.0,0.0,BRD-K58885221,0


We can visualise the repartition of active and inactive within the dataset 

In [53]:
# Map activity values to labels
activity_labels = {0: 'Inactive', 1: 'Active'}
er_table['activity_label'] = er_table['activity'].map(activity_labels)

# Counting the occurrences of each activity label
activity_counts = er_table['activity_label'].value_counts().reset_index()
activity_counts.columns = ['activity_label', 'count']

# Plotting the bar plot using Plotly
fig = px.bar(activity_counts, x='activity_label', y='count', 
             labels={'activity_label': 'Activity', 'count': 'Count'},
             title='Count of Activity Values',
             color='activity_label',
             color_discrete_map={'Inactive': 'purple', 'Active': 'green'})

fig.show()
print(f'{round((er_table["activity"].sum()/len(er_table))*100, 1)}% of the compounds are active.') 

24.1% of the compounds are active.


In [54]:
#Since the distinction between active and inactive is not needed we drop the two columns
activity_only_df = er_table.drop(
    columns=['TOX21_ERa_LUC_BG1_Antagonist', 'TOX21_ERa_LUC_BG1_Agonist'])

We can also seperate the activity class into the agonism/ antagonism and therefore train one model for each 

2. Agonism activity and Antagonism activity 

In [55]:
agonist = er_table[['TOX21_ERa_LUC_BG1_Agonist','CPD_NAME']]
antagonist = er_table[['TOX21_ERa_LUC_BG1_Antagonist','CPD_NAME']]

In [56]:
agonist.loc[:, 'TOX21_ERa_LUC_BG1_Agonist'] = agonist['TOX21_ERa_LUC_BG1_Agonist'].astype('int')
antagonist.loc[:, 'TOX21_ERa_LUC_BG1_Antagonist'] = antagonist['TOX21_ERa_LUC_BG1_Antagonist'].astype('int')



## Mergin the annotation and morphological fingerprint 

We will create three types of datasets:

- **Activity dataset:** Classifies molecules as active or non-active based on both agonism and antagonism assays. Non-active molecules did not show activity for ERα.

- **Agonism dataset:** Considers only the agonism assay results. Active molecules are agonists, and non-active molecules did not show agonist effects during the assay.

- **Antagonism dataset:** Considers only the antagonism assay results. Active molecules are antagonists, and non-active molecules did not show antagonist effects during the assay.


4. Merging the annotation to the morphological data 

In [57]:
activity_df = activity_only_df.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
activity_df.reset_index(inplace=True, drop=True)
activity_df.dropna(inplace=True)
activity_df.drop('activity_label',axis = 1,inplace=True)
activity_df.head(2)

Unnamed: 0,CPD_NAME,activity,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,mibefradil,1,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,mibefradil,1,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


In [58]:
agonist_df = agonist.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
agonist_df.rename(columns={'TOX21_ERa_LUC_BG1_Agonist': 'activity'},inplace=True)
agonist_df.reset_index(inplace=True, drop=True)
agonist_df.dropna(inplace=True)
agonist_df.head(2)

Unnamed: 0,activity,CPD_NAME,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,0.0,mibefradil,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,0.0,mibefradil,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


In [59]:
antagonist_df = antagonist.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
antagonist_df.rename(columns={'TOX21_ERa_LUC_BG1_Antagonist': 'activity'},inplace=True)
antagonist_df.reset_index(inplace=True, drop=True)
antagonist_df.dropna(inplace=True)
antagonist_df.head(2)

Unnamed: 0,activity,CPD_NAME,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,1.0,mibefradil,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,1.0,mibefradil,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


As mentioned before, treatments have been tested under different stereoisomers their Broad ID defers, although the compound names are identic. Therefore, we will be careful that different Broad IDs are only contained in one of the Sets to avoid data leakage.

## Spliting Data in Training and TestSet

For ML, we always need training and a test set that are pairwise disjoint. <br>
To this end, we split the data into two parts, where the training set will contain 80% of the samples and the test set the remaining 20%. <br>
Moreover, we need a response vector y, i.e., the column with the activity information and a feature matrix X, i.e., a matrix with the morphological or molecular fingerprints or a combination thereof, for each compound. <br>
To make sure we do not have a data leakage, we ensure that all instances of compounds, which appear several times in the dataset, are put in the same set. Moreover, we want to ensure that we have enough instances of the active class in both sets. Therefore, we perform a stratified split that ensures a constant ratio of active and inactive compounds in all datasets.

1. Features and variable to predict (response vector)

In [60]:
def train_test_name(data):
    train_names, test_names, train_activities, test_activities = train_test_split(
        data['CPD_NAME'].values, 
        data['activity'].values.astype('int'), 
        random_state=42, 
        test_size=0.2, 
        shuffle=True,
        stratify=data['activity'].values.astype('int')) #stratify split 
    return train_names, test_names, train_activities, test_activities

In [61]:
train_names, test_names, train_activities, test_activities = train_test_name(activity_df)

In [62]:
if 'linoleic acid' in train_names:
    print('linoleic acid is in training set')
elif 'linoleic acid' in test_names:
    print('linoleic acid is in test set')

linoleic acid is in training set


## Binary Classification

In [63]:
def define_set(data,train_names, test_names ):
    X = data.copy(deep=True)
    X_train = X.loc[X['CPD_NAME'].isin(train_names), :].copy(deep=True)
    X_test = X.loc[X['CPD_NAME'].isin(test_names), :].copy(deep=True)

    y_train = X_train['activity'].values
    y_test = X_test['activity'].values

    cols_to_drop = ['morganB_fps', 'morphological_fingerprint', 'activity',
                 'Metadata_broad_sample','CPD_NAME']
    X_test_morph_fp = X_test.drop(cols_to_drop, axis=1)
    X_train_morph_fp = X_train.drop(cols_to_drop, axis=1)
    
    return X_test_morph_fp ,X_train_morph_fp , y_train , y_test , X_train ,X_test

In [64]:
#X_test_morph_fp ,X_train_morph_fp , y_train , y_test,X_train ,X_test = define_set(agonist_df,train_names, test_names)

In [65]:
X_test_morph_fp ,X_train_morph_fp , y_train , y_test,X_train ,X_test = define_set(antagonist_df,train_names, test_names)

In [66]:
#X_test_morph_fp ,X_train_morph_fp , y_train , y_test,X_train ,X_test = define_set(activity_df,train_names, test_names)

## Cross Validation and Hyperparameter Tuning

In ML, we want to minimize a loss function, e.g., the mean-squared error (MSE). In particular, we want to fit the training data in order to minimize the loss on the training samples, while still having a small error on the unseen test samples. Moreover, there are hyperparameters defining the hypothesis space of the model, which can be tuned in order to minimize the estimated test error. The test error can be estimated using a so-called k-fold cross validation (CV), which is performed in the following.

For an RF, there are three hyperparameters that we want to tune in order to optimize the estimated test error: 
* depth of the trees in the forest: 10, 20, or 30
* number of trees in the forest: 100, 300, or 500
* minimum number of samples per leaf: 10, 15, or 20

In [67]:
def rf_cross_validation(X_train, y_train, max_depth_range=[10, 20, 30], num_tree_range=[100, 300, 500], min_samples_leaf_range=[10, 15, 20]):
    '''
        performs a 5-fold CV for a Random Forest for given X_train and y_train

        @param X_train: the training matrix
        @param y_train: the associated response vector
        @param max_depth_range: list containing the values that should be tested for max depth, default [10,20,30]
        @param num_tree_range: list containing the values that should be tested for the number of trees, default [100,300,500]
        @param min_samples_leaf_range: list containing the values that should be tested for the minimum number of samples per leaf, default [10,15,20]

        @return: a forest with the best hyperparameter according to the estimated test MSE and trained on the whole training set
    '''
    best_score = -float('inf')
    for depth in max_depth_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, max_depth=depth, n_jobs=-1,
                                    class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_depth = depth

    best_score = -float('inf')
    for n_tree in num_tree_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, n_estimators=n_tree,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_n_tree = n_tree

    best_score = -float('inf')
    for num_samples in min_samples_leaf_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, min_samples_leaf=num_samples,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_min_samples = num_samples

    rf = RandomForestClassifier(random_state=42, n_estimators=best_n_tree, max_depth=best_depth,
                                min_samples_leaf=best_min_samples, n_jobs=-1, class_weight='balanced')
    rf.fit(X_train, y_train)
    
    return rf

In [68]:
best_estimator = rf_cross_validation(X_train_morph_fp, y_train)

Now, we print the hyperparameters of the forest, to see what performed best in the CV.

In [69]:
param_dict = best_estimator.get_params()
print(
    f'We use a max depth of {param_dict["max_depth"]}, {param_dict["n_estimators"]} trees, and at least {param_dict["min_samples_leaf"]} samples per leaf.')

We use a max depth of 20, 100 trees, and at least 20 samples per leaf.


## Prediction on Test Set 

Now, we predict for the test data, i.e., so far unseen samples to calculate the real test error.

In [70]:
random_forest = best_estimator
y_pred = random_forest.predict(X_test_morph_fp)

And calculate the Matthews correlation coefficient (MCC) on our test set prediciton.

In [71]:
print(
    f'The RF trained on morphological fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred, y_true=y_test)}')

The RF trained on morphological fingerprints reached an MCC of 0.6167646926321844


## Using Structural (Morgan) Fingerprints

1. Features and variable to predict (response vector)

In [72]:

X_test_morgan_fp = X_test['morganB_fps'].copy(deep=True).values
X_train_morgan_fp = X_train['morganB_fps'].copy(deep=True).values

In [73]:
def convert_list_features_to_numpy(x):
    '''
        converts the features that are stored in an array containing lists to an array of arrays such that shape works

        @param x: array containing lists 
        @return x_new: array of arrays of ints
    '''
    new_x = []
    for element in x:
        new_x.append(np.array(element))
    return np.array(new_x)

In [74]:
X_test_morgan_fp = convert_list_features_to_numpy(X_test_morgan_fp)
X_train_morgan_fp = convert_list_features_to_numpy(X_train_morgan_fp)

In [75]:
best_estimator = rf_cross_validation(X_train_morgan_fp, y_train)

In [76]:
param_dict = best_estimator.get_params()
print(
    f'We use a max depth of {param_dict["max_depth"]}, {param_dict["n_estimators"]} trees, and at least {param_dict["min_samples_leaf"]} samples per leaf.')

We use a max depth of 20, 100 trees, and at least 15 samples per leaf.


### Prediction on Test Set

In [77]:
y_pred = best_estimator.predict(X_test_morgan_fp)

In [78]:
print(
    f'The RF trained on Morgan fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred, y_true=y_test)}')

The RF trained on Morgan fingerprints reached an MCC of 0.45776534854849266


## Using both Structural (Morgan) and Morphological Fingerprints

In [79]:
def append_concat_molecular_and_morphological_FP(x):
    '''
        creates a new column 'appended_profile' containing the joint morphological and Morgan fingerprints

        @param x: pandas DataFrame with columns 'morphological_fingerprint' and 'morganB_fps'
        @return x with new column 'appended_profile' 
    '''
    x['appended_profile'] = x.apply(lambda x: np.concatenate(
        [x['morphological_fingerprint'], x['morganB_fps']]), axis=1)
    return x

In [80]:
X_test_morph_morgan_fp = append_concat_molecular_and_morphological_FP(
    X_test.copy(deep=True))
X_train_morph_morgan_fp = append_concat_molecular_and_morphological_FP(
    X_train.copy(deep=True))

In [81]:
X_train_morph_morgan_fp = convert_list_features_to_numpy(
    X_train_morph_morgan_fp['appended_profile'].values)
X_test_morph_morgan_fp = convert_list_features_to_numpy(
    X_test_morph_morgan_fp['appended_profile'].values)

In [82]:
rf_combined_morgan_morph = rf_cross_validation(
    X_train=X_train_morph_morgan_fp, y_train=y_train)

In [83]:
y_pred_combi = rf_combined_morgan_morph.predict(X_test_morph_morgan_fp)

In [84]:
print(
    f'The RF trained on both morphological and Morgan fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred_combi, y_true=y_test)}')

The RF trained on both morphological and Morgan fingerprints reached an MCC of 0.5315812566841983


## Conclusion

Apparently, the model trained on Morgan fingerprints only reached the best (highest) MCC.

## Multiclass Classification

1. Multiclass problem 

We created four classes to later train a multiclass classifier based on the two assays available, which are:
- Agonist
- Antagonist
- Both agonist and antagonist
- No activity

In [85]:
def convert_to_three_classes(x):
    ''' 
        needed to add the class for the multiclass setting

        @param x: row in dataframe
        @retun: class as int
    '''
    if (x['TOX21_ERa_LUC_BG1_Antagonist'] == 1) and (x['TOX21_ERa_LUC_BG1_Agonist'] == 1):
        return 3
    elif x['TOX21_ERa_LUC_BG1_Antagonist'] == 1:
        return 2
    elif x['TOX21_ERa_LUC_BG1_Agonist'] == 1:
        return 1
    else:
        return 0
    

class_encoding_dict = {3: 'both', 2: 'antagonist',
                       1: 'agonist', 0: 'not active'}


er_table['four_class_response'] = er_table.apply(
    lambda x: convert_to_three_classes(x), axis=1)


er_table['activity'] = er_table.apply(lambda x: min(int(
    x['TOX21_ERa_LUC_BG1_Antagonist']) + int(x['TOX21_ERa_LUC_BG1_Agonist']), 1), axis=1)


# Map the four_class_response values to their respective class names
er_table['class_label'] = er_table['four_class_response'].map(class_encoding_dict)

# Count the occurrences of each class label
class_counts = er_table['class_label'].value_counts().reset_index()
class_counts.columns = ['class_label', 'count']

# Plotting the bar plot using Plotly
fig = px.bar(class_counts, x='class_label', y='count', 
             labels={'class_label': 'Class Label', 'count': 'Count'},
             title='Value Count of Four Class Response',
             color='class_label',
             color_discrete_sequence=px.colors.qualitative.Safe)

fig.show()

In [86]:
combinded_activity_df = er_table.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
combinded_activity_df.drop(['activity','activity_label','class_label','TOX21_ERa_LUC_BG1_Antagonist','TOX21_ERa_LUC_BG1_Agonist'],axis=1,inplace=True)
combinded_activity_df.rename(columns={'four_class_response': 'activity'},inplace=True)
combinded_activity_df.reset_index(inplace=True, drop=True)
combinded_activity_df.dropna(inplace=True)
combinded_activity_df.head(2)

Unnamed: 0,CPD_NAME,activity,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,mibefradil,2,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,mibefradil,2,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


2. We build a model using the 4 classes 

In [87]:
train_names, test_names, train_activities, test_activities = train_test_split(
    combinded_activity_df['CPD_NAME'].values, combinded_activity_df['activity'].values.astype('int'), random_state=42, test_size=0.2, shuffle=True,stratify=combinded_activity_df['activity'].values.astype('int')) #stratify split 

In [88]:
X_test_morph_fp ,X_train_morph_fp , y_train , y_test , X_train ,X_test = define_set(combinded_activity_df,train_names, test_names)

3. Training multiclass model

In [89]:
best_estimator = rf_cross_validation(X_train_morph_fp, y_train)

4. Predicting multiclass model 

In [90]:
y_pred = best_estimator.predict(X_test_morph_fp)

In [91]:
print(
    f'The RF trained on both morphological and morphological fingerprints reached an MCC of {matthews_corrcoef(y_pred=y_pred, y_true=y_test)}')

The RF trained on both morphological and morphological fingerprints reached an MCC of 0.3995125347737898


To do LM (biologist proof):
- DMSO as control (know that it is not toxic)+ 1 molecule known to be toxic chemoterapeutics 
- a concret exemple with 2 molecules 