# Machine learning 

In this part, we use machine learning (ML) models, particularly random forests (RF), to predict the Estrogen receptor activity of our previously processed dataset.

## Import libraries 

In [67]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import  balanced_accuracy_score,classification_report

## Reminder : The datasets 


As a reminder, we are using:

- **Morphological fingerprint**: Morphological fingerprint features represent the effect of treatment measured on cells, in this case, U20S cells from osteosarcoma cancer.
- **Structural fingerprint**: Structural fingerprints are computer-readable vectors obtained from the SMILES. They encapsulate atom environment, connectivity, and substructure information.

We are now introducing:

- **Activity annotation**: Activity and inactivity annotations are obtained from agonist and antagonist assays. A set of molecules has been tested on LUC BG1 breast cancer cells to estimate AC50 value; concentration for half-maximal activity derived from the Hill equation model. The Toxcast results (agonist/antagonist) are obtained in binary format: 1 (agonist or antagonist) / 0 (non-agonist, non-antagonist).



### Load data: per-treatment profile fingerprints and annotation file

1. Load the morphological data 

In [68]:
dataset = pd.read_pickle('../Data/Output/output_notebook_2.pkl')
dataset.head(2)

Unnamed: 0,Metadata_broad_sample,CPD_NAME,CPD_SMILES,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,STD_smile,morganB_fps,euclidean_distance_morganB_fps,tanimoto_distance_morganB_fps,morphological_fingerprint,euclidean_distance_morphological_fingerprint
0,BRD-K08693008-001-01-9,BRD-K08693008,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@@H]1...,0.048557,1.511709,0.671319,1.608394,-0.511233,0.565885,0.257311,...,0.586195,1.125321,1.009055,-1.088308,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@H]1CC[C@H](CCn2cc(C3...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[0.04855710428510099, 1.5117089902809278, 0.67...",27.358629
1,BRD-K63982890-001-01-9,BRD-K63982890,OC[C@@H]1O[C@@H](CCn2cc(nn2)C2CCCCC2)CC[C@H]1N...,-0.4525,0.951801,0.099539,0.742502,-0.953422,-0.629503,0.145339,...,-0.662275,-0.373907,1.198932,-0.733997,O=C(Nc1ccc(Cl)c(Cl)c1)N[C@@H]1CC[C@H](CCn2cc(C...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",8.944272,0.144928,"[-0.4524995916327004, 0.951801364194127, 0.099...",27.667467


In [Part2-Similarity_Analysis](Part2-Similarity_Analysis.ipynb), we compute Euclidean distance, as well as Tanmoto distance that won't be needed, therefore we can drop them. 

In [69]:
dataset.drop(['CPD_SMILES','STD_smile','euclidean_distance_morganB_fps', 'euclidean_distance_morphological_fingerprint', 'tanimoto_distance_morganB_fps'], axis = 1, inplace=True)

2. Endocrine activity (ER) dataset 

This section on the annotation data set we will : 
- Load data 
- Check the data for missing value 
- Define the ratio of active / inactive 


In [70]:
er_table = pd.read_csv('../Data/Annotations/ER_activity_luc_bg1.csv', sep = ',', index_col=0)
print(f'There are {len(er_table["CPD_NAME"])} tested for ER alpha activity')
er_table.head(2)

There are 886 tested for ER alpha activity


Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME
0,1.0,0.0,mibefradil
1,0.0,0.0,fenbufen


Check how many missing values, 'NaN', there is 

In [71]:

if er_table.isna().sum().sum() == 0:
    print("There are no NaN values in er_table.")
else:
    print("There are NaN values in er_table : ")

er_table[er_table.isna().any(axis=1)]


There are NaN values in er_table : 


Unnamed: 0,TOX21_ERa_LUC_BG1_Antagonist,TOX21_ERa_LUC_BG1_Agonist,CPD_NAME
35,,,metoprolol tartrate
255,,,"2,4-dichlorophenoxyacetic acid"
281,,,ciprofloxacin
402,,,paraxanthine
632,,,tyrphostin AG-1478


In [72]:
# Drop the NaN value 
er_table.dropna(inplace=True)

## Activity classes 

We want to make a distinction between the activity classes, i.e., agonism and antagonism, and only train a model for the antagonist class. Note that it is possible to train a model for a multiclass problem, where the model could distinguish between agonist, antagonist, and no activity. However, to keep the task simple, we will only build a model for a binary problem: antagonist vs. non-antagonist.

NB: We prepare the code to train binary classification for different activity types, so you can experiment with it. Notes than one can also perform   

1. Agonism activity and Antagonism activity 

In [73]:
agonist = er_table[['TOX21_ERa_LUC_BG1_Agonist','CPD_NAME']].copy(deep = True)
agonist['TOX21_ERa_LUC_BG1_Agonist'] = agonist['TOX21_ERa_LUC_BG1_Agonist'].astype('int')

antagonist = er_table[['TOX21_ERa_LUC_BG1_Antagonist','CPD_NAME']].copy(deep = True)
antagonist['TOX21_ERa_LUC_BG1_Antagonist'] = antagonist['TOX21_ERa_LUC_BG1_Antagonist'].astype('int')

We  also want to mark any endocrine activity, i.e., independent of whether the compound is marked as antagonist or agonist, as 'active' (1) and only 'inactive' (0) for the compounds with no activity.

2. Estrogen receptor activity vs non-activity 

In [74]:
er_table['activity'] = er_table.apply(lambda x: min(int(x['TOX21_ERa_LUC_BG1_Antagonist']) + int(x['TOX21_ERa_LUC_BG1_Agonist']), 1), axis=1)

In [75]:
activity_only_df = er_table.drop(columns=['TOX21_ERa_LUC_BG1_Antagonist', 'TOX21_ERa_LUC_BG1_Agonist'])

## Merging the annotation and morphological fingerprint 

We will create three types of datasets:

- **Activity dataset:** Classifies molecules as active or non-active based on both agonism and antagonism assays. Non-active molecules did not show activity for ERα.

- **Agonism dataset:** Considers only the agonism assay results. Active molecules are agonists, and non-active molecules did not show agonist effects during the assay.

- **Antagonism dataset:** Considers only the antagonism assay results. Active molecules are antagonists, and non-active molecules did not show antagonist effects during the assay.

1. Merging the annotation to the morphological data 

In [76]:
activity_df = activity_only_df.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
activity_df.reset_index(inplace=True, drop=True)
activity_df.dropna(inplace=True)

duplicated_activity_df = activity_df.duplicated(subset='Metadata_broad_sample', keep='first')
unique_activity_df = activity_df[~duplicated_activity_df]
print(f'{round((unique_activity_df["activity"].sum()/len(unique_activity_df))*100, 1)}% of the compounds are active.') 
unique_activity_df.head(2)


23.0% of the compounds are active.


Unnamed: 0,CPD_NAME,activity,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,mibefradil,1,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,mibefradil,1,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


In [77]:
agonist_df = agonist.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
agonist_df.rename(columns={'TOX21_ERa_LUC_BG1_Agonist': 'activity'},inplace=True)
agonist_df.reset_index(inplace=True, drop=True)
agonist_df.dropna(inplace=True)
duplicated_agonist_df = agonist_df.duplicated(subset='Metadata_broad_sample', keep='first')
unique_agonist_df = agonist_df[~duplicated_agonist_df]
print(f'{round((unique_agonist_df["activity"].sum()/len(unique_agonist_df))*100, 1)}% of the compounds are agonist.') 
unique_agonist_df.head(2)

14.3% of the compounds are agonist.


Unnamed: 0,activity,CPD_NAME,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,0,mibefradil,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,0,mibefradil,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


In [78]:
antagonist_df = antagonist.join(dataset.set_index('CPD_NAME'), on='CPD_NAME')
antagonist_df.rename(columns={'TOX21_ERa_LUC_BG1_Antagonist': 'activity'},inplace=True)
antagonist_df.reset_index(inplace=True, drop=True)
antagonist_df.dropna(inplace=True)
duplicated_antagonist_df = antagonist_df.duplicated(subset='Metadata_broad_sample', keep='first')
unique_antagonist_df = antagonist_df[~duplicated_antagonist_df]
print(f'{round((unique_antagonist_df["activity"].sum()/len(unique_antagonist_df))*100, 1)}% of the compounds are antagonist.') 
unique_antagonist_df.head(2)

10.4% of the compounds are antagonist.


Unnamed: 0,activity,CPD_NAME,Metadata_broad_sample,Cells_AreaShape_Area,Cells_AreaShape_Center_X,Cells_AreaShape_Center_Y,Cells_AreaShape_Compactness,Cells_AreaShape_FormFactor,Cells_AreaShape_Orientation,Cells_AreaShape_Perimeter,...,Nuclei_Texture_InverseDifferenceMoment_ER_10_0,Nuclei_Texture_InverseDifferenceMoment_RNA_3_0,Nuclei_Texture_SumAverage_AGP_10_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_RNA_10_0,Nuclei_Texture_SumEntropy_AGP_10_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_RNA_10_0,morganB_fps,morphological_fingerprint
0,1,mibefradil,BRD-K09549677-311-05-6,-1.522142,-0.167552,-0.057663,0.922271,1.969991,-0.140241,-2.420645,...,0.797366,0.126704,-1.622976,0.448551,-0.025718,-1.723061,-0.21309,-0.63833,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-1.5221420268136925, -0.167552301981941, -0.0..."
1,1,mibefradil,BRD-K09549677-300-01-8,0.307927,0.775698,0.480218,2.664921,1.42163,0.032674,-0.583083,...,-1.452899,-1.101514,-0.628421,0.754518,-0.195795,-2.022078,0.089506,-1.074223,"[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.30792735741491867, 0.7756983423999834, 0.48..."


As mentioned before, treatments have been tested under different stereoisomers their Broad ID defers, although the compound names are identic. Therefore, we will be careful that different Broad IDs are only contained in one of the Sets to avoid data leakage.

## Spliting Data in Training and Test Set

For ML, we always need training and a test set that is pairwise disjoint. <br>
To this end, we split the data into two parts, where the training set will contain 80% of the samples and the test set the remaining 20%. The training set will be utilized for training, evaluating, and optimizing the model, while the test set will be used for making predictions<br>

Moreover, we need a response vector y, i.e., the column with the activity information and a feature matrix X, i.e., a matrix with the morphological or molecular fingerprints or a combination thereof, for each compound. <br>

To make sure we do not have data leakage, we ensure that all instances of compounds, that appear several times in the dataset, are put in the same set. Moreover, we want to ensure that we have enough instances of the active class in both sets. Therefore, we perform a stratified split that ensures a constant ratio of active and inactive compounds in all datasets.

![Cross-validation and hyperparameter Tuning](../Images/CV.png)


1. Features and variables to predict 

In [79]:
def stratified_split(data):
    X_train, X_test, y_train_activities, y_test_activities= train_test_split(
        data.iloc[:,3:], 
        data['activity'].values.astype('int'), 
        random_state=42, 
        test_size=0.2, 
        shuffle=True,
        stratify=data['activity'].values.astype('int')) #stratify split 
    return X_train, X_test, y_train_activities, y_test_activities

In [80]:
X_train, X_test, y_train_activities, y_test_activities = stratified_split(unique_antagonist_df) #unique_antagonist_df , unique_agonist_df

A. For morphological 

In [81]:
X_train_morpho = X_train.drop(['morganB_fps','morphological_fingerprint'], axis=1).values
X_test_morpho = X_test.drop(['morganB_fps','morphological_fingerprint'], axis=1).values

B. For structural 

In [82]:
def convert_list_features_to_numpy(x):
    '''
        converts the features that are stored in an array containing lists to an array of arrays such that shape works

        @param x: array containing lists 
        @return x_new: array of arrays of ints
    '''
    new_x = []
    for element in x:
        new_x.append(np.array(element))
    return np.array(new_x)

In [83]:
X_test_morgan_fp = X_test['morganB_fps'].copy(deep=True).values
X_train_morgan_fp = X_train['morganB_fps'].copy(deep=True).values
#convert as an array 
X_test_morgan_fp = convert_list_features_to_numpy(X_test_morgan_fp)
X_train_morgan_fp = convert_list_features_to_numpy(X_train_morgan_fp)

C. For combined features

In [84]:
def append_concat_molecular_and_morphological_FP(x):
    '''
        creates a new column 'appended_profile' containing the joint morphological and Morgan fingerprints

        @param x: pandas DataFrame with columns 'morphological_fingerprint' and 'morganB_fps'
        @return x with new column 'appended_profile' 
    '''
    x['appended_profile'] = x.apply(lambda x: np.concatenate(
        [x['morphological_fingerprint'], x['morganB_fps']]), axis=1)
    return x

In [85]:
X_test_morph_morgan_fp = append_concat_molecular_and_morphological_FP(X_test.copy(deep=True))
X_train_morph_morgan_fp = append_concat_molecular_and_morphological_FP(X_train.copy(deep=True))

In [86]:
X_train_morph_morgan_fp = convert_list_features_to_numpy(X_train_morph_morgan_fp['appended_profile'].values)
X_test_morph_morgan_fp = convert_list_features_to_numpy(X_test_morph_morgan_fp['appended_profile'].values)

## Binary Classification

### Cross-validation and Hyperparemeter Tuning 

In ML, we want to minimize a loss function, e.g., the mean-squared error (MSE). In particular, we want to fit the training data to minimize the loss on the training samples, while still having a small error on the unseen test samples. Moreover, hyperparameters define the hypothesis space of the model, which can be tuned to minimize the estimated test error. The test error can be estimated using a so-called k-fold cross-validation (CV), which is performed in the following.

For an RF, there are three hyperparameters that we want to tune to optimize the estimated test error: 
* depth of the trees in the forest: 10, 20, 30 or 50 
* number of trees in the forest: 100, 200, 300 or 500
* minimum number of samples per leaf: 5, 10, 15, 20 or 25

In [87]:
def rf_cross_validation(X_train, y_train, max_depth_range=[10, 20, 30, 50], num_tree_range=[100, 200, 300, 500], min_samples_leaf_range=[5,10, 15, 20, 25]):
    '''
        performs a 5-fold CV for a Random Forest for given X_train and y_train

        @param X_train: the training matrix
        @param y_train: the associated response vector
        @param max_depth_range: list containing the values that should be tested for max depth, default [10,20,30]
        @param num_tree_range: list containing the values that should be tested for the number of trees, default [100,300,500]
        @param min_samples_leaf_range: list containing the values that should be tested for the minimum number of samples per leaf, default [10,15,20]

        @return: a forest with the best hyperparameter according to the estimated test MSE and trained on the whole training set
    '''
    best_score = -float('inf')
    for depth in max_depth_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, max_depth=depth, n_jobs=-1,
                                    class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_depth = depth

    best_score = -float('inf')
    for n_tree in num_tree_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, n_estimators=n_tree,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_n_tree = n_tree

    best_score = -float('inf')
    for num_samples in min_samples_leaf_range:
        cv_results = cross_validate(RandomForestClassifier(random_state=42, min_samples_leaf=num_samples,
                                    n_jobs=-1, class_weight='balanced'), X=X_train, y=y_train, scoring='balanced_accuracy', cv=5)
        score = np.mean(cv_results['test_score'])
        if score > best_score:
            best_score = score
            best_min_samples = num_samples

    rf = RandomForestClassifier(random_state=42, n_estimators=best_n_tree, max_depth=best_depth,
                                min_samples_leaf=best_min_samples, n_jobs=-1, class_weight='balanced')
    
    return rf

In [88]:
best_estimator_morpho = rf_cross_validation(X_train_morpho, y_train_activities)
best_estimator_struct= rf_cross_validation(X_train_morgan_fp, y_train_activities)
best_estimator_combined = rf_cross_validation(X_train_morph_morgan_fp, y_train_activities)

### Model Instantiation and Evaluation Function

We instantiate the estimator for each modality by selecting the best estimator obtained during hyperparameter tuning. Additionally, we include a default random forest to compare how the performance varies when the model is optimized.

We use two metrics to evaluate the predicability of the model, 
- **Matthews correlation coefficient (MCC) :** MCC is a reliable statistical rate that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.
- **Balanced accuracy (BA):**  BA combines sensitivity (true positive rate) which measures how well the model detects positive cases, answering questions like 'How many antagonist molecules did the model correctly recall?' and specificity (true negative rate) which measures the model's ability to correctly identify negative cases.


In [None]:
models = {
    "RF"         : [RandomForestClassifier(random_state=42,class_weight ='balanced')], ##default
    "RF_morpho"  : [best_estimator_morpho],
    "RF_struct"  : [best_estimator_struct],
    "RF_combined": [best_estimator_combined]
}

def evaluate_model(y_test,y_pred):
    
    '''
    Compute and asses a model with acc balanced acc and f1 metrics 
        Parameters : 
            y_test: test to predict 
            y_pred: pred made by the model
        Returns : 
            classification report 
    '''    
    
    ba = balanced_accuracy_score(y_test,y_pred)
    mcc = matthews_corrcoef(y_pred=y_pred, y_true=y_test)

    print(f' Evaluation of predicition made on Test set : ')
    print(f'Metrics to evaluate the model: \n balanced accuracy : {ba*100:.2f} % , \n MCC : {mcc:.3f}.')
    print(f'Summary metrics in a cross table: \n')
    print(classification_report(y_test, y_pred))

def make_predictions(models_dict ,selected_model,X_train,X_test,y_test,y_train) :

    '''
    Function to call to evaluate the model and do a cross val 
        Parameters : 
            models_dict (dictionary): dict of defined architecture and its parameters
            selected_model (str): name of the model to choose
            y_test : vector to predict either morphological or structural fingerprint or both 
    '''      

            
    if selected_model in models_dict :
        model = models_dict[selected_model][0]

    #learn by fitting the model on the trainset 
    model.fit(X_train,y_train) 
    
    #compute prediction on test set 
    predictions = model.predict(X_test)
    evaluate_model(y_test,predictions) 
    return model



###  Make Predictions

1. Morphological fingerprint

A. Default Setting 


In [None]:
default_morpho = make_predictions(models,'RF',X_train_morpho,X_test_morpho,y_test_activities,y_train_activities)


B. Tuned: using hyperparameters setting 

In [None]:
morpho_model = make_predictions(models,'RF_morpho',X_train_morpho,X_test_morpho,y_test_activities,y_train_activities)

2. Structural fingerprint

A. Default Setting

In [None]:
default_struct = make_predictions(models,'RF',X_train_morgan_fp,X_test_morgan_fp,y_test_activities,y_train_activities)

B. Tuned: using hyperparameters setting 

In [None]:
struct_model = make_predictions(models,'RF_struct',X_train_morgan_fp,X_test_morgan_fp,y_test_activities,y_train_activities)

3. Combined fingerprint 

A. Default Setting

In [None]:
default_combined = make_predictions(models,'RF',X_train_morph_morgan_fp,X_test_morph_morgan_fp,y_test_activities,y_train_activities)

B. Tuned: using hyperparameters setting 

In [None]:
combined_model = make_predictions(models,'RF_combined',X_train_morph_morgan_fp,X_test_morph_morgan_fp,y_test_activities,y_train_activities)

## Conclusion

In this section, we focus on predicting the activity of molecules based on their morphological profiles, structural fingerprints, or a combination of both. We specifically examine the prediction of estrogen receptor alpha (ERα) antagonism. After training and testing the model, we observe that ERα antagonism activity is better predicted using morphological fingerprints alone. However, the performance of structural fingerprints improves when combined with morphological fingerprints.


The Matthews correlation coefficient (MCC), is a more reliable statistical rate that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.

## Discussion 

1. Study case; a concrete example for ERa antagonism prediction 

---

In this discussion section, we will demonstrate how to predict if a molecule is an ERα antagonist. To achieve this, we need an unseen molecule for the model. The unseen molecule must have the same number of features as those the model was trained on, specifically 290 features for the morphological fingerprint in this case. Our initial data contained around 30,000 molecules, and we only worked with 880 molecules in this section. Therefore, we can predict, for example, some of the remaining 29,508 molecules.

Let's predict the activity of a phytoestrogen (such as flavonoids, coumestans, lignans), specifically naringin (a dietary flavonoid), and one random compound, BRD-K98953112.


In [None]:
naringin = dataset[dataset['CPD_NAME'] == 'naringin'].iloc[:,2:-2]
morpho_model.predict(naringin.values)[0]
if morpho_model.predict(naringin.values)[0] == 1:
    print(f'The molecule naringin is predicted as an estrogen receptor alpha antagonist')

In [None]:
BRD_K98953112 = dataset[dataset['CPD_NAME'] == 'BRD-K98953112'].iloc[:,2:-2]
if morpho_model.predict(BRD_K98953112.values)[0] == 1:
    print(f'The molecule BRD-K98953112 is predicted as an estrogen receptor alpha antagonist')
else: 
    print(f'The molecule is not predicted as an estrogen receptor alpha antagonist')

---

Here we see that Naringin is predicted to be an antagonist for ERα. An antagonist is a drug that diminishes the effect of an agonist (a drug that binds to and activates a receptor). Antagonists can be either competitive or non-competitive, and each type can be reversible or irreversible.

Naringin, a phytoestrogen belonging to the flavonoid group, exhibits biphasic activity, functioning in a dose-dependent manner with both estrogenic and anti-estrogenic effects (Ryoiti Kiyama - [Estrogenic flavonoids and their molecular mechanisms of action](https://www.sciencedirect.com/science/article/pii/S0955286322003187#cebibl1)).

For BRD-K98953112, we used PubChem for more details. It appears to be a fluorinated organic compound, specifically a PFAS. However, no studies have been conducted on this molecule, so we cannot draw any conclusions. PFAS are primarily used in food contact materials for their water and oil repellent characteristics. The effects of several PFASs on thyroid hormone levels during pregnancy and in childhood have been documented in reports, namely [Endocrine Disrruptors: from Scientific Evidence to human Health Protection](https://www.europarl.europa.eu/RegData/etudes/STUD/2019/608866/IPOL_STU(2019)608866_EN.pdf).