# ER Toxcast data set 

Evaluation of the number of overlapping compounds with the BROAD dataset BBC$047$ and different Toxcast assays on the Estrogen Receptor alpha.
Data from ToxCast are accessible from the [Environmental Protection Agency](https://www.epa.gov/comptox-tools/exploring-toxcast-data), website. The dataset used here is ToxCast downloaded from [Moleculenet](https://moleculenet.org/datasets-1).

Estrogen receptor model data and complementary information on assays are retrieved from the [high-throughput screening data for estrogen receptor model](https://clowder.edap-cluster.com/files/6114b353e4b0856fdc656937?dataset=61147fefe4b0856fdc65639b&space=&folder=6114818be4b0856fdc656443), released version dating from 2015.


## Import Library 

In [None]:
import pandas as pd 
import plotly.express as px
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger
import warnings

## Load data : ToxCast and per-treatment profiles 

The Toxcast dataset gathers toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks. It contains assay outcomes on antagonism and agonism, as well as viability toward the estrogen receptor alpha, tested on the recombinant human ovarian (BG1Luc4E2) cell line. The protocol for these experiments is available under [Tox21Assay_SLPs](https://clowder.edap-cluster.com/files/638f9982e4b04f6bb1497267?dataset=63602c6de4b04f6bb13dc4d4&space=).

In [None]:
toxcast = pd.read_csv('../Data/Annotations/toxcast_data.csv')
dataset = pd.read_pickle('../Data/Output/output_notebook_1.pkl')

### Load datasets

Out of the different assay, we are only interested in the one performed on ER, therefore we're selecting them accordingly by checking the asay name. 

### Get Estrogen receptor relevant assays 

In [None]:
toxcast.columns.to_list()

In [None]:
ER_TOX_ASSAY_id = []
for assay in toxcast.columns.to_list():
    if assay.startswith('TOX21_ERa_') or assay.startswith('TOX21_ESRE_') or assay.startswith('OT_ER_ER'):
        ER_TOX_ASSAY_id.append(assay)
print(f'There are {len(ER_TOX_ASSAY_id)} assay related to estrogen activity')    

In [None]:
toxcast_er = toxcast[['smiles'] + ER_TOX_ASSAY_id]

### Keep unique informaton and drop lines having missing value

Now we make a list of unique smiles that have values for ER assay, everything but NAN for each assay. 

In [None]:
filtered_toxcast_smiles_ER = []
for assay in ER_TOX_ASSAY_id:
    filtered_data = toxcast[toxcast[assay].notnull() & (toxcast[assay] != ' ')]
    if not filtered_data.empty:
        filtered_toxcast_smiles_ER += filtered_data['smiles'].tolist()
filtered_toxcast_smiles_ER = list(set(filtered_toxcast_smiles_ER))

In [None]:
print(f'We have now {len(filtered_toxcast_smiles_ER)} smiles with endpoint toward estrogen receptor')

As we want to evaluate the overlapping between two datasets and molecule names are not consistent we will use the SMILES information of both datasets. Ideally, one will prepare both SMILES sets the same way to determine common molecules in both sets.

### Check the validity of smiles and standardization

In [None]:

# Function to check if SMILES is valid
def is_valid_smiles(smiles):
    
    return Chem.MolFromSmiles(smiles) is not None

# Filter out invalid SMILES
RDLogger.DisableLog('rdApp.info')
valid_smiles_df = toxcast_er[toxcast_er['smiles'].apply(is_valid_smiles)]


Now we normalize the smiles obtained at the previous step and the one from our dataset 

In [None]:
def clean_std_smiles(dataset,smiles):
    
    '''
    Standardize (Cleanup) and canonicalize parent smiles
        Parameters : 
            dataset (data frame): data frame having a column named 'CPD_SMILES' with smiles
        Returns : 
            dataset (data frame): same data frame with one more column named 'STD_smile'
    '''
    # Disable the log message and warning 
    RDLogger.DisableLog('rdApp.info')
    warnings.filterwarnings('ignore') 
    # Convert SMILES to RDKit molecule objects
    mol_objects = dataset[smiles].apply(Chem.MolFromSmiles)
    
    # Clean up and standardize the molecule list
    clean_mol = [rdMolStandardize.Cleanup(mol) for mol in mol_objects if mol is not None]
    
    # Get the largest fragment after cleanup
    parent_clean_mol = [rdMolStandardize.FragmentParent(mol) for mol in clean_mol]
    
    # Convert the standardized molecules back to SMILES
    std_smiles = [Chem.MolToSmiles(mol) for mol in parent_clean_mol]
    
    # Add the standardized SMILES to the dataset
    dataset.loc[:,'STD_smile'] = std_smiles

    return dataset

In [None]:
dataset = clean_std_smiles(dataset=dataset,smiles='CPD_SMILES')
toxcast_er = clean_std_smiles(dataset=valid_smiles_df,smiles='smiles')

Now both dataset have a Standarddize smiles entry we can compare them to identified overlapping molecules.

### Overlap between sets

In [None]:
overlapped = dataset[dataset['STD_smile'].isin(toxcast_er['STD_smile'].to_list())]
print(f'There are {overlapped.shape[0]} overlapped compound between all Toxcast estrogen receptor endpoint and broad dataset ')

We want to see if we have enough molecules for each class before building a predictive model in [Part3-Machine_learning.ipynb](Part3-Machine_learning.ipynb). 


### Proportions of activity for each assay.

Now we are looking at the ratio of active/inactive i.e classes for each assay. That will be later plot for better visualisation.


In [None]:
counts_dict = {}
for col in toxcast_er[ER_TOX_ASSAY_id]:
    counts_dict[col] = toxcast_er[col].value_counts()

# Create a DataFrame from the dictionary
counts_df_assay = pd.DataFrame.from_dict(counts_dict, orient='index').transpose().fillna(0).astype(int)

In [None]:
print(f'One can quickly see that the class are highly inbalanced indepently from the assay')
counts_df_assay

### Visualization of which of the assay endpoint has the most overlapped compound in the dataset overlapping

The next step is to identify which assay endpoint has the most overlapping compounds between the Tox-Broad dataset, which contains 14 assay endpoints and over 800 molecules.

In [None]:
merged_df = overlapped.merge(toxcast_er, on='STD_smile', how='left')


In [None]:
counts_dict = {}
for col in merged_df[ER_TOX_ASSAY_id]:
    counts_dict[col] = merged_df[col].value_counts()

# Create a DataFrame from the dictionary
counts_df_assay = pd.DataFrame.from_dict(counts_dict, orient='index').transpose().fillna(0).astype(int)

In [None]:
counts_df_assay = pd.DataFrame.from_dict(counts_dict, orient='index').transpose().fillna(0).astype(int)
# Reset the index to turn it into a column for melting
counts_df_assay.reset_index(inplace=True)
counts_df_assay.rename(columns={'index': 'status'}, inplace=True)

# Melt the DataFrame to long format
melted_df = counts_df_assay.melt(id_vars=['status'], var_name='assay', value_name='count')

# Map the status column to active/inactive for better readability
status_mapping = {0.0: 'inactive', 1.0: 'active'}
melted_df['status'] = melted_df['status'].map(status_mapping)

# Create the bar plot
fig = px.bar(melted_df, x='assay', y='count', color='status', barmode='group',
             labels={'assay': 'Assay', 'count': 'Count', 'status': 'Status'},
             title='Number of Active and Inactive Counts per Assay')

# Update layout for better appearance
fig.update_layout(xaxis_title='Assay', yaxis_title='Count', title='Number of Active and Inactive Counts per Assay for Endrogen receptor Agonist / Antagonist activity')

# Show the plot
fig.show()


In [None]:
ERa_LUC_BG1 = merged_df[['TOX21_ERa_LUC_BG1_Antagonist','TOX21_ERa_LUC_BG1_Agonist','CPD_NAME']]

In [None]:
ERa_LUC_BG1.head()

In [None]:
active = len(ERa_LUC_BG1[(ERa_LUC_BG1['TOX21_ERa_LUC_BG1_Antagonist'] != 0) | (ERa_LUC_BG1['TOX21_ERa_LUC_BG1_Agonist'] != 0)]['CPD_NAME'].unique())
inactive = len(ERa_LUC_BG1[(ERa_LUC_BG1['TOX21_ERa_LUC_BG1_Antagonist'] == 0) & (ERa_LUC_BG1['TOX21_ERa_LUC_BG1_Agonist'] == 0)]['CPD_NAME'].unique())
print(f'There are {active} active (agonist - antagonist) and {inactive} inactive molecules on ER alpha activity tested on LUC_BG1 cell types and also identified in the BBBC047 dataset.')

### Saving activity data set

In [None]:
ERa_LUC_BG1.to_csv('../Data/Output/ER_activity_luc_bg1.csv')