<a href="https://colab.research.google.com/github/stawiskm/QSAR_Modelbuilding_amesTest/blob/main/AMES_Test-Part-1-genotoxicity-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Computational Ames test [Part 1] Data collection**

Marc Jermann

In this Jupyter notebook, we will be building a real-life **data science project**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data and other data sources.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client



## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target**

Since the Ames test is performed using Salmonella typhimurium, this organism is our target.

### **Target search for Salmonella typhimurium**

In [3]:
# Target search for Salmonella typhimurium
target = new_client.target
target_query = target.search('Salmonella typhimurium')
targets = pd.DataFrame.from_dict(target_query)
targets.head(4)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Salmonella enterica subsp. enterica serovar Ty...,Salmonella typhimurium,31.0,False,CHEMBL351,[],ORGANISM,90371
1,[],Salmonella,Salmonella,15.0,True,CHEMBL614446,[],ORGANISM,590
2,[],Homo sapiens,Protein NipSnap homolog 3A,15.0,False,CHEMBL3817722,"[{'accession': 'Q9UFN0', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Salmonella enterica subsp. enterica serovar Pa...,Salmonella paratyphi,13.0,False,CHEMBL612293,[],ORGANISM,54388


### **Select and retrieve activity data for Ames tests with *Salmonella typhimurium* (first entry)**

We will assign the first entry (which corresponds to the target organism, *Salmonella typhimurium*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL351'

Here, we will retrieve only bioactivity data for *Salmonella typhimurium* (CHEMBL351) that are reported as activity.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="Activity")

In [6]:
df = pd.DataFrame.from_dict(res)

Now we keep only the entries that also have the keyword "Ames" in their **assay_description**. Either as "Ames test" or "Ames assay".
Also, we keep only the entries that have an understandable **activity_comment**.

In [7]:
df_ames = df.loc[df['assay_description'].str.contains(' Ames ')]
df_ames = df_ames.loc[df['activity_comment'].isin(['Non-toxic', 'Toxic','Non-Toxic'])]

## **Data pre-processing of the data**

### **Handling missing data**
If any compounds has missing value for the **activity_comment** column then drop it

In [8]:
# frequency count of column activity_comment
count = df_ames["activity_comment"].value_counts()
print(count)

Non-toxic    964
Non-Toxic    404
Toxic        298
Name: activity_comment, dtype: int64


Apparently, for this dataset there is no missing data. But there are two spellings for Non-toxic. This will be corrected in the next step.

In [9]:
df_ames['activity_comment'] = df_ames['activity_comment'].replace({'Non-Toxic':'Non-toxic'})

### **Remove unneeded columns and duplicates**

In [10]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'activity_comment']
df_ames = df_ames[selection]

In [11]:
df_ames=df_ames.drop_duplicates(subset=['molecule_chembl_id'], keep='last')
df_ames

Unnamed: 0,molecule_chembl_id,canonical_smiles,activity_comment
389,CHEMBL398372,CC(C)(N)C(=O)N[C@H](COCc1ccccc1)c1nnnn1CCOC(=O...,Non-toxic
420,CHEMBL137803,C=C1C(=O)O[C@@H]2C[C@@]3(C)CCCC(=C)[C@@H]3C[C@...,Toxic
422,CHEMBL486423,C=C1CCC[C@]2(C)C[C@H]3OC(=O)[C@@H](C)[C@H]3C[C...,Non-toxic
423,CHEMBL6466,O=c1ccc2ccccc2o1,Non-toxic
424,CHEMBL24171,COc1c2ccoc2cc2oc(=O)ccc12,Non-toxic
...,...,...,...
4768,CHEMBL4283853,COCCCOc1cc2c(cc1OC)-c1cc(=O)c(C(=O)O)cn1[C@H](...,Non-toxic
4770,CHEMBL4284040,N#C/C(=C\c1cccc2cnccc12)c1c[nH]c2ccccc12,Non-toxic
4771,CHEMBL3945880,O=c1cc(NC2CCN(S(=O)(=O)C(F)(F)F)CC2)c2cc(C(c3c...,Non-toxic
4793,CHEMBL4514379,N#Cc1cnc2ccc(-c3c(-c4ccc(F)c(Cl)c4)ncn3CCO)nn12,Non-toxic


## **File handling**

Finally we will save the resulting data to a CSV file.

In [12]:
df_ames.to_csv('QSAR_ames_raw-data.csv', index=False)