# **Computational Ames test [Part 1] Download Bioactivity Data**

Marc Jermann

In this Jupyter notebook, we will be building a real-life **data science project**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data and other data sources.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 17.7 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 10.5 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 8.7 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 8.0 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 5.6 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.3 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 13.8 MB/s 
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.1.0-py3-none-any.whl (15 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target**

Since the Ames test is performed using Salmonella typhimurium, this organism is our target.

### **Target search for Salmonella typhimurium**

In [None]:
# Target search for Salmonella typhimurium
target = new_client.target
target_query = target.search('Salmonella typhimurium')
targets = pd.DataFrame.from_dict(target_query)
targets.head(4)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Salmonella enterica subsp. enterica serovar Ty...,Salmonella typhimurium,31.0,False,CHEMBL351,[],ORGANISM,90371
1,[],Salmonella,Salmonella,15.0,True,CHEMBL614446,[],ORGANISM,590
2,[],Homo sapiens,Protein NipSnap homolog 3A,15.0,False,CHEMBL3817722,"[{'accession': 'Q9UFN0', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Salmonella enterica subsp. enterica serovar Pa...,Salmonella paratyphi,13.0,False,CHEMBL612293,[],ORGANISM,54388


### **Select and retrieve bioactivity data for Ames tests with *Salmonella typhimurium* (first entry)**

We will assign the first entry (which corresponds to the target organism, *Salmonella typhimurium*) to the ***selected_target*** variable 

In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL351'

Here, we will retrieve only bioactivity data for *Salmonella typhimurium* (CHEMBL351) that are reported as activity.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="Activity")

In [None]:
df = pd.DataFrame.from_dict(res)

Now we keep only the entries that also have the keyword "Ames" in their **assay_description**. Either as "Ames test" or "Ames assay".
Also, we keep only the entries that have an understandable **activity_comment**.

In [None]:
df_ames = df.loc[df['assay_description'].str.contains(' Ames ')]
df_ames = df_ames.loc[df['activity_comment'].isin(['Non-toxic', 'Toxic','Non-Toxic'])]

## **Data pre-processing of the bioactivity data**

### **Handling missing data**
If any compounds has missing value for the **activity_comment** column then drop it

In [None]:
# frequency count of column activity_comment
count = df2["activity_comment"].value_counts()
print(count)

Non-toxic    1368
Toxic         298
Name: activity_comment, dtype: int64


Apparently, for this dataset there is no missing data. But there are two spellings for Non-toxic. This will be corrected in the next step.

In [None]:
df2['activity_comment'] = df2['activity_comment'].replace({'Non-Toxic':'Non-toxic'})

### **Remove unneeded columns and duplicates**

In [None]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'activity_comment']
df3 = df2[selection]

In [None]:
df3=df3.drop_duplicates(subset=['molecule_chembl_id'], keep='last')
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,activity_comment
389,CHEMBL398372,CC(C)(N)C(=O)N[C@H](COCc1ccccc1)c1nnnn1CCOC(=O...,Non-toxic
420,CHEMBL137803,C=C1C(=O)O[C@@H]2C[C@@]3(C)CCCC(=C)[C@@H]3C[C@...,Toxic
422,CHEMBL486423,C=C1CCC[C@]2(C)C[C@H]3OC(=O)[C@@H](C)[C@H]3C[C...,Non-toxic
423,CHEMBL6466,O=c1ccc2ccccc2o1,Non-toxic
424,CHEMBL24171,COc1c2ccoc2cc2oc(=O)ccc12,Non-toxic
...,...,...,...
4768,CHEMBL4283853,COCCCOc1cc2c(cc1OC)-c1cc(=O)c(C(=O)O)cn1[C@H](...,Non-toxic
4770,CHEMBL4284040,N#C/C(=C\c1cccc2cnccc12)c1c[nH]c2ccccc12,Non-toxic
4771,CHEMBL3945880,O=c1cc(NC2CCN(S(=O)(=O)C(F)(F)F)CC2)c2cc(C(c3c...,Non-toxic
4793,CHEMBL4514379,N#Cc1cnc2ccc(-c3c(-c4ccc(F)c(Cl)c4)ncn3CCO)nn12,Non-toxic


## **Copying files to Google Drive**

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [None]:
! ls -l

total 56
-rw-r--r-- 1 root root 48388 Feb 28 16:24 bioactivity_preprocessed_data.csv
drwx------ 5 root root  4096 Feb 28 16:23 gdrive
drwxr-xr-x 1 root root  4096 Feb 18 14:33 sample_data


Firstly, we need to mount the Google Drive into Colab so that we can have access to our Google adrive from within Colab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)


Mounted at /content/gdrive/


Let's copy to the Google Drive

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data/"

In [None]:
! cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_preprocessed_data.csv


Let's see the CSV files that we have so far.

Taking a glimpse of the **bioactivity_data.csv** file that we've just created.

In [None]:
! head "/content/gdrive/MyDrive/Colab Notebooks/data/bioactivity_preprocessed_data.csv"

molecule_chembl_id,canonical_smiles,activity_comment
CHEMBL398372,CC(C)(N)C(=O)N[C@H](COCc1ccccc1)c1nnnn1CCOC(=O)NCCCCO,Non-toxic
CHEMBL137803,C=C1C(=O)O[C@@H]2C[C@@]3(C)CCCC(=C)[C@@H]3C[C@H]12,Toxic
CHEMBL486423,C=C1CCC[C@]2(C)C[C@H]3OC(=O)[C@@H](C)[C@H]3C[C@@H]12,Non-toxic
CHEMBL6466,O=c1ccc2ccccc2o1,Non-toxic
CHEMBL24171,COc1c2ccoc2cc2oc(=O)ccc12,Non-toxic
CHEMBL416,COc1c2occc2cc2ccc(=O)oc12,Non-toxic
CHEMBL52229,COc1ccc2ccc(=O)oc2c1CC=C(C)C,Non-toxic
CHEMBL453805,CC(C)=CCOc1c2occc2cc2ccc(=O)oc12,Non-toxic
CHEMBL164660,O=c1ccc2cc3ccoc3cc2o1,Non-toxic
