<a href="https://colab.research.google.com/github/shankar124/googlecolab/blob/main/BioInformatics_Project_Scratch_Download_Bioactivity_Data_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

YouTube Link of this Project - https://www.youtube.com/watch?v=plVLRashaA8&list=PLtqF5YXg7GLlQJUv9XJ3RWdd5VYGwBHrP


**ChEMBL Database**

---
The [ChEMBL](https://www.ebi.ac.uk/chembl/) Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications. [Data as of March 25, 2020; ChEMBL version 26].




**Installing Libraries**



Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.


In [None]:
! pip install chembl_webresource_client

**Importing Libraries**

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

### Search for Target protein

#### Target serach for coronavirus

In [None]:
## Target search for coronavirus

target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Select and retrieve bioactivity data for SARS coronavirus 3C-like proteinase (seventh entry)

We will assign the seventh entry (which corresponds to the target protein, coronavirus 3C-like proteinase) to the *selected_target* variable

In [4]:
selected_target = targets.target_chembl_id[6]
selected_target

'CHEMBL3927'

Here, we will retrieve only bioactivity data for coronavirus 3C-like proteinase (CHEMBL3927) that are reported as IC values in nM (nanomolar) unit.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
res

Assign it to a dataframe called df

In [None]:
df = pd.DataFrame.from_dict(res)
# df = pd.DataFrame(res)
df

In [None]:
## to check the unique value in standard_type column
df.standard_type.unique()

## its contains only IC50 so our data set is correct

In [11]:
df.to_csv('bio_data.csv', index=False)

**Copying files to Google Drive**


In [None]:
# from google.colab import drive
# drive.mount('/content/gdrive/', force_remount=True)

In [None]:
! head bio_data.csv

**Handling missing data**

Check for Null value


In [None]:
df.isnull().sum()

If any compounds has missing value for the standard_value column then drop it

In [None]:
df2 = df[df.standard_value.notna()]
df2

In [None]:
# Check Again
# df2.isnull().sum()

Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.


### Data pre-processing of the bioactivity data



**Labeling compounds as either being active, inactive or intermediate**

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.


In [None]:
# for i in range (0,len(df2.standard_value)):
#   if df2.standard_value[i] < float(1000):
#     df2.

In [None]:
col_list = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[col_list]
df3

In [None]:
# for i in range (0,len(df2.standard_value)):
#    if float(df3.standard_value[i]) <= float(1000):
#      df3['bioactivity_class'] = 'active'
#    elif float(df3.standard_value[i]) >= float(10000):
#      df3['bioactivity_class'] = 'inactive'
#    else :
#     df3['bioactivity_class'] = 'intermediate'
df4 = df3.copy()
bioactivity_class = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append('inactive')
  elif float(i) <= 1000:
    bioactivity_class.append('active')
  else:
    bioactivity_class.append('intermediate')
print(len(bioactivity_class))

df4['bioactivity_class'] = bioactivity_class
df4


Save the dataframe to CSV file

In [34]:
df4.to_csv('bioactivity_preprocesed_data.csv', index=False)