# Drug Discovery

Drug discovery is the process of identifying and developing new medications. It involves finding new chemical compounds or biologics that can be used to prevent, treat, or cure diseases. Drug discovery is a complex and multi-step process that integrates biology, chemistry, pharmacology, and clinical science to develop new therapies.

Target Identification: The process begins with identifying a biological target, such as a protein, enzyme, or gene, that plays a critical role in a disease. The idea is to find a target that, when modulated by a drug, can lead to therapeutic benefits.

Target Validation: Once a potential target is identified, researchers validate it by demonstrating that modulating the target affects the disease's progression or symptoms.

``https://github.com/dataprofessor/bioinformatics_freecodecamp/tree/main``

# ChEMBL Database

The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications. [Data as of March 25, 2020; ChEMBL version 26].

ChEMBL contains bioactivity data for small molecules, including information about their interactions with drug targets like proteins, enzymes, and receptors. This data can be used to identify potential drug candidates by understanding how these molecules interact with biological targets.

In [1]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [2]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],Feline coronavirus,Feline coronavirus,14.0,False,CHEMBL612744,[],ORGANISM,12663
2,[],Murine coronavirus,Murine coronavirus,14.0,False,CHEMBL5209664,[],ORGANISM,694005
3,[],Canine coronavirus,Canine coronavirus,14.0,False,CHEMBL5291668,[],ORGANISM,11153
4,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
5,[],Human coronavirus OC43,Human coronavirus OC43,13.0,False,CHEMBL5209665,[],ORGANISM,31631
6,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
8,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
9,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


assign the seventh entry (which corresponds to the target protein, *coronavirus 3C-like proteinase*) to the ***selected_target*** variable 

In [3]:
selected_target = targets.target_chembl_id[6]
selected_target

'CHEMBL3927'

Here, we ony retrieve the bioactivity data for *coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as IC50 values in nM (nanomolar) unit.

IC₅₀ (Half Maximal Inhibitory Concentration) is a commonly used measure in pharmacology to evaluate the potency of a substance, such as a drug or an inhibitor, in inhibiting a specific biological or biochemical function.

In [4]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [5]:
df = pd.DataFrame.from_dict(res)

In [6]:
df.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,7.2
1,,,1480936,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,9.4
2,,,1481061,[],CHEMBL830868,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.5
3,,,1481065,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.11
4,,,1481066,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,2.0


In [7]:
df.shape

(133, 46)

In [8]:
df['standard_type'].unique()

array(['IC50'], dtype=object)

IC50 stands for half-maximal inhibitory concentration. It is a measure of the potency of a drug in inhibiting a specific biological or biochemical function. The IC50 value represents the concentration of a substance required to inhibit a given biological process by 50%.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 46 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   action_type                0 non-null      object
 1   activity_comment           0 non-null      object
 2   activity_id                133 non-null    int64 
 3   activity_properties        133 non-null    object
 4   assay_chembl_id            133 non-null    object
 5   assay_description          133 non-null    object
 6   assay_type                 133 non-null    object
 7   assay_variant_accession    0 non-null      object
 8   assay_variant_mutation     0 non-null      object
 9   bao_endpoint               133 non-null    object
 10  bao_format                 133 non-null    object
 11  bao_label                  133 non-null    object
 12  canonical_smiles           133 non-null    object
 13  data_validity_comment      27 non-null     object
 14  data_valid

In [10]:
print(df['standard_value'])

0       7200.0
1       9400.0
2      13500.0
3      13110.0
4       2000.0
        ...   
128    10600.0
129    10100.0
130    11500.0
131    10700.0
132    78900.0
Name: standard_value, Length: 133, dtype: object


For standard value : lower the value -> better the potency of the drug becomes
                   : Higher the value -> worse potency of the drug (more amount of drug required)

In [11]:
df.to_csv('bioactivity_data.csv', index=False)

# Data Preprocessing

In [12]:
df = pd.read_csv('bioactivity_data.csv')

In [13]:
df.duplicated().sum()

0

**Bioactivity Class:** Categorizes the activity of compounds, usually based on their measured effects in biological assays. Common classes include:\
**Active:** Compounds that show significant biological activity in the assay.\
**Inactive:** Compounds that do not show significant activity.\
**Intermediate:** Compounds with ambiguous or borderline activity.\

**Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [14]:
bioactivity_class = []
for i in df.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

In [15]:
bioactivity_class[:10]

['intermediate',
 'intermediate',
 'inactive',
 'inactive',
 'intermediate',
 'active',
 'intermediate',
 'active',
 'inactive',
 'inactive']

**Iterate the *molecule_chembl_id* to a list**

In [16]:
mol_cid = []
for i in df['molecule_chembl_id']:
  mol_cid.append(i)

molecule_chembl_id is a unique identifier assigned to each molecule (compound) in the ChEMBL database.
The molecule is the drug that produce a modulary activity in the target protein

In [17]:
mol_cid[:10]

['CHEMBL187579',
 'CHEMBL188487',
 'CHEMBL185698',
 'CHEMBL426082',
 'CHEMBL187717',
 'CHEMBL365134',
 'CHEMBL187598',
 'CHEMBL190743',
 'CHEMBL365469',
 'CHEMBL188983']

**Iterate *canonical_smiles* to a list**

In [18]:
canonical_smiles = []
for i in df.canonical_smiles:
  canonical_smiles.append(i)

Canonical SMILES (Simplified Molecular Input Line Entry System) is a standardized text representation of a molecule's structure. It encodes the molecular structure using a linear string of characters that uniquely represents the arrangement of atoms, bonds, and stereochemistry in the molecule.

In [19]:
canonical_smiles[:10]

['Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21',
 'O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21',
 'O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21',
 'O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21',
 'O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-]',
 'O=C1C(=O)N(Cc2cc3ccccc3s2)c2c(Br)cccc21',
 'O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccc(F)cc21',
 'O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccc(I)cc21',
 'O=C1C(=O)N(Cc2cc3ccccc3s2)c2cccc(Cl)c21',
 'O=C1C(=O)N(C/C=C/c2cc3ccccc3s2)c2ccc(I)cc21']

**Iterate *standard_value* to a list**

In [20]:
standard_value = []
for i in df.standard_value:
  standard_value.append(i)

In the ChEMBL database, the term "standard value" typically refers to the standardized measurement of a biological or chemical property, such as the potency, activity, or concentration of a compound. These values are used to provide consistent and comparable data across different studies and experiments.

In [21]:
standard_value[:10]

[7200.0,
 9400.0,
 13500.0,
 13110.0,
 2000.0,
 980.0,
 4820.0,
 950.0,
 11200.0,
 23500.0]

### **Combine the 4 lists into a dataframe**

In [22]:
data_tuples = list(zip(mol_cid, canonical_smiles, standard_value, bioactivity_class))
res_df = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'standard_value', 'bioactivity_class'])

In [23]:
res_df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,7200.0,intermediate
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,9400.0,intermediate
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,13500.0,inactive
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,13110.0,inactive
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],2000.0,intermediate


### Add lebels to each tuple

In [24]:
pd.concat([res_df,pd.Series(bioactivity_class)], axis=1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class,0
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,7200.0,intermediate,intermediate
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,9400.0,intermediate,intermediate
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,13500.0,inactive,inactive
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,13110.0,inactive,inactive
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],2000.0,intermediate,intermediate
...,...,...,...,...,...
128,CHEMBL2146517,COC(=O)[C@@]1(C)CCCc2c1ccc1c2C(=O)C(=O)c2c(C)c...,10600.0,inactive,inactive
129,CHEMBL187460,C[C@H]1COC2=C1C(=O)C(=O)c1c2ccc2c1CCCC2(C)C,10100.0,inactive,inactive
130,CHEMBL363535,Cc1coc2c1C(=O)C(=O)c1c-2ccc2c(C)cccc12,11500.0,inactive,inactive
131,CHEMBL227075,Cc1cccc2c3c(ccc12)C1=C(C(=O)C3=O)[C@@H](C)CO1,10700.0,inactive,inactive


In [25]:
res_df = res_df.rename(columns={0: 'Bioactivity_Class'})

In [26]:
res_df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,7200.0,intermediate
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,9400.0,intermediate
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,13500.0,inactive
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,13110.0,inactive
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],2000.0,intermediate


In [27]:
res_df.to_csv('bioactivity_preprocessed_data.csv', index=False)