<a href="https://colab.research.google.com/github/thntran/learn/blob/master/Compound_data_acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
[?25l  Downloading https://files.pythonhosted.org/packages/2e/48/0db29040c92726fcc6f99a5bc89e0ea8cf5a9d84753ebaaf53108792da2a/chembl-webresource-client-0.10.2.tar.gz (51kB)
[K     |██████▍                         | 10kB 9.3MB/s eta 0:00:01[K     |████████████▊                   | 20kB 1.7MB/s eta 0:00:01[K     |███████████████████             | 30kB 2.2MB/s eta 0:00:01[K     |█████████████████████████▍      | 40kB 2.5MB/s eta 0:00:01[K     |███████████████████████████████▊| 51kB 2.0MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 1.9MB/s 
Collecting requests-cache>=0.4.7
  Downloading https://files.pythonhosted.org/packages/7f/55/9b1c40eb83c16d8fc79c5f6c2ffade04208b080670fbfc35e0a5effb5a92/requests_cache-0.5.2-py2.py3-none-any.whl
Building wheels for collected packages: chembl-webresource-client
  Building wheel for chembl-webresource-client (setup.py) ... [?25l[?25hdone
  Created wheel for chembl-webresource-client: fil

Install anaconda and RDkit in order to work on molecular data

In [2]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/') 

--2020-08-14 06:33:15--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85055499 (81M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’


2020-08-14 06:33:15 (193 MB/s) - ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’ saved [85055499/85055499]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ done
Solving environment: / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==1.6.0=py37h7b6447c_0
    

**Importing libraries**

In [3]:
from chembl_webresource_client.new_client import new_client
import pandas as pd
import math
from rdkit.Chem import PandasTools

**Search for target protein**

In [4]:
target = new_client.target
molecule = new_client.molecule
bioactivities = new_client.activity

In [5]:
uniprot_id = 'P00533'
# Get target information from ChEMBL but restrict to specified values only
target_query = target.get(target_components__accession=uniprot_id) \
                     .only('target_chembl_id', 'organism', 'pref_name', 'target_type')
#print(type(target_query))
targets = pd.DataFrame.from_records(target_query)
targets

Unnamed: 0,organism,pref_name,target_chembl_id,target_type
0,Homo sapiens,Epidermal growth factor receptor erbB1,CHEMBL203,SINGLE PROTEIN
1,Homo sapiens,Epidermal growth factor receptor erbB1,CHEMBL203,SINGLE PROTEIN
2,Homo sapiens,Epidermal growth factor receptor and ErbB2 (HE...,CHEMBL2111431,PROTEIN FAMILY
3,Homo sapiens,Epidermal growth factor receptor,CHEMBL2363049,PROTEIN FAMILY
4,Homo sapiens,MER intracellular domain/EGFR extracellular do...,CHEMBL3137284,CHIMERIC PROTEIN



Select CHEMBL203 as the target of interest.

CHEMBL203: It is a single protein and represents the human Epidermal growth factor receptor (EGFR, also named erbB1)


In [6]:
selected_target = targets.target_chembl_id[0]

We want to download all molecules that have been tested against our target of interest,, only consider:
- human proteins
- bioactivity type IC50
- exact measurements (relation '=')
- binding data (assay type 'B')

In [7]:
bioact = bioactivities.filter(target_chembl_id = selected_target).filter(type = 'IC50').filter(relation = '=').filter(assay_type = 'B') \
                      .only('activity_id','assay_chembl_id','canonical_smiles', 'assay_description', 'assay_type', \
                            'molecule_chembl_id', 'type', 'units', 'relation', 'value', \
                            'target_chembl_id', 'target_organism')
len(bioact), len(bioact[0]), type(bioact), type(bioact[0])
bioact_df = pd.DataFrame.from_records(bioact)
bioact_df.head(10)

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,canonical_smiles,molecule_chembl_id,relation,target_chembl_id,target_organism,type,units,value
0,32260,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,CHEMBL68920,=,CHEMBL203,Homo sapiens,IC50,uM,0.041
1,32260,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,CHEMBL68920,=,CHEMBL203,Homo sapiens,IC50,uM,0.041
2,32267,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,CHEMBL69960,=,CHEMBL203,Homo sapiens,IC50,uM,0.17
3,32680,CHEMBL677833,In vitro inhibition of Epidermal growth factor...,B,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,CHEMBL137635,=,CHEMBL203,Homo sapiens,IC50,uM,9.3
4,32770,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,CHEMBL306988,=,CHEMBL203,Homo sapiens,IC50,uM,500.0
5,32772,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,O=C(O)/C=C/c1ccc(O)cc1,CHEMBL66879,=,CHEMBL203,Homo sapiens,IC50,uM,3000.0
6,32780,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,N#CC(C#N)=Cc1cc(O)ccc1[N+](=O)[O-],CHEMBL77085,=,CHEMBL203,Homo sapiens,IC50,uM,96.0
7,33406,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C(=O)NCCN2CCOCC2)[nH]c1/C=C1\C(=O)N(C)c2...,CHEMBL443268,=,CHEMBL203,Homo sapiens,IC50,uM,5.31
8,34039,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,COc1cc(/C=C(\C#N)C(=O)O)cc(OC)c1O,CHEMBL76979,=,CHEMBL203,Homo sapiens,IC50,uM,264.0
9,34041,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,N#CC(C#N)=C(N)/C(C#N)=C/c1ccc(O)cc1,CHEMBL76589,=,CHEMBL203,Homo sapiens,IC50,uM,0.125


In [8]:
bioact_df.shape

(7178, 12)

In [9]:
bioact_df = bioact_df.dropna(axis=0, how='any') #delete entries with missing values

In [10]:
bioact_df = bioact_df.drop_duplicates('molecule_chembl_id', keep='first') #Delete duplicates: Sometimes the same molecule (molecule_chembl_id) has been tested more than once

In [11]:
bioact_df.shape

(5498, 12)

In [12]:
print(bioact_df.units.unique())
bioact_df = bioact_df.drop(bioact_df.index[~bioact_df.units.str.contains('M')])
print(bioact_df.units.unique())
bioact_df.shape

['uM' 'nM' 'M' "10'1 ug/ml" 'ug ml-1' "10'-1microM" "10'1 uM"
 "10'-1 ug/ml" "10'-2 ug/ml" "10'2 uM" '/uM' "10'-6g/ml" 'mM' 'umol/L'
 'nmol/L']
['uM' 'nM' 'M' "10'-1microM" "10'1 uM" "10'2 uM" '/uM' 'mM']


(5428, 12)

In [13]:
bioact_df = bioact_df.reset_index(drop=True) 
bioact_df.head()

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,canonical_smiles,molecule_chembl_id,relation,target_chembl_id,target_organism,type,units,value
0,32260,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,CHEMBL68920,=,CHEMBL203,Homo sapiens,IC50,uM,0.041
1,32267,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,CHEMBL69960,=,CHEMBL203,Homo sapiens,IC50,uM,0.17
2,32680,CHEMBL677833,In vitro inhibition of Epidermal growth factor...,B,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,CHEMBL137635,=,CHEMBL203,Homo sapiens,IC50,uM,9.3
3,32770,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,CHEMBL306988,=,CHEMBL203,Homo sapiens,IC50,uM,500.0
4,32772,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,O=C(O)/C=C/c1ccc(O)cc1,CHEMBL66879,=,CHEMBL203,Homo sapiens,IC50,uM,3000.0


In [14]:
bioact_df.to_csv('bioactivity-raw.csv', index = False)

In [15]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive/


In [16]:
! cp bioactivity-raw.csv "/content/gdrive/My Drive/Colab Notebooks/Data"

Convert all value units to nM

In [17]:
def convert_to_NM(unit, bioactivity):
  value = bioact_df['value']
  if unit != "nM":
        if unit == "pM":
            value = float(bioactivity)/1000
        elif unit == "10'-11M":
            value = float(bioactivity)/100
        elif unit == "10'-10M":
            value = float(bioactivity)/10
        elif unit == "10'-8M":
            value = float(bioactivity)*10
        elif unit == "10'-1microM" or unit == "10'-7M":
            value = float(bioactivity)*100
        elif unit == "uM" or unit == "/uM" or unit == "10'-6M":
            value = float(bioactivity)*1000
        elif unit == "10'1 uM":
            value = float(bioactivity)*10000
        elif unit == "10'2 uM":
            value = float(bioactivity)*100000
        elif unit == "mM":
            value = float(bioactivity)*1000000
        elif unit == "M":
            value = float(bioactivity)*1000000000
        else:
            print ('unit not recognized...', unit)
        return value
  else: return bioactivity

In [18]:
bioactivity_nM = []
for i, row in bioact_df.iterrows():
    bioact_nM = convert_to_NM(row['units'], row['value'])
    bioactivity_nM.append(bioact_nM)
bioact_df['value'] = bioactivity_nM
bioact_df['units'] = 'nM'
bioact_df.head(10)

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,canonical_smiles,molecule_chembl_id,relation,target_chembl_id,target_organism,type,units,value
0,32260,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,CHEMBL68920,=,CHEMBL203,Homo sapiens,IC50,nM,41.0
1,32267,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,CHEMBL69960,=,CHEMBL203,Homo sapiens,IC50,nM,170.0
2,32680,CHEMBL677833,In vitro inhibition of Epidermal growth factor...,B,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,CHEMBL137635,=,CHEMBL203,Homo sapiens,IC50,nM,9300.0
3,32770,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,CHEMBL306988,=,CHEMBL203,Homo sapiens,IC50,nM,500000.0
4,32772,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,O=C(O)/C=C/c1ccc(O)cc1,CHEMBL66879,=,CHEMBL203,Homo sapiens,IC50,nM,3000000.0
5,32780,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,N#CC(C#N)=Cc1cc(O)ccc1[N+](=O)[O-],CHEMBL77085,=,CHEMBL203,Homo sapiens,IC50,nM,96000.0
6,33406,CHEMBL674637,Inhibitory activity towards tyrosine phosphory...,B,Cc1cc(C(=O)NCCN2CCOCC2)[nH]c1/C=C1\C(=O)N(C)c2...,CHEMBL443268,=,CHEMBL203,Homo sapiens,IC50,nM,5310.0
7,34039,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,COc1cc(/C=C(\C#N)C(=O)O)cc(OC)c1O,CHEMBL76979,=,CHEMBL203,Homo sapiens,IC50,nM,264000.0
8,34041,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,N#CC(C#N)=C(N)/C(C#N)=C/c1ccc(O)cc1,CHEMBL76589,=,CHEMBL203,Homo sapiens,IC50,nM,125.0
9,34049,CHEMBL674643,Inhibitory concentration of EGF dependent auto...,B,N#CC(C#N)=Cc1ccc(O)c(O)c1,CHEMBL76904,=,CHEMBL203,Homo sapiens,IC50,nM,35000.0


In [19]:
bioact_df.shape

(5428, 12)

In [20]:
selection = ['molecule_chembl_id','canonical_smiles','type','value','units']
df1 = bioact_df[selection]
df1.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,type,value,units
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,IC50,41.0,nM
1,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,IC50,170.0,nM
2,CHEMBL137635,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,IC50,9300.0,nM
3,CHEMBL306988,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,IC50,500000.0,nM
4,CHEMBL66879,O=C(O)/C=C/c1ccc(O)cc1,IC50,3000000.0,nM
5,CHEMBL77085,N#CC(C#N)=Cc1cc(O)ccc1[N+](=O)[O-],IC50,96000.0,nM
6,CHEMBL443268,Cc1cc(C(=O)NCCN2CCOCC2)[nH]c1/C=C1\C(=O)N(C)c2...,IC50,5310.0,nM
7,CHEMBL76979,COc1cc(/C=C(\C#N)C(=O)O)cc(OC)c1O,IC50,264000.0,nM
8,CHEMBL76589,N#CC(C#N)=C(N)/C(C#N)=C/c1ccc(O)cc1,IC50,125.0,nM
9,CHEMBL76904,N#CC(C#N)=Cc1ccc(O)c(O)c1,IC50,35000.0,nM


Convert the IC50 values to pIC50

In [21]:
df2 = df1.rename(columns= {'value':'value_IC50'})
df2.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,type,value_IC50,units
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,IC50,41.0,nM
1,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,IC50,170.0,nM
2,CHEMBL137635,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,IC50,9300.0,nM
3,CHEMBL306988,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,IC50,500000.0,nM
4,CHEMBL66879,O=C(O)/C=C/c1ccc(O)cc1,IC50,3000000.0,nM
5,CHEMBL77085,N#CC(C#N)=Cc1cc(O)ccc1[N+](=O)[O-],IC50,96000.0,nM
6,CHEMBL443268,Cc1cc(C(=O)NCCN2CCOCC2)[nH]c1/C=C1\C(=O)N(C)c2...,IC50,5310.0,nM
7,CHEMBL76979,COc1cc(/C=C(\C#N)C(=O)O)cc(OC)c1O,IC50,264000.0,nM
8,CHEMBL76589,N#CC(C#N)=C(N)/C(C#N)=C/c1ccc(O)cc1,IC50,125.0,nM
9,CHEMBL76904,N#CC(C#N)=Cc1ccc(O)c(O)c1,IC50,35000.0,nM


In [22]:
df2 = df2[~df2['canonical_smiles'].isnull()]
print(df2.shape)
df2.head()

(5428, 5)


Unnamed: 0,molecule_chembl_id,canonical_smiles,type,value_IC50,units
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,IC50,41.0,nM
1,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,IC50,170.0,nM
2,CHEMBL137635,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,IC50,9300.0,nM
3,CHEMBL306988,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,IC50,500000.0,nM
4,CHEMBL66879,O=C(O)/C=C/c1ccc(O)cc1,IC50,3000000.0,nM


To allow IC50 data to be more uniformly distributed, we will convert IC50 to the negative logarithmic scale which is essentially -log10(IC50).

This custom function pIC50() will accept a DataFrame as input and will:

- Take the IC50 values from the value_IC50 column and converts it from nM to M by multiplying the value by 10$^{-9}$
- Take the molar value and apply -log10
- Delete the value_IC50 column and create a new value_pIC50 column


In [23]:
df3 = df2.reset_index(drop=True)

In [24]:
# dfx = df2.head(5)
ic50 = df3['value_IC50'].astype(float)

In [25]:
print(len(ic50))
print(ic50.head(10))

5428
0         41.0
1        170.0
2       9300.0
3     500000.0
4    3000000.0
5      96000.0
6       5310.0
7     264000.0
8        125.0
9      35000.0
Name: value_IC50, dtype: float64


In [26]:
pIC50 = pd.Series() 
i = 0
while i < len(df3.value_IC50):
    value = 9 - math.log10(ic50[i]) # pIC50=-log10(IC50 mol/l) --> for nM: -log10(IC50*10**-9)= 9-log10(IC50)
    if value < 0:
        print("Negative pIC50 value at index"+str(i))
    pIC50.at[i] = value
    i += 1
    
df3['value_pIC50'] = pIC50
df3.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,molecule_chembl_id,canonical_smiles,type,value_IC50,units,value_pIC50
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,IC50,41.0,nM,7.387216
1,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,IC50,170.0,nM,6.769551
2,CHEMBL137635,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,IC50,9300.0,nM,5.031517
3,CHEMBL306988,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,IC50,500000.0,nM,3.30103
4,CHEMBL66879,O=C(O)/C=C/c1ccc(O)cc1,IC50,3000000.0,nM,2.522879


In [27]:
selection = ['molecule_chembl_id','canonical_smiles','value_IC50','value_pIC50','units']
df = df3[selection]
df.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,value_IC50,value_pIC50,units
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,41.0,7.387216,nM
1,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,170.0,6.769551,nM
2,CHEMBL137635,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,9300.0,5.031517,nM
3,CHEMBL306988,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,500000.0,3.30103,nM
4,CHEMBL66879,O=C(O)/C=C/c1ccc(O)cc1,3000000.0,2.522879,nM
5,CHEMBL77085,N#CC(C#N)=Cc1cc(O)ccc1[N+](=O)[O-],96000.0,4.017729,nM
6,CHEMBL443268,Cc1cc(C(=O)NCCN2CCOCC2)[nH]c1/C=C1\C(=O)N(C)c2...,5310.0,5.274905,nM
7,CHEMBL76979,COc1cc(/C=C(\C#N)C(=O)O)cc(OC)c1O,264000.0,3.578396,nM
8,CHEMBL76589,N#CC(C#N)=C(N)/C(C#N)=C/c1ccc(O)cc1,125.0,6.90309,nM
9,CHEMBL76904,N#CC(C#N)=Cc1ccc(O)c(O)c1,35000.0,4.455932,nM


In [28]:
df.to_csv('filtered-data.csv', index = False)

In [29]:
! cp filtered-data.csv "/content/gdrive/My Drive/Colab Notebooks/Data"

In [33]:
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

In [34]:
def lipinski(smiles, verbose=False):
  moldata = []
  for element in smiles:
    mol = Chem.MolFromSmiles(element)
    moldata.append(mol)

  baseData = np.arange(1,1)
  i = 0
  #define the properties
  for mol in moldata:
    MW = Descriptors.MolWt(mol)
    LogP = Descriptors.MolLogP(mol)
    HBD = Lipinski.NumHDonors(mol)
    HBA = Lipinski.NumHAcceptors(mol)

    row = np.array([MW, LogP, HBD, HBA])

    if (i==0):
      baseData = row
    else:
      baseData = np.vstack([baseData, row])
    i=i+1

  column_titles = ["MW", "LogP", "HBD", "HBA"]
  desciptors = pd.DataFrame(data=baseData, columns=column_titles)

  return desciptors

In [35]:
df_lipinski = lipinski(df.canonical_smiles)

In [36]:
df_full = pd.concat([df,df_lipinski], axis=1)
df_full.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,value_IC50,value_pIC50,units,MW,LogP,HBD,HBA
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,41.0,7.387216,nM,383.814,4.45034,3.0,4.0
1,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,170.0,6.769551,nM,482.903,3.61432,3.0,6.0
2,CHEMBL137635,CN(c1ccccc1)c1ncnc2ccc(N/N=N/Cc3ccccn3)cc12,9300.0,5.031517,nM,369.432,4.772,1.0,6.0
3,CHEMBL306988,CC(=C(C#N)C#N)c1ccc(NC(=O)CCC(=O)O)cc1,500000.0,3.30103,nM,283.287,2.31056,2.0,4.0
4,CHEMBL66879,O=C(O)/C=C/c1ccc(O)cc1,3000000.0,2.522879,nM,164.16,1.49,2.0,2.0
5,CHEMBL77085,N#CC(C#N)=Cc1cc(O)ccc1[N+](=O)[O-],96000.0,4.017729,nM,215.168,1.73096,1.0,5.0
6,CHEMBL443268,Cc1cc(C(=O)NCCN2CCOCC2)[nH]c1/C=C1\C(=O)N(C)c2...,5310.0,5.274905,nM,539.999,3.22822,3.0,7.0
7,CHEMBL76979,COc1cc(/C=C(\C#N)C(=O)O)cc(OC)c1O,264000.0,3.578396,nM,249.222,1.40098,2.0,5.0
8,CHEMBL76589,N#CC(C#N)=C(N)/C(C#N)=C/c1ccc(O)cc1,125.0,6.90309,nM,236.234,1.55914,2.0,5.0
9,CHEMBL76904,N#CC(C#N)=Cc1ccc(O)c(O)c1,35000.0,4.455932,nM,186.17,1.52836,2.0,4.0
