***
# 1. ChEMBL data retrieve



####  We first obtained information about compounds associated with SARS-CoV-2 present in ChEMBL database using this link:

#### [ChEMBL compounds dataset](https://www.ebi.ac.uk/chembl/g/#browse/compounds/filter/_metadata.compound_records.src_id%3A52)


#### Alternatively, data can get accessed through  [ChEMBL database](https://www.ebi.ac.uk/chembl/ ) > Explore SARS-CoV-2 data > 8.2K Compounds section.

#### After downloading, our raw dataset is named _chembl_covid_raw.csv_
***
 

In [24]:
# imports
import os 
import sys
import json 
import pprint
import pandas as pd 
import numpy as np
from collections import Counter

***
# 2. Filtering by Max Phase and Small Molecule

#### We first filtered out compounds without small molecule targets and without _max phase_, to analyze those drugs at development in phase 1 as a baseline.
***


In [10]:
df = pd.read_csv("chembl_covid_raw.csv", delimiter=";")

print('Compounds associated with SARS-CoV-2: ', df['ChEMBL ID'].nunique())

subset = df.loc[(df['Max Phase'] > 0) & (df['Type'] == "Small molecule")]

cols = ["ChEMBL ID", "Name", "Type", "Max Phase"]

subset = subset[cols]

Compounds associated with SARS-CoV-2:  8208


***
# 3. Retrieve activities for each drugs

#### Using a custom bash script, we queried information about activities (IC50 and Ki) through cURL on ChEMBL.

###### A) cURL script obtained from ChEMBL database at each activity.

###### B) Script to generate a cURL for each drug.

###### C) Run all cURL in a terminal using a bash comand.

###### D) Parse responses to obtain a new table with activities reported for each drug.

***


In [19]:
# A) - - - - - - - cURL - - - - - - - #
"""
curl -XPOST "https://www.ebi.ac.uk/chembl/elk/es/chembl_activity/_search" -H 'Content-Type: application/json' -d'{
  "size": 20,
  "from": 0,
  "_source": [
    "molecule_chembl_id",
    "_metadata.parent_molecule_data.compound_key",
    "standard_type",
    "standard_relation",
    "standard_value",
    "standard_units",
    "pchembl_value",
    "activity_comment",
    "assay_chembl_id",
    "assay_description",
    "bao_label",
    "_metadata.assay_data.assay_organism",
    "target_chembl_id",
    "target_pref_name",
    "target_organism",
    "_metadata.target_data.target_type",
    "document_chembl_id",
    "_metadata.source.src_description",
    "_metadata.assay_data.cell_chembl_id",
    "molecule_pref_name",
    "_metadata.parent_molecule_data.max_phase",
    "_metadata.parent_molecule_data.full_mwt",
    "_metadata.parent_molecule_data.num_ro5_violations",
    "_metadata.parent_molecule_data.alogp",
    "canonical_smiles",
    "data_validity_comment",
    "uo_units",
    "ligand_efficiency.bei",
    "ligand_efficiency.le",
    "ligand_efficiency.lle",
    "ligand_efficiency.sei",
    "potential_duplicate",
    "assay_type",
    "bao_format",
    "_metadata.assay_data.tissue_chembl_id",
    "_metadata.assay_data.assay_tissue",
    "_metadata.assay_data.assay_cell_type",
    "_metadata.assay_data.assay_subcellular_fraction",
    "_metadata.assay_data.assay_parameters",
    "assay_variant_accession",
    "assay_variant_mutation",
    "src_id",
    "document_journal",
    "document_year",
    "activity_properties",
    "_metadata.parent_molecule_data.image_file",
    "activity_id"
  ],
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "analyze_wildcard": true,
            "query": "molecule_chembl_id:(\"CHEMBL2019024\") AND standard_type:(\"IC50\")"
          }
        }
      ],
      "filter": []
    }
  },
  "track_total_hits": true,
  "sort": []
}'
"""

# create folders for json responses
#bash_comand -> mkdir json_files/Ki
#bash_comand -> mkdir json_files/IC50

print()




In [20]:
# B) - - - - - - - Python script to replace each compounds ID and activities - - - - - - - #
MOD = '"query": "molecule_chembl_id:'
chembl_ids = subset["ChEMBL ID"].tolist()
for compound in chembl_ids:
    continue                      # ----------------------> remove to run script
    for activity in activities:
        with open("json_files/{}_{}.json".format(compound, activity), "w") as g:
            with open("chembl_activities.json") as f:
                for line in f:
                    if MOD in line:
                        print(compound)
                        line = line.replace(MOD, '{}(\\"{}\\") AND standard_type:(\\"{}\\")'.format(MOD, compound, activity))
                        print(line)
                        g.write(line)
                        print(compound, activity)



In [21]:
# C) - - - - - - - cURL files generator - - - - - - - #

#bash_comand -> for f in *.json; do name=`echo $f|cut -d "." -f1`; bash $f > $name.response; done


In [23]:
# D) - - - - - - - parsing json responses  - - - - - - - #


Ki = "json_files/Ki"

responses = list(f for f in os.listdir(Ki) if f.endswith("response"))
responses = list(os.path.join(Ki, f) for f in responses)

datos = []
for f in responses:
    chembl_id = f.split("/")[-1].split("_")[0]
    data = json.load(open(f)) 
    data = data.get("hits").get("hits")

    for i in data:
        subdict = i
        standard_unit  = subdict.get("_source").get("standard_units")
        if standard_unit != "nM":
            continue
        standard_type  = subdict.get("_source").get("standard_type")
        standard_value = subdict.get("_source").get("standard_value")
        organism       = subdict.get("_source").get("target_organism")
        target_type    = subdict.get("_source").get("_metadata").get("target_data").get("target_type")
        target_chembl_id = subdict.get("_source").get("target_chembl_id")


        if standard_unit not in ["nM", "µM", "ug ml-1", "ng/ml"]:
            continue
        if standard_unit in ["µM", "ug ml-1", "ng/ml"]:
            standard_value = float(standard_value) * 1000

        if not all([standard_unit, standard_type, standard_value, organism, target_type]):
            continue
        
        if target_type != 'SINGLE PROTEIN':
            continue
        if 'coronavirus' not in organism:
            if 'Homo sapiens' not in organism:
                continue

        standard_value = float(standard_value)

        datos.append([chembl_id, standard_type, standard_value, standard_unit, target_chembl_id])

datos = pd.DataFrame(datos, columns=['chembl_id', 'standard_type', 'standard_value', 'standard_unit', 'target_chembl_id'])

print(datos)
print(datos.dtypes)

datos.to_csv("data_Ki.csv")

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #

IC50 = "json_files/IC50"

responses = list(f for f in os.listdir(IC50) if f.endswith("response"))
responses = list(os.path.join(IC50, f) for f in responses)

datos = []
for f in responses:

    chembl_id = f.split("/")[-1].split("_")[0]
    data = json.load(open(f)) 
    data = data.get("hits").get("hits")
    
    for i in data:
        subdict = i

        standard_unit  = subdict.get("_source").get("standard_units")
        if standard_unit != "nM":
            continue
        standard_type  = subdict.get("_source").get("standard_type")
        standard_value = subdict.get("_source").get("standard_value")
        organism       = subdict.get("_source").get("target_organism")
        target_type    = subdict.get("_source").get("_metadata").get("target_data").get("target_type")

        target_chembl_id = subdict.get("_source").get("target_chembl_id")


        if not all([standard_unit, standard_type, standard_value, organism, target_type]):
            continue

        if target_type != 'SINGLE PROTEIN':
            continue
        if 'coronavirus' not in organism:
            if 'Homo sapiens' not in organism:
                continue
        
        standard_value = float(standard_value)

        datos.append([chembl_id, standard_type, standard_value, standard_unit, target_chembl_id])

        

datos = pd.DataFrame(datos, columns=['chembl_id', 'standard_type', 'standard_value', 'standard_unit', 'target_chembl_id'])

print(datos)
print(datos.dtypes)



datos.to_csv("data_IC50.csv")

          chembl_id standard_type  standard_value standard_unit  \
0      CHEMBL100259            Ki      51000.0000            nM   
1      CHEMBL100259            Ki          0.0012            nM   
2      CHEMBL100259            Ki      31000.0000            nM   
3      CHEMBL100259            Ki       5200.0000            nM   
4      CHEMBL100259            Ki     242000.0000            nM   
...             ...           ...             ...           ...   
18170      CHEMBL99            Ki          1.0000            nM   
18171      CHEMBL99            Ki         45.0000            nM   
18172      CHEMBL99            Ki          0.7000            nM   
18173      CHEMBL99            Ki       1400.0000            nM   
18174      CHEMBL99            Ki        800.0000            nM   

      target_chembl_id  
0           CHEMBL1997  
1           CHEMBL4502  
2           CHEMBL5780  
3           CHEMBL5551  
4        CHEMBL3509606  
...                ...  
18170       CHEMBL18

***
# 4. Filtering drugs using activities values

#### --> Continue in filtering_drugs jupyter notebook
***