# ATC_Dictionary

This file contains the process of creating a dictionary that takes in ATC codes and returns gctx sig_ids, which are used to find corresponding perturbagens in the main gctx file containing all perturbagen and gene expression data.

Steps:
1. First a dataframe / dictionary was created to relate ATC codes to PubChem CIDs using Pubchem's database — Original JSON data from:https://pubchem.ncbi.nlm.nih.gov/source/11950#data=Annotations (5 pages) | (I have converted each JSON page into a text file (found in this directory) although the code processes the raw JSON files)
2. Next I used the PubChem Identifier Exchange Service (https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi) to find corresponding InChiKeys to the CIDs, saved the data as a CSV, and opened it as a dataframe
3. After, I downloaded and opened compoundinfo data from clue.io (https://clue.io/data/CMap2020#LINCS2020) for the gctx, as this corresponds InChiKeys to pert_ids
4. Now, all these sub-dictionaries get merged together to form one big database relating ATC codes to pert_ids
5. To find the sig_ids, a function was created which finds all associated ATC codes that start with the input prefix, finds the corresponding pert_ids in the dictionary, and then inner merges the pert_ids with the sig_info document to find the sig_info (column ids) used to parse the gctx database for the drugs corresponding to the ATC prefixes.

In [7]:
import pandas as pd
import json
import numpy as np

##  Step 1: Processing JSON data (ATC to PubChem CID)

Initializing lists which will be combined into a dataframe

In [8]:
atc_codes = []
cids = []
names = []

Processing raw data

In [9]:
# 1st page
with open("pubchem_ATC_json_pg1.txt", 'r') as fp:
    data = json.load(fp)
    annotation = data["Annotations"]["Annotation"]

    for i in range(1000): #Each page is up to 1000 samples long
        atc_codes.append(annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[0])
        cids.append(annotation[i]["LinkedRecords"]["CID"][0])
        x = annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[2:]
        names.append(" ".join(x))

#2nd page
with open("pubchem_ATC_json_pg2.txt", 'r') as fp:
    data = json.load(fp)
    annotation = data["Annotations"]["Annotation"]

    for i in range(1000):
        atc_codes.append(annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[0])
        cids.append(annotation[i]["LinkedRecords"]["CID"][0])
        x = annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[2:]
        names.append(" ".join(x))
        
# 3rd page
with open("pubchem_ATC_json_pg3.txt", 'r') as fp:
    data = json.load(fp)
    annotation = data["Annotations"]["Annotation"]

    for i in range(1000):
        atc_codes.append(annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[0])
        cids.append(annotation[i]["LinkedRecords"]["CID"][0])
        x = annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[2:]
        names.append(" ".join(x))
        
with open("pubchem_ATC_json_pg4.txt", 'r') as fp:
    data = json.load(fp)
    annotation = data["Annotations"]["Annotation"]

    for i in range(1000):
        atc_codes.append(annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[0])
        cids.append(annotation[i]["LinkedRecords"]["CID"][0])
        x = annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[2:]
        names.append(" ".join(x))
        
with open("pubchem_ATC_json_pg5.txt", 'r') as fp:
    data = json.load(fp)
    annotation = data["Annotations"]["Annotation"]

    for i in range(364):
        atc_codes.append(annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[0])
        cids.append(annotation[i]["LinkedRecords"]["CID"][0])
        x = annotation[i]["Data"][0]["Value"]["StringWithMarkup"][4]["String"].split()[2:]
        names.append(" ".join(x))

Checking if lists populated correctly

In [10]:
atc_codes

['V03AB22',
 'A07AA12',
 'M01AB12',
 'N05AX17',
 'L01EX06',
 'B01AC26',
 'G01AA04',
 'N06BX15',
 'J05AX17',
 'R05DA06',
 'A10BH05',
 'N02AC04',
 'A02BX05',
 'N02AB01',
 'V09IX08',
 'B03AA10',
 'A02AC01',
 'L01EX03',
 'D03AX10',
 'L01EE04',
 'N06BA14',
 'A07AA11',
 'A05AA01',
 'M03AA01',
 'V08CA03',
 'S01AE08',
 'B01AF02',
 'L01EB03',
 'N05CC04',
 'D11AX04',
 'A02BA05',
 'C01BA08',
 'A07AA02',
 'P01AX02',
 'N05CH03',
 'G03DA01',
 'P01BE05',
 'C01BA12',
 'C02DB01',
 'N07BA04',
 'V04CE02',
 'M01CA03',
 'N05BB02',
 'J01CE06',
 'N04AB01',
 'D11AH02',
 'A07AA12',
 'N04AA09',
 'B01AF04',
 'B01AF03',
 'L01EE03',
 'M09AX11',
 'R01BA01',
 'A04AD14',
 'P01CD01',
 'C10AX08',
 'J05AP03',
 'A03BB04',
 'C01CX06',
 'A14AA06',
 'A11CC06',
 'N03AX30',
 'P03AA01',
 'A03BB06',
 'D07AB05',
 'C04AC07',
 'C01BA01',
 'M01CB05',
 'A10BK05',
 'B03AA06',
 'J04AK01',
 'C10AX15',
 'L02BA03',
 'L01AD05',
 'A16AX09',
 'N01AX05',
 'A08AX01',
 'C02KX01',
 'P01BE02',
 'N07XX01',
 'P03BA01',
 'P01BE03',
 'B02AB05',
 'N0

In [11]:
cids

[10026,
 10034073,
 100472,
 10071196,
 10074640,
 10077130,
 10079874,
 10083,
 10089466,
 10090,
 10096344,
 10100,
 10101269,
 10101,
 10103319,
 101033550,
 10112,
 10113978,
 10114,
 10127622,
 10130337,
 101307877,
 10133,
 101602193,
 101673418,
 10178705,
 10182969,
 10184653,
 10188,
 10197702,
 10203245,
 102058611,
 102090452,
 10219,
 10220503,
 102210,
 10221470,
 102239676,
 10230,
 10235,
 102371197,
 10239,
 10240,
 10250769,
 10255142,
 102572331,
 102580518,
 102669,
 10275777,
 10280735,
 10288191,
 10295295,
 10297,
 10311306,
 10311,
 1031,
 10324367,
 10347880,
 10351092,
 10360683,
 10363641,
 10391,
 10404,
 10429215,
 10430619,
 10444661,
 10448938,
 10452965,
 10453870,
 10464762,
 1046,
 10472693,
 104741,
 104799,
 10482134,
 104845,
 104850,
 104865,
 104888,
 104903,
 104926,
 105031,
 105102,
 10518,
 1051,
 10531,
 10548,
 1054,
 10599,
 10607,
 10610,
 10629256,
 10630,
 10631,
 10635,
 10651,
 1065,
 10695961,
 10715,
 107641,
 107715,
 107751,
 107770

In [12]:
names

['Amyl nitrite',
 'Fidaxomicin',
 'Difenpiramide',
 'Pimavanserin',
 'Masitinib',
 'Vorapaxar',
 'Candicidin',
 'Pipradrol',
 'Enisamium iodide',
 'Normethadone',
 'Linagliptin',
 'Dextropropoxyphene',
 'Bismuth subcitrate',
 'Ketobemidone',
 'Fluoroethylcholine (18F)',
 'Ferrous ascorbate',
 'Calcium carbonate',
 'Pazopanib',
 'Enoxolone',
 'Selumetinib',
 'Solriamfetol',
 'Rifaximin',
 'Chenodeoxycholic acid',
 'Alcuronium',
 'Gadodiamide',
 'Besifloxacin',
 'Apixaban',
 'Afatinib',
 'Dichloralphenazone',
 'Lithium succinate',
 'Niperotidine',
 'Prajmaline',
 'Nystatin',
 'Emetine',
 'Tasimelteon',
 'Gestonorone',
 'Artenimol',
 'Lorajmine',
 'Dihydralazine',
 'Cytisine',
 'Sulfobromophthalein',
 'Oxycinchophen',
 'Captodiame',
 'Penamecillin',
 'Etanautine',
 'Pimecrolimus',
 'Fidaxomicin',
 'Phenglutarimide',
 'Betrixaban',
 'Edoxaban',
 'Binimetinib',
 'Palovarotene',
 'Phenylpropanolamine',
 'Rolapitant',
 'Melarsoprol',
 'Policosanol',
 'Boceprevir',
 'Fentonium',
 'Angiotensina

Combining lists into a dictionary dataframe and saving it as CSV

In [13]:
atc_to_cid = pd.DataFrame({"ATC_Code" : atc_codes, "cids" : cids, "Compound_Name" : names})
atc_to_cid.to_csv("atc_to_cid.csv")
atc_to_cid

Unnamed: 0,ATC_Code,cids,Compound_Name
0,V03AB22,10026,Amyl nitrite
1,A07AA12,10034073,Fidaxomicin
2,M01AB12,100472,Difenpiramide
3,N05AX17,10071196,Pimavanserin
4,L01EX06,10074640,Masitinib
...,...,...,...
4359,C05BB05,996,Phenol
4360,A02BB02,9978336,Enprostil
4361,A08AA06,9982,Etilamfetamine
4362,D10AX04,9989226,Aluminium oxide


## Step 2: Using Pubchem Identifier Exchange Service (CIDs to InChiKeys)

Saving CIDs into txt file and uploading it to the exchange service (https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi)

In [14]:
np_cids = np.array(cids)
np.savetxt("cid_list.txt", np_cids, delimiter = ",", fmt = "%d")

Loading InChiKeys from exchange service

In [15]:
inchi_keys = pd.read_csv("inchi_keys.txt", sep = "\t")
inchi_keys

Unnamed: 0,cids,inchi_key
0,10026,CSDTZUBPSYWZDX-UHFFFAOYSA-N
1,10034073,ZVGNESXIJDCBKN-UUEYKCAUSA-N
2,100472,PWHROYKAGRUWDQ-UHFFFAOYSA-N
3,10071196,RKEWSXXUOLRFBX-UHFFFAOYSA-N
4,10074640,WJEOLQLKVOPQFV-UHFFFAOYSA-N
...,...,...
4359,996,ISWSIDIOOBJBQZ-UHFFFAOYSA-N
4360,9978336,PTOJVMZPWPAXER-FPXSIRDUSA-N
4361,9982,YAGBSNMZQKEFCO-UHFFFAOYSA-N
4362,9989226,PNEYBMLMFCGWSK-UHFFFAOYSA-N


## Step 3: Loading compound_info (InChiKeys to pert_ids)

Downloading (from https://clue.io/data/CMap2020#LINCS2020) and loading relevant compound_info 

In [16]:
compound_info = pd.read_csv("compoundinfo_beta.txt", sep = "\t").loc[:,["pert_id", "inchi_key"]]
compound_info

Unnamed: 0,pert_id,inchi_key
0,BRD-A08715367,DATAGRPVKZEWHA-UHFFFAOYSA-N
1,BRD-A12237696,RHGKLRLOHDJJDR-UHFFFAOYSA-N
2,BRD-A18795974,BLYMJBIZMIGWFK-UHFFFAOYSA-N
3,BRD-A27924917,WBSMZVIMANOCNX-UHFFFAOYSA-N
4,BRD-A35931254,VMWNQDUVQKEIOC-UHFFFAOYSA-N
...,...,...
39316,BRD-K62685538,VXKHXGOKWPXYNA-PGBVPBMZSA-N
39317,BRD-K62221994,RANJJVIMTOIWIN-UHFFFAOYSA-N
39318,BRD-K53397409,WPYMKLBDIGXBTP-UHFFFAOYSA-N
39319,BRD-A62182663,HLXSCTYHLQHQDJ-UHFFFAOYSA-N


## Step 4: Merging dictionaries into one (ATC to pert_id)

Creating dictionary

In [17]:
dictionary = atc_to_cid.merge(inchi_keys, on = "cids", how = "inner").merge(compound_info, on = "inchi_key", how = "inner")
dictionary

Unnamed: 0,ATC_Code,cids,Compound_Name,inchi_key,pert_id
0,N05AX17,10071196,Pimavanserin,RKEWSXXUOLRFBX-UHFFFAOYSA-N,BRD-K83405785
1,L01EX06,10074640,Masitinib,WJEOLQLKVOPQFV-UHFFFAOYSA-N,BRD-K71035033
2,L01EX06,10074640,Masitinib,WJEOLQLKVOPQFV-UHFFFAOYSA-N,BRD-K71035033
3,L01EX06,10074640,Masitinib,WJEOLQLKVOPQFV-UHFFFAOYSA-N,BRD-K71035033
4,L01EX06,10074640,Masitinib,WJEOLQLKVOPQFV-UHFFFAOYSA-N,BRD-K71035033
...,...,...,...,...,...
3663,N06AX26,9966051,Vortioxetine,YQNWZWMKLDQSAC-UHFFFAOYSA-N,BRD-K53963539
3664,N06AX26,9966051,Vortioxetine,YQNWZWMKLDQSAC-UHFFFAOYSA-N,BRD-K53963539
3665,N06AX26,9966051,Vortioxetine,YQNWZWMKLDQSAC-UHFFFAOYSA-N,BRD-K53963539
3666,N06AX26,9966051,Vortioxetine,YQNWZWMKLDQSAC-UHFFFAOYSA-N,BRD-K53963539


In [18]:
dictionary.drop_duplicates(subset = "pert_id", inplace = True, ignore_index= True)

Info loss:

In [19]:
print("Number of unique ATC Codes: " + str(atc_to_cid.ATC_Code.nunique()))
print("Number of unique CID Codes: " + str(atc_to_cid.cids.nunique()))
print("Number of unique InChiKeys: " + str(len(np.unique(np_cids))))
print("Number of unique pert_ids: " + str(dictionary.pert_id.nunique()))


Number of unique ATC Codes: 2984
Number of unique CID Codes: 4364
Number of unique InChiKeys: 4364
Number of unique pert_ids: 1418


Saving dictionary

In [20]:
dictionary.to_csv("ATC_to_pert_id_dictionary.txt")

## Step 5: Sig_ids Finder Function

Opening sig_info metadata downloaded from https://clue.io/data/CMap2020#LINCS2020

In [21]:
sig_info = pd.read_csv("siginfo_beta.txt", sep = "\t", dtype=str)
sig_info

Unnamed: 0,bead_batch,nearest_dose,pert_dose,pert_dose_unit,pert_idose,pert_itime,pert_time,pert_time_unit,cell_mfc_name,pert_mfc_id,...,sig_id,pert_type,cell_iname,det_wells,det_plates,distil_ids,build_name,project_code,cmap_name,is_ncs_exemplar
0,b17,,100,ug/ml,100 ug/ml,336 h,336,h,N8,BRD-U44432129,...,MET001_N8_XH:BRD-U44432129:100:336,trt_cp,NAMEC8,H05|H06|H07|H08,MET001_N8_XH_X1_B17,MET001_N8_XH_X1_B17:H05|MET001_N8_XH_X1_B17:H0...,,MET,BRD-U44432129,0
1,b15,10,10,uM,10 uM,3 h,3,h,A549,BRD-K81418486,...,ABY001_A549_XH:BRD-K81418486:10:3,trt_cp,A549,L04|L08|L12,ABY001_A549_XH_X1_B15,ABY001_A549_XH_X1_B15:L04|ABY001_A549_XH_X1_B1...,,ABY,vorinostat,0
2,b15,2.5,2.5,uM,2.5 uM,24 h,24,h,HT29,BRD-K70511574,...,ABY001_HT29_XH:BRD-K70511574:2.5:24,trt_cp,HT29,E18|E22,ABY001_HT29_XH_X1_B15,ABY001_HT29_XH_X1_B15:E18|ABY001_HT29_XH_X1_B1...,,ABY,HMN-214,0
3,b18,10,10,uM,10 uM,3 h,3,h,HME1,BRD-K81418486,...,LTC002_HME1_3H:BRD-K81418486:10,trt_cp,HME1,F19,LTC002_HME1_3H_X1_B18,LTC002_HME1_3H_X1_B18:F19,,LTC,vorinostat,0
4,b15,10,10,uM,10 uM,3 h,3,h,H1975,BRD-A61304759,...,ABY001_H1975_XH:BRD-A61304759:10:3,trt_cp,H1975,P01|P05|P09,ABY001_H1975_XH_X1_B15,ABY001_H1975_XH_X1_B15:P01|ABY001_H1975_XH_X1_...,,ABY,tanespimycin,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1202651,b18,10,10,uM,10 uM,24 h,24,h,HCC515,BRD-K48853221,...,DOSVAL001_HCC515_24H:BRD-K48853221:10,trt_cp,HCC515,K01,DOSVAL001_HCC515_24H_X1_B18|DOSVAL001_HCC515_2...,DOSVAL001_HCC515_24H_X1_B18:K01|DOSVAL001_HCC5...,,DOSVAL,BRD-K48853221,1
1202652,b18,10,10,uM,10 uM,24 h,24,h,HCC515,BRD-K90382497,...,DOSVAL001_HCC515_24H:BRD-K90382497:10,trt_cp,HCC515,O03,DOSVAL001_HCC515_24H_X1_B18|DOSVAL001_HCC515_2...,DOSVAL001_HCC515_24H_X1_B18:O03|DOSVAL001_HCC5...,,DOSVAL,GW-843682X,0
1202653,b19,20,20,uM,20 uM,24 h,24,h,HCC515,BRD-K45785972,...,DOSVAL002_HCC515_24H:BRD-K45785972:20,trt_cp,HCC515,M22,DOSVAL002_HCC515_24H_X1.L2_B19|DOSVAL002_HCC51...,DOSVAL002_HCC515_24H_X1.L2_B19:M22|DOSVAL002_H...,,DOSVAL,BRD-K45785972,0
1202654,b19,4,5,uM,4 uM,24 h,24,h,A375,BRD-K28513938,...,DOSVAL004_A375_24H:BRD-K28513938:5,trt_cp,A375,E09,DOSVAL004_A375_24H_X1.A2_B19|DOSVAL004_A375_24...,DOSVAL004_A375_24H_X1.A2_B19:E09|DOSVAL004_A37...,,DOSVAL,BRD-K28513938,0


Opening sig_info documentation downloaded from https://clue.io/data/CMap2020#LINCS2020

In [22]:
sig_info = pd.read_csv("siginfo_beta.txt", sep = "\t", dtype=str)
sig_info

Unnamed: 0,bead_batch,nearest_dose,pert_dose,pert_dose_unit,pert_idose,pert_itime,pert_time,pert_time_unit,cell_mfc_name,pert_mfc_id,...,sig_id,pert_type,cell_iname,det_wells,det_plates,distil_ids,build_name,project_code,cmap_name,is_ncs_exemplar
0,b17,,100,ug/ml,100 ug/ml,336 h,336,h,N8,BRD-U44432129,...,MET001_N8_XH:BRD-U44432129:100:336,trt_cp,NAMEC8,H05|H06|H07|H08,MET001_N8_XH_X1_B17,MET001_N8_XH_X1_B17:H05|MET001_N8_XH_X1_B17:H0...,,MET,BRD-U44432129,0
1,b15,10,10,uM,10 uM,3 h,3,h,A549,BRD-K81418486,...,ABY001_A549_XH:BRD-K81418486:10:3,trt_cp,A549,L04|L08|L12,ABY001_A549_XH_X1_B15,ABY001_A549_XH_X1_B15:L04|ABY001_A549_XH_X1_B1...,,ABY,vorinostat,0
2,b15,2.5,2.5,uM,2.5 uM,24 h,24,h,HT29,BRD-K70511574,...,ABY001_HT29_XH:BRD-K70511574:2.5:24,trt_cp,HT29,E18|E22,ABY001_HT29_XH_X1_B15,ABY001_HT29_XH_X1_B15:E18|ABY001_HT29_XH_X1_B1...,,ABY,HMN-214,0
3,b18,10,10,uM,10 uM,3 h,3,h,HME1,BRD-K81418486,...,LTC002_HME1_3H:BRD-K81418486:10,trt_cp,HME1,F19,LTC002_HME1_3H_X1_B18,LTC002_HME1_3H_X1_B18:F19,,LTC,vorinostat,0
4,b15,10,10,uM,10 uM,3 h,3,h,H1975,BRD-A61304759,...,ABY001_H1975_XH:BRD-A61304759:10:3,trt_cp,H1975,P01|P05|P09,ABY001_H1975_XH_X1_B15,ABY001_H1975_XH_X1_B15:P01|ABY001_H1975_XH_X1_...,,ABY,tanespimycin,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1202651,b18,10,10,uM,10 uM,24 h,24,h,HCC515,BRD-K48853221,...,DOSVAL001_HCC515_24H:BRD-K48853221:10,trt_cp,HCC515,K01,DOSVAL001_HCC515_24H_X1_B18|DOSVAL001_HCC515_2...,DOSVAL001_HCC515_24H_X1_B18:K01|DOSVAL001_HCC5...,,DOSVAL,BRD-K48853221,1
1202652,b18,10,10,uM,10 uM,24 h,24,h,HCC515,BRD-K90382497,...,DOSVAL001_HCC515_24H:BRD-K90382497:10,trt_cp,HCC515,O03,DOSVAL001_HCC515_24H_X1_B18|DOSVAL001_HCC515_2...,DOSVAL001_HCC515_24H_X1_B18:O03|DOSVAL001_HCC5...,,DOSVAL,GW-843682X,0
1202653,b19,20,20,uM,20 uM,24 h,24,h,HCC515,BRD-K45785972,...,DOSVAL002_HCC515_24H:BRD-K45785972:20,trt_cp,HCC515,M22,DOSVAL002_HCC515_24H_X1.L2_B19|DOSVAL002_HCC51...,DOSVAL002_HCC515_24H_X1.L2_B19:M22|DOSVAL002_H...,,DOSVAL,BRD-K45785972,0
1202654,b19,4,5,uM,4 uM,24 h,24,h,A375,BRD-K28513938,...,DOSVAL004_A375_24H:BRD-K28513938:5,trt_cp,A375,E09,DOSVAL004_A375_24H_X1.A2_B19|DOSVAL004_A375_24...,DOSVAL004_A375_24H_X1.A2_B19:E09|DOSVAL004_A37...,,DOSVAL,BRD-K28513938,0


Creating function that returns sig_ids from ATC code prefix

In [23]:
def ATC_to_records(atc_code):
    bool_series = dictionary.ATC_Code.str.startswith(atc_code, na = False)
    records = dictionary[bool_series]
    pert_ids = records.pert_id
    sig_ids = sig_info.merge(pert_ids, left_on = "pert_mfc_id", right_on= "pert_id", how = "inner").sig_id
    return sig_ids

In [24]:
test_ATC_to_records = ATC_to_records("B01AA")
test_ATC_to_records

0        CPD002_PC3_6H:BRD-K82236179-001-07-6:10
1       CPD002_PC3_24H:BRD-K82236179-001-07-6:10
2      CPD002_MCF7_24H:BRD-K82236179-001-07-6:10
3       CPD002_MCF7_6H:BRD-K82236179-001-07-6:10
4                        REP.A020_HEK293_24H:G20
                         ...                    
549                        REP.A008_YAPC_24H:O10
550                        REP.B008_MCF7_24H:O12
551                        REP.B008_MCF7_24H:O09
552                        REP.A008_MCF7_24H:O08
553                         REP.B008_PC3_24H:O07
Name: sig_id, Length: 554, dtype: object

Creating function that returns number of CIDs from given ATC code prefix

In [25]:
def num_compounds(atc_code):
    bool_series = dictionary.ATC_Code.str.startswith(atc_code, na=False)
    records = dictionary[bool_series]
    return records.cids.nunique()

In [26]:
test_num_compounds = num_compounds("B01")
test_num_compounds

22