This notebook is used to look at the different datasets that are available and choose the one that could be best used to answer the goal which is predicting the side effects.

Eventhough different datasets were imported, some of the datasets did not have enough information to merge with other dataset or there were other datasets which had more observations that I was interested to explore. Ultimately Liu dataset and two_sides dataset were used for all analyses and predictions

In [1]:
# Import libraries
import pandas as pd
from pandas import DataFrame
import numpy as np
pd.set_option('display.max_columns', None)  
#pd.set_option('display.max_rows', None)

1) UMLS_id = UMLS concept id as found on the drug label (UMLS - universal medical language system)

2) concept_name = side effect

3) detection_method = method of detection (NLP_indication/NLP_precondition/text_mention)
the authors used NLP on drug label inserts to find the side effects for a particular drug

4) concept_type = MedDRA concept type (LLT= lowest level term, PT= preferred term
in few cases, the term is neither LLT nor PT
NOTE: from medra.tsv file (SIDER website), PT covers all the different forma of a condition.
eg. the PT, abdominal distension covers the different forms namely abdominal distension, distended abdomen, swollen abdomen, swelling abdomen or swelling abd

5) UMLS_id_medra = UMLS concept id for MedDRA term

6) medra_conceptname= MedDRA concept name

IN meddra_all_indications.tsv.gz= all side effects found on the lables are given as LLT. there is atleast one PT for every LLT, but sometimes the PT is the same as the LLT

In [2]:
# Import the datafiles- snap dataset from stanford and SIDER database
#MEDRA is a dictionary used by SIDER to extract side effects info from insert labels
meddra_df = pd.read_table('data/meddra_all_indications.tsv', sep='\t', \
                          names=["stitch_id", "UMLS_id", "detection_method", "concept_name", \
                                 "concept_type", "UMLS_id_medra", "medra_conceptname"])


In [4]:
meddra_df.head()
#concept_name and medra_conceptname is indication for which a drug is used.

Unnamed: 0,stitch_id,UMLS_id,detection_method,concept_name,concept_type,UMLS_id_medra,medra_conceptname
0,CID100000085,C0015544,text_mention,Failure to Thrive,LLT,C0015544,Failure to thrive
1,CID100000085,C0015544,text_mention,Failure to Thrive,PT,C0015544,Failure to thrive
2,CID100000085,C0020615,text_mention,Hypoglycemia,LLT,C0020615,Hypoglycaemia
3,CID100000085,C0020615,text_mention,Hypoglycemia,PT,C0020615,Hypoglycaemia
4,CID100000085,C0022661,NLP_indication,"Kidney Failure, Chronic",LLT,C0022661,Renal failure chronic


In [5]:
#meddra_df[meddra_df.UMLS_id=='C0085393']

In [6]:
meddra_df.shape

(30835, 7)

In [7]:
#drop unnecessary columns
meddra_df=meddra_df[meddra_df.concept_type=='PT']

In [8]:
meddra_df=meddra_df.drop(['detection_method', 'concept_type', 'UMLS_id_medra', \
                         'medra_conceptname'], axis=1)

In [9]:
meddra_df.head(1)

Unnamed: 0,stitch_id,UMLS_id,concept_name
1,CID100000085,C0015544,Failure to Thrive


In [10]:
meddra_df.nunique()

stitch_id       1437
UMLS_id         2705
concept_name    2705
dtype: int64

In [11]:
#Use this only to merge with SNAP database since SNAP database has CID1... which is stitch_id1

#meddra_all_label_indications.tsv -the only column that is extra in this
#file is the source label which is the first column
meddra_label_df =pd.read_table('data/meddra_all_label_indications.tsv', sep='\t', \
                               names=["source_label", "stitch_id1", "stitch_id2","UMLS_id", \
                                      "detection_method", "concept_name", \
                                 "concept_type", "UMLS_id_medra", "medra_conceptname"])
#What is the difference between stitch_id1 and "stitch_id2"?
#there is only a difference of 1 digit- stitch_id1 is CID1... and stitch_id2 is CID0...
#CIDs / CID0... - this is a stereo-specific compound, and the suffix is the 
#PubChem compound id.
#CIDm / CID1... - this is a "flat" compound, i.e. with merged stereo-isomers
#The suffix (without the leading "1") is the PubChem compound id.

1) cells not needed : 

detection_method, UMLS_id

2)UMLS id - concept unique identifier for the concept_name. for eg.if you search for C0016658	UMLS_id, it will show concept_name associated with it is fracture.  (https://ncim.nci.nih.gov/ncimbrowser/ConceptReport.jsp?dictionary=NCI%20Metathesaurus&code=C00166580)

In [12]:
#meddra.tsv has name of the side effect
drug_side_effect=pd.read_table('data/meddra_all_label_se.tsv', sep='\t', \
                          names=["source", "stitch_id1", "stitch_id2", "umls_id", "concept_type", "umls_idmeddra", \
                                 "side_effect"])

In [13]:
drug_side_effect=drug_side_effect[drug_side_effect.concept_type=='PT']

In [14]:
drug_side_effect.head(2)

Unnamed: 0,source,stitch_id1,stitch_id2,umls_id,concept_type,umls_idmeddra,side_effect
1,EMA/WC500020092.html,CID100216416,CID000216416,C0000737,PT,C0000737,Abdominal pain
2,EMA/WC500020092.html,CID100216416,CID000216416,C0000737,PT,C0687713,Gastrointestinal pain


In [15]:
drug_side_effect=drug_side_effect.drop(['source', "stitch_id2", 'concept_type', 'umls_idmeddra'], axis=1)

In [16]:
drug_side_effect.shape

(2523626, 3)

In [17]:
drug_side_effect.nunique()

stitch_id1     1430
umls_id        5805
side_effect    4251
dtype: int64

CID100002909, CID100003222, CID100003249, CID100010340, 

In [6]:
drug_names=pd.read_table('data/drug_names.tsv', sep='\t', header=None, names=["stitch_code", "drug_name"])

In [7]:
drug_names.head()

Unnamed: 0,stitch_code,drug_name
0,CID100000085,carnitine
1,CID100000119,gamma-aminobutyric
2,CID100000137,5-aminolevulinic
3,CID100000143,leucovorin
4,CID100000146,5-methyltetrahydrofolate


In [None]:
drug_names.shape

In [None]:
drug_names.nunique()

In [None]:
# merge the drug names with meddra_df (meddra_all_indications data) on the stitch_id
drug_meddra=pd.merge(drug_names, meddra_df, left_on="stitch_code", right_on="stitch_id", how='left')

In [None]:
#drop the stitch_id since it is same as stitch_code
drug_meddra=drug_meddra.drop(columns=['stitch_id'])

In [None]:
drug_meddra.head(3)

In [None]:
drug_meddra.shape

In [None]:
drug_meddra.nunique()

In [None]:
drug_meddra.groupby('drug_name').stitch_code.count()

In [None]:
drug_meddra.isnull().sum()

In [None]:
drug_meddra

In [None]:
#merge drug_meddra with drug_side_effect
drug_meddra_side_effect=pd.merge(drug_meddra, drug_side_effect, left_on='UMLS_id', right_on='umls_id', how='left')

In [None]:
drug_meddra_side_effect.head(3)

In [None]:
drug_meddra_side_effect[drug_meddra_side_effect.UMLS_id=='C0085393']

## Drug ATC codes will be needed only to search using ATC codes in SIDER

In [None]:
drug_atc=pd.read_table('data/drug_atc.tsv', sep='\t', header=None, names=["stitch_code", "atc_code"])

In [None]:
#How many drugs?
# Which drugs have the maximum side effects?
#How many side effects for each drugs?

In [None]:
#ATC code can be used to find the drug name from SIDER database
#IS THERE ANOTHER DATABASE THAT CONNECTS ATC_CODE WITH DRUG NAME? -
# https://www.genome.jp/kegg-bin/get_htext#E587
drug_atc.head()

### Biodecagon_stanford  or SNAP dataset

In [2]:
bio_decagon=pd.read_csv('data/bio-decagon-combo.csv')

In [3]:
bio_decagon.head(2)

Unnamed: 0,STITCH 1,STITCH 2,Polypharmacy Side Effect,Side Effect Name
0,CID000002173,CID000003345,C0151714,hypermagnesemia
1,CID000002173,CID000003345,C0035344,retinopathy of prematurity


In [4]:
bio_decagon.columns=['stitch_1', 'stitch_2', 'polypharmacy_side_effect', 'side_effect_name']


In [5]:
bio_decagon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4649441 entries, 0 to 4649440
Data columns (total 4 columns):
stitch_1                    object
stitch_2                    object
polypharmacy_side_effect    object
side_effect_name            object
dtypes: object(4)
memory usage: 141.9+ MB


In [None]:
bio_decagon.stitch_1.nunique()

In [None]:
#merge bio_decagon with drug_names

In [30]:
#Import chemical substr dataset
chem_str=pd.read_csv('data/chemical_substr.txt', sep='\t')
chem_str.head()

Unnamed: 0.1,Unnamed: 0,SUB1,SUB2,SUB3,SUB4,SUB5,SUB6,SUB7,SUB8,SUB9,SUB10,SUB11,SUB12,SUB13,SUB14,SUB15,SUB16,SUB17,SUB18,SUB19,SUB20,SUB21,SUB22,SUB23,SUB24,SUB25,SUB26,SUB27,SUB28,SUB29,SUB30,SUB31,SUB32,SUB33,SUB34,SUB35,SUB36,SUB37,SUB38,SUB39,SUB40,SUB41,SUB42,SUB43,SUB44,SUB45,SUB46,SUB47,SUB48,SUB49,SUB50,SUB51,SUB52,SUB53,SUB54,SUB55,SUB56,SUB57,SUB58,SUB59,SUB60,SUB61,SUB62,SUB63,SUB64,SUB65,SUB66,SUB67,SUB68,SUB69,SUB70,SUB71,SUB72,SUB73,SUB74,SUB75,SUB76,SUB77,SUB78,SUB79,SUB80,SUB81,SUB82,SUB83,SUB84,SUB85,SUB86,SUB87,SUB88,SUB89,SUB90,SUB91,SUB92,SUB93,SUB94,SUB95,SUB96,SUB97,SUB98,SUB99,SUB100,SUB101,SUB102,SUB103,SUB104,SUB105,SUB106,SUB107,SUB108,SUB109,SUB110,SUB111,SUB112,SUB113,SUB114,SUB115,SUB116,SUB117,SUB118,SUB119,SUB120,SUB121,SUB122,SUB123,SUB124,SUB125,SUB126,SUB127,SUB128,SUB129,SUB130,SUB131,SUB132,SUB133,SUB134,SUB135,SUB136,SUB137,SUB138,SUB139,SUB140,SUB141,SUB142,SUB143,SUB144,SUB145,SUB146,SUB147,SUB148,SUB149,SUB150,SUB151,SUB152,SUB153,SUB154,SUB155,SUB156,SUB157,SUB158,SUB159,SUB160,SUB161,SUB162,SUB163,SUB164,SUB165,SUB166,SUB167,SUB168,SUB169,SUB170,SUB171,SUB172,SUB173,SUB174,SUB175,SUB176,SUB177,SUB178,SUB179,SUB180,SUB181,SUB182,SUB183,SUB184,SUB185,SUB186,SUB187,SUB188,SUB189,SUB190,SUB191,SUB192,SUB193,SUB194,SUB195,SUB196,SUB197,SUB198,SUB199,SUB200,SUB201,SUB202,SUB203,SUB204,SUB205,SUB206,SUB207,SUB208,SUB209,SUB210,SUB211,SUB212,SUB213,SUB214,SUB215,SUB216,SUB217,SUB218,SUB219,SUB220,SUB221,SUB222,SUB223,SUB224,SUB225,SUB226,SUB227,SUB228,SUB229,SUB230,SUB231,SUB232,SUB233,SUB234,SUB235,SUB236,SUB237,SUB238,SUB239,SUB240,SUB241,SUB242,SUB243,SUB244,SUB245,SUB246,SUB247,SUB248,SUB249,SUB250,SUB251,SUB252,SUB253,SUB254,SUB255,SUB256,SUB257,SUB258,SUB259,SUB260,SUB261,SUB262,SUB263,SUB264,SUB265,SUB266,SUB267,SUB268,SUB269,SUB270,SUB271,SUB272,SUB273,SUB274,SUB275,SUB276,SUB277,SUB278,SUB279,SUB280,SUB281,SUB282,SUB283,SUB284,SUB285,SUB286,SUB287,SUB288,SUB289,SUB290,SUB291,SUB292,SUB293,SUB294,SUB295,SUB296,SUB297,SUB298,SUB299,SUB300,SUB301,SUB302,SUB303,SUB304,SUB305,SUB306,SUB307,SUB308,SUB309,SUB310,SUB311,SUB312,SUB313,SUB314,SUB315,SUB316,SUB317,SUB318,SUB319,SUB320,SUB321,SUB322,SUB323,SUB324,SUB325,SUB326,SUB327,SUB328,SUB329,SUB330,SUB331,SUB332,SUB333,SUB334,SUB335,SUB336,SUB337,SUB338,SUB339,SUB340,SUB341,SUB342,SUB343,SUB344,SUB345,SUB346,SUB347,SUB348,SUB349,SUB350,SUB351,SUB352,SUB353,SUB354,SUB355,SUB356,SUB357,SUB358,SUB359,SUB360,SUB361,SUB362,SUB363,SUB364,SUB365,SUB366,SUB367,SUB368,SUB369,SUB370,SUB371,SUB372,SUB373,SUB374,SUB375,SUB376,SUB377,SUB378,SUB379,SUB380,SUB381,SUB382,SUB383,SUB384,SUB385,SUB386,SUB387,SUB388,SUB389,SUB390,SUB391,SUB392,SUB393,SUB394,SUB395,SUB396,SUB397,SUB398,SUB399,SUB400,SUB401,SUB402,SUB403,SUB404,SUB405,SUB406,SUB407,SUB408,SUB409,SUB410,SUB411,SUB412,SUB413,SUB414,SUB415,SUB416,SUB417,SUB418,SUB419,SUB420,SUB421,SUB422,SUB423,SUB424,SUB425,SUB426,SUB427,SUB428,SUB429,SUB430,SUB431,SUB432,SUB433,SUB434,SUB435,SUB436,SUB437,SUB438,SUB439,SUB440,SUB441,SUB442,SUB443,SUB444,SUB445,SUB446,SUB447,SUB448,SUB449,SUB450,SUB451,SUB452,SUB453,SUB454,SUB455,SUB456,SUB457,SUB458,SUB459,SUB460,SUB461,SUB462,SUB463,SUB464,SUB465,SUB466,SUB467,SUB468,SUB469,SUB470,SUB471,SUB472,SUB473,SUB474,SUB475,SUB476,SUB477,SUB478,SUB479,SUB480,SUB481,SUB482,SUB483,SUB484,SUB485,SUB486,SUB487,SUB488,SUB489,SUB490,SUB491,SUB492,SUB493,SUB494,SUB495,SUB496,SUB497,SUB498,SUB499,SUB500,SUB501,SUB502,SUB503,SUB504,SUB505,SUB506,SUB507,SUB508,SUB509,SUB510,SUB511,SUB512,SUB513,SUB514,SUB515,SUB516,SUB517,SUB518,SUB519,SUB520,SUB521,SUB522,SUB523,SUB524,SUB525,SUB526,SUB527,SUB528,SUB529,SUB530,SUB531,SUB532,SUB533,SUB534,SUB535,SUB536,SUB537,SUB538,SUB539,SUB540,SUB541,SUB542,SUB543,SUB544,SUB545,SUB546,SUB547,SUB548,SUB549,SUB550,SUB551,SUB552,SUB553,SUB554,SUB555,SUB556,SUB557,SUB558,SUB559,SUB560,SUB561,SUB562,SUB563,SUB564,SUB565,SUB566,SUB567,SUB568,SUB569,SUB570,SUB571,SUB572,SUB573,SUB574,SUB575,SUB576,SUB577,SUB578,SUB579,SUB580,SUB581,SUB582,SUB583,SUB584,SUB585,SUB586,SUB587,SUB588,SUB589,SUB590,SUB591,SUB592,SUB593,SUB594,SUB595,SUB596,SUB597,SUB598,SUB599,SUB600,SUB601,SUB602,SUB603,SUB604,SUB605,SUB606,SUB607,SUB608,SUB609,SUB610,SUB611,SUB612,SUB613,SUB614,SUB615,SUB616,SUB617,SUB618,SUB619,SUB620,SUB621,SUB622,SUB623,SUB624,SUB625,SUB626,SUB627,SUB628,SUB629,SUB630,SUB631,SUB632,SUB633,SUB634,SUB635,SUB636,SUB637,SUB638,SUB639,SUB640,SUB641,SUB642,SUB643,SUB644,SUB645,SUB646,SUB647,SUB648,SUB649,SUB650,SUB651,SUB652,SUB653,SUB654,SUB655,SUB656,SUB657,SUB658,SUB659,SUB660,SUB661,SUB662,SUB663,SUB664,SUB665,SUB666,SUB667,SUB668,SUB669,SUB670,SUB671,SUB672,SUB673,SUB674,SUB675,SUB676,SUB677,SUB678,SUB679,SUB680,SUB681,SUB682,SUB683,SUB684,SUB685,SUB686,SUB687,SUB688,SUB689,SUB690,SUB691,SUB692,SUB693,SUB694,SUB695,SUB696,SUB697,SUB698,SUB699,SUB700,SUB701,SUB702,SUB703,SUB704,SUB705,SUB706,SUB707,SUB708,SUB709,SUB710,SUB711,SUB712,SUB713,SUB714,SUB715,SUB716,SUB717,SUB718,SUB719,SUB720,SUB721,SUB722,SUB723,SUB724,SUB725,SUB726,SUB727,SUB728,SUB729,SUB730,SUB731,SUB732,SUB733,SUB734,SUB735,SUB736,SUB737,SUB738,SUB739,SUB740,SUB741,SUB742,SUB743,SUB744,SUB745,SUB746,SUB747,SUB748,SUB749,SUB750,SUB751,SUB752,SUB753,SUB754,SUB755,SUB756,SUB757,SUB758,SUB759,SUB760,SUB761,SUB762,SUB763,SUB764,SUB765,SUB766,SUB767,SUB768,SUB769,SUB770,SUB771,SUB772,SUB773,SUB774,SUB775,SUB776,SUB777,SUB778,SUB779,SUB780,SUB781,SUB782,SUB783,SUB784,SUB785,SUB786,SUB787,SUB788,SUB789,SUB790,SUB791,SUB792,SUB793,SUB794,SUB795,SUB796,SUB797,SUB798,SUB799,SUB800,SUB801,SUB802,SUB803,SUB804,SUB805,SUB806,SUB807,SUB808,SUB809,SUB810,SUB811,SUB812,SUB813,SUB814,SUB815,SUB816,SUB817,SUB818,SUB819,SUB820,SUB821,SUB822,SUB823,SUB824,SUB825,SUB826,SUB827,SUB828,SUB829,SUB830,SUB831,SUB832,SUB833,SUB834,SUB835,SUB836,SUB837,SUB838,SUB839,SUB840,SUB841,SUB842,SUB843,SUB844,SUB845,SUB846,SUB847,SUB848,SUB849,SUB850,SUB851,SUB852,SUB853,SUB854,SUB855,SUB856,SUB857,SUB858,SUB859,SUB860,SUB861,SUB862,SUB863,SUB864,SUB865,SUB866,SUB867,SUB868,SUB869,SUB870,SUB871,SUB872,SUB873,SUB874,SUB875,SUB876,SUB877,SUB878,SUB879,SUB880,SUB881
0,carnitine,1,1,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,GABA,1,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,delta-aminolevulinic acid,1,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,leucovorin,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,1,0,0,0,1,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0,1,0,1,1,0,0,0,0,0,0,1,1,0,1,0,1,0,1,1,0,1,1,0,0,0,1,1,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,1,0,1,1,0,1,1,0,0,0,1,1,0,0,1,0,0,1,1,1,0,0,0,1,1,1,1,0,1,1,1,1,0,0,1,0,1,0,1,1,1,1,0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,PGE2,1,1,1,1,0,0,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,1,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
#Load all the Pauwel's dataset
#http://members.cbio.mines-paristech.fr/~yyamanishi/side-effect/

In [None]:
#Load Liu dataset from .mat file in NSS/capstone/data folder

In [None]:
#Load Mituzani datasets
#http://web.kuicr.kyoto-u.ac.jp/supp/smizutan/target-effect/

In [19]:
#Import off_sides dataset. OFFSIDES- associations before drug approval
off_sides=pd.read_csv('data/OFFSIDES.csv.xz.csv', compression='xz', header=0, sep=',', quotechar='"')

  interactivity=interactivity, compiler=compiler, result=result)


In [20]:
off_sides.head()

Unnamed: 0,drug_rxnorn_id,drug_concept_name,condition_meddra_id,condition_concept_name,A,B,C,D,PRR,PRR_error,mean_reporting_frequency
0,4024,"ergoloid mesylates, USP",10002034,Anaemia,6,126,21,1299,2.85714,0.45382,0.0454545
1,4024,"ergoloid mesylates, USP",10002965,Aplasia pure red cell,1,131,1,1319,10.0,1.41126,0.00757576
2,4024,"ergoloid mesylates, USP",10013442,Disseminated intravascular coagulation,1,131,6,1314,1.66667,1.07626,0.00757576
3,4024,"ergoloid mesylates, USP",10023126,Jaundice,2,130,7,1313,2.85714,0.79657,0.0151515
4,4024,"ergoloid mesylates, USP",10016288,Febrile neutropenia,1,131,5,1315,2.0,1.09163,0.00757576


In [21]:
#Import two sides dataset. TWOSIDES - data of side effects of pairs of drugs
two_sides=pd.read_csv('data/TWOSIDES.csv.xz.csv', compression='xz', header=0, sep=',', quotechar='"')

  interactivity=interactivity, compiler=compiler, result=result)


In [22]:
two_sides.head(2)

Unnamed: 0,drug_1_rxnorn_id,drug_1_concept_name,drug_2_rxnorm_id,drug_2_concept_name,condition_meddra_id,condition_concept_name,A,B,C,D,PRR,PRR_error,mean_reporting_frequency
0,10355,Temazepam,136411,sildenafil,10003239,Arthralgia,7,149,24,1536,2.91667,0.421275,0.0448718
1,1808,Bumetanide,7824,Oxytocin,10003239,Arthralgia,1,13,2,138,5.0,1.19224,0.0714286


In [23]:
two_sides.drug_1_concept_name.nunique()

1716

In [None]:
#pivot two-sides
#res = df.pivot_table(index=['item', 'day'], columns='time',
#                    values='data', aggfunc='first').reset_index()

two_sides_pivot=two_sides.drop(['A', 'B', 'C', 'D', 'PRR', 'PRR_error', 'mean_reporting_frequency'], axis=1)

In [None]:
two_sides_pivot.head(2)

In [None]:
two_sides_pivot=two_sides_pivot.pivot_table(index=['drug_1_rxnorn_id', 'drug_1_concept_name', 'drug_2_rxnorm_id'\
                                                  'drug_2_concept_name'], columns='')

how many drug-drug interactions

q to michael:  two_sides have side effect for the 2 drugs already. how are we going to merge the chemical substru

In [32]:
side_effect_binary=pd.read_csv('data/source_codes_and_datasets_2015-09-05/Liu_dataset_and_experiments/Liu_dataset/merged_data/random_group_cv_data.indication', sep = '\t', delimiter = '|', engine= 'python', header = None)

In [33]:
side_effect_binary.columns=['drugBankID', 'drugName', 'compoundID', 'ADE_str', 'chem_str', \
                           'target_gene', 'transporter', 'enzyme', 'kgg', 'indication', \
                           'group_str']

In [34]:
side_effect_binary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 832 entries, 0 to 831
Data columns (total 11 columns):
drugBankID     832 non-null object
drugName       832 non-null object
compoundID     832 non-null int64
ADE_str        832 non-null object
chem_str       832 non-null object
target_gene    832 non-null object
transporter    832 non-null object
enzyme         832 non-null object
kgg            832 non-null object
indication     832 non-null object
group_str      832 non-null object
dtypes: int64(1), object(10)
memory usage: 71.6+ KB


In [35]:
side_effect_binary.head(2)

Unnamed: 0,drugBankID,drugName,compoundID,ADE_str,chem_str,target_gene,transporter,enzyme,kgg,indication,group_str
0,DB00220,nelfinavir,4451,0010000000000000000000000000000000001000000000...,0000111111111110000111000000000000000010000000...,0000000000000000000000000000000000000000000000...,0000000000000000100001000000000000000000000000...,0000000000000000000000000000000000000000010000...,0000000000000000000000000000000000000000000000...,0000000000000000000000000000000100000000000000...,"3,0,0,2,4,2,3,4,0,3,2,0,4,2,3,3,3,0,2,3,4,4,4,..."
1,DB01340,cilazapril,2751,0010000000000000000100000001000000001010000000...,0000011111011110000111000000000000000000000000...,0000000000000000000000000000000000000000000000...,0000000000000000000000000000000000000000000000...,0000000000000000000000000000000000000000000000...,0000000000000000000000000000000000000000010000...,0000000000000000000000000000000000000000000000...,"4,2,3,4,4,2,1,4,0,4,4,4,2,2,3,0,2,0,0,3,3,1,3,..."


In [36]:
side_effect_binary1=side_effect_binary[['drugBankID', 'drugName', 'compoundID']]
side_effect_binary1.head(2)

Unnamed: 0,drugBankID,drugName,compoundID
0,DB00220,nelfinavir,4451
1,DB01340,cilazapril,2751


In [37]:
side_effect_binary_str=side_effect_binary[['ADE_str', 'chem_str', 'target_gene', \
                                           'transporter', 'enzyme', 'kgg', 'indication'
                                          ]]

In [38]:
type(side_effect_binary_str)

pandas.core.frame.DataFrame

In [41]:
#convert ADE_str to list and then to dataframe and concatenate to side_effect_binary1
side_effect_binary_ade=side_effect_binary_str.ADE_str.apply(list)
type(side_effect_binary_ade)

pandas.core.series.Series

In [42]:
se_binary_ade=pd.DataFrame(side_effect_binary_ade.tolist())

In [43]:
type(se_binary_ade)

pandas.core.frame.DataFrame

In [44]:
side_effect_binary1=pd.concat([side_effect_binary1, se_binary_ade], axis=1)
side_effect_binary1.head(2)

Unnamed: 0,drugBankID,drugName,compoundID,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131,1132,1133,1134,1135,1136,1137,1138,1139,1140,1141,1142,1143,1144,1145,1146,1147,1148,1149,1150,1151,1152,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1184,1185,1186,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196,1197,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,1208,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228,1229,1230,1231,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241,1242,1243,1244,1245,1246,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256,1257,1258,1259,1260,1261,1262,1263,1264,1265,1266,1267,1268,1269,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279,1280,1281,1282,1283,1284,1285,1286,1287,1288,1289,1290,1291,1292,1293,1294,1295,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384
0,DB00220,nelfinavir,4451,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,DB01340,cilazapril,2751,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


In [213]:
side_effect_binary.drugName.nunique()

832

In [233]:
side_effect_binary.shape

(832, 11)

In [None]:
#remove group_str from side_effect_binary

In [None]:
#merge side_effect_binary with drug_meddra

ADE= adverse drug event

In [None]:
#merge TWO_sides with side_effect_binary data

### Load DrugBank dataset if access provided

In [265]:
#import DrugBank dataset
import xml.etree.ElementTree as et 

In [267]:
xtree = et.parse("data/full_drugbank_database.xml")

In [268]:
def parse_XML(xml_file, df_cols): 
    """Parse the input XML file and store the result in a pandas 
    DataFrame with the given columns. 
    
    The first element of df_cols is supposed to be the identifier 
    variable, which is an attribute of each node element in the 
    XML data; other features will be parsed from the text content 
    of each sub-element. 
    """
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []
    
    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

In [None]:
parse_XML("data/full_drugbank_database.xml", ['name', 'id', 'description', 'cas-number'])

### Data Exploration
