In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile

In [2]:
path_to_zip = "Data/webmd.zip"
with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
    zip_ref.extractall("Data")

In [3]:
path_to_zip = "Data/drugsComTrain_raw.zip"
with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
    zip_ref.extractall("Data")

In [4]:
path_to_zip = "Data/drugsComTest_raw.zip"
with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
    zip_ref.extractall("Data")

<h2>The above unzip code in the 3 above cells extracts the zip into the Data directory. Only run them once in the beginning.
<h3>REMEMBER NOT TO PUSH THE EXTRACTED FILE TO GITHUB. IT IS TOO BIG.
    
<br>Explanation: The extracted .csv is around 168Mb. Github blocks pushes above a 100Mb. So get run this part of the code on your local machine so you can have the data to work with, but after you are done with it remember to delete the .csv files and only keep the .zip files when you push to github.

In [52]:
UCIdrug_train = pd.read_csv("Data/drugsComTrain_raw.csv", parse_dates=["date"])
UCIdrug_test = pd.read_csv("Data/drugsComTest_raw.csv", parse_dates=["date"])
webmd = pd.read_csv("Data/webmd.csv")

Convert .csv data to pandas dataframe.

In [55]:
print("UCI Train shape :" ,UCIdrug_train.shape)
print("UCI Test shape :", UCIdrug_test.shape)
print("Webmd shape:", webmd.shape)

UCI Train shape : (161297, 7)
UCI Test shape : (53766, 7)
Webmd shape: (362806, 12)


Above we check the shape of the data just to check we're doing everything correct.<br>
As we can see the, UCI train data has 161297 rows and 7 columns.<br>
UCI test data has 53766 rows and 7 columns.<br>
Webmd data has 32806 rows, and 12 columns.

In [57]:
UCIdrug_train.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,2012-05-20,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,2010-04-27,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,2009-12-14,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,2015-11-03,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,2016-11-27,37


We print the first few rows of the UCI data to get an idea of what it's about. Arguably the most important information here is drugname, condition, review, rating, and possibly useful count.

In [58]:
webmd.head()

Unnamed: 0,Age,Condition,Date,Drug,DrugId,EaseofUse,Effectiveness,Reviews,Satisfaction,Sex,Sides,UsefulCount
0,75 or over,Stuffy Nose,9/21/2014,25dph-7.5peh,146724,5,5,I'm a retired physician and of all the meds I ...,5,Male,"Drowsiness, dizziness , dry mouth /nose/thro...",0
1,25-34,Cold Symptoms,1/13/2011,25dph-7.5peh,146724,5,5,cleared me right up even with my throat hurtin...,5,Female,"Drowsiness, dizziness , dry mouth /nose/thro...",1
2,65-74,Other,7/16/2012,warfarin (bulk) 100 % powder,144731,2,3,why did my PTINR go from a normal of 2.5 to ov...,3,Female,,0
3,75 or over,Other,9/23/2010,warfarin (bulk) 100 % powder,144731,2,2,FALLING AND DON'T REALISE IT,1,Female,,0
4,35-44,Other,1/6/2009,warfarin (bulk) 100 % powder,144731,1,1,My grandfather was prescribed this medication ...,1,Male,,1


Now we print the first few rows of the webmd data to get an idea of what it's about. Arguably the most important information here is drugname, condition, review, ease of use, satisfaction, and sides. However preproccesing is required, since the drug name also comes with the form of the drug, and a lot of sides(side effects) columns are empty.

In [66]:
num_unique_drug_webmd = webmd["Drug"].nunique() 
print("Unique drugs in webmd: ", num_unique_drug_webmd)
num_unique_drug_UCItrain = UCIdrug_train["drugName"].nunique() 
print("Unique drugs in UCI Train data: ", num_unique_drug_UCItrain)

Unique drugs in webmd:  7093
Unique drugs in UCI Train data:  3436


Above we get the unique values in the drug column in webmd data and drugname column in the UCI train data. However, we must consider that for the webmd data, there's the same drug but in different forms, which would show up as different values. We must take care of this to find the true number of unique drugs in the webmd dataset, to the best of our ability.

In [67]:
unique_drug_webmd = webmd["Drug"].unique() 
unique_drug_UCItrain = df_test["drugName"].unique() 
print(type(unique_drug_UCItrain))
print(unique_drug_UCItrain[0:10])
print(unique_drug_webmd[0:10])

<class 'numpy.ndarray'>
['Mirtazapine' 'Mesalamine' 'Bactrim' 'Contrave' 'Cyclafem 1 / 35'
 'Zyclara' 'Copper' 'Amitriptyline' 'Methadone' 'Levora']
['25dph-7.5peh' 'warfarin (bulk) 100 % powder' 'wymzya fe'
 '12 hour nasal relief spray, non-aerosol' 'pyrogallol crystals' 'lyza'
 'lysiplex plus liquid' 'lysteda' 'pyrithione zinc shampoo'
 'lysine acetate 4,000 mg oral powder packet']


As we see above, to the webmd dataset has drug names as the first word, while what comes after is a description of the form of the drug. This is what we must take care of. <br>
unique_drug_webmd contains the unique values from the drug column of webmdb. <br>
unique_drug_UCItrain contains the unique values from the drugname column of UCI train set.

In [84]:
sep = ' '
unique_drug_webmd_names = []
for i in unique_drug_webmd:
    unique_drug_webmd_names.append(i.split(sep, 1)[0])
    
unique_drug_webmd_names = np.array(unique_drug_webmd_names)
initial = len(unique_drug_webmd_names)

unique_drug_webmd_names = np.unique(unique_drug_webmd_names)   #returns sorted unique values in numpy array
print(unique_drug_webmd_names[50:60])
final = len(unique_drug_webmd_names)

print("\nLength of arrays with first word without removing repetition:",initial)
print("Length of arrays with first word without removing repetition:",final)
print("Repetitions removed:", initial - final)

['acetic' 'acetyl-l-carnitine' 'acetylcholine'
 'acetylcyst-mecobal-l-mefolate' 'acetylcysteine' 'acid' 'acidophilus'
 'acidophilus-pectin' 'acidophilus-sporogenes' 'aciphex']

Length of arrays with first word without removing repetition: 7093
Length of arrays with first word without removing repetition: 4615
Repetitions removed: 2478


The above cell keeps only the first word in the unique_drug_webmd array to the unique_drug_webmd_names array. We have to check for repeats in unique_drug_webmd_names. Then, we obtain the unique elements in the unique_drug_webmd_names array. We check the new length of array and it is 4615, and we seem to have removed 2478 repetions.

In [87]:
for i in range(len(unique_drug_UCItrain)-1):
    temp = unique_drug_UCItrain[i].upper()
    unique_drug_UCItrain[i] = temp
    
for i in range(len(unique_drug_webmd_names)-1):
    temp = unique_drug_webmd_names[i].upper()
    unique_drug_webmd_names[i] = temp
    
print(unique_drug_UCItrain[0:5])
print(unique_drug_webmd_names[0:5])

['MIRTAZAPINE' 'MESALAMINE' 'BACTRIM' 'CONTRAVE' 'CYCLAFEM 1 / 35']
['12' '15DM-100GFN-5PEH' '20DM-4CPM' '25DPH-7.5PEH' '4']


Now we want to do a comparision, to find if there are any drugs in both datasets. For this, we do not want any conflicts in the letters being uppercase or lowercase, and so we convert everything in both arrays to uppercase.

In [89]:
match_count = 0
common_drugs = []
for i in unique_drug_UCItrain:
    if i in unique_drug_webmd_names:
        match_count+=1
        common_drugs.append(i)

print(match_count)
print(common_drugs[0:10])

1570
['MIRTAZAPINE', 'MESALAMINE', 'BACTRIM', 'CONTRAVE', 'ZYCLARA', 'AMITRIPTYLINE', 'METHADONE', 'PAROXETINE', 'MICONAZOLE', 'BELVIQ']


We have 1570 drugs that are present in both UCIdrugs train data and the webmd dataset. Few of the common drugs are printed above.