In [111]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [112]:
DrugData = pd.read_csv('Drug.csv')
DrugData.head()


Unnamed: 0,Condition,Drug,Indication,Type,Reviews,Effective,EaseOfUse,Satisfaction,Information
0,Acute Bacterial Sinusitis,Levofloxacin,On Label,RX,994 Reviews,2.52,3.01,1.84,\n\t\t\t\t\tLevofloxacin is used to treat a va...
1,Acute Bacterial Sinusitis,Levofloxacin,On Label,RX,994 Reviews,2.52,3.01,1.84,\n\t\t\t\t\tLevofloxacin is used to treat a va...
2,Acute Bacterial Sinusitis,Moxifloxacin,On Label,RX,755 Reviews,2.78,3.0,2.08,\n\t\t\t\t\t This is a generic drug. The avera...
3,Acute Bacterial Sinusitis,Azithromycin,On Label,RX,584 Reviews,3.21,4.01,2.57,\n\t\t\t\t\tAzithromycin is an antibiotic (mac...
4,Acute Bacterial Sinusitis,Azithromycin,On Label,RX,584 Reviews,3.21,4.01,2.57,\n\t\t\t\t\tAzithromycin is an antibiotic (mac...


In [113]:
DrugData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2219 entries, 0 to 2218
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Condition     2219 non-null   object 
 1   Drug          2219 non-null   object 
 2   Indication    2219 non-null   object 
 3   Type          2219 non-null   object 
 4   Reviews       2219 non-null   object 
 5   Effective     2219 non-null   float64
 6   EaseOfUse     2219 non-null   float64
 7   Satisfaction  2219 non-null   float64
 8   Information   2219 non-null   object 
dtypes: float64(3), object(6)
memory usage: 156.1+ KB


Let's list down all our observations from the super-quick glance of the dataset, as above.

There are 9 variables/features/columns and 2219 observations/samples/rows in the dataset.

The response variable seems to be Effectiveness, while the remaining 8 are most likely predictors.

There are 3 variables identified as float64 by default, and it seems they are indeed Numeric.

There are 6 variables identified as object by default, and they are most likely Categorical (except Reviews,which seems to be numerical count).

None of the variables/features seem to have any missing value.

---
## Data Preparation and Cleaning

We will look through the dataset to prepare and indentify any areas that needs to be cleaned before our analysis.

In [114]:
DrugData["Reviews"]

0       994 Reviews
1       994 Reviews
2       755 Reviews
3       584 Reviews
4       584 Reviews
           ...     
2214      2 Reviews
2215      1 Reviews
2216      1 Reviews
2217      1 Reviews
2218      1 Reviews
Name: Reviews, Length: 2219, dtype: object

Looking at the Reviews Variable, we note that Reviews should be Numerical but it is classified as object64 due to the string behind. Hence, there is a need to remove "reviews" and convert it to a numerical variable.

In [115]:

DrugData["Reviews"] = DrugData["Reviews"].map(lambda x: x.rstrip(' Reviews')).astype('int64')
DrugData["Reviews"]


0       994
1       994
2       755
3       584
4       584
       ... 
2214      2
2215      1
2216      1
2217      1
2218      1
Name: Reviews, Length: 2219, dtype: int64

The price and form of medicine can be found within the information column. We will then extract the price and form from the column and create new columns for price and form. (Example of forms are "tablet", "capsule" etc.)

In [116]:
all_form = ['Capsule(s)','Capsule','Tablet(s)','Tablet','tablets','Bottle','Vial(s)','Vial','Reconstituted(s)','Reconstituted','Tube','Jar','Can','Box','Syringe','Implant','Package','Pen(s)','Inhaler']
Form = []
Price = []
for i in range (DrugData.shape[0]):
    info = DrugData.iloc[i,8]
    sentence = info.split('.')
    for x in sentence:
        if 'average' in x :
            words = x.split()
    for word in words:
        word = word.replace(",", "")
        if word in all_form :
            Form.append(word)
            break
    for price in words:
        if '$' in price:
            temp = price.replace("$", "")
            temp = temp[0:len(temp)-1]
            Price.append(temp)

In [117]:
tablet = ['Tablet(s)','Tablet','tablets']
capsule = ['Capsule(s)','Capsule']
cream = ['Tube','Can','Jar']
liquid_drink = ['Bottle']
liquid_inject = ['Vial','Reconstituted','Reconstituted(s)','Vial(s)','Pen(s)','Syringe']
other = ['Box','Package','Implant','Inhaler']
for i in range (len(Form)):
    if Form[i] in tablet:
        Form[i] = 'Tablet'
    elif Form[i] in capsule:
        Form[i] = 'Capsule'
    elif Form[i] in cream:
        Form[i] = 'Cream'
    elif Form[i] in liquid_drink :
        Form[i] = 'Liquid (Drink)'
    elif Form[i] in liquid_inject :
        Form[i] = 'Liquid (Inject)'
    elif Form[i] in other:
        Form[i] = 'Other'