<div id="about_dataset">
    <h2>About the dataset</h2>
    Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. 
    <br>
    <br>
    Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.
    <br>
    <br>
    It is a sample of binary classifier, and you can use the training part of the dataset 
    to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.
</div>

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv("drug200.csv", delimiter=",")
df.head(10)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
5,22,F,NORMAL,HIGH,8.607,drugX
6,49,F,NORMAL,HIGH,16.275,drugY
7,41,M,LOW,HIGH,11.037,drugC
8,60,M,NORMAL,HIGH,15.171,drugY
9,43,M,LOW,NORMAL,19.368,drugY


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
Age            200 non-null int64
Sex            200 non-null object
BP             200 non-null object
Cholesterol    200 non-null object
Na_to_K        200 non-null float64
Drug           200 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB


In [3]:
# Pre-processing: X is the feature Matrix and y is the response vector

X = df[["Age", "Sex", "BP", "Cholesterol", "Na_to_K"]].values
y = df[["Drug"]].values

In [4]:
# Sklearn Dtrees do not support categoricals, therefore we need to encode the labels
from sklearn import preprocessing

le_sex = preprocessing.LabelEncoder()
le_sex.fit(["F", "M"])
X[:, 1] = le_sex.transform(X[:, 1])

le_bp = preprocessing.LabelEncoder()
le_bp.fit(["LOW", "NORMAL", "HIGH"])
X[:, 2] = le_bp.transform(X[:, 2])

le_chol = preprocessing.LabelEncoder()
le_chol.fit(["NORMAL", "HIGH"])
X[:, 3] = le_chol.transform(X[:, 3])

In [5]:
# Train-test split 70-30
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

In [6]:
# Modelling

dtree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
# criterion="entropy" shows the information gain
# max_depth shows the max number of splittings that can be done

dtree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [8]:
# Prediction
pred = dtree.predict(X_test)

# Accuracy
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, pred))

Accuracy: 0.9833333333333333


In [11]:
# Jaccard Accuracy score w/o sklearn and minimal numpy

y_test = list(y_test)
pred = list(pred)

for drug in y_test:
    if drug == "drugA":
        y_test[y_test.index(drug)] = 0
    elif drug == "drugB":
        y_test[y_test.index(drug)] = 1
    elif drug == "drugC":
        y_test[y_test.index(drug)] = 2
    elif drug == "drugX":
        y_test[y_test.index(drug)] = 3
    elif drug == "drugY":
        y_test[y_test.index(drug)] = 4

for drug in pred:
    if drug == "drugA":
        pred[pred.index(drug)] = 0
    elif drug == "drugB":
        pred[pred.index(drug)] = 1
    elif drug == "drugC":
        pred[pred.index(drug)] = 2
    elif drug == "drugX":
        pred[pred.index(drug)] = 3
    elif drug == "drugY":
        pred[pred.index(drug)] = 4

y_test = np.array(y_test)
pred = np.array(pred)

checklist = list(np.equal(y_test, pred))

trues = checklist.count(True)
trues / len(checklist)

0.9833333333333333

### Visualization


![Decision Tree](download.png)