# Decision Trees

In this notebook, we will learn about a popular machine learning classification algorithm called Decision Tree. We will use historical data of patients, and their response to different medications to build a decision tree model. Then, we can use the trained decision tree to predict the proper drug for a new patient.

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

Suppose that we have data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, drugA, drugB, drugC, drugD and drugE. We want to build a decision tree model to find out which drug might be appropriate for a future patient with the same illness. 

The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.

## Data Pre-processing

In [None]:
df = pd.read_csv('drug200.csv')

In [None]:
df.head()

In [None]:
df['Sex'].value_counts()

In [None]:
df['BP'].value_counts()

In [None]:
df['Cholesterol'].value_counts()

Sex, BP and Cholesterol are categorical variables. Let's convert them into dummyVariables

In [None]:
dummyVariable_sex = pd.get_dummies(df['Sex'])              # get dummy variables for the sex column
df = pd.concat([df, dummyVariable_sex], axis=1)            # concatenate the dummy variables to the dataframe
df.drop('Sex', axis = 1, inplace = True)                   # drop the sex column
df.drop('F', axis = 1, inplace = True)                     # drop the F sex dummy variable
df.rename(columns = {'M' : 'dSex'}, inplace = True)        # Male column will be sex vector : 0 for F and 1 for M

dummyVariable_chl = pd.get_dummies(df['Cholesterol'])
df = pd.concat([df, dummyVariable_chl], axis=1)
df.drop('Cholesterol', axis = 1, inplace = True)
df.drop('HIGH', axis = 1, inplace = True)
df.rename(columns = {'NORMAL' : 'dCholesterol'}, inplace = True) # Normal column will be cholesterol vector

dummyVariable_bp = pd.get_dummies(df['BP'])
df = pd.concat([df, dummyVariable_bp], axis=1)
df.drop('BP', axis = 1, inplace = True)
df['dBP'] = 2*df['NORMAL'] + df['LOW']              # 0,1,2 represents HIGH, LOW, NORMAL
df.drop('HIGH', axis = 1, inplace = True)
df.drop('LOW', axis = 1, inplace = True)
df.drop('NORMAL', axis = 1, inplace = True)

In [None]:
df.head()

Let's define the feature matrix X and the target vector y:

In [None]:
X = df[['Age', 'dSex', 'dBP', 'dCholesterol', 'Na_to_K']].values
y = df['Drug']

## Decision Tree

Let's split the data into train-test datasets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 3)

Let's import DecisionTreeClassifier, and create an instance of it named drugTree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# inside of the classifier, we specify criterion = 'Entropy' so we can see the information gain of each node
drugTree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4)
drugTree   # shows the default parameters

Now, we will fit the data with the training feature matrix X_train, and target data y_train

In [None]:
drugTree.fit(X_train, y_train)

Let's make prediction on the test dataset and store it in the variable called predTree

In [None]:
predTree = drugTree.predict(X_test)

Let's compare the prediction on the test set with the actual value of the target variable

In [None]:
print(y_test[0:5])

In [None]:
print(predTree[0:5])

Now, let's estimate the accuracy of our model

In [None]:
from sklearn import metrics

print('Decision Tree Accuracy : ', metrics.accuracy_score(y_test, predTree))

__Accuracy classification score__ computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

Finally, let's visualize the tree

In [None]:
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree

In [None]:
dot_data = StringIO()
filename = "drugtree.png"
featureNames = df.columns[0:5]
targetNames = df["Drug"].unique().tolist()
out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True,  special_characters=True,rotate=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')