<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>

# Decision Trees

In this notebook, we will:

*   Develop a classification model using Decision Tree Algorithm

In this notebook, we will use this classification algorithm to build a model from the historical data of patients, and their response to different medications. Then we will use the trained decision tree to predict the class of an unknown patient, or to find a proper drug for a new patient.

### Import the Required Packages:

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import sys
import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as tree

### About the dataset
    
Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. 
<br>
<br>
Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.
<br>
<br>
It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of an unknown patient, or to prescribe a drug to a new patient.

<div id="downloading_data"> 
    <h2>Downloading the Data</h2>
    To download the data, we will use pandas library to read it directly into a dataframe from IBM Object Storage.
</div>


In [3]:
my_data = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv', delimiter=",")
my_data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


Size of data

In [4]:
my_data.shape

(200, 6)

### Pre-processing

Using <b>my_data</b> as the Drug.csv data read by pandas, declare the following variables: <br>

<ul>
    <li> <b> x </b> as the <b> Feature Matrix </b> (data of my_data) </li>
    <li> <b> y </b> as the <b> response vector </b> (target) </li>
</ul>

Remove the column containing the target name since it doesn't contain numeric values.


In [5]:
my_data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [6]:
x = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
x[0:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

In [7]:
x.shape

(200, 5)

In [8]:
x.ndim

# x[:,1] - the 1st index selects all the columns and the 2nd index selects only the 1st column of the 2 dimensional Numpy Array

2

As you may figure out, some features in this dataset are categorical, such as Sex or BP. Unfortunately, <code>Sklearn Decision Trees</code> does not handle categorical variables. We can still convert these features to numerical values using the <code>LabelEncoder()</code> method to convert the categorical variable into dummy/indicator variables.

In [9]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
x[:,1] = le_sex.transform(x[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
x[:,2] = le_BP.transform(x[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
x[:,3] = le_Chol.transform(x[:,3]) 

x[0:5]

array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043]], dtype=object)

In [10]:
my_data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


Now we can fill the target variable.


In [11]:
y = my_data["Drug"]
y[0:5]

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

<div id="setting_up_tree">
    <h3>Setting up the Decision Tree</h3>
    We will be using train/test split on our decision tree. Let's import <b>train_test_split</b> from <b>sklearn.cross_validation</b>.
</div>

In [12]:
from sklearn.model_selection import train_test_split

Now <b> train_test_split </b> will return 4 different parameters. We will name them:<br>
x_trainset, x_testset, y_trainset, y_testset <br> <br>

The <b>x</b> and <b>y</b> are the arrays required before the split, the <code>test_size</code> represents the ratio of the testing dataset, and the <code>random_state</code> ensures that we obtain the same splits.

In [13]:
x_trainset, x_testset, y_trainset, y_testset = train_test_split(x, y, test_size=0.3, random_state=3)

In [14]:
x_trainset

array([[26, 0, 0, 1, 19.161],
       [41, 0, 2, 1, 22.905],
       [28, 0, 2, 0, 19.675],
       [19, 0, 0, 0, 13.313],
       [50, 1, 2, 1, 15.79],
       [24, 1, 2, 0, 25.786],
       [72, 1, 1, 0, 16.31],
       [74, 0, 1, 0, 20.942],
       [37, 0, 1, 1, 12.006],
       [31, 1, 0, 1, 17.069],
       [22, 0, 2, 0, 8.607],
       [20, 0, 2, 1, 9.281],
       [28, 0, 1, 0, 13.127],
       [59, 0, 2, 0, 13.884],
       [15, 1, 0, 1, 17.206],
       [51, 0, 1, 1, 23.003],
       [45, 1, 1, 1, 10.017],
       [33, 0, 1, 0, 33.486],
       [39, 1, 0, 0, 9.664],
       [29, 0, 0, 0, 29.45],
       [60, 1, 2, 0, 15.171],
       [24, 0, 0, 1, 18.457],
       [49, 0, 2, 1, 9.381],
       [37, 1, 1, 1, 8.968],
       [32, 0, 0, 1, 10.292],
       [21, 0, 0, 1, 28.632],
       [23, 1, 2, 0, 12.26],
       [40, 1, 0, 0, 27.826],
       [38, 1, 1, 0, 18.295],
       [47, 1, 1, 1, 30.568],
       [22, 0, 0, 1, 22.818],
       [47, 1, 0, 0, 10.403],
       [30, 0, 2, 0, 10.443],
       [69, 1, 1, 0

Print the shape of x_trainset and y_trainset. Ensuring that the dimensions match.

In [15]:
print('Shape of x training set {}'.format(x_trainset.shape), '&', 'Size of y training set {}'.format(y_trainset.shape))

Shape of x training set (140, 5) & Size of y training set (140,)


Print the shape of x_testset and y_testset. Ensuring that the dimensions match.

In [16]:
# your code


print('Shape of x testing set {}'.format(x_testset.shape), '&', 'Size of y testing set {}'.format(y_testset.shape))

Shape of x testing set (60, 5) & Size of y testing set (60,)


<div id="modeling">
    <h3>Modeling</h3>
    We will first create an instance of the <b>DecisionTreeClassifier</b> called <b>drugTree</b>.
    <br><br>
    Inside of the classifier, specify <i> criterion="entropy" </i> so we can see the information gain of each node.
</div>

In [17]:
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree # it shows the default parameters

Next, we will fit the data with the training feature matrix <b> x_trainset </b> and training  response vector <b> y_trainset </b>

In [18]:
drugTree.fit(x_trainset,y_trainset)

<div id="prediction">
    <h3>Prediction</h3>
    Let's make some <b>predictions</b> on the testing dataset and store it into a variable called <b>predTree</b>.
</div>

In [19]:
predTree = drugTree.predict(x_testset)

In [20]:
predTree

array(['drugY', 'drugX', 'drugX', 'drugX', 'drugX', 'drugC', 'drugY',
       'drugA', 'drugB', 'drugA', 'drugY', 'drugA', 'drugY', 'drugY',
       'drugX', 'drugY', 'drugX', 'drugX', 'drugB', 'drugX', 'drugX',
       'drugY', 'drugY', 'drugY', 'drugX', 'drugB', 'drugY', 'drugY',
       'drugA', 'drugX', 'drugB', 'drugC', 'drugC', 'drugX', 'drugX',
       'drugC', 'drugY', 'drugX', 'drugX', 'drugX', 'drugA', 'drugY',
       'drugC', 'drugY', 'drugA', 'drugY', 'drugY', 'drugY', 'drugY',
       'drugY', 'drugB', 'drugX', 'drugY', 'drugX', 'drugY', 'drugY',
       'drugA', 'drugX', 'drugY', 'drugX'], dtype=object)

In [21]:
y_testset[0:10]

40     drugY
51     drugX
139    drugX
197    drugX
170    drugX
82     drugC
183    drugY
46     drugA
70     drugB
100    drugA
Name: Drug, dtype: object

We can print out <b>predTree</b> and <b>y_testset</b> if we want to visually compare the predictions to the actual values.

In [22]:
print (predTree [0:5])
print (y_testset [0:5])

['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
40     drugY
51     drugX
139    drugX
197    drugX
170    drugX
Name: Drug, dtype: object


<div id="evaluation">
    <h3>Evaluation</h3>
    Next, let's import <b>metrics</b> from sklearn and check the accuracy of our model.
</div>

In [23]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

DecisionTrees's Accuracy:  0.9833333333333333


#### © IBM Corporation 2020. All rights reserved.
