# Decision Tree

Decision trees won’t be a great choice for a feature space with complex relationships between numerical variables, but it’s great for data with a simplier mix of numerical and categorical.

<img class="irc_mi" src="https://databricks.com/wp-content/uploads/2014/09/decision-tree-example.png" onload="google.aft&amp;&amp;google.aft(this)" width="585" height="328" style="margin-top: 3px;" alt="Related image">

### <font color = red>Wisconsin Breast Cancer Database

Citation Request:
   This breast cancer databases was obtained from the University of Wisconsin
   Hospitals, Madison from Dr. William H. Wolberg.  

URL: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

### Relevant Information

In [103]:
import os
import subprocess

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import Image
from io import StringIO
import pydot
%matplotlib inline

In [104]:
data = np.genfromtxt(fname ='breast_cancer.csv', delimiter= ',', dtype= float)

Take a look at the dataset.

In [110]:
columns = ['id number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 
           'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']
df = pd.DataFrame(data, columns = columns)
df.head(10)

Unnamed: 0,id number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025.0,5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
1,1002945.0,5.0,4.0,4.0,5.0,7.0,10.0,3.0,2.0,1.0,2.0
2,1015425.0,3.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0
3,1016277.0,6.0,8.0,8.0,1.0,3.0,4.0,3.0,7.0,1.0,2.0
4,1017023.0,4.0,1.0,1.0,3.0,2.0,1.0,3.0,1.0,1.0,2.0
5,1017122.0,8.0,10.0,10.0,8.0,7.0,10.0,9.0,7.0,1.0,4.0
6,1018099.0,1.0,1.0,1.0,1.0,2.0,10.0,3.0,1.0,1.0,2.0
7,1018561.0,2.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
8,1033078.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,5.0,2.0
9,1033078.0,4.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0


In [111]:
df.shape

(699, 11)

### Cleaning up the data

The cancer dataset’s first column consists of patient’s id. To make this prediction process unbiased, we should remove this patient id. We can use numpy delete() method for this operation.

delete(): It returns a new transformed array. Three parameters should to passed.

- arr: It holds the array name.
- obj: It indicates which sub-arrays to remove.
- axis: The axis along which to delete. axis = 1 is used for columns & axis = 0 for rows.

In [112]:
data = np.delete(arr = data, obj= 0, axis = 1)

Now, we wish to divide the dataset into feature & label dataset. i.e., feature data is predictor variables they will help us to predict labels(criterion variable). Here, first 9 columns include continuous variables that will help us to predict whether a patient is having the benign tumor or malignant tumor.

In [113]:
features = data[:,range(0,9)]
classification = data[:,9]

**Data Imputation:**

Imputation is a process of replacing missing values with substituted values. In our dataset, some columns have missing values. We can replace missing values with mean, median, mode or any particular value.
Sklearn provides Imputer() method to perform imputation in 1 line of code. We just need to define missing_values, axis, and strategy. We are using “median” value of the column (axis = 0) to substitute with the missing value.

For instance, index 23 has NaN value in the 6th ('Bare Nuclei') column.

In [114]:
features[23]

array([  8.,   4.,   5.,   1.,   2.,  nan,   7.,   3.,   1.])

In [115]:
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values="NaN", strategy='median', axis=0)
features = imp.fit_transform(features)

In [116]:
features[23]

array([ 8.,  4.,  5.,  1.,  2.,  1.,  7.,  3.,  1.])

We can see that the NaN value in the 6th column is now replaced by the median value of the column.

## Visualize the decision tree

In [117]:
# dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt = DecisionTreeClassifier()
dt.fit(features, classification)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [118]:
with open("breast_cancer_classifier.txt", "w") as f:
    f = export_graphviz(dt, out_file=f)

Open the text file from current directory, copy the content, and paste it to the link below to visualize the graph!

*graphviz web portal address: http://webgraphviz.com/*

## Predicting whether a patient is suffering from the benign tumor or malignant tumor

cancer data set 5th observation

In [128]:
data[5]

array([  8.,  10.,  10.,   8.,   7.,  10.,   9.,   7.,   1.,   4.])

In [132]:
features[5]

array([  8.,  10.,  10.,   8.,   7.,  10.,   9.,   7.,   1.])

In [131]:
test_features_5 = [8.0, 10.0, 10.0, 8.0, 7.0, 10.0, 9.0, 7.0, 1.0]
test_features_5_class = dt.predict([test_features_5])

print("Input: ", features[5])
print("Actual class: ", data[5][9])
print("Classifier predicted: ", test_features_5_class)

Input:  [  8.  10.  10.   8.   7.  10.   9.   7.   1.]
Actual class:  4.0
Classifier predicted:  [ 4.]
