## Application of Tree Based Classification Methods to Wisconsin Breast Cancer Dataset

In [1]:
import numpy as np
import pandas as pd
from time import time
from functools import reduce
from collections import deque
from sklearn import tree, ensemble
from metrix import accuracy, confusion_matrix, precision, recall, f1_score
import scipy.stats as st

from DecisionTree import DecisionTreeClassifier, DecisionTreeRegressor
from RandomForestClassifier import RandomForestClassifier
from AdaBoostClassifier import AdaBoostClassifier
from GradientBoostingClassifier import GradientBoostingClassifier

In [2]:
def train_test_split(X, ratio=0.7):
    X = X.sample(frac=1).reset_index(drop=True)
    return X[:int(len(X) * ratio)], X[int(len(X) * ratio):]

# Helper functions for grid search
def cartesian_product(arr1, arr2):
    product = []
    for i in arr1:
        for j in arr2:
            if isinstance(i, tuple):
                # (1, 2) and 5 -> (1, 2, 5)
                product.append((*i,j))
            else:
                # 1 and 2 -> (1, 2)
                product.append((i,j))
    return product

def all_possible_param_combinations(params):
    return reduce(cartesian_product, map(lambda param_name: params[param_name], params))

In [3]:
def grid_search(model, params_to_optimize, X_train, y_train, X_test, y_test):
    all_possibilities = all_possible_param_combinations(params_to_optimize)
    best_accuracy = 0
    best_model = None
    for index, possibility in enumerate(all_possibilities):
        model_i = model(*possibility)
        a=time()
        model_i.fit(X_train, y_train)
        b=time()
        print('model', index + 1)
        print('trained in', b-a, 'seconds')
        accuracy = model_i.score(X_test, y_test)
        print('accuracy: ', accuracy)
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_model = model_i

    return best_accuracy, best_model

## Data Preparation

In [4]:
df = pd.read_csv('breast-cancer-wisconsin.data', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3,1,1,1,3,2,1,1,1,2
695,841769,2,1,1,1,2,1,1,1,1,2
696,888820,5,10,10,3,7,3,8,10,2,4
697,897471,4,8,6,4,3,4,10,6,1,4


In [5]:
df = df.loc[(df != '?').all(axis=1)]

  res_values = method(rvalues)


In [6]:
df = df.astype('int')

In [7]:
train_data, test_data = train_test_split(df.iloc[:, 1:])

In [8]:
X_train = train_data.iloc[:, :-1]
y_train = train_data.iloc[:, -1]

X_test = test_data.iloc[:, :-1]
y_test = test_data.iloc[:, -1]

## Model Selection with Cross-Validation and Grid Search

We have different hyperparameters to consider: min_members is the minimum required examples to label a node as leaf. max_depth is the maximum depth of a feature. max_features is the number of features to consider when making a split. Those features are selected randomly. Now, let's search for the optimal parameters with grid search by evaluating model performance on validation set. First, we'll built single decision trees from our data and later we will compare their performance with some ensemble methods such as Random Forest, AdaBoost etc.

In [9]:
params_to_optimize = {
    'tol': [0.1],
    'max_depth': [2, 3, 4],
    'min_members': [10, 20, 50],
    'criterion': ['gini'],
    'split_method': ['binary'],
    'max_features': [1, 2]
}

In [10]:
best_dt_accuracy, best_dt_model = grid_search(DecisionTreeClassifier, params_to_optimize, X_train, y_train, X_test, y_test)

model 1
trained in 0.0024471282958984375 seconds
accuracy:  0.8292682926829268
model 2
trained in 0.0025339126586914062 seconds
accuracy:  0.9317073170731708
model 3
trained in 0.0018148422241210938 seconds
accuracy:  0.9219512195121952
model 4
trained in 0.0032830238342285156 seconds
accuracy:  0.8926829268292683
model 5
trained in 0.0013821125030517578 seconds
accuracy:  0.8292682926829268
model 6
trained in 0.002389669418334961 seconds
accuracy:  0.9317073170731708
model 7
trained in 0.0026226043701171875 seconds
accuracy:  0.926829268292683
model 8
trained in 0.0029892921447753906 seconds
accuracy:  0.9365853658536586
model 9
trained in 0.0021331310272216797 seconds
accuracy:  0.9414634146341463
model 10
trained in 0.0028219223022460938 seconds
accuracy:  0.9365853658536586
model 11
trained in 0.0024251937866210938 seconds
accuracy:  0.9317073170731708
model 12
trained in 0.0026772022247314453 seconds
accuracy:  0.9414634146341463
model 13
trained in 0.002405881881713867 seconds
ac

In [11]:
best_dt_accuracy

0.9414634146341463

In [12]:
best_dt_model.min_members, best_dt_model.max_depth, best_dt_model.max_features

(20, 3, 1)

From the 18 different models we evaluated, the model with hyperparameters min_members={{best_dt_model.min_members}}, max_depth={{best_dt_model.max_depth}}, max_features={{best_dt_model.max_features}} seems to be the best one with the {{np.round(best_dt_accuracy*100, 2).astype('int')}}% accuracy on validation set. Let us further evaluate this model by looking at other metrics.

In [13]:
pred_dt = best_dt_model.predict(X_test)
pred_dt

array([2, 2, 4, 4, 2, 2, 4, 2, 4, 2, 4, 4, 2, 2, 2, 2, 4, 4, 2, 2, 4, 2,
       2, 2, 4, 4, 2, 4, 4, 2, 4, 2, 4, 2, 4, 4, 2, 2, 2, 2, 2, 4, 2, 4,
       2, 4, 4, 2, 2, 2, 4, 2, 4, 2, 2, 2, 4, 4, 4, 4, 4, 2, 4, 4, 4, 2,
       4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 4,
       2, 2, 2, 4, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 4, 2, 2, 2, 2, 2, 2, 4,
       2, 2, 2, 4, 2, 2, 4, 2, 4, 2, 2, 4, 4, 4, 2, 2, 2, 2, 2, 2, 4, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 4, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 2, 4, 2, 2,
       2, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 4, 2, 2,
       4, 4, 4, 4, 2, 2, 2])

In [14]:
conf_matrix = confusion_matrix(y_test, pred_dt, classes=[2, 4])
conf_matrix

Unnamed: 0,Actual Positive,Actual Negative
Predicted Positive,61,4
Predicted Negative,8,132


In [15]:
dt_true_positive = conf_matrix.iloc[0, 0]
dt_predicted_positive = conf_matrix.iloc[0, :].sum()
dt_actual_positive = conf_matrix.iloc[:, 0].sum()

We have {{dt_true_positive}} true positives out of {{dt_predicted_positive}} positively predicted examples. So, our precision rate is {{dt_true_positive}}/{{dt_predicted_positive}}:

In [16]:
dt_precision = precision(y_test, pred_dt, classes=[2, 4])
dt_precision

0.9384615384615385

The model correctly classified {{dt_true_positive}} out of {{dt_actual_positive}} positive examples. So, the *recall* is:

In [17]:
dt_recall = recall(y_test, pred_dt, classes=[2, 4])
dt_recall

0.8840579710144928

Finally, we calculate *f1 score* by $ \frac{2pr}{p + r} $, where p is precision and r is recall:

In [18]:
dt_f1 = f1_score(y_test, pred_dt, classes=[2, 4])
dt_f1

0.9104477611940298

Now, let us examine how well Random Forest will perform on our data. We have an extra hyperparameter, *n_trees*, which defines the number of trees to build.

In [19]:
params_to_optimize = {
    'n_trees': [100, 150, 200],
    'tol': [0.1],
    'max_depth': [3, 4],
    'min_members': [10, 20],
    'criterion': ['gini'],
    'split_method': ['binary'],
    'max_features': [1, 2],
}

In [20]:
best_rf_accuracy, best_rf_model = grid_search(RandomForestClassifier, params_to_optimize, X_train, y_train, X_test, y_test)

model 1
trained in 0.1718606948852539 seconds
accuracy:  0.9658536585365853
model 2
trained in 0.2550930976867676 seconds
accuracy:  0.9658536585365853
model 3
trained in 0.1602160930633545 seconds
accuracy:  0.9658536585365853
model 4
trained in 0.253950834274292 seconds
accuracy:  0.9658536585365853
model 5
trained in 0.2288961410522461 seconds
accuracy:  0.9658536585365853
model 6
trained in 0.3246805667877197 seconds
accuracy:  0.9658536585365853
model 7
trained in 0.22014999389648438 seconds
accuracy:  0.9658536585365853
model 8
trained in 0.3346598148345947 seconds
accuracy:  0.9658536585365853
model 9
trained in 0.23746824264526367 seconds
accuracy:  0.9658536585365853
model 10
trained in 0.38608264923095703 seconds
accuracy:  0.9658536585365853
model 11
trained in 0.23241543769836426 seconds
accuracy:  0.9658536585365853
model 12
trained in 0.3595423698425293 seconds
accuracy:  0.9658536585365853
model 13
trained in 0.30576252937316895 seconds
accuracy:  0.9658536585365853
mode

In [21]:
best_rf_accuracy

0.9658536585365853

In [22]:
best_rf_model.n_trees, best_rf_model.max_depth, best_rf_model.max_features, best_rf_model.min_members

(100, 3, 1, 10)

From the 24 different models we evaluated, the model with hyperparameters min_members={{best_rf_model.min_members}}, max_depth={{best_rf_model.max_depth}}, max_features={{best_rf_model.max_features}}, n_trees={{best_rf_model.n_trees}} performs best one with the {{np.round(best_rf_accuracy*100, 2).astype('int')}}% accuracy on validation set. Now, again, compute other metrics to see examine the performance of this model.

In [23]:
pred_rf = best_rf_model.predict(X_test)

In [24]:
confusion_matrix(y_test, pred_rf, classes=[2, 4])

Unnamed: 0,Actual Positive,Actual Negative
Predicted Positive,67,5
Predicted Negative,2,131


In [25]:
rf_precision = precision(y_test, pred_rf, classes=[2, 4])
rf_precision

0.9305555555555556

In [26]:
rf_recall = recall(y_test, pred_rf, classes=[2, 4])
rf_recall

0.9710144927536232

In [27]:
rf_f1 = f1_score(y_test, pred_rf, classes=[2, 4])
rf_f1

0.9503546099290779

Now, let's see how AdaBoost performs on our dataset. We have now, instead of *n_trees*, *n_learners* parameter.

In [28]:
params_to_optimize = {
    'n_learners': [50, 100, 150, 200],
    'tol': [0.1],
    'max_depth': [2],
    'min_members': [10, 20],
    'criterion': ['gini'],
    'split_method': ['binary'],
    'max_features': [1, 2, 3, 4, 5],
}

In [29]:
best_ab_accuracy, best_ab_model = grid_search(AdaBoostClassifier, params_to_optimize, X_train, y_train, X_test, y_test)

model 1
trained in 0.058773040771484375 seconds
accuracy:  0.9365853658536586
model 2
trained in 0.1498410701751709 seconds
accuracy:  0.9512195121951219
model 3
trained in 0.21143436431884766 seconds
accuracy:  0.9414634146341463
model 4
trained in 0.22395777702331543 seconds
accuracy:  0.9609756097560975
model 5
trained in 0.25395774841308594 seconds
accuracy:  0.9463414634146341
model 6
trained in 0.11724472045898438 seconds
accuracy:  0.9365853658536586
model 7
trained in 0.15119481086730957 seconds
accuracy:  0.9317073170731708
model 8
trained in 0.18380069732666016 seconds
accuracy:  0.9317073170731708
model 9
trained in 0.21530532836914062 seconds
accuracy:  0.9463414634146341
model 10
trained in 0.25317883491516113 seconds
accuracy:  0.9463414634146341
model 11
trained in 0.1747736930847168 seconds
accuracy:  0.9365853658536586
model 12
trained in 0.24207186698913574 seconds
accuracy:  0.9463414634146341
model 13
trained in 0.30335450172424316 seconds
accuracy:  0.9463414634146

In [30]:
best_ab_accuracy

0.9609756097560975

In [31]:
best_ab_model.n_learners, best_ab_model.min_members, best_ab_model.max_features

(50, 10, 4)

From the 64 different models we evaluated, the model with hyperparameters min_members={{best_ab_model.min_members}}, max_features={{best_ab_model.max_features}}, n_learners={{best_ab_model.n_learners}} performs best one with the {{np.round(best_ab_accuracy*100, 2).astype('int')}}% accuracy on validation set. Let's compute some metrics again.

In [32]:
pred_ab = best_ab_model.predict(X_test)

In [33]:
confusion_matrix(y_test, pred_ab, classes=[2, 4])

Unnamed: 0,Actual Positive,Actual Negative
Predicted Positive,65,4
Predicted Negative,4,132


In [34]:
ab_precision = precision(y_test, pred_ab, classes=[2, 4])
ab_precision

0.9420289855072463

In [35]:
ab_recall = recall(y_test, pred_ab, classes=[2, 4])
ab_recall

0.9420289855072463

In [36]:
ab_f1 = f1_score(y_test, pred_ab, classes=[2, 4])
ab_f1

0.9420289855072463

Now, let's finally look at the performance of Gradient Boosting Classifier. We have two additional features to AdaBoost, alpha and n_iters_stop. alpha is the learning rate. It control the effect of each regression tree that predicts the residuals. n_iters_stop defines the stopping criterion of how many number of subsequent iterations that loss is not changed.

In [48]:
params_to_optimize = {
    'n_learners': [50, 100, 150, 200],
    'n_iters_stop': [5],
    'loss_tol': [10e-4],
    'alpha': [0.1, 0.3, 0.5, 0.8],
    'tol': [0.1],
    'max_depth': [2],
    'min_members': [10, 20],
    'split_method': ['binary'],
    'max_features': [1, 2, 3],
}

In [49]:
best_gb_accuracy, best_gb_model = grid_search(GradientBoostingClassifier, params_to_optimize, X_train, y_train, X_test, y_test)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


model 1
trained in 0.24473261833190918 seconds
accuracy:  0.9609756097560975
model 2
trained in 0.31832313537597656 seconds
accuracy:  0.9609756097560975
model 3
trained in 0.845130205154419 seconds
accuracy:  0.9560975609756097
model 4
trained in 0.22322964668273926 seconds
accuracy:  0.9609756097560975
model 5
trained in 0.30461597442626953 seconds
accuracy:  0.9609756097560975
model 6
trained in 0.8693137168884277 seconds
accuracy:  0.9560975609756097
model 7
trained in 0.23079752922058105 seconds
accuracy:  0.9560975609756097
model 8
trained in 0.33502650260925293 seconds
accuracy:  0.9560975609756097
model 9
trained in 0.9166715145111084 seconds
accuracy:  0.9560975609756097
model 10
trained in 0.23220419883728027 seconds
accuracy:  0.9560975609756097
model 11
trained in 0.33174872398376465 seconds
accuracy:  0.9560975609756097
model 12
trained in 0.9008445739746094 seconds
accuracy:  0.9560975609756097
model 13
trained in 0.23565101623535156 seconds
accuracy:  0.9658536585365853


In [50]:
best_gb_accuracy

0.9658536585365853

In [51]:
best_gb_model.n_learners, best_gb_model.min_members, best_gb_model.max_features, best_gb_model.alpha

(50, 10, 2, 0.5)

Amongst 96 different models we evaluated, the model with hyperparameters n_learners={{best_gb_model.min_members}}, max_features={{best_gb_model.max_features}}, alpha={{best_gb_model.alpha}}, min_members={{best_gb_model.min_members}} performs best one with the {{np.round(best_gb_accuracy*100, 2).astype('int')}}% accuracy on validation set. Again, we calculate metrics.

In [41]:
pred_gb = best_gb_model.predict(X_test)

In [42]:
confusion_matrix(y_test, pred_gb, classes=[2,4])

Unnamed: 0,Actual Positive,Actual Negative
Predicted Positive,67,4
Predicted Negative,2,132


In [43]:
gb_precision = precision(y_test, pred_gb, classes=[2, 4])
gb_precision

0.9436619718309859

In [44]:
gb_recall = recall(y_test, pred_gb, classes=[2, 4])
gb_recall

0.9710144927536232

In [45]:
gb_f1 = f1_score(y_test, pred_gb, classes=[2, 4])
gb_f1

0.9571428571428571

Now, let's take a final look at the models we selected and compare their performance

In [46]:
model_eval_matrix = np.array([
    [best_dt_accuracy, best_rf_accuracy, best_ab_accuracy, best_gb_accuracy],
    [dt_precision, rf_precision, ab_precision, gb_precision],
    [dt_recall, rf_recall, ab_recall, gb_recall],
    [dt_f1, rf_f1, ab_f1, gb_f1]
])
model_eval_df = pd.DataFrame(model_eval_matrix, columns=['Decision Tree', 'Random Forest', 'AdaBoost', 'Gradient Boosting'], index=['Accuracy', 'Precision', 'Recall', 'F1 Score'])

In [47]:
model_eval_df

Unnamed: 0,Decision Tree,Random Forest,AdaBoost,Gradient Boosting
Accuracy,0.941463,0.965854,0.960976,0.970732
Precision,0.938462,0.930556,0.942029,0.943662
Recall,0.884058,0.971014,0.942029,0.971014
F1 Score,0.910448,0.950355,0.942029,0.957143
