Improving a Decision Tree Classifier's performance by additional Pre-processing

### Importing the necessary libraries

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier #DecisionTree
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA

### Dataset's description

In [2]:
# loading the dataset
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print("The data set's description:")
print(cancer.DESCR)
print("\n")

The data set's description:
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For in

In [3]:
# converting the dataset into pandas df using feature data
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)

# adding a column for target data
df['target'] = cancer.target

In [4]:
# basic statistics
print("***basic statistics for attribute values in the dataset***")
print(df.describe())
print("\n")

***basic statistics for attribute values in the dataset***
       mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630         

In [5]:
# for scaling the data
scaler = StandardScaler()

# storing the feature data
X = cancer.data[:,0:9] #taking only first 10 features for better reliability of decision tree
X = scaler.fit_transform(X)
# storing the target data
y = cancer.target

# splitting the data using Scikit-Learn's train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True,test_size=0.3, random_state=0) #shuffling to remove ordering/rando

## DECISION TREE

The advantage with decision trees is that the implementation doesn’t need much pre-processing.

- After loading the dataset, its observed that it is a multi-variate dataset with no missing values and free of categorical variables.
- The data is scaled to a standard format using a standard scaler before implementing the decision tree.
- Before modelling, the original data was split into train (70%) and test (30%) datasets.
- There are 30 features in the dataset. For the Decision Tree classifier, I used only 10 attributes for better reliability of the decision tree.

In [6]:
# initialization
dt = DecisionTreeClassifier()
# training the model
dt.fit(X_train,y_train)

# prediction
# evaluating on the test data
y_test_pred = dt.predict(X_test)
dt_test = accuracy_score(y_test, y_test_pred)

# evaluating on the training data
y_train_pred = dt.predict(X_train)
dt_train = accuracy_score(y_train, y_train_pred)

print("Training data accuracy is " +  repr(dt_train) + " and test data accuracy is " + repr(dt_test) + "\n")

# classification report
print("***Classification report***")
print(classification_report(y_test, y_test_pred))
print("\n")

# confusion matrix
ConfMatrix = confusion_matrix(y_test, dt.predict(X_test))
print("***Confusion matrix***")
print(ConfMatrix)
print("\n")

Training data accuracy is 1.0 and test data accuracy is 0.935672514619883

***Classification report***
              precision    recall  f1-score   support

           0       0.91      0.92      0.91        63
           1       0.95      0.94      0.95       108

    accuracy                           0.94       171
   macro avg       0.93      0.93      0.93       171
weighted avg       0.94      0.94      0.94       171



***Confusion matrix***
[[ 58   5]
 [  6 102]]




### Attribute selection (Feature Selection using Recursive Feature Elimination RFE)

Feature selection using Recursive Feature Elimination (RFE):

It is an optimization technique based on the greedy method. 
Based on an elimination pattern, the features are ranked.
The models are built iteratively by the RFE, which rates each iteration's top and worst performing features.
The model is created again using the remaining features once the desirable features have been marked, and so on until all the features have been used.
After utilising the Decision Tree Classifier to analyse the chosen features I obtained from the RFE model,
basic statistics like precision, recall, and accuracy remained almost similar as evident from the results.

In [7]:
X_col = df.drop(columns=["target"], axis=1)
y_col = df['target']

rfe = RFE(estimator= DecisionTreeClassifier(), n_features_to_select= 10)
rfe.fit(X_col, y_col)

for i, col in zip(range(X_col.shape[1]), X_col.columns):
    print("{}: selected= {} rank= {}".format(col, rfe.support_[i], rfe.ranking_[i]))

X_rfe = df[['mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 
            'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 
            'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 
            'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 
            'worst symmetry']]
y_rfe = df['target']

X_train, X_test, y_train, y_test = train_test_split(X_rfe, y_rfe, random_state=0, test_size=.30)

rfe_final = DecisionTreeClassifier()
rfe_final.fit(X_train, y_train)

y_as = rfe_final.predict(X_test)

# Confusion matrix
print("***Confusion matrix*** \n ", confusion_matrix(y_test, y_as))

# Classification report
print("***Classification report*** \n ",classification_report(y_test, y_as))

mean radius: selected= False rank= 21
mean texture: selected= False rank= 14
mean perimeter: selected= False rank= 13
mean area: selected= False rank= 12
mean smoothness: selected= False rank= 10
mean compactness: selected= False rank= 5
mean concavity: selected= False rank= 4
mean concave points: selected= True rank= 1
mean symmetry: selected= False rank= 7
mean fractal dimension: selected= False rank= 6
radius error: selected= False rank= 16
texture error: selected= False rank= 20
perimeter error: selected= False rank= 18
area error: selected= False rank= 2
smoothness error: selected= True rank= 1
compactness error: selected= True rank= 1
concavity error: selected= False rank= 11
concave points error: selected= False rank= 8
symmetry error: selected= True rank= 1
fractal dimension error: selected= False rank= 3
worst radius: selected= True rank= 1
worst texture: selected= True rank= 1
worst perimeter: selected= False rank= 15
worst area: selected= True rank= 1
worst smoothness: selec

### Principal Component Analysis (PCA)

With 30 dimensions, the dataset is challenging to visualise. An effective preliminary technique for investigating a dataset is PCA.
Its efficient due to its speed and ability to operate with practically all dataset types, but it might not be able to identify more nuanced
categories that result in better visuals for more complicated datasets.
When performing PCA, all attribute statistics will be roughly on the same scale, which is quite helpful.
In this case, PCA is not effective.

The decision tree performed well even without standardizing and gave a score of 92%. But as we see the difference in training and test accuracy, overfitting of the model can be
observed which is not good. This can be resolved using different parameters during the decision tree implementation.

The overall goal should be aimed at choosing the optimum feature selection method to maximise model prediction.

In [9]:
pca = PCA(n_components=4)
pc = pca.fit_transform(X)
pc_df = pd.DataFrame(data= pc, columns = ['PC 1', 'PC 2', 'PC 3', 'PC 4'])
print(pc_df)

pc_df['target'] = df['target'].values
pc_df.head()

X = pc_df.drop(columns= 'target', axis= 1).values
y = pc_df[['target']].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=.30)
pc_dt = DecisionTreeClassifier()
pc_dt.fit(X_train, y_train)
ypc = pc_dt.predict(X_test)

# Confusion matrix
print("***Confusion matrix*** \n ", confusion_matrix(y_test, ypc))

# Classification report
print("***Classification report*** \n ", classification_report(y_test, ypc))


         PC 1      PC 2      PC 3      PC 4
0    5.020811  2.765995 -2.080034  0.490632
1    1.847320 -2.240095 -1.179743  1.004146
2    4.012625 -0.070391 -0.245925  0.145749
3    3.142418  5.151176  0.940843 -0.524327
4    3.223664 -0.938076 -1.934712  0.123766
..        ...       ...       ...       ...
564  4.548010 -1.467778 -0.529461 -0.719554
565  3.000012 -2.003788  1.078556 -0.191926
566  0.752028 -1.919558  1.574417 -0.272770
567  6.821815  1.123028  1.734324  0.060279
568 -3.921449 -1.247852  2.038703  0.803170

[569 rows x 4 columns]
***Confusion matrix*** 
  [[53 10]
 [11 97]]
***Classification report*** 
                precision    recall  f1-score   support

           0       0.83      0.84      0.83        63
           1       0.91      0.90      0.90       108

    accuracy                           0.88       171
   macro avg       0.87      0.87      0.87       171
weighted avg       0.88      0.88      0.88       171

