<i>Modified from the file written by Ahsan Khan on behalf of Alberta Machine Intelligence Institute for the Al Pathways Partnership supported by Prairies Economic Development Canada</i>

---

**Important Note:**

Please do not alter any part of this notebook outside the designated text cells that are clearly marked with "*Start student input* ↓" and "*End student input ↑*". Changes made outside these specified areas could lead to incorrect evaluations of your work, potentially affecting your lab scores.

Ensure you complete all activities within these sections, which are indicated by labels like **[A1]**, **[A2]**, **[A3]**, ... Each activity is crucial for the successful completion of this lab. Additionally, please name your variables exactly as specified in the instructions (if specified) to ensure that your answers are correctly assessed.

---


# Lab 2: Decision Trees

Following up from the k-NN classifer, you will now be introduced to the Decision Tree (DT) classifier. DTs, like k-NN, are able to naturally handle non-linear multi-class data. The algorithm is not a distance based classifer like the K-NN classifer, instead it takes sequential  binary decisions in the internal nodes of the tree in order to arrive at a prediction (leaf). For this lab you will use the Breast Cancer Wisconsin (Diagnostic) dataset again.

In [None]:
# Crucial data processing and analysis libraries
import numpy as np
import pandas as pd

# Loading the modules required to build and evaluate a DT classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Loading the Breast Cancer Wisconsin (Diagnostic) dataset from sklearn
from sklearn.datasets import load_breast_cancer

##### Loading our data onto a dataframe the same way you encountered previously in lab 1.

In [None]:
#loading data
breast_cancer = load_breast_cancer()

# There are three key parts to the dataset we care about
# The feature data, X
X = breast_cancer.data
display(X)

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [None]:
# The target classes, y
y = breast_cancer.target
display(y)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [None]:
# The feature names
display(breast_cancer.feature_names)

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

##### This is how the data looks like in a pandas dataframe

In [None]:
df = pd.DataFrame(X, columns=breast_cancer['feature_names'])
df['class'] = y


print(f"Number of rows in the data: {df.shape[0]}")
df.head()

Number of rows in the data: 569


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,class
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


##### Always a good idea to observe some statistics of our dataset to get an understanding of it.

In [None]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,class
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


# Lab activity: the decision tree classifer

##### **[A1]**  
Split your data into training and validation. Use ``X_train``, ``y_train``,``X_val`` and ``y_val`` as the assigned variables respectively.

*Start student input* ↓

In [None]:
# Put your code here.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

*End student input ↑*

##### **[A2]**
Instantiate a decision tree classifer called `dt`. For now, you do not need to set any hyperparameters or other class constructor arguments.

*Start student input* ↓

In [None]:
# Put your code here.
dt = DecisionTreeClassifier()

*End student input ↑*

##### **[A3]**
Fit the decision tree classifier to your data

*Start student input* ↓

In [None]:
# Put your code here.
dt.fit(X_train, y_train)

##### **[A4]**
Predict on your validation data

*Start student input* ↓

In [None]:
# Put your code here.
y_prediction=dt.predict(X_val)

*End student input ↑*

##### **[A5]**
Evaluate model performance using the `accuracy_score()` and  `classification_report()` functions (you can find the documentation for `classification_report` [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).)

*Start student input* ↓

In [None]:
# Put your code here.
# calculate the accuracy score
accuracyScore=accuracy_score(y_val, y_prediction)
print(accuracyScore)

# Generate classificaction report
classificationReport=classification_report(y_val, y_prediction)
print(classificationReport)

0.9122807017543859
              precision    recall  f1-score   support

           0       0.96      0.85      0.90        53
           1       0.88      0.97      0.92        61

    accuracy                           0.91       114
   macro avg       0.92      0.91      0.91       114
weighted avg       0.92      0.91      0.91       114



*End student input ↑*

Going back to your lab 1 notebook you may notice the accuracy for your first k-NN model was lower than the accuracy achieved in the first DT model here. Recall that you had to scale your data afterwards in order to achieve a decent accuracy score for the k-NN classifer.

##### **[A6]**
Normalize your dataset using the `StandardScaler()` function, normalizing both the `X_train` and `X_val` values from parameters set according to training data. Then, fit a new model `dt2` to scaled data. Finally evalaute the accuracy score.

*Start student input* ↓

In [None]:
# Put your code here.
# import standardscaler for feature normalization
standardscaler = StandardScaler()
X_train_scaled = standardscaler.fit_transform(X_train)
X_val_scaled = standardscaler.transform(X_val)

#Instantiate a new Decision Tree classifier
dt2 = DecisionTreeClassifier()

dt2.fit(X_train_scaled, y_train)

y_prediction_scaled = dt2.predict(X_val_scaled)

# Compute the accuracy score of the model on the scaled data
accuracy_score_scaled = accuracy_score(y_val, y_prediction_scaled)
print(accuracy_score_scaled)


# Generate a classification report to analyze model performance
report_score_scaled = classification_report(y_val, y_prediction_scaled)
print(report_score_scaled)

0.9298245614035088
              precision    recall  f1-score   support

           0       0.96      0.89      0.92        53
           1       0.91      0.97      0.94        61

    accuracy                           0.93       114
   macro avg       0.93      0.93      0.93       114
weighted avg       0.93      0.93      0.93       114



*End student input ↑*

##### **[A7]**
Based on your results for the above two accuracy evaluations do you need to scale your data for a DT classifer? Explain.

*Start student input* ↓

Explanation: No, because a DT classifer doesnot used based computations , scaling is not required. The model functions well without normalizing, as evidenced by the little accuracy difference between before sclaing (91.2%) and after scaling (92.9%).Decisions tree are invarient to scaling since they divide data according to feature thresholds.

*End student input ↑*