# Decision Tree

## Table of Contents

1. [**Introduction**](#Intro)  
    
2. [**Model Development Procedure**](#ModelDevp)
   
3. [**Model Development and Evaluation**](#Class)



## 1 Introduction <a name="Intro"></a>

A decision tree is a flowchart-like tree structure where: 

- Each internal node (decision node) denotes a test on an feature
- Each branch represents an outcome of the test
- Each leaf node (or terminal node) holds a class label
- The topmost node is the root node





## 2 Model Development Procedure <a name="ModelDevp"></a>

Here are the steps to implement logistic regression in Python using <font color='blue'>scikit-learn</font> library

__1.__ Import `DecisionTreeClassifier`, `train_test_split`, and `MinMaxScaler` funcions from scikit learn library along with `numpy` library
```python
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np
```

__2.__ Define dependant (target variable) and independent variable (feature) from data set:
```python
x_data=np.array(df[['feature1','feature2',...]])
y_data=df['target variable']
```

__3.__ Normalize your data using <font color='blue'>MinMaxScaler</font> (Optional but advised)
```python
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data
```

__4.__ Split the data into train and test sets: `x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data)`



__5.__ Create a decision tree object using the constructor: `dt = DecisionTreeClassifier() `


__6.__ Use the fit function to fit the model to the training data: `dt.fit(x_train,y_train)`

__7.__ Then, make prediction using the test data and training data:
```python
yhatTest=dt.predict(x_test)
yhatTrain=dt.predict(x_train)
```

__8.__ Create visualizations (Optional)

__9.__ Model Evaluation

## 3 Model Developement and Evaluation<a name="Class"></a>

Steel plate faults dataset is provided by Semeion, Research of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. In this dataset, the faults of steel plates are classified into 7 types. Since it has been donated on October 26,2010, this dataset has been widely used in machine learning for automatic pattern recognition. Types of fault and corresponding numbers of sample are shown in the table below

<img src="https://docs.google.com/uc?export=download&id=1pw1oJ7plDsTASg_ntI_QSVivQ-tMhlqq" width="500">


The number of samples vary a lot from one category to another. Meanwhile, fault 7 is a special class because it contains all other faults except the first six kinds of fault. In other words, samples in class 7 may have no obvious common characteristics. For every sample, 27 features are recorded, providing evidences for its fault class. All attributes are expressed by integers or real numbers. Detailed information about these 27 independent variables is listed out in the following table.

<img src="https://docs.google.com/uc?export=download&id=1lAV-mPa2seL9VWkezbaCicnZVwOup2c6" width="500">


Ref: https://www.sciencedirect.com/science/article/pii/S0925231214012193?casa_token=8ZvcrfiUELkAAAAA:Vt2ShomuyzpagA6Su9nSQHzImgti_HHvtK5zuGqgC01It_Xn9UsccPB-5HVtzBonmsYCibDgYQ



Ref for the dataset: https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults


In [None]:
import pandas as pd

url = ('https://raw.githubusercontent.com/MasoudMiM/ME_364/main/Steel_Plates_Faults/Data.csv')
df = pd.read_csv(url,names=['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
                            'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
                            'Length_of_Conveyer', 'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
                            'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index', 'Edges_X_Index',
                            'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas', 'Log_X_Index', 'Log_Y_Index',
                            'Orientation_Index', 'Luminosity_Index', 'SigmoidOfAreas', 'Pastry', 'Z_Scratch',
                            'K_Scratch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults'])           
df.head()

In [None]:
# Check to see if there are missing values in the dataset
df.isnull().sum().sum()

Step __1__, importing the required libraries

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np

Step __2__, defining dependat and independant variables

In [None]:
features = ['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
             'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
             'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness']
x_data=np.array(df[features])
y_data=df['K_Scratch']

In [None]:
df['K_Scratch'].unique()

In [None]:
print('Target variable distribution:')
print( df['K_Scratch'].value_counts() )

df['K_Scratch'].value_counts().plot(kind='barh')

Step __3__, normalization using MinMaxScaler

In [None]:
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data

Step __4__, spliting the data

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data,test_size=0.3)

Step __5__, creating a logistic regression object

In [None]:
dt = DecisionTreeClassifier(criterion="entropy")

Step __6__, fitting the model (training the  model)

In [None]:
dt.fit(x_train,y_train)

Step __7__, making predictions

In [None]:
yhatTest=dt.predict(x_test)
yhatTrain=dt.predict(x_train)

We can look at the classes

In [None]:
dt.classes_

We can also look at the tree structure

In [None]:
# Method 1
text_representation = tree.export_text(dt, max_depth=3, feature_names=features)
print(text_representation)

In [None]:
# Method 2
_, ax = plt.subplots(figsize=(30,30)) # Resize figure
tree.plot_tree(dt, max_depth=3, feature_names=features, filled=True, ax=ax);

For the net visualization, you need to install `dtreeviz` library first if it is not installed. Uncomment the cell below to install the library.

In [None]:
#!pip install dtreeviz

In [None]:
# Method 3
from dtreeviz.trees import dtreeviz # remember to load the package

dtviz = dtreeviz(dt, x_train, y_train,
                target_name="target",
                feature_names=features,
                class_names=('No Fault', 'Faulty'))

dtviz

In [None]:
from sklearn.metrics import accuracy_score 
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
acc_scoreTrain = accuracy_score(y_train,yhatTrain)
acc_scoreTest = accuracy_score(y_test,yhatTest)
print(f'The accuracy for training data is {acc_scoreTrain:0.3f}')
print(f'The accuracy for the test data is {acc_scoreTest:0.3f}')

In [None]:
J_scoreTrain = jaccard_score(y_train,yhatTrain)
J_scoreTest = jaccard_score(y_test,yhatTest)
print(f'Jaccard index for training data is {J_scoreTrain:0.3f}')
print(f'Jaccard index for the test data is {J_scoreTest:0.3f}')

In [None]:
F_scoreTrain = f1_score(y_train,yhatTrain)
F_scoreTest = f1_score(y_test,yhatTest)
print(f'F-score for training data is {F_scoreTrain:0.3f}')
print(f'F-score for the test data is {F_scoreTest:0.3f}')

In [None]:
LogLossTrain = log_loss(y_train,yhatTrain)
LogLossTest = log_loss(y_test,yhatTest)
print(f'Log Loss for training data is {LogLossTrain:0.3f}')
print(f'Log loss for the test data is {LogLossTest:0.3f}')

In [None]:
print('Confusion matrix for training data')
CM_scoreTrain = confusion_matrix(y_train,yhatTrain)   # possible option normalize='true'
print(CM_scoreTrain)

print(40*'-')

print('Confusion matrix for test data')
CM_scoreTest = confusion_matrix(y_test,yhatTest)   # possible option normalize='true'
print(CM_scoreTest)

In [None]:
dispTr = ConfusionMatrixDisplay(CM_scoreTrain,display_labels=['No Fault','Fault']) # 
dispTr.plot()

dispTs = ConfusionMatrixDisplay(CM_scoreTest,display_labels=['No Fault','Fault'])
dispTs.plot()