**Basic Decision Tree Classification using Python** 😊

by: Jomar Saif P. Baudin

In this workshop we will use google colaboratory and python syntax to build a Decision Tree, which uses continuous and categorical data from the [sample data](https://github.com/superbaw/Machine-Learning-for-Psychology/blob/main/Board%20Exam.csv) to predict whether or not a college graduate has passed the board examination.

Decision Trees are an exceptionally useful machine learning method when you need to know how the decisions are being made. For example, if you have to justify the predictions to your superior, Decision Trees are a good method because each step in the decision-making process is easy to understand.

In this workshop, you will learn about:
1.	Importing python libraries
2.	Loading Dataset
3.	Checking the missing data
4.	Finding the missing data
5.	Removing missing values
6.	Further data cleaning
7.	Checking the number of classes, unique values under species column
8.	Generating descriptive statistics results
9.	Generating plots
10.	Preprocessing data for decision tree
11.	Splitting the data into training and testing subsets
12.	Training decision tree classifier using training data
13.	Obtaining predictions from trained decision tree using the testing data
14.	Generating a classification report
15.	Generating a confusion matrix
16.	Displaying feature importance
17.	Generating text representation of decision tree
18.	Generating the decision tree plot

Importing python libraries:








In [1]:
#Importing python libraries

import pandas as pd # Pandas for Data Manipulation
import numpy as np # Numerical Python for Numerical data
import matplotlib.pyplot as plt # Data visualizations
import seaborn as sns # Data Visualizations

Loading dataset:

In [None]:
# Load Dataset

df = pd.read_csv("direct link of the spreadsheet.csv")
df.head # DataFrame, .head .tail, .sample

Checking the missing data:

In [None]:
#Checking of missing data

df.info()

Finding the missing data:

In [None]:
# Find the Missing Data

df.isnull().sum() # magbilang ng missing data

Removing missing values:

In [None]:
# Removing missing values

df = df.dropna()
df.isnull().sum()

Further data cleaning:

In [None]:
#To further check if there is an entry in a data such as "." for data cleaning

df[df['Math Grade'] == '.']

Checking the number of classes, unique values under species column:

In [None]:
# check number of classes, unique values under Board Exam column

df['Board Exam Result'].unique()

Generating descriptive statistics results:

In [None]:
# Descriptive Statistics

df[df['Board Exam Result'] == 'Passed'].groupby('Scholar Status').describe().T

Generating plots:

In [None]:
# Generating pairplots

sns.pairplot(df, hue = 'Board Exam Result')

In [None]:
# Just another pairplot

sns.pairplot(df, hue = 'Scholar Status')

In [None]:
# More and more pairplot

sns.pairplot(df, hue = 'Working Student Status')

In [None]:
#3d pairplot

import plotly.express as px
fig = px.scatter_3d(df, x='Math Grade', y='English Grade', z='Science Grade', color='Working Student Status')
fig.show()

Preprocessing data for decision tree:

In [None]:
# Multiple Categorical Features
# preprocess the data by encoding categorical features and separating the features from the target variable

X = pd.get_dummies(df.drop('Board Exam Result', axis = 1), drop_first = True) # label encoding 0, 1
y = df['Board Exam Result']

Splitting the data into training and testing subsets:

In [None]:
# train - test split
# split the dataset into training and testing subsets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101) #Yung random state ay any random number. But for lecture purposes, let us use the same random state

Training decision tree classifier using training data:

In [None]:
# Instance Model
# train a Decision Tree classifier using the training data

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state = 101)
model.fit(X_train, y_train) # Training ng model

Obtaining predictions from trained decision tree using the testing data:

In [None]:
# base Predictions (Default Settings)
# obtain predictions from the trained Decision Tree classifier on the testing data

base_preds = model.predict(X_test) # Prediction (based from X_test)
base_preds

Generating a classification report:

In [None]:
# Classification Report

from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, base_preds))

Generating a confusion matrix:

In [None]:
# Confusion matrix

cm = confusion_matrix(y_test, base_preds)
sns.heatmap(cm, annot = True, fmt = 'd', xticklabels = ['Passed', 'Failed'],
            yticklabels = ['Passed', 'Failed'], cbar = True).set(title = 'Confusion matrix')
plt.show()

Displaying feature importance:

In [None]:
# Decision Tree Model Attributes

#model.feature_importances_

# provide insights into which features are most important for the trained Decision Tree classifier's decision-making process

ft = pd.DataFrame(index = X.columns, columns = ['Feature Importance'],
                  data = model.feature_importances_) # eto ay gagawa ka ng table
ft.sort_values(by = 'Feature Importance', ascending = False) # Eto ang pang sort ng values in descendin order

Generate text representation of decision tree:

In [None]:
# This is how to do a text representation of decision trees 9text version)

from sklearn import tree

text_representation = tree.export_text(model, feature_names = X.columns.tolist())
print(text_representation)

Generating a decision tree plot:

In [None]:
# Tree Visualization (Graphical)

from sklearn.tree import plot_tree # Implementation, left 9true), right (false)

plt.figure(figsize = (15,10), dpi = 1000) # Pang set ng figure size
plot_tree(model, feature_names = X.columns, filled = True,
          class_names = ['Passed','Failed']);

# see classification report for class arrangement

**END OF WORKSHOP**

**Thank you!**