# Unit II task: Supervised Learning: Decision Trees - Dinosaur Geological Time Period Predictor

Universidad Politecnica de Yucatan

Course: Machine Learning


## Advantages



1.   Simple to understand and to interpret.
2.   Able to handle both numerical and categorical data.
3.   Automatic Variable Selection.
4.   Quick Training Time
5.   No Need for Feature Scaling



## Disadvantages

1.  Prone to overfitting.
2.  High variance estimators.
3.  More expensive.
4.  Not fully compatible with scikit-learn:

# **Types of decision tree**

*   **Binary Decision Trees.**

Binary Decision Trees are especially suitable for binary classification problems and are an essential part of tree-based models in machine learning.

*   **Multiclass Decision Trees.**

Multiclass Decision Trees are useful in multi-category classification problems and are an important part of tree-based models in machine learning.

*   **Regression Trees.**

Regression Trees are used to predict continuous numerical values and are an important part of tree-based models in machine learning.

*   **Gradient Boosting Trees.**

It is a machine learning technique that improves accuracy by combining sequential decision trees in classification and regression problems.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

## Data Loading and Preprocessing

For the data preprocessing there were some parameters considered for the selection. The dataset includes cathegorical and numerical values, however, most of the cathegorical values are not apted for 'Hot-Encoding' due to their quantity of different categories.
The selected columns were 'Formation', 'Country', 'Diet', 'Lat', 'Lng', 'Geological Time Period'.

In [None]:
data = pd.read_csv('C:\\Users\\cokit\\Documents\\9quarter\\Machine Learning\\dinosaur_data.csv')
data.head()
#load dataset

Unnamed: 0,Lat,Lng,What Dinosaurs Eat,Accepted Name,Country,Cc,Diet,Early Interval,Formation,Geological Interval,Geological Time Period,Ref Author,Ref Pubyr,State,Max Ma,Min Ma
0,42.9333,123.966698,PLANT,Chaoyangsaurus youngi,China,CN,herbivore,Late Tithonian,Tuchengzi,Tithonian,Jurassic,Dong,1992,Liaoning,150.8,132.9
1,41.799999,120.73333,PLANT and ANIMAL,Protarchaeopteryx robusta,China,CN,omnivore,Late Barremian,Yixian,Barremian,Cretaceous,Ji et al.,1998,Liaoning,130.0,122.46
2,41.799999,120.73333,PLANT and ANIMAL,Caudipteryx zoui,China,CN,omnivore,Late Barremian,Yixian,Barremian,Cretaceous,Ji and Ji,1997,Liaoning,130.0,122.46
3,50.740726,-111.528732,FLESH,Gorgosaurus libratus,Canada,CA,carnivore,Late Campanian,Dinosaur Park,Campanian,Cretaceous,Matthew and Brown,1922,Alberta,83.5,70.6
4,50.737015,-111.549347,FLESH,Gorgosaurus libratus,Canada,CA,carnivore,Late Campanian,Dinosaur Park,Campanian,Cretaceous,Russell,1970,Alberta,83.5,70.6


In [None]:
data['Formation'].fillna('Unknown', inplace=True)

In [None]:
selected_columns = ['Formation', 'Country', 'Diet', 'Lat', 'Lng', 'Geological Time Period'] #select the columns that will be used
encoded_data = pd.get_dummies(data[selected_columns], columns=['Formation', 'Country', 'Diet']) #hot encoding
encoded_data.head()

Unnamed: 0,Lat,Lng,Geological Time Period,"Formation_""Lance""",Formation_Aguja,Formation_Antlers,Formation_Arundel Clay,Formation_Bayan Gobi,Formation_Bayan Mandahu,Formation_Bearpaw Shale,...,Country_Canada,Country_China,Country_Mexico,Country_United States,Diet_carnivore,"Diet_carnivore, omnivore",Diet_herbivore,"Diet_herbivore, omnivore",Diet_omnivore,Diet_piscivore
0,42.9333,123.966698,Jurassic,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
1,41.799999,120.73333,Cretaceous,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
2,41.799999,120.73333,Cretaceous,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,50.740726,-111.528732,Cretaceous,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
4,50.737015,-111.549347,Cretaceous,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0


In [None]:
X = encoded_data.drop('Geological Time Period', axis=1) #substract the target
y = data['Geological Time Period'].values.reshape(-1,1) #use it as label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #split train and test
X.head()


Unnamed: 0,Lat,Lng,"Formation_""Lance""",Formation_Aguja,Formation_Antlers,Formation_Arundel Clay,Formation_Bayan Gobi,Formation_Bayan Mandahu,Formation_Bearpaw Shale,Formation_Bearpaw shale,...,Country_Canada,Country_China,Country_Mexico,Country_United States,Diet_carnivore,"Diet_carnivore, omnivore",Diet_herbivore,"Diet_herbivore, omnivore",Diet_omnivore,Diet_piscivore
0,42.9333,123.966698,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
1,41.799999,120.73333,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
2,41.799999,120.73333,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,50.740726,-111.528732,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
4,50.737015,-111.549347,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0


## Sklearn

In [None]:
clf = DecisionTreeClassifier(random_state=42) #defining the classifier by using ScikitLearn library
clf.fit(X_train, y_train) #ussing fit to incorporate the model
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred) #accuracy using ScikitLearn
print(f'Accuracy: {accuracy:.4f}') #print accuracy

Accuracy: 0.9899


## From Scratch Version

This "Scratch version" was obtained from the channel Normalized Nerd (1). He applied the algorithm defininf all the functions.

## Node class

In [None]:
class Node():
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, info_gain=None, value=None):
        ''' constructor '''

        # for decision node
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.info_gain = info_gain

        # for leaf node
        self.value = value

## Tree class

In [None]:
class DecisionTreeClassifier():
    def __init__(self, min_samples_split=2, max_depth=2):
        ''' constructor '''

        # initialize the root of the tree
        self.root = None

        # stopping conditions
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth

    def build_tree(self, dataset, curr_depth=0):
        ''' recursive function to build the tree '''

        X, Y = dataset[:,:-1], dataset[:,-1]
        num_samples, num_features = np.shape(X)

        # split until stopping conditions are met
        if num_samples>=self.min_samples_split and curr_depth<=self.max_depth:
            # find the best split
            best_split = self.get_best_split(dataset, num_samples, num_features)
            # check if information gain is positive
            if best_split["info_gain"]>0:
                # recur left
                left_subtree = self.build_tree(best_split["dataset_left"], curr_depth+1)
                # recur right
                right_subtree = self.build_tree(best_split["dataset_right"], curr_depth+1)
                # return decision node
                return Node(best_split["feature_index"], best_split["threshold"],
                            left_subtree, right_subtree, best_split["info_gain"])

        # compute leaf node
        leaf_value = self.calculate_leaf_value(Y)
        # return leaf node
        return Node(value=leaf_value)

    def get_best_split(self, dataset, num_samples, num_features):
        ''' function to find the best split '''

        # dictionary to store the best split
        best_split = {}
        max_info_gain = -float("inf")

        # loop over all the features
        for feature_index in range(num_features):
            feature_values = dataset[:, feature_index]
            possible_thresholds = np.unique(feature_values)
            # loop over all the feature values present in the data
            for threshold in possible_thresholds:
                # get current split
                dataset_left, dataset_right = self.split(dataset, feature_index, threshold)
                # check if childs are not null
                if len(dataset_left)>0 and len(dataset_right)>0:
                    y, left_y, right_y = dataset[:, -1], dataset_left[:, -1], dataset_right[:, -1]
                    # compute information gain
                    curr_info_gain = self.information_gain(y, left_y, right_y, "gini")
                    # update the best split if needed
                    if curr_info_gain>max_info_gain:
                        best_split["feature_index"] = feature_index
                        best_split["threshold"] = threshold
                        best_split["dataset_left"] = dataset_left
                        best_split["dataset_right"] = dataset_right
                        best_split["info_gain"] = curr_info_gain
                        max_info_gain = curr_info_gain

        # return best split
        return best_split

    def split(self, dataset, feature_index, threshold):
        ''' function to split the data '''

        dataset_left = np.array([row for row in dataset if row[feature_index]<=threshold])
        dataset_right = np.array([row for row in dataset if row[feature_index]>threshold])
        return dataset_left, dataset_right

    def information_gain(self, parent, l_child, r_child, mode="entropy"):
        ''' function to compute information gain '''

        weight_l = len(l_child) / len(parent)
        weight_r = len(r_child) / len(parent)
        if mode=="gini":
            gain = self.gini_index(parent) - (weight_l*self.gini_index(l_child) + weight_r*self.gini_index(r_child))
        else:
            gain = self.entropy(parent) - (weight_l*self.entropy(l_child) + weight_r*self.entropy(r_child))
        return gain

    def entropy(self, y):
        ''' function to compute entropy '''

        class_labels = np.unique(y)
        entropy = 0
        for cls in class_labels:
            p_cls = len(y[y == cls]) / len(y)
            entropy += -p_cls * np.log2(p_cls)
        return entropy

    def gini_index(self, y):
        ''' function to compute gini index '''

        class_labels = np.unique(y)
        gini = 0
        for cls in class_labels:
            p_cls = len(y[y == cls]) / len(y)
            gini += p_cls**2
        return 1 - gini

    def calculate_leaf_value(self, Y):
        ''' function to compute leaf node '''

        Y = list(Y)
        return max(Y, key=Y.count)

    def print_tree(self, tree=None, indent=" "):
        ''' function to print the tree '''

        if not tree:
            tree = self.root

        if tree.value is not None:
            print(tree.value)

        else:
            print("X_"+str(tree.feature_index), "<=", tree.threshold, "?", tree.info_gain)
            print("%sleft:" % (indent), end="")
            self.print_tree(tree.left, indent + indent)
            print("%sright:" % (indent), end="")
            self.print_tree(tree.right, indent + indent)

    def fit(self, X, Y):
        ''' function to train the tree '''

        dataset = np.concatenate((X, Y), axis=1)
        self.root = self.build_tree(dataset)

    def predict(self, X):
        ''' function to predict new dataset '''

        predictions = [self.make_prediction(x, self.root) for x in X.values]

        return predictions

    def make_prediction(self, x, tree):
        ''' function to predict a single data point '''

        if tree.value!=None: return tree.value
        feature_val = x[tree.feature_index]
        if feature_val<=tree.threshold:
            return self.make_prediction(x, tree.left)
        else:
            return self.make_prediction(x, tree.right)

## Train-Test split

In [None]:
X = encoded_data.drop('Geological Time Period', axis=1)
y = data['Geological Time Period'].values.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X.head()

Unnamed: 0,Lat,Lng,"Formation_""Lance""",Formation_Aguja,Formation_Antlers,Formation_Arundel Clay,Formation_Bayan Gobi,Formation_Bayan Mandahu,Formation_Bearpaw Shale,Formation_Bearpaw shale,...,Country_Canada,Country_China,Country_Mexico,Country_United States,Diet_carnivore,"Diet_carnivore, omnivore",Diet_herbivore,"Diet_herbivore, omnivore",Diet_omnivore,Diet_piscivore
0,42.9333,123.966698,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
1,41.799999,120.73333,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
2,41.799999,120.73333,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,50.740726,-111.528732,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
4,50.737015,-111.549347,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0


## Fit the model

In [None]:
classifier = DecisionTreeClassifier(min_samples_split=3, max_depth=5)
classifier.fit(X_train, y_train)
classifier.print_tree()

X_122 <= 0.0 ? 0.08837796634460471
 left:X_146 <= 0.0 ? 0.10744579338236943
  left:X_1 <= -86.016388 ? 0.0494895623689236
    left:X_25 <= 0.0 ? 0.020803681191599727
        left:X_91 <= 0.0 ? 0.01601169112349877
                left:X_50 <= 0.0 ? 0.014586078727358766
                                left:Cretaceous
                                right:Jurassic
                right:Jurassic
        right:Triassic
    right:X_1 <= 106.619347 ? 0.17561432556843143
        left:X_1 <= -74.514938 ? 0.12249570371930735
                left:X_0 <= 40.134052 ? 0.24970046082949304
                                left:Cretaceous
                                right:Triassic
                right:X_1 <= -64.058052 ? 0.05329290813307247
                                left:Jurassic
                                right:Jurassic
        right:X_174 <= 0.0 ? 0.043762868092321563
                left:X_178 <= 0.0 ? 0.029292079901611215
                                left:Cretaceous
              

## Test the model

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 0.9554


## References

(1) Normalized Nerd. (2021, January 13). Decision Tree Classification Clearly Explained! [Video]. YouTube. https://www.youtube.com/watch?v=ZVR2Way4nwQ&ab_channel=NormalizedNerd