1. Model Selection:
Choose Algorithm: Select an appropriate machine learning algorithm based on the nature of the problem (classification, regression, clustering, etc.), the size of the dataset, and other factors.

2. Model Building:
Instantiate Model: Create an instance of the chosen machine learning algorithm.

Fit Model: Train the model on the training data by calling the fit() method. During training, the model learns the patterns and relationships present in the data.

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It builds a tree-like structure where each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents the class label (in classification) or predicted value (in regression). Here's a description of the Decision Tree model:

1. Structure of Decision Tree:
Root Node: The topmost node in the tree, which corresponds to the feature that best splits the data into homogeneous subsets based on a certain criterion (e.g., Gini impurity, entropy).

Internal Nodes: Intermediate nodes in the tree that represent decisions based on feature values. Each internal node tests a specific feature and splits the data into subsets accordingly.

Leaf Nodes: Terminal nodes in the tree that represent the final prediction or classification label. Each leaf node corresponds to a class label (in classification) or predicted value (in regression).

2. Training Process:
Splitting Criteria: The decision tree algorithm recursively selects the best feature to split the data at each node. It evaluates different splitting criteria (e.g., Gini impurity, entropy) to determine the feature that maximizes the homogeneity of the resulting subsets.

Recursive Partitioning: The dataset is recursively partitioned into subsets based on the selected feature and its possible values. This process continues until certain stopping criteria are met, such as maximum tree depth, minimum samples per leaf, or minimum impurity decrease.

3. Prediction Process:
Traversal: To make predictions for a new instance, it traverses the decision tree from the root node to a leaf node based on the feature values of the instance.

Classification: In classification tasks, the class label associated with the leaf node reached by the instance determines the predicted class.

Regression: In regression tasks, the predicted value associated with the leaf node reached by the instance is the final predicted value.

4. Key Advantages:
Interpretability: Decision Trees are easy to interpret and visualize, making them useful for understanding the underlying decision-making process of the model.

Non-Parametric: Decision Trees make no assumptions about the underlying data distribution and can handle both numerical and categorical features.

Handles Non-Linear Relationships: Decision Trees can capture complex non-linear relationships between features and the target variable.

Feature Importance: Decision Trees can provide information about feature importance, which helps in feature selection and understanding the most influential features.

5. Key Limitations:
Overfitting: Decision Trees are prone to overfitting, especially when the tree depth is not properly controlled or the training data is noisy.

Instability: Small variations in the training data can lead to significantly different tree structures, making the model unstable.

Bias towards Features with Many Levels: Decision Trees tend to favor features with many levels (high cardinality) during the splitting process, which can lead to biased trees.

Limited Generalization: Decision Trees may not generalize well to unseen data, especially when the decision boundaries are too complex.

Overall, Decision Trees are versatile and widely used in various domains due to their simplicity, interpretability, and ability to handle both classification and regression tasks. However, they are often used in ensemble methods (e.g., Random Forests, Gradient Boosting) to overcome their limitations and improve performance.

In [1]:
#Import Required Library [Details are available in README.md file]
from sklearn.tree import DecisionTreeClassifier
import os
import matplotlib.pyplot as plt
import random
import cv2
import pickle


In [2]:
# Get the current directory
current_dir = os.getcwd()

# Get the parent directory (one level up)
current_dir = os.path.dirname(current_dir)

# Get the parent directory (one level up)
parent_dir = os.path.dirname(current_dir)

# Print the parent directory
print("Parent Directory:", parent_dir)

Parent Directory: E:\upgrade_capston_project-main


In [3]:
preprocessed_data_dir = parent_dir+'/datasets/processed_dataset/'

In [4]:
#Load the preprocessed data
with open(os.path.join(preprocessed_data_dir,'X_train.pkl'), 'rb') as f:
    X_train = pickle.load(f)

# Load y_train from file
with open(os.path.join(preprocessed_data_dir,'y_train.pkl'), 'rb') as f:
    y_train = pickle.load(f)


In [5]:
# Define a function to train a decision tree
def train_decision_tree(X_train, y_train):
	dt_classifier = DecisionTreeClassifier()
	dt_classifier.fit(X_train, y_train)
	return dt_classifier

In [6]:
# Train decision tree
dt_classifier = train_decision_tree(X_train, y_train)

In [7]:
model_dir = parent_dir+'/models/'

In [8]:
# Save the trained model to a file
with open(os.path.join(model_dir,'decisionTreeModel.pkl'), 'wb') as f:
    pickle.dump(dt_classifier, f)