<a href="https://colab.research.google.com/github/ranjithdurgunala/ML-LAB-2025-2026/blob/main/Decision_Tree_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction to Decision Tree Classifier**

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It's one of the most intuitive and easy-to-understand models.

Think of it like a flowchart or a game of "20 Questions." The tree starts with a single root node representing the entire dataset. It then splits the data into smaller, more homogeneous groups based on a series of questions about the features. Each question corresponds to an internal node, and the possible answers are the branches. The process continues until it reaches the leaf nodes, which represent the final classification or prediction.

The goal during training is to find the best questions (splits) that most effectively separate the data into pure classes. To do this, it commonly uses metrics like the Gini Impurity or Entropy to measure the "disorder" or "impurity" of a node. A lower impurity means the data in that node is more uniform (i.e., belongs to a single class).

Key Advantages:

Easy to interpret: Its flowchart-like structure is simple to visualize and explain.

Handles both numerical and categorical data: It can work with different types of features.

Non-parametric: It doesn't make strong assumptions about the underlying distribution of the data.

Key Disadvantage:

Prone to overfitting: Without constraints (like limiting the tree's depth), it can create overly complex trees that memorize the training data but fail to generalize to new data. This is why hyperparameter tuning is crucial.




**Scikit-learn (sklearn)**

Scikit-learn is a comprehensive, open-source machine learning library. It provides simple and efficient tools for data mining and data analysis.

sklearn.datasets: This module provides helper functions to load popular datasets, like the wine dataset (load_wine) used in the example. These built-in datasets are great for practicing and testing algorithms.

sklearn.model_selection: This module contains essential tools for splitting data and tuning models.

train_test_split: A function to randomly split a dataset into training and testing subsets. This is fundamental for evaluating a model's performance on unseen data.

GridSearchCV: An automated tool for hyperparameter tuning. It exhaustively searches through a specified grid of parameters and uses cross-validation to find the combination that yields the best performance.

sklearn.tree: This module contains the decision tree-based models.

DecisionTreeClassifier: The class that implements the Decision Tree algorithm for classification tasks.

sklearn.metrics: This module provides tools for evaluating model performance.

accuracy_score: A function that calculates the accuracy of a classification model, which is the proportion of correct predictions.

***Basic Decision Tree Implementation***

First, let's build and train a simple Decision Tree classifier with its default settings.

In [1]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
# The wine dataset is great for classification examples.
wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)

# Split data into training and testing sets
# 80% for training, 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree model
# Using default parameters first.
dt_default = DecisionTreeClassifier(random_state=42)
dt_default.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred_default = dt_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

print(f"Accuracy of default Decision Tree: {accuracy_default:.4f}")

Accuracy of default Decision Tree: 0.9444


***2. Hyperparameter Tuning with GridSearchCV***

Hyperparameter tuning is the process of finding the optimal model architecture. For a Decision Tree, key parameters to tune include:

criterion: The function to measure the quality of a split ('gini' or 'entropy').

max_depth: The maximum depth of the tree. Limiting this helps prevent overfitting.

min_samples_split: The minimum number of samples required to split an internal node.

min_samples_leaf: The minimum number of samples required to be at a leaf node.

We will use GridSearchCV to test a "grid" of different parameter combinations and find the best one using cross-validation.

In [2]:
from sklearn.model_selection import GridSearchCV

# 1. Define the parameter grid to search
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 2. Set up GridSearchCV
# The estimator is our Decision Tree model.
# cv=5 means 5-fold cross-validation.
# n_jobs=-1 uses all available CPU cores to speed up the process.
dt = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=1, scoring='accuracy')

# 3. Fit GridSearchCV to the training data
# This will train the model with every parameter combination.
grid_search.fit(X_train, y_train)

# 4. Get the best parameters and the best model
print("\nBest Parameters found by GridSearchCV:")
print(grid_search.best_params_)

best_dt = grid_search.best_estimator_

# 5. Evaluate the tuned model
y_pred_tuned = best_dt.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)

print(f"\nAccuracy of default Decision Tree: {accuracy_default:.4f}")
print(f"Accuracy of tuned Decision Tree:   {accuracy_tuned:.4f} ✨")

Fitting 5 folds for each of 90 candidates, totalling 450 fits

Best Parameters found by GridSearchCV:
{'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10}

Accuracy of default Decision Tree: 0.9444
Accuracy of tuned Decision Tree:   0.9444 ✨


***Explanation of the Tuning Process***

Define Parameter Grid: We create a dictionary (param_grid) where keys are the parameter names (max_depth, etc.) and values are lists of settings to try for those parameters.

Instantiate GridSearchCV: We pass our base model (dt), the parameter grid, and a cross-validation strategy (cv=5). GridSearchCV will now train and evaluate the model for every single combination of parameters in the grid. For example, it will test a tree with criterion='gini', max_depth=5, min_samples_split=2, min_samples_leaf=1, and so on.

Fit: Calling .fit() starts this exhaustive search process on the training data.

Best Estimator: After the search is complete, grid_search.best_params_ gives you the combination that performed the best, and grid_search.best_estimator_ gives you the model already trained with these optimal parameters.

Final Evaluation: By comparing the accuracy of the default model and the tuned model on the test set, you can see the improvement gained from hyperparameter tuning. Typically, the tuned model has better (or at least more reliable) performance.