<a href="https://colab.research.google.com/github/radhakrishna1435/ML_23AG1A66E4/blob/main/Decision_Tree_classifier..ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction to Decision Tree Classifier**

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It's one of the most intuitive and easy-to-understand models.

Think of it like a flowchart or a game of "20 Questions." The tree starts with a single root node representing the entire dataset. It then splits the data into smaller, more homogeneous groups based on a series of questions about the features. Each question corresponds to an internal node, and the possible answers are the branches. The process continues until it reaches the leaf nodes, which represent the final classification or prediction.



In [1]:
mport pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


dt_default = DecisionTreeClassifier(random_state=42)
dt_default.fit(X_train, y_train)

y_pred_default = dt_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

print(f"Accuracy of default Decision Tree: {accuracy_default:.4f}")

Accuracy of default Decision Tree: 0.9444


**. Hyperparameter Tuning with GridSearchCV**

Hyperparameter tuning is the process of finding the optimal model architecture. For a Decision Tree, key parameters to tune include:

criterion: The function to measure the quality of a split ('gini' or 'entropy').

max_depth: The maximum depth of the tree. Limiting this helps prevent overfitting.

min_samples_split: The minimum number of samples required to split an internal node.

min_samples_leaf: The minimum number of samples required to be at a leaf node.

We will use GridSearchCV to test a "grid" of different parameter combinations and find the best one using cross-validation.

In [2]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, n_jobs=-1, verbose=1, scoring='accuracy')

grid_search.fit(X_train, y_train)

print("\nBest Parameters found by GridSearchCV:")
print(grid_search.best_params_)

best_dt = grid_search.best_estimator_

y_pred_tuned = best_dt.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)

print(f"\nAccuracy of default Decision Tree: {accuracy_default:.4f}")
print(f"Accuracy of tuned Decision Tree:   {accuracy_tuned:.4f} ✨")

Fitting 5 folds for each of 90 candidates, totalling 450 fits

Best Parameters found by GridSearchCV:
{'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10}

Accuracy of default Decision Tree: 0.9444
Accuracy of tuned Decision Tree:   0.9444 ✨
