# **Decission Tree**

# **Question 1: What is a Decision Tree, and how does it work in the context of classification?**

**ANSWER:** A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks, but it is most commonly explained in the context of classification.
How it works in Classification

**Root Node Selection**

The tree starts with a root node that contains the entire dataset.
The algorithm chooses the best feature to split the dataset.
Criteria for "best" are usually:
Gini Impurity,
Entropy / Information Gain (from Information Theory)

**Splitting the Data**

The chosen feature is used to partition the dataset into subsets.

Example: If the feature is "Age > 30," data gets divided into two branches.

**Recursive Partitioning**

The algorithm repeats the process for each subset (choosing the best feature at that node).

This continues until:
All data points in a node belong to the same class, OR
Maximum depth / stopping criteria are reached.

**Leaf Node Assignment**

At the end of each path, the node is assigned a class label (majority class of samples in that node).

#**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?**


**ANSWER:**

**Gini Impurity**

It measures the probability of incorrectly classifying a randomly chosen element.
Value 0 means pure (all samples in one class). Higher values indicate more impurity.

**Entropy**

It measures the disorder or uncertainty in the node.
0 = pure, 1 = maximum impurity for binary classes.

**Impact on Decision Tree Splits**

At each node, the algorithm chooses the feature that gives the largest reduction in impurity.

Gini is faster and often preferred in CART (Classification and Regression Trees).

Entropy (via Information Gain) is more sensitive to class distribution.

Both aim to create child nodes that are as pure as possible, improving classification accuracy.

#**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**


**ANSWER:**

**Pre-Pruning (Early Stopping):**

The tree growth is stopped early, before it becomes too complex.

Stopping conditions may include: maximum depth, minimum samples per node, or minimum information gain.

**Advantage:** Saves computation and prevents the tree from overfitting at the start.

**Post-Pruning (Pruning after Full Growth):**

The tree is grown fully and then pruned back by removing branches that do not improve accuracy.

Methods: cost complexity pruning, reduced error pruning.

**Advantage:** Produces simpler, more general trees while still considering all possible splits first.

#**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

**ANSWER**:
**Information Gain**  measures the reduction in impurity (or uncertainty) achieved after splitting a dataset on a feature

**Importance for Splits:**

A higher IG means the feature provides more “information” about the target class.

Decision Trees choose the feature with the highest Information Gain at each node.

This ensures that the tree splits data into purer subsets, improving classification accuracy.

#**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**


**ANSWER**:
**Applications of Decision Trees**

**Medical Diagnosis** – predicting diseases from patient symptoms.

**Finance** – credit scoring, loan approval, fraud detection.

**Marketing** – customer segmentation, predicting customer churn.

**Operations** – risk analysis, decision support systems.


**Advantages:**

Easy to understand and interpret (like flowcharts).

Handles both numerical and categorical data.

Requires little data preprocessing (no scaling/normalization needed).

Can capture non-linear relationships.

**Limitations:**

Prone to overfitting if not pruned.

Small changes in data can lead to a completely different tree (high variance).

Biased towards features with more categories.

Less accurate compared to ensemble methods (e.g., Random Forests).

#**Question 6: Write a Python program to:1 Load the Iris Dataset 2 Train a Decision Tree Classifier using the Gini criterion 3 Print the model’s accuracy and feature importances**
 (● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).)

In [10]:
# Import Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

#  Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target


#  Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=52
)

#  Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=52)
clf.fit(X_train, y_train)

#  Predictions
y_pred = clf.predict(X_test)

# 5. Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# 6. Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Accuracy: 0.9333333333333333
Feature Importances:
sepal length (cm): 0.0191
sepal width (cm): 0.0000
petal length (cm): 0.9374
petal width (cm): 0.0435


# **Question 7: Write a Python program to:● Load the Iris Dataset● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.**


In [12]:
# Question 7: Decision Tree Classifier - Comparing max_depth vs Full Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=69
)

# 3. Train Decision Tree Classifier with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=69)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)

# 4. Train a fully-grown Decision Tree
clf_full = DecisionTreeClassifier(random_state=69)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)

# 5. Print accuracies
print("Accuracy with max_depth=3:", accuracy_score(y_test, y_pred_limited))
print("Accuracy with full tree:", accuracy_score(y_test, y_pred_full))


Accuracy with max_depth=3: 0.9777777777777777
Accuracy with full tree: 0.9555555555555556


#**Question 8: Write a Python program to:● Load the Boston Housing Dataset● Train a Decision Tree Regressor● Print the Mean Squared Error (MSE) and feature importances**
(Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV))

In [13]:
# Import necessary libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load the Boston Housing dataset
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=69)
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.3f}")

# Print feature importances
feature_importances = pd.Series(regressor.feature_importances_, index=X.columns)
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


Mean Squared Error (MSE): 9.574

Feature Importances:
RM         0.590075
LSTAT      0.191007
CRIM       0.071858
DIS        0.064809
TAX        0.026275
AGE        0.015267
INDUS      0.011848
B          0.010606
PTRATIO    0.007085
NOX        0.006802
RAD        0.002061
ZN         0.001994
CHAS       0.000312
dtype: float64


#**Question 9: Write a Python program to:● Load the Iris Dataset● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV● Print the best parameters and the resulting model accuracy**
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).



In [15]:
# Import  libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Decision Tree Classifier
dtree = DecisionTreeClassifier(random_state=42)

# Set up hyperparameter grid for tuning
param_grid = {
    'max_depth': [None, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=dtree, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV on training data
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Predict on test set using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.3f}")


Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Model Accuracy: 1.000


**Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.**


**1: Handle Missing Values:-**
Ill check if there is any missing value .

with: **df.isnull().sum()** function.

if any missing values:-

for:

**Numerical features:** Use mean, median, or KNN imputation.

**Categorical features:** Use mode  or a new category like “Unknown


 **2: Encode Categorical Features**

 Decision Trees in sklearn can handle ordinal categories as integers, but one-hot encoding is safer for non-ordinal categorical variables.

 Merge encoded categorical features with numerical features.

Alternatively, use ColumnTransformer to handle numerical and categorical preprocessing in one pipeline.


**3: Train a Decision Tree Model**

Split data into training and test sets:

Train a Decision Tree Classifier:


 **4: Tune Hyperparameters**

Goal: Avoid overfitting and improve performance.

Important hyperparameters for Decision Tree:

**max_depth:** Maximum depth of the tree.

**min_samples_split:** Minimum samples required to split a node.

**min_samples_leaf:** Minimum samples required at a leaf node.

criterion: “gini” or “entropy” for split quality.


**5: Business Value in Real-World Healthcare**

**Early Disease Detection:** Predicting disease risk early allows timely intervention.

**Resource Optimization:** Helps hospitals prioritize high-risk patients for tests or treatments.

**Personalized Care:** Tailor patient care based on predicted risk.

**Cost Reduction:** Reduce unnecessary tests for low-risk patients while focusing on high-risk ones.

**Data-Driven Decisions:** Provides actionable insights for clinicians and management.

**Example:** If the model predicts high risk of diabetes in a patient, the hospital can schedule further tests and preventive care earlier, improving patient outcomes and reducing treatment costs later.