1. What is a Decision Tree, and how does it work in the context of
classification?

1. Definition (4 marks)
A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks.
It represents decisions in the form of a tree-like structure, where:

Root Node → Represents the entire dataset and the first splitting feature.

Internal Nodes → Represent features/attributes and conditions.

Branches → Represent the outcome of a decision.

Leaf Nodes → Represent the final class label or prediction.

In classification, it predicts a categorical output (e.g., spam/not spam, yes/no).

2. How it Works (10 marks)
Step 1: Selecting the Best Attribute
At each node, the algorithm chooses the best feature to split the data.

Selection is based on splitting criteria like:

Gini Impurity (CART algorithm)

Entropy / Information Gain (ID3, C4.5)

Chi-square (for categorical features)

Step 2: Splitting the Dataset
The chosen feature divides the dataset into subsets where the target classes are more homogeneous.

Step 3: Recursion
The process repeats recursively on each subset, creating new decision nodes and branches.

Step 4: Stopping Criteria
Tree growth stops when:

All samples in a node belong to the same class.

No more features are available.

A pre-set depth/leaf size limit is reached (to prevent overfitting).

Step 5: Classification
A new data point is classified by traversing the tree from root to leaf, following the decisions until a leaf node is reached, which gives the predicted class.

3. Example (4 marks)
Problem: Classify if a student will pass an exam based on Study Hours and Attendance.



In [None]:
Root: Attendance ≥ 75%?
    Yes → Study Hours ≥ 3?
        Yes → PASS
        No  → FAIL
    No  → FAIL


Here, each condition is a node decision, and the final prediction is at the leaf.

4. Advantages & Disadvantages (2 marks)
Advantages:

Easy to understand and visualize.

Handles both numerical and categorical data.

Disadvantages:

Prone to overfitting.

Small changes in data can cause large changes in the tree (high variance).

2.  Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Gini Impurity & Entropy – Impurity Measures in Decision Trees
1. Introduction (3 marks)
In a Decision Tree, the choice of the “best” feature for splitting is based on how well it separates the data into pure subsets.
Impurity Measures are metrics that quantify how mixed the class labels are in a node.
Two common measures:

Gini Impurity (used in CART)

Entropy (used in ID3, C4.5)

2. Gini Impurity (7 marks)
Definition:
Probability that a randomly chosen sample from a node would be misclassified if it were labeled according to the distribution of classes in that node.

Formula:
Gini Impurity
=
1
−
∑
𝑖
=
1
𝑛
(
𝑝
𝑖
)
2
Gini Impurity=1−
i=1
∑
n
​
 (p
i
​
 )
2

where:

𝑝
𝑖
p
i
​
  = proportion of samples belonging to class i in the node.

𝑛
n = number of classes.

Properties:
Range: 0 (pure) to 0.5 (binary case, completely impure).

Lower Gini = purer node.

Example:
Node contains 4 samples → 3 “Yes” and 1 “No”:

𝑝
𝑌
𝑒
𝑠
=
3
/
4
=
0.75
,

𝑝
𝑁
𝑜
=
0.25
p
Yes
​
 =3/4=0.75, p
No
​
 =0.25
𝐺
𝑖
𝑛
𝑖
=
1
−
(
0.75
2
+
0.25
2
)
=
1
−
(
0.5625
+
0.0625
)
=
0.375
Gini=1−(0.75
2
 +0.25
2
 )=1−(0.5625+0.0625)=0.375
Smaller Gini after a split means better separation.

3. Entropy (7 marks)
Definition:
Measures disorder or uncertainty in a node.

Based on the concept from Information Theory (Shannon).

Formula:
Entropy
=
−
∑
𝑖
=
1
𝑛
𝑝
𝑖
log
⁡
2
𝑝
𝑖
Entropy=−i=1∑n​pi​log 2​pi

where 𝑝𝑖pi
  is the proportion of class i.

Properties:
Range: 0 (pure) to 1 (binary max impurity).

Higher Entropy = more disorder.

Example:
Using the same 3 “Yes” and 1 “No”:

𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
=
−
(
0.75
log
⁡
2
0.75
+
0.25
log
⁡
2
0.25
)
Entropy=−(0.75log
2
​
 0.75+0.25log
2
​
 0.25)
=
−
(
0.75
×
−
0.4150
+
0.25
×
−
2
)
=−(0.75×−0.4150+0.25×−2)
=
0.811
=0.811
Lower Entropy after a split = better separation.

4. Impact on Splits (3 marks)
Goal: Select the feature and threshold that maximizes purity (minimizes Gini or Entropy) after the split.

Information Gain (IG):

𝐼
𝐺
=
Entropy (Parent)
−
Weighted Entropy (Children)
IG=Entropy (Parent)−Weighted Entropy (Children)
Higher IG = better split.
For Gini, we choose the split that yields the lowest weighted Gini.

Key Point: Both measures usually give similar results, but:

Gini is faster to compute (no logarithms).

Entropy is more theoretically linked to information theory.

✅ In short:

Gini Impurity and Entropy quantify node impurity.

Decision Trees split on features that reduce impurity the most.

Lower impurity = more homogeneous nodes = better classification performance.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

1. Introduction (3 marks)
Decision Trees tend to overfit if grown without restrictions — capturing noise instead of patterns.
Pruning controls overfitting by reducing the size of the tree.

Two main approaches:

Pre-Pruning (Early Stopping)

Post-Pruning (Prune After Full Growth)

2. Pre-Pruning (Early Stopping) (6 marks)
Definition:
Stops tree growth before it becomes too complex by applying constraints during construction.

Common stopping criteria:

Maximum tree depth (max_depth)

Minimum samples required to split a node (min_samples_split)

Minimum impurity decrease required

Maximum number of leaf nodes

Example:
Stop splitting if a node has fewer than 5 samples.

Practical Advantage:

Time & computation efficient — prevents building unnecessarily large trees.

3. Post-Pruning (Reduced Error / Cost Complexity) (6 marks)
Definition:
First grow the full tree (allowing overfitting), then remove branches that provide little improvement in predictive performance.

Techniques:

Reduced Error Pruning: Remove nodes if validation accuracy doesn’t drop.

Cost Complexity Pruning (CCP): Balance between tree size and error using complexity parameter (
𝛼
α).

Example:
After full tree growth, remove a branch that only classifies 1–2 samples but increases variance.

Practical Advantage:

Higher accuracy — ensures that only harmful branches are removed after seeing the full data structure.

4. Key Differences (5 marks)
Aspect	Pre-Pruning	Post-Pruning
Timing	Stops growth during building	Prunes after full tree is built
Computation	Faster, less memory usage	Slower, requires full tree first
Overfitting control	May underfit if stopped too early	More accurate control after seeing tree
Flexibility	Limited — stops based on thresholds	Flexible — can evaluate whole structure

✅ In short:

Pre-Pruning: Prevents over-complexity before it happens (fast but risk of underfitting).

Post-Pruning: Lets tree overfit, then trims excess (slower but often more accurate).

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Information Gain in Decision Trees
1. Introduction (3 marks)
In a Decision Tree, the goal is to choose the feature that best separates the data into pure subsets.
Information Gain (IG) is a measure used (especially in ID3 and C4.5 algorithms) to determine how much “information” a feature gives us about the target variable.

2. Definition (5 marks)
Information Gain = Reduction in uncertainty about the target variable after splitting on a particular feature.

It is based on Entropy from information theory.

Formula:
  
    G(S,A)=Entropy(S)−v∈Values(A)∑​∣S∣∣Sv∣×Entropy(Sv)

Where:

S = dataset at the current node
A = feature to split on
Sv= subset of S where A=v
∣S∣∣Sv∣= proportion of samples in subset Sv

3. Step-by-Step Working (6 marks)
Step 1: Calculate Entropy of the parent node (before split).
Step 2: For each possible value of a feature, split the dataset and calculate the Entropy of each subset.
Step 3: Compute the Weighted Average Entropy after the split.
Step 4: Subtract this weighted entropy from the parent node’s entropy → this is the Information Gain.
Step 5: Repeat for all features, choose the one with highest IG.

4. Example (3 marks)
Suppose we want to classify “Play Tennis” based on “Weather”:

Parent Entropy (before split): 0.94
Entropy after splitting on Weather: 0.69

IG=0.94−0.69=0.25
→ This means “Weather” reduces uncertainty by 0.25 bits of information.

5. Importance for Choosing the Best Split (3 marks)
High IG means the feature gives a better separation of classes → leads to purer child nodes.

Ensures that the most informative features are chosen at higher levels of the tree.

Helps improve classification accuracy and reduce tree depth.

✅ In short:
Information Gain measures how much knowing a feature helps in predicting the target.
The Decision Tree selects the feature with the highest IG at each node to maximize purity and efficiency.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Real-World Applications of Decision Trees
1. Introduction (2 marks)
A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks.
Its interpretability makes it popular across industries.

2. Common Real-World Applications (9 marks)
Medical Diagnosis

Predicting diseases based on patient symptoms, test results, and medical history.

Example: Classifying whether a tumor is malignant or benign.

Customer Churn Prediction

Telecom or subscription services use trees to predict customers likely to leave.

Credit Scoring & Risk Assessment

Banks use trees to evaluate loan applications based on income, credit history, etc.

Fraud Detection

Identify fraudulent transactions in banking or e-commerce.

Retail & Marketing

Segment customers for targeted advertising campaigns.

Manufacturing Quality Control

Detect defective products based on production parameters.

Weather Prediction

Predicting rain, storms, or suitable conditions for farming.

Education

Predicting student performance or identifying at-risk students.

HR Analytics

Screening job candidates based on qualifications and past performance.

3. Main Advantages (5 marks)
Easy to Understand & Interpret – No complex mathematics; resembles human decision-making.

Handles Both Numerical & Categorical Data – Versatile in data types.

No Need for Feature Scaling – Unlike algorithms such as SVM or KNN.

Can Capture Non-Linear Relationships – Flexible in splitting criteria.

Works Well for Small to Medium Datasets – Requires less preprocessing.

4. Main Limitations (4 marks)
Overfitting – Tends to fit noise in the training data if not pruned.

Instability – Small changes in data can cause large changes in the tree.

Biased Towards Features with More Levels – Categorical features with many categories may dominate.

Not Always the Most Accurate – Often outperformed by ensemble methods like Random Forests.

✅ In short:
Decision Trees are widely used in healthcare, finance, marketing, and more due to their simplicity and interpretability, but they must be pruned or combined with ensembles to avoid overfitting and improve stability.

In [None]:
6. ● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances


Python Program – Decision Tree Classifier on Iris Dataset (Gini Criterion)
python
Copy
Edit


# Question 6: Decision Tree Classifier on Iris Dataset using Gini Criterion

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data        # Features
y = iris.target      # Target labels

# 2. Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Initialize Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# 4. Train the model
clf.fit(X_train, y_train)

# 5. Predict on test data
y_pred = clf.predict(X_test)

# 6. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 7. Get feature importances
feature_importances = pd.Series(
    clf.feature_importances_,
    index=iris.feature_names
)

# 8. Print results
print("Decision Tree Classifier (Gini Criterion)")
print("========================================")
print(f"Accuracy on test data: {accuracy:.2f}")
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


In [None]:
Sample output:

Decision Tree Classifier (Gini Criterion)
========================================
Accuracy on test data: 1.00

Feature Importances:
petal length (cm)    0.57
petal width (cm)     0.43
sepal width (cm)     0.00
sepal length (cm)    0.00
dtype: float64


In [None]:
7. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

# Question: Compare Decision Tree with max_depth=3 vs Fully-Grown Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Decision Tree with max_depth=3
shallow_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
shallow_tree.fit(X_train, y_train)

# 4. Train Fully-Grown Decision Tree (no max_depth limit)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)

# 5. Predictions
y_pred_shallow = shallow_tree.predict(X_test)
y_pred_full = full_tree.predict(X_test)

# 6. Accuracy scores
acc_shallow = accuracy_score(y_test, y_pred_shallow)
acc_full = accuracy_score(y_test, y_pred_full)

# 7. Print results
print("Decision Tree Accuracy Comparison")
print("=================================")
print(f"Shallow Tree (max_depth=3) Accuracy: {acc_shallow:.2f}")
print(f"Fully-Grown Tree Accuracy:          {acc_full:.2f}")


In [None]:
Sample output:

Decision Tree Accuracy Comparison
=================================
Shallow Tree (max_depth=3) Accuracy: 0.97
Fully-Grown Tree Accuracy:          1.00


In [None]:
8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

# Question 8: Decision Tree Regressor on California Housing Dataset

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data       # Features
y = housing.target     # Target variable (Median House Value)

# 2. Split data into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Initialize Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)

# 4. Train the model
regressor.fit(X_train, y_train)

# 5. Predict on test set
y_pred = regressor.predict(X_test)

# 6. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# 7. Get feature importances
feature_importances = pd.Series(
    regressor.feature_importances_,
    index=housing.feature_names
)

# 8. Print results
print("Decision Tree Regressor on California Housing Dataset")
print("=====================================================")
print(f"Mean Squared Error (MSE): {mse:.4f}\n")
print("Feature Importances:")
print(feature_importances.sort_values(ascending=False))




In [None]:
Sample output:

Decision Tree Regressor on California Housing Dataset
=====================================================
Mean Squared Error (MSE): 0.1615

Feature Importances:
MedInc      0.5331
Latitude    0.1387
Longitude   0.1294
AveOccup    0.0596
HouseAge    0.0541
AveRooms    0.0418
AveBedrms   0.0275
Population  0.0160
dtype: float64


In [None]:
9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

# Question 9: Decision Tree Hyperparameter Tuning with GridSearchCV

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Initialize base Decision Tree Classifier
dt_clf = DecisionTreeClassifier(random_state=42)

# 4. Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 6]
}

# 5. Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=dt_clf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,                # 5-fold cross-validation
    n_jobs=-1
)

# 6. Fit GridSearchCV on training data
grid_search.fit(X_train, y_train)

# 7. Get the best estimator
best_model = grid_search.best_estimator_

# 8. Predict on test set
y_pred = best_model.predict(X_test)

# 9. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 10. Print results
print("Decision Tree Hyperparameter Tuning (Iris Dataset)")
print("==================================================")
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")
print(f"Test Set Accuracy: {accuracy:.4f}")


In [None]:
Sample output:

Decision Tree Hyperparameter Tuning (Iris Dataset)
==================================================
Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9667
Test Set Accuracy: 1.0000


10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

1. Handle Missing Values
Identify missing values using pandas.isnull() or df.info().

Numerical features: Replace missing values using mean or median imputation (SimpleImputer(strategy='median') is preferred for skewed data).

Categorical features: Replace missing values using most frequent category (SimpleImputer(strategy='most_frequent')).

If too many missing values exist in a feature (>40–50%), consider dropping that feature to avoid noise.

For advanced handling, use KNN imputation or Iterative Imputer to leverage patterns in other features.

2. Encode the Categorical Features (3 marks)
Label Encoding for ordinal features (where order matters, e.g., disease stage: mild < moderate < severe).

One-Hot Encoding for nominal features (no order, e.g., blood type: A, B, AB, O).

Use OneHotEncoder(handle_unknown='ignore') to avoid errors with unseen categories during prediction.

Since Decision Trees are not affected by feature scaling, no standardization is needed.

3. Train a Decision Tree Model (4 marks)
Data Split: Use train_test_split() (e.g., 80% training, 20% testing).

Model Choice: Use DecisionTreeClassifier(criterion='gini', random_state=42) or criterion='entropy' depending on preference.

Fit the Model: Train on the preprocessed training set.

Check Feature Importances: Identify which medical features are most predictive (e.g., blood pressure, lab test results).

4. Tune Hyperparameters (4 marks)
Use GridSearchCV or RandomizedSearchCV to optimize:

max_depth (controls tree depth, prevents overfitting)

min_samples_split (minimum samples to split a node)

min_samples_leaf (minimum samples in a leaf node)

criterion (Gini vs Entropy)




In [None]:
Example:

param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

Evaluate parameters using cross-validation to ensure robust performance.

5. Evaluate Performance (3 marks)
Metrics:

Accuracy → overall correctness

Precision & Recall → important for imbalanced healthcare data
(Recall is critical to minimize false negatives in disease detection)

F1-score → balances precision and recall

ROC-AUC → measures separability of classes

Use confusion matrix to understand type of errors (false positives vs false negatives).

If data is imbalanced, use stratified splits and possibly class weights.

6. Business Value in Real-World Setting (2 marks)
Early Disease Detection: Helps doctors identify at-risk patients earlier, improving treatment outcomes.

Personalized Treatment: High-importance features guide targeted interventions (e.g., recommending specific tests).

Operational Efficiency: Automates initial screening, reducing workload on medical staff.

Cost Reduction: Avoids unnecessary tests for low-risk patients, optimizing healthcare spending.

Patient Satisfaction: Faster diagnosis leads to quicker care and better patient trust.

✅ In summary:
We preprocess the data (handle missing values & encode categories), train and tune a Decision Tree, evaluate using healthcare-relevant metrics, and apply it in a way that delivers better patient outcomes, efficiency, and cost savings.