# Assignment Code: DA-AG-012
# Decision Tree | Assignment
Instructions: Carefully read each question. Use Google Docs, Microsoft Word, or a similar tool
to create a document where you type out each question along with its answer. Save the
document as a PDF, and then upload it to the LMS. Please do not zip or archive the files before
uploading them. Each question carries 20 marks.
Total Marks: 100

Question 1:  What is a Decision Tree, and how does it work in the context of
classification?

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?


Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).

● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6:   Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

Question 7:  Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
(Include your Python code and output in the code box below.)


Question 8: Write a Python program to:

● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)

Question 9: Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV

● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.

## Question 1
### What is a Decision Tree, and how does it work in the context of classification?


**Answer:**

A Decision Tree is a supervised machine learning algorithm used for classification and regression.  
In classification, it works by recursively splitting the dataset based on feature values.

Each internal node represents a decision on a feature, each branch represents an outcome of the decision, and each leaf node represents a class label.

The objective is to create pure subsets using impurity measures such as Gini Impurity or Entropy.


## Question 2
### Explain the concepts of Gini Impurity and Entropy as impurity measures.


**Answer:**

**Gini Impurity** measures how often a randomly chosen sample would be incorrectly classified.

Gini = 1 − Σ(pᵢ²)

Lower Gini indicates a better split.

**Entropy** measures the randomness or uncertainty in the dataset.

Entropy = −Σ(pᵢ log₂ pᵢ)

Both impurity measures help determine the best feature to split the data by reducing impurity.


## Question 3
### Difference between Pre-Pruning and Post-Pruning in Decision Trees


**Answer:**

Pre-Pruning stops the tree growth early using constraints like max_depth and min_samples_split.

Post-Pruning removes unnecessary branches after the tree is fully grown.

**Advantages:**
- Pre-Pruning reduces overfitting and training time.
- Post-Pruning improves generalization on unseen data.


## Question 4
### What is Information Gain in Decision Trees?


**Answer:**

Information Gain measures the reduction in entropy after splitting the dataset on a feature.

Information Gain = Entropy(parent) − Σ Entropy(children)

It is important because it helps select the feature that best separates the data.


## Question 5
### Applications, Advantages, and Limitations of Decision Trees


**Answer:**

**Applications:**
- Medical diagnosis
- Fraud detection
- Credit scoring
- Customer churn prediction

**Advantages:**
- Easy to interpret
- Works with numerical and categorical data
- Minimal data preprocessing

**Limitations:**
- Overfitting
- Sensitive to small data changes
- Less accurate than ensemble models


## Question 6
### Decision Tree Classifier using Gini Criterion


In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = DecisionTreeClassifier(criterion="gini")
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


## Question 7
### Compare max_depth = 3 with a fully-grown Decision Tree


In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_accuracy = accuracy_score(y_test, full_tree.predict(X_test))

limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_accuracy = accuracy_score(y_test, limited_tree.predict(X_test))

print("Full Tree Accuracy:", full_accuracy)
print("Max Depth = 3 Accuracy:", limited_accuracy)


Full Tree Accuracy: 1.0
Max Depth = 3 Accuracy: 1.0


## Question 8
### Decision Tree Regressor on Boston Housing Dataset


In [10]:
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load Boston Housing dataset from OpenML
boston = fetch_openml(name="boston", version=1, as_frame=True)

X = boston.data
y = boston.target.astype(float)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predictions
y_pred = regressor.predict(X_test)

# Output
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", regressor.feature_importances_)


Mean Squared Error: 10.416078431372549
Feature Importances: [5.12956739e-02 3.35270585e-03 5.81619171e-03 2.27940651e-06
 2.71483790e-02 6.00326256e-01 1.36170630e-02 7.06881622e-02
 1.94062297e-03 1.24638653e-02 1.10116089e-02 9.00872742e-03
 1.93328464e-01]


## Question 9
### Hyperparameter Tuning using GridSearchCV


In [6]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [2, 3, 4, 5],
    "min_samples_split": [2, 5, 10]
}

dt = DecisionTreeClassifier(random_state=42)

grid = GridSearchCV(dt, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Best Accuracy: 0.9416666666666668


## Question 10
### Healthcare Decision Tree – Step-by-Step Process


**Answer:**

1. Handle Missing Values:
   - Numerical features using mean or median.
   - Categorical features using mode or a new category.

2. Encode Categorical Features:
   - One-Hot Encoding
   - Label Encoding

3. Train Decision Tree:
   - Choose Gini or Entropy
   - Split data into training and testing sets

4. Hyperparameter Tuning:
   - max_depth
   - min_samples_split
   - GridSearchCV

5. Model Evaluation:
   - Accuracy
   - Precision and Recall
   - Confusion Matrix
   - ROC-AUC

**Business Value:**
The model helps in early disease detection, reduces healthcare costs, improves patient outcomes, and supports data-driven clinical decisions.
