---
# Assignment Code: DA-AG-012
---

# Question 1: What is a Decision Tree, and how does it work in the context of classification?
- Answer:

A Decision Tree is a machine learning method that is often used for classification problems. It works like a flowchart where we keep asking questions about the data until we reach a final decision.

At the top (root node), the tree starts with the most important question or feature.

From there, the data is split into smaller groups based on answers (like yes/no or true/false).

This splitting continues at different nodes until we reach the leaf nodes, which give us the final class or prediction.

For example, if we want to decide whether someone will play cricket, the tree may first check the weather. If it’s overcast, the answer is “Yes”. If it’s sunny, it may then check humidity, and so on, until it reaches a decision.

In short, a decision tree makes predictions by asking a series of simple questions about the features of the data and following the path of answers down to the final outcome.


---


# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- Answer:

When a Decision Tree is built, it has to decide which feature to split on at each step. To do this, it uses impurity measures like Gini Impurity and Entropy. These tell us how “mixed up” the classes are in a node.

Gini Impurity:

It measures how often a randomly chosen element would be wrongly classified if we labeled it according to the distribution of classes in that node.

A pure node (where all samples belong to one class) has Gini = 0.

The higher the Gini value, the more mixed the classes are.

Entropy:

Entropy comes from information theory and measures the uncertainty or disorder in the data.

If all data belongs to one class, entropy = 0 (no disorder).

If the data is evenly split between classes, entropy is high (maximum disorder).

Impact on Splits:

Both Gini and Entropy help the tree choose the “best” feature for splitting.

The algorithm looks for the split that reduces impurity the most (called Information Gain in case of Entropy).

In simple words, they guide the tree to create child nodes that are purer (more homogeneous) than the parent node, improving classification accuracy.

In short:

Gini Impurity focuses on misclassification probability.

Entropy focuses on information gain (disorder reduction).

Both help the decision tree decide the best way to split the data.

---

# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- Answer:

Pruning in Decision Trees means reducing the size of the tree so that it doesn’t become too complex and overfit the data. There are two types: Pre-Pruning and Post-Pruning.

Pre-Pruning (Early Stopping):

In this method, we stop the tree from growing too deep while it is being built.

For example, we can set rules like “stop splitting if a node has fewer than 5 samples” or “stop if the tree reaches a maximum depth.”

Advantage: Saves time and resources because the tree does not grow unnecessarily large.

Post-Pruning (Pruning after Full Growth):

Here, we first let the tree grow fully, even if it becomes very complex.

After that, we cut back some branches that do not improve accuracy or that cause overfitting.

Advantage: Usually gives better accuracy because the tree first explores all possible splits before trimming.

In short:

Pre-Pruning = Stop early to prevent complexity.

Post-Pruning = Grow fully, then cut back for better generalization.

---


# Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- Answer:

Information Gain is a measure used in Decision Trees to decide which feature should be chosen for splitting the data. It tells us how much “information” a feature gives us about the class labels.

It is calculated using Entropy.

At each split, the algorithm checks how much the entropy (disorder) is reduced after dividing the dataset based on that feature.

The higher the Information Gain, the better that feature is at separating the data into pure groups.

Why it is important:

A Decision Tree’s goal is to make nodes that are as pure as possible (where most samples belong to the same class).

Information Gain helps the algorithm choose the feature that makes the child nodes purer compared to the parent node.

This ensures the tree becomes more accurate and efficient in classification.

In short: Information Gain = reduction in disorder after a split. It is important because it helps the tree pick the best question (feature) at each step.

---


# Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
- Answer:

Real-world applications of Decision Trees:

Healthcare: Used to diagnose diseases based on patient symptoms and test results.

Finance: Helps banks decide whether to approve a loan by analyzing income, credit history, and other factors.

Marketing: Predicts customer behavior, such as whether they will respond to a campaign or buy a product.

Fraud Detection: Identifies suspicious transactions by checking patterns in financial data.

Education: Predicts student performance or dropout risk based on attendance, grades, and activities.

Main Advantages:

Easy to understand and interpret (works like asking questions).

Handles both numerical and categorical data.

Requires little data preprocessing (no need for normalization or scaling).

Main Limitations:

Can overfit if the tree is too deep.

Sensitive to small changes in data (a small change may create a different tree).

Less effective when there are too many features compared to advanced models (like Random Forests).

In short: Decision Trees are widely used in fields like healthcare, finance, and marketing because they are simple and explainable, but they can become unstable and overfit if not controlled properly.

---


# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances
-


In [1]:
# Question 6: Decision Tree on Iris Dataset using Gini Criterion

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data      # Features
y = iris.target    # Target labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


---

# Question 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.
-


In [2]:
# Question 7: Decision Tree Classifier with max_depth vs fully-grown tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fully grown Decision Tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Decision Tree with max_depth=3
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
y_pred_pruned = pruned_tree.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Print results
print("Accuracy of fully-grown tree:", accuracy_full)
print("Accuracy of tree with max_depth=3:", accuracy_pruned)


Accuracy of fully-grown tree: 1.0
Accuracy of tree with max_depth=3: 1.0


---

# Question 8: Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances
-

In [4]:
# Question 8: Decision Tree Regressor using California Housing Dataset

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predictions
y_pred = regressor.predict(X_test)

# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Feature Importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.5280096503174904

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


---

# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# ● Print the best parameters and the resulting model accuracy
-

In [5]:
# Question 9: Hyperparameter tuning of Decision Tree using GridSearchCV

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# GridSearchCV
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Best estimator
best_model = grid_search.best_estimator_

# Predictions
y_pred = best_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy with Best Parameters: 1.0


---

# Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:
#● Handle the missing values
#● Encode the categorical features
#● Train a Decision Tree model
#● Tune its hyperparameters ● Evaluate its performance And describe what business value this model could provide in the real-world setting.




- Answer:

If I were working as a data scientist on this healthcare problem, I would follow these steps:

1. Handle Missing Values

First, I would check how much data is missing.

For numerical features, I could fill missing values using the mean or median.

For categorical features, I could fill them with the most frequent category (mode) or use a special label like “Unknown.”

If a feature has too many missing values (e.g., more than 40–50%), I might consider dropping it.

2. Encode the Categorical Features

Decision Trees can handle categorical features, but most implementations (like scikit-learn) need them in numeric form.

For this, I would use One-Hot Encoding (creating dummy variables) or Label Encoding depending on the type of categorical data.

3. Train a Decision Tree Model

I would split the dataset into training and testing sets.

Then, I’d train a Decision Tree Classifier on the training data.

The model will learn patterns from patient information (age, symptoms, test results, etc.) to predict whether the patient has the disease.

4. Tune its Hyperparameters

To avoid overfitting, I would tune parameters like:

max_depth (how deep the tree grows),

min_samples_split (minimum samples required to split a node),

criterion (Gini or Entropy).

I’d use GridSearchCV or RandomizedSearchCV for systematic tuning.

5. Evaluate its Performance

I’d test the model on the unseen test set.

Metrics I’d use:

Accuracy (overall correctness),

Precision (how many predicted positives are truly positive),

Recall (how many real positives we correctly detected),

F1-score (balance of precision and recall).

In healthcare, Recall is most important, because missing a sick patient is riskier than a false alarm.

6. Business Value in Real-world Setting

This model can help doctors quickly identify high-risk patients and prioritize their treatment.

It reduces manual workload by highlighting patterns that may not be obvious.

It improves early detection of diseases, which can save lives.

From a business point of view, it increases efficiency, reduces costs of unnecessary tests, and improves patient satisfaction.

---