Question 1: What is a Decision Tree, and how does it work in the context of
classification?

Ans


A **Decision Tree** is a supervised machine learning algorithm used for classification and regression tasks. In classification, it is used to predict a category or class label based on input features. It works like a flowchart structure where each internal node represents a decision based on a feature, each branch represents the outcome of that decision, and each leaf node represents the final class label.

In classification, the algorithm starts at the root node and splits the data based on the feature that best separates the classes. It uses measures like **Gini Index** or **Entropy (Information Gain)** to decide the best feature for splitting. The process continues recursively, creating smaller branches until the data is clearly classified or a stopping condition is reached.

For example, if we are predicting whether a loan should be approved or not, the tree might first split based on income level, then credit score, and then employment status. By following the decisions from root to leaf, the model assigns a final class label.

In simple terms, a Decision Tree classifies data by asking a series of questions and following the answers step by step until it reaches a final decision.


Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Ans

**Gini Impurity** and **Entropy** are impurity measures used in Decision Trees to decide the best feature for splitting the data. They measure how mixed the classes are in a node. A pure node contains data points from only one class, while an impure node contains a mix of classes.

**Gini Impurity** measures the probability of incorrectly classifying a randomly chosen data point if it were randomly labeled according to the class distribution in the node. Its value ranges from 0 to 0. A Gini value of 0 means the node is completely pure (all samples belong to one class). The Decision Tree algorithm selects the split that results in the lowest Gini impurity.

**Entropy** measures the amount of uncertainty or randomness in the data. It ranges from 0 (completely pure) to 1 (maximum disorder, for binary classification). Entropy is used to calculate **Information Gain**, which shows how much uncertainty is reduced after a split. The split with the highest Information Gain (or lowest Entropy) is chosen.

Both Gini Impurity and Entropy help the Decision Tree choose the best feature to split the data. They impact the splits by ensuring that each split makes the resulting nodes as pure as possible, leading to better classification performance.


Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.


Ans

**Pre-Pruning** and **Post-Pruning** are techniques used to prevent overfitting in Decision Trees.

**Pre-Pruning (Early Stopping)** stops the tree from growing before it becomes too complex. It sets conditions such as maximum depth, minimum number of samples required to split a node, or minimum impurity decrease. If these conditions are met, the tree stops splitting further.
**Practical Advantage:** It reduces training time and computational cost because the tree is controlled during the building process.

**Post-Pruning (Backward Pruning)** allows the tree to grow fully and then removes branches that do not significantly improve model performance. It trims unnecessary nodes after evaluating them using validation data or cost-complexity pruning.
**Practical Advantage:** It often produces a more accurate and better-generalized model because pruning decisions are made after seeing the full tree structure.

In short, pre-pruning controls growth early to save resources, while post-pruning simplifies a fully grown tree to improve generalization.



Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?


Ans

**Information Gain** is a measure used in Decision Trees to decide which feature should be chosen for splitting the data. It is based on the concept of **Entropy**, which measures the uncertainty or impurity in a dataset.

Information Gain calculates how much the entropy decreases after splitting the data on a particular feature. In simple terms, it measures how much “information” a feature provides about the class label.

If a split results in child nodes that are more pure (less mixed), the entropy decreases significantly, and the Information Gain is high.

It is important because the Decision Tree algorithm selects the feature with the **highest Information Gain** for splitting. This ensures that each split makes the data more organized and closer to pure class labels, leading to better and more accurate classification.


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Ans

Decision Trees are used in many real-life applications because they are simple and easy to understand.

Common applications include loan approval, where banks decide whether to give a loan based on income and credit score; medical diagnosis, where symptoms are used to predict diseases; fraud detection, where suspicious transactions are identified; customer churn prediction, where companies find customers who may leave; and marketing, where businesses group customers for targeted advertising.

The main advantages of Decision Trees are that they are easy to understand and explain, can handle both numerical and categorical data, require less data preparation, and are useful for decision-making problems.

The main limitations are that they can easily overfit if the tree becomes too large, small changes in data can create a very different tree, and they may not be as accurate as advanced models like Random Forest or Gradient Boosting.

In simple terms, Decision Trees are useful and easy-to-use models, but they must be carefully controlled to avoid overfitting and instability.


Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

In [1]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset (built-in dataset)
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create Decision Tree model using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy:", accuracy)
print("Feature Importances:", model.feature_importances_)

Model Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

In [2]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset (built-in dataset)
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -----------------------------
# Fully-grown Decision Tree
# -----------------------------
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# -----------------------------
# Decision Tree with max_depth=3
# -----------------------------
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
y_pred_pruned = pruned_tree.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Print results
print("Accuracy of Fully-Grown Tree:", accuracy_full)
print("Accuracy of Tree with max_depth=3:", accuracy_pruned)

Accuracy of Fully-Grown Tree: 1.0
Accuracy of Tree with max_depth=3: 1.0


Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importance

In [3]:
# Import required libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Load Boston Housing dataset from OpenML
boston = fetch_openml(name="boston", version=1, as_frame=True)

X = boston.data
y = boston.target.astype(float)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Mean Squared Error (MSE):", mse)
print("\nFeature Importances:")
for feature, importance in zip(X.columns, model.feature_importances_):
    print(f"{feature}: {importance}")

Mean Squared Error (MSE): 11.588026315789474

Feature Importances:
CRIM: 0.05846545229060361
ZN: 0.000988919249451643
INDUS: 0.009872448809169472
CHAS: 0.0002973342835618114
NOX: 0.007050562083191356
RM: 0.575807411273885
AGE: 0.007170198655228184
DIS: 0.10962404854314393
RAD: 0.001646356693641641
TAX: 0.002181112508453187
PTRATIO: 0.025042865841170155
B: 0.011872990423277916
LSTAT: 0.189980299345222


Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy


In [4]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [None, 2, 3, 4, 5, 6],
    'min_samples_split': [2, 4, 6, 8, 10]
}

# Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Train using GridSearch
grid_search.fit(X_train, y_train)

# Get best model
best_model = grid_search.best_estimator_

# Predict using best model
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Model Accuracy with Best Parameters:", accuracy)

Best Parameters: {'max_depth': None, 'min_samples_split': 6}
Model Accuracy with Best Parameters: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Ans

If I were working as a data scientist in a healthcare company, I would follow a structured step-by-step approach:

Step 1: Handle Missing Values
First, I would analyze how many missing values exist and in which columns.

* For numerical features (like age, blood pressure), I would fill missing values using mean or median.
* For categorical features (like gender, disease history), I would use mode (most frequent value).
  If missing values are very high in a column, I might consider dropping that feature.
  In advanced cases, I could use techniques like KNN imputation.

Step 2: Encode Categorical Features
Machine learning models cannot understand text categories directly.

* For binary categories (Yes/No), I would use Label Encoding.
* For features with multiple categories (e.g., blood type), I would use One-Hot Encoding.
  This converts categorical data into numerical format suitable for the Decision Tree.

Step 3: Split the Dataset
I would split the dataset into training and testing sets (for example, 80% training and 20% testing) to evaluate performance properly.

Step 4: Train the Decision Tree Model
I would create a DecisionTreeClassifier and train it using the cleaned and encoded training data. Decision Trees are useful because they handle mixed data types well and are easy to interpret.

Step 5: Tune Hyperparameters
To avoid overfitting, I would tune parameters such as:

* max_depth
* min_samples_split
* min_samples_leaf
* criterion (gini or entropy)

I would use GridSearchCV or RandomizedSearchCV to find the best combination of parameters.

Step 6: Evaluate the Model
I would evaluate performance using:

* Accuracy
* Precision
* Recall
* F1-score
* Confusion Matrix
* ROC-AUC score

In healthcare, Recall is very important because missing a diseased patient (false negative) can be dangerous.

Step 7: Interpret the Model
Decision Trees allow us to see feature importance. This helps doctors understand which factors (e.g., blood pressure, cholesterol, age) contribute most to the disease prediction.

Business Value in Real-World Healthcare

This model can provide several benefits:

* Early disease detection, leading to timely treatment
* Reduced healthcare costs by identifying high-risk patients early
* Better resource allocation in hospitals
* Data-driven decision support for doctors
* Improved patient outcomes

In simple terms, this model can help the healthcare company detect diseases faster, improve treatment planning, and potentially save lives while reducing operational costs.
