In [None]:
# Question 1: What is a Decision Tree, and how does it work in the context of classification?
# Ans 1: A Decision Tree is a supervised machine learning algorithm that works like a flowchart.
# It's a structure used to make decisions by splitting a dataset into smaller and smaller subsets based
# on its features, ultimately leading to a final outcome or prediction.
# New Data Point: Outlook = Sunny, Humidity = High, Wind = Weak
# Start at Root Node: The node asks, "Outlook?"
# The data's value is "Sunny," so it follows the "Sunny" branch.
# Arrive at New Node: This node asks, "Humidity?"
# The data's value is "High," so it follows the "High" branch.
# Arrive at Leaf Node: This is a leaf node that says "Class: No"

# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
# Ans 2: Gini Impurity and Entropy are both metrics used to measure the "impurity" or "disorder" of a set of data.
# In a decision tree, "impurity" refers to how mixed the classes are at a particular node.
# A "pure" node is ideal: all data points in it belong to a single class (e.g., 100% "Spam").
# An "impure" node is the worst case: the data points are split evenly among all classes (e.g., 50% "Spam," 50% "Not Spam")
# Gini Impurity
# Gini Impurity measures the likelihood of a randomly chosen element from a node being incorrectly classified if it were randomly
#  labeled according to the distribution of classes in that node.
# Entropy
# Entropy is a concept from information theory that measures the amount of uncertainty or randomness (i.e., disorder) in a node.

# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
# Ans 3: Pruning means reducing the size of a decision tree by removing unnecessary branches or nodes that add little predictive power.
# It helps the model generalize better to unseen data.
# Post-pruning allows the tree to grow completely first, and then removes branches that don’t improve performance on validation data.
# Pre-pruning stops the tree from growing too deep by setting constraints before the tree is fully built.
# advantage:
# Better generalization — It helps the model keep only the most meaningful splits, improving performance on unseen data.

# Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
# Ans 4: Information Gain (IG) is a measure of how much “information” or “purity” is gained by splitting a dataset based on a particular feature.
# It tells us how good a feature is at separating the classes (like YES/NO, spam/not spam, etc.).
# The higher the Information Gain, the better the feature is for making a split.

# Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
# Ans 5: Decision Trees are widely used in finance, healthcare, marketing, and business decision-making. Banks use them for loan
#  approvals and fraud detection, while in healthcare, they assist in disease diagnosis and treatment prediction. Marketers apply
#  them for customer segmentation and churn prediction. Their main advantages include easy interpretation, handling both numerical and
#  categorical data, and minimal preprocessing. However, Decision Trees often overfit, are sensitive to small data changes, and may bias
#   features with many categories. Despite these drawbacks, they remain a popular choice for interpretable and efficient decision-making,
#   especially when combined with ensemble methods like Random Forests.


In [1]:
# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(accuracy * 100, 2), "%")


print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 100.0 %

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


In [2]:
# Question 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)

clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)

y_pred_limited = clf_limited.predict(X_test)
y_pred_full = clf_full.predict(X_test)

accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)


print("Accuracy with max_depth=3:", round(accuracy_limited * 100, 2), "%")
print("Accuracy with fully-grown tree:", round(accuracy_full * 100, 2), "%")




Accuracy with max_depth=3: 100.0 %
Accuracy with fully-grown tree: 100.0 %


In [3]:
# Question 8: Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

# Import required libraries
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", round(mse, 2))


print("\nFeature Importances:")
for feature, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 11.59

Feature Importances:
CRIM: 0.0585
ZN: 0.0010
INDUS: 0.0099
CHAS: 0.0003
NOX: 0.0071
RM: 0.5758
AGE: 0.0072
DIS: 0.1096
RAD: 0.0016
TAX: 0.0022
PTRATIO: 0.0250
B: 0.0119
LSTAT: 0.1900


In [4]:
# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using
# GridSearchCV
# ● Print the best parameters and the resulting model accuracy

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Model Accuracy after tuning:", round(accuracy * 100, 2), "%")


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy after tuning: 100.0 %


In [None]:
# Question 10: Imagine you’re working as a data scientist for a healthcare company that
# wants to predict whether a patient has a certain disease. You have a large dataset with
# mixed data types and some missing values.
# Explain the step-by-step process you would follow to:
# ● Handle the missing values
# ● Encode the categorical features
# ● Train a Decision Tree model
# ● Tune its hyperparameters
# ● Evaluate its performance
# And describe what business value this model could provide in the real-world
# setting

# Ans 10:
# 1. Handle Missing Values:
# Remove irrelevant columns, impute numerical with mean/median, categorical with mode or “Unknown.” Add missing indicators to help model recognize data gaps.
# 2. Encode Categorical Features:
# Use one-hot for nominal, label for ordinal features. Merge rare categories. Apply encoders in pipelines to prevent data leakage and ensure consistency.
# 3. Train Decision Tree Model:
# Split data, train DecisionTreeClassifier, handle imbalance using class weights, and evaluate performance using accuracy, precision, recall, and F1-score metrics for validation.
# 4. Tune Hyperparameters:
# Use GridSearchCV to optimize max_depth, min_samples_split, and criterion. Apply cross-validation to prevent overfitting and ensure best generalization performance.
# 5. Evaluate & Business Value:
# Evaluate with accuracy, recall, F1-score, ROC-AUC. Enables early disease detection, reduces costs, and supports better patient decisions through data-driven healthcare insights.