Question 1: What is a Decision Tree, and how does it work in the context of
classification?

Answer:

A Decision Tree is a flowchart-like model used for classification where data is split into branches based on feature conditions. Each internal node represents a test on a feature, branches represent outcomes, and leaf nodes represent final class labels. 

It works by recursively partitioning the dataset to maximize class separation.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Answer:

Gini Impurity: Measures probability of incorrectly classifying a randomly chosen sample. (Lower Gini = purer node).
Entropy: Measures disorder/uncertainty in a node. (Lower Entropy = higher purity).
Impact: Decision Trees choose splits that minimize Gini/Entropy, leading to purer child nodes and better classification.


Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Answer:

Pre-Pruning: Stops tree growth early using conditions (e.g., max depth, min samples).
Advantage: Saves time and prevents overfitting early.

Post-Pruning: Grows full tree first, then removes weak branches.
Advantage: Improves accuracy by simplifying the model after seeing complete structure.


Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Answer:

Information Gain measures the reduction in impurity (Entropy) after splitting a dataset on a feature.
Importance: The feature with the highest Information Gain is chosen for the split, ensuring the tree separates classes most effectively.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer:

Applications: Medical diagnosis, credit risk scoring, fraud detection, customer segmentation, recommendation systems.
Advantages: Easy to understand, interpretable, handles categorical & numerical data.
Limitations: Prone to overfitting, unstable with small data changes, less accurate than ensemble methods.



In [6]:
'''
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
'''
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

print("Feature Importances:", clf.feature_importances_)



Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


In [7]:
'''
Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

'''
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
print("Full Tree Accuracy:", accuracy_score(y_test, y_pred_full))

pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
y_pred_pruned = pruned_tree.predict(X_test)
print("Max Depth=3 Tree Accuracy:", accuracy_score(y_test, y_pred_pruned))


Full Tree Accuracy: 1.0
Max Depth=3 Tree Accuracy: 1.0


In [8]:
'''
Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

'''
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions and MSE
y_pred = reg.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

# Feature importances
print("Feature Importances:", reg.feature_importances_)



Mean Squared Error: 0.4979753095391472
Feature Importances: [0.52760631 0.05182796 0.05397784 0.0284493  0.0302391  0.13141145
 0.09337296 0.08311508]


In [9]:
'''
Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
'''

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8, 10]}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

'''
Answer :

1. Handle missing values

   a. Inspect pattern & % missing (MCAR/MAR/MNAR).
   b. Small %: drop rows or simple impute (median for numeric, mode for categorical).
   c. Larger or informative: add missing-indicator column + use KNN or IterativeImputer (MICE).
   d. For MNAR use domain rules or model-based approaches.

2. Encode categorical features

   a. Identify ordinal vs nominal.
   b. Ordinal → `OrdinalEncoder`; nominal (low-cardinality) → `OneHotEncoder`.
   c. High-cardinality → frequency/target/hash encoding or embeddings.
   d. Implement with `ColumnTransformer` inside a `Pipeline` to avoid leakage.

3. Train Decision Tree

   a. Split data stratified: train / validation / test.
   b. Use a pipeline: preprocessing → `DecisionTreeClassifier` (scaling usually not required).
   c. Handle imbalance: `class_weight='balanced'` or resample (SMOTE) inside CV.

4. Tune hyperparameters

   a. Use `GridSearchCV` or `RandomizedSearchCV` with stratified CV.
   b. Tune `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`, `criterion`, `class_weight`.
   c. Optimize business-relevant metric (recall/ROC-AUC/PR-AUC), consider nested CV.

5. Evaluate performance

   a. On test set report: confusion matrix, recall (sensitivity), precision, F1, ROC-AUC, PR-AUC, calibration (Brier).
   b. Tune decision threshold for business trade-offs (FP vs FN cost).
   c. Explain with feature importances + SHAP; check fairness and clinical plausibility.

6. Deployment & monitoring (brief)**

   a. Clinician validation, integrate into workflow, log predictions, monitor drift, retrain periodically, ensure privacy/compliance.

7. Business value

   a. Early detection and triage, prioritize tests/resources, reduce costs, improve patient outcomes, enable targeted interventions and measurable ROI — provided thresholds and clinical validation align with real-world cost/benefit.
