Question 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:

Information Gain (IG) measures the reduction in entropy (uncertainty) after a dataset is split on a feature.

Why it's important:

- IG quantifies how much a candidate split reduces uncertainty about the target label. The split with the highest IG is typically chosen because it produces child nodes that are more homogeneous (purer).

- Selecting splits by maximizing IG leads the tree to focus on features that best separate classes early in the tree, improving predictive power.

Question 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer:

Applications:

- Medical diagnosis (predicting disease presence)

- Credit scoring / loan approval

- Customer churn prediction

- Fraud detection

- Feature selection / interpretable rule extraction

- Manufacturing defect classification

Advantages:

- Simple to understand and interpret (white-box model).

- Handles numerical and categorical features (with preprocessing).

- No need for feature scaling.

- Can capture non-linear relationships and interactions.

- Fast inference.

Limitations:

- Tend to overfit without pruning or regularization.

- High variance—small data changes can produce different trees.

- Greedy splitting can miss globally optimal splits.

- Not as accurate as ensemble methods (Random Forests, Gradient Boosting) on many tasks.

- Can produce biased trees if classes are imbalanced.

Question 6:   Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

Answer:

In [20]:
# Decision Tree with Gini on iris

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
df = pd.DataFrame(data.data,columns= data.feature_names)
# splitting the dataset
X = df
y = pd.Series(data.target)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =0.25,random_state= 42)
# train Decision Tree with gini
classifier = DecisionTreeClassifier(criterion='gini',random_state=42)
classifier.fit(X_train,y_train)
#Predict and Evalute
y_pred = classifier.predict(X_test)
acc = accuracy_score(y_test,y_pred)
# print results

print(f'Accuracy(test) : {acc:.3f}')
imp = classifier.feature_importances_
for name,impo in zip(data.feature_names,imp):
  print(f'{name} --> {impo:.3f}')


Accuracy(test) : 1.000
sepal length (cm) --> 0.000
sepal width (cm) --> 0.018
petal length (cm) --> 0.900
petal width (cm) --> 0.082


Question 7:  Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.

Answer:

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# load dataset
data = load_iris()
df = pd.DataFrame(data.data,columns = data.feature_names)
df['target'] = data.target
X = df.drop('target',axis = 1)
y= df.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = .2,random_state=42)

#decision tree with depth 3
cls_dep3 = DecisionTreeClassifier(max_depth=3,random_state=42)
cls_dep3.fit(X_train,y_train)
y_pred_dep3 = cls_dep3.predict(X_test)
acc_dep3 = accuracy_score(y_test,y_pred_dep3)
print(f'Accuracy score of Decision tree with depth 3 is : {acc_dep3:.3f}')
# fully grown decision tree
cls = DecisionTreeClassifier(random_state=42)
cls.fit(X_train,y_train)
y_pred = cls.predict(X_test)
acc = accuracy_score(y_test,y_pred)
print(f'Accuracy score of fully grown decision tree is : {acc:.3f}')

Accuracy score of Decision tree with depth 3 is : 1.000
Accuracy score of fully grown decision tree is : 1.000


Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

Answer:

In [11]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

data = fetch_california_housing()
df = pd.DataFrame(data.data,columns = data.feature_names)
df['target']= data.target
X = df.drop('target',axis = 1)
y=df.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state = 42)

reg = DecisionTreeRegressor(random_state = 42)
reg.fit(X_train,y_train)

y_pred = reg.predict(X_test)
MSE = mean_squared_error(y_test,y_pred)

# print results
print(f'Mean_squared_error(test) : {MSE:.3f}\n')
importances = reg.feature_importances_
for name,imp in zip(data.feature_names,importances):
  print(f"{name} --> {imp:.3f}")



Mean_squared_error(test) : 0.495

MedInc --> 0.529
HouseAge --> 0.052
AveRooms --> 0.053
AveBedrms --> 0.029
Population --> 0.031
AveOccup --> 0.131
Latitude --> 0.094
Longitude --> 0.083


Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy

Answer:

In [21]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')
#load dataset
data = load_iris()
df = pd.DataFrame(data.data,columns = data.feature_names)
df['target']=data.target
X= df.drop('target',axis = 1)
y = df.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=42)

param = {
    'max_depth': [1,2,3,4,5,6,7,8,9,10],
    'min_samples_split':[1,2,3,4,5,6]
}
clf = DecisionTreeClassifier(random_state=42)
model = GridSearchCV(clf,param_grid=param,cv = 5,scoring='accuracy',verbose = 2)
model.fit(X_train,y_train)
# Evalute:
y_pred = model.best_estimator_.predict(X_test)
acc = accuracy_score(y_test,y_pred)
best_param = model.best_params_
#print result
print(f'\nBest Parameter of the model is : \n{best_param}\n')
print(f'\nAccuracy of the model is : {acc:.3f}\n')



Fitting 5 folds for each of 60 candidates, totalling 300 fits
[CV] END ...................max_depth=1, min_samples_split=1; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=1; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=1; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=1; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=1; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=2; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=2; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=2; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=2; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=2; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_split=3; total time=   0.0s
[CV] END ...................max_depth=1, min_sa

Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Answer:

1. Understand the data & problem

- Check class balance, data types, missingness patterns (MCAR / MAR / MNAR).

- Identify important features, potential leakage, and privacy/safety constraints.

2. Handle missing values

- Exploratory analysis: quantify missingness per feature and by class.

- Simple strategies:

  - If missingness very small and rows not valuable, drop rows.

  - If feature has > ~30–50% missing, consider dropping or carefully imputing based on domain knowledge.

- Imputation:

  - Numerical: median (robust) or mean.

  - Categorical: fill with the most frequent value.

   - Use pipelines to avoid data leakage: fit imputer on training data only.


3. Encode categorical features

- Low-cardinality categorical: One-hot encoding (use OneHotEncoder with handle_unknown='ignore').

- High-cardinality categorical: target encoding or ordinal encoding with caution (use cross-validated target encoding to avoid leakage).



4. Train a Decision Tree model

- Build a pipeline: preprocessing (imputer + encoder + optional scaling) → DecisionTreeClassifier.

- Set random_state for reproducibility.

- Use class_weight='balanced' if classes are imbalanced (or resampling techniques like SMOTE/undersampling with caution).

- Use cross-validation to estimate baseline performance.

5. Tune hyperparameters

- Use GridSearchCV or RandomizedSearchCV in a cross-validated pipeline.

- Key Decision Tree hyperparameters:
max_depth, min_samples_split, min_samples_leaf, max_features, ccp_alpha (cost complexity pruning).

- Use stratified CV for classification.



6. Evaluate performance

- Use appropriate metrics:

   - For disease prediction: precision, recall (sensitivity), specificity, F1-score, ROC-AUC, PR-AUC.

   - Most important is often recall (sensitivity) for disease detection if missing a disease is costly; but high false positives also have cost.

   - Use confusion matrix and class-specific metrics.


8. Business value

- Early detection: identify probable patients for further testing, enabling earlier treatment and better outcomes.

- Resource allocation: prioritize high-risk patients for limited diagnostic resources.

- Cost savings: reduce unnecessary tests if model helps target testing efficiently, or reduce late-stage treatment costs by early intervention.

- Operational efficiency: automate triage decisions and assist clinicians with interpretable rules.

- Monitoring & feedback loop: continuous model monitoring improves care and reduces risk.