# Decision Tree

Question 1: What is a Decision Tree, and how does it work in the context of classification?

What is a Decision Tree?

A Decision Tree is a machine learning model that makes decisions by asking a series of questions, just like how we make decisions in real life.

It looks like an inverted tree:

Root node ‚Üí first question

Branches ‚Üí possible answers

Leaf nodes ‚Üí final decision (class label)

‚úÖ How it works in classification

In classification, the goal is to predict a category (like ‚Äúspam or not spam‚Äù, ‚Äúdiabetic or not‚Äù, ‚Äúapple or orange‚Äù).

A Decision Tree works in these steps:

1. Start at the root

It begins with the whole dataset and picks the best feature to split the data.
Example: For predicting if a fruit is ‚ÄúApple‚Äù or ‚ÄúOrange‚Äù, the first question could be:
‚ÄúIs the color orange?‚Äù

2. Split into branches

Depending on the answer (‚ÄúYes‚Äù or ‚ÄúNo‚Äù), the data is divided into groups.

3. Continue asking questions

Each group is further split based on another best feature (like weight, size, texture).

4. Reach the leaf node

When the tree cannot be split further, it decides the final class label.

For example:

If color = orange ‚Üí Orange

If color ‚â† orange and weight < some value ‚Üí Apple

‚úÖ How does the tree decide which question to ask?

It uses mathematical criteria like:

Gini impurity

Entropy / Information Gain

These help choose the feature that best separates the classes.

‚úÖ Simple Example

Suppose you want to classify whether a student will ‚ÄúPass‚Äù or ‚ÄúFail‚Äù based on study hours:

Root:
‚û°Ô∏è Did the student study ‚â• 3 hours?

Yes ‚Üí Pass

No ‚Üí Fail

Just like that, a tree uses simple questions to classify.

‚úÖ Summary

A Decision Tree:

is a flowchart-like model

works by asking the best possible questions

splits data step-by-step

ends with a final classification decision

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?


‚úÖ Question 2: What are Gini Impurity and Entropy? How do they affect Decision Tree splits?

Decision Trees try to split the data in such a way that each group becomes as pure as possible.

A pure node = contains only one class
Example: All ‚ÄúApple‚Äù or all ‚ÄúOrange‚Äù.

To measure how impure a node is, we use impurity measures:
‚úÖ Gini Impurity
‚úÖ Entropy

‚úÖ 1. Gini Impurity
Definition

Gini impurity tells us how mixed the classes are in a node.

Formula:

Gini
=
1
‚àí
‚àë
ùëù
ùëñ
2
Gini=1‚àí‚àëp
i
2
	‚Äã


Where
ùëù
ùëñ
p
i
	‚Äã

 = probability of class
ùëñ
i

Intuition

0 ‚Üí perfectly pure (only one class)

Higher value ‚Üí more mixed

Example

If a node has:

50% apples

50% oranges

Gini
=
1
‚àí
(
0.5
2
+
0.5
2
)
=
0.5
Gini=1‚àí(0.5
2
+0.5
2
)=0.5

If a node has:

100% apples

Gini
=
1
‚àí
(
1
2
)
=
0
Gini=1‚àí(1
2
)=0

(pure)

‚úÖ 2. Entropy
Definition

Entropy measures the uncertainty or disorder in a node.

Formula:

Entropy
=
‚àí
‚àë
ùëù
ùëñ
log
‚Å°
2
ùëù
ùëñ
Entropy=‚àí‚àëp
i
	‚Äã

log
2
	‚Äã

p
i
	‚Äã

Intuition

0 ‚Üí pure

Higher entropy ‚Üí more disorder (more mixed)

Example

50% apples, 50% oranges:

Entropy
=
‚àí
[
0.5
log
‚Å°
2
0.5
+
0.5
log
‚Å°
2
0.5
]
=
1
Entropy=‚àí[0.5log
2
	‚Äã

0.5+0.5log
2
	‚Äã

0.5]=1

100% apples ‚Üí Entropy = 0 (pure)

‚úÖ How do Gini & Entropy impact splits in a Decision Tree?

A decision tree tries to split data such that impurity decreases after the split.

‚úÖ Goal of every split

Choose the feature that produces child nodes with the lowest impurity.

‚úÖ Information Gain (based on Entropy)

If entropy is used, the tree chooses the split with the highest Information Gain:

Information Gain
=
Entropy(before split)
‚àí
Entropy(after split)
Information Gain=Entropy(before split)‚àíEntropy(after split)

More gain ‚Üí better split.

‚úÖ How Gini vs Entropy change the splits
Measure	What it focuses on	Behaviour
Gini Impurity	Probability of misclassification	Splits faster, computationally simpler
Entropy	Uncertainty / information gain	More accurate but slightly slower
Both	Prefer pure splits	Give very similar results in practice
‚úÖ Example (Simple)

Suppose you can split the data using:

Color

Weight

The tree calculates:

Gini/entropy before the split

Gini/entropy of child nodes after the split

Whichever feature results in the largest reduction in impurity becomes the split.

‚úÖ Summary

Gini Impurity and Entropy measure how mixed a node is.

Lower values = purer nodes.

Decision Trees choose the feature that gives:

Lowest impurity, OR

Highest information gain (entropy)

Gini is simpler; entropy gives similar but more information-focused splits.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


‚úÖ Question 3: Difference between Pre-Pruning and Post-Pruning in Decision Trees

Decision Trees can easily become too deep and overfit the training data.
To prevent this, we use pruning, which means stopping the tree from becoming unnecessarily large.

There are two types:

‚úÖ Pre-Pruning (Early Stopping)

‚úÖ Post-Pruning (Prune after full growth)

‚úÖ 1. Pre-Pruning (Early Stopping)
Definition

Pre-pruning stops the tree while it is still growing.
We apply certain conditions to stop the splitting early.

Common pre-pruning methods:

Set a maximum depth (e.g., max_depth = 5)

Minimum samples required to split (min_samples_split)

Minimum samples in a leaf node (min_samples_leaf)

Practical Advantage

‚úÖ Faster training ‚Üí because the tree doesn‚Äôt grow fully.
Useful when working with large datasets or limited computational resources.

‚úÖ 2. Post-Pruning (Pruning after full growth)
Definition

Post-pruning allows the tree to grow completely first, and then removes branches that do not improve accuracy.

Common post-pruning techniques:

Cost Complexity Pruning (used in scikit-learn, parameter: ccp_alpha)

Reduced-error pruning (older method)

Practical Advantage

‚úÖ Improves generalization ‚Üí because unnecessary branches are removed after evaluating accuracy.
Often gives more accurate models than pre-pruning.

‚úÖ Main Differences Summary
Feature	Pre-Pruning	Post-Pruning
When applied?	During tree growth	After full tree is built
How?	Stops splits early	Removes weak branches
Goal	Prevent complexity	Simplify after evaluation
Advantage	Faster training	Better accuracy & generalization

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

‚úÖ What is Information Gain?

Information Gain (IG) is a measure used in Decision Trees to decide which feature to split on at each step.

It tells us how much ‚Äúinformation‚Äù a split adds by reducing uncertainty (entropy).

Formula:
Information Gain
=
Entropy(before split)
‚àí
Entropy(after split)
Information Gain=Entropy(before split)‚àíEntropy(after split)
Intuition

High Information Gain ‚Üí the split makes the data purer

Low Information Gain ‚Üí the split does not improve purity much

‚úÖ Why is Information Gain important?

A Decision Tree must choose the best question (feature) at every node.

Information Gain helps by:

‚úÖ showing which feature reduces impurity the most

‚úÖ ensuring the tree asks the most informative question first

‚úÖ making the model accurate and efficient

Example:

If splitting on Color gives 80% purity
and splitting on Size gives only 40% purity,
‚Üí The tree will choose Color because it has higher Information Gain.

‚úÖ In short:

Information Gain = how much entropy decreases after a split.

The decision tree selects the feature with the highest information gain.

This ensures the tree becomes simpler, more accurate, and less confusing.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

‚úÖ Question 5: Real-World Applications of Decision Trees + Advantages & Limitations
‚úÖ Real-World Applications of Decision Trees
1. Medical Diagnosis

Decision Trees help doctors classify whether a patient has a certain disease based on symptoms, test results, and medical history.

2. Banking & Finance

Used for:

credit scoring (approve or reject loans)

identifying risky customers

fraud detection

3. Marketing & Customer Segmentation

Companies use trees to:

predict which customers will buy a product

target advertisements

analyze customer behavior

4. E-commerce & Recommendation Systems

Used to predict:

whether a user will click on a product

what items to recommend

5. Manufacturing & Quality Control

To detect defective products by checking measurements, weight, or color.

6. Weather Prediction

Classifying whether it will rain or not based on temperature, humidity, pressure, etc.

‚úÖ Main Advantages of Decision Trees
1. Easy to Understand & Interpret

Looks like a flowchart ‚Äî even non-technical people can understand the logic.

2. Works with Both Numerical and Categorical Data

Handles numbers (e.g., age, salary) and categories (e.g., color, gender) easily.

3. Requires Little Data Preprocessing

No need for:

feature scaling

normalization

dummy variables (scikit-learn handles categories internally)

4. Fast and Efficient

Tree construction is quick and works well even for large datasets.

‚úÖ Main Limitations of Decision Trees
1. Highly Prone to Overfitting

If not pruned, the tree becomes too deep and memorizes the training data.

2. Unstable

A small change in data can produce a completely different tree.

3. May Create Biased Trees

If one feature has many levels (e.g., zip code), the tree may prefer it even if not meaningful.

4. Not the Best for Continuous Predictions

Decision Trees for regression often give block-like, non-smooth predictions.

‚úÖ In Short
‚úÖ Applications:

Medical diagnosis, finance, marketing, e-commerce, manufacturing, weather, etc.

‚úÖ Advantages:

Easy to interpret, fast, handles all data types, minimal preprocessing.

‚úÖ Limitations:

Overfitting, instability, bias, not smooth for regression.

Dataset Info:
‚óè Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
‚óè Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6: Write a Python program to:
‚óè Load the Iris Dataset
‚óè Train a Decision Tree Classifier using the Gini criterion
‚óè Print the model‚Äôs accuracy and feature importances

‚úÖ Python Program: Decision Tree on Iris Dataset (Gini criterion)

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data        # features
y = iris.target      # labels

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train Decision Tree Classifier using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# 4. Predict on test data
y_pred = model.predict(X_test)

# 5. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print feature importances
print("Feature Importances:")
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {importance}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0
sepal width (cm): 0.01911001911001911
petal length (cm): 0.8932635518001373
petal width (cm): 0.08762642908984374


‚úÖ What the Program Does

Loads the Iris dataset

Splits it into training (70%) and testing (30%)

Trains a Decision Tree Classifier using Gini Impurity

Prints:

‚úÖ Model accuracy

‚úÖ Feature importances (which features were most important for the tree)

Question 7: Write a Python program to:
‚óè Load the Iris Dataset
‚óè Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

Here is a clean, ready-to-run Python program that trains:

‚úÖ a fully-grown Decision Tree
‚úÖ a Decision Tree with max_depth = 3

‚Ä¶and compares their accuracies.

‚úÖ Python Program: Compare Full Tree vs max_depth=3

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Fully-grown Decision Tree (no max_depth limit)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)
full_acc = accuracy_score(y_test, full_pred)

# 4. Decision Tree with max_depth = 3
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_pred = limited_tree.predict(X_test)
limited_acc = accuracy_score(y_test, limited_pred)

# 5. Print accuracy comparison
print("Accuracy of Fully-Grown Tree     :", full_acc)
print("Accuracy of max_depth=3 Tree     :", limited_acc)


Accuracy of Fully-Grown Tree     : 1.0
Accuracy of max_depth=3 Tree     : 1.0


‚úÖ What this program shows

A fully-grown tree may overfit but sometimes gives very high accuracy.

A limited-depth tree (max_depth=3):

is simpler

generalizes better

may have slightly lower or sometimes equal accuracy.

Question 8: Write a Python program to:
‚óè Load the Boston Housing Dataset
‚óè Train a Decision Tree Regressor
‚óè Print the Mean Squared Error (MSE) and feature importances

Here is a clean, updated, and ready-to-run Python program for training a Decision Tree Regressor on the Boston Housing Dataset.

Using Boston Housing

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the Boston Housing Dataset
boston = load_boston()
X = boston.data
y = boston.target

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# 4. Predict and compute MSE
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)

# 5. Print feature importances
print("\nFeature Importances:")
for name, importance in zip(boston.feature_names, model.feature_importances_):
    print(f"{name}: {importance}")


ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


Using California Housing Dataset

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing Dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict and compute MSE
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)

# Feature importances
print("\nFeature Importances:")
for name, importance in zip(data.feature_names, model.feature_importances_):
    print(f"{name}: {importance}")

Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285090936963706
HouseAge: 0.05188353710616045
AveRooms: 0.05297496833123543
AveBedrms: 0.02866045788296106
Population: 0.030515676373806224
AveOccup: 0.13083767753210346
Latitude: 0.09371656401749287
Longitude: 0.08290202505986989


‚úÖ What the Program Does

Loads the dataset

Splits into training/testing

Trains a Decision Tree Regressor

Prints:

‚úÖ Mean Squared Error

‚úÖ Feature importances

Question 9: Write a Python program to:
‚óè Load the Iris Dataset
‚óè Tune the Decision Tree‚Äôs max_depth and min_samples_split using
GridSearchCV
‚óè Print the best parameters and the resulting model accuracy

‚úÖ Loads the Iris dataset
‚úÖ Tunes max_depth and min_samples_split using GridSearchCV
‚úÖ Prints the best parameters and model accuracy

‚úÖ Python Program: Decision Tree Hyperparameter Tuning with GridSearchCV

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Define the Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# 4. Define the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# 5. GridSearchCV for tuning
grid = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    scoring='accuracy'
)

# 6. Fit the grid search
grid.fit(X_train, y_train)

# 7. Predict using the best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

# 8. Compute accuracy
accuracy = accuracy_score(y_test, y_pred)

# 9. Output results
print("Best Parameters:", grid.best_params_)
print("Model Accuracy:", accuracy)

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy: 1.0


‚úÖ What This Program Does

Uses GridSearchCV to try multiple combinations of:

max_depth

min_samples_split

Selects the combination with the highest accuracy (CV score)

Evaluates that best model on the test set

Question 10: Imagine you‚Äôre working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
‚óè Handle the missing values
‚óè Encode the categorical features
‚óè Train a Decision Tree model
‚óè Tune its hyperparameters
‚óè Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Step-by-step process
1) Understand the data first

Check how many rows/columns, which columns are numeric vs categorical, how much missingness each column has, class balance (disease vs no disease).

Ask domain questions: could a missing value be meaningful? (e.g., ‚Äútest not ordered‚Äù may carry info)

2) Handle missing values

Strategy depends on the column type and why it‚Äôs missing. Common approaches:

Numerical features

If missingness is small & data missing at random: impute with median (robust) or mean.

If distribution is skewed, median is safer.

If missingness is informative, add a binary missingness indicator column.

For complex patterns, consider KNN imputation or MICE / iterative imputation (can be slower).

Categorical features

Impute with most frequent (mode) or a new category like "<missing>".

If missingness itself is meaningful, keep a separate flag.

Target (label)

Usually drop rows with missing labels (unless you plan weak supervision).

Practical rule: Don‚Äôt leak test data: fit imputers on training set only and apply to test set (use Pipelines to enforce this).

3) Encode categorical features

One-hot encoding: for nominal categorical features with a small number of levels.

Ordinal encoding: only if categories have a real order (e.g., stage I/II/III).

Target/mean encoding: possible for high-cardinality features but must be used carefully (use cross-validated target encoding to avoid leakage).

Use OneHotEncoder(handle_unknown='ignore') to handle unseen categories in test data.

4) Train a Decision Tree model

Split data: train_test_split(..., stratify=y) to keep class balance.

Decision trees don‚Äôt need feature scaling.

Consider class imbalance: set class_weight='balanced' or use resampling (SMOTE) if necessary.

Start with a simple baseline DecisionTree (e.g., DecisionTreeClassifier(random_state=42)).

5) Tune hyperparameters

Important hyperparameters to tune:

max_depth ‚Äî controls tree depth (prevents overfitting).

min_samples_split and min_samples_leaf ‚Äî minimum samples to split / be a leaf.

max_features ‚Äî how many features to consider at each split.

criterion ‚Äî 'gini' or 'entropy'.

ccp_alpha ‚Äî cost-complexity pruning parameter (scikit-learn).

Use GridSearchCV or RandomizedSearchCV with cross-validation (e.g., 5-fold stratified CV). Choose scoring metric according to business needs (see next).

6) Evaluate performance

Use multiple evaluation tools ‚Äî don‚Äôt rely on accuracy alone, especially in healthcare:

Classification metrics

Recall (sensitivity): fraction of sick patients correctly identified ‚Äî often most important in disease detection (missed cases are costly).

Precision: fraction of predicted positives that are true positives.

F1 score: balance of precision & recall.

ROC AUC: overall ranking ability.

PR curve / average precision: better when classes are imbalanced.

Other checks

Confusion matrix: shows TP, FP, TN, FN.

Calibration: are predicted probabilities well calibrated? (important for decision thresholds)

Cross-validation: get mean + std of metrics to estimate stability.

External validation: test on data from different hospitals / time periods.

Explainability: feature importances, SHAP or partial dependence plots ‚Äî essential for clinician trust.

Fairness & bias checks: ensure model doesn‚Äôt systematically underperform for protected groups.

Monitoring: track model drift and performance in production.

7) Deployment & governance considerations (healthcare specifics)

Interpretability ‚Äî doctors need explanations (decision paths, feature importances, SHAP).

Regulatory & privacy ‚Äî follow HIPAA/GDPR as appropriate.

Clinical validation ‚Äî prospective studies, clinician review before deploying.

Actionability ‚Äî define clear downstream actions (triage, order confirmatory test, urgent review).

8) Business value (real-world)

Early detection ‚Üí faster treatment, better outcomes.

Triage & prioritization ‚Üí helps prioritize high-risk patients for immediate attention.

Cost savings ‚Üí avoid unnecessary tests / reduce late-stage treatment costs.

Resource allocation ‚Üí better use of staff and diagnostic equipment.

Decision support ‚Üí assist clinicians with second opinions and reduce human error.

Population health insights ‚Üí identify risk factors and target preventive interventions.

Compact scikit-learn example pipeline

This example assumes a binary disease label y and a DataFrame df. It shows separate preprocessing for numeric/categorical, DecisionTree, and GridSearchCV.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

# Example: df is your DataFrame, 'label' is 0/1 disease column
# df = pd.read_csv('your_data.csv')
# For illustration, assume:
# numeric_features = ['age', 'bmi', 'blood_pressure']
# categorical_features = ['gender', 'smoking_status']

numeric_features = ['age', 'bmi', 'blood_pressure']
categorical_features = ['gender', 'smoking_status']

X = df[numeric_features + categorical_features]
y = df['label']

# Train/test split (stratify to keep class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Preprocessing pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # median imputation
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
    ],
    remainder='drop'
)

# Full pipeline with a Decision Tree
pipe = Pipeline([
    ('pre', preprocessor),
    ('clf', DecisionTreeClassifier(random_state=42))
])

# Hyperparameter grid
param_grid = {
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [3, 5, 8, None],
    'clf__min_samples_split': [2, 5, 10],
    'clf__min_samples_leaf': [1, 2, 5],
    'clf__class_weight': [None, 'balanced']
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(pipe, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
best_model = grid.best_estimator_

# Evaluate on test set
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:,1]

print(classification_report(y_test, y_pred, digits=4))
print("ROC AUC:", roc_auc_score(y_test, y_proba))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))


Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best params: {'clf__class_weight': None, 'clf__criterion': 'gini', 'clf__max_depth': 8, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 5}
              precision    recall  f1-score   support

           0     0.5000    0.4545    0.4762        11
           1     0.4000    0.4444    0.4211         9

    accuracy                         0.4500        20
   macro avg     0.4500    0.4495    0.4486        20
weighted avg     0.4550    0.4500    0.4514        20

ROC AUC: 0.4747474747474747
Confusion matrix:
 [[5 6]
 [5 4]]


In [None]:
import pandas as pd
import numpy as np

# Create a sample DataFrame to simulate the healthcare data
data = {
    'age': np.random.randint(20, 70, 100),
    'bmi': np.random.uniform(18.0, 35.0, 100),
    'blood_pressure': np.random.randint(90, 180, 100),
    'gender': np.random.choice(['Male', 'Female'], 100),
    'smoking_status': np.random.choice(['Never', 'Former', 'Current'], 100),
    'label': np.random.randint(0, 2, 100) # 0 for no disease, 1 for disease
}
df = pd.DataFrame(data)

print("Sample DataFrame 'df' created:")
display(df.head())


Sample DataFrame 'df' created:


Unnamed: 0,age,bmi,blood_pressure,gender,smoking_status,label
0,69,19.99656,147,Male,Never,1
1,42,21.254005,148,Female,Current,1
2,34,30.790642,167,Female,Former,0
3,32,20.749895,164,Female,Former,1
4,23,18.905749,138,Female,Never,0


Short checklist you can follow for a project

Data exploration & domain questions

Split (train/validation/test) with stratification

Build preprocessing pipeline (impute, encode)

Baseline model (DecisionTree)

Hyperparameter tuning with CV (choose metric carefully)

Evaluate with multiple metrics + explainability

External validation & fairness checks

Deploy with monitoring, clinical validation, and governance