# **Decision Tree | Assignment**

**1. What is a Decision Tree, and how does it work in the context of
classification?**

Definition

- A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks.

- It is a tree-like structure where each internal node represents a feature (attribute), each branch represents a decision (rule) based on that feature, and each leaf node represents an outcome (class label in classification).

How it Works in Classification

1.Root Node (Start):

- The tree starts with the root node, which contains the entire dataset.

- The algorithm chooses the best feature to split the data.

- Best feature = one that maximizes class separation (using measures like Gini Index, Entropy/Information Gain, or Chi-Square).

2.Splitting (Recursive Partitioning):

- Data is divided into subsets based on the chosen feature’s possible values.

- This process is repeated recursively, forming branches and child nodes.

3.Stopping Criteria:

- Splitting continues until one of the following occurs:

- All records in a node belong to the same class.

- No further improvement can be made by splitting.

Maximum depth or minimum samples per leaf is reached (to avoid overfitting).

4.Leaf Nodes (Prediction):

- Each leaf node represents a final classification.

- For a new input, the model traverses the tree according to the feature values until it reaches a leaf node, and assigns the corresponding class label.

Example

Suppose we want to classify whether a person will buy a laptop (Yes/No) based on:

- Age (Young, Middle, Old)

- Income (High, Medium, Low)

- The tree might first split on Income (best predictor).

- If Income = High → predict “Yes”.

- If Income = Low → split further on Age.

- Eventually, leaf nodes give the final Yes/No classification.

Advantages

- Easy to understand and interpret (like human decision-making).

- Works well with both categorical and numerical data.

- Requires little preprocessing.

Disadvantages

- Prone to overfitting if not pruned.

- Sensitive to small changes in data.

**2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

1.Impurity in Decision Trees

- In Decision Trees, impurity measures determine how well a feature separates the classes.

- A pure node contains only one class → impurity = 0.

- An impure node contains mixed classes → impurity > 0.

- The algorithm chooses splits that reduce impurity the most.

2.Gini Impurity

- Definition: Gini Impurity measures the probability of incorrectly classifying a randomly chosen element if we label it according to the class distribution in the node.

- Formula:

- $$Gini=1−i=1∑k​pi2​$$

Where:

- 𝑘 = number of classes

- 𝑝𝑖 = probability of class

- 𝑖 in the node

- Range: 0 → perfectly pure (only one class)

- Maximum = 0.5 (for binary classes with equal probability 0.5, 0.5)

Example:

- Suppose a node has 70% Class A, 30% Class B.

- $Gini=1−(0.72+0.32)=1−(0.49+0.09)=0.42$

- → Some impurity exists.

3.Entropy

- Definition: Entropy (from Information Theory) measures the uncertainty or disorder in a node.

- Formula:

- $Entropy=−i=1∑k​pi​log2​(pi​)$

- Range:

- 0 → perfectly pure (all samples in one class)

- 1 → maximum disorder (binary classes equally split, 0.5 & 0.5)

- Example:

- For the same node (70% A, 30% B):

- $Entropy=−(0.7log2​(0.7)+0.3log2​(0.3))≈0.881$

- → Shows more disorder than Gini.

4.Impact on Splits

- Decision Trees use Gini or Entropy to evaluate splits.

- For each possible feature split:

- Compute the weighted average impurity of child nodes.

- Choose the split that minimizes impurity the most (highest Information Gain).

Difference:

- Gini is faster to compute (no logarithms).

- Entropy gives more weight to rare classes.

- In practice, both often give similar trees.

| Feature                  | **Gini Impurity**                              | **Entropy**                                                 |
| ------------------------ | ---------------------------------------------- | ----------------------------------------------------------- |
| **Definition**           | Measures the probability of misclassification  | Measures the disorder/uncertainty in the data               |
| **Formula**              | $Gini = 1 - \sum p_i^2$                        | $Entropy = -\sum p_i \log_2(p_i)$                           |
| **Range**                | 0 (pure) → 0.5 (binary max)                    | 0 (pure) → 1 (binary max)                                   |
| **Computation**          | Faster (no logarithm)                          | Slower (uses logarithm)                                     |
| **Interpretation**       | Focuses on correct classification probability  | Focuses on information content and uncertainty              |
| **Bias**                 | Slightly favors larger classes                 | Treats rare classes more carefully                          |
| **Practical Difference** | Often similar results                          | Often similar results                                       |
| **Preferred Use**        | Default in scikit-learn DecisionTreeClassifier | Used when information-theory based interpretation is needed |



**3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

1.Introduction

- Decision Trees tend to overfit if they grow too deep, capturing noise instead of patterns.

- Pruning is a technique used to reduce the size of the tree and improve generalization.

- Two main types: Pre-Pruning and Post-Pruning.

2.Pre-Pruning (Early Stopping)

- Definition: Stop growing the tree before it becomes too complex.

- How: By setting constraints like:

- Maximum depth of tree (max_depth)

- Minimum samples required to split a node (min_samples_split)

- Minimum samples per leaf (min_samples_leaf)

- Maximum number of leaf nodes (max_leaf_nodes)

- Practical Advantage: Saves time and memory during training because the tree does not grow unnecessarily large.

3.Post-Pruning (Reduced Error Pruning)

- Definition: First grow the tree fully, then remove branches that do not improve performance.

How:

- Train a full tree.

- Use a validation set or cross-validation to evaluate performance.

- Iteratively prune subtrees that do not improve accuracy.

- Practical Advantage: Produces a more accurate and generalized tree, as pruning is based on actual performance rather than fixed rules.

4. Key Difference

| Aspect                | **Pre-Pruning**                                | **Post-Pruning**                      |
| --------------------- | ---------------------------------------------- | ------------------------------------- |
| **When applied**      | During tree construction                       | After full tree is built              |
| **Decision criteria** | Based on stopping rules (depth, samples, etc.) | Based on validation performance       |
| **Complexity**        | Lower training cost                            | Higher training cost                  |
| **Risk**              | May underfit if stopped too early              | Less underfitting, but more expensive |

5.Summary

- Pre-Pruning → Faster, prevents overly deep trees early.

- Post-Pruning → More accurate, improves generalization by removing weak branches.

Example:

In [1]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# -----------------------------
# 1. Pre-Pruning Example
# -----------------------------
# Limit the depth of the tree (pre-pruning)
pre_pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pre_pruned_tree.fit(X_train, y_train)

# Evaluate
y_pred_pre = pre_pruned_tree.predict(X_test)
print("Pre-Pruning Accuracy:", accuracy_score(y_test, y_pred_pre))

# -----------------------------
# 2. Post-Pruning Example
# -----------------------------
# First, train a full tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)

print("Full Tree Accuracy (before pruning):", accuracy_score(y_test, full_tree.predict(X_test)))

# Use Cost-Complexity Pruning (Post-Pruning)
# Get effective alphas and impurities
path = full_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Train different pruned trees for each alpha
best_accuracy = 0
best_alpha = 0
for alpha in ccp_alphas:
    pruned_tree = DecisionTreeClassifier(random_state=42, ccp_alpha=alpha)
    pruned_tree.fit(X_train, y_train)
    acc = accuracy_score(y_test, pruned_tree.predict(X_test))
    if acc > best_accuracy:
        best_accuracy = acc
        best_alpha = alpha

print("Best alpha for post-pruning:", best_alpha)
print("Post-Pruning Accuracy:", best_accuracy)


Pre-Pruning Accuracy: 1.0
Full Tree Accuracy (before pruning): 1.0
Best alpha for post-pruning: 0.0
Post-Pruning Accuracy: 1.0


**4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

1.Introduction

- Decision Trees split data at each node to create pure subsets.

- Information Gain (IG) is a metric that measures how much “information” a feature provides about the class labels after a split.

- It is based on the reduction in Entropy (or impurity).

2.Definition of Information Gain

- Entropy before split: Measures the impurity of the parent node.

- Entropy after split: Weighted average of entropy of child nodes.

- $$IG(S, A) = Entropy(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \cdot Entropy(S_v)$$

Where:

- 𝑆 = dataset at the node

- 𝐴 = feature used for split

- 𝑆𝑣 = subset of

- 𝑆 where feature

- 𝐴 = 𝑣

Interpretation:

- Higher IG → feature provides more useful information → better split.

- If IG = 0 → feature doesn’t help in classification.

3.Example

- Suppose we want to classify whether a student will Pass or Fail based on Study Hours.

- Parent Node Entropy (before split):
6 Pass, 6 Fail →

- $Entropy=−(0.5log2​0.5+0.5log2​0.5)=1.0$

- After Split (Study Hours < 5 and ≥ 5):

- Left Node: 5 Fail, 1 Pass → Entropy ≈ 0.65

- Right Node: 5 Pass, 1 Fail → Entropy ≈ 0.65

Weighted Entropy:

- $(6/12×0.65)+(6/12×0.65)=0.65$

Information Gain:

- $IG=1.0−0.65=0.35$

- This means splitting on Study Hours improves classification.

4.Importance of Information Gain

- Feature Selection: Helps select the feature that best separates classes.

- Tree Growth: Ensures that each split increases “purity.”

- Prevents Random Splits: Without IG, splits may be meaningless.

- Better Generalization: IG ensures the tree learns patterns, not noise.

5.Summary

- Information Gain = Reduction in Entropy after a split.

- Higher IG → better feature choice.

- It ensures that Decision Trees grow accurate and meaningful splits.

Example:

In [2]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split

# Load dataset (Iris flower dataset)
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with Entropy (Information Gain)
tree_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=3, random_state=42)
tree_entropy.fit(X_train, y_train)

# Print accuracy
print("Decision Tree Accuracy:", tree_entropy.score(X_test, y_test))

# Show tree rules
tree_rules = export_text(tree_entropy, feature_names=iris.feature_names)
print("\nDecision Tree Rules (using Information Gain):\n")
print(tree_rules)


Decision Tree Accuracy: 0.9777777777777777

Decision Tree Rules (using Information Gain):

|--- petal length (cm) <= 2.45
|   |--- class: 0
|--- petal length (cm) >  2.45
|   |--- petal length (cm) <= 4.75
|   |   |--- petal width (cm) <= 1.60
|   |   |   |--- class: 1
|   |   |--- petal width (cm) >  1.60
|   |   |   |--- class: 2
|   |--- petal length (cm) >  4.75
|   |   |--- petal length (cm) <= 5.15
|   |   |   |--- class: 2
|   |   |--- petal length (cm) >  5.15
|   |   |   |--- class: 2



**5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?**

1.Introduction

- Decision Trees are widely used in classification and regression problems because they are simple, interpretable, and mimic human decision-making.

2.Real-World Applications (8 marks)

A. Healthcare Diagnosis

- Predict diseases (e.g., diabetes, cancer) based on patient symptoms, test results, and lifestyle factors.

B. Finance & Banking

- Credit scoring and loan approval decisions.

- Fraud detection in transactions.

C. Marketing & Customer Analytics

- Predict customer churn (likelihood of leaving a service).

- Targeted advertising and product recommendations.

D. Manufacturing & Quality Control

- Predict product defects based on production conditions.

- Optimize supply chain decisions.

E. Agriculture

- Crop disease detection.

- Yield prediction based on soil quality, weather, and irrigation.

F. Education

- Predict student performance based on study habits and attendance.

3.Advantages of Decision Trees

- Easy to Understand & Interpret – Mimics human reasoning with “if-then” rules.

- No Need for Feature Scaling – Works well with raw data (no normalization/standardization required).

- Handles Both Numerical & Categorical Data.

- Fast Training & Prediction – Efficient on small-to-medium datasets.

- Feature Selection Built-In – Automatically picks the most important features.

4.Limitations of Decision Trees

- Overfitting – Tends to create very deep trees that don’t generalize well.

- Unstable – Small changes in data can produce a very different tree.

- Biased Splits – Features with many categories may dominate splits.

- Not Great for Continuous Predictions – Regression trees are less accurate than advanced models.

- Lower Accuracy than Ensemble Models – Alone, Decision Trees are weaker than Random Forests or Gradient Boosted Trees.

Applications of Decision Trees

| **Domain**                | **Example Use Case**                                                  |
| ------------------------- | --------------------------------------------------------------------- |
| **Healthcare**          | Predict diseases (e.g., diabetes, cancer) based on symptoms and tests |
| **Finance & Banking**   | Loan approval, credit scoring, fraud detection                        |
| **Marketing & Retail**  | Predict customer churn, product recommendation, targeted ads          |
| **Manufacturing**       | Detect defective products, optimize supply chain                      |
| **Agriculture**        | Crop disease detection, yield prediction                              |
| **Education**           | Predict student performance, dropout risk analysis                    |


5.Summary (Importance)

- Applications: Healthcare, finance, marketing, manufacturing, etc.

- Advantages: Simple, interpretable, no preprocessing needed.

- Limitations: Overfitting, instability, less accurate compared to ensembles.

**6. Write a Python program to:**

**● Load the Iris Dataset**

**● Train a Decision Tree Classifier using the Gini criterion**

**● Print the model’s accuracy and feature importances**

Answer :

In [3]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Decision Tree Accuracy (Gini):", accuracy)

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Decision Tree Accuracy (Gini): 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


**7. Write a Python program to:**

**● Load the Iris Dataset**

**● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.**

Answer :

In [4]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Fully-grown Decision Tree (no depth limit)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 2. Decision Tree with max_depth=3
shallow_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
shallow_tree.fit(X_train, y_train)
y_pred_shallow = shallow_tree.predict(X_test)
accuracy_shallow = accuracy_score(y_test, y_pred_shallow)

# Print results
print("Accuracy of Fully-grown Tree:", accuracy_full)
print("Accuracy of Tree with max_depth=3:", accuracy_shallow)


Accuracy of Fully-grown Tree: 1.0
Accuracy of Tree with max_depth=3: 1.0


**8. Write a Python program to:**

**● Load the California Housing dataset from sklearn**

**● Train a Decision Tree Regressor**

**● Print the Mean Squared Error (MSE) and feature importances**

Answer :

In [5]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.5280096503174904

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


**9. Write a Python program to:**

**● Load the Iris Dataset**

**● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV**

**● Print the best parameters and the resulting model accuracy**

Answer :

In [6]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# Apply GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Best estimator accuracy on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy on Test Set:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy on Test Set: 1.0


**10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:**

**● Handle the missing values**

**● Encode the categorical features**

**● Train a Decision Tree model**

**● Tune its hyperparameters**

**● Evaluate its performance**

**And describe what business value this model could provide in the real-world setting.**

High-level summary (one line)

- Build a robust pipeline: clean & impute → encode → (optionally) balance classes → train with cross-validation → tune → calibrate → evaluate on hold-out & external data → explain & deploy with monitoring.

1.Handle missing values (why & how)

- Understand the missingness: check whether values are MCAR / MAR / MNAR — that affects choice of imputation (e.g., MNAR needs domain input).

- Simple, safe imputation first:

- For numeric features: median (robust to outliers) or model-based / iterative imputation if patterns are complex.

- For categorical: most_frequent or a special "MISSING" category.

- Always add missingness indicator (binary flag per feature) to preserve signal that a value was missing — this often helps predictive models. (This approach is used and demonstrated in sklearn’s imputation examples).
Scikit-learn
+1

- Advanced imputation when needed:

- Use IterativeImputer / KNN imputation if missingness relates to other features and dataset size supports it.

- Do imputation inside a pipeline (never impute on full data before cross-validation) — this avoids data leakage.

2.Encode categorical features (practical choices)

- Low-cardinality categories: use OneHotEncode (handle_unknown='ignore').

- High-cardinality categorical variables (e.g., thousands of codes): prefer target encoding / frequency encoding or categorical embeddings (if using tree-based libraries that accept categories) — but target encoding must be done inside CV to avoid leakage.

- Ordinal categorical variables: if an intrinsic order exists, use OrdinalEncoder (with care).

- Pipeline approach: use ColumnTransformer to apply numeric and categorical pipelines separately.

3.Balance classes (important for disease prediction)

- If disease prevalence is low, the model can be biased toward the majority (healthy). Two common options:

- Resampling (applied only on training folds): SMOTE (synthetic minority oversampling) and its variants (SMOTE + Tomek, SMOTEENN) — available in imblearn. Use inside the training pipeline so resampling happens per CV fold.
imbalanced-learn.org

- Class weights: set class_weight='balanced' in DecisionTreeClassifier to upweight minority class without changing sample counts. Good when synthetic samples might be risky.

- Which to choose? If minority patterns are under-represented in feature space, SMOTE often helps. If you want simplicity and avoid synthetic patients, use class_weight. (Both are valid; try both in tuning.)

4.Train a Decision Tree model (practical setup)

- Prefer building a scikit-learn pipeline that encapsulates preprocessing → (resampler) → classifier so all steps are applied correctly inside cross-validation.

- Important tree hyperparameters to consider: max_depth, min_samples_split, min_samples_leaf, max_features, and pruning parameter ccp_alpha (cost-complexity pruning). These control complexity and overfitting.

5.Hyperparameter tuning

- Use StratifiedKFold (preserve class ratios in each fold).

- Use GridSearchCV for small grids, or RandomizedSearchCV for larger search spaces. Tune with scoring that matches business needs (see next section).

- If dataset is large or hyperparameter space is big, consider Bayesian optimization (Optuna, scikit-optimize).

- When resampling (SMOTE) is used, put it inside the pipeline so sampling is done only on training folds.

6.Evaluation (what to measure — this is critical in healthcare)

- Do not rely on accuracy for rare disease prediction. Use:

- Recall (sensitivity) — how many actual diseased patients we detect (critical when missing a disease is costly).

- Precision — among predicted positives, how many truly have disease (important to avoid unnecessary followups).

- F1-score — harmonic mean (useful when balancing precision & recall).

- ROC-AUC and PR-AUC (average precision) — for imbalanced data PR-AUC / precision-recall curves can be more informative than ROC.
PLOS

- Confusion matrix for concrete counts.

- Calibration (do predicted probabilities reflect true risk?) — produce calibration curve / Brier score and consider CalibratedClassifierCV (isotonic or sigmoid) if probabilities are poorly calibrated. scikit-learn provides calibration tools.
Scikit-learn

- Validation strategy:

- Hold out a final test set (never used in CV).

- If possible, validate on an external / temporal / prospective dataset to estimate real-world performance.

- Decision threshold tuning: choose probability threshold based on business cost matrix (false negatives vs false positives). Use decision/utility analysis (net benefit) to pick thresholds.

7.Explainability & model checks

- Trees are interpretable — show feature_importances_. But for deeper insight, use SHAP or partial dependence plots so clinicians can understand per-patient contributions. (SHAP is commonly used in healthcare explainability literature.)

- Check for spurious features (data leakage), fairness (does performance vary by subgroups?), and robustness (sensitivity to small perturbations).

8.Calibration & post-processing

- Calibrate predicted probabilities if you will use them for risk thresholds / triage (see scikit-learn calibration docs). Calibrated probabilities are essential when the output is used to trigger actions (e.g., immediate referral).
Scikit-learn

9.Deployment, monitoring & regulatory/ethical considerations

- Monitoring: track data drift, performance decay, and input distributions; periodically retrain with recent data.

- Logging and auditability: log predictions, outcomes (when available), and model inputs for audits.

- Clinical review & governance: model should be reviewed by clinicians and integrated into workflows with human oversight.

- Regulatory: medical AI may be regulated (FDA has published an AI/ML action plan and guidance for software as a medical device) — follow regulatory best practices (traceability, documentation, risk assessment).
U.S. Food and Drug Administration

10.Business value (what this model provides)

- Early detection / triage: identify high-risk patients for earlier testing or treatment → improved outcomes.

- Resource prioritization: focus limited diagnostics/treatment resources on the most likely patients.

- Cost reduction: reduce unnecessary tests for low-risk patients and avoid late-stage expensive treatments.

- Clinical decision support: augment clinicians with an explainable risk score, improving decision consistency.

- Population health insights: discover risk factors and actionable interventions across cohorts.

- Important caveat: clinical deployment must weigh harms: false negatives (missed disease) can be fatal; false positives can cause anxiety/over-treatment and cost. Choose thresholds with clinicians and quantify net benefit.

11.Concise practical checklist (for an assignment / exam — 20-mark style)

- Inspect missingness → decide MCAR/MAR/MNAR → impute inside pipeline + add missing flags.
Scikit-learn

- Encode categorical features via ColumnTransformer (OneHot / target encoding if high cardinality).

- Split: train / validation (CV) / held-out test (and external if possible).

- Use an imbalance strategy: SMOTE inside pipeline or class_weight='balanced'.
imbalanced-learn.org

- Build pipeline: preprocess → (SMOTE) → DecisionTreeClassifier (or ensemble).

- Tune max_depth, min_samples_split, min_samples_leaf, max_features, ccp_alpha by GridSearchCV/RandomizedSearchCV with StratifiedKFold.

- Evaluate with recall, precision, F1, PR-AUC, ROC-AUC and calibration curves; choose threshold based on cost/benefit.
PLOS
+1

- Explainability: feature importances + SHAP; clinical validation & prospective testing.

- Document, monitor, and follow regulatory guidance.
U.S. Food and Drug Administration

12.Final practical tips

- Prefer recall for dangerous diseases (but quantify false positive cost).

- Calibrate probabilities before deploying (use CalibratedClassifierCV from sklearn).
Scikit-learn

- Document everything (data lineage, feature definitions, versions) for audits and regulatory review. FDA has specific guidance and plans for AI/ML SaMD — read and follow applicable regulations.  U.S. Food and Drug Administration

Key authoritative references (most load-bearing)

- scikit-learn imputation & examples (SimpleImputer; missingness indicator example).
Scikit-learn
+1

- SMOTE & imbalanced-learn (resampling techniques for class imbalance).
imbalanced-learn.org

- Precision-Recall is more informative than ROC for imbalanced datasets (Saito & Rehmsmeier, PLoS One 2015).
PLOS

- scikit-learn probability calibration (CalibratedClassifierCV, calibration curves).
Scikit-learn

- FDA AI/ML action plan & regulation considerations for medical software.
U.S. Food and Drug Administration

13.Colab / sklearn pipeline (runnable skeleton)

In [9]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score

# ---- Load dataset ----
iris = load_iris(as_frame=True)
df = iris.frame
X = df.drop("target", axis=1)
y = df["target"]

# Example: here all features are numeric
numeric_cols = X.columns.tolist()
categorical_cols = []   # no categorical columns in Iris

# ---- Define transformers ----
numeric_transformer = ImbPipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = ImbPipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# ---- ColumnTransformer ----
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ],
    remainder='drop'
)

# ---- Pipeline ----
pipeline = ImbPipeline([
    ('preproc', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('clf', DecisionTreeClassifier(class_weight='balanced', random_state=42))
])

# ---- Hyperparameter grid ----
param_grid = {
    'clf__max_depth': [3, 5, None],
    'clf__min_samples_split': [2, 5, 10],
    'clf__min_samples_leaf': [1, 2, 5]
}

# ---- Train/test split ----
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# ---- GridSearchCV ----
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipeline, param_grid, scoring='accuracy', cv=cv, n_jobs=-1)
grid.fit(X_train, y_train)

# ---- Evaluate ----
best = grid.best_estimator_
print("Best Parameters:", grid.best_params_)

y_pred = best.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))


Best Parameters: {'clf__max_depth': 3, 'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2}
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.94      1.00      0.97        15
           2       1.00      0.93      0.97        15

    accuracy                           0.98        45
   macro avg       0.98      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

