## Question 1: What is a Decision Tree, and how does it work in the context of classification?

### **Definition**

A **Decision Tree** is a **supervised machine learning algorithm** that is widely used for **classification and regression** tasks.  
In the context of **classification**, a decision tree predicts the **class label** of an input by learning **simple decision rules** inferred from data features.

It is called a "tree" because it has a structure similar to a tree in nature — consisting of **nodes**, **branches**, and **leaves**.



### **Structure of a Decision Tree**

- **Root Node:**  
  Represents the entire dataset and the first feature used to make a decision.

- **Decision Nodes:**  
  These are internal nodes that split the data based on certain conditions (e.g., “Is Age > 30?”).

- **Branches:**  
  Represent the outcomes of decisions and connect nodes.

- **Leaf Nodes:**  
  Represent the final class label (e.g., “Yes” or “No”) or output value.



### **How a Decision Tree Works (Classification Process)**

1. **Data Splitting:**  
   The algorithm selects the feature that best divides the data into distinct classes.  
   This is usually done using measures like:
   - **Gini Impurity**
   - **Entropy / Information Gain**
   - **Gain Ratio**

2. **Decision Making:**  
   Each node splits the data based on a condition — for example:



3. **Recursive Splitting:**  
The process continues recursively — splitting each subset into smaller groups until:
- All data points in a node belong to the same class, or
- A stopping criterion is met (e.g., maximum depth reached).

4. **Prediction:**  
For a new data point, the algorithm follows the path of decisions from the root to a leaf node, where it assigns the final class label.



### **Example**

Suppose we want to predict whether a person will **buy a car** based on two features: **Age** and **Income**.

A simple decision tree might look like:


         [Age > 30?]
          /       \
      Yes/         \No
      [Income > 50k?]  → No
       /       \
    Yes/         \No
    Buy           No Buy

This tree shows a sequence of decisions leading to the classification outcome.

---

### **Advantages of Decision Trees**

- Easy to **understand** and **visualize**.  
- Handles both **numerical** and **categorical** data.  
- Does not require feature scaling (normalization).  
- Can capture **non-linear relationships**.



### **Limitations**

- **Overfitting:** May fit noise if the tree becomes too deep.  
- **Unstable:** Small changes in data can change the tree structure.  
- **Bias toward dominant features** if data is not balanced.




## Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?


### **Introduction**

In a **Decision Tree**, the goal at each step is to **split the data** in a way that results in the **purest possible subsets** — meaning each subset contains mostly data points from a single class.  

To measure this **purity or impurity**, we use **impurity measures** such as **Gini Impurity** and **Entropy**.



### **1. Gini Impurity**

**Definition:**  
Gini Impurity measures how often a randomly chosen data point from the dataset would be **incorrectly classified** if it were labeled according to the class distribution in that subset.

**Formula:**

\[
Gini = 1 - \sum_{i=1}^{C} (p_i)^2
\]

Where:  
- \( C \) = Number of classes  
- \( p_i \) = Probability of a data point belonging to class \( i \)

**Interpretation:**
- \( Gini = 0 \): The node is **pure** (all samples belong to one class).  
- \( Gini \) increases as the classes become more mixed.  

**Example:**
If a node contains 80% Class A and 20% Class B:
\[
Gini = 1 - (0.8^2 + 0.2^2) = 1 - (0.64 + 0.04) = 0.32
\]
So, the impurity is **0.32**.



### **2. Entropy**

**Definition:**  
Entropy measures the **amount of randomness or disorder** in the dataset. It originates from **information theory** and is used in decision trees built with the **Information Gain** criterion (like in ID3 algorithm).

**Formula:**

\[
Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i)
\]

Where:  
- \( p_i \) = Probability of class \( i \)

**Interpretation:**
- \( Entropy = 0 \): Node is pure (only one class present).  
- \( Entropy \) is **maximum** when classes are evenly distributed.

**Example:**
For 50% Class A and 50% Class B:
\[
Entropy = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1
\]
So, the node has **maximum impurity**.



### **3. How They Impact the Splits**

During tree building:
1. The algorithm calculates **Gini** or **Entropy** for all possible splits.
2. It selects the split that produces the **largest reduction in impurity** — that is, the **most pure child nodes**.

This reduction is known as **Information Gain**:

\[
Information\ Gain = Entropy_{parent} - \sum_{j} \frac{N_j}{N} Entropy_{child_j}
\]

or, when using Gini:
\[
Gini\ Gain = Gini_{parent} - \sum_{j} \frac{N_j}{N} Gini_{child_j}
\]



### **4. Comparison Between Gini and Entropy**

| Criteria | Gini Impurity | Entropy |
|-----------|----------------|----------|
| Range | 0 to 0.5 (for binary classification) | 0 to 1 |
| Computation | Simpler and faster | Slightly more complex |
| Behavior | Tends to isolate the most frequent class | More sensitive to class imbalance |
| Used In | CART Algorithm | ID3, C4.5 Algorithms |







### **Introduction**

A **Decision Tree** algorithm recursively splits data into smaller subsets based on feature values to build a predictive model.  
However, if the tree keeps splitting until every data point is perfectly classified, it may become **too complex** and start **overfitting** — meaning it performs well on training data but poorly on unseen data.

To prevent overfitting and improve generalization, we use a process called **Pruning**.

**Pruning** means **reducing the size of a decision tree** by removing unnecessary branches or nodes that do not contribute much to the model’s predictive power.

There are **two main types** of pruning:
1. **Pre-Pruning (Early Stopping)**
2. **Post-Pruning (Pruning After Training)**



### **1. Pre-Pruning (Early Stopping)**

**Definition:**  
Pre-pruning stops the tree-building process **early**, before it becomes too complex.  
In this approach, the algorithm decides **not to split a node** if the split does not provide significant improvement in purity (measured using **Gini Impurity** or **Entropy**).



#### **How It Works**

During the training phase, the algorithm evaluates each possible split.  
It checks whether the split improves the model enough to justify the increase in complexity.

If the improvement (known as **information gain**) is below a certain **threshold**, the algorithm **stops splitting** further.

\[
Information\ Gain = Impurity_{parent} - \sum_{j=1}^{k} \frac{N_j}{N} \times Impurity_{child_j}
\]

If `Information Gain < Minimum Threshold`, the split is **not made**.



#### **Common Pre-Pruning Parameters (in Scikit-learn)**

- `max_depth` → maximum depth of the tree  
- `min_samples_split` → minimum number of samples required to split a node  
- `min_samples_leaf` → minimum samples allowed in a leaf node  
- `max_leaf_nodes` → limits total number of leaf nodes  
- `min_impurity_decrease` → minimum decrease in impurity needed for a split  

---

#### **Example**

Suppose we are classifying emails as “Spam” or “Not Spam”.  
If splitting on the word *“offer”* only improves accuracy by 0.1%, the algorithm might decide not to split further — avoiding overfitting to noise words.



#### **Practical Advantage of Pre-Pruning**

 **Advantage:**  
- **Saves computation time and prevents overfitting early.**  
Since the tree stops growing when improvement becomes insignificant, it produces a **simpler, faster, and more generalizable** model.


### **2. Post-Pruning (Pruning After Training)**

**Definition:**  
Post-pruning allows the decision tree to **grow fully** — potentially overfitting — and then **removes unimportant branches** afterward.  
This process simplifies the model without losing much predictive accuracy.


#### **How It Works**

1. The algorithm first grows a **complete tree** using all training data.  
2. It then evaluates the **importance of each branch or leaf** using a validation set or cross-validation.  
3. Subtrees or branches that contribute **little to accuracy** are pruned (removed).  
4. The pruning continues until the model’s performance stops improving.



#### **Common Post-Pruning Methods**

- **Reduced Error Pruning:**  
  - Prunes nodes only if removing them does not reduce accuracy on the validation set.  
- **Cost Complexity Pruning (CCP):**  
  - Introduced in CART algorithm.  
  - Adds a penalty for tree complexity using the formula:  

\[
R_{\alpha}(T) = R(T) + \alpha |T|
\]

Where:  
- \( R(T) \): Total misclassification error of the tree  
- \( |T| \): Number of leaf nodes (complexity)  
- \( \alpha \): Complexity parameter (controls pruning strength)

By adjusting \( \alpha \), the algorithm balances **accuracy vs. simplicity**.



#### **Example**

In a decision tree for predicting loan approvals, post-pruning may remove branches related to extremely rare cases (like income > ₹10 million), which add complexity but don’t improve prediction accuracy.


#### **Practical Advantage of Post-Pruning**

 **Advantage:**  
- **Produces a more accurate and generalizable model.**  
Since pruning occurs **after** full growth, it ensures that no potentially useful patterns are missed during the initial training.



### **3. Key Differences Between Pre-Pruning and Post-Pruning**

| **Feature** | **Pre-Pruning (Early Stopping)** | **Post-Pruning (After Training)** |
|--------------|----------------------------------|-----------------------------------|
| **Timing** | Stops tree growth during training | Prunes the tree after it is fully grown |
| **Goal** | Prevent overfitting early | Remove overfitted branches after learning |
| **Computation** | Less computationally expensive | More computationally intensive |
| **Risk** | Might stop too early and underfit | Allows full learning, but then simplifies |
| **Example Parameters** | `max_depth`, `min_samples_split` | `ccp_alpha`, validation-based pruning |



### **4. Summary**

- **Pre-Pruning** limits tree growth using early stopping criteria → reduces overfitting and saves computation time.  
- **Post-Pruning** grows a complete tree first, then removes unnecessary parts → improves model performance and generalization.  

Both techniques aim to build a **simpler, more robust Decision Tree** that performs well on unseen data.



## Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?



### **Introduction**

In **Decision Tree algorithms**, the most important step is **deciding where to split the data** at each node.  
The quality of each split determines how well the tree classifies the data.  

To measure this quality, we use **Information Gain (IG)** — a metric that quantifies how much **“information” or “purity”** is gained after a split.  

Information Gain is primarily used when the **Entropy** measure is chosen as the impurity criterion (as in **ID3** and **C4.5** algorithms).



### **Definition**

**Information Gain** measures the **reduction in entropy (impurity)** achieved by partitioning the dataset based on a particular feature.

It tells us **how much uncertainty in the dataset decreases** when we split it on a given attribute.



### **Formula for Information Gain**

\[
Information\ Gain (IG) = Entropy(Parent) - \sum_{i=1}^{k} \frac{N_i}{N} \times Entropy(Child_i)
\]

Where:  
- \( Entropy(Parent) \) → Entropy of the original dataset (before splitting)  
- \( Entropy(Child_i) \) → Entropy of each child subset after splitting  
- \( N_i \) → Number of samples in the \( i^{th} \) child node  
- \( N \) → Total number of samples in the parent node  



### **Step-by-Step Explanation**

1. **Compute Parent Entropy:**  
   Calculate the entropy of the original dataset before any split.

2. **Split the Data:**  
   Divide the dataset into subsets based on one feature (e.g., “Age > 30?”).

3. **Compute Child Entropies:**  
   For each subset, compute the entropy again.

4. **Compute Weighted Average of Child Entropies:**  
   Weight each child’s entropy by the proportion of samples it contains.

5. **Calculate Information Gain:**  
   Subtract the weighted child entropy from the parent entropy.


### **Interpretation**

- A **higher Information Gain** means the feature provides **more useful information** for classifying data.  
- The **feature with the highest Information Gain** is selected as the **best split** at that node.



### **Example**

Suppose we are predicting whether a student **passes** or **fails** based on whether they **study regularly**.

| Student | Studies Regularly | Result |
|----------|-------------------|--------|
| A | Yes | Pass |
| B | Yes | Pass |
| C | No | Fail |
| D | No | Fail |
| E | Yes | Pass |

#### **Step 1: Calculate Parent Entropy**

Total = 5 students → 3 Pass, 2 Fail  
\[
Entropy(Parent) = - \left( \frac{3}{5} \log_2 \frac{3}{5} + \frac{2}{5} \log_2 \frac{2}{5} \right) = 0.971
\]

#### **Step 2: Split on “Studies Regularly”**

- **Yes group:** (3 Pass, 0 Fail) → Entropy = 0  
- **No group:** (0 Pass, 2 Fail) → Entropy = 0  

#### **Step 3: Weighted Entropy After Split**

\[
Entropy(Children) = \frac{3}{5}(0) + \frac{2}{5}(0) = 0
\]

#### **Step 4: Information Gain**

\[
IG = 0.971 - 0 = 0.971
\]

**High Information Gain (0.971)** means “Studies Regularly” is an excellent attribute to split on — it perfectly classifies the data.



### **Why Information Gain is Important**

1. **Feature Selection:**  
   Information Gain helps the Decision Tree **choose the most informative feature** at each node.

2. **Efficient Splitting:**  
   It ensures that every split **maximizes class purity**, leading to faster and more accurate learning.

3. **Reduces Uncertainty:**  
   Each split reduces the randomness (entropy) of the dataset, making predictions more certain.

4. **Improves Model Accuracy:**  
   Higher Information Gain leads to better generalization and less misclassification.



### **Relation to Entropy**

Information Gain is directly based on **Entropy**, which measures impurity:

\[
Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i)
\]

- High entropy → High disorder (mixed classes).  
- Low entropy → High purity (mostly one class).  

So, a split that **reduces entropy the most** will have the **highest Information Gain**.



### **Advantages of Using Information Gain**

- Provides a **quantitative measure** for evaluating splits.  
- Works well with **categorical attributes**.  
- Encourages **pure and interpretable** tree structures.



### **Limitations**

- Tends to favor features with **many unique values** (e.g., ID numbers).  
  → This is addressed by using **Gain Ratio** (used in C4.5 algorithm).  
- Computationally more expensive than Gini Impurity.







### **Introduction**

A **Decision Tree** is a supervised machine learning algorithm used for both **classification** and **regression** problems.  
It operates by recursively splitting the dataset into smaller and smaller subsets based on the most significant attributes (features), forming a tree-like structure of decisions and outcomes.  
Each internal node represents a **test on an attribute**, each branch represents the **result of that test**, and each leaf node represents a **final decision or output**.

Decision Trees are popular because they are **easy to interpret**, **require little data preprocessing**, and **mimic human decision-making**.  
They are frequently used in various real-world applications due to their transparency and flexibility.



### **1. Real-World Applications of Decision Trees**



#### **1.1 Iris Dataset – Classification Task**

One of the most common applications of Decision Trees is **classification**.  
The **Iris dataset** is a classic example used to classify flowers into three species based on their physical features.

- **Dataset Information:**
  - Features: Sepal Length, Sepal Width, Petal Length, Petal Width
  - Target: Species of Iris (Setosa, Versicolor, Virginica)

- **Working Example:**
  A Decision Tree might split data as follows:
If (Petal Length < 2.5 cm) → Iris-setosa
Else if (Petal Width < 1.8 cm) → Iris-versicolor
Else → Iris-virginica


- **Practical Use Case:**  
Such classification systems can be adapted for **automated plant identification**, **image-based species recognition**, or **biological research** applications.

---

#### **1.2 Boston Housing Dataset – Regression Task**

Decision Trees can also perform **regression**, predicting continuous values instead of categories.  
The **Boston Housing dataset** is widely used for this purpose.

- **Dataset Information:**
- Features: Average number of rooms, crime rate, accessibility to highways, property tax rate, etc.
- Target: Median value of owner-occupied homes in $1000s.

- **Working Example:**
A Decision Tree might split as:
If (RM > 6.5) and (LSTAT < 10%) → House Price ≈ High
Else if (RM < 5.5) → House Price ≈ Low
Else → House Price ≈ Medium


- **Practical Use Case:**  
This can help **real estate companies** or **urban planners** predict housing prices and plan infrastructure development.

---

#### **1.3 Additional Real-World Examples**

- **Finance:** Credit risk analysis, loan approval, and fraud detection.  
- **Healthcare:** Predicting diseases based on symptoms and medical test results.  
- **Marketing:** Customer segmentation, churn prediction, and sales forecasting.  
- **Manufacturing:** Quality control and predictive maintenance of machinery.  
- **Education:** Student performance prediction and adaptive learning systems.



### **2. Advantages of Decision Trees**

1. **Easy to Understand and Interpret:**  
 The tree structure is visual and similar to human reasoning, making it easy for non-technical users to interpret.

2. **Handles Both Numerical and Categorical Data:**  
 Decision Trees can work with various data types without requiring feature scaling or normalization.

3. **Requires Minimal Data Preparation:**  
 Unlike many algorithms, trees don’t require feature standardization or dummy encoding.

4. **Useful for Feature Selection:**  
 Decision Trees automatically rank features by importance during training, helping identify key predictors.

5. **Works Well on Nonlinear Relationships:**  
 Can model complex relationships between features without assuming linearity.



### **3. Limitations of Decision Trees**

1. **Overfitting:**  
 Trees can become overly complex and fit noise in the training data, reducing generalization on unseen data.

2. **Instability:**  
 Small changes in data can drastically change the structure of the tree.

3. **Biased Toward Features with More Levels:**  
 Features with more categories may dominate the splitting process.

4. **Less Accurate Alone:**  
 While easy to interpret, single trees may not achieve the same accuracy as ensemble models like **Random Forests** or **Gradient Boosted Trees**.

5. **Computational Cost:**  
 For large datasets, training and pruning can become computationally expensive.



### **4. Summary of Datasets**

| Dataset | Task Type | Example Use | scikit-learn Loader |
|----------|------------|--------------|---------------------|
| **Iris Dataset** | Classification | Predicting flower species | `sklearn.datasets.load_iris()` |
| **Boston Housing Dataset** | Regression | Predicting house prices | `sklearn.datasets.load_boston()` *(or via CSV)* |




In [2]:
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Classification using Iris dataset
iris = load_iris()
clf = DecisionTreeClassifier(random_state=42)
clf.fit(iris.data, iris.target)

# Regression using California Housing dataset (modern replacement for Boston)
housing = fetch_california_housing()
reg = DecisionTreeRegressor(random_state=42)
reg.fit(housing.data, housing.target)

print("Classification Tree Depth:", clf.get_depth())
print("Regression Tree Depth:", reg.get_depth())


Classification Tree Depth: 5
Regression Tree Depth: 40


## Question 6: Write a Python program to:
## ● Load the Iris Dataset
## ● Train a Decision Tree Classifier using the Gini criterion
## ● Print the model’s accuracy and feature importances

In [3]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)


y_pred = clf.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(accuracy * 100, 2), "%")


print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {round(importance, 4)}")


Model Accuracy: 100.0 %

Feature Importances:
sepal length (cm): 0.0
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


## Question 7: Write a Python program to:
## ● Load the Iris Dataset
## ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.


In [4]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)


clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)


y_pred_limited = clf_limited.predict(X_test)
y_pred_full = clf_full.predict(X_test)


acc_limited = accuracy_score(y_test, y_pred_limited)
acc_full = accuracy_score(y_test, y_pred_full)


print("Decision Tree with max_depth=3 Accuracy: ", round(acc_limited * 100, 2), "%")
print("Fully-grown Decision Tree Accuracy: ", round(acc_full * 100, 2), "%")


Decision Tree with max_depth=3 Accuracy:  100.0 %
Fully-grown Decision Tree Accuracy:  100.0 %


## Question 8: Write a Python program to:
## ● Load the Boston Housing Dataset
## ● Train a Decision Tree Regressor
## ● Print the Mean Squared Error (MSE) and feature importances

In [8]:

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np


housing = fetch_california_housing()
X = housing.data
y = housing.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)


y_pred = regressor.predict(X_test)


mse = mean_squared_error(y_test, y_pred)


print("Mean Squared Error (MSE):", round(mse, 2))


print("\nFeature Importances:")
for name, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{name}: {round(importance, 4)}")


Mean Squared Error (MSE): 0.53

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.025
Population: 0.0322
AveOccup: 0.139
Latitude: 0.09
Longitude: 0.0888


## Question 9: Write a Python program to:
## ● Load the Iris Dataset
## ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
## ● Print the best parameters and the resulting model accuracy

In [10]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


dt = DecisionTreeClassifier(random_state=42)


param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 3, 4, 5, 6]
}


grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,             
    scoring='accuracy',
    n_jobs=-1         
)


grid_search.fit(X_train, y_train)


best_model = grid_search.best_estimator_


y_pred = best_model.predict(X_test)

y
accuracy = accuracy_score(y_test, y_pred)


print("Best Parameters Found:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", round(grid_search.best_score_ * 100, 2), "%")
print("Test Set Accuracy with Best Model:", round(accuracy * 100, 2), "%")


Best Parameters Found: {'max_depth': 4, 'min_samples_split': 6}
Best Cross-Validation Accuracy: 94.29 %
Test Set Accuracy with Best Model: 100.0 %


## Question 10: Imagine you’re working as a data scientist for a healthcare company thatwants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
## Explain the step-by-step process you would follow to:
## ● Handle the missing values
## ● Encode the categorical features
## ● Train a Decision Tree model
## ● Tune its hyperparameters
## ● Evaluate its performance And describe what business value this model could provide in the real-world setting.


### Problem context
You have a large healthcare dataset (mixed numerical + categorical features, some missing values). Goal: predict whether a patient has a disease (binary classification).



## 1. Understand the data (first, before modeling)
**Why:** good decisions about imputation, encoding and evaluation depend on data characteristics.

**Actions**
- Inspect column types, missingness pattern, cardinality of categorical features, class balance (positive vs negative cases), distributions and outliers.
- Ask whether missingness is likely MCAR / MAR / MNAR (domain knowledge). For example, a missing lab test might mean “test not ordered” (informative).
- Calculate baseline metrics (e.g., base prevalence) so you know what “good” looks like.

**Checks**
- `df.isnull().mean()` per column (missing rate).
- `df.nunique()` for categorical cardinalities.
- Class counts: `y.value_counts(normalize=True)`.



## 2. Handling missing values
**Principles**
- Use domain knowledge. Missingness can be informative (create indicator flag).
- Avoid data leakage: impute using only training data statistics inside a pipeline.

**Strategies (by feature type & missing rate)**

1. **Low missingness (e.g., <5–10%)**
   - Numeric: median (robust) or mean if normal.
   - Categorical: constant like `'missing'` or the mode.

2. **Moderate missingness**
   - Numeric: median or model-based imputation (KNN, Iterative Imputer / MICE).
   - Categorical: impute with `'missing'` or create a new category; consider target/impact encoding (careful with leakage).

3. **High missingness (>30–50%)**
   - Consider dropping the feature unless domain knowledge says it’s essential.
   - Or engineer a binary indicator `feature_missing` and treat the feature carefully.

4. **Informative missingness**
   - Create a missing indicator column (`feature_X_missing`) and use it as a predictor.

5. **Time-dependent / longitudinal labs**
   - If multiple measurements exist, extract summary statistics (last, max, trend) rather than naively imputing.

**Recommended tools (scikit-learn)**
- `sklearn.impute.SimpleImputer` (mean/median/most_frequent/constant)
- `sklearn.impute.KNNImputer` (if reasonable size)
- `sklearn.impute.IterativeImputer` (MICE-like), but ensure performance & stability

**Important:** Put imputation **inside** a pipeline and fit only on training folds to avoid leakage.


## 3. Encoding categorical features
**Principles**
- Use encoders that match feature cardinality and model type.
- One-hot for low-cardinality categories.
- For high-cardinality categorical features consider target encoding / count encoding / embeddings — but be careful to avoid target leakage (use nested CV or smoothing).

**Options**
- **One-Hot Encoding** (`OneHotEncoder` with `handle_unknown='ignore'`): works well for small cardinalities and tree models tolerate sparse one-hots.
- **Ordinal Encoding** (`OrdinalEncoder`): only if categories have natural order.
- **Count / Frequency Encoding:** replace category by its frequency — good for high cardinality.
- **Target Encoding / Mean Encoding:** can improve performance for high cardinality but MUST be done with cross-validation or using smoothing to prevent leakage.
- **Leave-one-out / CatBoost-style encoders**: alternatives that reduce leakage risk.

**Practical rule of thumb**
- If cardinality <= 10: One-hot.
- If cardinality > 30: use target/count encoding or hashing (with careful CV).



## 4. Feature engineering and scaling
**Decision Trees do not require scaling**, but:
- Create interaction / domain features (e.g., BMI from height & weight).
- Aggregate lab time series (last value, slope).
- Binarize clinically meaningful thresholds (e.g., `age >= 65`).
- Add missingness indicators.

---

## 5. Train a Decision Tree (proper pipeline & cross-validation)
**Why pipeline?** ensures preprocessing + imputation + encoding are applied identically in CV and deployment, and prevents leakage.

**Suggested pipeline (sklearn)**
- `ColumnTransformer` to apply different imputers/encoders to numeric vs categorical.
- `Pipeline` with Transformer → `DecisionTreeClassifier`.

**Use stratified splits** because disease prevalence may be low (stratify by label).

**Handle class imbalance**
- Use `class_weight='balanced'` or supply `sample_weight`.
- Also consider resampling (SMOTE, undersampling) but do that inside CV pipeline to avoid leakage.



## 6. Hyperparameter tuning
**Key DecisionTree hyperparameters**
- `max_depth` (controls overfitting)
- `min_samples_split`
- `min_samples_leaf`
- `max_features` (subset of features to consider at each split)
- `ccp_alpha` (cost complexity pruning)
- `criterion` (`gini` or `entropy`) — usually minor differences

**Tuning approach**
- Use `GridSearchCV` or `RandomizedSearchCV` with **StratifiedKFold** and scoring tuned to business objective (e.g., `'roc_auc'`, `'average_precision'`, or custom scoring that weights false negatives more).
- Use nested CV if you need an unbiased estimate of generalization performance when you also report tuned CV performance.
- Use `n_jobs=-1` to parallelize.

**Example parameter grid**
```py
param_grid = {
  'clf__max_depth': [3, 5, 7, 10, None],
  'clf__min_samples_split': [2, 5, 10, 20],
  'clf__min_samples_leaf': [1, 2, 5, 10],
  'clf__ccp_alpha': [0.0, 0.001, 0.01, 0.1]
}
