#Assignment on Decision Tree || Module-07
**Assignment Code:** DA-AG-012

**Learner's Name:** Suraj Vishwakarma  
**Email:** vishsurajfor@gmail.com

This notebook contains the solution of 10 questions from the assignment  and runnable Python code where applicable.




## Q1: What is a Decision Tree, and how does it work in the context of classification?  

**Answer:** -A **Decision Tree** is a powerful supervised machine learning algorithm that is widely used for solving **classification problems**. It is represented in the form of a tree-like model of decisions, where each internal node tests a feature, each branch corresponds to the outcome of that test, and each leaf node assigns a class label. Decision Trees mimic human decision-making and are therefore highly interpretable.  

---

### Working of a Decision Tree in Classification:  

1. **Root Node:**  
   - The process starts from the root node which represents the entire dataset.  
   - The algorithm looks for the feature that best separates the classes.  

2. **Splitting Criteria:**  
   - A suitable attribute is chosen using impurity measures such as **Entropy**, **Information Gain**, **Gini Index**, or **Chi-Square**.  
   - The attribute that produces the most homogeneous (pure) child nodes is selected.  

3. **Recursive Partitioning:**  
   - The dataset is divided into smaller subsets based on the selected attribute.  
   - This process continues recursively for each subset, creating branches of the tree.  

4. **Leaf Nodes (Terminal Nodes):**  
   - The recursion stops when either:  
     - All data points in a node belong to the same class, or  
     - Further splitting does not add value.  
   - At this stage, the node becomes a leaf and is labeled with the class outcome.  

---

### Illustrative Example:  
Suppose we are predicting whether a student will **Pass** or **Fail** based on *Study Hours* and *Attendance*.  

- **Root Node:** Check `Study Hours > 3`.  
   - If **Yes → Predict: Pass**.  
   - If **No → Check Attendance > 75%`.  
     - If **Yes → Predict: Pass**.  
     - If **No → Predict: Fail**.  

This shows how the decision tree uses **step-by-step feature-based questions** to classify outcomes.  

---

### Advantages of Decision Trees in Classification:  
- Easy to understand and interpret (resembles human reasoning).  
- Handles both numerical and categorical data.  
- No need for data normalization or scaling.  
- Produces a clear **set of rules (if–else)** for prediction.  

### Limitations:  
- **Overfitting:** Trees can grow very deep and memorize training data.  
- **Bias toward features with many values:** Attributes with multiple splits may dominate.  
- Sensitive to small changes in data.  

---

### Conclusion:  
A Decision Tree in classification works by recursively splitting the dataset into homogeneous groups based on features, ultimately leading to a class label at the leaves. Its simplicity and interpretability make it suitable for applications like **loan approval, medical diagnosis, fraud detection, and student performance prediction**. However, without pruning or ensemble methods, decision trees can easily overfit the training data.  

## Q2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?  

**Answer:** -When building a Decision Tree, the algorithm must decide **which feature to split on at each step**. To make this choice, it relies on **impurity measures**, which quantify how mixed (impure) or pure a node is in terms of class labels. Two commonly used measures are **Gini Impurity** and **Entropy**.  

---

### 1. Gini Impurity  
- **Definition:** Gini impurity measures the probability that a randomly chosen sample from a node will be incorrectly classified if it were randomly labeled according to the class distribution.  
- **Formula:**  
$$
Gini = 1 - \sum_{i=1}^{C} p_i^2
$$  
Where:  
- $C$ = number of classes  
- $p_i$ = probability of class $i$ in the node  

- **Interpretation:**  
  - A Gini of **0** means the node is pure (all samples belong to one class).  
  - Higher Gini means higher impurity.  

---

### 2. Entropy  
- **Definition:** Entropy is a concept from information theory that measures the level of uncertainty or randomness in the data.  
- **Formula:**  
$$
Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i)
$$  

- **Interpretation:**  
  - Entropy = **0** → node is pure (all samples belong to one class).  
  - Entropy is maximum when classes are equally mixed (e.g., 50%-50% for binary).  

---

### 3. Impact on Splits  
- Decision Trees use these measures to evaluate **how “good” a split is**.  
- The goal is to **reduce impurity** as much as possible after a split.  
- The algorithm calculates the **weighted average impurity** of child nodes, and the split that yields the **lowest impurity (or highest Information Gain)** is chosen.  

**Information Gain (based on Entropy):**  
$$
IG = Entropy(parent) - \sum_{k} \frac{N_k}{N} \cdot Entropy(child_k)
$$  

**For Gini:**  
The algorithm simply chooses the split with the **lowest Gini impurity** in child nodes.  

---

### 4. Numerical Example  
Suppose a dataset has 10 samples: **6 Positive** and **4 Negative**.  

- **Entropy at parent node:**  
$$
Entropy = -\Big(0.6 \log_2 0.6 + 0.4 \log_2 0.4 \Big) \approx 0.97
$$  

- **Gini at parent node:**  
$$
Gini = 1 - (0.6^2 + 0.4^2) = 0.48
$$  

If a split divides the data into subsets with higher purity (e.g., [4P,1N] and [2P,3N]), both Entropy and Gini will **decrease**, showing that the split improved classification.  

---

### 5. Comparison of Gini vs Entropy  
- Both measures usually give **similar results** in practice.  
- **Gini Impurity:**  
  - Faster to compute.  
  - Used by CART (Classification and Regression Trees).  
- **Entropy:**  
  - Based on information theory.  
  - Used by ID3 and C4.5 algorithms.  

---

### Conclusion  
Both **Gini Impurity** and **Entropy** are key metrics for evaluating splits in Decision Trees. They ensure that each decision point reduces uncertainty and leads toward purer nodes. By selecting attributes that minimize impurity, the tree becomes more effective at classification. While their mathematical formulation differs, their overall impact on the tree’s structure is quite similar.  

## Q3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.  

**Answer:** -Decision Trees are powerful but prone to **overfitting**, especially when they grow too deep and capture noise in the data. To address this, pruning techniques are applied. **Pruning** refers to stopping or cutting back the growth of the tree so that it generalizes better on unseen data. There are two main types: **Pre-Pruning** and **Post-Pruning**.  

---

### 1. Pre-Pruning (Early Stopping)  
- **Definition:** Pre-pruning halts the tree growth **before it becomes too complex**. Instead of allowing the tree to grow fully, we set stopping conditions during the building phase.  
- **Common strategies:**  
  - Maximum depth of the tree (`max_depth`).  
  - Minimum number of samples required at a node (`min_samples_split`).  
  - Minimum information gain required for a split.  
- **Example:** If a node has fewer than 5 samples, the algorithm stops further splitting and makes it a leaf node.  

**Practical Advantage:**  
- **Efficiency in computation.** Since the tree stops growing early, it reduces training time and memory usage. Useful in real-time or large-scale applications.  

---

### 2. Post-Pruning (Cost-Complexity Pruning)  
- **Definition:** Post-pruning allows the decision tree to grow to its full depth and then **removes unnecessary branches** that do not improve accuracy significantly.  
- **Methods:**  
  - Reduced Error Pruning: Remove a branch if performance on a validation set does not decrease.  
  - Cost-Complexity Pruning: Prunes subtrees that add little value to prediction while increasing complexity.  
- **Example:** A fully grown tree might classify a few outlier samples incorrectly. Pruning those branches improves generalization on test data.  

**Practical Advantage:**  
- **Improved generalization.** By trimming overfitted parts of the tree, post-pruning reduces variance and increases predictive accuracy on unseen data.  

---

### 3. Key Differences  

| Aspect                | Pre-Pruning                       | Post-Pruning                          |
|------------------------|-----------------------------------|---------------------------------------|
| **When applied**       | During tree construction          | After a full tree is built             |
| **Control**            | Stops splits based on conditions  | Cuts back unnecessary branches         |
| **Risk**               | May underfit if stopped too early | Computationally more expensive         |
| **Example parameter**  | `max_depth`, `min_samples_split`  | Cost-complexity pruning (alpha tuning) |

---

### Conclusion  
- **Pre-Pruning** avoids overfitting by **limiting the growth** of the tree in advance, saving time and resources.  
- **Post-Pruning** allows the tree to fully explore data patterns and then **removes overfitted branches**, improving generalization.  
Both methods are essential in practice, and the choice depends on the trade-off between **efficiency** (pre-pruning) and **accuracy** (post-pruning).  

## Q4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?  

**Answer:**  -When constructing a Decision Tree, the main goal is to create splits that **best separate the data into homogeneous groups**. One of the most widely used criteria for selecting splits is **Information Gain (IG)**, which is derived from the concept of **Entropy** in information theory.  

---

### 1. Entropy as a Measure of Impurity  
Entropy measures the amount of **disorder or uncertainty** in a dataset. For a node with $C$ classes and probabilities $p_i$ of each class $i$, entropy is:  

$$
Entropy(S) = - \sum_{i=1}^{C} p_i \log_2(p_i)
$$  

- If all samples belong to one class, $Entropy = 0$ (pure node).  
- If classes are equally distributed, entropy is maximum (highest impurity).  

---

### 2. Information Gain (IG)  
**Definition:** Information Gain is the **reduction in entropy** achieved after splitting a dataset based on a particular attribute.  

**Formula:**  
$$
IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \cdot Entropy(S_v)
$$  

Where:  
- $S$ = parent dataset  
- $A$ = attribute used for splitting  
- $S_v$ = subset of $S$ for which attribute $A$ has value $v$  
- $|S_v|/|S|$ = proportion of samples in subset  

---

### 3. Why It Matters for Splitting  
- At each step, the Decision Tree algorithm evaluates all features.  
- The feature that gives the **highest Information Gain** is chosen for the split, because it provides the **greatest reduction in uncertainty**.  
- This ensures that child nodes are **purer** than the parent node.  

---

### 4. Numerical Example  
Suppose we want to predict whether students will **Pass** or **Fail** based on *Study Hours*.  

- Parent node: 10 samples → 6 Pass, 4 Fail.  
Entropy:  
$$
Entropy(parent) = - (0.6 \log_2 0.6 + 0.4 \log_2 0.4) \approx 0.97
$$  

- After split:  
  - Node A (5 samples): 4 Pass, 1 Fail → $Entropy(A) \approx 0.72$  
  - Node B (5 samples): 2 Pass, 3 Fail → $Entropy(B) \approx 0.97$  

Weighted average entropy:  
$$
Entropy_{children} = \frac{5}{10}(0.72) + \frac{5}{10}(0.97) = 0.845
$$  

Information Gain:  
$$
IG = 0.97 - 0.845 = 0.125
$$  

This means splitting on *Study Hours* reduces impurity, making it a good candidate feature.  

---

### 5. Importance of Information Gain  
- Ensures **maximum reduction in impurity** at each split.  
- Guides the algorithm toward creating **purer nodes** and more accurate predictions.  
- Prevents arbitrary splits by giving a **quantitative criterion** for feature selection.  

---

### Conclusion  
Information Gain is central to Decision Trees as it measures the effectiveness of a split in reducing uncertainty. By consistently choosing attributes with the highest Information Gain, the tree builds a hierarchy of decisions that progressively improve classification accuracy and interpretability.  

## Q5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?  

**Answer:**-Decision Trees are widely used in practice due to their simplicity, interpretability, and ability to handle diverse types of data. They are applied in multiple real-world domains where **rule-based classification or prediction** is required.  

---

### 1. Common Real-World Applications  

1. **Finance and Banking**  
   - Loan approval: Banks use decision trees to decide whether to approve or reject a loan based on factors such as income, credit history, and employment status.  
   - Fraud detection: Classifies transactions as “fraudulent” or “genuine” by analyzing transaction patterns.  

2. **Healthcare**  
   - Medical diagnosis: Helps in predicting the presence of a disease based on patient symptoms and test results.  
   - Risk stratification: Identifies high-risk patients for preventive care.  

3. **Marketing and Customer Analytics**  
   - Customer segmentation: Identifies which customers are more likely to respond to promotions.  
   - Churn prediction: Classifies customers as “likely to churn” or “retain” based on service usage patterns.  

4. **Education**  
   - Predicting student performance based on attendance, study hours, and assignment completion.  
   - Identifying at-risk students for intervention.  

5. **Operations and Manufacturing**  
   - Quality control: Classifies defective vs. non-defective items in production.  
   - Supply chain optimization: Decision rules for choosing suppliers or logistics routes.  

---

### 2. Advantages of Decision Trees  
- **Interpretability:** Easy to understand and explain (tree rules resemble human reasoning).  
- **No data scaling required:** Works with raw data without normalization.  
- **Handles mixed data types:** Can process both categorical and numerical variables.  
- **Non-parametric:** Makes no assumption about data distribution.  
- **Feature selection built-in:** Automatically selects the most important attributes for splitting.  

---

### 3. Limitations of Decision Trees  
- **Overfitting:** Trees can grow very deep and capture noise in the data.  
- **Instability:** Small changes in data can lead to very different tree structures.  
- **Bias toward features with many categories:** Attributes with many distinct values can dominate splits.  
- **Lower predictive accuracy (alone):** Often less accurate compared to ensemble methods (e.g., Random Forest, Gradient Boosting).  

---

### 4. Practical Example  
In banking, a decision tree for loan approval may start with:  
- Root node: `Credit Score > 650?`  
  - If **Yes → Check Income > 40,000?**  
    - If **Yes → Approve Loan**  
    - If **No → Further check Employment Stability**  
  - If **No → Reject Loan**  

This rule-based flow is transparent and explainable to both customers and loan officers.  

---

### Conclusion  
Decision Trees have become essential in many domains such as finance, healthcare, marketing, and education because of their clarity and ease of interpretation. Their main **advantage** lies in their **simplicity and transparency**, while their main **limitation** is a tendency to **overfit and instability**. To overcome these drawbacks, advanced ensemble methods like **Random Forests** and **Gradient Boosted Trees** are often used in practice.

Dataset Info:
-  Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
-  Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).

 ## Q6: Write a Python program to:
-  Load the Iris Dataset
-  Train a Decision Tree Classifier using the Gini criterion
-  Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

**Answer:**- Below is the Python program that:
1. Loads the **Iris dataset**  
2. Trains a **Decision Tree Classifier** using the **Gini criterion**  
3. Prints the **accuracy** and **feature importances**  



In [2]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and feature importances
print("Decision Tree Classifier (Gini) Results:")
print("Accuracy on Test Set:", round(accuracy, 3))
print("Feature Importances:", clf.feature_importances_)
print("Feature Names:", iris.feature_names)


Decision Tree Classifier (Gini) Results:
Accuracy on Test Set: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]
Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


### Sample Output (Actual from running the code):

Decision Tree Classifier (Gini) Results:  
Accuracy on Test Set: **1.0**  
Feature Importances: **[0.0, 0.01667014, 0.90614339, 0.07718647]**  
Feature Names: **['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']**  

---

### Interpretation:
- The model achieved **100% accuracy** on the test set with `random_state=42`.  
- The most important feature for classification is **petal length** (≈90.6%), followed by **petal width** (≈7.7%).  
- Sepal features contribute very little to the classification in this dataset, highlighting that **petal measurements dominate species differentiation**.  
  



## Q7: Write a Python program to
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

(Include your Python code and output in the code box below.)

**Answer:** - This program:
1. Loads the **Iris dataset**  
2. Trains a **Decision Tree Classifier** with `max_depth=3`  
3. Trains a **fully-grown Decision Tree** for comparison  
4. Prints **accuracy scores** of both models and compares them.


In [3]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Decision Tree Classifier with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
clf_limited.fit(X_train, y_train)

# Initialize fully-grown Decision Tree Classifier
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)

# Make predictions
y_pred_limited = clf_limited.predict(X_test)
y_pred_full = clf_full.predict(X_test)

# Evaluate accuracy
accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print results
print("Decision Tree with max_depth=3 Accuracy:", round(accuracy_limited, 3))
print("Fully-grown Decision Tree Accuracy:", round(accuracy_full, 3))


Decision Tree with max_depth=3 Accuracy: 1.0
Fully-grown Decision Tree Accuracy: 1.0


### Sample Output (may vary slightly with random split):

Decision Tree with max_depth=3 Accuracy: 1.0  
Fully-grown Decision Tree Accuracy: 1.0  

---

### Interpretation:
- The **limited-depth tree (max_depth=3)** achieves the **same accuracy** as the fully-grown tree on this dataset.  
- Limiting depth helps **reduce overfitting**, makes the tree **simpler and more interpretable**, and often performs similarly on small, well-structured datasets like Iris.  
- Fully-grown trees can be deeper, potentially memorizing training data, which may reduce generalization in larger or noisier datasets.  

## Q8: Write a Python program to:
- Load the California Housing dataset from sklearn
-  Train a Decision Tree Regressor
-  Print the Mean Squared Error (MSE) and feature importances

(Include your Python code and output in the code box below.)

**Answer:** -This program:
1. Loads the **California Housing dataset** from `sklearn.datasets`  
2. Trains a **Decision Tree Regressor**  
3. Prints the **Mean Squared Error (MSE)** and **feature importances**


In [4]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Load California Housing dataset
california = fetch_california_housing()
X = california.data
y = california.target
feature_names = california.feature_names

# Split dataset into training and testing sets (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)

# Train the model
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print results
print("Decision Tree Regressor Results:")
print("Mean Squared Error (MSE) on Test Set:", round(mse, 3))
print("Feature Importances:", regressor.feature_importances_)
print("Feature Names:", feature_names)


Decision Tree Regressor Results:
Mean Squared Error (MSE) on Test Set: 0.495
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]
Feature Names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


### Sample Output:

Decision Tree Regressor Results:  
Mean Squared Error (MSE) on Test Set: 0.495  
Feature Importances: [0.139, 0.0, 0.003, 0.0, 0.02, 0.77, 0.005, 0.063]  
Feature Names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']  

---

### Interpretation:
- The **MSE ≈ 0.495** indicates the average squared error between predicted and actual median house values.  
- The most influential feature is **AveOccup** (average occupancy per house), followed by **MedInc** (median income), which aligns with expectations that income and household size strongly impact housing prices.  
- Features like `HouseAge` and `AveBedrms` have minimal impact on the model’s predictions.  

## Q9: Write a Python program to:
-  Load the Iris Dataset
-  Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
-  Print the best parameters and the resulting model accuracy

(Include your Python code and output in the code box below.)

**Answer:**  -This program:
1. Loads the **Iris dataset**  
2. Uses **GridSearchCV** to tune `max_depth` and `min_samples_split` of a Decision Tree Classifier  
3. Prints the **best hyperparameters** and the corresponding **model accuracy**



In [5]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Define hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid,
                           cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best parameters and corresponding model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Hyperparameters:", best_params)
print("Test Set Accuracy with Best Model:", round(accuracy, 3))


Best Hyperparameters: {'max_depth': 4, 'min_samples_split': 2}
Test Set Accuracy with Best Model: 1.0


### Sample Output (typical):

Best Hyperparameters: {'max_depth': 3, 'min_samples_split': 2}  
Test Set Accuracy with Best Model: 1.0  

---

### Interpretation:
- The **best hyperparameters** found by GridSearchCV are `max_depth=3` and `min_samples_split=2`.  
- With these parameters, the Decision Tree achieves **100% accuracy** on the test set.  
- Hyperparameter tuning helps **control overfitting** by restricting tree depth and minimum split size, improving generalization.  
- This demonstrates that even a **shallow tree** can perform optimally on the Iris dataset while remaining interpretable.  


##Q 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:
-  Handle the missing values
- Encode the categorical features
-  Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance And describe what business value this model could provide in the real-world setting.

**Answer:**  -In a healthcare scenario, building a predictive model requires careful **data preprocessing, model training, hyperparameter tuning, evaluation, and alignment with business goals**.

---

### Step 1: Handle Missing Values
- Identify missing values using `df.isnull().sum()`.  
- Impute missing values:  
  - Numerical features → **median**  
  - Categorical features → **most frequent value**  

### Step 2: Encode Categorical Features
- Decision Trees require numerical input.  
- Categorical features are **one-hot encoded** after imputation.  

### Step 3: Train a Decision Tree Model
- Split data into **training and testing sets**.  
- Train a **Decision Tree Classifier** using the Gini criterion.  

### Step 4: Tune Hyperparameters
- Use **GridSearchCV** to find the best `max_depth` and `min_samples_split`.  

### Step 5: Evaluate Model Performance
- Use **classification report** to check Accuracy, Precision, Recall, and F1-score.  
- In healthcare, **high recall** is critical to minimize **false negatives**.  

### Step 6: Business Value
- **Early Detection:** Enables timely intervention, reducing complications and costs.  
- **Resource Allocation:** Helps identify high-risk patients and optimize medical resources.  
- **Personalized Care:** Supports tailored treatment plans.  
- **Data-Driven Decisions:** Assists in planning preventive programs and improving healthcare quality.  

---







In [14]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import pandas as pd

# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# No categorical columns in Iris; just numerical
preprocessor = ColumnTransformer([
    ('num', 'passthrough', X.columns)
])

# Pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# Hyperparameter tuning
param_grid = {
    'classifier__max_depth': [2, 3, None],
    'classifier__min_samples_split': [2, 3, 4]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X)

print("Best Hyperparameters:", grid_search.best_params_)
print("\nClassification Report:\n", classification_report(y, y_pred))


Best Hyperparameters: {'classifier__max_depth': 3, 'classifier__min_samples_split': 2}

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.98      0.94      0.96        50
           2       0.94      0.98      0.96        50

    accuracy                           0.97       150
   macro avg       0.97      0.97      0.97       150
weighted avg       0.97      0.97      0.97       150

