1. **What is a Decision Tree, and how does it work in the context of classification?**


**Decision Tree**

- A Decision Tree is a supervised machine learning algorithm used for classification and regression.
- In classification, it predicts a class label by learning simple decision rules from the data.

**How It Works in Classification**

- The algorithm splits the dataset into smaller subsets based on feature values.
- At each step, it selects the feature that best separates the data into distinct classes.
- Splitting continues until:
  - all samples in a node belong to the same class, or
  - stopping criteria like max depth or minimum samples per leaf are reached.

**Structure**

- **Root Node**: Represents the entire dataset and the first splitting feature.
- **Internal Nodes**: Represent decision points based on feature values.
- **Leaf Nodes**: Contain final class predictions.

**Decision Rule**

- At each node, the split is chosen using metrics such as:
  - Gini Index
  - Entropy (Information Gain)

**Final Prediction**

- A new sample is classified by traversing the tree from root to leaf, following decision rules at each node, resulting in a class label.


2. **Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**


**Gini Impurity**

- Gini Impurity measures how often a randomly chosen sample would be incorrectly classified if it was randomly labeled according to the class distribution in a node.
- Formula: Gini = 1 − Σ (pᵢ)²  
  where pᵢ is the proportion of samples belonging to class i.
- Lower Gini indicates a purer node (more samples from a single class).

**Entropy**

- Entropy measures the amount of uncertainty or disorder in a node.
- Formula: Entropy = − Σ (pᵢ * log₂(pᵢ))  
  where pᵢ is the proportion of samples belonging to class i.
- Lower entropy means higher purity.

**Impact on Decision Tree Splitting**

- Both Gini and Entropy evaluate how well a feature split separates the data into pure groups.
- During tree building:
  - The algorithm tries all possible splits.
  - It chooses the split that results in the **largest reduction in impurity** (highest Information Gain for entropy or highest Gini reduction).
- The goal is to create nodes that contain samples mostly from a single class, improving classification accuracy.

3. **What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

**Pre-Pruning**

- Pre-pruning (also called early stopping) stops the tree from growing too deep during training.
- Conditions like maximum depth, minimum samples per split, or minimum information gain are applied to halt further splits.

**Practical Advantage of Pre-Pruning**

- Reduces training time and prevents the tree from becoming overly complex, helping to avoid overfitting early.

**Post-Pruning**

- Post-pruning allows the tree to grow fully first and then removes branches that do not provide meaningful improvement.
- Nodes or subtrees are pruned based on validation performance or pruning cost complexity (such as reduced error pruning).

**Practical Advantage of Post-Pruning**

- Produces simpler and more generalizable trees because pruning decisions are based on observed performance rather than assumptions during training.


4. **What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

**Information Gain**

- Information Gain measures how much uncertainty (impurity) is reduced in the target variable after splitting the dataset based on a feature.
- It is calculated as the difference between the entropy of the parent node and the weighted entropy of the child nodes after the split.

**Formula**

  Information Gain = Entropy(parent) − Σ ( (nᵢ / N) × Entropy(childᵢ) )

  nᵢ = number of samples in child node  
  N = total samples in parent node

**Why It Is Important for Choosing the Best Split**

- Information Gain identifies which feature provides the most effective separation of classes at each step.
- A higher Information Gain means the split leads to purer child nodes.
- Decision Trees aim to reduce impurity with every split, so the feature with the highest Information Gain is chosen.
- This leads to better classification accuracy and a more meaningful tree structure.

5. **What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

**Real-World Applications of Decision Trees**

- **Fraud detection** in banking and financial institutions.
- **Medical diagnosis** for predicting disease likelihood.
- **Customer churn prediction** in telecom and subscription services.
- **Loan approval** and credit risk assessment.
- **Product recommendation** and customer segmentation in e-commerce.
- **Weather prediction** and agriculture planning.

---

**Advantages**

- **Easy to understand and interpret** because decisions are shown in a tree-like structure.
- **Requires little data preprocessing** (no scaling or normalization needed).
- **Can handle both numerical and categorical variables.**
- **Captures non-linear relationships** between features and target.

---

**Limitations**

- **Prone to overfitting** when the tree grows too deep.
- **Small changes in data can drastically alter the tree structure**, reducing stability.
- **Biased toward features with many possible split values**, unless corrected.
- **Less suitable for continuous prediction tasks compared to ensemble models.**


In [2]:
# 6. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Accuracy of Decision Tree Classifier:", accuracy)
print("Feature Importances:", model.feature_importances_)
print("Feature Names and Importance:")
for name, score in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {score}")

Accuracy of Decision Tree Classifier: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]
Feature Names and Importance:
sepal length (cm): 0.0
sepal width (cm): 0.016670139612419255
petal length (cm): 0.9061433868879218
petal width (cm): 0.07718647349965893


In [3]:
# 7. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train–test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fully grown decision tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_pred = full_tree.predict(X_test)
full_accuracy = accuracy_score(y_test, full_pred)

# Decision tree with max_depth=3
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_pred = pruned_tree.predict(X_test)
pruned_accuracy = accuracy_score(y_test, pruned_pred)

# Print accuracy comparison
print("Accuracy of Fully-Grown Tree:", full_accuracy)
print("Accuracy of Tree with max_depth=3:", pruned_accuracy)

Accuracy of Fully-Grown Tree: 1.0
Accuracy of Tree with max_depth=3: 1.0


In [5]:
# 8. Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load Boston Housing dataset from OpenML
data = fetch_openml(name="boston", version=1, as_frame=True)
X = data.data
y = data.target
feature_names = X.columns

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Compute MSE
mse = mean_squared_error(y_test, y_pred)

# Display results
print("Mean Squared Error (MSE):", mse)
print("\nFeature Importances:")
for name, importance in zip(feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error (MSE): 10.416078431372549

Feature Importances:
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


In [6]:
# 9. Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using
# GridSearchCV
# ● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Parameter grid for tuning
param_grid = {
    "max_depth": [1, 2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 6]
}

# Decision Tree Classifier with GridSearchCV
dt = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


10. **Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:**
- ● Handle the missing values
- ● Encode the categorical features
- ● Train a Decision Tree model
- ● Tune its hyperparameters
- ● Evaluate its performance
And describe what business value this model could provide in the real-worldsetting.

**Overall approach (high-level)**

- **Goal:** build a robust, explainable Decision Tree classifier to predict disease presence from mixed-type healthcare data, while handling missing values and avoiding leakage.
- **Key principles:** preserve clinical meaning, avoid leakage, handle class imbalance, prefer interpretability and calibration, monitor post-deployment.

**1) Handle missing values**

- **Understand missingness**
  - Determine mechanism: Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR).
  - Check missing-rate per feature and by class (target).
- **Simple strategies**
  - **Numerical:** median imputation (robust) or mean if symmetric; consider KNN or IterativeImputer for complex patterns.
  - **Categorical:** impute with a special category like `"Missing"` or the mode.
- **Advanced strategies**
  - **Model-based imputation:** Iterative (multivariate) imputer or predictive models if missingness is informative.
  - **Missingness indicator:** create a binary flag `feature_missing` for features with >~1% missing to capture MNAR signals.
- **Practical rules**
  - Never impute based on target or test set statistics during training (avoid leakage).
  - Impute using training set statistics and apply same transformers to validation/test.
  - Prefer simple, interpretable imputations in healthcare; preserve original distribution if possible.

**2) Encode categorical features**

- **Separate types**
  - **Nominal (no order):** use One-Hot Encoding or low-cardinality target encoding.
  - **Ordinal (known order):** use OrdinalEncoder with a clinically meaningful mapping.
  - **High-cardinality (>20–50 categories):** consider target/mean encoding, frequency encoding, or embedding techniques; apply smoothing and cross-validation to prevent leakage.
- **Avoid leakage with target encoding**
  - Use K-fold target encoding or implement target encoding inside cross-validation folds / training pipeline.
- **Tree-specific notes**
  - Decision trees do not require scaling.
  - Trees handle integer-encoded categories, but consistent encoding is essential.
- **Pipeline usage**
  - Use `ColumnTransformer` to apply different encoders per column so transformations are reproducible.

**3) Train a Decision Tree model**

- **Data split**
  - Create stratified train/validation/test splits (e.g., 60/20/20) to preserve class ratios.
- **Baseline model**
  - Train a simple `DecisionTreeClassifier(random_state=R)` with default parameters to establish baseline metrics.
- **Preprocessing pipeline**
  - Build a scikit-learn `Pipeline` or `ColumnTransformer` that:
    - imputes numericals and categoricals,
    - applies encoders,
    - optionally creates missingness indicators and basic interactions,
    - then fits the `DecisionTreeClassifier`.
- **Class imbalance**
  - If disease is rare, use `class_weight='balanced'`, oversampling (SMOTE) inside cross-validation, or use thresholding based on business costs.

**4) Tune hyperparameters**

- **Important hyperparameters to consider**
  - `max_depth` — controls complexity and overfitting.
  - `min_samples_split` / `min_samples_leaf` — avoid tiny leaves.
  - `max_features` — number of features to consider at each split.
  - `criterion` — 'gini' or 'entropy'.
  - `class_weight` — handle imbalance.
  - `ccp_alpha` — cost-complexity pruning parameter.
- **Search strategy**
  - Use `GridSearchCV` or `RandomizedSearchCV` with `StratifiedKFold` CV.
  - Score using metrics relevant to business (see evaluation below), not just accuracy.
  - Use nested CV for unbiased estimation when tuning and reporting final performance.
- **Computational tips**
  - Start with wide `RandomizedSearchCV`, then refine with `GridSearchCV`.
  - Limit tree depth early to avoid extremely large trees.

**5) Evaluate performance**

- **Evaluation splits**
  - Hold out a final test set not used during training or tuning.
- **Metrics (choose by business needs)**
  - **Primary:** ROC-AUC (general separability), PR-AUC (for imbalanced data), or F1 if balancing precision/recall.
  - **Clinical emphasis:** sensitivity (recall) to catch disease cases, specificity to avoid false alarms — choose threshold accordingly.
  - **Other:** Accuracy (useful if classes balanced), Precision, Recall, F1-score, Confusion Matrix.
  - **Calibration:** use calibration curve / Brier score — well-calibrated probabilities are critical in healthcare.
- **Explainability & validation**
  - Compute feature importances; use SHAP/LIME to explain individual predictions and validate clinical plausibility.
  - Check partial dependence plots for how features affect risk.
- **Residual / error analysis**
  - Examine false negatives and false positives for systematic patterns (age groups, missing data, subpopulations).
- **Robustness & stability**
  - Test model on subgroups (age, gender, hospital) to detect bias.
  - Perform sensitivity analyses (imputation methods, encoding variants).
- **Operational metrics**
  - Decision-curve analysis or cost-benefit analysis to quantify clinical utility given downstream costs (tests, treatments, missed cases).

**6) Deployment & monitoring (brief)**

- **Model packaging**
  - Serialize pipeline (`joblib`) that includes imputer, encoder, and tree.
- **Monitoring**
  - Track data drift, model performance drift, and calibration over time.
  - Re-evaluate and retrain periodically as new labeled data arrives.
- **Governance**
  - Maintain audit trail, clinician review, and documentation of feature transformations and limitations.

**7) Business value of the model**

- **Early detection & intervention**
  - Identify high-risk patients earlier so clinicians can run confirmatory tests or start preventive care.
- **Resource allocation**
  - Prioritize limited diagnostic resources (imaging, specialist appointments) for likely positives.
- **Cost reduction**
  - Reduce unnecessary testing for low-risk patients and avoid expensive late-stage treatments via early action.
- **Operational efficiency**
  - Automate triage and flagging in electronic health records to streamline workflows.
- **Clinical decision support**
  - Provide transparent risk scores and feature-based explanations to assist clinician judgement.
- **Population health**
  - Inform screening programs and public-health interventions by identifying high-risk cohorts.
- **Caveats**
  - Must ensure model fairness, transparency, and clinical validation before live use; false negatives in healthcare can have serious consequences so align threshold and operating point with clinical risk tolerance.

**8) Final recommendations**
- In healthcare, prioritize explainability, calibration, and conservative thresholds for clinical safety.
- Involve clinicians early for feature selection and to validate model outputs.
- Maintain continuous monitoring and a retraining plan as the data distribution evolves.
