# Decision Tree | Assignment

## Question 1
**What is a Decision Tree, and how does it work in the context of classification?**

### Answer
A **Decision Tree** is a supervised learning model that predicts a target by learning simple decision rules inferred from data features.  
For **classification**, the tree recursively partitions the feature space into regions that are (ideally) pure in terms of class labels:

1. **Root node**: Start with all training data.  
2. **Split**: Choose a feature and threshold that best separates classes (e.g., by maximizing information gain or minimizing Gini impurity).  
3. **Recursion**: Repeat splitting on each child node until a stopping criterion is met (e.g., max depth, minimum samples, or pure leaf).  
4. **Prediction**: For a new sample, traverse the tree from root to a **leaf** following the decision rules; the predicted class is the majority class at that leaf.

Trees are **interpretable** (the path is a human-readable set of if/else rules), handle mixed feature types, and require little preprocessing.


## Question 2
**Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

### Answer
Let a node have class proportions \(p_1, p_2, \dots, p_K\).

- **Gini Impurity**: \( G = 1 - \sum_{k=1}^K p_k^2 \). Lower Gini means purer nodes.  
- **Entropy**: \( H = -\sum_{k=1}^K p_k \log_2 p_k \). Entropy is 0 when a node is pure and maximal when classes are uniform.

**Impact on splits**: During training, the algorithm evaluates candidate splits and chooses the one that **reduces impurity** the most (i.e., maximizes **information gain** for entropy or minimizes Gini). In practice, Gini and Entropy often give similar trees; Gini is a bit faster to compute, while entropy has an information-theoretic interpretation.


## Question 3
**What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

### Answer
- **Pre-pruning (early stopping)**: Limit tree growth during training using constraints like `max_depth`, `min_samples_split`, `min_samples_leaf`, or `max_leaf_nodes`.

  - *Advantage*: Faster training and smaller, more interpretable trees that generalize better when data is noisy.

- **Post-pruning (cost-complexity pruning)**: Grow a large tree first, then prune back using a complexity penalty (e.g., `ccp_alpha` in scikit-learn) based on validation performance.

  - *Advantage*: Can discover a strong large tree first and then optimally simplify it to reduce overfitting.


## Question 4
**What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

### Answer
**Information Gain (IG)** is the **reduction in impurity** (often entropy) achieved by a split. If \(H(\text{parent})\) is the impurity of a node and the split creates children with impurities \(H(\text{left}), H(\text{right})\) and proportions \(w_l, w_r\), then:
\[ IG = H(\text{parent}) - \big(w_l H(\text{left}) + w_r H(\text{right})\big). \]

A split with higher IG yields children that are **purer**, improving class separability and usually predictive performance. The tree chooses the split (feature + threshold) that **maximizes** IG (or equivalently minimizes weighted child impurity).


## Question 5
**What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

### Answer
**Applications**: credit approval, fraud detection, medical diagnosis, churn prediction, risk scoring, quality control, and simple rule-based decision support.

**Advantages**: interpretable rules, little preprocessing, handles nonlinearity and interactions, works with mixed data types, robust to monotonic transformations.

**Limitations**: high variance (overfitting) if unpruned, axis-aligned splits can be myopic, small changes in data can change the tree; ensembles (Random Forests, Gradient Boosting) often perform better but lose some interpretability.


## Question 6
**Load the Iris Dataset; train a Decision Tree Classifier using the Gini criterion; print the model’s accuracy and feature importances.**

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Train classifier with Gini
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_gini.fit(X_train, y_train)

# Evaluate
y_pred = clf_gini.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print(f"Accuracy (Gini): {acc:.4f}\n")
print("Feature Importances:")
for name, imp in zip(feature_names, clf_gini.feature_importances_):
    print(f"  {name:25s} {imp:.4f}")

Accuracy (Gini): 0.8947

Feature Importances:
  sepal length (cm)         0.0134
  sepal width (cm)          0.0201
  petal length (cm)         0.9199
  petal width (cm)          0.0466


## Question 7
**Train a Decision Tree Classifier with `max_depth=3` and compare its accuracy to a fully-grown tree.**

In [2]:
# Constrained depth tree
clf_depth3 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
acc_depth3 = accuracy_score(y_test, clf_depth3.predict(X_test))

# Fully-grown (use previous clf_gini with no depth limit)
acc_full = acc  # from Q6

print(f"Accuracy (max_depth=3): {acc_depth3:.4f}")
print(f"Accuracy (fully-grown): {acc_full:.4f}")

Accuracy (max_depth=3): 0.8947
Accuracy (fully-grown): 0.8947


## Question 8
**(Offline-friendly)** Train a Decision Tree Regressor on a built-in regression dataset and report MSE and feature importances.

In [3]:
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset (offline, bundled with scikit-learn)
dia = load_diabetes()
Xr, yr = dia.data, dia.target
feature_names_reg = dia.feature_names

# Train/test split
from sklearn.model_selection import train_test_split
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.25, random_state=42)

# Train regressor
dtr = DecisionTreeRegressor(random_state=42)
dtr.fit(Xr_train, yr_train)

# Evaluate
pred = dtr.predict(Xr_test)
mse = mean_squared_error(yr_test, pred)

print(f"MSE: {mse:.4f}\n")
print("Feature Importances:")
for name, imp in zip(feature_names_reg, dtr.feature_importances_):
    print(f"  {name:25s} {imp:.4f}")

MSE: 5941.7027

Feature Importances:
  age                       0.0501
  sex                       0.0194
  bmi                       0.4145
  bp                        0.0777
  s1                        0.0749
  s2                        0.0595
  s3                        0.0572
  s4                        0.0314
  s5                        0.1397
  s6                        0.0756


## Question 9
**Tune the Decision Tree’s `max_depth` and `min_samples_split` using GridSearchCV; print the best parameters and accuracy.**

In [4]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8, 10]
}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=None)
grid.fit(X_train, y_train)

best = grid.best_estimator_
best_acc = accuracy_score(y_test, best.predict(X_test))

print("Best Params:", grid.best_params_)
print(f"Test Accuracy with Best Model: {best_acc:.4f}")

Best Params: {'max_depth': 3, 'min_samples_split': 2}
Test Accuracy with Best Model: 0.8947


## Question 10
**End-to-end plan for disease prediction with mixed data and missing values.**

### Answer (Step-by-step)
1. **Exploration & Target Definition**: Understand label prevalence, feature types (numeric, categorical, ordinal), leakage, and business KPIs (e.g., recall@precision, AUROC).

2. **Handling Missing Values**: Use imputers: median for skewed numeric, mean for roughly normal numeric, most-frequent for low-cardinality categorical. Consider **iterative imputation** or domain-driven defaults where appropriate. Keep **missingness indicators** when it correlates with the label.

3. **Encoding Categorical Features**: For trees, minimal encoding is needed (label encoding works). If using one-hot, ensure reasonable dimensionality (group rare levels, or use target encoding with leakage-safe CV).

4. **Train/Validation Split**: Stratified split (or time-based split if temporal). Keep a hold-out test set. Address class imbalance with class weights (e.g., `class_weight='balanced'`) or resampling.

5. **Model Training (Decision Tree)**: Start with constraints: `max_depth`, `min_samples_split`, `min_samples_leaf`. Use pipelines to bundle imputers + encoders + classifier.

6. **Hyperparameter Tuning**: Grid/random search over depth, leaf size, split criteria, and `ccp_alpha` (cost-complexity pruning). Optimize for a metric aligned to business goals (e.g., **recall** to reduce missed positives) and include calibration if probabilistic thresholds matter.

7. **Evaluation**: Report accuracy, precision/recall/F1, ROC-AUC, PR-AUC, confusion matrix, and calibration curve. Perform cross-validation. Check subgroup performance for fairness.

8. **Business Value**: The model prioritizes patients for further testing or early intervention, optimizing clinician time and cost while improving patient outcomes. Use decision thresholds aligned with capacity and risk tolerance, monitor drift, and retrain on new data.

9. **Deployment & Monitoring**: Serve via an API or batch scoring; log predictions, feature distributions, and outcomes; set up alerts for performance decay and data drift.



**Note**: Tree-based ensembles (Random Forests/Gradient Boosting) typically outperform a single tree while remaining relatively interpretable (feature importances, SHAP values).