# Decision Tree — Assignment Answers

### Question 1 — What is a Decision Tree, and how does it work in the context of classification?

**Answer (short):**  
A **Decision Tree** is a tree-structured model used for classification and regression. In classification, the tree splits the feature space into regions using simple decision rules (e.g., feature <= threshold). Each internal node tests one feature, branches represent outcomes of the test, and each leaf node assigns a class label (or class probabilities). The tree is grown by selecting splits that maximize a purity metric (e.g., Information Gain / Entropy or Gini Impurity) on the training data.

**How it works (mechanics):**
1. Start with the root node containing the full training set.
2. For each candidate feature and threshold, compute an impurity measure for the resulting child nodes.
3. Select the split that yields the largest decrease in impurity (highest information gain).
4. Recursively split child nodes until a stopping criterion is met (pure nodes, max depth, min samples).
5. For prediction, pass a sample down the tree following the tests until a leaf is reached; the leaf's class is the predicted class (or class distribution).


### Question 2 — Gini Impurity and Entropy (impurity measures)

**Gini Impurity:**  
- For a node with class probabilities \(p_k\), Gini impurity is \(G = 1 - \sum_k p_k^2\).  
- It measures the probability of mislabeling a random sample from the node if it were randomly labeled according to the class distribution.

**Entropy:**  
- Entropy is \(H = -\sum_k p_k \log_2 p_k\).  
- It measures the amount of “disorder” or unpredictability in the class distribution.

**Impact on splits:**  
- Both measure impurity; splits are chosen to reduce impurity the most (maximize Information Gain).  
- Entropy is typically more sensitive to changes in class probabilities near 0 and 1, while Gini is slightly faster to compute. In practice both often produce similar trees; Gini is the default in many implementations because of computational speed.


### Question 3 — Pre-Pruning vs Post-Pruning

**Pre-Pruning (early stopping):**  
- Stop tree growth early using criteria such as `max_depth`, `min_samples_split`, `min_samples_leaf`, or a minimum impurity decrease.  
- **Advantage:** Reduces overfitting and training time because tree never grows too large.

**Post-Pruning (prune after full growth):**  
- Grow a full tree and then remove (prune) branches that do not improve performance on a validation set (or using cost-complexity pruning).  
- **Advantage:** Can produce simpler trees while keeping beneficial splits discovered in deep parts of the tree (often yields better accuracy than naive pre-pruning).


### Question 4 — Information Gain

**Information Gain (IG):**  
- IG measures the reduction in impurity (entropy or Gini) due to a split. For entropy:  
  \(	ext{IG} = H(	ext{parent}) - \sum_{i \in \{	ext{children}\}}
rac{N_i}{N}\, H(i)\).  
- **Why important:** It quantifies how much a split improves class purity; the split with the highest IG is typically chosen as the best split.


### Question 5 — Applications, advantages and limitations

**Common applications:**  
- Medical diagnosis, credit scoring, customer churn prediction, fraud detection, loan approval, and many tabular-data tasks.  
- Also used inside ensemble methods like Random Forests and Gradient Boosted Trees.

**Advantages:**  
- Easy to interpret and visualize.  
- Handles both numerical and categorical features.  
- Requires little feature preparation (no scaling needed).  
- Can model non-linear relationships and interactions.

**Limitations:**  
- Prone to overfitting if not pruned.  
- Unstable: small changes in data can yield different trees.  
- Single trees often underperform compared to ensembles (Random Forests, XGBoost) on predictive accuracy.


In [1]:
# Question 6: Load Iris, train DecisionTreeClassifier with Gini, print accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print('Accuracy (Iris, Gini):', acc)
print('Feature importances:')
for name, imp in zip(iris.feature_names, clf.feature_importances_):
    print(f'  {name}: {imp:.4f}')


Accuracy (Iris, Gini): 0.9333333333333333
Feature importances:
  sepal length (cm): 0.0062
  sepal width (cm): 0.0292
  petal length (cm): 0.5586
  petal width (cm): 0.4060


In [2]:
# Question 7: Train DecisionTreeClassifier with max_depth=3 and compare to fully-grown tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# fully-grown tree (no max_depth)
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

# constrained tree
clf_md3 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_md3.fit(X_train, y_train)
acc_md3 = accuracy_score(y_test, clf_md3.predict(X_test))

print(f'Fully-grown tree accuracy: {acc_full:.4f}')
print(f'max_depth=3 tree accuracy: {acc_md3:.4f}')


Fully-grown tree accuracy: 0.9333
max_depth=3 tree accuracy: 0.9667


In [3]:
# Question 8: Load Boston Housing Dataset (or fallback) and train DecisionTreeRegressor
# Print MSE and feature importances

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Try to load the Boston dataset; if unavailable, create a synthetic regression dataset
try:
    # scikit-learn older versions provide load_boston
    from sklearn.datasets import load_boston
    bdata = load_boston()
    Xb, yb = bdata.data, bdata.target
    feature_names = list(bdata.feature_names)
    source = 'load_boston()'
except Exception:
    # Fallback: create synthetic dataset with 13 features to mimic Boston's dimensionality
    from sklearn.datasets import make_regression
    Xb, yb = make_regression(n_samples=506, n_features=13, noise=0.5, random_state=42)
    feature_names = [f'feat_{i}' for i in range(Xb.shape[1])]
    source = 'make_regression (fallback)'

from sklearn.model_selection import train_test_split
Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(Xb_train, yb_train)
yb_pred = reg.predict(Xb_test)
mse = mean_squared_error(yb_test, yb_pred)

print('Dataset source:', source)
print('MSE (DecisionTreeRegressor):', mse)
print('\nFeature importances:')
for name, imp in zip(feature_names, reg.feature_importances_):
    print(f'  {name}: {imp:.4f}')


Dataset source: make_regression (fallback)
MSE (DecisionTreeRegressor): 11399.504755366335

Feature importances:
  feat_0: 0.0077
  feat_1: 0.0099
  feat_2: 0.0065
  feat_3: 0.0238
  feat_4: 0.0673
  feat_5: 0.0458
  feat_6: 0.0686
  feat_7: 0.0188
  feat_8: 0.0327
  feat_9: 0.3592
  feat_10: 0.0055
  feat_11: 0.0231
  feat_12: 0.3310


In [4]:
# Question 9: Tune max_depth and min_samples_split on Iris using GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {
    'max_depth': [None, 1, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print('Best parameters found:', grid.best_params_)
best_clf = grid.best_estimator_
print('Validation (cross-validated) score:', grid.best_score_)
print('Test accuracy with best estimator:', accuracy_score(y_test, best_clf.predict(X_test)))


Best parameters found: {'max_depth': None, 'min_samples_split': 2}
Validation (cross-validated) score: 0.9416666666666668
Test accuracy with best estimator: 0.9333333333333333


### Question 10 — Healthcare predictive model: step-by-step process

**Problem:** Predict whether a patient has a disease. Dataset has mixed data types and missing values.

**1) Handle missing values**
- Inspect missingness pattern (missing completely at random vs not).  
- If missingness is small: simple imputation — mean/median for numeric, most frequent or new category for categorical.  
- If missingness is informative: add binary indicator columns for "was_missing".  
- For advanced work: use `IterativeImputer` or model-based imputation.

**2) Encode categorical features**
- For low-cardinality categories: One-Hot Encoding (e.g., `OneHotEncoder`) or pandas `get_dummies`.  
- For high-cardinality categories: Target encoding, frequency encoding, or embedding techniques.  
- Ensure consistent encoding between train and production (save encoders).

**3) Train a Decision Tree model**
- Split data into train/validation/test (or use cross-validation).  
- Optionally scale numeric data (not required for trees).  
- Fit `DecisionTreeClassifier` using criterion (`gini` or `entropy`) and set basic regularization (`max_depth`, `min_samples_leaf`) to avoid overfitting.

**4) Tune hyperparameters**
- Use `GridSearchCV` or `RandomizedSearchCV` over `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`, and `ccp_alpha` (cost-complexity pruning).  
- Use cross-validation and consider class imbalance (use `class_weight='balanced'` or resampling).

**5) Evaluate performance**
- Use metrics appropriate to the business objective: accuracy, precision, recall, F1-score, ROC AUC. For medical diagnosis, maximize sensitivity/recall (catch cases) while maintaining acceptable precision.  
- Calibrate predicted probabilities if needed (e.g., `CalibratedClassifierCV`).  
- Use confusion matrix, ROC, PR curves, and calibration plots.
- Perform fairness and bias checks and validate on hold-out / external dataset.

**Business value**
- Early identification of patients who may have the disease (triage), helping prioritize further testing and intervention.  
- Better allocation of diagnostic resources and cost reduction by reducing unnecessary tests.  
- Improved patient outcomes via earlier treatment, and potential to build clinical decision support tools (with careful validation and regulatory compliance).
