# ML Practice Questions Part 6: Tree-Based Methods

This notebook covers decision trees and tree-based algorithms, including their theoretical foundations, implementation details, and practical considerations. Each question includes mathematical derivations, algorithmic implementations, and empirical analysis.

**Topics Covered:**
- Decision tree construction and splitting criteria
- Pruning techniques and overfitting prevention
- Tree ensemble methods fundamentals
- Feature importance and interpretation
- Handling categorical and missing data

**Format:** Each question includes theory, implementation, and analysis sections.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification, make_regression, load_iris, load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
np.random.seed(42)

## Question 1: Decision Tree Construction and Splitting Criteria

**Question:** Implement a decision tree from scratch with multiple splitting criteria (Gini, Entropy, MSE). Compare their behavior and explain when to use each criterion.

### Theory

**Decision Tree Algorithm:**
1. Start with entire dataset at root
2. Find best feature and threshold to split
3. Recursively apply to child nodes
4. Stop when stopping criterion met

**Splitting Criteria:**

**Gini Impurity (Classification):**
$$\text{Gini}(S) = 1 - \sum_{i=1}^c p_i^2$$
where $p_i$ is proportion of class $i$ in set $S$

**Entropy (Classification):**
$$\text{Entropy}(S) = -\sum_{i=1}^c p_i \log_2(p_i)$$

**Mean Squared Error (Regression):**
$$\text{MSE}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y})^2$$

**Information Gain:**
$$\text{IG}(S, A) = \text{Impurity}(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \text{Impurity}(S_v)$$

In [None]:
class DecisionTreeNode:
    """Node in a decision tree."""
    
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None, samples=None):
        self.feature = feature      # Feature index to split on
        self.threshold = threshold  # Threshold value for split
        self.left = left           # Left child node
        self.right = right         # Right child node
        self.value = value         # Prediction value (for leaf nodes)
        self.samples = samples     # Number of samples in this node
    
    def is_leaf(self):
        return self.value is not None

class DecisionTreeCustom:
    """Decision tree implementation from scratch."""
    
    def __init__(self, criterion='gini', max_depth=None, min_samples_split=2, 
                 min_samples_leaf=1, max_features=None, task='classification'):
        self.criterion = criterion
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.task = task
        self.root = None
        self.feature_importances_ = None
        
    def _gini_impurity(self, y):
        """Calculate Gini impurity."""
        if len(y) == 0:
            return 0
        
        counts = np.bincount(y)
        probabilities = counts / len(y)
        return 1 - np.sum(probabilities ** 2)
    
    def _entropy(self, y):
        """Calculate entropy."""
        if len(y) == 0:
            return 0
        
        counts = np.bincount(y)
        probabilities = counts / len(y)
        # Avoid log(0)
        probabilities = probabilities[probabilities > 0]
        return -np.sum(probabilities * np.log2(probabilities))
    
    def _mse(self, y):
        """Calculate mean squared error."""
        if len(y) == 0:
            return 0
        
        mean_y = np.mean(y)
        return np.mean((y - mean_y) ** 2)
    
    def _calculate_impurity(self, y):
        """Calculate impurity based on criterion."""
        if self.task == 'classification':
            if self.criterion == 'gini':
                return self._gini_impurity(y)
            elif self.criterion == 'entropy':
                return self._entropy(y)
        else:  # regression
            return self._mse(y)
    
    def _information_gain(self, y, y_left, y_right):
        """Calculate information gain from a split."""
        n = len(y)
        n_left, n_right = len(y_left), len(y_right)
        
        if n_left == 0 or n_right == 0:
            return 0
        
        parent_impurity = self._calculate_impurity(y)
        left_impurity = self._calculate_impurity(y_left)
        right_impurity = self._calculate_impurity(y_right)
        
        weighted_impurity = (n_left / n) * left_impurity + (n_right / n) * right_impurity
        
        return parent_impurity - weighted_impurity
    
    def _best_split(self, X, y):
        """Find the best split for the given data."""
        n_samples, n_features = X.shape
        
        if n_samples < self.min_samples_split:
            return None, None
        
        # Determine features to consider
        if self.max_features is None:
            features_to_consider = range(n_features)
        else:
            n_features_to_consider = min(self.max_features, n_features)
            features_to_consider = np.random.choice(n_features, n_features_to_consider, replace=False)
        
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        for feature in features_to_consider:
            # Get unique values for potential thresholds
            thresholds = np.unique(X[:, feature])
            
            for threshold in thresholds:
                # Split data
                left_mask = X[:, feature] <= threshold
                right_mask = ~left_mask
                
                # Check minimum samples constraint
                if np.sum(left_mask) < self.min_samples_leaf or np.sum(right_mask) < self.min_samples_leaf:
                    continue
                
                # Calculate information gain
                y_left, y_right = y[left_mask], y[right_mask]
                gain = self._information_gain(y, y_left, y_right)
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold
    
    def _build_tree(self, X, y, depth=0):
        """Recursively build the decision tree."""
        n_samples = len(y)
        
        # Determine leaf value
        if self.task == 'classification':
            leaf_value = np.bincount(y).argmax()  # Majority class
        else:
            leaf_value = np.mean(y)  # Mean for regression
        
        # Stopping criteria
        if (self.max_depth is not None and depth >= self.max_depth) or \
           n_samples < self.min_samples_split or \
           len(np.unique(y)) == 1:  # Pure node
            return DecisionTreeNode(value=leaf_value, samples=n_samples)
        
        # Find best split
        best_feature, best_threshold = self._best_split(X, y)
        
        if best_feature is None:
            return DecisionTreeNode(value=leaf_value, samples=n_samples)
        
        # Split data
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Recursively build left and right subtrees
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return DecisionTreeNode(
            feature=best_feature,
            threshold=best_threshold,
            left=left_child,
            right=right_child,
            samples=n_samples
        )
    
    def fit(self, X, y):
        """Fit the decision tree."""
        X = np.array(X)
        y = np.array(y)
        
        self.n_features_ = X.shape[1]
        self.root = self._build_tree(X, y)
        
        # Calculate feature importances
        self._calculate_feature_importances(X, y)
        
        return self
    
    def _predict_sample(self, x, node):
        """Predict a single sample."""
        if node.is_leaf():
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self._predict_sample(x, node.left)
        else:
            return self._predict_sample(x, node.right)
    
    def predict(self, X):
        """Make predictions."""
        X = np.array(X)
        return np.array([self._predict_sample(x, self.root) for x in X])
    
    def _calculate_feature_importances(self, X, y):
        """Calculate feature importances based on information gain."""
        importances = np.zeros(self.n_features_)
        
        def _traverse(node, X_subset, y_subset):
            if node.is_leaf():
                return
            
            # Calculate weighted information gain
            n_samples = len(y_subset)
            left_mask = X_subset[:, node.feature] <= node.threshold
            right_mask = ~left_mask
            
            y_left, y_right = y_subset[left_mask], y_subset[right_mask]
            gain = self._information_gain(y_subset, y_left, y_right)
            
            # Weight by number of samples
            importances[node.feature] += (n_samples / len(y)) * gain
            
            # Recursively traverse children
            if not node.left.is_leaf():
                _traverse(node.left, X_subset[left_mask], y_left)
            if not node.right.is_leaf():
                _traverse(node.right, X_subset[right_mask], y_right)
        
        _traverse(self.root, X, y)
        
        # Normalize importances
        if np.sum(importances) > 0:
            importances = importances / np.sum(importances)
        
        self.feature_importances_ = importances

# Generate classification dataset
X_cls, y_cls = make_classification(n_samples=1000, n_features=10, n_informative=5, 
                                  n_redundant=2, n_clusters_per_class=1, random_state=42)
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_cls, y_cls, test_size=0.3, random_state=42)

# Generate regression dataset
X_reg, y_reg = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# Compare splitting criteria for classification
criteria = ['gini', 'entropy']
classification_results = {}

for criterion in criteria:
    # Custom implementation
    dt_custom = DecisionTreeCustom(criterion=criterion, max_depth=5, task='classification')
    dt_custom.fit(X_train_cls, y_train_cls)
    y_pred_custom = dt_custom.predict(X_test_cls)
    
    # Sklearn implementation
    dt_sklearn = DecisionTreeClassifier(criterion=criterion, max_depth=5, random_state=42)
    dt_sklearn.fit(X_train_cls, y_train_cls)
    y_pred_sklearn = dt_sklearn.predict(X_test_cls)
    
    classification_results[criterion] = {
        'custom_accuracy': accuracy_score(y_test_cls, y_pred_custom),
        'sklearn_accuracy': accuracy_score(y_test_cls, y_pred_sklearn),
        'custom_importances': dt_custom.feature_importances_,
        'sklearn_importances': dt_sklearn.feature_importances_
    }

# Test regression with MSE
dt_reg_custom = DecisionTreeCustom(criterion='mse', max_depth=5, task='regression')
dt_reg_custom.fit(X_train_reg, y_train_reg)
y_pred_reg_custom = dt_reg_custom.predict(X_test_reg)

dt_reg_sklearn = DecisionTreeRegressor(criterion='squared_error', max_depth=5, random_state=42)
dt_reg_sklearn.fit(X_train_reg, y_train_reg)
y_pred_reg_sklearn = dt_reg_sklearn.predict(X_test_reg)

print("Classification Results:")
for criterion in criteria:
    result = classification_results[criterion]
    print(f"\n{criterion.upper()}:")
    print(f"  Custom accuracy: {result['custom_accuracy']:.4f}")
    print(f"  Sklearn accuracy: {result['sklearn_accuracy']:.4f}")

print(f"\nRegression Results:")
print(f"Custom MSE: {mean_squared_error(y_test_reg, y_pred_reg_custom):.4f}")
print(f"Sklearn MSE: {mean_squared_error(y_test_reg, y_pred_reg_sklearn):.4f}")

In [None]:
# Visualize splitting criteria comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Compare Gini vs Entropy curves
p_values = np.linspace(0.01, 0.99, 100)
gini_values = [1 - p**2 - (1-p)**2 for p in p_values]
entropy_values = [-p*np.log2(p) - (1-p)*np.log2(1-p) for p in p_values]

axes[0, 0].plot(p_values, gini_values, label='Gini Impurity', linewidth=2)
axes[0, 0].plot(p_values, entropy_values, label='Entropy', linewidth=2)
axes[0, 0].set_xlabel('Probability of Class 1')
axes[0, 0].set_ylabel('Impurity')
axes[0, 0].set_title('Gini vs Entropy Comparison')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Feature importance comparison
feature_names = [f'Feature {i+1}' for i in range(10)]
x_pos = np.arange(len(feature_names))
width = 0.35

gini_importances = classification_results['gini']['custom_importances']
entropy_importances = classification_results['entropy']['custom_importances']

axes[0, 1].bar(x_pos - width/2, gini_importances, width, label='Gini', alpha=0.8)
axes[0, 1].bar(x_pos + width/2, entropy_importances, width, label='Entropy', alpha=0.8)
axes[0, 1].set_xlabel('Features')
axes[0, 1].set_ylabel('Importance')
axes[0, 1].set_title('Feature Importance: Gini vs Entropy')
axes[0, 1].set_xticks(x_pos)
axes[0, 1].set_xticklabels(feature_names, rotation=45)
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Accuracy vs max_depth
depths = range(1, 11)
gini_scores = []
entropy_scores = []

for depth in depths:
    # Gini
    dt_gini = DecisionTreeCustom(criterion='gini', max_depth=depth, task='classification')
    dt_gini.fit(X_train_cls, y_train_cls)
    gini_scores.append(accuracy_score(y_test_cls, dt_gini.predict(X_test_cls)))
    
    # Entropy
    dt_entropy = DecisionTreeCustom(criterion='entropy', max_depth=depth, task='classification')
    dt_entropy.fit(X_train_cls, y_train_cls)
    entropy_scores.append(accuracy_score(y_test_cls, dt_entropy.predict(X_test_cls)))

axes[0, 2].plot(depths, gini_scores, 'o-', label='Gini', linewidth=2, markersize=6)
axes[0, 2].plot(depths, entropy_scores, 's-', label='Entropy', linewidth=2, markersize=6)
axes[0, 2].set_xlabel('Max Depth')
axes[0, 2].set_ylabel('Test Accuracy')
axes[0, 2].set_title('Accuracy vs Tree Depth')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# Decision boundary visualization (2D projection)
# Use first 2 features for visualization
X_2d = X_train_cls[:, :2]
dt_2d = DecisionTreeCustom(criterion='gini', max_depth=3, task='classification')
dt_2d.fit(X_2d, y_train_cls)

# Create mesh
h = 0.02
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

mesh_points = np.c_[xx.ravel(), yy.ravel()]
Z = dt_2d.predict(mesh_points)
Z = Z.reshape(xx.shape)

axes[1, 0].contourf(xx, yy, Z, alpha=0.8, cmap='RdYlBu')
scatter = axes[1, 0].scatter(X_2d[:, 0], X_2d[:, 1], c=y_train_cls, cmap='RdYlBu', edgecolors='black')
axes[1, 0].set_xlabel('Feature 1')
axes[1, 0].set_ylabel('Feature 2')
axes[1, 0].set_title('Decision Tree Boundary (Gini, depth=3)')

# Impurity reduction at each split simulation
# Generate simple 1D data for clear visualization
np.random.seed(42)
x_simple = np.random.randn(100)
y_simple = (x_simple > 0).astype(int)

# Calculate impurity before and after split at x=0
gini_before = 1 - (np.sum(y_simple == 0)/len(y_simple))**2 - (np.sum(y_simple == 1)/len(y_simple))**2
entropy_before = -np.sum([(np.sum(y_simple == i)/len(y_simple)) * np.log2(np.sum(y_simple == i)/len(y_simple) + 1e-10) for i in [0, 1]])

left_mask = x_simple <= 0
right_mask = x_simple > 0
y_left, y_right = y_simple[left_mask], y_simple[right_mask]

if len(y_left) > 0 and len(y_right) > 0:
    gini_left = 1 - (np.sum(y_left == 0)/len(y_left))**2 - (np.sum(y_left == 1)/len(y_left))**2
    gini_right = 1 - (np.sum(y_right == 0)/len(y_right))**2 - (np.sum(y_right == 1)/len(y_right))**2
    gini_after = (len(y_left)/len(y_simple)) * gini_left + (len(y_right)/len(y_simple)) * gini_right
    
    entropy_left = -np.sum([(np.sum(y_left == i)/len(y_left)) * np.log2(np.sum(y_left == i)/len(y_left) + 1e-10) for i in [0, 1]])
    entropy_right = -np.sum([(np.sum(y_right == i)/len(y_right)) * np.log2(np.sum(y_right == i)/len(y_right) + 1e-10) for i in [0, 1]])
    entropy_after = (len(y_left)/len(y_simple)) * entropy_left + (len(y_right)/len(y_simple)) * entropy_right

    categories = ['Before Split', 'After Split']
    gini_values_split = [gini_before, gini_after]
    entropy_values_split = [entropy_before, entropy_after]
    
    x_cat = np.arange(len(categories))
    width = 0.35
    
    axes[1, 1].bar(x_cat - width/2, gini_values_split, width, label='Gini', alpha=0.8)
    axes[1, 1].bar(x_cat + width/2, entropy_values_split, width, label='Entropy', alpha=0.8)
    axes[1, 1].set_ylabel('Impurity')
    axes[1, 1].set_title('Impurity Reduction After Split')
    axes[1, 1].set_xticks(x_cat)
    axes[1, 1].set_xticklabels(categories)
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    # Add gain annotations
    gini_gain = gini_before - gini_after
    entropy_gain = entropy_before - entropy_after
    axes[1, 1].text(0.5, max(max(gini_values_split), max(entropy_values_split)) * 0.8, 
                   f'Gini Gain: {gini_gain:.3f}\nEntropy Gain: {entropy_gain:.3f}', 
                   ha='center', va='center', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# MSE visualization for regression
y_mean = np.mean(y_train_reg)
mse_values = [(y_val - y_mean)**2 for y_val in y_train_reg[:50]]  # First 50 samples

axes[1, 2].hist(mse_values, bins=20, alpha=0.7, edgecolor='black')
axes[1, 2].axvline(x=np.mean(mse_values), color='red', linestyle='--', linewidth=2, label=f'Mean MSE: {np.mean(mse_values):.2f}')
axes[1, 2].set_xlabel('Squared Error')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].set_title('Distribution of Squared Errors (Regression)')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nSplitting Criteria Analysis:")
print(f"Gini impurity range: [0, 0.5] for binary classification")
print(f"Entropy range: [0, 1] for binary classification")
print(f"Gini tends to find pure nodes faster (favors larger, purer partitions)")
print(f"Entropy is more sensitive to impurity changes")
print(f"MSE is used for regression problems to minimize variance")

## Question 2: Pruning Techniques and Overfitting Prevention

**Question:** Implement pre-pruning and post-pruning techniques. Compare their effectiveness in preventing overfitting and analyze the bias-variance tradeoff.

### Theory

**Pre-pruning (Early Stopping):**
- Stop tree growth based on criteria:
  - Maximum depth
  - Minimum samples per leaf
  - Minimum information gain
  - Maximum number of leaves

**Post-pruning:**
- Build full tree then remove branches
- **Cost Complexity Pruning (Minimal Cost-Complexity Pruning):**

$$R_{\alpha}(T) = R(T) + \alpha|\tilde{T}|$$

Where:
- $R(T)$ = misclassification rate of tree $T$
- $\alpha$ = complexity parameter
- $|\tilde{T}|$ = number of terminal nodes

**Reduced Error Pruning:**
- Use validation set to decide pruning
- Remove node if validation error doesn't increase

**Bias-Variance Tradeoff:**
- Unpruned trees: Low bias, high variance
- Pruned trees: Higher bias, lower variance
- Optimal pruning minimizes total error

In [None]:
class PrunedDecisionTree(DecisionTreeCustom):
    """Decision tree with pruning capabilities."""
    
    def __init__(self, criterion='gini', max_depth=None, min_samples_split=2, 
                 min_samples_leaf=1, min_info_gain=0.0, ccp_alpha=0.0, **kwargs):
        super().__init__(criterion=criterion, max_depth=max_depth, 
                        min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, **kwargs)
        self.min_info_gain = min_info_gain
        self.ccp_alpha = ccp_alpha
        self.tree_size_ = 0
        
    def _best_split_with_gain_threshold(self, X, y):
        """Find best split with minimum information gain threshold."""
        best_feature, best_threshold = self._best_split(X, y)
        
        if best_feature is None:
            return None, None
        
        # Calculate information gain for the best split
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        y_left, y_right = y[left_mask], y[right_mask]
        
        gain = self._information_gain(y, y_left, y_right)
        
        if gain < self.min_info_gain:
            return None, None
        
        return best_feature, best_threshold
    
    def _build_tree_with_prepruning(self, X, y, depth=0):
        """Build tree with pre-pruning (early stopping)."""
        n_samples = len(y)
        self.tree_size_ += 1
        
        # Determine leaf value
        if self.task == 'classification':
            leaf_value = np.bincount(y).argmax()
        else:
            leaf_value = np.mean(y)
        
        # Pre-pruning conditions
        if (self.max_depth is not None and depth >= self.max_depth) or \
           n_samples < self.min_samples_split or \
           len(np.unique(y)) == 1:
            return DecisionTreeNode(value=leaf_value, samples=n_samples)
        
        # Find best split with gain threshold
        best_feature, best_threshold = self._best_split_with_gain_threshold(X, y)
        
        if best_feature is None:
            return DecisionTreeNode(value=leaf_value, samples=n_samples)
        
        # Split data
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Build children
        left_child = self._build_tree_with_prepruning(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree_with_prepruning(X[right_mask], y[right_mask], depth + 1)
        
        return DecisionTreeNode(
            feature=best_feature,
            threshold=best_threshold,
            left=left_child,
            right=right_child,
            samples=n_samples
        )
    
    def _calculate_tree_error(self, node, X, y):
        """Calculate misclassification error for a tree/subtree."""
        if node.is_leaf():
            if self.task == 'classification':
                predictions = np.full(len(y), node.value)
                return np.sum(predictions != y)
            else:
                return np.sum((y - node.value) ** 2)
        
        left_mask = X[:, node.feature] <= node.threshold
        right_mask = ~left_mask
        
        left_error = 0 if np.sum(left_mask) == 0 else self._calculate_tree_error(node.left, X[left_mask], y[left_mask])
        right_error = 0 if np.sum(right_mask) == 0 else self._calculate_tree_error(node.right, X[right_mask], y[right_mask])
        
        return left_error + right_error
    
    def _count_leaves(self, node):
        """Count number of leaf nodes in tree."""
        if node.is_leaf():
            return 1
        return self._count_leaves(node.left) + self._count_leaves(node.right)
    
    def _post_prune_ccp(self, X, y):
        """Cost complexity post-pruning."""
        if self.ccp_alpha <= 0:
            return
        
        def _prune_recursive(node, X_subset, y_subset):
            if node.is_leaf():
                return node
            
            # Get subsets for children
            left_mask = X_subset[:, node.feature] <= node.threshold
            right_mask = ~left_mask
            
            X_left, y_left = X_subset[left_mask], y_subset[left_mask]
            X_right, y_right = X_subset[right_mask], y_subset[right_mask]
            
            # Recursively prune children
            node.left = _prune_recursive(node.left, X_left, y_left)
            node.right = _prune_recursive(node.right, X_right, y_right)
            
            # Calculate error and complexity for current subtree
            subtree_error = self._calculate_tree_error(node, X_subset, y_subset)
            subtree_leaves = self._count_leaves(node)
            
            # Calculate error if we prune this node (make it a leaf)
            if self.task == 'classification':
                leaf_value = np.bincount(y_subset).argmax()
                leaf_error = np.sum(y_subset != leaf_value)
            else:
                leaf_value = np.mean(y_subset)
                leaf_error = np.sum((y_subset - leaf_value) ** 2)
            
            # Cost complexity criterion
            subtree_cost = subtree_error + self.ccp_alpha * subtree_leaves
            leaf_cost = leaf_error + self.ccp_alpha * 1
            
            # Prune if leaf is better
            if leaf_cost <= subtree_cost:
                return DecisionTreeNode(value=leaf_value, samples=len(y_subset))
            
            return node
        
        self.root = _prune_recursive(self.root, X, y)
    
    def fit(self, X, y):
        """Fit tree with pruning."""
        X = np.array(X)
        y = np.array(y)
        
        self.n_features_ = X.shape[1]
        self.tree_size_ = 0
        
        # Build tree with pre-pruning
        self.root = self._build_tree_with_prepruning(X, y)
        
        # Apply post-pruning if specified
        self._post_prune_ccp(X, y)
        
        # Calculate feature importances
        self._calculate_feature_importances(X, y)
        
        return self

# Generate dataset with noise to demonstrate overfitting
np.random.seed(42)
n_samples = 1000
X_noise = np.random.randn(n_samples, 20)  # 20 features
# Only first 5 features are relevant
y_noise = (X_noise[:, 0] + X_noise[:, 1] - X_noise[:, 2] + 0.5*X_noise[:, 3] - 0.3*X_noise[:, 4] > 0).astype(int)
# Add noise to labels
noise_indices = np.random.choice(n_samples, size=int(0.1 * n_samples), replace=False)
y_noise[noise_indices] = 1 - y_noise[noise_indices]

X_train_noise, X_test_noise, y_train_noise, y_test_noise = train_test_split(
    X_noise, y_noise, test_size=0.3, random_state=42)

# Further split training data for validation (for reduced error pruning)
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train_noise, y_train_noise, test_size=0.3, random_state=42)

# Test different pruning strategies
pruning_configs = {
    'No Pruning': {'max_depth': None, 'min_samples_leaf': 1, 'min_info_gain': 0.0, 'ccp_alpha': 0.0},
    'Max Depth 5': {'max_depth': 5, 'min_samples_leaf': 1, 'min_info_gain': 0.0, 'ccp_alpha': 0.0},
    'Min Samples Leaf 10': {'max_depth': None, 'min_samples_leaf': 10, 'min_info_gain': 0.0, 'ccp_alpha': 0.0},
    'Min Info Gain 0.01': {'max_depth': None, 'min_samples_leaf': 1, 'min_info_gain': 0.01, 'ccp_alpha': 0.0},
    'CCP Alpha 0.01': {'max_depth': None, 'min_samples_leaf': 1, 'min_info_gain': 0.0, 'ccp_alpha': 0.01},
    'Combined Pruning': {'max_depth': 8, 'min_samples_leaf': 5, 'min_info_gain': 0.005, 'ccp_alpha': 0.005}
}

pruning_results = {}

for config_name, config in pruning_configs.items():
    dt = PrunedDecisionTree(
        criterion='gini',
        task='classification',
        **config
    )
    
    dt.fit(X_train_sub, y_train_sub)
    
    # Evaluate on different sets
    train_acc = accuracy_score(y_train_sub, dt.predict(X_train_sub))
    val_acc = accuracy_score(y_val, dt.predict(X_val))
    test_acc = accuracy_score(y_test_noise, dt.predict(X_test_noise))
    
    # Count tree complexity
    n_leaves = dt._count_leaves(dt.root)
    
    pruning_results[config_name] = {
        'train_accuracy': train_acc,
        'val_accuracy': val_acc,
        'test_accuracy': test_acc,
        'n_leaves': n_leaves,
        'overfitting': train_acc - test_acc  # Measure of overfitting
    }

# Convert to DataFrame for easier analysis
pruning_df = pd.DataFrame(pruning_results).T
print("Pruning Strategies Comparison:")
print(pruning_df.round(4))

In [None]:
# Visualize pruning effects and bias-variance tradeoff
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Accuracy comparison
config_names = list(pruning_results.keys())
train_accs = [pruning_results[name]['train_accuracy'] for name in config_names]
val_accs = [pruning_results[name]['val_accuracy'] for name in config_names]
test_accs = [pruning_results[name]['test_accuracy'] for name in config_names]

x_pos = np.arange(len(config_names))
width = 0.25

axes[0, 0].bar(x_pos - width, train_accs, width, label='Train', alpha=0.8)
axes[0, 0].bar(x_pos, val_accs, width, label='Validation', alpha=0.8)
axes[0, 0].bar(x_pos + width, test_accs, width, label='Test', alpha=0.8)
axes[0, 0].set_xlabel('Pruning Strategy')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Accuracy Comparison')
axes[0, 0].set_xticks(x_pos)
axes[0, 0].set_xticklabels(config_names, rotation=45, ha='right')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Tree complexity (number of leaves)
n_leaves = [pruning_results[name]['n_leaves'] for name in config_names]
bars = axes[0, 1].bar(config_names, n_leaves, alpha=0.7, color='orange')
axes[0, 1].set_ylabel('Number of Leaves')
axes[0, 1].set_title('Tree Complexity')
axes[0, 1].set_xticklabels(config_names, rotation=45, ha='right')
for bar, val in zip(bars, n_leaves):
    axes[0, 1].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 1,
                   f'{val}', ha='center', va='bottom')
axes[0, 1].grid(True, alpha=0.3)

# Overfitting measure (train - test accuracy)
overfitting = [pruning_results[name]['overfitting'] for name in config_names]
colors = ['red' if x > 0.05 else 'green' for x in overfitting]
bars = axes[0, 2].bar(config_names, overfitting, alpha=0.7, color=colors)
axes[0, 2].set_ylabel('Train - Test Accuracy')
axes[0, 2].set_title('Overfitting Measure')
axes[0, 2].set_xticklabels(config_names, rotation=45, ha='right')
axes[0, 2].axhline(y=0.05, color='black', linestyle='--', alpha=0.5, label='Overfitting threshold')
axes[0, 2].legend()
for bar, val in zip(bars, overfitting):
    axes[0, 2].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.005,
                   f'{val:.3f}', ha='center', va='bottom', fontsize=9)
axes[0, 2].grid(True, alpha=0.3)

# Bias-variance tradeoff simulation
# Test different tree depths
depths = range(1, 16)
n_bootstrap = 50
bias_variance_results = {'depth': [], 'bias': [], 'variance': [], 'total_error': []}

for depth in depths:
    predictions = []
    
    # Bootstrap sampling for variance estimation
    for _ in range(n_bootstrap):
        # Bootstrap sample
        bootstrap_idx = np.random.choice(len(X_train_sub), len(X_train_sub), replace=True)
        X_bootstrap = X_train_sub[bootstrap_idx]
        y_bootstrap = y_train_sub[bootstrap_idx]
        
        # Train tree
        dt_bootstrap = PrunedDecisionTree(criterion='gini', max_depth=depth, task='classification')
        dt_bootstrap.fit(X_bootstrap, y_bootstrap)
        
        # Predict on test set
        pred = dt_bootstrap.predict(X_test_noise)
        predictions.append(pred)
    
    predictions = np.array(predictions)
    
    # Calculate bias and variance
    mean_predictions = np.mean(predictions, axis=0)
    
    # Bias: difference between average prediction and true labels
    bias_squared = np.mean((mean_predictions - y_test_noise) ** 2)
    
    # Variance: average variance of predictions across bootstrap samples
    variance = np.mean(np.var(predictions, axis=0))
    
    # Total error approximation
    total_error = bias_squared + variance
    
    bias_variance_results['depth'].append(depth)
    bias_variance_results['bias'].append(bias_squared)
    bias_variance_results['variance'].append(variance)
    bias_variance_results['total_error'].append(total_error)

# Plot bias-variance tradeoff
axes[1, 0].plot(bias_variance_results['depth'], bias_variance_results['bias'], 'o-', 
               label='Bias²', linewidth=2, markersize=6)
axes[1, 0].plot(bias_variance_results['depth'], bias_variance_results['variance'], 's-', 
               label='Variance', linewidth=2, markersize=6)
axes[1, 0].plot(bias_variance_results['depth'], bias_variance_results['total_error'], '^-', 
               label='Total Error', linewidth=2, markersize=6)
axes[1, 0].set_xlabel('Max Depth')
axes[1, 0].set_ylabel('Error')
axes[1, 0].set_title('Bias-Variance Tradeoff')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Complexity vs Performance
axes[1, 1].scatter(n_leaves, test_accs, c=overfitting, cmap='RdYlGn_r', 
                  s=100, alpha=0.8, edgecolors='black')
for i, name in enumerate(config_names):
    axes[1, 1].annotate(name, (n_leaves[i], test_accs[i]), 
                       xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1, 1].set_xlabel('Number of Leaves (Complexity)')
axes[1, 1].set_ylabel('Test Accuracy')
axes[1, 1].set_title('Complexity vs Performance')
cbar = plt.colorbar(axes[1, 1].collections[0], ax=axes[1, 1])
cbar.set_label('Overfitting')
axes[1, 1].grid(True, alpha=0.3)

# Learning curves for different pruning strategies
selected_configs = ['No Pruning', 'Max Depth 5', 'Combined Pruning']
train_sizes = np.linspace(0.1, 1.0, 10)

for config_name in selected_configs:
    config = pruning_configs[config_name]
    train_scores = []
    val_scores = []
    
    for train_size in train_sizes:
        n_samples = int(train_size * len(X_train_sub))
        X_subset = X_train_sub[:n_samples]
        y_subset = y_train_sub[:n_samples]
        
        dt = PrunedDecisionTree(criterion='gini', task='classification', **config)
        dt.fit(X_subset, y_subset)
        
        train_scores.append(accuracy_score(y_subset, dt.predict(X_subset)))
        val_scores.append(accuracy_score(y_val, dt.predict(X_val)))
    
    axes[1, 2].plot(train_sizes * len(X_train_sub), train_scores, 'o-', 
                   label=f'{config_name} (Train)', alpha=0.7)
    axes[1, 2].plot(train_sizes * len(X_train_sub), val_scores, 's--', 
                   label=f'{config_name} (Val)', alpha=0.7)

axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('Accuracy')
axes[1, 2].set_title('Learning Curves')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find optimal pruning strategy
best_config = max(pruning_results.keys(), key=lambda x: pruning_results[x]['test_accuracy'])
print(f"\nBest pruning strategy: {best_config}")
print(f"Test accuracy: {pruning_results[best_config]['test_accuracy']:.4f}")
print(f"Overfitting measure: {pruning_results[best_config]['overfitting']:.4f}")
print(f"Tree complexity: {pruning_results[best_config]['n_leaves']} leaves")

print("\nPruning Guidelines:")
print("- Pre-pruning is computationally efficient but may stop too early")
print("- Post-pruning builds full tree then removes branches (more expensive but thorough)")
print("- Combined approach often works best in practice")
print("- Monitor validation performance to detect optimal pruning level")

## Question 3: Feature Importance and Tree Interpretation

**Question:** Compare different feature importance measures in decision trees. Implement permutation importance and analyze how tree structure affects interpretability.

### Theory

**Feature Importance Measures:**

1. **Gini/Entropy Importance (Mean Decrease Impurity):**
$$\text{Importance}_j = \sum_{t \in \text{splits on feature } j} p_t \cdot \Delta I_t$$
where $p_t$ is proportion of samples at node $t$ and $\Delta I_t$ is impurity decrease

2. **Permutation Importance:**
$$\text{Importance}_j = \text{Score}(\text{original}) - \text{Score}(\text{permuted}_j)$$
- More reliable as it measures actual predictive contribution
- Model-agnostic approach

3. **Drop-Column Importance:**
$$\text{Importance}_j = \text{Score}(\text{all features}) - \text{Score}(\text{without feature } j)$$

**Interpretation Challenges:**
- **Feature interactions**: Trees naturally capture interactions
- **Bias toward high-cardinality features**: More split opportunities
- **Correlated features**: May substitute for each other
- **Tree instability**: Small data changes can create very different trees

In [None]:
class InterpretableDecisionTree(PrunedDecisionTree):
    """Decision tree with enhanced interpretability features."""
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.permutation_importances_ = None
        self.drop_column_importances_ = None
        
    def _calculate_permutation_importance(self, X, y, metric='accuracy', n_repeats=10):
        """Calculate permutation importance."""
        X = np.array(X)
        baseline_score = self._calculate_score(X, y, metric)
        
        importances = np.zeros(X.shape[1])
        
        for feature_idx in range(X.shape[1]):
            scores = []
            
            for _ in range(n_repeats):
                # Permute feature
                X_permuted = X.copy()
                X_permuted[:, feature_idx] = np.random.permutation(X_permuted[:, feature_idx])
                
                # Calculate score with permuted feature
                permuted_score = self._calculate_score(X_permuted, y, metric)
                scores.append(baseline_score - permuted_score)
            
            importances[feature_idx] = np.mean(scores)
        
        return importances
    
    def _calculate_drop_column_importance(self, X, y, metric='accuracy'):
        """Calculate drop-column importance."""
        X = np.array(X)
        baseline_score = self._calculate_score(X, y, metric)
        
        importances = np.zeros(X.shape[1])
        
        for feature_idx in range(X.shape[1]):
            # Create dataset without this feature
            X_drop = np.delete(X, feature_idx, axis=1)
            
            # Train new model without this feature
            dt_drop = InterpretableDecisionTree(
                criterion=self.criterion,
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                min_samples_leaf=self.min_samples_leaf,
                task=self.task
            )
            dt_drop.fit(X_drop, y)
            
            # Calculate score without this feature
            drop_score = dt_drop._calculate_score(X_drop, y, metric)
            importances[feature_idx] = baseline_score - drop_score
        
        return importances
    
    def _calculate_score(self, X, y, metric):
        """Calculate model score."""
        predictions = self.predict(X)
        
        if metric == 'accuracy':
            return accuracy_score(y, predictions)
        elif metric == 'mse':
            return -mean_squared_error(y, predictions)  # Negative for consistency
        else:
            raise ValueError(f"Unknown metric: {metric}")
    
    def calculate_all_importances(self, X, y, metric='accuracy'):
        """Calculate all types of feature importance."""
        # Gini/Entropy importance already calculated in fit
        self.permutation_importances_ = self._calculate_permutation_importance(X, y, metric)
        self.drop_column_importances_ = self._calculate_drop_column_importance(X, y, metric)
        
        return {
            'gini_entropy': self.feature_importances_,
            'permutation': self.permutation_importances_,
            'drop_column': self.drop_column_importances_
        }
    
    def get_tree_rules(self, feature_names=None):
        """Extract decision rules from the tree."""
        if feature_names is None:
            feature_names = [f'feature_{i}' for i in range(self.n_features_)]
        
        rules = []
        
        def _extract_rules(node, path=[]):
            if node.is_leaf():
                rule = ' AND '.join(path) if path else 'True'
                rules.append({
                    'rule': rule,
                    'prediction': node.value,
                    'samples': node.samples
                })
                return
            
            feature_name = feature_names[node.feature]
            
            # Left child (<=)
            left_condition = f"{feature_name} <= {node.threshold:.3f}"
            _extract_rules(node.left, path + [left_condition])
            
            # Right child (>)
            right_condition = f"{feature_name} > {node.threshold:.3f}"
            _extract_rules(node.right, path + [right_condition])
        
        _extract_rules(self.root)
        return rules
    
    def analyze_feature_interactions(self, X, y, top_features=5):
        """Analyze feature interactions in the tree."""
        # Get top features by importance
        top_feature_indices = np.argsort(self.feature_importances_)[-top_features:]
        
        interactions = {}
        
        def _find_interactions(node, features_in_path=set()):
            if node.is_leaf():
                if len(features_in_path) > 1:
                    # Record interaction
                    interaction_key = tuple(sorted(features_in_path))
                    if interaction_key not in interactions:
                        interactions[interaction_key] = 0
                    interactions[interaction_key] += node.samples
                return
            
            # Add current feature to path
            new_features_in_path = features_in_path.copy()
            if node.feature in top_feature_indices:
                new_features_in_path.add(node.feature)
            
            _find_interactions(node.left, new_features_in_path)
            _find_interactions(node.right, new_features_in_path)
        
        _find_interactions(self.root)
        
        # Sort by frequency
        sorted_interactions = sorted(interactions.items(), key=lambda x: x[1], reverse=True)
        
        return sorted_interactions

# Create dataset with known feature interactions
np.random.seed(42)
n_samples = 1000
n_features = 10

X_interpret = np.random.randn(n_samples, n_features)
# Create target with known feature importance and interactions
y_interpret = (
    2 * X_interpret[:, 0] +          # Strong individual effect
    1.5 * X_interpret[:, 1] +        # Medium individual effect
    X_interpret[:, 0] * X_interpret[:, 1] +  # Interaction effect
    0.5 * X_interpret[:, 2] +        # Weak individual effect
    np.random.randn(n_samples) * 0.3  # Noise
) > 0
y_interpret = y_interpret.astype(int)

# Add irrelevant features (should have low importance)
X_train_interp, X_test_interp, y_train_interp, y_test_interp = train_test_split(
    X_interpret, y_interpret, test_size=0.3, random_state=42)

# Train interpretable decision tree
dt_interp = InterpretableDecisionTree(
    criterion='gini',
    max_depth=6,
    min_samples_leaf=10,
    task='classification'
)
dt_interp.fit(X_train_interp, y_train_interp)

# Calculate all importance measures
feature_names = [f'Feature_{i}' for i in range(n_features)]
all_importances = dt_interp.calculate_all_importances(X_test_interp, y_test_interp)

# Get decision rules
rules = dt_interp.get_tree_rules(feature_names)

# Analyze feature interactions
interactions = dt_interp.analyze_feature_interactions(X_train_interp, y_train_interp)

print("Feature Importance Comparison:")
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Gini_Entropy': all_importances['gini_entropy'],
    'Permutation': all_importances['permutation'],
    'Drop_Column': all_importances['drop_column']
})
print(importance_df.round(4))

print(f"\nTest Accuracy: {accuracy_score(y_test_interp, dt_interp.predict(X_test_interp)):.4f}")

print("\nTop 5 Decision Rules:")
for i, rule in enumerate(rules[:5]):
    print(f"{i+1}. IF {rule['rule']} THEN class={rule['prediction']} (samples={rule['samples']})")

print("\nTop Feature Interactions:")
for interaction, frequency in interactions[:5]:
    feature_names_interaction = [feature_names[i] for i in interaction]
    print(f"{' & '.join(feature_names_interaction)}: {frequency} samples")

In [None]:
# Visualize feature importance and interpretability
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Feature importance comparison
x_pos = np.arange(len(feature_names))
width = 0.25

axes[0, 0].bar(x_pos - width, all_importances['gini_entropy'], width, 
              label='Gini/Entropy', alpha=0.8)
axes[0, 0].bar(x_pos, all_importances['permutation'], width, 
              label='Permutation', alpha=0.8)
axes[0, 0].bar(x_pos + width, all_importances['drop_column'], width, 
              label='Drop Column', alpha=0.8)
axes[0, 0].set_xlabel('Features')
axes[0, 0].set_ylabel('Importance')
axes[0, 0].set_title('Feature Importance Comparison')
axes[0, 0].set_xticks(x_pos)
axes[0, 0].set_xticklabels(feature_names, rotation=45)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Correlation between importance measures
axes[0, 1].scatter(all_importances['gini_entropy'], all_importances['permutation'], 
                  alpha=0.7, s=100, edgecolors='black')
for i, feature in enumerate(feature_names):
    axes[0, 1].annotate(feature, 
                       (all_importances['gini_entropy'][i], all_importances['permutation'][i]),
                       xytext=(5, 5), textcoords='offset points', fontsize=8)

# Add correlation line
corr = np.corrcoef(all_importances['gini_entropy'], all_importances['permutation'])[0, 1]
axes[0, 1].set_xlabel('Gini/Entropy Importance')
axes[0, 1].set_ylabel('Permutation Importance')
axes[0, 1].set_title(f'Importance Correlation (r={corr:.3f})')
axes[0, 1].grid(True, alpha=0.3)

# Feature stability analysis
n_bootstrap = 30
bootstrap_importances = []

for _ in range(n_bootstrap):
    # Bootstrap sample
    bootstrap_idx = np.random.choice(len(X_train_interp), len(X_train_interp), replace=True)
    X_bootstrap = X_train_interp[bootstrap_idx]
    y_bootstrap = y_train_interp[bootstrap_idx]
    
    # Train tree
    dt_bootstrap = InterpretableDecisionTree(
        criterion='gini', max_depth=6, min_samples_leaf=10, task='classification'
    )
    dt_bootstrap.fit(X_bootstrap, y_bootstrap)
    
    bootstrap_importances.append(dt_bootstrap.feature_importances_)

bootstrap_importances = np.array(bootstrap_importances)

# Box plot of feature importance stability
axes[0, 2].boxplot([bootstrap_importances[:, i] for i in range(n_features)], 
                  labels=feature_names)
axes[0, 2].set_ylabel('Feature Importance')
axes[0, 2].set_title('Feature Importance Stability')
axes[0, 2].tick_params(axis='x', rotation=45)
axes[0, 2].grid(True, alpha=0.3)

# Tree depth vs interpretability
depths = range(1, 11)
n_rules = []
avg_rule_length = []
test_accuracies = []

for depth in depths:
    dt_depth = InterpretableDecisionTree(
        criterion='gini', max_depth=depth, min_samples_leaf=10, task='classification'
    )
    dt_depth.fit(X_train_interp, y_train_interp)
    
    rules_depth = dt_depth.get_tree_rules(feature_names)
    
    n_rules.append(len(rules_depth))
    avg_length = np.mean([len(rule['rule'].split(' AND ')) for rule in rules_depth])
    avg_rule_length.append(avg_length)
    
    test_acc = accuracy_score(y_test_interp, dt_depth.predict(X_test_interp))
    test_accuracies.append(test_acc)

ax1 = axes[1, 0]
ax2 = ax1.twinx()

line1 = ax1.plot(depths, n_rules, 'b-o', label='Number of Rules')
line2 = ax2.plot(depths, test_accuracies, 'r-s', label='Test Accuracy')

ax1.set_xlabel('Max Depth')
ax1.set_ylabel('Number of Rules', color='b')
ax2.set_ylabel('Test Accuracy', color='r')
ax1.set_title('Complexity vs Performance')

# Combine legends
lines = line1 + line2
labels = [l.get_label() for l in lines]
ax1.legend(lines, labels, loc='center right')
ax1.grid(True, alpha=0.3)

# Rule length distribution
axes[1, 1].plot(depths, avg_rule_length, 'g-^', linewidth=2, markersize=8)
axes[1, 1].set_xlabel('Max Depth')
axes[1, 1].set_ylabel('Average Rule Length')
axes[1, 1].set_title('Rule Complexity vs Tree Depth')
axes[1, 1].grid(True, alpha=0.3)

# Feature usage frequency in tree
feature_usage = np.zeros(n_features)

def count_feature_usage(node):
    if node.is_leaf():
        return
    feature_usage[node.feature] += 1
    count_feature_usage(node.left)
    count_feature_usage(node.right)

count_feature_usage(dt_interp.root)

bars = axes[1, 2].bar(feature_names, feature_usage, alpha=0.7, color='purple')
axes[1, 2].set_ylabel('Usage Count in Tree')
axes[1, 2].set_title('Feature Usage in Tree Structure')
axes[1, 2].tick_params(axis='x', rotation=45)
for bar, val in zip(bars, feature_usage):
    if val > 0:
        axes[1, 2].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.1,
                       f'{int(val)}', ha='center', va='bottom')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate feature importance statistics
print("\nFeature Importance Statistics:")
for i, feature in enumerate(feature_names):
    std_dev = np.std(bootstrap_importances[:, i])
    cv = std_dev / (np.mean(bootstrap_importances[:, i]) + 1e-10)  # Coefficient of variation
    print(f"{feature}: Mean={np.mean(bootstrap_importances[:, i]):.4f}, "
          f"Std={std_dev:.4f}, CV={cv:.4f}")

print("\nInterpretability Insights:")
print(f"- Tree has {len(rules)} decision rules")
print(f"- Average rule length: {np.mean([len(rule['rule'].split(' AND ')) for rule in rules]):.2f} conditions")
print(f"- Most important features: {', '.join([feature_names[i] for i in np.argsort(all_importances['permutation'])[-3:]])}")
print(f"- Feature importance correlation (Gini vs Permutation): {corr:.3f}")

## Summary and Key Takeaways

### Tree-Based Methods Fundamentals:

1. **Splitting Criteria**:
   - **Gini Impurity**: Range [0, 0.5] for binary classification; computationally efficient
   - **Entropy**: Range [0, 1] for binary classification; more sensitive to changes
   - **MSE**: Used for regression; minimizes variance in predictions
   - Choice of criterion has minimal impact on final performance in most cases

2. **Pruning Strategies**:
   - **Pre-pruning**: Prevents overfitting during construction; computationally efficient
   - **Post-pruning**: More thorough but expensive; builds full tree then removes branches
   - **Cost Complexity Pruning**: Balances tree size and accuracy using α parameter
   - Combined approaches often work best in practice

3. **Feature Importance**:
   - **Gini/Entropy Importance**: Fast to compute but can be biased
   - **Permutation Importance**: More reliable; measures actual predictive contribution
   - **Drop-Column Importance**: Most accurate but computationally expensive
   - Feature importance can be unstable across different tree structures

### Practical Guidelines:

**Preventing Overfitting:**
- Set reasonable max_depth (typically 3-8 for interpretability)
- Use min_samples_leaf (5-20) to ensure statistical significance
- Monitor validation performance for optimal pruning
- Consider ensemble methods for better generalization

**Interpretability:**
- Shallow trees (depth ≤ 5) are most interpretable
- Decision rules provide clear logical explanations
- Feature interactions are naturally captured
- Use permutation importance for reliable feature ranking

**When to Use Decision Trees:**
- Need interpretable models
- Mixed data types (numerical and categorical)
- Non-linear relationships and interactions
- Missing values can be handled naturally

**Limitations:**
- High variance (small data changes → different trees)
- Bias toward features with more levels
- Difficulty with linear relationships
- Can create overly complex models without pruning