<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Decision%20Tree%20Classification/Decision%20Tree%20Classification%20Code%20Walk%20Through.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree Classification: Code Walk Through

This notebook walks through the **computational steps** of Decision Tree Classification from scratch.

## What We'll Cover:
1. **Calculate Entropy** - measure impurity/disorder in a dataset
2. **Calculate Information Gain** - measure how much a split reduces entropy
3. **Find Best Split** - evaluate all possible thresholds for numerical features
4. **Build the Tree** - recursively partition the feature space
5. **Make Predictions** - traverse the tree to classify new points
6. **Visualize Decision Boundaries** - see the axis-aligned partitions

We'll show **both manual calculations** (to understand the logic) and **vectorized NumPy versions** (for efficiency).

## Step 1: Import Libraries

We need:
- **NumPy** for numerical operations
- **Matplotlib** for visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

## Step 2: Create Training Data

We'll use the same dataset from the lecture slides (slides 21-27):
- **11 training points** with **2 numerical features** ($x_1$ and $x_2$)
- **2 classes**: class 0 (blue) and class 1 (orange)

This dataset has a clear pattern that decision trees can capture with axis-aligned splits.

In [None]:
# Training data from lecture slides (numerical features example)
X_train = np.array([
    [-0.5, -4.0],   # Point 0, class 0
    [-1.5, -2.5],   # Point 1, class 0
    [ 0.0,  0.0],   # Point 2, class 0
    [-1.0,  0.5],   # Point 3, class 0
    [ 0.5,  1.5],   # Point 4, class 0
    [ 2.5,  1.0],   # Point 5, class 0
    [ 3.5, -3.5],   # Point 6, class 1
    [ 2.0, -3.0],   # Point 7, class 1
    [ 3.0, -2.0],   # Point 8, class 1
    [ 1.5, -1.5],   # Point 9, class 1
    [ 4.0, -1.0]    # Point 10, class 1
])

# Labels: which class each point belongs to
y_train = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

print("Training data shape:", X_train.shape)  # (11, 2) = 11 points, 2 features
print("Labels shape:", y_train.shape)         # (11,) = 11 labels
print(f"\nClass distribution: {np.bincount(y_train)}")
print(f"  Class 0: {np.sum(y_train == 0)} samples")
print(f"  Class 1: {np.sum(y_train == 1)} samples")

## Step 3: Visualize the Data

Let's plot our training data to see how it's distributed in 2D space.

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
            c='skyblue', s=150, edgecolors='black', linewidths=1.5,
            label='Class 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
            c='orange', s=150, edgecolors='black', linewidths=1.5,
            label='Class 1')
plt.xlabel('$x_1$', fontsize=14)
plt.ylabel('$x_2$', fontsize=14)
plt.title('Training Data Visualization', fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.show()

print(f"We have {len(X_train)} training points in 2D space")
print("Notice: Class 0 (blue) tends to be on the left/top, Class 1 (orange) on the right/bottom")

## Step 4: Understanding Entropy

**Entropy** measures the **impurity** or **disorder** in a dataset. It tells us how "mixed" the classes are.

$$E = -\sum_{i=1}^{N} p_i \log_2(p_i)$$

Where:
- $p_i$ is the proportion of samples belonging to class $i$
- $N$ is the number of classes
- By convention, $0 \cdot \log_2(0) = 0$

**Key insights:**
- **Pure node** (all same class): Entropy = 0
- **Maximally mixed** (equal proportions): Entropy = $\log_2(N)$
- For binary classification: max entropy = 1.0 (when 50-50 split)

### Manual Entropy Calculation

Let's calculate the entropy of our full training set step by step.

We have 6 samples of class 0 and 5 samples of class 1 (total = 11).

In [None]:
# Step 1: Count samples in each class
n_total = len(y_train)
n_class_0 = np.sum(y_train == 0)
n_class_1 = np.sum(y_train == 1)

print("Step 1 - Count samples:")
print(f"  Total samples: {n_total}")
print(f"  Class 0: {n_class_0}")
print(f"  Class 1: {n_class_1}")
print()

# Step 2: Calculate proportions
p_0 = n_class_0 / n_total
p_1 = n_class_1 / n_total

print("Step 2 - Calculate proportions:")
print(f"  p_0 = {n_class_0}/{n_total} = {p_0:.4f}")
print(f"  p_1 = {n_class_1}/{n_total} = {p_1:.4f}")
print()

# Step 3: Calculate each term: p_i * log2(p_i)
term_0 = p_0 * np.log2(p_0)
term_1 = p_1 * np.log2(p_1)

print("Step 3 - Calculate p_i √ó log‚ÇÇ(p_i):")
print(f"  p_0 √ó log‚ÇÇ(p_0) = {p_0:.4f} √ó log‚ÇÇ({p_0:.4f}) = {p_0:.4f} √ó {np.log2(p_0):.4f} = {term_0:.4f}")
print(f"  p_1 √ó log‚ÇÇ(p_1) = {p_1:.4f} √ó log‚ÇÇ({p_1:.4f}) = {p_1:.4f} √ó {np.log2(p_1):.4f} = {term_1:.4f}")
print()

# Step 4: Sum and negate
entropy_manual = -(term_0 + term_1)

print("Step 4 - Sum and negate:")
print(f"  E = -({term_0:.4f} + {term_1:.4f})")
print(f"  E = -({term_0 + term_1:.4f})")
print(f"  E = {entropy_manual:.4f}")
print()
print(f"Root Entropy = {entropy_manual:.3f} (matches lecture slide 21: 0.994)")

### Vectorized Entropy Function

Now let's create an efficient function to calculate entropy for any label array.

In [None]:
def entropy(y):
    """
    Calculate entropy of a label array.

    E = -Œ£ p_i √ó log‚ÇÇ(p_i)
    """
    if len(y) == 0:
        return 0.0

    # Count occurrences of each class
    _, counts = np.unique(y, return_counts=True)

    # Calculate proportions
    proportions = counts / len(y)

    # Calculate entropy (only for non-zero proportions to avoid log(0))
    return -np.sum(proportions * np.log2(proportions))

# Verify it matches our manual calculation
entropy_vectorized = entropy(y_train)
print(f"Vectorized entropy: {entropy_vectorized:.4f}")
print(f"Manual entropy:     {entropy_manual:.4f}")
print(f"Results match: {np.isclose(entropy_vectorized, entropy_manual)}")

### Visualizing Entropy: Low vs High Impurity

Let's see how entropy varies with class proportions.

In [None]:
# Test entropy on different distributions
print("Entropy Examples:")
print("=" * 60)

# Pure node (all class 0)
y_pure = np.array([0, 0, 0, 0, 0])
print(f"Pure (all 0s):     {y_pure} ‚Üí E = {entropy(y_pure):.4f}")

# Almost pure (9 muffins, 1 cookie - from lecture slide 9)
y_low_entropy = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
print(f"Low entropy (9:1): {y_low_entropy} ‚Üí E = {entropy(y_low_entropy):.4f}")

# Our dataset (6:5)
print(f"Our data (6:5):    ‚Üí E = {entropy(y_train):.4f}")

# Perfectly balanced
y_balanced = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
print(f"Balanced (5:5):    {y_balanced} ‚Üí E = {entropy(y_balanced):.4f}")

# Plot entropy curve
plt.figure(figsize=(10, 6))
p_values = np.linspace(0.001, 0.999, 100)
entropy_values = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_values]

plt.plot(p_values, entropy_values, 'b-', linewidth=2)
plt.axvline(x=6/11, color='red', linestyle='--', label=f'Our data (p={6/11:.3f})')
plt.scatter([6/11], [entropy(y_train)], color='red', s=100, zorder=5)
plt.xlabel('Proportion of Class 0', fontsize=12)
plt.ylabel('Entropy', fontsize=12)
plt.title('Entropy vs Class Proportion (Binary Classification)', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

## Step 5: Understanding Information Gain

**Information Gain (IG)** measures how much a split reduces entropy.

$$IG = E_{parent} - \frac{|L|}{N} E_L - \frac{|R|}{N} E_R$$

Where:
- $E_{parent}$ is the entropy before the split
- $|L|, |R|$ are the number of samples in left and right children
- $E_L, E_R$ are the entropies of the children
- $N$ is the total number of samples

**Higher information gain = better split!**

In [None]:
def information_gain(y_parent, y_left, y_right):
    """
    Calculate information gain from a split.

    IG = E_parent - (|L|/N √ó E_L + |R|/N √ó E_R)
    """
    n_parent = len(y_parent)
    n_left = len(y_left)
    n_right = len(y_right)

    if n_left == 0 or n_right == 0:
        return 0.0

    # Parent entropy
    e_parent = entropy(y_parent)

    # Weighted child entropy
    e_left = entropy(y_left)
    e_right = entropy(y_right)
    weighted_child = (n_left / n_parent) * e_left + (n_right / n_parent) * e_right

    return e_parent - weighted_child

print("Information Gain Function defined!")

## Step 6: Finding the Best Split for Numerical Features

For numerical features, we need to find the **best threshold** to split on.

**Algorithm:**
1. Sort the unique values of the feature
2. Consider **midpoints** between consecutive values as candidate thresholds
3. For each threshold, split the data into left ($\leq$ threshold) and right ($>$ threshold)
4. Calculate information gain for each split
5. Choose the threshold with the highest information gain

From lecture slide 22:
- Unique values of $x_1$: [-1.5, -1.0, -0.5, 0.0, 0.5, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0]
- Candidate thresholds: [-1.25, -0.75, -0.25, 0.25, 1.0, 1.75, 2.25, 2.75, 3.25, 3.75]

### Manual Calculation: Evaluating Split on $x_1$ at threshold 1.0

Let's manually evaluate the split $x_1 \leq 1.0$ (from lecture slide 23).

In [None]:
# Feature x1 values
x1 = X_train[:, 0]
print("Feature x‚ÇÅ values:")
for i, (val, label) in enumerate(zip(x1, y_train)):
    print(f"  Point {i}: x‚ÇÅ = {val:5.1f}, class = {label}")

print("\n" + "=" * 60)
print("EVALUATING SPLIT: x‚ÇÅ ‚â§ 1.0")
print("=" * 60)

# Split at threshold 1.0
threshold = 1.0
left_mask = x1 <= threshold
right_mask = x1 > threshold

y_left = y_train[left_mask]
y_right = y_train[right_mask]

print(f"\nThreshold: x‚ÇÅ ‚â§ {threshold}")
print(f"\nLeft child (x‚ÇÅ ‚â§ {threshold}):")
print(f"  Samples: {np.sum(left_mask)} points")
print(f"  Labels: {y_left}")
print(f"  Class 0: {np.sum(y_left == 0)}, Class 1: {np.sum(y_left == 1)}")

print(f"\nRight child (x‚ÇÅ > {threshold}):")
print(f"  Samples: {np.sum(right_mask)} points")
print(f"  Labels: {y_right}")
print(f"  Class 0: {np.sum(y_right == 0)}, Class 1: {np.sum(y_right == 1)}")

In [None]:
# Calculate entropies
print("\nEntropy Calculations:")
print("-" * 60)

e_parent = entropy(y_train)
print(f"Parent Entropy: E = {e_parent:.4f}")

e_left = entropy(y_left)
print(f"\nLeft Child (5 class 0, 1 class 1):")
print(f"  E_left = -{5/6:.4f} √ó log‚ÇÇ({5/6:.4f}) - {1/6:.4f} √ó log‚ÇÇ({1/6:.4f})")
print(f"  E_left = {e_left:.4f}")

e_right = entropy(y_right)
print(f"\nRight Child (1 class 0, 4 class 1):")
print(f"  E_right = -{1/5:.4f} √ó log‚ÇÇ({1/5:.4f}) - {4/5:.4f} √ó log‚ÇÇ({4/5:.4f})")
print(f"  E_right = {e_right:.4f}")

# Calculate weighted entropy
n_left = len(y_left)
n_right = len(y_right)
n_total = len(y_train)

weighted_entropy = (n_left / n_total) * e_left + (n_right / n_total) * e_right
print(f"\nWeighted Child Entropy:")
print(f"  E = ({n_left}/{n_total}) √ó {e_left:.4f} + ({n_right}/{n_total}) √ó {e_right:.4f}")
print(f"  E = {n_left/n_total:.4f} √ó {e_left:.4f} + {n_right/n_total:.4f} √ó {e_right:.4f}")
print(f"  E = {weighted_entropy:.4f}")

# Calculate information gain
ig = e_parent - weighted_entropy
print(f"\nInformation Gain:")
print(f"  IG = E_parent - E_weighted")
print(f"  IG = {e_parent:.4f} - {weighted_entropy:.4f}")
print(f"  IG = {ig:.4f}")
print(f"\n‚Üí Split at x‚ÇÅ ‚â§ 1.0 gives IG = {ig:.3f} (matches lecture slide 23: 0.629)")

### Visualize the Split

In [None]:
plt.figure(figsize=(12, 5))

# Plot 1: Before split
plt.subplot(1, 2, 1)
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
            c='skyblue', s=150, edgecolors='black', linewidths=1.5, label='Class 0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
            c='orange', s=150, edgecolors='black', linewidths=1.5, label='Class 1')
plt.axvline(x=1.0, color='green', linewidth=3, linestyle='-', label='Split: x‚ÇÅ = 1.0')
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title(f'Split at x‚ÇÅ ‚â§ 1.0 (IG = {ig:.3f})', fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

# Plot 2: After split
plt.subplot(1, 2, 2)
# Left region (x1 <= 1.0)
plt.axvspan(-2.5, 1.0, alpha=0.2, color='blue', label='Left: x‚ÇÅ ‚â§ 1.0')
# Right region (x1 > 1.0)
plt.axvspan(1.0, 4.5, alpha=0.2, color='red', label='Right: x‚ÇÅ > 1.0')

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
            c='skyblue', s=150, edgecolors='black', linewidths=1.5)
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
            c='orange', s=150, edgecolors='black', linewidths=1.5)
plt.axvline(x=1.0, color='green', linewidth=3, linestyle='-')
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('Regions After Split', fontsize=14)
plt.legend(fontsize=10, loc='lower right')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Left region: 5 blue (class 0), 1 orange (class 1) ‚Üí E = {e_left:.3f}")
print(f"Right region: 1 blue (class 0), 4 orange (class 1) ‚Üí E = {e_right:.3f}")

### Evaluating All Possible Splits

Let's find the **best split** by evaluating all candidate thresholds for both features.

In [None]:
def find_best_split(X, y):
    """
    Find the best feature and threshold to split on.

    Returns: (best_feature, best_threshold, best_gain, split_info)
    """
    n_samples, n_features = X.shape
    best_gain = 0.0
    best_feature = None
    best_threshold = None
    all_splits = []

    for feature_idx in range(n_features):
        feature_values = X[:, feature_idx]
        unique_values = np.unique(feature_values)

        if len(unique_values) < 2:
            continue

        # Midpoint thresholds
        thresholds = (unique_values[:-1] + unique_values[1:]) / 2

        for threshold in thresholds:
            left_mask = feature_values <= threshold
            right_mask = ~left_mask

            y_left = y[left_mask]
            y_right = y[right_mask]

            if len(y_left) == 0 or len(y_right) == 0:
                continue

            gain = information_gain(y, y_left, y_right)
            all_splits.append((feature_idx, threshold, gain))

            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold

    return best_feature, best_threshold, best_gain, all_splits

# Find best split for root node
best_feat, best_thresh, best_ig, all_splits = find_best_split(X_train, y_train)

print("All Candidate Splits:")
print("=" * 60)
print(f"{'Feature':<10} {'Threshold':<12} {'Info Gain':<12}")
print("-" * 60)

# Sort by feature then threshold
all_splits_sorted = sorted(all_splits, key=lambda x: (x[0], x[1]))
for feat, thresh, gain in all_splits_sorted:
    marker = " ‚Üê BEST" if feat == best_feat and thresh == best_thresh else ""
    print(f"x[{feat}]      {thresh:<12.3f} {gain:<12.4f}{marker}")

print("\n" + "=" * 60)
print(f"BEST SPLIT: x[{best_feat}] ‚â§ {best_thresh:.1f} with IG = {best_ig:.4f}")
print("(Matches lecture slide 25: x‚ÇÅ ‚â§ 1.0 has highest IG = 0.629)")

## Step 7: Building the Decision Tree

Now let's build the complete decision tree using **recursive splitting**.

**Stopping conditions:**
1. Node is **pure** (all samples same class)
2. Reached **maximum depth**
3. Not enough samples to split (< `min_samples_split`)
4. No valid split improves purity

In [None]:
class DecisionTreeClassifier:
    """
    Decision Tree Classifier using entropy and information gain.
    """

    def __init__(self, max_depth=None, min_samples_split=2, min_samples_leaf=1):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.tree_ = None
        self.n_classes_ = None
        self.classes_ = None

    def _entropy(self, y):
        """Calculate entropy of label array."""
        if len(y) == 0:
            return 0.0
        counts = np.bincount(y, minlength=self.n_classes_)
        proportions = counts / len(y)
        proportions = proportions[proportions > 0]
        return -np.sum(proportions * np.log2(proportions))

    def _information_gain(self, y, y_left, y_right):
        """Calculate information gain from a split."""
        n = len(y)
        n_left, n_right = len(y_left), len(y_right)

        if n_left == 0 or n_right == 0:
            return 0.0

        e_parent = self._entropy(y)
        e_left = self._entropy(y_left)
        e_right = self._entropy(y_right)

        return e_parent - (n_left/n * e_left + n_right/n * e_right)

    def _find_best_split(self, X, y):
        """Find the best feature and threshold to split on."""
        n_samples, n_features = X.shape
        best_gain = 0.0
        best_split = None

        for feature_idx in range(n_features):
            feature_values = X[:, feature_idx]
            unique_values = np.unique(feature_values)

            if len(unique_values) < 2:
                continue

            thresholds = (unique_values[:-1] + unique_values[1:]) / 2

            for threshold in thresholds:
                left_mask = feature_values <= threshold
                right_mask = ~left_mask

                n_left = np.sum(left_mask)
                n_right = np.sum(right_mask)

                if n_left < self.min_samples_leaf or n_right < self.min_samples_leaf:
                    continue

                y_left = y[left_mask]
                y_right = y[right_mask]

                gain = self._information_gain(y, y_left, y_right)

                if gain > best_gain:
                    best_gain = gain
                    best_split = {
                        'feature': feature_idx,
                        'threshold': threshold,
                        'left_mask': left_mask,
                        'right_mask': right_mask,
                        'gain': gain
                    }

        return best_split

    def _build_tree(self, X, y, depth=0):
        """Recursively build the decision tree."""
        n_samples = len(y)

        # Calculate class probabilities
        counts = np.bincount(y, minlength=self.n_classes_)
        proba = counts / counts.sum()
        majority_class = np.argmax(counts)

        # Stopping conditions
        # 1. Pure node
        if len(np.unique(y)) == 1:
            return {'leaf': True, 'proba': proba, 'class': majority_class}

        # 2. Max depth reached
        if self.max_depth is not None and depth >= self.max_depth:
            return {'leaf': True, 'proba': proba, 'class': majority_class}

        # 3. Not enough samples
        if n_samples < self.min_samples_split:
            return {'leaf': True, 'proba': proba, 'class': majority_class}

        # Find best split
        best_split = self._find_best_split(X, y)

        # 4. No valid split
        if best_split is None:
            return {'leaf': True, 'proba': proba, 'class': majority_class}

        # Recursively build children
        left_tree = self._build_tree(
            X[best_split['left_mask']],
            y[best_split['left_mask']],
            depth + 1
        )
        right_tree = self._build_tree(
            X[best_split['right_mask']],
            y[best_split['right_mask']],
            depth + 1
        )

        return {
            'leaf': False,
            'feature': best_split['feature'],
            'threshold': best_split['threshold'],
            'left': left_tree,
            'right': right_tree,
            'gain': best_split['gain'],
            'n_samples': n_samples,
            'entropy': self._entropy(y)
        }

    def fit(self, X, y):
        """Build decision tree from training data."""
        X = np.asarray(X, dtype=float)
        y = np.asarray(y)

        self.classes_, y_encoded = np.unique(y, return_inverse=True)
        self.n_classes_ = len(self.classes_)
        self.n_features_ = X.shape[1]

        self.tree_ = self._build_tree(X, y_encoded, depth=0)
        return self

    def _predict_single(self, x, node):
        """Traverse tree for single sample."""
        while not node['leaf']:
            if x[node['feature']] <= node['threshold']:
                node = node['left']
            else:
                node = node['right']
        return node['proba']

    def predict_proba(self, X):
        """Predict class probabilities."""
        X = np.asarray(X, dtype=float)
        return np.array([self._predict_single(x, self.tree_) for x in X])

    def predict(self, X):
        """Predict class labels."""
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]

print("DecisionTreeClassifier class defined!")

### Train the Model

In [None]:
# Train our decision tree
model = DecisionTreeClassifier(max_depth=3, min_samples_leaf=1)
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"Number of classes: {model.n_classes_}")
print(f"Classes: {model.classes_}")

### Visualize the Tree Structure

In [None]:
def print_tree(node, feature_names=None, indent=0):
    """Pretty print the decision tree structure."""
    prefix = "  " * indent

    if node['leaf']:
        class_counts = node['proba'] * 11  # Approximate counts
        print(f"{prefix}üçÉ Leaf: predict class {node['class']}")
        print(f"{prefix}   proba = {node['proba']}")
    else:
        feat_name = f"x[{node['feature']}]" if feature_names is None else feature_names[node['feature']]
        print(f"{prefix}üìä {feat_name} ‚â§ {node['threshold']:.2f}")
        print(f"{prefix}   (IG = {node['gain']:.4f}, E = {node['entropy']:.4f}, n = {node['n_samples']})")
        print(f"{prefix}   ‚îú‚îÄ True (left):")
        print_tree(node['left'], feature_names, indent + 2)
        print(f"{prefix}   ‚îî‚îÄ False (right):")
        print_tree(node['right'], feature_names, indent + 2)

print("Decision Tree Structure:")
print("=" * 70)
print_tree(model.tree_, feature_names=['$x_1$', '$x_2$'])

## Step 8: Making Predictions

Let's classify a new point: $(2.0, -2.0)$ (from lecture slide 41).

In [None]:
# Test point from lecture
X_test_point = np.array([[2.0, -2.0]])

print("Prediction Walk-Through for point (2.0, -2.0):")
print("=" * 60)

# Manual traversal
node = model.tree_
x = X_test_point[0]
step = 1

while not node['leaf']:
    feat = node['feature']
    thresh = node['threshold']

    print(f"\nStep {step}: At node 'x[{feat}] ‚â§ {thresh:.2f}'")
    print(f"  Test: x[{feat}] = {x[feat]:.2f} ‚â§ {thresh:.2f}?")

    if x[feat] <= thresh:
        print(f"  Result: {x[feat]:.2f} ‚â§ {thresh:.2f} is TRUE ‚Üí go LEFT")
        node = node['left']
    else:
        print(f"  Result: {x[feat]:.2f} ‚â§ {thresh:.2f} is FALSE ‚Üí go RIGHT")
        node = node['right']
    step += 1

print(f"\nStep {step}: Reached LEAF node")
print(f"  Probabilities: {node['proba']}")
print(f"  Predicted class: {node['class']}")

# Verify with predict method
pred = model.predict(X_test_point)[0]
proba = model.predict_proba(X_test_point)[0]

print("\n" + "=" * 60)
print("PREDICTION RESULT:")
print(f"  Point: {X_test_point[0].tolist()}")
print(f"  Probabilities [class 0, class 1]: {proba.tolist()}")
print(f"  Predicted Class: {pred}")
print(f"\n(Matches lecture slide 41: Expected class 1)")

### Evaluate on Training Data

In [None]:
# Predict on all training points
y_pred = model.predict(X_train)

print("Predictions on Training Data:")
print("-" * 50)
print(f"{'Point':<8} {'x‚ÇÅ':>6} {'x‚ÇÇ':>6}   {'True':>5} {'Pred':>5} {'Match':>6}")
print("-" * 50)

correct = 0
for i in range(len(X_train)):
    match = "‚úì" if y_train[i] == y_pred[i] else "‚úó"
    if y_train[i] == y_pred[i]:
        correct += 1
    print(f"  {i:<6} {X_train[i, 0]:>6.1f} {X_train[i, 1]:>6.1f}   {y_train[i]:>5} {y_pred[i]:>5} {match:>6}")

accuracy = correct / len(y_train)
print("-" * 50)
print(f"Training Accuracy: {correct}/{len(y_train)} = {accuracy:.2%}")

## Step 9: Visualize Decision Boundary

Decision trees create **axis-aligned** (rectangular) decision boundaries.

In [None]:
def plot_decision_boundary(model, X, y, title="Decision Tree Decision Boundary"):
    """Plot decision boundary for 2D classification."""
    # Create mesh
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),
                           np.linspace(x2_min, x2_max, 200))

    # Predict on mesh
    X_mesh = np.c_[xx1.ravel(), xx2.ravel()]
    Z = model.predict_proba(X_mesh)[:, 1].reshape(xx1.shape)

    # Plot
    plt.figure(figsize=(10, 8))
    plt.contourf(xx1, xx2, Z, levels=20, cmap='RdBu_r', alpha=0.6)
    plt.colorbar(label='P(Class 1)')
    plt.contour(xx1, xx2, Z, levels=[0.5], colors='black', linewidths=2, linestyles='dashed')

    # Plot data points
    plt.scatter(X[y == 0, 0], X[y == 0, 1], c='skyblue', s=150,
                edgecolors='black', linewidths=1.5, label='Class 0')
    plt.scatter(X[y == 1, 0], X[y == 1, 1], c='orange', s=150,
                edgecolors='black', linewidths=1.5, label='Class 1')

    plt.xlabel('$x_1$', fontsize=14)
    plt.ylabel('$x_2$', fontsize=14)
    plt.title(title, fontsize=16)
    plt.legend(fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.show()

plot_decision_boundary(model, X_train, y_train,
                       "Decision Tree: Axis-Aligned Decision Boundary")

### Understanding the Rectangular Regions

Notice how the decision boundary consists of **horizontal and vertical lines** ‚Äî this is because each split is based on a single feature threshold.

In [None]:
# Visualize the splits explicitly
plt.figure(figsize=(10, 8))

# Background regions
plt.axvspan(-2.5, 1.0, alpha=0.15, color='blue', label='Region: x‚ÇÅ ‚â§ 1.0')
plt.axvspan(1.0, 5.0, ymin=0, ymax=0.55, alpha=0.15, color='red')  # x1 > 1.0, x2 <= 0
plt.axvspan(1.0, 5.0, ymin=0.55, ymax=1, alpha=0.15, color='blue')  # x1 > 1.0, x2 > 0

# Data points
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
            c='skyblue', s=150, edgecolors='black', linewidths=1.5, label='Class 0', zorder=5)
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
            c='orange', s=150, edgecolors='black', linewidths=1.5, label='Class 1', zorder=5)

# Split lines
plt.axvline(x=1.0, color='green', linewidth=3, linestyle='-', label='Split 1: x‚ÇÅ = 1.0')
plt.axhline(y=0.0, color='purple', linewidth=3, linestyle='-', xmin=0.54, label='Split 2: x‚ÇÇ = 0.0')

# Annotations
plt.annotate('Predict 0\n(5 blue, 1 orange)', xy=(-0.5, 2), fontsize=11, ha='center',
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
plt.annotate('Predict 1\n(0 blue, 5 orange)', xy=(3, -2.5), fontsize=11, ha='center',
             bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
plt.annotate('Predict 0\n(1 blue, 0 orange)', xy=(3.5, 1), fontsize=11, ha='center',
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

plt.xlabel('$x_1$', fontsize=14)
plt.ylabel('$x_2$', fontsize=14)
plt.title('Decision Tree Splits (from lecture slide 27)', fontsize=16)
plt.legend(fontsize=10, loc='lower left')
plt.grid(True, alpha=0.3)
plt.xlim(-2.5, 5)
plt.ylim(-5, 3)
plt.show()

print("Decision Rules (from lecture slide 27):")
print("  IF x‚ÇÅ ‚â§ 1.0 ‚Üí predict 0")
print("  ELSE IF x‚ÇÇ ‚â§ 0.0 ‚Üí predict 1")
print("  ELSE ‚Üí predict 0")

## Step 10: Comparison with scikit-learn

Let's verify our implementation matches sklearn's `DecisionTreeClassifier`.

In [None]:
from sklearn.tree import DecisionTreeClassifier as SklearnDecisionTree
from sklearn.tree import plot_tree

# Train sklearn model with same parameters
sklearn_model = SklearnDecisionTree(
    criterion='entropy',
    max_depth=3,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)
sklearn_model.fit(X_train, y_train)

# Compare predictions
our_pred = model.predict(X_train)
sklearn_pred = sklearn_model.predict(X_train)

print("Comparison: Our Implementation vs scikit-learn")
print("=" * 60)
print(f"\nOur predictions:     {our_pred}")
print(f"sklearn predictions: {sklearn_pred}")
print(f"\nPredictions match: {np.all(our_pred == sklearn_pred)}")

# Compare accuracies
our_accuracy = np.mean(our_pred == y_train)
sklearn_accuracy = sklearn_model.score(X_train, y_train)

print(f"\nOur Training Accuracy:     {our_accuracy:.2%}")
print(f"sklearn Training Accuracy: {sklearn_accuracy:.2%}")

In [None]:
# Visualize sklearn tree
plt.figure(figsize=(16, 10))
plot_tree(sklearn_model,
          feature_names=['$x_1$', '$x_2$'],
          class_names=['Class 0', 'Class 1'],
          filled=True,
          rounded=True,
          fontsize=11)
plt.title('sklearn DecisionTreeClassifier Structure', fontsize=14)
plt.tight_layout()
plt.show()

## Summary

We've walked through all the computational steps of Decision Tree Classification:

1. ‚úÖ **Calculated Entropy** - measured impurity using $E = -\sum p_i \log_2(p_i)$
2. ‚úÖ **Calculated Information Gain** - measured split quality as reduction in entropy
3. ‚úÖ **Found Best Splits** - evaluated all (feature, threshold) pairs using midpoints
4. ‚úÖ **Built the Tree** - recursively partitioned feature space
5. ‚úÖ **Made Predictions** - traversed tree from root to leaf
6. ‚úÖ **Visualized Decision Boundaries** - saw axis-aligned rectangular regions

### Key Concepts

| Concept | Formula | Description |
|---------|---------|-------------|
| **Entropy** | $E = -\sum_{i} p_i \log_2(p_i)$ | Measures impurity (0 = pure, higher = more mixed) |
| **Information Gain** | $IG = E_{parent} - \sum_j \frac{|S_j|}{N} E_j$ | Reduction in entropy from a split |
| **Greedy Splitting** | Select $\arg\max$ IG | Choose locally best split at each node |
| **Threshold Selection** | Midpoints of unique values | $t_k = \frac{v_k + v_{k+1}}{2}$ |

### Key NumPy Operations Used

| Operation | Purpose |
|-----------|--------|
| `np.unique(y, return_counts=True)` | Count class occurrences |
| `np.bincount(y)` | Fast counting for integer labels |
| `np.log2(p)` | Logarithm base 2 for entropy |
| `X[:, feature_idx] <= threshold` | Create boolean mask for split |
| `np.argmax(proba)` | Find class with highest probability |

### Decision Tree Characteristics

| Property | Decision Trees |
|----------|---------------|
| **Decision Boundary** | Axis-aligned (rectangular regions) |
| **Feature Scaling** | Not required |
| **Interpretability** | High (can extract IF-ELSE rules) |
| **Training Speed** | Fast (O(n √ó d √ó log n) per split) |
| **Prediction Speed** | Very fast (O(depth)) |
| **Overfitting Risk** | High (control with max_depth, min_samples) |

### Comparison with Other Algorithms

| Algorithm | Boundary Type | Scaling Needed | Interpretable | Handles Non-linear |
|-----------|--------------|----------------|---------------|--------------------|
| **Decision Trees** | Axis-aligned | No | Yes | Yes |
| **Logistic Regression** | Linear | Yes | Yes | No (manual features) |
| **KNN** | Non-parametric | Yes | Limited | Yes |
| **Naive Bayes** | Linear/Quadratic | No | Yes | Limited |