# Tree Ensembles Fundamentals

In this notebook, we'll explore Tree Ensembles, which combine multiple decision trees to create more powerful and robust models. We'll focus on Random Forests and Gradient Boosting, two popular ensemble methods.

We'll cover:
1. Understanding ensemble methods and their advantages
2. Implementing Random Forest from scratch
3. Training and making predictions
4. Visualizing ensemble decision boundaries
5. Comparing with scikit-learn's implementation

In [9]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import sys
import os

# Add the parent directory to sys.path to import custom modules
sys.path.append(os.path.join(os.getcwd(), '..'))

# Import custom modules
from models.random_forest import RandomForest
from utils.data_generator import generate_nonlinear_data
from utils.plotting import plot_decision_boundary

# Set random seed for reproducibility
np.random.seed(42)

## 1. Generate Synthetic Data

We'll create a synthetic dataset with a complex decision boundary to demonstrate the power of tree ensembles.

In [None]:
# Generate synthetic data
X, y = generate_nonlinear_data(n_samples=1000, noise=0.2, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Plot the data
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
plt.colorbar()
plt.title('Synthetic Classification Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

## 2. Train Our Custom Random Forest

Now we'll train our custom random forest implementation on the synthetic data.

In [None]:
# Create and train our custom random forest
rf = RandomForest(n_trees=10, max_depth=5, min_samples_split=2)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Calculate accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Test accuracy: {accuracy:.4f}')

## 3. Visualize Decision Boundary

Let's visualize the decision boundary learned by our custom random forest.

In [None]:
# Plot decision boundary for our custom implementation
plot_decision_boundary(X_test, y_test, rf, title='Custom Random Forest Decision Boundary')

## 4. Compare with Scikit-learn

Let's compare our implementation with scikit-learn's random forest and gradient boosting classifiers.

In [None]:
# Create and train scikit-learn's random forest
sklearn_rf = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=42)
sklearn_rf.fit(X_train, y_train)

# Make predictions
sklearn_rf_pred = sklearn_rf.predict(X_test)

# Calculate accuracy
sklearn_rf_accuracy = np.mean(sklearn_rf_pred == y_test)
print(f'Scikit-learn Random Forest accuracy: {sklearn_rf_accuracy:.4f}')

# Plot decision boundary
plot_decision_boundary(X_test, y_test, sklearn_rf, title='Scikit-learn Random Forest Decision Boundary')

In [None]:
# Create and train scikit-learn's gradient boosting
sklearn_gb = GradientBoostingClassifier(n_estimators=10, max_depth=5, random_state=42)
sklearn_gb.fit(X_train, y_train)

# Make predictions
sklearn_gb_pred = sklearn_gb.predict(X_test)

# Calculate accuracy
sklearn_gb_accuracy = np.mean(sklearn_gb_pred == y_test)
print(f'Scikit-learn Gradient Boosting accuracy: {sklearn_gb_accuracy:.4f}')

# Plot decision boundary
plot_decision_boundary(X_test, y_test, sklearn_gb, title='Scikit-learn Gradient Boosting Decision Boundary')

## 5. Effect of Ensemble Size

Let's explore how the number of trees in the ensemble affects the model's performance and decision boundary.

In [None]:
n_trees_list = [1, 5, 10, 20]
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()  # Flatten the 2x2 array of axes

for i, n_trees in enumerate(n_trees_list):
    # Train random forest with different number of trees
    rf = RandomForest(n_trees=n_trees, max_depth=5, min_samples_split=2)
    rf.fit(X_train, y_train)
    
    # Create mesh grid for decision boundary
    x_min, x_max = X_test[:, 0].min() - 0.5, X_test[:, 0].max() + 0.5
    y_min, y_max = X_test[:, 1].min() - 0.5, X_test[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                        np.arange(y_min, y_max, 0.02))
    
    # Make predictions
    Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot on the appropriate subplot
    axes[i].contourf(xx, yy, Z, alpha=0.4)
    axes[i].scatter(X_test[:, 0], X_test[:, 1], c=y_test, alpha=0.8)
    axes[i].set_title(f'Decision Boundary (n_trees={n_trees})')
    axes[i].set_xlabel('Feature 1')
    axes[i].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

## Conclusion

In this notebook, we've explored tree ensembles by:
1. Implementing a random forest from scratch
2. Training it on synthetic data
3. Visualizing its decision boundaries
4. Comparing with scikit-learn's implementations
5. Analyzing the effect of ensemble size on model performance

Key takeaways:
- Tree ensembles create more robust and accurate models
- Random forests reduce overfitting through averaging
- Gradient boosting builds strong learners sequentially
- Increasing the number of trees generally improves performance 