# Feature Importance Analysis

Understanding which features matter most in Random Forests:
1. Calculate feature importance
2. Visualize importance scores
3. Compare with sklearn

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import sys
sys.path.append('..')
from decision_trees.tree_from_scratch import DecisionTree
from utils import bootstrap_sample, majority_vote

sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

---
## Load Iris Dataset

In [None]:
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Dataset: {X.shape}')
print(f'Features: {feature_names}')

---
## Feature Importance with Scikit-Learn

In [None]:
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

# Get feature importance
importances = rf.feature_importances_

# Sort by importance
indices = np.argsort(importances)[::-1]

print('='*60)
print('FEATURE IMPORTANCE')
print('='*60)
for i in range(len(feature_names)):
    idx = indices[i]
    print(f'{i+1}. {feature_names[idx]:25s}: {importances[idx]:.4f}')

---
## Visualize Feature Importance

In [None]:
# Create bar plot
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices], 
        color='steelblue', edgecolor='black')
plt.xticks(range(len(importances)), 
          [feature_names[i] for i in indices], rotation=45, ha='right')
plt.xlabel('Features', fontsize=12)
plt.ylabel('Importance Score', fontsize=12)
plt.title('Random Forest Feature Importance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

### Interpretation

Features are ranked by how much they reduce impurity across all trees.

**For Iris Dataset**:
- **Petal length** and **petal width** are most important
- These features best separate the three Iris species
- Sepal measurements are less discriminative

---
## Create Synthetic Dataset with Known Importance

In [None]:
# Create dataset where features have different importance
n_samples = 1000
X_syn = np.random.randn(n_samples, 5)

# y depends strongly on X[:, 0] and X[:, 1], weakly on X[:, 2], not on X[:, 3] or X[:, 4]
y_syn = (3 * X_syn[:, 0] + 2 * X_syn[:, 1] + 0.5 * X_syn[:, 2] + 
         np.random.randn(n_samples) * 0.1)
y_syn = (y_syn > np.median(y_syn)).astype(int)

print('True feature importance (by construction):')
print('Feature 0: HIGH (coef=3.0)')
print('Feature 1: HIGH (coef=2.0)') 
print('Feature 2: LOW (coef=0.5)')
print('Feature 3: NONE (random noise)')
print('Feature 4: NONE (random noise)')

In [None]:
# Train Random Forest
rf_syn = RandomForestClassifier(n_estimators=100, random_state=42)
rf_syn.fit(X_syn, y_syn)

# Get importance
imp_syn = rf_syn.feature_importances_

# Visualize
plt.figure(figsize=(10, 6))
colors = ['green', 'green', 'orange', 'red', 'red']
plt.bar(range(5), imp_syn, color=colors, edgecolor='black', alpha=0.7)
plt.xticks(range(5), [f'Feature {i}' for i in range(5)])
plt.xlabel('Features', fontsize=12)
plt.ylabel('Importance Score', fontsize=12)
plt.title('Feature Importance on Synthetic Data', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.legend(['Important', 'Somewhat Important', 'Noise'], loc='upper right')
plt.show()

print('\nLearned Feature Importance:')
for i in range(5):
    print(f'Feature {i}: {imp_syn[i]:.4f}')

print('\nObservation: Random Forest correctly identifies important features!')

---
## Summary

### Feature Importance Benefits:
1. **Feature selection**: Remove unimportant features
2. **Model interpretation**: Understand what drives predictions
3. **Domain insight**: Validate assumptions about data
4. **Debugging**: Identify unexpected feature influence

### How It Works:
- **Mean Decrease in Impurity**: Sum of impurity reduction across all splits using that feature
- **Normalized**: Scores sum to 1.0
- **Higher score**: Feature is more important for predictions

### Key Point:
"Random Forests automatically calculate feature importance by measuring how much each feature reduces impurity across all trees and splits. Features that frequently reduce impurity (create purer child nodes) are more important. This provides interpretability without sacrificing predictive power."

---

**Random Forests component complete!**