# Notebook 5: Feature Selection — Filter, Wrapper, and Embedded Methods

**Module ML600 — Optimization, Regularization, and Model Selection**

## Learning Objectives

By the end of this notebook you will be able to:

- Explain **why feature selection matters** (dimensionality, interpretability, speed)
- Apply **filter methods**: correlation, mutual information, variance threshold
- Apply **wrapper methods**: Recursive Feature Elimination (RFE)
- Apply **embedded methods**: Lasso coefficients, tree-based importance, `SelectFromModel`
- **Compare** all three approaches on the same dataset
- Avoid common pitfalls (data leakage, relying on a single method)

## Prerequisites

- Familiarity with supervised learning (classification / regression)
- Understanding of train/test splitting and cross-validation
- Basic knowledge of Lasso (L1) regularization and tree-based models
- Python libraries: `numpy`, `pandas`, `matplotlib`, `seaborn`, `sklearn`

## Table of Contents

1. [Why Feature Selection Matters](#1)
2. [Filter Methods](#2)
   - 2a. Correlation with Target
   - 2b. Mutual Information
   - 2c. Variance Threshold
3. [Wrapper Methods — RFE](#3)
4. [Embedded Methods](#4)
   - 4a. Lasso (L1) Coefficients
   - 4b. Tree-Based Feature Importance
   - 4c. SelectFromModel
5. [Comparison of All Methods](#5)
6. [Common Mistakes](#6)
7. [Exercise](#7)

---
## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import (
    VarianceThreshold, mutual_info_classif, RFE, SelectFromModel
)
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
print('Setup complete.')

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')  # 0=malignant, 1=benign

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f'Training: {X_train.shape}  |  Test: {X_test.shape}')
print(f'Features: {X_train.shape[1]}')
X.head()

<a id='1'></a>
## 1. Why Feature Selection Matters

| Benefit | Explanation |
|---------|-------------|
| **Reduce dimensionality** | Fewer features = simpler model, lower risk of overfitting |
| **Improve interpretability** | Easier to explain which features drive predictions |
| **Speed up training** | Less data to process per iteration |
| **Remove noise** | Irrelevant / redundant features add noise and hurt generalization |

Three families of methods:

1. **Filter**: score features independently of any model (fast, model-agnostic)
2. **Wrapper**: use a model to evaluate feature subsets (more accurate, slower)
3. **Embedded**: feature selection happens as part of model training (e.g., L1 penalty, tree splits)

<a id='2'></a>
## 2. Filter Methods

Filter methods rank features using **statistical measures** without training a model.

### 2a. Correlation with Target

For regression (or binary classification mapped to 0/1), Pearson correlation measures linear association between each feature and the target.

In [None]:
# Compute absolute correlation of each feature with the target
train_df = X_train.copy()
train_df['target'] = y_train.values

correlations = train_df.corr()['target'].drop('target').abs().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(12, 7))
correlations.plot(kind='barh', ax=ax, color=sns.color_palette('viridis', len(correlations)))
ax.set_xlabel('|Correlation| with Target')
ax.set_title('Feature-Target Correlation (Absolute Pearson)')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print('Top 10 features by |correlation|:')
print(correlations.head(10).to_string())

### 2b. Mutual Information

Mutual information captures **any** (including non-linear) dependency between a feature and the target. `sklearn.feature_selection.mutual_info_classif` estimates MI for classification tasks.

In [None]:
# Compute mutual information scores
mi_scores = mutual_info_classif(
    X_train, y_train, random_state=RANDOM_STATE, n_neighbors=5
)
mi_series = pd.Series(mi_scores, index=X_train.columns).sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(12, 7))
mi_series.plot(kind='barh', ax=ax, color=sns.color_palette('magma', len(mi_series)))
ax.set_xlabel('Mutual Information Score')
ax.set_title('Feature Ranking by Mutual Information')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print('Top 10 features by MI:')
print(mi_series.head(10).to_string())

### 2c. Variance Threshold

Removes features whose variance is below a threshold. Zero-variance features carry no information. This is most useful after scaling or when features are binary.

In [None]:
# Scale first so variances are comparable
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index
)

# Show feature variances (all should be ~1 after StandardScaler, but let's check raw)
raw_var = X_train.var().sort_values(ascending=False)
print('Feature variances (raw, top 5):')
print(raw_var.head().to_string())

# Apply VarianceThreshold on raw features
vt = VarianceThreshold(threshold=0.0)  # remove zero-variance only
vt.fit(X_train)
kept = X_train.columns[vt.get_support()]
removed = X_train.columns[~vt.get_support()]
print(f'\nVarianceThreshold (threshold=0): kept {len(kept)}, removed {len(removed)}')
if len(removed) > 0:
    print(f'Removed features: {list(removed)}')
else:
    print('No features removed (all have variance > 0).')

In [None]:
# Visualize: side-by-side comparison of correlation vs mutual information rankings
comparison_filter = pd.DataFrame({
    'Correlation Rank': range(1, len(correlations) + 1),
    'MI Rank': [list(mi_series.index).index(f) + 1 for f in correlations.index]
}, index=correlations.index)

fig, ax = plt.subplots(figsize=(10, 8))
top_n = 15
for i, feat in enumerate(correlations.index[:top_n]):
    corr_rank = comparison_filter.loc[feat, 'Correlation Rank']
    mi_rank = comparison_filter.loc[feat, 'MI Rank']
    ax.plot([0, 1], [corr_rank, mi_rank], 'o-', color=plt.cm.tab20(i), markersize=6)
    ax.text(-0.05, corr_rank, feat, ha='right', fontsize=8, va='center')
    ax.text(1.05, mi_rank, feat, ha='left', fontsize=8, va='center')

ax.set_xlim(-0.5, 1.5)
ax.set_xticks([0, 1])
ax.set_xticklabels(['Correlation Rank', 'MI Rank'], fontsize=12)
ax.set_ylabel('Rank (1 = best)')
ax.set_title(f'Correlation vs MI Rankings (Top {top_n} by Correlation)')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

<a id='3'></a>
## 3. Wrapper Methods — Recursive Feature Elimination (RFE)

RFE works by:
1. Training a model on all features
2. Ranking features by importance (e.g., coefficients, feature importances)
3. Removing the least important feature(s)
4. Repeating until the desired number of features is reached

It is more expensive than filter methods but often more accurate because it accounts for feature interactions.

In [None]:
# RFE with LogisticRegression
# Scale features first (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression(random_state=RANDOM_STATE, max_iter=5000)

# Select top 10 features
rfe = RFE(estimator=lr, n_features_to_select=10, step=1)
rfe.fit(X_train_scaled, y_train)

rfe_selected = X_train.columns[rfe.support_]
rfe_ranking = pd.Series(rfe.ranking_, index=X_train.columns).sort_values()

print('RFE selected features (top 10):')
for i, feat in enumerate(rfe_selected, 1):
    print(f'  {i}. {feat}')

In [None]:
# Visualize RFE ranking
fig, ax = plt.subplots(figsize=(12, 7))
colors = ['#4CAF50' if r == 1 else '#BDBDBD' for r in rfe_ranking.values]
rfe_ranking.plot(kind='barh', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('RFE Ranking (1 = selected)')
ax.set_title('Recursive Feature Elimination Rankings')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Evaluate: all features vs RFE-selected features
# All features
lr_all = LogisticRegression(random_state=RANDOM_STATE, max_iter=5000)
scores_all = cross_val_score(lr_all, X_train_scaled, y_train, cv=5, scoring='accuracy')

# RFE features only
X_train_rfe = X_train_scaled[:, rfe.support_]
X_test_rfe = X_test_scaled[:, rfe.support_]
lr_rfe = LogisticRegression(random_state=RANDOM_STATE, max_iter=5000)
scores_rfe = cross_val_score(lr_rfe, X_train_rfe, y_train, cv=5, scoring='accuracy')

print(f'All {X_train.shape[1]} features  -> CV accuracy: {scores_all.mean():.4f} +/- {scores_all.std():.4f}')
print(f'RFE {len(rfe_selected)} features -> CV accuracy: {scores_rfe.mean():.4f} +/- {scores_rfe.std():.4f}')

<a id='4'></a>
## 4. Embedded Methods

Embedded methods perform feature selection **during model training**.

### 4a. Lasso (L1) Coefficients

L1 regularization drives some coefficients to **exactly zero**, effectively removing those features.

In [None]:
# Use LogisticRegression with L1 penalty (saga solver supports L1)
lr_l1 = LogisticRegression(
    penalty='l1', solver='saga', C=1.0,
    random_state=RANDOM_STATE, max_iter=5000
)
lr_l1.fit(X_train_scaled, y_train)

lasso_coefs = pd.Series(
    np.abs(lr_l1.coef_[0]), index=X_train.columns
).sort_values(ascending=False)

n_nonzero = (lasso_coefs > 0).sum()
n_zero = (lasso_coefs == 0).sum()
print(f'L1 Logistic Regression: {n_nonzero} non-zero coefficients, {n_zero} zeroed out')

fig, ax = plt.subplots(figsize=(12, 7))
colors = ['#E53935' if c > 0 else '#BDBDBD' for c in lasso_coefs.values]
lasso_coefs.plot(kind='barh', ax=ax, color=colors, edgecolor='black')
ax.set_xlabel('|Coefficient| (L1 Logistic Regression)')
ax.set_title('Lasso (L1) Feature Importance')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

lasso_selected = lasso_coefs[lasso_coefs > 0].index.tolist()
print(f'\nSelected features ({len(lasso_selected)}): {lasso_selected}')

### 4b. Tree-Based Feature Importance

Decision trees and ensembles provide `feature_importances_` based on how much each feature reduces impurity (Gini or entropy for classification, MSE for regression).

In [None]:
# Random Forest feature importances
rf = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE)
rf.fit(X_train, y_train)  # no scaling needed for trees

rf_importances = pd.Series(
    rf.feature_importances_, index=X_train.columns
).sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(12, 7))
rf_importances.plot(kind='barh', ax=ax,
                     color=sns.color_palette('YlOrRd_r', len(rf_importances)),
                     edgecolor='black')
ax.set_xlabel('Feature Importance (Gini)')
ax.set_title('Random Forest Feature Importance')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print('Top 10 features by RF importance:')
print(rf_importances.head(10).to_string())

### 4c. SelectFromModel

`SelectFromModel` selects features whose importance is above a threshold (default: `mean`).

In [None]:
# SelectFromModel with RandomForest
sfm_rf = SelectFromModel(rf, threshold='mean')
sfm_rf.fit(X_train, y_train)
sfm_rf_features = X_train.columns[sfm_rf.get_support()]

print(f'SelectFromModel (RF, threshold=mean): {len(sfm_rf_features)} features selected')
print(f'Features: {list(sfm_rf_features)}')

# SelectFromModel with L1 LogisticRegression
sfm_l1 = SelectFromModel(lr_l1, threshold=1e-5)
sfm_l1.fit(X_train_scaled, y_train)
sfm_l1_features = X_train.columns[sfm_l1.get_support()]

print(f'\nSelectFromModel (L1 LR, threshold=1e-5): {len(sfm_l1_features)} features selected')
print(f'Features: {list(sfm_l1_features)}')

In [None]:
# Compare Lasso vs RF importances side by side
compare_embedded = pd.DataFrame({
    'Lasso |coef|': lasso_coefs / lasso_coefs.max(),  # normalize to [0,1]
    'RF Importance': rf_importances / rf_importances.max()
}).sort_values('RF Importance', ascending=True)

fig, ax = plt.subplots(figsize=(12, 8))
compare_embedded.plot(kind='barh', ax=ax, width=0.8, edgecolor='black')
ax.set_xlabel('Normalized Importance')
ax.set_title('Lasso (L1) vs Random Forest Feature Importance (Normalized)')
ax.legend(loc='lower right')
plt.tight_layout()
plt.show()

<a id='5'></a>
## 5. Comparison of All Methods

Let us compare the top-10 features selected by each method and evaluate model accuracy.

In [None]:
# Collect top-10 features from each method
top_k = 10

methods = {
    'Correlation': correlations.head(top_k).index.tolist(),
    'Mutual Info': mi_series.head(top_k).index.tolist(),
    'RFE (LR)': rfe_selected.tolist(),
    'Lasso (L1)': lasso_coefs.head(top_k).index.tolist(),
    'RF Importance': rf_importances.head(top_k).index.tolist()
}

# Print selected features
for name, feats in methods.items():
    print(f'{name:15s}: {feats}')
    print()

In [None]:
# Evaluate each feature subset with LogisticRegression (5-fold CV)
results = []

for method_name, selected_feats in methods.items():
    # Scale the selected features
    sc = StandardScaler()
    X_tr_sub = sc.fit_transform(X_train[selected_feats])
    X_te_sub = sc.transform(X_test[selected_feats])
    
    lr_eval = LogisticRegression(random_state=RANDOM_STATE, max_iter=5000)
    cv_scores = cross_val_score(lr_eval, X_tr_sub, y_train, cv=5, scoring='accuracy')
    
    lr_eval.fit(X_tr_sub, y_train)
    test_acc = lr_eval.score(X_te_sub, y_test)
    
    results.append({
        'Method': method_name,
        'Num Features': len(selected_feats),
        'CV Accuracy (mean)': cv_scores.mean(),
        'CV Accuracy (std)': cv_scores.std(),
        'Test Accuracy': test_acc
    })

# Add baseline (all features)
lr_base = LogisticRegression(random_state=RANDOM_STATE, max_iter=5000)
cv_base = cross_val_score(lr_base, X_train_scaled, y_train, cv=5, scoring='accuracy')
lr_base.fit(X_train_scaled, y_train)
results.append({
    'Method': 'All Features',
    'Num Features': X_train.shape[1],
    'CV Accuracy (mean)': cv_base.mean(),
    'CV Accuracy (std)': cv_base.std(),
    'Test Accuracy': lr_base.score(X_test_scaled, y_test)
})

results_df = pd.DataFrame(results).sort_values('Test Accuracy', ascending=False)
print(results_df.to_string(index=False))

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# CV accuracy
sorted_df = results_df.sort_values('CV Accuracy (mean)')
colors = ['#4CAF50' if m == 'All Features' else '#2196F3' for m in sorted_df['Method']]
axes[0].barh(sorted_df['Method'], sorted_df['CV Accuracy (mean)'],
             xerr=sorted_df['CV Accuracy (std)'], color=colors, edgecolor='black')
axes[0].set_xlabel('CV Accuracy')
axes[0].set_title('5-Fold CV Accuracy by Feature Selection Method')

# Test accuracy
sorted_df2 = results_df.sort_values('Test Accuracy')
colors2 = ['#4CAF50' if m == 'All Features' else '#FF9800' for m in sorted_df2['Method']]
axes[1].barh(sorted_df2['Method'], sorted_df2['Test Accuracy'],
             color=colors2, edgecolor='black')
axes[1].set_xlabel('Test Accuracy')
axes[1].set_title('Test Set Accuracy by Feature Selection Method')

plt.tight_layout()
plt.show()

In [None]:
# Feature overlap heatmap: which features appear across methods?
all_features = sorted(set(f for feats in methods.values() for f in feats))
presence = pd.DataFrame(
    {method: [1 if f in feats else 0 for f in all_features]
     for method, feats in methods.items()},
    index=all_features
)
presence['Total'] = presence.sum(axis=1)
presence = presence.sort_values('Total', ascending=False)

fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(
    presence.drop(columns='Total'), annot=True, cmap='YlGn',
    linewidths=0.5, cbar_kws={'label': 'Selected (1) / Not (0)'},
    ax=ax
)
ax.set_title('Feature Selection Overlap Across Methods')
plt.tight_layout()
plt.show()

print('Features selected by ALL methods:')
consensus = presence[presence['Total'] == len(methods)].index.tolist()
print(consensus if consensus else 'None')

<a id='6'></a>
## 6. Common Mistakes

| Mistake | Why It Is Wrong | Fix |
|---------|----------------|-----|
| **Feature selection BEFORE train/test split** | Information from the test set leaks into feature ranking, giving optimistic results | Always split first, then do feature selection on the training set only |
| **Using only one method** | Different methods capture different aspects (linear vs non-linear, univariate vs multivariate) | Compare at least one filter + one embedded method |
| **Ignoring feature interactions** | Filter methods rank features independently | Use wrapper or embedded methods to capture interactions |
| **Selecting too few or too many features** | Too few = underfitting; too many = no benefit | Use cross-validation to choose the optimal number |
| **Using correlation for non-linear relationships** | Pearson correlation misses non-linear patterns | Use mutual information or tree-based importance |

<a id='7'></a>
## 7. Exercise

**Task**: Apply feature selection to the wine dataset and compare methods.

1. Load `sklearn.datasets.load_wine()`
2. Split into train/test (80/20, stratify, `random_state=42`)
3. Compute mutual information scores on the training set
4. Run RFE with `RandomForestClassifier` to select 5 features
5. Fit `LogisticRegression` with (a) all features, (b) top-5 MI features, (c) RFE 5 features
6. Report 5-fold CV accuracy and test accuracy for each
7. Which method produces the best result with only 5 features?

In [None]:
# YOUR CODE HERE
from sklearn.datasets import load_wine

# Step 1: Load data
# wine = load_wine()
# X_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
# y_wine = wine.target

# Step 2: Split
# X_tr, X_te, y_tr, y_te = train_test_split(...)

# Step 3: Mutual information
# mi = mutual_info_classif(X_tr, y_tr, random_state=42)

# Step 4: RFE
# rfe_wine = RFE(RandomForestClassifier(random_state=42), n_features_to_select=5)

# Step 5-6: Evaluate and compare
# ...

---
**End of Notebook 5** | Next: [06 — End-to-End ML Project Template](06_End_to_End_ML_Project_Template.ipynb)