# Feature Selection: Methods and Impact on Model Performance

This notebook explores various feature selection techniques and their effects on machine learning model performance. We'll investigate how selecting the right subset of features can improve model accuracy, reduce overfitting, and enhance interpretability.

## What is Feature Selection?

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the predictive modeling problem. It helps to:

- **Reduce dimensionality** and simplify models
- **Improve performance** by removing irrelevant or redundant features
- **Reduce overfitting** and enhance generalization
- **Increase interpretability** by focusing on the most important variables
- **Decrease computational cost** during model training and inference

Let's explore several feature selection approaches:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
from sklearn.datasets import load_diabetes, fetch_california_housing  # Changed this line
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Add the parent directory to sys.path
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our implementations
from models.linear_regression import LinearRegression
from models.logistic_regression import LogisticRegression
from utils.feature_selection import FeatureSelector, plot_feature_importance

# Set random seed for reproducibility
np.random.seed(42)

## 1. Creating a Synthetic Dataset with Known Important Features

To properly evaluate feature selection techniques, we'll start by creating a synthetic dataset where we know exactly which features are important and which are irrelevant noise.

In [None]:
def create_synthetic_data(n_samples=500, n_features=50, n_informative=10, noise=0.5):
    """Create a synthetic regression dataset with known important features."""
    # Generate random feature matrix
    X = np.random.randn(n_samples, n_features)
    
    # Generate weights: only n_informative features have non-zero weights
    true_weights = np.zeros(n_features)
    informative_indices = np.random.choice(n_features, n_informative, replace=False)
    true_weights[informative_indices] = np.random.uniform(1, 5, size=n_informative)
    
    # Generate target variable
    y = np.dot(X, true_weights) + np.random.normal(0, noise, size=n_samples)
    
    return X, y, true_weights, informative_indices

# Create dataset
n_features = 50
n_informative = 10
X, y, true_weights, informative_indices = create_synthetic_data(
    n_samples=500, n_features=n_features, n_informative=n_informative, noise=1.0)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Visualize the true feature importance
plt.figure(figsize=(12, 6))
plt.bar(range(n_features), np.abs(true_weights), color='skyblue')
plt.xlabel('Feature Index')
plt.ylabel('Absolute Weight')
plt.title('True Feature Importance')
plt.grid(axis='y')

# Highlight informative features
for idx in informative_indices:
    plt.bar(idx, np.abs(true_weights[idx]), color='tomato')

plt.show()

print(f"Dataset created with {n_features} total features")
print(f"Only {n_informative} features are truly informative: {informative_indices}")
print(f"The remaining {n_features - n_informative} features are noise")

## 2. Evaluating a Model with All Features (Baseline)

First, let's establish a baseline by training a model using all features, including the irrelevant ones.

In [None]:
# Scale features for better performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model with all features
model_all = LinearRegression(learning_rate=0.01, max_iterations=2000, regularization=0.0)
model_all.fit(X_train_scaled, y_train)

# Evaluate performance
y_pred_train = model_all.predict(X_train_scaled)
y_pred_test = model_all.predict(X_test_scaled)

mse_train_all = mean_squared_error(y_train, y_pred_train)
mse_test_all = mean_squared_error(y_test, y_pred_test)
r2_test_all = r2_score(y_test, y_pred_test)

print(f"Baseline Model (All {n_features} Features)")
print(f"Training MSE: {mse_train_all:.4f}")
print(f"Test MSE: {mse_test_all:.4f}")
print(f"Test R²: {r2_test_all:.4f}")

# Compare learned weights with true weights
plt.figure(figsize=(12, 6))
plt.bar(range(n_features), np.abs(true_weights), alpha=0.5, label='True Weights')
plt.bar(range(n_features), np.abs(model_all.weights), alpha=0.5, label='Learned Weights')
plt.xlabel('Feature Index')
plt.ylabel('Absolute Weight')
plt.title('True vs. Learned Feature Weights')
plt.legend()
plt.grid(True, axis='y')
plt.show()

# Calculate how many of the top features match the true informative features
top_n_indices = np.argsort(np.abs(model_all.weights))[-n_informative:]
matches = np.intersect1d(top_n_indices, informative_indices)
print(f"\nOf the top {n_informative} learned features, {len(matches)} match the true informative features.")
print(f"Match rate: {len(matches)/n_informative:.1%}")

## 3. Filter Methods: Correlation-Based Feature Selection

Filter methods select features based on statistical measures. The simplest approach is to select features with the highest correlation with the target variable.

In [None]:
# Create feature selector
selector = FeatureSelector()

# Apply correlation-based selection to choose top k features
k = n_informative  # Select the same number as true informative features
selected_features_corr, corr_scores = selector.select_k_best(X_train_scaled, y_train, k=k)

# Plot correlation scores
plt.figure(figsize=(12, 6))
plt.bar(range(n_features), corr_scores, color='lightblue')
plt.xlabel('Feature Index')
plt.ylabel('Absolute Correlation')
plt.title('Feature Correlation with Target')
plt.grid(axis='y')

# Highlight selected features
for idx in selected_features_corr:
    plt.bar(idx, corr_scores[idx], color='orange')
    
# Highlight true informative features with a different marker
for idx in informative_indices:
    plt.axvline(x=idx, color='red', linestyle='--', alpha=0.3)

plt.show()

# Calculate overlap with true informative features
overlap_corr = np.intersect1d(selected_features_corr, informative_indices)
print(f"Correlation-based feature selection:")
print(f"Selected {k} features: {selected_features_corr}")
print(f"Correctly identified {len(overlap_corr)} out of {n_informative} informative features")
print(f"Accuracy: {len(overlap_corr)/n_informative:.1%}")

In [None]:
# Train a model using only correlation-selected features
X_train_corr = X_train_scaled[:, selected_features_corr]
X_test_corr = X_test_scaled[:, selected_features_corr]

model_corr = LinearRegression(learning_rate=0.01, max_iterations=2000, regularization=0.0)
model_corr.fit(X_train_corr, y_train)

# Evaluate performance
y_pred_train_corr = model_corr.predict(X_train_corr)
y_pred_test_corr = model_corr.predict(X_test_corr)

mse_train_corr = mean_squared_error(y_train, y_pred_train_corr)
mse_test_corr = mean_squared_error(y_test, y_pred_test_corr)
r2_test_corr = r2_score(y_test, y_pred_test_corr)

print(f"Correlation-Based Feature Selection Model ({k} Features)")
print(f"Training MSE: {mse_train_corr:.4f}")
print(f"Test MSE: {mse_test_corr:.4f}")
print(f"Test R²: {r2_test_corr:.4f}")
print(f"\nImprovement over baseline (Test MSE): {(mse_test_all - mse_test_corr)/mse_test_all:.1%}")

## 4. Wrapper Methods: Recursive Feature Elimination (RFE)

Wrapper methods use the model itself to evaluate feature importance. Recursive Feature Elimination (RFE) recursively removes the least important features.

In [None]:
# Apply recursive feature elimination
selected_features_rfe, feature_ranks = selector.recursive_feature_elimination(
    X_train_scaled, y_train, n_features_to_select=k)

# Plot feature ranks (lower rank = more important)
importance_scores = n_features + 1 - feature_ranks  # Convert ranks to scores for visualization

plt.figure(figsize=(12, 6))
plt.bar(range(n_features), importance_scores, color='lightgreen')
plt.xlabel('Feature Index')
plt.ylabel('Importance Score')
plt.title('RFE Feature Importance')
plt.grid(axis='y')

# Highlight selected features
for idx in selected_features_rfe:
    plt.bar(idx, importance_scores[idx], color='green')
    
# Highlight true informative features with a different marker
for idx in informative_indices:
    plt.axvline(x=idx, color='red', linestyle='--', alpha=0.3)

plt.show()

# Calculate overlap with true informative features
overlap_rfe = np.intersect1d(selected_features_rfe, informative_indices)
print(f"Recursive Feature Elimination:")
print(f"Selected {k} features: {selected_features_rfe}")
print(f"Correctly identified {len(overlap_rfe)} out of {n_informative} informative features")
print(f"Accuracy: {len(overlap_rfe)/n_informative:.1%}")

In [None]:
# Train a model using only RFE-selected features
X_train_rfe = X_train_scaled[:, selected_features_rfe]
X_test_rfe = X_test_scaled[:, selected_features_rfe]

model_rfe = LinearRegression(learning_rate=0.01, max_iterations=2000, regularization=0.0)
model_rfe.fit(X_train_rfe, y_train)

# Evaluate performance
y_pred_train_rfe = model_rfe.predict(X_train_rfe)
y_pred_test_rfe = model_rfe.predict(X_test_rfe)

mse_train_rfe = mean_squared_error(y_train, y_pred_train_rfe)
mse_test_rfe = mean_squared_error(y_test, y_pred_test_rfe)
r2_test_rfe = r2_score(y_test, y_pred_test_rfe)

print(f"RFE Feature Selection Model ({k} Features)")
print(f"Training MSE: {mse_train_rfe:.4f}")
print(f"Test MSE: {mse_test_rfe:.4f}")
print(f"Test R²: {r2_test_rfe:.4f}")
print(f"\nImprovement over baseline (Test MSE): {(mse_test_all - mse_test_rfe)/mse_test_all:.1%}")

## 5. Embedded Methods: L1 Regularization (Lasso) for Feature Selection

Embedded methods perform feature selection during model training. L1 regularization (Lasso) can effectively drive irrelevant feature weights to exactly zero.

In [None]:
# Apply L1-based feature selection
selected_features_l1, l1_importance = selector.l1_based_selection(
    X_train_scaled, y_train, alpha=0.1, threshold=0.01)

# Sort selected features for better display
selected_features_l1 = np.sort(selected_features_l1)

# Plot L1 feature importance
plt.figure(figsize=(12, 6))
plt.bar(range(n_features), l1_importance, color='lightpink')
plt.xlabel('Feature Index')
plt.ylabel('Absolute Weight')
plt.title('L1 Regularization Feature Importance')
plt.grid(axis='y')

# Highlight selected features
for idx in selected_features_l1:
    plt.bar(idx, l1_importance[idx], color='red')
    
# Highlight true informative features with a different marker
for idx in informative_indices:
    plt.axvline(x=idx, color='blue', linestyle='--', alpha=0.3)

plt.show()

# Calculate overlap with true informative features
overlap_l1 = np.intersect1d(selected_features_l1, informative_indices)
print(f"L1 Regularization Feature Selection:")
print(f"Selected {len(selected_features_l1)} features: {selected_features_l1}")
print(f"Correctly identified {len(overlap_l1)} out of {n_informative} informative features")
print(f"Accuracy: {len(overlap_l1)/n_informative:.1%}")

In [None]:
# Train a model using only L1-selected features
X_train_l1 = X_train_scaled[:, selected_features_l1]
X_test_l1 = X_test_scaled[:, selected_features_l1]

model_l1 = LinearRegression(learning_rate=0.01, max_iterations=2000, regularization=0.0)
model_l1.fit(X_train_l1, y_train)

# Evaluate performance
y_pred_train_l1 = model_l1.predict(X_train_l1)
y_pred_test_l1 = model_l1.predict(X_test_l1)

mse_train_l1 = mean_squared_error(y_train, y_pred_train_l1)
mse_test_l1 = mean_squared_error(y_test, y_pred_test_l1)
r2_test_l1 = r2_score(y_test, y_pred_test_l1)

print(f"L1-Based Feature Selection Model ({len(selected_features_l1)} Features)")
print(f"Training MSE: {mse_train_l1:.4f}")
print(f"Test MSE: {mse_test_l1:.4f}")
print(f"Test R²: {r2_test_l1:.4f}")
print(f"\nImprovement over baseline (Test MSE): {(mse_test_all - mse_test_l1)/mse_test_all:.1%}")

## 6. Comparing All Feature Selection Methods

Now let's compare the performance of all the feature selection methods we've explored.

In [None]:
# Create a DataFrame to compare all methods
comparison = pd.DataFrame({
    'Method': ['All Features', 'Correlation-Based', 'RFE', 'L1 Regularization'],
    'Feature Count': [n_features, len(selected_features_corr), len(selected_features_rfe), len(selected_features_l1)],
    'Training MSE': [mse_train_all, mse_train_corr, mse_train_rfe, mse_train_l1],
    'Test MSE': [mse_test_all, mse_test_corr, mse_test_rfe, mse_test_l1],
    'Test R²': [r2_test_all, r2_test_corr, r2_test_rfe, r2_test_l1],
    'Identified Correctly': [len(matches), len(overlap_corr), len(overlap_rfe), len(overlap_l1)],
    'Identification Rate': [len(matches)/n_informative, len(overlap_corr)/n_informative, 
                          len(overlap_rfe)/n_informative, len(overlap_l1)/n_informative],
})

# Calculate improvement over baseline
comparison['MSE Improvement'] = (mse_test_all - comparison['Test MSE']) / mse_test_all

# Sort by test MSE
comparison = comparison.sort_values('Test MSE')

# Display the comparison
pd.set_option('display.precision', 4)
display(comparison)

# Visualize the comparison
plt.figure(figsize=(14, 10))

# Plot 1: Feature Count vs. Methods
plt.subplot(2, 2, 1)
plt.bar(comparison['Method'], comparison['Feature Count'], color='skyblue')
plt.ylabel('Number of Features')
plt.title('Feature Count by Method')
plt.xticks(rotation=45)
plt.grid(axis='y')

# Plot 2: Test MSE vs. Methods
plt.subplot(2, 2, 2)
plt.bar(comparison['Method'], comparison['Test MSE'], color='lightgreen')
plt.ylabel('Test MSE')
plt.title('Test Error by Method')
plt.xticks(rotation=45)
plt.grid(axis='y')

# Plot 3: Identification Rate vs. Methods
plt.subplot(2, 2, 3)
plt.bar(comparison['Method'], comparison['Identification Rate'], color='coral')
plt.ylabel('Identification Rate')
plt.title('Rate of Correctly Identifying True Features')
plt.xticks(rotation=45)
plt.grid(axis='y')

# Plot 4: MSE Improvement vs. Feature Count
plt.subplot(2, 2, 4)
plt.scatter(comparison['Feature Count'], comparison['MSE Improvement'], 
          s=100, c=range(len(comparison)), cmap='viridis')
for i, method in enumerate(comparison['Method']):
    plt.annotate(method, 
               (comparison['Feature Count'].iloc[i], comparison['MSE Improvement'].iloc[i]),
               xytext=(5, 5), textcoords='offset points')
plt.xlabel('Number of Features')
plt.ylabel('MSE Improvement')
plt.title('Performance Improvement vs. Feature Count')
plt.grid(True)

plt.tight_layout()
plt.show()

## 7. Real-World Example: Diabetes Dataset

Now let's apply our feature selection methods to a real-world dataset.

In [None]:
# Load the diabetes dataset
diabetes = load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
feature_names = diabetes.feature_names

# Split into training and test sets
X_train_diab, X_test_diab, y_train_diab, y_test_diab = train_test_split(
    X_diabetes, y_diabetes, test_size=0.3, random_state=42)

# Scale features
scaler_diab = StandardScaler()
X_train_diab_scaled = scaler_diab.fit_transform(X_train_diab)
X_test_diab_scaled = scaler_diab.transform(X_test_diab)

print("Diabetes Dataset Information:")
print(f"Number of samples: {X_diabetes.shape[0]}")
print(f"Number of features: {X_diabetes.shape[1]}")
print(f"Feature names: {feature_names}")

In [None]:
# Function to evaluate model performance
def evaluate_diabetes_model(X_train, X_test, y_train, y_test, method_name):
    # Train model
    model = LinearRegression(learning_rate=0.01, max_iterations=2000, regularization=0.01)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    mse_train = mean_squared_error(y_train, y_pred_train)
    mse_test = mean_squared_error(y_test, y_pred_test)
    r2_test = r2_score(y_test, y_pred_test)
    
    return {
        'Method': method_name,
        'Feature Count': X_train.shape[1],
        'Training MSE': mse_train,
        'Test MSE': mse_test,
        'Test R²': r2_test,
        'Model': model
    }

# Apply feature selection methods
# 1. Correlation-based
selected_diab_corr, _ = selector.select_k_best(X_train_diab_scaled, y_train_diab, k=5)

# 2. RFE
selected_diab_rfe, _ = selector.recursive_feature_elimination(
    X_train_diab_scaled, y_train_diab, n_features_to_select=5)

# 3. L1 regularization
selected_diab_l1, l1_importance_diab = selector.l1_based_selection(
    X_train_diab_scaled, y_train_diab, alpha=0.1, threshold=0.01)

# Evaluate all methods
results_diab = []

# All features
results_diab.append(evaluate_diabetes_model(
    X_train_diab_scaled, X_test_diab_scaled, y_train_diab, y_test_diab, 'All Features'))

# Correlation-based
results_diab.append(evaluate_diabetes_model(
    X_train_diab_scaled[:, selected_diab_corr], X_test_diab_scaled[:, selected_diab_corr], 
    y_train_diab, y_test_diab, 'Correlation-Based'))

# RFE
results_diab.append(evaluate_diabetes_model(
    X_train_diab_scaled[:, selected_diab_rfe], X_test_diab_scaled[:, selected_diab_rfe], 
    y_train_diab, y_test_diab, 'RFE'))

# L1 regularization
results_diab.append(evaluate_diabetes_model(
    X_train_diab_scaled[:, selected_diab_l1], X_test_diab_scaled[:, selected_diab_l1], 
    y_train_diab, y_test_diab, 'L1 Regularization'))

# Create DataFrame
df_results_diab = pd.DataFrame([
    {k: v for k, v in result.items() if k != 'Model'} for result in results_diab
])

# Sort by test MSE
df_results_diab = df_results_diab.sort_values('Test MSE')

# Display results
display(df_results_diab)

Let's visualize feature importance for the diabetes dataset and see which features were selected by each method:

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 10))

# Plot weights from full model
full_model = results_diab[0]['Model']
plt.subplot(2, 2, 1)
plt.bar(range(len(feature_names)), np.abs(full_model.weights), color='lightblue')
plt.xticks(range(len(feature_names)), feature_names, rotation=90)
plt.ylabel('Absolute Weight')
plt.title('Feature Importance: All Features')
plt.grid(axis='y')

# Plot selected features by each method
plt.subplot(2, 2, 2)
plt.bar(range(len(feature_names)), np.zeros(len(feature_names)), color='white', alpha=0.0)
for idx in selected_diab_corr:
    plt.bar(idx, 1, color='orange')
plt.xticks(range(len(feature_names)), feature_names, rotation=90)
plt.ylabel('Selected (1=Yes)')
plt.title('Features Selected: Correlation-Based')
plt.ylim(0, 1.2)

plt.subplot(2, 2, 3)
plt.bar(range(len(feature_names)), np.zeros(len(feature_names)), color='white', alpha=0.0)
for idx in selected_diab_rfe:
    plt.bar(idx, 1, color='green')
plt.xticks(range(len(feature_names)), feature_names, rotation=90)
plt.ylabel('Selected (1=Yes)')
plt.title('Features Selected: RFE')
plt.ylim(0, 1.2)

plt.subplot(2, 2, 4)
plt.bar(range(len(feature_names)), np.zeros(len(feature_names)), color='white', alpha=0.0)
for idx in selected_diab_l1:
    plt.bar(idx, 1, color='red')
plt.xticks(range(len(feature_names)), feature_names, rotation=90)
plt.ylabel('Selected (1=Yes)')
plt.title('Features Selected: L1 Regularization')
plt.ylim(0, 1.2)

plt.tight_layout()
plt.show()

# Find common features selected by all methods
common_features = set(selected_diab_corr)
common_features = common_features.intersection(selected_diab_rfe)
common_features = common_features.intersection(selected_diab_l1)
common_feature_names = [feature_names[i] for i in common_features]

print(f"Features selected by all methods: {common_feature_names}")

# Compare model performance
plt.figure(figsize=(10, 6))
plt.bar(df_results_diab['Method'], df_results_diab['Test MSE'], color='lightgreen')
plt.xlabel('Method')
plt.ylabel('Test MSE')
plt.title('Diabetes Dataset: Model Performance by Feature Selection Method')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

## 8. Key Insights and Best Practices

### What We've Learned

1. **Benefits of Feature Selection**:
   - Reduces model complexity without sacrificing performance
   - Can often improve generalization by removing noise features
   - Allows us to identify which variables are most important for predictions
   - Different methods have different strengths in identifying important features

2. **Comparison of Methods**:
   - **Filter Methods** (Correlation-based):
     - Simple and fast to compute
     - Independent of the model
     - May miss features with non-linear or interactive relationships
   
   - **Wrapper Methods** (RFE):
     - Often more accurate since they use the actual model
     - Computationally more intensive
     - Can capture more complex relationships
   
   - **Embedded Methods** (L1 Regularization):
     - Integrates feature selection with model training
     - Can automatically determine the optimal number of features
     - Balances model complexity with performance

3. **Real-World Applications**:
   - Different datasets may benefit from different feature selection approaches
   - Features selected by multiple methods are often truly important
   - The optimal number of features depends on the specific problem

### Best Practices for Feature Selection

1. **Always Evaluate on Validation/Test Data**
   - Feature selection should be based on training data only
   - Evaluate results on separate validation data
   - Avoid data leakage between selection and evaluation

2. **Consider Multiple Methods**
   - Try different approaches and compare results
   - Features selected by multiple methods are often important
   - Ensemble different feature selection techniques

3. **Domain Knowledge is Valuable**
   - Incorporate subject matter expertise when possible
   - Some features may be important despite statistical measures
   - Context matters in interpreting feature importance

4. **Balance Performance and Interpretability**
   - Fewer features often lead to more interpretable models
   - Consider the trade-off between simplicity and accuracy
   - Choose the minimal set of features that provides acceptable performance

Feature selection is a powerful technique in the machine learning toolkit. By carefully choosing the most relevant features, we can build models that are simpler, more accurate, and easier to understand and explain.