# Lesson 08: Feature Selection

**What you'll learn:**
- Why select features?
- Correlation-based selection
- SelectKBest (automatic selection)
- Testing if selection helps

**This is ONE of the optimization techniques for your assignment!**

---

## Section 1: Why Select Features?

### READ

Not all features are useful! Some might be:
- **Irrelevant**: Don't help prediction
- **Redundant**: Duplicate information from other features
- **Noisy**: Add randomness that hurts accuracy

**Benefits of feature selection:**
- Faster training
- Simpler models (easier to explain)
- Sometimes better accuracy!

### TRY IT - Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.metrics import f1_score

# Load data
df = pd.read_csv('../datasets/tomatjus.csv')
X = df.drop('quality', axis=1)
y = df['quality']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Total features: {X.shape[1]}")
print(f"Feature names: {X.columns.tolist()}")

---

## Section 2: Correlation-Based Selection

### READ

Select features based on their **correlation** with the target:
- High correlation → Feature is useful
- Low correlation → Feature might not help

Threshold: Keep features with |correlation| > 0.1 (adjust as needed)

### TRY IT

In [None]:
# Convert target to numeric for correlation
le = LabelEncoder()
y_numeric = le.fit_transform(y)

# Calculate correlation with target
df_temp = X.copy()
df_temp['target'] = y_numeric
correlations = df_temp.corr()['target'].drop('target')

print("Correlation with target:")
print(correlations.abs().sort_values(ascending=False))

In [None]:
# Visualize correlations
plt.figure(figsize=(10, 5))
correlations.abs().sort_values().plot(kind='barh', color='steelblue')
plt.xlabel('Absolute Correlation with Target')
plt.title('Feature Correlations')
plt.axvline(x=0.1, color='red', linestyle='--', label='Threshold (0.1)')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Select features with correlation > 0.1
threshold = 0.1
selected_by_corr = correlations[abs(correlations) > threshold].index.tolist()
print(f"\nFeatures with |correlation| > {threshold}:")
print(selected_by_corr)
print(f"\nReduced from {X.shape[1]} to {len(selected_by_corr)} features!")

---

## Section 3: SelectKBest

### READ

**SelectKBest** automatically selects the K best features using statistical tests.

Common score functions:
- `f_classif`: ANOVA F-test (for classification)
- `mutual_info_classif`: Mutual information (detects non-linear relationships)

### TRY IT

In [None]:
# Select top 5 features
selector = SelectKBest(score_func=f_classif, k=5)
X_selected = selector.fit_transform(X_train, y_train)

# Which features were selected?
selected_mask = selector.get_support()
selected_features = X.columns[selected_mask].tolist()

print("Top 5 features (SelectKBest):")
print(selected_features)

In [None]:
# See the scores for all features
scores = pd.DataFrame({
    'Feature': X.columns,
    'Score': selector.scores_
}).sort_values('Score', ascending=False)

print("\nFeature Scores (higher = more important):")
print(scores)

---

## Section 4: Testing if Selection Helps

In [None]:
# Baseline: All features
rf_all = RandomForestClassifier(random_state=42)
rf_all.fit(X_train, y_train)
score_all = f1_score(y_test, rf_all.predict(X_test), average='weighted')

print(f"All features ({X.shape[1]}): F1 = {score_all:.3f}")

In [None]:
# With selected features
for k in [3, 5, 7]:
    selector = SelectKBest(f_classif, k=k)
    X_train_sel = selector.fit_transform(X_train, y_train)
    X_test_sel = selector.transform(X_test)
    
    rf_sel = RandomForestClassifier(random_state=42)
    rf_sel.fit(X_train_sel, y_train)
    score_sel = f1_score(y_test, rf_sel.predict(X_test_sel), average='weighted')
    
    print(f"Top {k} features: F1 = {score_sel:.3f}")

### EXPLAIN

**Key points:**
- Fit selector on training data only (avoid data leakage)
- Sometimes fewer features = same or better accuracy
- Simpler models are easier to explain and faster to train

---

## Quick Reference

```python
from sklearn.feature_selection import SelectKBest, f_classif

# Select top K features
selector = SelectKBest(f_classif, k=5)
X_train_sel = selector.fit_transform(X_train, y_train)
X_test_sel = selector.transform(X_test)  # Don't fit again!

# Get selected feature names
selected_mask = selector.get_support()
selected_features = X.columns[selected_mask]
```

---

## Next Lesson

In **Lesson 09: Handling Imbalance**, you'll learn:
- What is class imbalance
- Oversampling and undersampling
- Class weights
- Which method works best