# Phase 2: Multi-Methoden Feature-Ranking

**Masterarbeit:** Zerstörungsfreie Werkstoffprüfung mittels 3MA-X8-Mikromagnetik  
**Input:** ~84 Features aus Phase 1  
**Output:** 8 verschiedene Feature-Rankings

---

## Methodische Grundlagen

### Ranking-Methoden (8)

| Methode | Typ | Charakteristik |
|---------|-----|----------------|
| **ANOVA F-Test** | Filter (Univariat) | Schnellste Baseline, lineare Separierbarkeit |
| **Mutual Information** | Filter (Nichtlinear) | Erfasst nichtlineare Abhängigkeiten |
| **mRMR** | Filter (Multivariat) | Minimiert Redundanz, maximiert Relevanz |
| **ReliefF** | Filter (Instanzbasiert) | Berücksichtigt Feature-Interaktionen |
| **L1-Lasso** | Embedded | Sparse Lösung durch L1-Regularisierung |
| **Random Forest** | Embedded (Nichtlinear) | Gini Importance (Mean Decrease in Impurity) |
| **Permutation Importance** | Embedded (Modellagnostisch) | Out-of-Sample Performance-Verlust |
| **PCA-Importance** | Unüberwacht | Loadings auf ersten Hauptkomponenten |

### KRITISCH: Fold-Aware Ranking

**Alle überwachten Methoden werden innerhalb 5-Fold GroupKFold CV durchgeführt!**  
→ Ranking pro Fold, dann Mittelung über Folds (Rank/Mean)  
→ Verhindert Overfitting-Bias

---

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import (
    f_classif,
    mutual_info_classif,
    SelectKBest
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from mrmr import mrmr_classif
from skrebate import ReliefF
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

# Custom Utilities
import sys
sys.path.append('..')
from utils.validation import create_group_kfold_splits
from utils.visualization import plot_ranking_comparison

# Plotting
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

## 1. Daten laden (Output von Phase 1)

In [None]:
# Daten aus Phase 1
DATA_PATH = '../data/processed/features_after_phase1.csv'
df = pd.read_csv(DATA_PATH)

TARGET_COL = 'class'
GROUP_COL = 'sample_id'

feature_cols = [col for col in df.columns if col not in [TARGET_COL, GROUP_COL]]
X = df[feature_cols].copy()
y = df[TARGET_COL].copy()
groups = df[GROUP_COL].copy()

print(f"✓ Daten geladen: {X.shape}")
print(f"  Features: {X.shape[1]}")
print(f"  Samples: {X.shape[0]}")
print(f"  Klassen: {y.nunique()}")
print(f"  Gruppen: {groups.nunique()}")

## 2. Preprocessing-Pipeline

**KRITISCH:** Imputation und Skalierung müssen INNERHALB jedes CV-Folds erfolgen!

In [None]:
def preprocess_data(X_train, X_test=None, fit=True):
    """
    Preprocessing: Imputation + Standardisierung.
    
    WICHTIG: Nur auf Trainingsdaten fitten!
    """
    if fit:
        # Imputer
        imputer = SimpleImputer(strategy='median')
        X_train_imputed = imputer.fit_transform(X_train)
        
        # Scaler
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train_imputed)
        
        if X_test is not None:
            X_test_imputed = imputer.transform(X_test)
            X_test_scaled = scaler.transform(X_test_imputed)
            return X_train_scaled, X_test_scaled, imputer, scaler
        else:
            return X_train_scaled, imputer, scaler
    else:
        raise ValueError("fit muss True sein für Preprocessing")

## 3. Ranking-Methoden

### 3.1 ANOVA F-Test

In [None]:
def rank_anova_f(X_train, y_train, feature_names):
    """
    ANOVA F-Test: Between/Within-Varianz.
    """
    f_scores, _ = f_classif(X_train, y_train)
    
    # NaN-Handling
    f_scores = np.nan_to_num(f_scores, nan=0.0, posinf=0.0, neginf=0.0)
    
    ranking = pd.DataFrame({
        'feature': feature_names,
        'score': f_scores
    }).sort_values('score', ascending=False).reset_index(drop=True)
    
    ranking['rank'] = range(1, len(ranking) + 1)
    return ranking

### 3.2 Mutual Information

In [None]:
def rank_mutual_info(X_train, y_train, feature_names, random_state=42):
    """
    Mutual Information: I(X;Y).
    """
    mi_scores = mutual_info_classif(
        X_train, y_train,
        discrete_features=False,
        random_state=random_state
    )
    
    ranking = pd.DataFrame({
        'feature': feature_names,
        'score': mi_scores
    }).sort_values('score', ascending=False).reset_index(drop=True)
    
    ranking['rank'] = range(1, len(ranking) + 1)
    return ranking

### 3.3 mRMR (Minimum Redundancy Maximum Relevance)

In [None]:
def rank_mrmr(X_train, y_train, feature_names, K=None):
    """
    mRMR: Maximiert Relevanz, minimiert Redundanz.
    """
    if K is None:
        K = len(feature_names)
    
    # Daten als DataFrame (mrmr-Bibliothek erfordert DataFrame)
    X_df = pd.DataFrame(X_train, columns=feature_names)
    y_series = pd.Series(y_train, name='target')
    
    # mRMR berechnen
    selected_features = mrmr_classif(X=X_df, y=y_series, K=K, show_progress=False)
    
    # Ranking erstellen
    ranking = pd.DataFrame({
        'feature': selected_features,
        'rank': range(1, len(selected_features) + 1)
    })
    
    # Fehlende Features ans Ende setzen
    missing_features = set(feature_names) - set(selected_features)
    if missing_features:
        missing_df = pd.DataFrame({
            'feature': list(missing_features),
            'rank': range(len(selected_features) + 1, len(feature_names) + 1)
        })
        ranking = pd.concat([ranking, missing_df], ignore_index=True)
    
    # Score = inverted rank (für Konsistenz)
    ranking['score'] = len(feature_names) - ranking['rank'] + 1
    
    return ranking.sort_values('rank').reset_index(drop=True)

### 3.4 ReliefF

In [None]:
def rank_relieff(X_train, y_train, feature_names, n_neighbors=10):
    """
    ReliefF: Instanzbasierte Feature-Gewichtung.
    """
    # ReliefF erfordert Integer-Labels
    le = LabelEncoder()
    y_encoded = le.fit_transform(y_train)
    
    # n_neighbors anpassen falls zu wenige Samples
    n_neighbors = min(n_neighbors, len(X_train) // 2)
    
    relief = ReliefF(n_features_to_select=len(feature_names), n_neighbors=n_neighbors)
    relief.fit(X_train, y_encoded)
    
    # Feature-Scores
    scores = relief.feature_importances_
    
    ranking = pd.DataFrame({
        'feature': feature_names,
        'score': scores
    }).sort_values('score', ascending=False).reset_index(drop=True)
    
    ranking['rank'] = range(1, len(ranking) + 1)
    return ranking

### 3.5 L1-Lasso (Logistic Regression)

In [None]:
def rank_lasso(X_train, y_train, feature_names, C=0.1, random_state=42):
    """
    L1-Lasso: Sparse Koeffizienten.
    """
    lasso = LogisticRegression(
        penalty='l1',
        solver='liblinear',
        C=C,
        max_iter=1000,
        random_state=random_state,
        multi_class='ovr'
    )
    
    lasso.fit(X_train, y_train)
    
    # Absolutbeträge der Koeffizienten (über alle Klassen gemittelt)
    coef_abs = np.abs(lasso.coef_).mean(axis=0)
    
    ranking = pd.DataFrame({
        'feature': feature_names,
        'score': coef_abs
    }).sort_values('score', ascending=False).reset_index(drop=True)
    
    ranking['rank'] = range(1, len(ranking) + 1)
    return ranking

### 3.6 Random Forest (Gini Importance)

In [None]:
def rank_random_forest(X_train, y_train, feature_names, n_estimators=100, random_state=42):
    """
    Random Forest: Mean Decrease in Impurity (Gini).
    """
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        random_state=random_state,
        max_depth=10,
        min_samples_split=5,
        n_jobs=-1
    )
    
    rf.fit(X_train, y_train)
    
    importances = rf.feature_importances_
    
    ranking = pd.DataFrame({
        'feature': feature_names,
        'score': importances
    }).sort_values('score', ascending=False).reset_index(drop=True)
    
    ranking['rank'] = range(1, len(ranking) + 1)
    return ranking

### 3.7 Permutation Importance

In [None]:
def rank_permutation(X_train, y_train, X_val, y_val, feature_names, random_state=42):
    """
    Permutation Importance: Out-of-Sample Performance-Verlust.
    
    KRITISCH: Berechnung auf Validierungsdaten (X_val, y_val)!
    """
    # Base Model: Random Forest
    rf = RandomForestClassifier(
        n_estimators=100,
        random_state=random_state,
        max_depth=10,
        n_jobs=-1
    )
    
    rf.fit(X_train, y_train)
    
    # Permutation Importance auf Validierungsdaten
    perm_importance = permutation_importance(
        rf, X_val, y_val,
        n_repeats=10,
        random_state=random_state,
        n_jobs=-1
    )
    
    importances = perm_importance.importances_mean
    
    ranking = pd.DataFrame({
        'feature': feature_names,
        'score': importances
    }).sort_values('score', ascending=False).reset_index(drop=True)
    
    ranking['rank'] = range(1, len(ranking) + 1)
    return ranking

### 3.8 PCA-Importance (Unüberwacht)

In [None]:
def rank_pca(X_train, feature_names, n_components=10):
    """
    PCA-Importance: Loadings auf ersten Hauptkomponenten.
    """
    n_components = min(n_components, X_train.shape[1], X_train.shape[0])
    
    pca = PCA(n_components=n_components)
    pca.fit(X_train)
    
    # Gewichtete Loadings (gewichtet mit explained variance)
    loadings = np.abs(pca.components_)
    explained_var = pca.explained_variance_ratio_
    
    weighted_loadings = np.sum(loadings * explained_var[:, np.newaxis], axis=0)
    
    ranking = pd.DataFrame({
        'feature': feature_names,
        'score': weighted_loadings
    }).sort_values('score', ascending=False).reset_index(drop=True)
    
    ranking['rank'] = range(1, len(ranking) + 1)
    return ranking

## 4. Fold-Aware Ranking Pipeline

**KRITISCH:** Ranking innerhalb jedes CV-Folds, dann Mittelung (Rank/Mean).

In [None]:
def compute_fold_aware_rankings(X, y, groups, random_state=42, n_splits=5):
    """
    Berechnet Feature-Rankings innerhalb GroupKFold CV.
    
    Returns:
    --------
    dict: {method_name: aggregated_ranking_dataframe}
    """
    feature_names = X.columns.tolist()
    gkf = create_group_kfold_splits(n_splits=n_splits)
    
    # Storage für Rankings pro Methode und Fold
    all_rankings = {
        'ANOVA': [],
        'MutualInfo': [],
        'mRMR': [],
        'ReliefF': [],
        'Lasso': [],
        'RandomForest': [],
        'Permutation': [],
        'PCA': []
    }
    
    print("Starte Fold-Aware Ranking...\n")
    
    for fold_idx, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups), 1):
        print(f"Fold {fold_idx}/{n_splits}")
        
        X_train_raw, X_val_raw = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # Preprocessing INNERHALB des Folds
        X_train, X_val, imputer, scaler = preprocess_data(X_train_raw, X_val_raw, fit=True)
        
        # 1. ANOVA
        print("  - ANOVA")
        rank_anova = rank_anova_f(X_train, y_train, feature_names)
        all_rankings['ANOVA'].append(rank_anova)
        
        # 2. Mutual Information
        print("  - Mutual Info")
        rank_mi = rank_mutual_info(X_train, y_train, feature_names, random_state)
        all_rankings['MutualInfo'].append(rank_mi)
        
        # 3. mRMR
        print("  - mRMR")
        rank_mrmr_result = rank_mrmr(X_train, y_train, feature_names)
        all_rankings['mRMR'].append(rank_mrmr_result)
        
        # 4. ReliefF
        print("  - ReliefF")
        rank_relief = rank_relieff(X_train, y_train, feature_names)
        all_rankings['ReliefF'].append(rank_relief)
        
        # 5. Lasso
        print("  - Lasso")
        rank_lasso_result = rank_lasso(X_train, y_train, feature_names, random_state=random_state)
        all_rankings['Lasso'].append(rank_lasso_result)
        
        # 6. Random Forest
        print("  - Random Forest")
        rank_rf = rank_random_forest(X_train, y_train, feature_names, random_state=random_state)
        all_rankings['RandomForest'].append(rank_rf)
        
        # 7. Permutation Importance
        print("  - Permutation")
        rank_perm = rank_permutation(X_train, y_train, X_val, y_val, feature_names, random_state)
        all_rankings['Permutation'].append(rank_perm)
        
        # 8. PCA (unüberwacht, nur auf X_train)
        print("  - PCA")
        rank_pca_result = rank_pca(X_train, feature_names)
        all_rankings['PCA'].append(rank_pca_result)
        
        print()
    
    # Aggregation: Mittelwert der Ränge über Folds
    print("Aggregiere Rankings über Folds...\n")
    
    aggregated_rankings = {}
    
    for method_name, fold_rankings in all_rankings.items():
        # Sammle Ränge für jedes Feature
        rank_dict = {feat: [] for feat in feature_names}
        
        for fold_ranking in fold_rankings:
            for _, row in fold_ranking.iterrows():
                rank_dict[row['feature']].append(row['rank'])
        
        # Mittelwert der Ränge
        mean_ranks = {feat: np.mean(ranks) for feat, ranks in rank_dict.items()}
        
        # Final Ranking
        final_ranking = pd.DataFrame({
            'feature': list(mean_ranks.keys()),
            'mean_rank': list(mean_ranks.values())
        }).sort_values('mean_rank').reset_index(drop=True)
        
        final_ranking['final_rank'] = range(1, len(final_ranking) + 1)
        
        aggregated_rankings[method_name] = final_ranking
    
    return aggregated_rankings

## 5. Rankings berechnen

In [None]:
# HAUPTBERECHNUNG
rankings = compute_fold_aware_rankings(
    X=X,
    y=y,
    groups=groups,
    random_state=42,
    n_splits=5
)

print("="*70)
print("✓ PHASE 2 ABGESCHLOSSEN")
print("="*70)
print(f"8 Feature-Rankings berechnet (5-Fold GroupKFold CV)")
print(f"Features pro Ranking: {len(rankings['ANOVA'])}")
print("="*70)

## 6. Rankings anzeigen (Top 20 pro Methode)

In [None]:
for method_name, ranking_df in rankings.items():
    print(f"\n{'='*70}")
    print(f"{method_name} - Top 20 Features")
    print(f"{'='*70}")
    print(ranking_df.head(20).to_string(index=False))

## 7. Visualisierung: Ranking-Vergleich

In [None]:
# Plot Ranking-Vergleich (Top 20)
rankings_for_plot = {name: df[['feature', 'mean_rank']].rename(columns={'mean_rank': 'score'}) 
                     for name, df in rankings.items()}

# Wähle 3 Methoden für kompakten Plot
selected_methods = ['ANOVA', 'RandomForest', 'mRMR']
rankings_subset = {k: v for k, v in rankings_for_plot.items() if k in selected_methods}

fig = plot_ranking_comparison(
    rankings_dict=rankings_subset,
    top_k=20,
    save_path='../results/plots/phase2_ranking_comparison.png'
)
plt.show()

## 8. Ergebnisse speichern

In [None]:
# Speichere jedes Ranking
import os
os.makedirs('../results/rankings', exist_ok=True)

for method_name, ranking_df in rankings.items():
    output_path = f'../results/rankings/phase2_ranking_{method_name}.csv'
    ranking_df.to_csv(output_path, index=False)
    print(f"✓ {method_name} gespeichert: {output_path}")

# Kombiniertes Ranking (alle Methoden)
combined = pd.DataFrame({'feature': X.columns})
for method_name, ranking_df in rankings.items():
    combined = combined.merge(
        ranking_df[['feature', 'final_rank']].rename(columns={'final_rank': f'rank_{method_name}'}),
        on='feature',
        how='left'
    )

combined_path = '../results/rankings/phase2_all_rankings_combined.csv'
combined.to_csv(combined_path, index=False)
print(f"\n✓ Kombiniertes Ranking gespeichert: {combined_path}")

---
## ✓ Phase 2 abgeschlossen!

**Nächster Schritt:** Notebook 3 - Phase 3: Iterative Reduktions-Evaluierung (LDA/QDA Benchmarking)