# Phase 1: Qualitätsfilterung und Korrelations-Prepruning

**Masterarbeit:** Zerstörungsfreie Werkstoffprüfung mittels 3MA-X8-Mikromagnetik  
**Ziel:** Reduktion von P=261 Features auf ~84 Features

---

## Methodische Grundlagen

### 1.1 Qualitätsfilterung
- **Missing Values:** Features mit >15% fehlenden Werten eliminieren
- **Near-Zero Variance:** Quasi-konstante Features entfernen

### 1.2 One-vs-Rest (OvR) Signal
Berechnung überwachter Korrelationen:
$$\text{OvR-Score}_j = \max_{c \in \text{Classes}} \max(|\rho_{\text{Pearson}}|, |\rho_{\text{Spearman}}|)$$

### 1.3 Hierarchisches Clustering (Redundanzelimination)
- **Distanzmetrik:** $d = 1 - |\rho|$
- **Schwellwert:** $|\rho| \geq 0.90$ → Cluster bilden
- **Repräsentanten:** Hybrid-Score = $0.5 \cdot \text{Zentralität} + 0.5 \cdot \text{OvR-Signal}$

---

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.stats import pearsonr, spearmanr
from sklearn.preprocessing import LabelBinarizer
import warnings
warnings.filterwarnings('ignore')

# Custom Utilities
import sys
sys.path.append('..')
from utils.validation import validate_data_structure, print_validation_report
from utils.visualization import plot_correlation_heatmap

# Plotting-Konfiguration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

## 1. Daten laden

**WICHTIG:** Passen Sie den Dateipfad an Ihre Datenquelle an!  
Erwartete Struktur:
- **Spalten:** 261 Features + 1 Zielvariable (z.B. 'class') + 1 Proben-ID (z.B. 'sample_id')
- **Zeilen:** n Samples (typischerweise Mehrfachmessungen pro Probe)

In [None]:
# ============================================================================
# ANPASSEN: Dateipfad zu Ihren 3MA-X8 Daten
# ============================================================================
DATA_PATH = '../data/raw/3ma_x8_features.csv'  # <-- HIER ANPASSEN!

# Daten einlesen
df_raw = pd.read_csv(DATA_PATH)

print(f"✓ Daten geladen: {df_raw.shape}")
print(f"\nErste 5 Zeilen:")
df_raw.head()

In [None]:
# ============================================================================
# ANPASSEN: Spalten-Namen für Zielvariable und Proben-ID
# ============================================================================
TARGET_COL = 'class'       # <-- Name der Zielvariablen-Spalte
GROUP_COL = 'sample_id'    # <-- Name der Proben-ID-Spalte

# Feature-Matrix X, Zielvariable y, Gruppen
feature_cols = [col for col in df_raw.columns if col not in [TARGET_COL, GROUP_COL]]
X = df_raw[feature_cols].copy()
y = df_raw[TARGET_COL].copy()
groups = df_raw[GROUP_COL].copy()

print(f"Features: {X.shape}")
print(f"Klassen: {y.nunique()} ({y.value_counts().to_dict()})")
print(f"Gruppen (Proben): {groups.nunique()}")

## 2. Datenstruktur-Validierung

In [None]:
# Validierung gegen Spezifikation
validation_results = validate_data_structure(
    X=X,
    y=y,
    groups=groups,
    expected_features=261
)

print_validation_report(validation_results)

## 3. Schritt 1.1: Missing Values Filter

In [None]:
MISSING_THRESHOLD = 0.15  # 15% gemäß Spezifikation

# Berechne Anteil fehlender Werte pro Feature
missing_ratios = X.isnull().mean()
features_to_keep = missing_ratios[missing_ratios <= MISSING_THRESHOLD].index.tolist()

n_removed = len(X.columns) - len(features_to_keep)
print(f"✓ Missing Values Filter:")
print(f"  Features entfernt: {n_removed}")
print(f"  Features verbleibend: {len(features_to_keep)}")

X_filtered = X[features_to_keep].copy()

## 4. Schritt 1.2: Near-Zero Variance Filter

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Varianz-Schwellwert (relativ zur Spannweite)
VARIANCE_THRESHOLD = 0.01

# Standardisiere Features temporär für Varianzberechnung
from sklearn.preprocessing import StandardScaler
scaler_temp = StandardScaler()
X_scaled_temp = scaler_temp.fit_transform(X_filtered)

# VarianceThreshold anwenden
selector = VarianceThreshold(threshold=VARIANCE_THRESHOLD)
selector.fit(X_scaled_temp)

features_high_var = X_filtered.columns[selector.get_support()].tolist()

n_removed_var = len(X_filtered.columns) - len(features_high_var)
print(f"✓ Near-Zero Variance Filter:")
print(f"  Features entfernt: {n_removed_var}")
print(f"  Features verbleibend: {len(features_high_var)}")

X_filtered = X_filtered[features_high_var].copy()

## 5. Schritt 1.3: One-vs-Rest (OvR) Signal berechnen

In [None]:
def compute_ovr_signal(X, y):
    """
    Berechnet OvR-Signal für jedes Feature:
    max_over_classes(max(|ρ_Pearson|, |ρ_Spearman|))
    """
    # Binäre Klassen-Indikatoren
    lb = LabelBinarizer()
    y_binary = lb.fit_transform(y)
    if y_binary.shape[1] == 1:  # Falls nur 2 Klassen
        y_binary = np.hstack([1 - y_binary, y_binary])
    
    ovr_scores = {}
    
    for feature in X.columns:
        feature_values = X[feature].values
        
        # Missing Values temporär durch Median ersetzen
        if np.isnan(feature_values).any():
            feature_values = pd.Series(feature_values).fillna(pd.Series(feature_values).median()).values
        
        max_corr = 0
        
        # Für jede Klasse
        for class_idx in range(y_binary.shape[1]):
            class_indicator = y_binary[:, class_idx]
            
            # Pearson
            try:
                corr_p, _ = pearsonr(feature_values, class_indicator)
                corr_p = abs(corr_p) if not np.isnan(corr_p) else 0
            except:
                corr_p = 0
            
            # Spearman
            try:
                corr_s, _ = spearmanr(feature_values, class_indicator)
                corr_s = abs(corr_s) if not np.isnan(corr_s) else 0
            except:
                corr_s = 0
            
            # Max über beide Korrelationen
            max_corr = max(max_corr, corr_p, corr_s)
        
        ovr_scores[feature] = max_corr
    
    return pd.Series(ovr_scores)

# Berechnung
print("Berechne OvR-Signal für {} Features...".format(len(X_filtered.columns)))
ovr_signal = compute_ovr_signal(X_filtered, y)

print(f"✓ OvR-Signal berechnet")
print(f"  Mittelwert: {ovr_signal.mean():.3f}")
print(f"  Median: {ovr_signal.median():.3f}")
print(f"\nTop 10 Features (höchstes OvR-Signal):")
print(ovr_signal.sort_values(ascending=False).head(10))

## 6. Schritt 1.4: Hierarchisches Clustering & Redundanzelimination

In [None]:
# Korrelationsmatrix berechnen (Pearson)
X_clean = X_filtered.fillna(X_filtered.median())  # Temporäre Imputation für Korrelationsberechnung
corr_matrix = X_clean.corr(method='pearson')

print(f"✓ Korrelationsmatrix: {corr_matrix.shape}")

In [None]:
# Distanzmatrix: d = 1 - |ρ|
distance_matrix = 1 - np.abs(corr_matrix.values)

# Hierarchisches Clustering (Average Linkage)
from scipy.spatial.distance import squareform
condensed_dist = squareform(distance_matrix, checks=False)
Z = linkage(condensed_dist, method='average')

# Cluster bilden bei |ρ| ≥ 0.90 → Distanz ≤ 0.10
CORR_THRESHOLD = 0.90
DISTANCE_THRESHOLD = 1 - CORR_THRESHOLD

cluster_labels = fcluster(Z, t=DISTANCE_THRESHOLD, criterion='distance')

print(f"✓ Clustering abgeschlossen")
print(f"  Anzahl Cluster: {len(np.unique(cluster_labels))}")

In [None]:
# Repräsentanten-Auswahl pro Cluster
def select_cluster_representatives(features, cluster_labels, corr_matrix, ovr_signal, alpha=0.5):
    """
    Wählt pro Cluster den Repräsentanten mit höchstem Hybrid-Score.
    
    Hybrid-Score = α * Zentralität + (1-α) * OvR-Signal
    
    Zentralität = mittlere absolute Korrelation zu allen Cluster-Mitgliedern
    """
    representatives = []
    
    for cluster_id in np.unique(cluster_labels):
        # Features im Cluster
        cluster_mask = cluster_labels == cluster_id
        cluster_features = np.array(features)[cluster_mask]
        
        if len(cluster_features) == 1:
            # Singleton-Cluster
            representatives.append(cluster_features[0])
        else:
            # Berechne Hybrid-Score für jedes Feature im Cluster
            scores = {}
            
            for feat in cluster_features:
                # Zentralität: mittlere |ρ| zu allen anderen im Cluster
                feat_idx = features.tolist().index(feat)
                cluster_indices = np.where(cluster_mask)[0]
                centrality = np.abs(corr_matrix.iloc[feat_idx, cluster_indices]).mean()
                
                # OvR-Signal normalisieren auf [0,1]
                ovr_normalized = ovr_signal[feat]
                
                # Hybrid-Score
                hybrid_score = alpha * centrality + (1 - alpha) * ovr_normalized
                scores[feat] = hybrid_score
            
            # Bester Repräsentant
            best_rep = max(scores, key=scores.get)
            representatives.append(best_rep)
    
    return representatives

# Repräsentanten auswählen
features_array = X_filtered.columns.values
representatives = select_cluster_representatives(
    features=features_array,
    cluster_labels=cluster_labels,
    corr_matrix=corr_matrix,
    ovr_signal=ovr_signal,
    alpha=0.5
)

print(f"✓ Repräsentanten ausgewählt")
print(f"  Features nach Clustering: {len(representatives)}")
print(f"  Reduktionsrate: {(1 - len(representatives) / len(X_filtered.columns)) * 100:.1f}%")

In [None]:
# Finale Feature-Matrix Phase 1
X_phase1 = X_filtered[representatives].copy()

print(f"\n{'='*70}")
print(f"PHASE 1 ABGESCHLOSSEN")
print(f"{'='*70}")
print(f"Features vorher:  {X.shape[1]}")
print(f"Features nachher: {X_phase1.shape[1]}")
print(f"Reduktion:        {X.shape[1] - X_phase1.shape[1]} Features ({(1 - X_phase1.shape[1] / X.shape[1]) * 100:.1f}%)")
print(f"{'='*70}")

## 7. Visualisierung: Korrelations-Heatmap (Top 30 Features)

In [None]:
# Plot Korrelationsheatmap der Top 30 Features (sortiert nach OvR-Signal)
top_30_features = ovr_signal.loc[representatives].sort_values(ascending=False).head(30).index.tolist()

fig = plot_correlation_heatmap(
    X=X_phase1[top_30_features],
    method='pearson',
    top_k=30,
    save_path='../results/plots/phase1_correlation_heatmap.png'
)
plt.show()

## 8. Ergebnisse speichern

In [None]:
# Speichere reduzierte Feature-Matrix
output_df = X_phase1.copy()
output_df[TARGET_COL] = y
output_df[GROUP_COL] = groups

output_path = '../data/processed/features_after_phase1.csv'
output_df.to_csv(output_path, index=False)
print(f"✓ Daten gespeichert: {output_path}")

# Speichere Feature-Liste und OvR-Scores
feature_info = pd.DataFrame({
    'feature': representatives,
    'ovr_signal': [ovr_signal[f] for f in representatives]
}).sort_values('ovr_signal', ascending=False)

feature_info_path = '../results/rankings/phase1_feature_info.csv'
feature_info.to_csv(feature_info_path, index=False)
print(f"✓ Feature-Info gespeichert: {feature_info_path}")

---
## ✓ Phase 1 abgeschlossen!

**Nächster Schritt:** Notebook 2 - Phase 2: Multi-Methoden Feature-Ranking