## üìù Section 1: Handling Missing Data

### Why Missing Data Occurs

**Common Causes:**
1. **Measurement Failures**: Sensor malfunctions, test equipment errors
2. **Data Collection Issues**: Logging failures, transmission errors
3. **Not Applicable**: Conditional tests that only run in certain scenarios
4. **Intentional Omissions**: Privacy concerns, cost constraints

**Types of Missing Data:**
- **MCAR (Missing Completely At Random)**: Missing values are independent of all data
- **MAR (Missing At Random)**: Missing values depend on observed data
- **MNAR (Missing Not At Random)**: Missing values depend on unobserved data

### Strategies for Handling Missing Data

**1. Mean/Median/Mode Imputation:**
- **Mean**: For normally distributed numeric features
- **Median**: For numeric features with outliers (more robust)
- **Mode**: For categorical features

$$\text{Imputed Value} = \begin{cases} 
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i & \text{(Mean)} \\
\text{median}(x) & \text{(Median)} \\
\text{mode}(x) & \text{(Mode)}
\end{cases}$$

**2. Constant Imputation:**
- Replace with a specific value (e.g., 0, -1, 'Missing')
- Useful when missing values have semantic meaning

**3. Forward Fill / Backward Fill:**
- **Forward Fill**: Use previous valid value (time series)
- **Backward Fill**: Use next valid value (time series)

**4. Advanced Methods:**
- **KNN Imputation**: Use k-nearest neighbors' values
- **Iterative Imputation**: Model each feature as a function of others
- **Deep Learning**: Autoencoders for complex imputation

**When to Use Each:**
- **< 5% missing**: Any method works reasonably well
- **5-20% missing**: Use median/mode or KNN imputation
- **20-40% missing**: Consider iterative imputation or drop feature
- **> 40% missing**: Usually better to drop the feature entirely

In [None]:
# Missing Data Handling Implementation

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
import matplotlib.pyplot as plt
import seaborn as sns

class MissingDataHandler:
    """
    Comprehensive missing data handling toolkit.
    
    Supports multiple imputation strategies:
    - Statistical: mean, median, mode
    - Fixed: constant value
    - Sequential: forward fill, backward fill
    - Advanced: KNN imputation
    """
    
    def __init__(self, strategy='median'):
        """
        Initialize with default strategy.
        
        Parameters:
        -----------
        strategy : str
            Default imputation strategy: 'mean', 'median', 'mode', 'constant', 
            'forward_fill', 'backward_fill', 'knn'
        """
        self.strategy = strategy
        self.imputer = None
        self.fill_values = {}
        
    def analyze_missing(self, df):
        """
        Analyze missing data patterns.
        
        Returns:
        --------
        DataFrame with missing value statistics per column
        """
        missing_stats = pd.DataFrame({
            'column': df.columns,
            'missing_count': df.isnull().sum(),
            'missing_pct': (df.isnull().sum() / len(df)) * 100,
            'dtype': df.dtypes
        })
        
        missing_stats = missing_stats[missing_stats['missing_count'] > 0]
        missing_stats = missing_stats.sort_values('missing_pct', ascending=False)
        
        return missing_stats
    
    def impute_simple(self, df, strategy='median', fill_value=None):
        """
        Simple imputation using sklearn SimpleImputer.
        
        Parameters:
        -----------
        df : DataFrame
            Input data with missing values
        strategy : str
            'mean', 'median', 'most_frequent', 'constant'
        fill_value : any
            Value to use when strategy='constant'
        
        Returns:
        --------
        DataFrame with imputed values
        """
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        
        if strategy == 'constant':
            self.imputer = SimpleImputer(strategy=strategy, fill_value=fill_value)
        else:
            self.imputer = SimpleImputer(strategy=strategy)
        
        df_imputed = df.copy()
        df_imputed[numeric_cols] = self.imputer.fit_transform(df[numeric_cols])
        
        # Store fill values for inspection
        self.fill_values = dict(zip(numeric_cols, self.imputer.statistics_))
        
        return df_imputed
    
    def impute_forward_fill(self, df):
        """Forward fill (use previous valid value)."""
        return df.fillna(method='ffill')
    
    def impute_backward_fill(self, df):
        """Backward fill (use next valid value)."""
        return df.fillna(method='bfill')
    
    def impute_knn(self, df, n_neighbors=5):
        """
        KNN imputation - uses k-nearest neighbors' values.
        
        Parameters:
        -----------
        df : DataFrame
            Input data with missing values
        n_neighbors : int
            Number of neighbors to use
        
        Returns:
        --------
        DataFrame with KNN-imputed values
        """
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        
        self.imputer = KNNImputer(n_neighbors=n_neighbors)
        
        df_imputed = df.copy()
        df_imputed[numeric_cols] = self.imputer.fit_transform(df[numeric_cols])
        
        return df_imputed
    
    def visualize_missing(self, df):
        """
        Visualize missing data patterns.
        """
        plt.figure(figsize=(12, 6))
        
        # Missing data heatmap
        plt.subplot(1, 2, 1)
        sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
        plt.title('Missing Data Pattern')
        plt.xlabel('Features')
        plt.ylabel('Samples')
        
        # Missing percentage bar chart
        plt.subplot(1, 2, 2)
        missing_pct = (df.isnull().sum() / len(df)) * 100
        missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=True)
        missing_pct.plot(kind='barh', color='coral')
        plt.title('Missing Data Percentage by Feature')
        plt.xlabel('Missing %')
        plt.ylabel('Features')
        
        plt.tight_layout()
        plt.show()

print("‚úÖ MissingDataHandler class implemented")
print("\nKey Methods:")
print("- analyze_missing(): Get missing data statistics")
print("- impute_simple(): Mean/median/mode/constant imputation")
print("- impute_forward_fill(): Use previous valid value")
print("- impute_backward_fill(): Use next valid value")
print("- impute_knn(): K-nearest neighbors imputation")
print("- visualize_missing(): Plot missing data patterns")

## üìù Section 2: Feature Scaling and Normalization

### Why Scale Features?

**Problem:** Features with different ranges can dominate the learning process.

**Example:**
- Frequency: 1000-3000 MHz (range: 2000)
- Voltage: 0.8-1.2 V (range: 0.4)
- Temperature: 25-85¬∞C (range: 60)

Without scaling, frequency would dominate distance calculations in algorithms like KNN, K-Means.

### Scaling Methods

**1. Standardization (Z-Score Normalization):**

Transforms features to have mean=0 and standard deviation=1.

$$x_{\text{scaled}} = \frac{x - \mu}{\sigma}$$

Where:
- $\mu$ = mean of feature
- $\sigma$ = standard deviation of feature

**When to Use:**
- Features follow normal distribution
- Algorithms assume normally distributed data (Linear Regression, Logistic Regression)
- Presence of outliers is acceptable (they remain outliers)

**2. Min-Max Normalization:**

Scales features to a fixed range [0, 1] (or custom range).

$$x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$$

**When to Use:**
- Bounded ranges needed (e.g., neural networks with sigmoid activation)
- Data doesn't follow normal distribution
- No significant outliers (outliers compress the scale)

**3. Robust Scaling:**

Uses median and interquartile range (IQR), robust to outliers.

$$x_{\text{scaled}} = \frac{x - \text{median}(x)}{\text{IQR}(x)}$$

Where: $\text{IQR} = Q_3 - Q_1$ (75th percentile - 25th percentile)

**When to Use:**
- Data contains significant outliers
- Median is more representative than mean

**4. Log Transformation:**

Applies logarithm to compress large values.

$$x_{\text{scaled}} = \log(x + 1)$$

(Add 1 to handle x=0 values)

**When to Use:**
- Highly skewed distributions (right-skewed)
- Power-law relationships
- Making multiplicative relationships additive

### Decision Guide

```
Does data have outliers?
‚îú‚îÄ YES: Use Robust Scaler
‚îî‚îÄ NO: Is data normally distributed?
       ‚îú‚îÄ YES: Use Standard Scaler
       ‚îî‚îÄ NO: Is data highly skewed?
              ‚îú‚îÄ YES: Use Log Transform + Standard Scaler
              ‚îî‚îÄ NO: Use Min-Max Scaler
```

In [None]:
# Feature Scaling Implementation

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from scipy import stats

class FeatureScaler:
    """
    Comprehensive feature scaling toolkit.
    
    Supports multiple scaling methods:
    - Standard scaling (z-score normalization)
    - Min-Max scaling (0-1 normalization)
    - Robust scaling (median and IQR)
    - Log transformation
    """
    
    def __init__(self, method='standard'):
        """
        Initialize with scaling method.
        
        Parameters:
        -----------
        method : str
            Scaling method: 'standard', 'minmax', 'robust', 'log'
        """
        self.method = method
        self.scaler = None
        self.feature_names = None
        
    def fit(self, X, feature_names=None):
        """
        Fit scaler to data.
        
        Parameters:
        -----------
        X : array-like or DataFrame
            Training data
        feature_names : list
            Names of features (for DataFrame output)
        """
        if isinstance(X, pd.DataFrame):
            self.feature_names = X.columns.tolist()
            X = X.values
        else:
            self.feature_names = feature_names
        
        if self.method == 'standard':
            self.scaler = StandardScaler()
        elif self.method == 'minmax':
            self.scaler = MinMaxScaler()
        elif self.method == 'robust':
            self.scaler = RobustScaler()
        elif self.method == 'log':
            self.scaler = None  # Log transform doesn't need fitting
        else:
            raise ValueError(f"Unknown scaling method: {self.method}")
        
        if self.scaler is not None:
            self.scaler.fit(X)
        
        return self
    
    def transform(self, X):
        """
        Transform data using fitted scaler.
        
        Parameters:
        -----------
        X : array-like or DataFrame
            Data to transform
        
        Returns:
        --------
        DataFrame or array with scaled features
        """
        is_dataframe = isinstance(X, pd.DataFrame)
        
        if is_dataframe:
            X_values = X.values
        else:
            X_values = X
        
        if self.method == 'log':
            X_scaled = np.log1p(X_values)  # log(1 + x) to handle zeros
        else:
            X_scaled = self.scaler.transform(X_values)
        
        if is_dataframe or self.feature_names is not None:
            feature_names = self.feature_names if self.feature_names else X.columns
            return pd.DataFrame(X_scaled, columns=feature_names, index=X.index if is_dataframe else None)
        
        return X_scaled
    
    def fit_transform(self, X, feature_names=None):
        """
        Fit scaler and transform data in one step.
        """
        return self.fit(X, feature_names).transform(X)
    
    def inverse_transform(self, X_scaled):
        """
        Inverse transform scaled data back to original scale.
        """
        if self.method == 'log':
            return np.expm1(X_scaled)  # exp(x) - 1
        
        is_dataframe = isinstance(X_scaled, pd.DataFrame)
        
        if is_dataframe:
            X_values = X_scaled.values
        else:
            X_values = X_scaled
        
        X_original = self.scaler.inverse_transform(X_values)
        
        if is_dataframe or self.feature_names is not None:
            feature_names = self.feature_names if self.feature_names else X_scaled.columns
            return pd.DataFrame(X_original, columns=feature_names, 
                              index=X_scaled.index if is_dataframe else None)
        
        return X_original
    
    def get_scaling_params(self):
        """
        Get scaling parameters for inspection.
        
        Returns:
        --------
        Dictionary with scaling parameters
        """
        if self.method == 'log':
            return {'method': 'log', 'note': 'No parameters (element-wise transform)'}
        
        params = {'method': self.method}
        
        if hasattr(self.scaler, 'mean_'):
            params['mean'] = self.scaler.mean_
        if hasattr(self.scaler, 'scale_'):
            params['scale'] = self.scaler.scale_
        if hasattr(self.scaler, 'center_'):
            params['center'] = self.scaler.center_
        if hasattr(self.scaler, 'data_min_'):
            params['data_min'] = self.scaler.data_min_
        if hasattr(self.scaler, 'data_max_'):
            params['data_max'] = self.scaler.data_max_
        
        return params
    
    def visualize_scaling_effect(self, X_original, X_scaled, feature_idx=0):
        """
        Visualize the effect of scaling on a feature.
        
        Parameters:
        -----------
        X_original : array-like
            Original data before scaling
        X_scaled : array-like
            Scaled data after transformation
        feature_idx : int
            Index of feature to visualize
        """
        if isinstance(X_original, pd.DataFrame):
            feature_name = X_original.columns[feature_idx]
            X_orig_values = X_original.iloc[:, feature_idx].values
        else:
            feature_name = f"Feature {feature_idx}"
            X_orig_values = X_original[:, feature_idx]
        
        if isinstance(X_scaled, pd.DataFrame):
            X_scaled_values = X_scaled.iloc[:, feature_idx].values
        else:
            X_scaled_values = X_scaled[:, feature_idx]
        
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # Original distribution
        axes[0, 0].hist(X_orig_values, bins=50, alpha=0.7, color='blue', edgecolor='black')
        axes[0, 0].set_title(f'{feature_name} - Original Distribution')
        axes[0, 0].set_xlabel('Value')
        axes[0, 0].set_ylabel('Frequency')
        axes[0, 0].axvline(np.mean(X_orig_values), color='red', linestyle='--', label='Mean')
        axes[0, 0].axvline(np.median(X_orig_values), color='green', linestyle='--', label='Median')
        axes[0, 0].legend()
        
        # Scaled distribution
        axes[0, 1].hist(X_scaled_values, bins=50, alpha=0.7, color='orange', edgecolor='black')
        axes[0, 1].set_title(f'{feature_name} - Scaled Distribution ({self.method})')
        axes[0, 1].set_xlabel('Scaled Value')
        axes[0, 1].set_ylabel('Frequency')
        axes[0, 1].axvline(np.mean(X_scaled_values), color='red', linestyle='--', label='Mean')
        axes[0, 1].axvline(np.median(X_scaled_values), color='green', linestyle='--', label='Median')
        axes[0, 1].legend()
        
        # Original box plot
        axes[1, 0].boxplot(X_orig_values, vert=False)
        axes[1, 0].set_title(f'{feature_name} - Original Box Plot')
        axes[1, 0].set_xlabel('Value')
        
        # Scaled box plot
        axes[1, 1].boxplot(X_scaled_values, vert=False)
        axes[1, 1].set_title(f'{feature_name} - Scaled Box Plot')
        axes[1, 1].set_xlabel('Scaled Value')
        
        plt.tight_layout()
        plt.show()
        
        # Print statistics
        print(f"\n{feature_name} Statistics:")
        print(f"\nOriginal:")
        print(f"  Mean: {np.mean(X_orig_values):.4f}")
        print(f"  Std: {np.std(X_orig_values):.4f}")
        print(f"  Min: {np.min(X_orig_values):.4f}")
        print(f"  Max: {np.max(X_orig_values):.4f}")
        
        print(f"\nScaled ({self.method}):")
        print(f"  Mean: {np.mean(X_scaled_values):.4f}")
        print(f"  Std: {np.std(X_scaled_values):.4f}")
        print(f"  Min: {np.min(X_scaled_values):.4f}")
        print(f"  Max: {np.max(X_scaled_values):.4f}")

print("‚úÖ FeatureScaler class implemented")
print("\nSupported Scaling Methods:")
print("- 'standard': Z-score normalization (mean=0, std=1)")
print("- 'minmax': Scale to [0, 1] range")
print("- 'robust': Use median and IQR (robust to outliers)")
print("- 'log': Log transformation (for skewed distributions)")

## üìù Section 3: Categorical Encoding

### Why Encode Categorical Variables?

**Problem:** Most machine learning algorithms require numeric input, but real-world data contains categorical features (device_type, wafer_lot, test_site, etc.).

**Example Categorical Features:**
- **Nominal**: No inherent order (color: red, blue, green)
- **Ordinal**: Natural order (temperature: low, medium, high)
- **High Cardinality**: Many unique values (product_id: 1000+ categories)

### Encoding Methods

**1. One-Hot Encoding (Dummy Variables):**

Creates binary column for each category.

**Original:**
```
| Color  |
|--------|
| Red    |
| Blue   |
| Green  |
```

**One-Hot Encoded:**
```
| Color_Red | Color_Blue | Color_Green |
|-----------|------------|-------------|
| 1         | 0          | 0           |
| 0         | 1          | 0           |
| 0         | 0          | 1           |
```

**Mathematical Representation:**
For category $c_i$ in feature $C$ with $k$ categories:

$$\text{OneHot}(c_i) = [0, 0, ..., 1, ..., 0] \in \{0,1\}^k$$

Where 1 is at position $i$.

**When to Use:**
- Nominal categories (no order)
- Low cardinality (< 15 categories)
- Tree-based models or linear models

**Limitations:**
- High dimensionality with many categories
- Sparse matrices (mostly zeros)
- Multicollinearity in linear models (drop one column)

**2. Label Encoding (Integer Encoding):**

Maps categories to integers: {Red: 0, Blue: 1, Green: 2}

$$\text{LabelEncode}(c_i) = i \in \{0, 1, ..., k-1\}$$

**When to Use:**
- Ordinal categories (temperature: low < medium < high)
- Tree-based models (can handle integers naturally)
- Target variable in classification

**Limitations:**
- Implies ordering (Red < Blue < Green) when none exists
- Linear models interpret as numeric distance

**3. Target Encoding (Mean Encoding):**

Replaces category with mean of target variable for that category.

$$\text{TargetEncode}(c_i) = \frac{1}{n_i} \sum_{j: x_j = c_i} y_j$$

Where:
- $n_i$ = count of samples with category $c_i$
- $y_j$ = target value for sample $j$

**Example:**
```
| Device_Type | Yield (target) |
|-------------|----------------|
| A           | 85%            |
| A           | 90%            |
| B           | 70%            |
| B           | 75%            |
```

Target Encoding: {A: 87.5%, B: 72.5%}

**When to Use:**
- High cardinality features (many categories)
- Strong correlation between category and target
- Gradient boosting models (XGBoost, LightGBM)

**Limitations:**
- Risk of overfitting (use smoothing or cross-validation)
- Data leakage if not done carefully
- Requires target variable (supervised only)

**4. Frequency Encoding:**

Replaces category with its occurrence frequency.

$$\text{FreqEncode}(c_i) = \frac{n_i}{n}$$

**5. Binary Encoding:**

Converts categories to binary digits, creates log‚ÇÇ(k) columns instead of k.

**Decision Guide:**

```
What type of categorical feature?
‚îú‚îÄ Ordinal (ordered): Use Label Encoding
‚îî‚îÄ Nominal (no order):
    ‚îú‚îÄ Low cardinality (< 15): Use One-Hot Encoding
    ‚îî‚îÄ High cardinality (> 15):
        ‚îú‚îÄ Tree-based model: Target Encoding
        ‚îî‚îÄ Linear model: One-Hot + dimensionality reduction
```

In [None]:
# Categorical Encoding Implementation

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from category_encoders import TargetEncoder  # pip install category-encoders

class CategoricalEncoder:
    """
    Comprehensive categorical encoding toolkit.
    
    Supports multiple encoding methods:
    - One-hot encoding (binary columns)
    - Label encoding (integer mapping)
    - Target encoding (mean of target per category)
    - Frequency encoding (category occurrence rate)
    """
    
    def __init__(self, method='onehot'):
        """
        Initialize with encoding method.
        
        Parameters:
        -----------
        method : str
            Encoding method: 'onehot', 'label', 'target', 'frequency'
        """
        self.method = method
        self.encoders = {}
        self.encoded_columns = []
        
    def fit(self, df, categorical_cols, target_col=None):
        """
        Fit encoders to categorical columns.
        
        Parameters:
        -----------
        df : DataFrame
            Training data
        categorical_cols : list
            Names of categorical columns to encode
        target_col : str
            Target column name (required for target encoding)
        """
        self.categorical_cols = categorical_cols
        self.target_col = target_col
        
        for col in categorical_cols:
            if self.method == 'label':
                encoder = LabelEncoder()
                encoder.fit(df[col])
                self.encoders[col] = encoder
                
            elif self.method == 'target':
                if target_col is None:
                    raise ValueError("target_col required for target encoding")
                encoder = TargetEncoder()
                encoder.fit(df[col], df[target_col])
                self.encoders[col] = encoder
                
            elif self.method == 'frequency':
                # Store frequency mapping
                freq_map = df[col].value_counts(normalize=True).to_dict()
                self.encoders[col] = freq_map
        
        return self
    
    def transform(self, df):
        """
        Transform categorical columns using fitted encoders.
        
        Parameters:
        -----------
        df : DataFrame
            Data to encode
        
        Returns:
        --------
        DataFrame with encoded categorical features
        """
        df_encoded = df.copy()
        
        if self.method == 'onehot':
            # One-hot encoding using pandas get_dummies
            df_encoded = pd.get_dummies(df_encoded, columns=self.categorical_cols, 
                                       prefix=self.categorical_cols, drop_first=False)
            self.encoded_columns = [col for col in df_encoded.columns 
                                   if any(cat in col for cat in self.categorical_cols)]
        
        elif self.method == 'label':
            for col in self.categorical_cols:
                encoder = self.encoders[col]
                df_encoded[col] = encoder.transform(df[col])
        
        elif self.method == 'target':
            for col in self.categorical_cols:
                encoder = self.encoders[col]
                df_encoded[col] = encoder.transform(df[col])
        
        elif self.method == 'frequency':
            for col in self.categorical_cols:
                freq_map = self.encoders[col]
                df_encoded[col] = df[col].map(freq_map)
                # Handle unseen categories
                df_encoded[col].fillna(0, inplace=True)
        
        return df_encoded
    
    def fit_transform(self, df, categorical_cols, target_col=None):
        """
        Fit encoders and transform data in one step.
        """
        return self.fit(df, categorical_cols, target_col).transform(df)
    
    def inverse_transform(self, df, encoded_cols=None):
        """
        Inverse transform encoded data back to original categories.
        Only works for label encoding.
        """
        if self.method != 'label':
            raise ValueError("Inverse transform only supported for label encoding")
        
        df_decoded = df.copy()
        
        for col in self.categorical_cols:
            if col in df_decoded.columns:
                encoder = self.encoders[col]
                df_decoded[col] = encoder.inverse_transform(df_decoded[col].astype(int))
        
        return df_decoded
    
    def get_encoding_info(self):
        """
        Get information about encodings applied.
        
        Returns:
        --------
        Dictionary with encoding details
        """
        info = {
            'method': self.method,
            'categorical_columns': self.categorical_cols,
            'encoders': {}
        }
        
        for col in self.categorical_cols:
            if self.method == 'label':
                encoder = self.encoders[col]
                info['encoders'][col] = {
                    'classes': encoder.classes_.tolist(),
                    'n_classes': len(encoder.classes_)
                }
            elif self.method == 'frequency':
                info['encoders'][col] = self.encoders[col]
        
        if self.method == 'onehot':
            info['encoded_columns'] = self.encoded_columns
            info['n_new_columns'] = len(self.encoded_columns)
        
        return info
    
    def visualize_encoding(self, df_original, df_encoded, col_name):
        """
        Visualize the effect of encoding on a categorical column.
        
        Parameters:
        -----------
        df_original : DataFrame
            Original data before encoding
        df_encoded : DataFrame
            Encoded data after transformation
        col_name : str
            Name of categorical column to visualize
        """
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # Original categories distribution
        value_counts = df_original[col_name].value_counts()
        axes[0].bar(range(len(value_counts)), value_counts.values, color='skyblue', edgecolor='black')
        axes[0].set_xticks(range(len(value_counts)))
        axes[0].set_xticklabels(value_counts.index, rotation=45, ha='right')
        axes[0].set_title(f'{col_name} - Original Categories')
        axes[0].set_xlabel('Category')
        axes[0].set_ylabel('Count')
        axes[0].grid(axis='y', alpha=0.3)
        
        # Encoded representation
        if self.method == 'onehot':
            # Show one-hot encoded columns
            encoded_cols = [c for c in df_encoded.columns if c.startswith(f'{col_name}_')]
            encoded_sample = df_encoded[encoded_cols].head(10)
            
            sns.heatmap(encoded_sample.T, cmap='YlOrRd', cbar=True, 
                       linewidths=0.5, linecolor='gray', ax=axes[1])
            axes[1].set_title(f'{col_name} - One-Hot Encoded (First 10 Samples)')
            axes[1].set_xlabel('Sample Index')
            axes[1].set_ylabel('Encoded Columns')
        
        else:
            # Show encoded values distribution
            axes[1].hist(df_encoded[col_name], bins=30, color='coral', edgecolor='black', alpha=0.7)
            axes[1].set_title(f'{col_name} - Encoded Values ({self.method})')
            axes[1].set_xlabel('Encoded Value')
            axes[1].set_ylabel('Frequency')
            axes[1].grid(axis='y', alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Print encoding mapping
        print(f"\n{col_name} Encoding Details:")
        if self.method == 'label':
            encoder = self.encoders[col_name]
            mapping = dict(zip(encoder.classes_, range(len(encoder.classes_))))
            print(f"Label Mapping: {mapping}")
        elif self.method == 'frequency':
            print(f"Frequency Mapping: {self.encoders[col_name]}")
        elif self.method == 'onehot':
            encoded_cols = [c for c in df_encoded.columns if c.startswith(f'{col_name}_')]
            print(f"Created {len(encoded_cols)} binary columns: {encoded_cols[:5]}...")

print("‚úÖ CategoricalEncoder class implemented")
print("\nSupported Encoding Methods:")
print("- 'onehot': Binary columns for each category")
print("- 'label': Integer mapping (0, 1, 2, ...)")
print("- 'target': Mean of target variable per category")
print("- 'frequency': Category occurrence frequency")

## üìù Section 4: Feature Creation and Engineering

### Why Create New Features?

**Core Principle:** Raw features rarely represent the underlying patterns optimally. Creating derived features helps models learn relationships more easily.

**Types of Feature Creation:**

### 1. Polynomial Features (Interactions and Powers)

**Interactions:** Combine two features to capture their joint effect.

$$\text{Interaction}(x_1, x_2) = x_1 \times x_2$$

**Example (Semiconductor):**
- $x_1$ = Voltage (VDD)
- $x_2$ = Frequency
- Interaction = VDD √ó Frequency (captures dynamic power relationship)

**Powers:** Capture non-linear relationships.

$$\text{Polynomial}(x, d) = [x, x^2, x^3, ..., x^d]$$

**Full Polynomial Features (degree 2):**
For features $[x_1, x_2]$, creates: $[1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]$

**When to Use:**
- Non-linear relationships in linear models
- Interactions between features are important
- Sufficient data to avoid overfitting (curse of dimensionality)

**Trade-off:**
- Degree 2 with 10 features ‚Üí 66 features
- Degree 3 with 10 features ‚Üí 286 features
- Can lead to overfitting with limited data

### 2. Domain-Specific Features

**Definition:** Features created using domain expertise and physical/business understanding.

**Semiconductor Testing Examples:**

**Power Efficiency:**
$$\text{Efficiency} = \frac{\text{Frequency}}{\text{Power}}$$

Higher efficiency = better performance per watt

**Power Density:**
$$\text{Power Density} = \frac{\text{Power}}{\text{VDD}}$$

Normalized power consumption

**Leakage Current (Normalized):**
$$\text{Leakage Norm} = \frac{\text{IDD}_{\text{static}}}{\text{VDD}}$$

Resistance-like metric for leakage

**Spatial Distance:**
$$\text{Radial Distance} = \sqrt{x^2 + y^2}$$

Distance from wafer center (edge dies often different)

**Temperature Coefficient:**
$$\text{Temp Coeff} = \frac{\text{Performance}}{\text{Temperature}}$$

How performance changes with temperature

### 3. Statistical Aggregations

**Group-Level Features:** Aggregate statistics within groups.

**Example:** For dies on the same wafer:
- Mean VDD per wafer
- Std deviation of frequency per wafer
- Deviation from wafer mean

$$\text{Deviation from Wafer Mean} = x_{\text{die}} - \bar{x}_{\text{wafer}}$$

**Why This Matters:**
- Captures process variation patterns
- Identifies outlier dies within wafer
- Helps predict wafer-level yield

### 4. Temporal Features (Time Series)

**From timestamps, extract:**
- Hour of day, day of week, month
- Days since reference event
- Rolling statistics (moving average, std)
- Lag features (previous N values)

### 5. Text Features

**From text data, extract:**
- Text length, word count
- Presence of keywords
- Sentiment scores
- TF-IDF vectors

### Feature Creation Strategy

```mermaid
graph TD
    A[Raw Features] --> B{Domain Knowledge?}
    B -->|Yes| C[Create Domain Features]
    B -->|No| D[Try Polynomial Features]
    C --> E[Check Correlation]
    D --> E
    E --> F{New Features Useful?}
    F -->|Yes| G[Keep Features]
    F -->|No| H[Discard Features]
    G --> I[Feature Selection]
```

**Best Practices:**
1. **Start Simple:** Linear combinations first
2. **Use Domain Knowledge:** Physics, business logic
3. **Validate Usefulness:** Check correlation with target
4. **Regularization:** Use L1/L2 to handle many features
5. **Feature Importance:** Remove low-importance features

In [None]:
# Feature Creation Implementation

from sklearn.preprocessing import PolynomialFeatures
from itertools import combinations

class FeatureCreator:
    """
    Comprehensive feature creation toolkit.
    
    Supports:
    - Polynomial features (powers and interactions)
    - Domain-specific features (custom recipes)
    - Statistical aggregations (group-level features)
    - Mathematical transformations (log, sqrt, etc.)
    """
    
    def __init__(self):
        """Initialize feature creator."""
        self.created_features = []
        self.poly_features = None
        
    def create_polynomial_features(self, df, numeric_cols, degree=2, interaction_only=False):
        """
        Create polynomial features (powers and interactions).
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        numeric_cols : list
            Numeric columns to create polynomial features from
        degree : int
            Polynomial degree (2 = squares and interactions, 3 = cubes, etc.)
        interaction_only : bool
            If True, only create interaction terms (no powers)
        
        Returns:
        --------
        DataFrame with original + polynomial features
        """
        # Extract numeric data
        X = df[numeric_cols].values
        
        # Create polynomial features
        poly = PolynomialFeatures(degree=degree, interaction_only=interaction_only, 
                                  include_bias=False)
        X_poly = poly.fit_transform(X)
        
        # Get feature names
        feature_names = poly.get_feature_names_out(numeric_cols)
        
        # Create DataFrame with polynomial features
        df_poly = pd.DataFrame(X_poly, columns=feature_names, index=df.index)
        
        # Store polynomial transformer
        self.poly_features = poly
        
        # Combine with original dataframe (keep non-numeric columns)
        non_numeric_cols = [col for col in df.columns if col not in numeric_cols]
        df_result = pd.concat([df[non_numeric_cols], df_poly], axis=1)
        
        # Track created features
        new_features = [f for f in feature_names if f not in numeric_cols]
        self.created_features.extend(new_features)
        
        print(f"‚úÖ Created {len(new_features)} polynomial features (degree={degree})")
        print(f"   Original: {len(numeric_cols)} ‚Üí New: {len(feature_names)}")
        
        return df_result
    
    def create_domain_features(self, df, recipes):
        """
        Create domain-specific features using custom recipes.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        recipes : dict
            Dictionary mapping new feature names to lambda functions
            Example: {'efficiency': lambda x: x['freq'] / x['power']}
        
        Returns:
        --------
        DataFrame with original + domain features
        """
        df_result = df.copy()
        
        for feature_name, recipe_func in recipes.items():
            try:
                df_result[feature_name] = recipe_func(df)
                self.created_features.append(feature_name)
                print(f"‚úÖ Created domain feature: {feature_name}")
            except Exception as e:
                print(f"‚ùå Failed to create {feature_name}: {e}")
        
        return df_result
    
    def create_aggregation_features(self, df, group_col, agg_cols, agg_funcs=['mean', 'std', 'min', 'max']):
        """
        Create group-level aggregation features.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        group_col : str
            Column to group by (e.g., 'wafer_id')
        agg_cols : list
            Columns to aggregate
        agg_funcs : list
            Aggregation functions to apply
        
        Returns:
        --------
        DataFrame with original + aggregation features
        """
        df_result = df.copy()
        
        for col in agg_cols:
            for func in agg_funcs:
                # Create aggregation
                agg_name = f'{col}_{func}_by_{group_col}'
                agg_values = df.groupby(group_col)[col].transform(func)
                df_result[agg_name] = agg_values
                
                self.created_features.append(agg_name)
                
                # Also create deviation from group mean
                if func == 'mean':
                    dev_name = f'{col}_deviation_from_{group_col}'
                    df_result[dev_name] = df[col] - agg_values
                    self.created_features.append(dev_name)
        
        print(f"‚úÖ Created {len(agg_cols) * len(agg_funcs)} aggregation features")
        print(f"   Grouped by: {group_col}")
        
        return df_result
    
    def create_interaction_features(self, df, col_pairs):
        """
        Create interaction features for specific column pairs.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        col_pairs : list of tuples
            Pairs of columns to create interactions
            Example: [('voltage', 'frequency'), ('power', 'temp')]
        
        Returns:
        --------
        DataFrame with original + interaction features
        """
        df_result = df.copy()
        
        for col1, col2 in col_pairs:
            # Multiplication interaction
            mult_name = f'{col1}_x_{col2}'
            df_result[mult_name] = df[col1] * df[col2]
            self.created_features.append(mult_name)
            
            # Ratio features (both directions)
            ratio_name_1 = f'{col1}_div_{col2}'
            ratio_name_2 = f'{col2}_div_{col1}'
            
            # Avoid division by zero
            df_result[ratio_name_1] = df[col1] / (df[col2] + 1e-10)
            df_result[ratio_name_2] = df[col2] / (df[col1] + 1e-10)
            
            self.created_features.extend([ratio_name_1, ratio_name_2])
            
            print(f"‚úÖ Created interactions for ({col1}, {col2}): multiply + 2 ratios")
        
        return df_result
    
    def create_mathematical_transforms(self, df, numeric_cols, transforms=['log', 'sqrt', 'square']):
        """
        Create mathematical transformations of numeric features.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        numeric_cols : list
            Columns to transform
        transforms : list
            Transformations to apply: 'log', 'sqrt', 'square', 'inverse'
        
        Returns:
        --------
        DataFrame with original + transformed features
        """
        df_result = df.copy()
        
        for col in numeric_cols:
            if 'log' in transforms:
                # Log transform (handle zeros and negatives)
                log_name = f'{col}_log'
                df_result[log_name] = np.log1p(df[col].clip(lower=0))
                self.created_features.append(log_name)
            
            if 'sqrt' in transforms:
                # Square root (handle negatives)
                sqrt_name = f'{col}_sqrt'
                df_result[sqrt_name] = np.sqrt(df[col].clip(lower=0))
                self.created_features.append(sqrt_name)
            
            if 'square' in transforms:
                # Square
                square_name = f'{col}_square'
                df_result[square_name] = df[col] ** 2
                self.created_features.append(square_name)
            
            if 'inverse' in transforms:
                # Inverse (handle zeros)
                inv_name = f'{col}_inverse'
                df_result[inv_name] = 1 / (df[col] + 1e-10)
                self.created_features.append(inv_name)
        
        print(f"‚úÖ Created {len(numeric_cols) * len(transforms)} mathematical transform features")
        
        return df_result
    
    def get_created_features(self):
        """
        Get list of all created features.
        
        Returns:
        --------
        List of feature names created by this instance
        """
        return self.created_features
    
    def reset(self):
        """Reset created features list."""
        self.created_features = []
        self.poly_features = None

print("‚úÖ FeatureCreator class implemented")
print("\nSupported Feature Creation Methods:")
print("- create_polynomial_features(): Powers and interactions")
print("- create_domain_features(): Custom recipes (domain knowledge)")
print("- create_aggregation_features(): Group-level statistics")
print("- create_interaction_features(): Specific column pair interactions")
print("- create_mathematical_transforms(): Log, sqrt, square, inverse")

## üìù Section 5: Feature Selection Methods

### Why Feature Selection?

**Problems with Too Many Features:**
1. **Curse of Dimensionality:** Model performance degrades with too many features relative to samples
2. **Overfitting:** Model learns noise instead of patterns
3. **Computational Cost:** Training and inference become expensive
4. **Interpretability:** Harder to understand which features matter
5. **Multicollinearity:** Correlated features cause instability in linear models

**Goal:** Keep only the most informative features that contribute to prediction accuracy.

### Feature Selection Categories

**1. Filter Methods:** Use statistical measures independent of model
**2. Wrapper Methods:** Use model performance to evaluate feature subsets
**3. Embedded Methods:** Feature selection during model training (L1 regularization)

### Filter Methods

#### A. Variance Threshold

**Principle:** Remove features with low variance (near-constant values).

$$\text{Variance}(x) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

**When to Use:**
- Quick preprocessing step
- Remove features that don't vary much across samples
- Binary features with 95%+ same value

**Example:**
- Feature with values [1, 1, 1, 1, 2, 1, 1, 1] ‚Üí Low variance, likely uninformative
- Feature with values [1, 5, 2, 8, 3, 9, 1, 6] ‚Üí High variance, potentially useful

**Sklearn Implementation:**
```python
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)  # Remove if variance < 0.01
```

#### B. Correlation-Based Selection

**Principle:** Remove highly correlated features (redundant information).

**Pearson Correlation:**
$$r_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$

Range: $r \in [-1, 1]$
- $r = 1$: Perfect positive correlation
- $r = -1$: Perfect negative correlation
- $r = 0$: No linear correlation

**Strategy:**
1. Calculate correlation matrix
2. For pairs with $|r| > 0.95$ (highly correlated), remove one feature
3. Keep feature with higher correlation to target (if available)

**When to Use:**
- Linear models (affected by multicollinearity)
- Many features with redundant information
- After creating polynomial features

#### C. Mutual Information

**Principle:** Measure mutual dependence between feature and target.

**Definition:**
$$I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$

Where:
- $I(X; Y)$ = mutual information between X and Y
- $p(x, y)$ = joint probability
- $p(x), p(y)$ = marginal probabilities

**Properties:**
- $I(X; Y) \geq 0$ (always non-negative)
- $I(X; Y) = 0$ if X and Y are independent
- Captures non-linear relationships (unlike correlation)

**When to Use:**
- Non-linear relationships between features and target
- Classification problems (discrete target)
- Ranking features by importance

**Sklearn Implementation:**
```python
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
mi_scores = mutual_info_classif(X, y)  # Classification
mi_scores = mutual_info_regression(X, y)  # Regression
```

#### D. Chi-Square Test

**Principle:** Test independence between categorical feature and target.

$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

Where:
- $O_i$ = observed frequency
- $E_i$ = expected frequency (if independent)

**When to Use:**
- Categorical features
- Classification problems
- Quick statistical test for relevance

### Wrapper Methods

#### Recursive Feature Elimination (RFE)

**Algorithm:**
1. Train model on all features
2. Rank features by importance
3. Remove least important feature
4. Repeat until desired number of features

**When to Use:**
- Small to medium feature sets
- When computational cost is acceptable
- Linear models or tree-based models (have feature importance)

**Trade-off:** More accurate but computationally expensive (trains many models)

### Embedded Methods

#### L1 Regularization (Lasso)

**Objective Function:**
$$\min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - \beta^T x_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}$$

**Effect:** L1 penalty drives some coefficients to exactly zero ‚Üí automatic feature selection

**When to Use:**
- Linear models (regression, logistic regression)
- High-dimensional data (more features than samples)
- Want sparse solution (few non-zero coefficients)

### Feature Selection Strategy

```mermaid
graph TD
    A[All Features] --> B{How many features?}
    B -->|< 100| C[Try All Methods]
    B -->|100-1000| D[Filter Methods First]
    B -->|> 1000| E[Variance Threshold + Correlation]
    
    C --> F{Model Type?}
    D --> F
    E --> F
    
    F -->|Linear| G[Correlation + Lasso]
    F -->|Tree-based| H[Feature Importance + RFE]
    F -->|Neural Net| I[Mutual Information + PCA]
    
    G --> J[Final Feature Set]
    H --> J
    I --> J
```

In [None]:
# Feature Selection Implementation

from sklearn.feature_selection import VarianceThreshold, SelectKBest, mutual_info_classif, mutual_info_regression, chi2, RFE
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

class FeatureSelector:
    """
    Comprehensive feature selection toolkit.
    
    Supports multiple selection methods:
    - Variance threshold (remove low-variance features)
    - Correlation-based (remove highly correlated features)
    - Mutual information (rank by MI score)
    - Chi-square test (categorical features)
    - Recursive Feature Elimination (wrapper method)
    - Feature importance (from tree-based models)
    """
    
    def __init__(self, method='variance'):
        """
        Initialize with selection method.
        
        Parameters:
        -----------
        method : str
            Selection method: 'variance', 'correlation', 'mutual_info', 
            'chi2', 'rfe', 'importance'
        """
        self.method = method
        self.selector = None
        self.selected_features = []
        self.feature_scores = {}
        
    def select_by_variance(self, df, threshold=0.01):
        """
        Remove features with variance below threshold.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        threshold : float
            Minimum variance required (features below this are removed)
        
        Returns:
        --------
        DataFrame with low-variance features removed
        """
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        
        # Calculate variance for each feature
        variances = df[numeric_cols].var()
        
        # Select features above threshold
        selected = variances[variances > threshold].index.tolist()
        removed = variances[variances <= threshold].index.tolist()
        
        self.selected_features = selected
        self.feature_scores = variances.to_dict()
        
        print(f"‚úÖ Variance Threshold Selection:")
        print(f"   Threshold: {threshold}")
        print(f"   Kept: {len(selected)} features")
        print(f"   Removed: {len(removed)} features")
        if removed:
            print(f"   Removed features: {removed[:5]}..." if len(removed) > 5 else f"   Removed features: {removed}")
        
        # Keep selected numeric + all non-numeric columns
        non_numeric_cols = [col for col in df.columns if col not in numeric_cols]
        return df[non_numeric_cols + selected]
    
    def select_by_correlation(self, df, threshold=0.95, target_col=None):
        """
        Remove highly correlated features (keeps one from each correlated pair).
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        threshold : float
            Correlation threshold (remove if |correlation| > threshold)
        target_col : str
            Target column (if provided, keeps feature more correlated with target)
        
        Returns:
        --------
        DataFrame with redundant features removed
        """
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        if target_col and target_col in numeric_cols:
            numeric_cols = [col for col in numeric_cols if col != target_col]
        
        # Calculate correlation matrix
        corr_matrix = df[numeric_cols].corr().abs()
        
        # Get upper triangle (avoid duplicate pairs)
        upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
        
        # Find features to remove
        to_remove = set()
        
        for col in upper_tri.columns:
            # Get features highly correlated with this feature
            high_corr_features = upper_tri.index[upper_tri[col] > threshold].tolist()
            
            if high_corr_features:
                for corr_feature in high_corr_features:
                    # If target provided, keep feature more correlated with target
                    if target_col and target_col in df.columns:
                        corr_with_target_col = abs(df[col].corr(df[target_col]))
                        corr_with_target_feat = abs(df[corr_feature].corr(df[target_col]))
                        
                        if corr_with_target_col >= corr_with_target_feat:
                            to_remove.add(corr_feature)
                        else:
                            to_remove.add(col)
                    else:
                        # No target, just remove the second feature
                        to_remove.add(corr_feature)
        
        # Select features to keep
        self.selected_features = [col for col in numeric_cols if col not in to_remove]
        
        print(f"‚úÖ Correlation-Based Selection:")
        print(f"   Threshold: {threshold}")
        print(f"   Kept: {len(self.selected_features)} features")
        print(f"   Removed: {len(to_remove)} features")
        if to_remove:
            removed_list = list(to_remove)
            print(f"   Removed features: {removed_list[:5]}..." if len(removed_list) > 5 else f"   Removed features: {removed_list}")
        
        # Keep selected numeric + all non-numeric + target
        non_numeric_cols = [col for col in df.columns if col not in numeric_cols and col != target_col]
        keep_cols = non_numeric_cols + self.selected_features
        if target_col:
            keep_cols.append(target_col)
        
        return df[keep_cols]
    
    def select_by_mutual_info(self, df, target_col, k=10, task='classification'):
        """
        Select top k features by mutual information score.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        target_col : str
            Target column name
        k : int
            Number of top features to select
        task : str
            'classification' or 'regression'
        
        Returns:
        --------
        DataFrame with top k features + target
        """
        # Separate features and target
        feature_cols = [col for col in df.columns if col != target_col]
        X = df[feature_cols].select_dtypes(include=[np.number])
        y = df[target_col]
        
        # Calculate mutual information
        if task == 'classification':
            mi_scores = mutual_info_classif(X, y, random_state=42)
        else:
            mi_scores = mutual_info_regression(X, y, random_state=42)
        
        # Create DataFrame with scores
        mi_df = pd.DataFrame({
            'feature': X.columns,
            'mi_score': mi_scores
        }).sort_values('mi_score', ascending=False)
        
        # Select top k features
        self.selected_features = mi_df.head(k)['feature'].tolist()
        self.feature_scores = dict(zip(mi_df['feature'], mi_df['mi_score']))
        
        print(f"‚úÖ Mutual Information Selection ({task}):")
        print(f"   Selected top {k} features:")
        for i, row in mi_df.head(k).iterrows():
            print(f"   {row['feature']:30s} MI = {row['mi_score']:.4f}")
        
        # Keep selected features + target + non-numeric
        non_numeric_cols = [col for col in df.columns if col not in X.columns and col != target_col]
        return df[non_numeric_cols + self.selected_features + [target_col]]
    
    def select_by_rfe(self, df, target_col, n_features=10, task='classification'):
        """
        Recursive Feature Elimination using Random Forest.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        target_col : str
            Target column name
        n_features : int
            Number of features to select
        task : str
            'classification' or 'regression'
        
        Returns:
        --------
        DataFrame with selected features + target
        """
        # Separate features and target
        feature_cols = [col for col in df.columns if col != target_col]
        X = df[feature_cols].select_dtypes(include=[np.number])
        y = df[target_col]
        
        # Choose estimator based on task
        if task == 'classification':
            estimator = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
        else:
            estimator = RandomForestRegressor(n_estimators=50, random_state=42, n_jobs=-1)
        
        # Perform RFE
        rfe = RFE(estimator=estimator, n_features_to_select=n_features, step=1)
        rfe.fit(X, y)
        
        # Get selected features
        self.selected_features = X.columns[rfe.support_].tolist()
        
        # Get feature rankings
        rankings = dict(zip(X.columns, rfe.ranking_))
        self.feature_scores = rankings
        
        print(f"‚úÖ Recursive Feature Elimination ({task}):")
        print(f"   Selected {n_features} features:")
        for feat in self.selected_features:
            print(f"   {feat}")
        
        # Keep selected features + target + non-numeric
        non_numeric_cols = [col for col in df.columns if col not in X.columns and col != target_col]
        return df[non_numeric_cols + self.selected_features + [target_col]]
    
    def select_by_importance(self, df, target_col, threshold=0.01, task='classification'):
        """
        Select features by Random Forest feature importance.
        
        Parameters:
        -----------
        df : DataFrame
            Input data
        target_col : str
            Target column name
        threshold : float
            Minimum importance threshold
        task : str
            'classification' or 'regression'
        
        Returns:
        --------
        DataFrame with important features + target
        """
        # Separate features and target
        feature_cols = [col for col in df.columns if col != target_col]
        X = df[feature_cols].select_dtypes(include=[np.number])
        y = df[target_col]
        
        # Train Random Forest
        if task == 'classification':
            model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
        else:
            model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
        
        model.fit(X, y)
        
        # Get feature importances
        importances = pd.DataFrame({
            'feature': X.columns,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        # Select features above threshold
        self.selected_features = importances[importances['importance'] > threshold]['feature'].tolist()
        self.feature_scores = dict(zip(importances['feature'], importances['importance']))
        
        print(f"‚úÖ Feature Importance Selection ({task}):")
        print(f"   Threshold: {threshold}")
        print(f"   Selected {len(self.selected_features)} features:")
        for i, row in importances[importances['importance'] > threshold].iterrows():
            print(f"   {row['feature']:30s} Importance = {row['importance']:.4f}")
        
        # Keep selected features + target + non-numeric
        non_numeric_cols = [col for col in df.columns if col not in X.columns and col != target_col]
        return df[non_numeric_cols + self.selected_features + [target_col]]
    
    def visualize_feature_scores(self, top_n=20):
        """
        Visualize feature scores/rankings.
        
        Parameters:
        -----------
        top_n : int
            Number of top features to display
        """
        if not self.feature_scores:
            print("‚ùå No feature scores available. Run a selection method first.")
            return
        
        # Sort by score
        scores_df = pd.DataFrame(list(self.feature_scores.items()), 
                                columns=['feature', 'score'])
        scores_df = scores_df.sort_values('score', ascending=False).head(top_n)
        
        # Plot
        plt.figure(figsize=(12, 6))
        plt.barh(range(len(scores_df)), scores_df['score'], color='steelblue')
        plt.yticks(range(len(scores_df)), scores_df['feature'])
        plt.xlabel('Score')
        plt.ylabel('Features')
        plt.title(f'Top {top_n} Features by {self.method.capitalize()} Method')
        plt.gca().invert_yaxis()
        plt.grid(axis='x', alpha=0.3)
        plt.tight_layout()
        plt.show()

print("‚úÖ FeatureSelector class implemented")
print("\nSupported Selection Methods:")
print("- select_by_variance(): Remove low-variance features")
print("- select_by_correlation(): Remove highly correlated features")
print("- select_by_mutual_info(): Select top k by mutual information")
print("- select_by_rfe(): Recursive Feature Elimination")
print("- select_by_importance(): Select by Random Forest importance")

## üî¨ Complete Example: Semiconductor Device Yield Prediction

### Scenario

**Problem:** Predict device yield (pass/fail) from parametric test data.

**Dataset:**
- **2000 semiconductor devices** tested across **20 wafers**
- **Raw Test Parameters:**
  - VDD (voltage): 0.8-1.2V
  - IDD (current): 10-100 mA
  - Frequency: 1000-3000 MHz
  - Power: 0.5-5 W
  - Temperature: 25-85¬∞C
- **Spatial Information:**
  - die_x, die_y: Position on wafer (-10 to +10 mm from center)
  - wafer_id: Which wafer (1-20)
- **Target:** yield (0 = fail, 1 = pass)

**Challenges:**
- **Missing Data:** ~10% missing values per feature
- **Spatial Effects:** Dies near wafer edge fail more often
- **Process Variation:** Different wafers have different characteristics
- **Imbalanced:** Only ~15% fail rate

**Feature Engineering Strategy:**
1. Handle missing data (median imputation)
2. Create domain-specific features (efficiency, power density, leakage)
3. Create spatial features (radial distance, quadrant, edge indicator)
4. Create statistical features (wafer-level aggregations, deviations)
5. Scale numeric features (StandardScaler)
6. Select important features (correlation + feature importance)

In [None]:
# Complete Semiconductor Example - Part 1: Data Generation and Initial Engineering

import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic semiconductor test data
n_samples = 2000
n_wafers = 20

# Wafer assignment
wafer_ids = np.random.randint(1, n_wafers + 1, n_samples)

# Die positions on wafer (-10 to +10 mm from center)
die_x = np.random.uniform(-10, 10, n_samples)
die_y = np.random.uniform(-10, 10, n_samples)

# Radial distance from wafer center
radial_distance = np.sqrt(die_x**2 + die_y**2)
wafer_radius = 10.0

# Spatial effect: dies near edge have lower performance
spatial_effect = radial_distance / wafer_radius  # 0 at center, 1 at edge

# Base test parameters with realistic distributions
vdd_base = np.random.normal(1.0, 0.05, n_samples)  # 1.0V ¬± 50mV
idd_base = np.random.normal(50, 10, n_samples)      # 50mA ¬± 10mA
freq_base = np.random.normal(2000, 200, n_samples)  # 2000MHz ¬± 200MHz
temp_base = np.random.normal(55, 15, n_samples)     # 55¬∞C ¬± 15¬∞C

# Apply spatial effects (edge dies have worse parameters)
vdd = vdd_base - 0.05 * spatial_effect * np.random.randn(n_samples)
idd = idd_base * (1 + 0.2 * spatial_effect + 0.1 * np.random.randn(n_samples))
freq = freq_base * (1 - 0.1 * spatial_effect + 0.05 * np.random.randn(n_samples))
temp = temp_base + 5 * spatial_effect + 2 * np.random.randn(n_samples)

# Power calculation (realistic relationship)
power = vdd * idd / 1000 + 0.5 * (freq / 1000)**2 / 1000

# Wafer-level process variation
wafer_means = np.random.normal(0, 0.5, n_wafers)
for i in range(n_wafers):
    wafer_mask = wafer_ids == (i + 1)
    vdd[wafer_mask] += wafer_means[i] * 0.02
    idd[wafer_mask] *= (1 + wafer_means[i] * 0.05)

# Yield determination (based on multiple factors)
# Fail if: edge location + high power + low freq + high temp + wafer variation
fail_score = (
    0.3 * spatial_effect +                    # Edge effect
    0.2 * (power - power.mean()) / power.std() +  # High power bad
    0.2 * (temp - temp.mean()) / temp.std() +     # High temp bad
    0.15 * (freq.mean() - freq) / freq.std() +    # Low freq bad
    0.15 * (idd - idd.mean()) / idd.std()         # High current bad
)

# Convert to binary with ~15% failure rate
fail_threshold = np.percentile(fail_score, 85)
yield_binary = (fail_score < fail_threshold).astype(int)

# Create DataFrame
df_raw = pd.DataFrame({
    'wafer_id': wafer_ids,
    'die_x': die_x,
    'die_y': die_y,
    'vdd': vdd,
    'idd': idd,
    'freq': freq,
    'power': power,
    'temp': temp,
    'yield': yield_binary
})

# Inject missing values (~10% per feature)
missing_rate = 0.10
for col in ['vdd', 'idd', 'freq', 'power', 'temp']:
    missing_indices = np.random.choice(df_raw.index, size=int(len(df_raw) * missing_rate), replace=False)
    df_raw.loc[missing_indices, col] = np.nan

print("=" * 80)
print("SEMICONDUCTOR DEVICE YIELD PREDICTION - DATASET")
print("=" * 80)
print(f"\nüìä Dataset Shape: {df_raw.shape}")
print(f"   Samples: {len(df_raw)}")
print(f"   Features: {len(df_raw.columns) - 1} (excluding target)")
print(f"   Target: 'yield' (binary)")

print(f"\nüìà Target Distribution:")
print(f"   Pass (1): {(df_raw['yield'] == 1).sum()} ({(df_raw['yield'] == 1).sum() / len(df_raw) * 100:.1f}%)")
print(f"   Fail (0): {(df_raw['yield'] == 0).sum()} ({(df_raw['yield'] == 0).sum() / len(df_raw) * 100:.1f}%)")

print(f"\n‚ùì Missing Data:")
for col in ['vdd', 'idd', 'freq', 'power', 'temp']:
    missing_count = df_raw[col].isnull().sum()
    missing_pct = (missing_count / len(df_raw)) * 100
    print(f"   {col:10s}: {missing_count:4d} missing ({missing_pct:.1f}%)")

print(f"\nüìè Feature Statistics (before engineering):")
print(df_raw.describe())

# Step 1: Handle Missing Data
print("\n" + "=" * 80)
print("STEP 1: HANDLE MISSING DATA")
print("=" * 80)

handler = MissingDataHandler(strategy='median')
df_imputed = handler.impute_simple(df_raw, strategy='median')

print(f"\n‚úÖ Imputation Complete:")
print(f"   Strategy: Median")
print(f"   Fill Values Used:")
for col, value in handler.fill_values.items():
    print(f"   {col:10s}: {value:.4f}")

# Verify no missing values remain
print(f"\n‚úÖ Verification: {df_imputed.isnull().sum().sum()} missing values remain")

# Step 2: Create Domain-Specific Features
print("\n" + "=" * 80)
print("STEP 2: CREATE DOMAIN-SPECIFIC FEATURES")
print("=" * 80)

creator = FeatureCreator()

# Define domain recipes based on semiconductor physics
domain_recipes = {
    # Power efficiency: performance per watt
    'efficiency': lambda x: x['freq'] / (x['power'] + 1e-10),
    
    # Power density: normalized by voltage
    'power_density': lambda x: x['power'] / (x['vdd'] + 1e-10),
    
    # Leakage normalized: current per voltage (like resistance)
    'leakage_normalized': lambda x: x['idd'] / (x['vdd'] + 1e-10),
    
    # Temperature coefficient: performance vs temperature
    'temp_coefficient': lambda x: x['freq'] / (x['temp'] + 1e-10),
    
    # Dynamic power factor: VDD^2 * Freq (proportional to dynamic power)
    'dynamic_power_factor': lambda x: (x['vdd'] ** 2) * x['freq'],
}

df_domain = creator.create_domain_features(df_imputed, domain_recipes)

print(f"\n‚úÖ Created {len(domain_recipes)} domain-specific features")
print(f"   New features: {list(domain_recipes.keys())}")

# Step 3: Create Spatial Features
print("\n" + "=" * 80)
print("STEP 3: CREATE SPATIAL FEATURES")
print("=" * 80)

# Radial distance from wafer center
df_domain['radial_distance'] = np.sqrt(df_domain['die_x']**2 + df_domain['die_y']**2)

# Quadrant (1-4 based on signs of x and y)
df_domain['quadrant'] = 1  # Default
df_domain.loc[(df_domain['die_x'] >= 0) & (df_domain['die_y'] >= 0), 'quadrant'] = 1
df_domain.loc[(df_domain['die_x'] < 0) & (df_domain['die_y'] >= 0), 'quadrant'] = 2
df_domain.loc[(df_domain['die_x'] < 0) & (df_domain['die_y'] < 0), 'quadrant'] = 3
df_domain.loc[(df_domain['die_x'] >= 0) & (df_domain['die_y'] < 0), 'quadrant'] = 4

# Edge indicator (1 if within 2mm of edge)
edge_threshold = 8.0  # mm from center
df_domain['is_edge_die'] = (df_domain['radial_distance'] > edge_threshold).astype(int)

print(f"‚úÖ Created 3 spatial features:")
print(f"   - radial_distance: {df_domain['radial_distance'].min():.2f} to {df_domain['radial_distance'].max():.2f} mm")
print(f"   - quadrant: {df_domain['quadrant'].nunique()} categories")
print(f"   - is_edge_die: {df_domain['is_edge_die'].sum()} edge dies ({df_domain['is_edge_die'].sum()/len(df_domain)*100:.1f}%)")

# Step 4: Create Statistical Aggregation Features
print("\n" + "=" * 80)
print("STEP 4: CREATE WAFER-LEVEL AGGREGATIONS")
print("=" * 80)

# Aggregate by wafer_id
agg_cols = ['vdd', 'idd', 'freq', 'power', 'temp']
df_agg = creator.create_aggregation_features(df_domain, 'wafer_id', agg_cols, agg_funcs=['mean', 'std'])

print(f"‚úÖ Created wafer-level aggregations:")
print(f"   Grouped by: wafer_id")
print(f"   Aggregated columns: {agg_cols}")
print(f"   Functions: mean, std (+ deviations)")
print(f"   Total new features: {len(agg_cols) * 2 + len(agg_cols)} per wafer")

print(f"\nüìä Current Feature Count: {len(df_agg.columns)} columns")
print(f"   Raw features: {len(df_raw.columns)}")
print(f"   Engineered: {len(df_agg.columns) - len(df_raw.columns)}")

# Save for next cell
df_engineered = df_agg.copy()

In [None]:
# Complete Semiconductor Example - Part 2: Scaling, Selection, and Model Comparison

# Step 5: Encode Categorical Features
print("=" * 80)
print("STEP 5: ENCODE CATEGORICAL FEATURES")
print("=" * 80)

# One-hot encode quadrant
encoder = CategoricalEncoder(method='onehot')
df_encoded = encoder.fit_transform(df_engineered, categorical_cols=['quadrant'])

print(f"‚úÖ One-hot encoded 'quadrant' feature")
print(f"   Original: 1 column with 4 categories")
print(f"   Encoded: {len([c for c in df_encoded.columns if 'quadrant' in c])} binary columns")

# Step 6: Scale Numeric Features
print("\n" + "=" * 80)
print("STEP 6: SCALE NUMERIC FEATURES")
print("=" * 80)

# Separate features and target
target_col = 'yield'
feature_cols = [col for col in df_encoded.columns if col != target_col]
X = df_encoded[feature_cols]
y = df_encoded[target_col]

# Identify numeric columns to scale
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# Scale using StandardScaler
scaler = FeatureScaler(method='standard')
X_scaled = scaler.fit_transform(X[numeric_cols])

# Combine scaled numeric with non-numeric (if any)
non_numeric_cols = [col for col in X.columns if col not in numeric_cols]
if non_numeric_cols:
    X_final = pd.concat([X[non_numeric_cols], X_scaled], axis=1)
else:
    X_final = X_scaled

print(f"‚úÖ Scaled {len(numeric_cols)} numeric features using StandardScaler")
print(f"   Mean ‚âà 0, Std ‚âà 1 for all features")

# Step 7: Feature Selection - Remove Highly Correlated Features
print("\n" + "=" * 80)
print("STEP 7: FEATURE SELECTION - CORRELATION")
print("=" * 80)

# Combine X_final and y for correlation selection
df_for_selection = pd.concat([X_final, y], axis=1)

selector_corr = FeatureSelector(method='correlation')
df_selected = selector_corr.select_by_correlation(df_for_selection, threshold=0.95, target_col=target_col)

# Separate again
X_selected = df_selected.drop(columns=[target_col])

print(f"\nüìä Features after correlation selection: {len(X_selected.columns)}")

# Step 8: Feature Selection - Feature Importance
print("\n" + "=" * 80)
print("STEP 8: FEATURE SELECTION - IMPORTANCE")
print("=" * 80)

# Combine for importance selection
df_for_importance = pd.concat([X_selected, y], axis=1)

selector_importance = FeatureSelector(method='importance')
df_final = selector_importance.select_by_importance(df_for_importance, target_col=target_col, 
                                                     threshold=0.01, task='classification')

# Final feature set
X_final_selected = df_final.drop(columns=[target_col])
y_final = df_final[target_col]

print(f"\n‚úÖ Final Feature Set: {len(X_final_selected.columns)} features")
print(f"   Original: {len(df_raw.columns) - 1} raw features")
print(f"   Engineered: {len(X_final.columns)} total features")
print(f"   Selected: {len(X_final_selected.columns)} final features")
print(f"   Reduction: {((len(X_final.columns) - len(X_final_selected.columns)) / len(X_final.columns) * 100):.1f}%")

# Step 9: Model Comparison - Before and After Feature Engineering
print("\n" + "=" * 80)
print("STEP 9: MODEL COMPARISON")
print("=" * 80)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Prepare raw features (just basic scaling, no engineering)
X_raw = df_imputed[['vdd', 'idd', 'freq', 'power', 'temp', 'die_x', 'die_y']]
y_raw = df_imputed['yield']

scaler_raw = FeatureScaler(method='standard')
X_raw_scaled = scaler_raw.fit_transform(X_raw)

# Split data - raw features
X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(
    X_raw_scaled, y_raw, test_size=0.3, random_state=42, stratify=y_raw
)

# Split data - engineered features
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(
    X_final_selected, y_final, test_size=0.3, random_state=42, stratify=y_final
)

print(f"üìä Train/Test Split:")
print(f"   Train: {len(X_train_raw)} samples (70%)")
print(f"   Test: {len(X_test_raw)} samples (30%)")

# Train models on raw features
print(f"\nüîπ Training on RAW features ({X_raw_scaled.shape[1]} features)...")

lr_raw = LogisticRegression(max_iter=1000, random_state=42)
lr_raw.fit(X_train_raw, y_train_raw)
y_pred_lr_raw = lr_raw.predict(X_test_raw)
y_prob_lr_raw = lr_raw.predict_proba(X_test_raw)[:, 1]

rf_raw = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_raw.fit(X_train_raw, y_train_raw)
y_pred_rf_raw = rf_raw.predict(X_test_raw)
y_prob_rf_raw = rf_raw.predict_proba(X_test_raw)[:, 1]

# Train models on engineered features
print(f"üîπ Training on ENGINEERED features ({X_final_selected.shape[1]} features)...")

lr_eng = LogisticRegression(max_iter=1000, random_state=42)
lr_eng.fit(X_train_eng, y_train_eng)
y_pred_lr_eng = lr_eng.predict(X_test_eng)
y_prob_lr_eng = lr_eng.predict_proba(X_test_eng)[:, 1]

rf_eng = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_eng.fit(X_train_eng, y_train_eng)
y_pred_rf_eng = rf_eng.predict(X_test_eng)
y_prob_rf_eng = rf_eng.predict_proba(X_test_eng)[:, 1]

# Compare Results
print("\n" + "=" * 80)
print("RESULTS COMPARISON: RAW vs ENGINEERED FEATURES")
print("=" * 80)

results = pd.DataFrame({
    'Model': ['Logistic Regression (Raw)', 'Random Forest (Raw)', 
              'Logistic Regression (Engineered)', 'Random Forest (Engineered)'],
    'Accuracy': [
        accuracy_score(y_test_raw, y_pred_lr_raw),
        accuracy_score(y_test_raw, y_pred_rf_raw),
        accuracy_score(y_test_eng, y_pred_lr_eng),
        accuracy_score(y_test_eng, y_pred_rf_eng)
    ],
    'Precision': [
        precision_score(y_test_raw, y_pred_lr_raw),
        precision_score(y_test_raw, y_pred_rf_raw),
        precision_score(y_test_eng, y_pred_lr_eng),
        precision_score(y_test_eng, y_pred_rf_eng)
    ],
    'Recall': [
        recall_score(y_test_raw, y_pred_lr_raw),
        recall_score(y_test_raw, y_pred_rf_raw),
        recall_score(y_test_eng, y_pred_lr_eng),
        recall_score(y_test_eng, y_pred_rf_eng)
    ],
    'F1': [
        f1_score(y_test_raw, y_pred_lr_raw),
        f1_score(y_test_raw, y_pred_rf_raw),
        f1_score(y_test_eng, y_pred_lr_eng),
        f1_score(y_test_eng, y_pred_rf_eng)
    ],
    'AUC-ROC': [
        roc_auc_score(y_test_raw, y_prob_lr_raw),
        roc_auc_score(y_test_raw, y_prob_rf_raw),
        roc_auc_score(y_test_eng, y_prob_lr_eng),
        roc_auc_score(y_test_eng, y_prob_rf_eng)
    ]
})

print("\n")
print(results.to_string(index=False))

# Calculate improvements
lr_improvement = (results.loc[2, 'AUC-ROC'] - results.loc[0, 'AUC-ROC']) / results.loc[0, 'AUC-ROC'] * 100
rf_improvement = (results.loc[3, 'AUC-ROC'] - results.loc[1, 'AUC-ROC']) / results.loc[1, 'AUC-ROC'] * 100

print(f"\nüéØ Feature Engineering Impact:")
print(f"   Logistic Regression: {lr_improvement:+.1f}% improvement in AUC-ROC")
print(f"   Random Forest: {rf_improvement:+.1f}% improvement in AUC-ROC")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Feature importance from engineered model
importances = pd.DataFrame({
    'feature': X_final_selected.columns,
    'importance': rf_eng.feature_importances_
}).sort_values('importance', ascending=False).head(15)

axes[0, 0].barh(range(len(importances)), importances['importance'], color='steelblue')
axes[0, 0].set_yticks(range(len(importances)))
axes[0, 0].set_yticklabels(importances['feature'], fontsize=8)
axes[0, 0].set_xlabel('Importance')
axes[0, 0].set_title('Top 15 Feature Importances (Engineered Model)')
axes[0, 0].invert_yaxis()
axes[0, 0].grid(axis='x', alpha=0.3)

# 2. ROC Curves Comparison
from sklearn.metrics import roc_curve

fpr_lr_raw, tpr_lr_raw, _ = roc_curve(y_test_raw, y_prob_lr_raw)
fpr_rf_raw, tpr_rf_raw, _ = roc_curve(y_test_raw, y_prob_rf_raw)
fpr_lr_eng, tpr_lr_eng, _ = roc_curve(y_test_eng, y_prob_lr_eng)
fpr_rf_eng, tpr_rf_eng, _ = roc_curve(y_test_eng, y_prob_rf_eng)

axes[0, 1].plot(fpr_lr_raw, tpr_lr_raw, label=f'LR Raw (AUC={results.loc[0, "AUC-ROC"]:.3f})', linestyle='--', color='blue')
axes[0, 1].plot(fpr_rf_raw, tpr_rf_raw, label=f'RF Raw (AUC={results.loc[1, "AUC-ROC"]:.3f})', linestyle='--', color='green')
axes[0, 1].plot(fpr_lr_eng, tpr_lr_eng, label=f'LR Engineered (AUC={results.loc[2, "AUC-ROC"]:.3f})', color='blue', linewidth=2)
axes[0, 1].plot(fpr_rf_eng, tpr_rf_eng, label=f'RF Engineered (AUC={results.loc[3, "AUC-ROC"]:.3f})', color='green', linewidth=2)
axes[0, 1].plot([0, 1], [0, 1], 'k--', alpha=0.3)
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curves: Raw vs Engineered Features')
axes[0, 1].legend(fontsize=8)
axes[0, 1].grid(alpha=0.3)

# 3. Model performance comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1', 'AUC-ROC']
raw_lr = results.loc[0, metrics].values
eng_lr = results.loc[2, metrics].values
raw_rf = results.loc[1, metrics].values
eng_rf = results.loc[3, metrics].values

x = np.arange(len(metrics))
width = 0.2

axes[1, 0].bar(x - width*1.5, raw_lr, width, label='LR Raw', color='lightblue')
axes[1, 0].bar(x - width/2, eng_lr, width, label='LR Engineered', color='blue')
axes[1, 0].bar(x + width/2, raw_rf, width, label='RF Raw', color='lightgreen')
axes[1, 0].bar(x + width*1.5, eng_rf, width, label='RF Engineered', color='green')
axes[1, 0].set_xticks(x)
axes[1, 0].set_xticklabels(metrics, rotation=45)
axes[1, 0].set_ylabel('Score')
axes[1, 0].set_title('Metric Comparison: Raw vs Engineered')
axes[1, 0].legend(fontsize=8)
axes[1, 0].grid(axis='y', alpha=0.3)
axes[1, 0].set_ylim([0, 1])

# 4. Feature engineering pipeline summary
pipeline_text = f"""Feature Engineering Pipeline Summary

Raw Features: {len(df_raw.columns) - 1}
‚îú‚îÄ Parametric Tests: 5 (VDD, IDD, Freq, Power, Temp)
‚îî‚îÄ Spatial: 2 (die_x, die_y)

After Engineering: {len(X_final.columns)}
‚îú‚îÄ Missing Data: Median imputation
‚îú‚îÄ Domain Features: +{len(domain_recipes)} (efficiency, power_density, etc.)
‚îú‚îÄ Spatial Features: +3 (radial_distance, quadrant, edge)
‚îú‚îÄ Aggregations: +{len(agg_cols)*3} (wafer mean/std/deviation)
‚îî‚îÄ Encoding: quadrant one-hot

After Selection: {len(X_final_selected.columns)}
‚îú‚îÄ Correlation Filter: removed {len(X_final.columns) - len(X_selected.columns)} redundant
‚îî‚îÄ Importance Filter: kept {len(X_final_selected.columns)} most important

Performance Gain:
‚îú‚îÄ Logistic Regression: {lr_improvement:+.1f}% AUC improvement
‚îî‚îÄ Random Forest: {rf_improvement:+.1f}% AUC improvement
"""

axes[1, 1].text(0.05, 0.95, pipeline_text, transform=axes[1, 1].transAxes,
                fontsize=9, verticalalignment='top', fontfamily='monospace',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

print("\n‚úÖ Feature Engineering Complete!")
print(f"   Time saved in production: Feature engineering pipeline automates complex transformations")
print(f"   Model improvement: {max(lr_improvement, rf_improvement):.1f}% better predictions")
print(f"   Business impact: Fewer false negatives ‚Üí Reduced yield loss")

## üöÄ Real-World Project Ideas

### Post-Silicon Validation Projects

#### 1. Automated Feature Discovery Engine ($20M+ Yield Improvement)
**Objective:** Build system that automatically discovers optimal features for yield prediction across different device types.

**Key Features:**
- Automated missing data strategy selection
- Domain-agnostic feature generation (polynomial, interactions)
- Genetic algorithms for feature selection
- Cross-validation to prevent overfitting

**Success Metrics:**
- 15-25% improvement in yield prediction accuracy
- Reduces feature engineering time from weeks to hours
- Generalizes across 10+ different product families
- Deployed in production for real-time test optimization

**Technologies:** Python, sklearn, TPOT (AutoML), Spark for scale

**Business Impact:** $20M+ annual savings from reduced scrap and improved test coverage

---

#### 2. Cross-Product Feature Transfer Learning ($15M+ Faster Ramp)
**Objective:** Transfer engineered features from mature products to new products to accelerate ramp-up.

**Key Features:**
- Feature importance transfer between similar devices
- Domain adaptation techniques
- Physics-based feature templates
- Automated feature validation on new product data

**Success Metrics:**
- 40% faster time-to-market for new products
- 60% reduction in feature engineering effort
- 80% accuracy with <1000 samples (vs 5000+ normally)

**Technologies:** Transfer learning, meta-learning, domain adaptation

**Business Impact:** $15M+ revenue from faster product launches

---

#### 3. Spatial Feature Synthesizer for Wafer Analysis ($10M+ Equipment Savings)
**Objective:** Create spatial features that reveal equipment-specific failure patterns on wafers.

**Key Features:**
- Radial, angular, and grid-based spatial encodings
- Wafer map clustering and pattern recognition
- Equipment signature extraction
- Automated root cause correlation

**Success Metrics:**
- Identify 90%+ of equipment-related failures
- Reduce equipment-related yield loss by 30%
- Predict equipment maintenance needs 2 weeks in advance

**Technologies:** Image processing, CNN for wafer maps, spatial statistics

**Business Impact:** $10M+ savings from equipment optimization and reduced downtime

---

#### 4. Test Correlation Feature Engineering ($12M+ Test Time Reduction)
**Objective:** Engineer features that capture test-to-test correlations to enable test reduction.

**Key Features:**
- Association rule mining between test failures
- Conditional probability features (P(Test B fails | Test A fails))
- Test clustering based on failure patterns
- Sequential test ordering optimization

**Success Metrics:**
- Reduce test time by 30% without yield impact
- Identify 50+ redundant tests
- 99.5%+ coverage with reduced test set

**Technologies:** Apriori algorithm, Bayesian networks, graph analytics

**Business Impact:** $12M+ annual savings from reduced test time and equipment utilization

---

### General Machine Learning Projects

#### 5. E-Commerce Customer Feature Pipeline ($50M+ Revenue)
**Objective:** Build comprehensive feature engineering pipeline for customer lifetime value prediction.

**Key Features:**
- RFM features (Recency, Frequency, Monetary)
- Behavioral sequences (click patterns, cart abandonment)
- Temporal features (seasonality, trend, day-of-week)
- Cross-sell propensity scores

**Success Metrics:**
- 25% improvement in CLV prediction accuracy
- 40% increase in marketing campaign ROI
- 15% reduction in churn through targeted retention

**Technologies:** Spark for scale, Feature Store (Feast), A/B testing framework

**Business Impact:** $50M+ revenue from optimized marketing spend

---

#### 6. Financial Fraud Detection Features ($100M+ Fraud Prevention)
**Objective:** Engineer features for real-time credit card fraud detection.

**Key Features:**
- Velocity features (transactions per hour, amount per day)
- Geographic anomalies (distance from previous transaction)
- Merchant category patterns
- Network features (connections between entities)

**Success Metrics:**
- 95%+ fraud detection rate
- <0.1% false positive rate
- <100ms prediction latency
- Detect novel fraud patterns within 24 hours

**Technologies:** Streaming (Kafka), graph databases (Neo4j), online learning

**Business Impact:** $100M+ annual fraud prevention

---

#### 7. Healthcare Risk Prediction Features ($200M+ Patient Safety)
**Objective:** Engineer features for hospital readmission and complication prediction.

**Key Features:**
- Clinical history aggregations (prior admissions, diagnoses)
- Medication interaction features
- Lab trend features (slope, volatility, anomalies)
- Social determinants of health proxies

**Success Metrics:**
- 30% reduction in preventable readmissions
- 40% earlier detection of complications
- Reduces false alarms by 50%
- Works across 100+ hospitals

**Technologies:** FHIR standard, privacy-preserving ML, causal inference

**Business Impact:** $200M+ healthcare cost savings, improved patient outcomes

---

#### 8. IoT Sensor Feature Engineering ($80M+ Maintenance Savings)
**Objective:** Engineer features from time-series sensor data for predictive maintenance.

**Key Features:**
- Statistical features (rolling mean, std, percentiles)
- Frequency domain features (FFT components, spectral entropy)
- Change point detection features
- Multi-sensor correlation features

**Success Metrics:**
- Predict failures 2 weeks in advance (90% accuracy)
- Reduce unplanned downtime by 50%
- Extend equipment life by 20%
- Optimize maintenance schedules

**Technologies:** Time series analysis, wavelets, autoencoders, edge computing

**Business Impact:** $80M+ annual savings from optimized maintenance

## üéØ Key Takeaways and Best Practices

### Critical Insights

**1. Feature Engineering Impact:**
- Can improve model performance by 20-50% or more
- Often more impactful than algorithm selection
- Requires domain knowledge AND data science expertise
- Good features make simple models outperform complex models with poor features

**2. The Feature Engineering Mindset:**
```
Raw Data ‚Üí Domain Understanding ‚Üí Feature Hypotheses ‚Üí Implementation ‚Üí Validation
```
Always ask:
- What physical/business processes generate this data?
- What relationships might exist between features?
- What aggregations make sense for this domain?
- What temporal patterns are relevant?

### When to Use Each Technique

| Technique | Use Case | When to Skip |
|-----------|----------|-------------|
| **Missing Data Imputation** | Always required | If <1% missing, consider dropping rows |
| **Scaling** | Linear models, neural nets, distance-based | Tree-based models (optional) |
| **One-Hot Encoding** | Nominal categories, <15 levels | High cardinality (>20 categories) |
| **Target Encoding** | High cardinality categories | Risk of overfitting with small data |
| **Polynomial Features** | Non-linear relationships in linear models | High dimensions already, tree models |
| **Domain Features** | Always try if domain knowledge available | Purely exploratory analysis |
| **Aggregations** | Group structure exists (wafer, customer) | No meaningful grouping |
| **Variance Threshold** | Always as first step | After already doing feature selection |
| **Correlation Filter** | Linear models, many features | Tree models (handles correlation) |
| **Mutual Information** | Non-linear relationships | Limited compute, use correlation |
| **RFE** | Small-medium feature sets | Large datasets (too slow) |

### Common Pitfalls and How to Avoid Them

**1. Data Leakage**
- ‚ùå **Wrong:** Fit scaler/imputer on entire dataset
- ‚úÖ **Right:** Fit only on training data, apply to test data
- ‚ùå **Wrong:** Use target in feature creation (future information)
- ‚úÖ **Right:** Only use information available at prediction time

**2. Overfitting Through Feature Engineering**
- ‚ùå **Wrong:** Create hundreds of features without validation
- ‚úÖ **Right:** Use cross-validation, monitor validation performance
- ‚ùå **Wrong:** Target encoding without regularization
- ‚úÖ **Right:** Use smoothing or cross-validation for target encoding

**3. Forgetting Feature Selection**
- ‚ùå **Wrong:** Feed all engineered features to model
- ‚úÖ **Right:** Remove redundant and low-importance features
- Rule: If features > samples/10, definitely select features

**4. Ignoring Computational Cost**
- ‚ùå **Wrong:** Complex features that slow down production inference
- ‚úÖ **Right:** Balance accuracy gain vs inference speed
- Example: Real-time fraud detection needs <100ms, so limit feature complexity

**5. Not Documenting Feature Logic**
- ‚ùå **Wrong:** Complex lambda functions with no comments
- ‚úÖ **Right:** Document domain rationale, units, expected ranges
- Include: Feature name, formula, business meaning, valid range

### Production Considerations

**1. Feature Store:**
```python
# Store feature definitions for reuse
feature_store = {
    'efficiency': {
        'formula': 'freq / power',
        'description': 'Performance per watt',
        'unit': 'MHz/W',
        'valid_range': [200, 5000],
        'created_date': '2024-01-15'
    }
}
```

**2. Feature Monitoring:**
- Track feature distributions over time
- Alert if feature values shift (data drift)
- Monitor missing value rates
- Validate feature ranges

**3. Versioning:**
- Version your feature engineering code
- Track which features were used for each model version
- Enable rollback if features break

**4. Scalability:**
- Test feature engineering on production data volumes
- Use vectorized operations (NumPy, Pandas)
- Consider Spark/Dask for big data
- Cache expensive features

### Decision Framework

**Starting a New Project?**

```mermaid
graph TD
    A[New Dataset] --> B{Understand Data}
    B --> C[Handle Missing Data]
    C --> D{Domain Knowledge?}
    D -->|Yes| E[Create Domain Features]
    D -->|No| F[Try Polynomial Features]
    E --> G{Many Features?}
    F --> G
    G -->|Yes >50| H[Feature Selection]
    G -->|No <50| I[Try All Features]
    H --> J[Train Baseline Model]
    I --> J
    J --> K{Performance Good?}
    K -->|No| L[Iterate: More Features or Better Selection]
    K -->|Yes| M[Deploy to Production]
    L --> C
```

### Recommended Workflow

**Phase 1: Understand Data (Day 1)**
1. Visualize distributions
2. Identify missing patterns
3. Check correlations
4. Document domain knowledge

**Phase 2: Basic Engineering (Day 2-3)**
1. Handle missing data
2. Scale numeric features
3. Encode categorical features
4. Create 5-10 domain features

**Phase 3: Advanced Engineering (Day 4-7)**
1. Try polynomial interactions
2. Create aggregation features
3. Test mathematical transforms
4. Generate temporal features (if time series)

**Phase 4: Selection & Validation (Day 8-10)**
1. Remove low variance
2. Remove high correlation
3. Select by importance
4. Cross-validate thoroughly

**Phase 5: Production (Day 11+)**
1. Document all features
2. Create feature pipeline
3. Test on production data
4. Set up monitoring

### Resources and Next Steps

**Further Learning:**
- **Books:** "Feature Engineering for Machine Learning" (Zheng & Casari)
- **Courses:** Fast.ai (practical feature engineering)
- **Libraries:** Featuretools (automated feature engineering), TPOT (AutoML)

**Next Notebooks:**
- **042:** Model Evaluation Metrics (how to measure improvement)
- **043:** Cross-Validation Techniques (proper feature validation)
- **044:** Hyperparameter Tuning (optimize after feature engineering)

**Practice Projects:**
- Kaggle competitions (force you to engineer features)
- Real-world datasets from work (domain knowledge helps!)
- Time series forecasting (temporal feature engineering)

---

## üèÅ Conclusion

**Feature Engineering is Both Art and Science:**
- **Art:** Requires intuition, domain knowledge, creativity
- **Science:** Needs rigorous validation, statistical testing, documentation

**Remember:**
> "Better data beats better algorithms. Better features beat better models."

The best ML practitioners spend 60-70% of their time on feature engineering and data quality, and only 30-40% on model selection and tuning.

**You've now learned:**
- ‚úÖ Missing data handling (6 methods)
- ‚úÖ Feature scaling (4 methods)
- ‚úÖ Categorical encoding (5 methods)
- ‚úÖ Feature creation (5 types)
- ‚úÖ Feature selection (6 methods)
- ‚úÖ End-to-end pipeline (semiconductor example)
- ‚úÖ Production best practices

**Go engineer some amazing features! üöÄ**