# ML Practice Questions - Part 2: Data Preprocessing and Feature Engineering

This notebook covers essential data preprocessing and feature engineering techniques. These skills are critical for building robust ML models that work well in production.

## Learning Objectives

By completing these questions, you will:
- Master different strategies for handling missing data
- Understand when and how to apply feature scaling techniques
- Learn various methods for encoding categorical variables
- Apply feature selection techniques to improve model performance
- Recognize and prevent data leakage in preprocessing pipelines

## Difficulty Levels
- ★☆☆ **Beginner**: Basic preprocessing concepts
- ★★☆ **Intermediate**: Advanced techniques and edge cases
- ★★★ **Advanced**: Complex scenarios and optimization

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler, Normalizer,
    LabelEncoder, OneHotEncoder, OrdinalEncoder
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import (
    SelectKBest, chi2, f_classif, mutual_info_classif,
    RFE, SelectFromModel
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import make_classification, make_regression
from sklearn.metrics import accuracy_score, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)

print("Libraries imported successfully!")

---

## Question 1: Missing Data Strategies ★★☆

**Question:** You have a dataset with different types of missing data patterns. For each scenario below, recommend the most appropriate handling strategy and implement it:

1. **Customer age**: 5% missing, normally distributed
2. **Income**: 15% missing, right-skewed distribution
3. **Product category**: 8% missing, categorical with clear hierarchy
4. **Website engagement score**: 25% missing, only missing for users who never logged in
5. **Survey responses**: 40% missing, likely missing not at random (MNAR)

Explain your reasoning and demonstrate the impact on model performance.

### Answer 1: Missing Data Strategies

#### **Missing Data Types (MCAR, MAR, MNAR)**

**MCAR (Missing Completely At Random)**: Missingness is independent of observed and unobserved data  
**MAR (Missing At Random)**: Missingness depends on observed data but not unobserved data  
**MNAR (Missing Not At Random)**: Missingness depends on the unobserved value itself

#### **1. Customer Age (5% missing, normal distribution)**
**Strategy: Mean/Median Imputation**
- **Reasoning**: Low missingness rate, normal distribution suggests mean imputation is reasonable
- **Alternative**: KNN imputation if other demographic features are available
- **Implementation**: SimpleImputer with mean strategy

#### **2. Income (15% missing, right-skewed)**
**Strategy: Median Imputation or Log-transform then Mean**
- **Reasoning**: Right-skewed distributions have outliers that make mean inappropriate
- **Alternative**: Predictive imputation using other features
- **Implementation**: SimpleImputer with median or KNNImputer

#### **3. Product Category (8% missing, categorical hierarchy)**
**Strategy: Mode Imputation or Hierarchical Imputation**
- **Reasoning**: Use category hierarchy to impute at appropriate level
- **Alternative**: Create "Unknown" category to preserve missingness information
- **Implementation**: Custom imputation based on parent categories

#### **4. Website Engagement (25% missing, systematic pattern)**
**Strategy: Missingness Indicator + Imputation**
- **Reasoning**: Missingness is informative (never logged in), should be preserved
- **Method**: Create binary "never_logged_in" feature + impute with 0
- **Implementation**: Add indicator variable before imputation

#### **5. Survey Responses (40% missing, MNAR)**
**Strategy: Multiple Imputation or Domain-Specific Modeling**
- **Reasoning**: High missingness rate with potential bias requires careful handling
- **Alternatives**: 
  - Use only complete cases if still sufficient sample size
  - Model the missingness mechanism explicitly
- **Implementation**: Multiple imputation with sensitivity analysis

In [None]:
# Create synthetic dataset with different missing data patterns
np.random.seed(42)
n_samples = 1000

# Generate base data
age = np.random.normal(35, 10, n_samples)
income = np.random.lognormal(10.5, 0.5, n_samples)  # Right-skewed
categories = np.random.choice(['Electronics', 'Clothing', 'Books', 'Home', 'Sports'], n_samples)
logged_in = np.random.binomial(1, 0.7, n_samples)  # 70% log in
engagement = np.where(logged_in, np.random.normal(50, 15, n_samples), np.nan)
survey_response = np.random.normal(3, 1, n_samples)  # 1-5 scale

# Create DataFrame
df = pd.DataFrame({
    'age': age,
    'income': income,
    'category': categories,
    'engagement': engagement,
    'survey_response': survey_response
})

# Introduce missing data patterns

# 1. Age: 5% MCAR
missing_age_idx = np.random.choice(n_samples, int(0.05 * n_samples), replace=False)
df.loc[missing_age_idx, 'age'] = np.nan

# 2. Income: 15% MAR (higher chance of missing for younger people)
income_missing_prob = 1 / (1 + np.exp(0.1 * (df['age'] - 25)))  # Sigmoid
income_missing_prob = income_missing_prob.fillna(0.15)  # Handle NaN ages
income_missing = np.random.binomial(1, income_missing_prob)
df.loc[income_missing == 1, 'income'] = np.nan

# 3. Category: 8% MCAR
missing_cat_idx = np.random.choice(n_samples, int(0.08 * n_samples), replace=False)
df.loc[missing_cat_idx, 'category'] = np.nan

# 4. Engagement: Already has systematic missingness (never logged in)
# This is MAR - missing depends on login status

# 5. Survey: 40% MNAR (people with extreme opinions less likely to respond)
extreme_opinions = (df['survey_response'] < 2) | (df['survey_response'] > 4)
survey_missing_prob = np.where(extreme_opinions, 0.6, 0.3)  # Higher missingness for extreme views
survey_missing = np.random.binomial(1, survey_missing_prob)
df.loc[survey_missing == 1, 'survey_response'] = np.nan

print("Dataset with Missing Data Patterns:")
print("="*50)
print(f"Total samples: {len(df)}")
print("\nMissing data summary:")
missing_summary = df.isnull().sum()
for col in missing_summary.index:
    pct = missing_summary[col] / len(df) * 100
    print(f"{col:<15}: {missing_summary[col]:>4} ({pct:>5.1f}%)")

# Visualize missing data patterns
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Missing Data Patterns and Distributions', fontsize=16)

# Missing data heatmap
missing_matrix = df.isnull()
sns.heatmap(missing_matrix, yticklabels=False, cbar=True, cmap='viridis', ax=axes[0, 0])
axes[0, 0].set_title('Missing Data Pattern')
axes[0, 0].set_xlabel('Features')

# Age distribution
axes[0, 1].hist(df['age'].dropna(), bins=30, alpha=0.7, color='skyblue')
axes[0, 1].set_title('Age Distribution (Normal)')
axes[0, 1].set_xlabel('Age')
axes[0, 1].axvline(df['age'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 1].axvline(df['age'].median(), color='orange', linestyle='--', label='Median')
axes[0, 1].legend()

# Income distribution (log scale)
axes[0, 2].hist(df['income'].dropna(), bins=30, alpha=0.7, color='lightgreen')
axes[0, 2].set_title('Income Distribution (Right-skewed)')
axes[0, 2].set_xlabel('Income')
axes[0, 2].axvline(df['income'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 2].axvline(df['income'].median(), color='orange', linestyle='--', label='Median')
axes[0, 2].legend()

# Category distribution
category_counts = df['category'].value_counts()
axes[1, 0].bar(category_counts.index, category_counts.values, alpha=0.7, color='lightcoral')
axes[1, 0].set_title('Category Distribution')
axes[1, 0].set_xlabel('Category')
axes[1, 0].tick_params(axis='x', rotation=45)

# Engagement vs Login status
df_temp = df.copy()
df_temp['logged_in'] = ~df_temp['engagement'].isnull()
login_counts = df_temp['logged_in'].value_counts()
axes[1, 1].bar(['Never Logged In', 'Logged In'], 
               [login_counts[False], login_counts[True]], 
               alpha=0.7, color=['red', 'green'])
axes[1, 1].set_title('Engagement Missingness Pattern')
axes[1, 1].set_ylabel('Count')

# Survey response distribution
axes[1, 2].hist(df['survey_response'].dropna(), bins=20, alpha=0.7, color='purple')
axes[1, 2].set_title('Survey Response Distribution')
axes[1, 2].set_xlabel('Survey Score (1-5)')
axes[1, 2].axvline(df['survey_response'].mean(), color='red', linestyle='--', label='Mean')
axes[1, 2].legend()

plt.tight_layout()
plt.show()

print(f"\nMissing Data Analysis:")
print(f"• Age: Low missingness ({missing_summary['age']/len(df):.1%}), normal distribution → Mean imputation")
print(f"• Income: Moderate missingness ({missing_summary['income']/len(df):.1%}), skewed → Median imputation")
print(f"• Category: Low missingness ({missing_summary['category']/len(df):.1%}) → Mode imputation")
print(f"• Engagement: High systematic missingness ({missing_summary['engagement']/len(df):.1%}) → Indicator + Imputation")
print(f"• Survey: Very high missingness ({missing_summary['survey_response']/len(df):.1%}), MNAR → Multiple imputation")

In [None]:
# Implement different imputation strategies and compare performance

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create target variable for evaluation
# Simulate a target that depends on our features
np.random.seed(42)
target = (0.1 * df['age'].fillna(df['age'].mean()) + 
          0.00001 * df['income'].fillna(df['income'].median()) +
          0.01 * df['engagement'].fillna(0) +
          df['survey_response'].fillna(3) +
          np.random.normal(0, 1, len(df)))
target = (target > target.median()).astype(int)  # Convert to binary classification

# Strategy 1: Simple Imputation
def simple_imputation_strategy(df):
    df_imputed = df.copy()
    
    # Age: Mean imputation
    df_imputed['age'].fillna(df_imputed['age'].mean(), inplace=True)
    
    # Income: Median imputation
    df_imputed['income'].fillna(df_imputed['income'].median(), inplace=True)
    
    # Category: Mode imputation
    df_imputed['category'].fillna(df_imputed['category'].mode()[0], inplace=True)
    
    # Engagement: Create indicator + zero imputation
    df_imputed['never_logged_in'] = df_imputed['engagement'].isnull().astype(int)
    df_imputed['engagement'].fillna(0, inplace=True)
    
    # Survey: Mean imputation (simple approach)
    df_imputed['survey_response'].fillna(df_imputed['survey_response'].mean(), inplace=True)
    
    return df_imputed

# Strategy 2: Advanced Imputation
def advanced_imputation_strategy(df):
    df_imputed = df.copy()
    
    # Age: KNN imputation
    numeric_cols = ['age', 'income', 'engagement', 'survey_response']
    knn_imputer = KNNImputer(n_neighbors=5)
    
    # Create engagement indicator first
    df_imputed['never_logged_in'] = df_imputed['engagement'].isnull().astype(int)
    df_imputed['engagement'].fillna(0, inplace=True)
    
    # Apply KNN to numeric columns (except engagement which we handled)
    numeric_subset = df_imputed[['age', 'income', 'survey_response']]
    imputed_numeric = knn_imputer.fit_transform(numeric_subset)
    df_imputed[['age', 'income', 'survey_response']] = imputed_numeric
    
    # Category: Mode imputation
    df_imputed['category'].fillna(df_imputed['category'].mode()[0], inplace=True)
    
    return df_imputed

# Strategy 3: Complete Case Analysis (for comparison)
def complete_case_strategy(df):
    return df.dropna()

# Apply strategies
df_simple = simple_imputation_strategy(df)
df_advanced = advanced_imputation_strategy(df)
df_complete = complete_case_strategy(df)

print("Imputation Strategy Comparison:")
print("="*50)
print(f"Original dataset: {len(df)} samples")
print(f"Complete cases: {len(df_complete)} samples ({len(df_complete)/len(df):.1%} retained)")
print(f"Simple imputation: {len(df_simple)} samples (no loss)")
print(f"Advanced imputation: {len(df_advanced)} samples (no loss)")

# Prepare features for modeling
def prepare_features(df_imputed):
    # One-hot encode category
    df_encoded = pd.get_dummies(df_imputed, columns=['category'], prefix='cat')
    return df_encoded.select_dtypes(include=[np.number])

# Evaluate different strategies
def evaluate_imputation_strategy(df_imputed, target_subset, strategy_name):
    X = prepare_features(df_imputed)
    y = target_subset
    
    # Simple model for evaluation
    model = RandomForestClassifier(n_estimators=50, random_state=42)
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    
    return scores.mean(), scores.std()

# Evaluate strategies
results = {}

# Complete cases
complete_indices = df_complete.index
results['Complete Cases'] = evaluate_imputation_strategy(
    df_complete, target[complete_indices], 'Complete Cases'
)

# Simple imputation
results['Simple Imputation'] = evaluate_imputation_strategy(
    df_simple, target, 'Simple Imputation'
)

# Advanced imputation
results['Advanced Imputation'] = evaluate_imputation_strategy(
    df_advanced, target, 'Advanced Imputation'
)

print("\nModel Performance Comparison:")
print("="*50)
for strategy, (mean_score, std_score) in results.items():
    print(f"{strategy:<20}: {mean_score:.4f} ± {std_score:.4f}")

# Visualize impact of missing data handling
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Sample sizes
strategies = list(results.keys())
sample_sizes = [len(df_complete), len(df_simple), len(df_advanced)]
accuracies = [results[s][0] for s in strategies]
errors = [results[s][1] for s in strategies]

# Plot 1: Sample sizes
bars1 = axes[0].bar(strategies, sample_sizes, alpha=0.7, color=['red', 'orange', 'green'])
axes[0].set_title('Sample Sizes After Handling Missing Data')
axes[0].set_ylabel('Number of Samples')
axes[0].tick_params(axis='x', rotation=45)
for bar, size in zip(bars1, sample_sizes):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
                 str(size), ha='center', va='bottom')

# Plot 2: Model performance
bars2 = axes[1].bar(strategies, accuracies, yerr=errors, alpha=0.7, 
                    color=['red', 'orange', 'green'], capsize=5)
axes[1].set_title('Model Performance by Strategy')
axes[1].set_ylabel('Accuracy')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0.5, max(accuracies) + 0.05)
for bar, acc in zip(bars2, accuracies):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                 f'{acc:.3f}', ha='center', va='bottom')

# Plot 3: Feature correlation after imputation
corr_matrix = prepare_features(df_advanced).corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='coolwarm', center=0, ax=axes[2])
axes[2].set_title('Feature Correlations (Advanced Imputation)')

plt.tight_layout()
plt.show()

print(f"\nKey Insights:")
print(f"• Complete case analysis loses {(1-len(df_complete)/len(df)):.1%} of data")
print(f"• Simple imputation preserves all samples but may introduce bias")
print(f"• Advanced imputation (KNN) often provides better feature relationships")
print(f"• Adding missingness indicators preserves important information")
print(f"• Strategy choice depends on missingness mechanism and business context")

---

## Question 2: Feature Scaling Decisions ★★☆

**Question:** You're building different types of models with the following features. For each model type, decide whether scaling is needed and which scaling method to use:

**Features:**
- Age (18-80)
- Income (20,000-500,000)
- Credit Score (300-850)
- Number of transactions (0-1000)
- Account balance (-10,000 to 100,000)

**Models to consider:**
1. Random Forest
2. Logistic Regression
3. SVM with RBF kernel
4. K-Means clustering
5. Principal Component Analysis (PCA)

Implement and demonstrate the impact of different scaling choices.

### Answer 2: Feature Scaling Decisions

#### **Scaling Requirements by Algorithm Type**

**Distance-Based Algorithms**: Require scaling (KNN, SVM, K-Means, PCA)  
**Tree-Based Algorithms**: Don't require scaling (Random Forest, Decision Trees)  
**Linear Algorithms**: Often benefit from scaling (Linear/Logistic Regression)  
**Neural Networks**: Almost always require scaling

#### **Scaling Method Selection**

**StandardScaler (Z-score normalization)**:
- Best for: Normal distributions, when preserving shape matters
- Formula: (x - μ) / σ
- Use when: Features are roughly normal, no strict bounds needed

**MinMaxScaler**:
- Best for: Bounded features, when preserving zero matters
- Formula: (x - min) / (max - min)
- Use when: Need features in [0,1] range, distribution shape matters

**RobustScaler**:
- Best for: Data with outliers
- Formula: (x - median) / IQR
- Use when: Outliers present, want robust statistics

**Normalizer**:
- Best for: When direction matters more than magnitude
- Formula: x / ||x||₂
- Use when: Text features, sparse data

#### **Recommendations by Model**

1. **Random Forest**: No scaling needed (tree-based)
2. **Logistic Regression**: StandardScaler (helps convergence)
3. **SVM with RBF**: StandardScaler or MinMaxScaler (distance-based)
4. **K-Means**: StandardScaler (distance-based clustering)
5. **PCA**: StandardScaler (variance-based dimensionality reduction)

In [None]:
# Generate synthetic financial dataset
np.random.seed(42)
n_samples = 1000

# Create features with different scales
age = np.random.randint(18, 81, n_samples)
income = np.random.lognormal(10.8, 0.6, n_samples)  # 20k-500k range
income = np.clip(income, 20000, 500000)
credit_score = np.random.normal(650, 100, n_samples)
credit_score = np.clip(credit_score, 300, 850)
transactions = np.random.poisson(50, n_samples)  # 0-1000 range
transactions = np.clip(transactions, 0, 1000)
balance = np.random.normal(20000, 25000, n_samples)  # Can be negative
balance = np.clip(balance, -10000, 100000)

# Create DataFrame
financial_df = pd.DataFrame({
    'age': age,
    'income': income,
    'credit_score': credit_score,
    'transactions': transactions,
    'balance': balance
})

# Create target variable (loan approval)
approval_prob = 1 / (1 + np.exp(-(
    0.05 * age + 
    0.00001 * income + 
    0.01 * credit_score + 
    0.002 * transactions + 
    0.00001 * balance - 15
)))
loan_approved = np.random.binomial(1, approval_prob)

print("Financial Dataset Overview:")
print("="*50)
print(financial_df.describe())
print(f"\nLoan Approval Rate: {loan_approved.mean():.1%}")

# Visualize feature distributions and scales
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Feature Distributions and Scales', fontsize=16)

features = ['age', 'income', 'credit_score', 'transactions', 'balance']
colors = ['skyblue', 'lightgreen', 'orange', 'pink', 'purple']

for i, (feature, color) in enumerate(zip(features, colors)):
    row, col = divmod(i, 3)
    axes[row, col].hist(financial_df[feature], bins=30, alpha=0.7, color=color)
    axes[row, col].set_title(f'{feature.replace("_", " ").title()}')
    axes[row, col].set_xlabel(feature.replace("_", " ").title())
    axes[row, col].grid(True, alpha=0.3)
    
    # Add statistics
    mean_val = financial_df[feature].mean()
    std_val = financial_df[feature].std()
    axes[row, col].axvline(mean_val, color='red', linestyle='--', alpha=0.7)
    axes[row, col].text(0.05, 0.95, f'μ={mean_val:.0f}\nσ={std_val:.0f}', 
                        transform=axes[row, col].transAxes, verticalalignment='top',
                        bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# Feature scale comparison
axes[1, 2].bar(range(len(features)), 
               [financial_df[f].std() for f in features], 
               alpha=0.7, color=colors)
axes[1, 2].set_title('Feature Standard Deviations')
axes[1, 2].set_ylabel('Standard Deviation')
axes[1, 2].set_xticks(range(len(features)))
axes[1, 2].set_xticklabels([f.replace('_', '\n') for f in features], rotation=0)
axes[1, 2].set_yscale('log')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFeature Scale Analysis:")
print(f"{'Feature':<15} {'Min':<10} {'Max':<10} {'Mean':<10} {'Std':<10} {'Range':<15}")
print("-" * 75)
for feature in features:
    data = financial_df[feature]
    print(f"{feature:<15} {data.min():<10.0f} {data.max():<10.0f} {data.mean():<10.0f} {data.std():<10.0f} {data.max()-data.min():<15.0f}")

print(f"\nScale Differences:")
print(f"• Income has ~100x larger scale than age")
print(f"• Balance can be negative, others are positive")
print(f"• Different distributions: normal, log-normal, Poisson")

In [None]:
# Compare different scaling methods and their impact on various algorithms

from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# Prepare data
X = financial_df.values
y = loan_approved

# Initialize scalers
scalers = {
    'No Scaling': None,
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

# Function to apply scaling
def apply_scaling(X, scaler):
    if scaler is None:
        return X
    return scaler.fit_transform(X)

# Evaluate different models with different scaling
results = {}

print("Model Performance with Different Scaling Methods:")
print("=" * 70)
print(f"{'Scaler':<15} {'Random Forest':<12} {'Logistic Reg':<12} {'SVM':<12} {'K-Means':<12}")
print("-" * 70)

for scaler_name, scaler in scalers.items():
    X_scaled = apply_scaling(X, scaler)
    
    # Random Forest (should be unaffected by scaling)
    rf_scores = cross_val_score(RandomForestClassifier(n_estimators=50, random_state=42), 
                                X_scaled, y, cv=5, scoring='accuracy')
    rf_mean = rf_scores.mean()
    
    # Logistic Regression (should benefit from scaling)
    lr_scores = cross_val_score(LogisticRegression(max_iter=1000, random_state=42), 
                                X_scaled, y, cv=5, scoring='accuracy')
    lr_mean = lr_scores.mean()
    
    # SVM (very sensitive to scaling)
    try:
        svm_scores = cross_val_score(SVC(random_state=42), 
                                     X_scaled, y, cv=3, scoring='accuracy')  # Reduced CV for speed
        svm_mean = svm_scores.mean()
    except:
        svm_mean = 0.0  # Failed due to scaling issues
    
    # K-Means clustering (unsupervised)
    kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_scaled)
    silhouette = silhouette_score(X_scaled, cluster_labels)
    
    print(f"{scaler_name:<15} {rf_mean:<12.3f} {lr_mean:<12.3f} {svm_mean:<12.3f} {silhouette:<12.3f}")
    
    results[scaler_name] = {
        'Random Forest': rf_mean,
        'Logistic Regression': lr_mean,
        'SVM': svm_mean,
        'K-Means': silhouette
    }

print("\nNotes:")
print("• Random Forest: Accuracy (tree-based, scale-invariant)")
print("• Logistic Regression: Accuracy (linear, benefits from scaling)")
print("• SVM: Accuracy (distance-based, very sensitive to scaling)")
print("• K-Means: Silhouette Score (distance-based clustering)")

# Visualize scaling effects
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Impact of Scaling on Different Algorithms', fontsize=16)

# Performance comparison
scaler_names = list(results.keys())
models = ['Random Forest', 'Logistic Regression', 'SVM', 'K-Means']
colors = ['skyblue', 'lightgreen', 'orange', 'pink']

for i, model in enumerate(models):
    row, col = divmod(i, 2)
    scores = [results[scaler][model] for scaler in scaler_names]
    
    bars = axes[row, col].bar(scaler_names, scores, alpha=0.7, color=colors[i])
    axes[row, col].set_title(f'{model} Performance')
    axes[row, col].set_ylabel('Score' if model != 'K-Means' else 'Silhouette Score')
    axes[row, col].tick_params(axis='x', rotation=45)
    axes[row, col].grid(True, alpha=0.3)
    
    # Add value labels
    for bar, score in zip(bars, scores):
        axes[row, col].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
                            f'{score:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

In [None]:
# Demonstrate the effect of scaling on feature distributions and PCA

# Compare original vs scaled features
standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()
robust_scaler = RobustScaler()

X_standard = standard_scaler.fit_transform(X)
X_minmax = minmax_scaler.fit_transform(X)
X_robust = robust_scaler.fit_transform(X)

# Create comparison DataFrame
scaling_comparison = pd.DataFrame({
    'Original_Age': X[:, 0],
    'Standard_Age': X_standard[:, 0],
    'MinMax_Age': X_minmax[:, 0],
    'Robust_Age': X_robust[:, 0],
    'Original_Income': X[:, 1],
    'Standard_Income': X_standard[:, 1],
    'MinMax_Income': X_minmax[:, 1],
    'Robust_Income': X_robust[:, 1]
})

# Visualize scaling effects
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
fig.suptitle('Effect of Different Scaling Methods', fontsize=16)

# Age distributions
scaling_methods = ['Original', 'Standard', 'MinMax', 'Robust']
age_cols = ['Original_Age', 'Standard_Age', 'MinMax_Age', 'Robust_Age']
income_cols = ['Original_Income', 'Standard_Income', 'MinMax_Income', 'Robust_Income']

for i, (method, age_col, income_col) in enumerate(zip(scaling_methods, age_cols, income_cols)):
    # Age
    axes[0, i].hist(scaling_comparison[age_col], bins=30, alpha=0.7, color='skyblue')
    axes[0, i].set_title(f'Age - {method}')
    axes[0, i].set_xlabel('Scaled Value')
    axes[0, i].grid(True, alpha=0.3)
    
    # Income
    axes[1, i].hist(scaling_comparison[income_col], bins=30, alpha=0.7, color='lightgreen')
    axes[1, i].set_title(f'Income - {method}')
    axes[1, i].set_xlabel('Scaled Value')
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# PCA comparison with and without scaling
print("\nPCA Analysis: Impact of Scaling")
print("="*50)

# PCA without scaling
pca_no_scale = PCA()
pca_no_scale.fit(X)

# PCA with standard scaling
pca_scaled = PCA()
pca_scaled.fit(X_standard)

print("Explained Variance Ratio (First 3 Components):")
print(f"Without Scaling: {pca_no_scale.explained_variance_ratio_[:3]}")
print(f"With Scaling:    {pca_scaled.explained_variance_ratio_[:3]}")

print(f"\nCumulative Explained Variance (First 3 Components):")
print(f"Without Scaling: {np.cumsum(pca_no_scale.explained_variance_ratio_[:3])}")
print(f"With Scaling:    {np.cumsum(pca_scaled.explained_variance_ratio_[:3])}")

# Feature contributions to first PC
print(f"\nFirst Principal Component Loadings:")
print(f"{'Feature':<15} {'No Scaling':<12} {'With Scaling':<12}")
print("-" * 40)
for i, feature in enumerate(features):
    loading_no_scale = pca_no_scale.components_[0, i]
    loading_scaled = pca_scaled.components_[0, i]
    print(f"{feature:<15} {loading_no_scale:<12.4f} {loading_scaled:<12.4f}")

print(f"\nKey Observations:")
print(f"• Without scaling: Income dominates PCA due to large variance")
print(f"• With scaling: All features contribute more equally")
print(f"• Random Forest: Unaffected by scaling (tree-based)")
print(f"• Logistic Regression: Improved convergence with scaling")
print(f"• SVM: Dramatic improvement with proper scaling")
print(f"• K-Means: Better cluster separation with scaling")

# Summary recommendations
print(f"\n" + "="*60)
print(f"SCALING RECOMMENDATIONS BY ALGORITHM")
print(f"="*60)
recommendations = {
    'Random Forest': 'No scaling needed',
    'Logistic Regression': 'StandardScaler (improves convergence)',
    'SVM': 'StandardScaler or MinMaxScaler (essential)',
    'K-Means': 'StandardScaler (distance-based)',
    'PCA': 'StandardScaler (variance-based)',
    'Neural Networks': 'StandardScaler or MinMaxScaler',
    'KNN': 'StandardScaler or MinMaxScaler'
}

for algorithm, recommendation in recommendations.items():
    print(f"{algorithm:<20}: {recommendation}")

---

## Question 3: Categorical Encoding Strategies ★★★

**Question:** You have different types of categorical variables that need encoding. For each variable type below, choose the most appropriate encoding method and explain potential pitfalls:

1. **Product Category** (5 categories, no order): Electronics, Clothing, Books, Home, Sports
2. **Education Level** (ordinal): High School < Bachelor's < Master's < PhD
3. **City** (high cardinality): 500+ unique cities
4. **Customer ID** (identifier): Unique per customer
5. **Day of Week** (cyclical): Monday through Sunday

Implement different encoding strategies and compare their impact on model performance.

### Answer 3: Categorical Encoding Strategies

#### **Encoding Method Selection Guide**

**One-Hot Encoding**:
- Best for: Low cardinality nominal variables
- Pros: No ordinal assumptions, interpretable
- Cons: High dimensionality, sparse features
- Use when: <10-20 categories, no natural order

**Ordinal Encoding**:
- Best for: Variables with natural order
- Pros: Single feature, preserves order
- Cons: Assumes equal spacing between categories
- Use when: Clear hierarchical relationship

**Target Encoding (Mean Encoding)**:
- Best for: High cardinality variables
- Pros: Reduces dimensionality, captures target relationship
- Cons: Overfitting risk, requires regularization
- Use when: Many categories, limited data per category

**Binary Encoding**:
- Best for: Medium cardinality variables
- Pros: Lower dimensionality than one-hot
- Cons: Less interpretable
- Use when: 10-100 categories

**Cyclical Encoding**:
- Best for: Cyclical variables (time, angles)
- Pros: Captures cyclical nature
- Cons: Requires domain knowledge
- Use when: Natural cyclical patterns

#### **Specific Recommendations**

1. **Product Category**: One-Hot Encoding (low cardinality, nominal)
2. **Education Level**: Ordinal Encoding (clear hierarchy)
3. **City**: Target Encoding or Embedding (high cardinality)
4. **Customer ID**: Drop or use for grouping (identifier, not predictive)
5. **Day of Week**: Cyclical Encoding (sin/cos transformation)

In [None]:
# Generate synthetic dataset with different categorical variable types
np.random.seed(42)
n_samples = 2000

# 1. Product Category (nominal, low cardinality)
product_categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']
product_probs = [0.3, 0.25, 0.15, 0.2, 0.1]  # Different popularity
product_category = np.random.choice(product_categories, n_samples, p=product_probs)

# 2. Education Level (ordinal)
education_levels = ['High School', 'Bachelor\'s', 'Master\'s', 'PhD']
education_probs = [0.4, 0.35, 0.2, 0.05]
education = np.random.choice(education_levels, n_samples, p=education_probs)

# 3. City (high cardinality)
# Generate 200 cities with Zipf distribution (realistic city size distribution)
city_names = [f'City_{i:03d}' for i in range(200)]
zipf_weights = 1 / np.arange(1, 201)  # Zipf distribution
zipf_weights = zipf_weights / zipf_weights.sum()
city = np.random.choice(city_names, n_samples, p=zipf_weights)

# 4. Customer ID (identifier - should not be used directly)
customer_id = [f'CUST_{i:06d}' for i in range(n_samples)]

# 5. Day of Week (cyclical)
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_of_week = np.random.choice(days, n_samples)

# Create DataFrame
categorical_df = pd.DataFrame({
    'customer_id': customer_id,
    'product_category': product_category,
    'education': education,
    'city': city,
    'day_of_week': day_of_week
})

# Create target variable influenced by categorical variables
# Education effect (ordinal)
education_effect = {'High School': 0, 'Bachelor\'s': 1, 'Master\'s': 2, 'PhD': 3}
edu_score = np.array([education_effect[ed] for ed in education])

# Product category effect (nominal)
category_effect = {'Electronics': 2, 'Clothing': 1, 'Books': 3, 'Home': 1, 'Sports': 2}
category_score = np.array([category_effect[cat] for cat in product_category])

# City effect (some cities are better markets)
city_effect = {city: np.random.normal(0, 1) for city in city_names}
city_score = np.array([city_effect[c] for c in city])

# Day of week effect (cyclical - weekend effect)
day_mapping = {day: i for i, day in enumerate(days)}
day_numeric = np.array([day_mapping[d] for d in day_of_week])
weekend_effect = np.cos(2 * np.pi * day_numeric / 7)  # Cyclical pattern

# Combine effects to create target
target_logit = (0.5 * edu_score + 
                0.3 * category_score + 
                0.2 * city_score + 
                0.4 * weekend_effect + 
                np.random.normal(0, 0.5, n_samples))

target = (target_logit > np.median(target_logit)).astype(int)

print("Categorical Dataset Overview:")
print("="*50)
print(f"Total samples: {n_samples}")
print(f"Target distribution: {target.mean():.1%} positive class")
print("\nCategorical Variable Summary:")
for col in ['product_category', 'education', 'city', 'day_of_week']:
    unique_count = categorical_df[col].nunique()
    print(f"{col:<20}: {unique_count:>3} unique values")

# Show value counts for each categorical variable
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Categorical Variable Distributions', fontsize=16)

# Product Category
category_counts = categorical_df['product_category'].value_counts()
axes[0, 0].bar(category_counts.index, category_counts.values, alpha=0.7)
axes[0, 0].set_title('Product Category Distribution')
axes[0, 0].tick_params(axis='x', rotation=45)

# Education Level
education_counts = categorical_df['education'].value_counts().reindex(education_levels)
axes[0, 1].bar(education_counts.index, education_counts.values, alpha=0.7, color='orange')
axes[0, 1].set_title('Education Level Distribution')
axes[0, 1].tick_params(axis='x', rotation=45)

# City (top 10)
city_counts = categorical_df['city'].value_counts().head(10)
axes[1, 0].bar(range(len(city_counts)), city_counts.values, alpha=0.7, color='green')
axes[1, 0].set_title('Top 10 Cities Distribution')
axes[1, 0].set_xlabel('City Rank')
axes[1, 0].set_ylabel('Count')

# Day of Week
day_counts = categorical_df['day_of_week'].value_counts().reindex(days)
axes[1, 1].bar(day_counts.index, day_counts.values, alpha=0.7, color='purple')
axes[1, 1].set_title('Day of Week Distribution')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print(f"\nCardinality Analysis:")
print(f"• Product Category: {categorical_df['product_category'].nunique()} categories (Low)")
print(f"• Education: {categorical_df['education'].nunique()} levels (Ordinal)")
print(f"• City: {categorical_df['city'].nunique()} cities (High cardinality)")
print(f"• Day of Week: {categorical_df['day_of_week'].nunique()} days (Cyclical)")
print(f"• Customer ID: {categorical_df['customer_id'].nunique()} IDs (Identifier)")

In [None]:
# Implement different encoding strategies

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
import category_encoders as ce  # You may need to install: pip install category_encoders

# For this demo, we'll implement target encoding manually
def target_encode(X_train, X_test, y_train, column, smoothing=1.0):
    """
    Target encoding with smoothing to prevent overfitting
    """
    # Calculate global mean
    global_mean = y_train.mean()
    
    # Calculate category means and counts
    category_stats = pd.DataFrame({
        'category': X_train[column],
        'target': y_train
    }).groupby('category').agg({
        'target': ['count', 'mean']
    })
    
    category_stats.columns = ['count', 'mean']
    
    # Apply smoothing
    category_stats['smoothed_mean'] = (
        (category_stats['mean'] * category_stats['count'] + global_mean * smoothing) /
        (category_stats['count'] + smoothing)
    )
    
    # Create mapping
    encoding_map = category_stats['smoothed_mean'].to_dict()
    
    # Apply encoding
    X_train_encoded = X_train[column].map(encoding_map).fillna(global_mean)
    X_test_encoded = X_test[column].map(encoding_map).fillna(global_mean)
    
    return X_train_encoded, X_test_encoded

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    categorical_df, target, test_size=0.3, random_state=42, stratify=target
)

print("Encoding Strategy Comparison:")
print("="*60)

# Strategy 1: One-Hot Encoding for Product Category
def encode_strategy_1(X_train, X_test, y_train):
    """
    Strategy 1: Basic encodings
    - Product Category: One-Hot
    - Education: Ordinal
    - City: Target Encoding
    - Day: One-Hot
    - Customer ID: Drop
    """
    result_train = pd.DataFrame()
    result_test = pd.DataFrame()
    
    # Product Category: One-Hot
    product_dummies_train = pd.get_dummies(X_train['product_category'], prefix='product')
    product_dummies_test = pd.get_dummies(X_test['product_category'], prefix='product')
    # Ensure same columns
    for col in product_dummies_train.columns:
        if col not in product_dummies_test.columns:
            product_dummies_test[col] = 0
    product_dummies_test = product_dummies_test[product_dummies_train.columns]
    
    result_train = pd.concat([result_train, product_dummies_train], axis=1)
    result_test = pd.concat([result_test, product_dummies_test], axis=1)
    
    # Education: Ordinal
    education_mapping = {'High School': 0, 'Bachelor\'s': 1, 'Master\'s': 2, 'PhD': 3}
    result_train['education_ordinal'] = X_train['education'].map(education_mapping)
    result_test['education_ordinal'] = X_test['education'].map(education_mapping)
    
    # City: Target Encoding
    city_train, city_test = target_encode(X_train, X_test, y_train, 'city')
    result_train['city_target'] = city_train
    result_test['city_target'] = city_test
    
    # Day: One-Hot
    day_dummies_train = pd.get_dummies(X_train['day_of_week'], prefix='day')
    day_dummies_test = pd.get_dummies(X_test['day_of_week'], prefix='day')
    # Ensure same columns
    for col in day_dummies_train.columns:
        if col not in day_dummies_test.columns:
            day_dummies_test[col] = 0
    day_dummies_test = day_dummies_test[day_dummies_train.columns]
    
    result_train = pd.concat([result_train, day_dummies_train], axis=1)
    result_test = pd.concat([result_test, day_dummies_test], axis=1)
    
    # Customer ID: Drop (not predictive)
    
    return result_train.fillna(0), result_test.fillna(0)

# Strategy 2: Cyclical Encoding for Day of Week
def encode_strategy_2(X_train, X_test, y_train):
    """
    Strategy 2: Advanced encodings
    - Product Category: One-Hot
    - Education: Ordinal
    - City: Target Encoding
    - Day: Cyclical (sin/cos)
    - Customer ID: Drop
    """
    result_train = pd.DataFrame()
    result_test = pd.DataFrame()
    
    # Product Category: One-Hot (same as strategy 1)
    product_dummies_train = pd.get_dummies(X_train['product_category'], prefix='product')
    product_dummies_test = pd.get_dummies(X_test['product_category'], prefix='product')
    for col in product_dummies_train.columns:
        if col not in product_dummies_test.columns:
            product_dummies_test[col] = 0
    product_dummies_test = product_dummies_test[product_dummies_train.columns]
    
    result_train = pd.concat([result_train, product_dummies_train], axis=1)
    result_test = pd.concat([result_test, product_dummies_test], axis=1)
    
    # Education: Ordinal (same as strategy 1)
    education_mapping = {'High School': 0, 'Bachelor\'s': 1, 'Master\'s': 2, 'PhD': 3}
    result_train['education_ordinal'] = X_train['education'].map(education_mapping)
    result_test['education_ordinal'] = X_test['education'].map(education_mapping)
    
    # City: Target Encoding (same as strategy 1)
    city_train, city_test = target_encode(X_train, X_test, y_train, 'city')
    result_train['city_target'] = city_train
    result_test['city_target'] = city_test
    
    # Day: Cyclical Encoding
    day_mapping = {'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3, 
                   'Friday': 4, 'Saturday': 5, 'Sunday': 6}
    
    day_numeric_train = X_train['day_of_week'].map(day_mapping)
    day_numeric_test = X_test['day_of_week'].map(day_mapping)
    
    # Convert to sin/cos
    result_train['day_sin'] = np.sin(2 * np.pi * day_numeric_train / 7)
    result_train['day_cos'] = np.cos(2 * np.pi * day_numeric_train / 7)
    result_test['day_sin'] = np.sin(2 * np.pi * day_numeric_test / 7)
    result_test['day_cos'] = np.cos(2 * np.pi * day_numeric_test / 7)
    
    return result_train.fillna(0), result_test.fillna(0)

# Strategy 3: Label Encoding Everything (poor strategy)
def encode_strategy_3(X_train, X_test, y_train):
    """
    Strategy 3: Label encoding for everything (demonstrating poor choices)
    """
    result_train = pd.DataFrame()
    result_test = pd.DataFrame()
    
    # Label encode everything
    for col in ['product_category', 'education', 'city', 'day_of_week']:
        le = LabelEncoder()
        # Fit on train, transform both
        le.fit(X_train[col])
        result_train[f'{col}_label'] = le.transform(X_train[col])
        
        # Handle unseen categories in test
        test_encoded = []
        for val in X_test[col]:
            if val in le.classes_:
                test_encoded.append(le.transform([val])[0])
            else:
                test_encoded.append(-1)  # Unseen category
        result_test[f'{col}_label'] = test_encoded
    
    return result_train.fillna(0), result_test.fillna(0)

# Apply encoding strategies
X_train_s1, X_test_s1 = encode_strategy_1(X_train, X_test, y_train)
X_train_s2, X_test_s2 = encode_strategy_2(X_train, X_test, y_train)
X_train_s3, X_test_s3 = encode_strategy_3(X_train, X_test, y_train)

print(f"Encoding Results:")
print(f"Strategy 1 (Mixed): {X_train_s1.shape[1]} features")
print(f"Strategy 2 (Cyclical): {X_train_s2.shape[1]} features")
print(f"Strategy 3 (Label): {X_train_s3.shape[1]} features")

# Evaluate strategies
strategies = {
    'Strategy 1 (Mixed)': (X_train_s1, X_test_s1),
    'Strategy 2 (Cyclical)': (X_train_s2, X_test_s2),
    'Strategy 3 (Label)': (X_train_s3, X_test_s3)
}

results = {}
for name, (X_tr, X_te) in strategies.items():
    # Random Forest
    rf_scores = cross_val_score(RandomForestClassifier(n_estimators=50, random_state=42),
                                X_tr, y_train, cv=5, scoring='accuracy')
    
    # Logistic Regression
    lr_scores = cross_val_score(LogisticRegression(max_iter=1000, random_state=42),
                                X_tr, y_train, cv=5, scoring='accuracy')
    
    results[name] = {
        'RF': rf_scores.mean(),
        'LR': lr_scores.mean(),
        'Features': X_tr.shape[1]
    }

print(f"\nPerformance Comparison:")
print(f"{'Strategy':<20} {'Features':<10} {'Random Forest':<15} {'Logistic Reg':<15}")
print("-" * 65)
for strategy, metrics in results.items():
    print(f"{strategy:<20} {metrics['Features']:<10} {metrics['RF']:<15.4f} {metrics['LR']:<15.4f}")

print(f"\nKey Observations:")
print(f"• Strategy 1: Appropriate encoding for each variable type")
print(f"• Strategy 2: Cyclical encoding captures day-of-week patterns better")
print(f"• Strategy 3: Label encoding creates artificial ordinality")
print(f"• Target encoding helps with high-cardinality variables")
print(f"• Feature count varies significantly between strategies")

---

## Question 4: Feature Selection Techniques ★★★

**Question:** You have a dataset with 100 features and 1000 samples for binary classification. Apply and compare different feature selection methods:

1. **Filter Methods**: Chi-square test, mutual information
2. **Wrapper Methods**: Recursive Feature Elimination (RFE)
3. **Embedded Methods**: L1 regularization (LASSO)
4. **Hybrid Approach**: Combine multiple methods

Analyze the computational cost, selected features, and impact on model performance.

### Answer 4: Feature Selection Techniques

#### **Feature Selection Categories**

**Filter Methods**:
- Independent of ML algorithm
- Fast computation
- Based on statistical tests
- Examples: Chi-square, correlation, mutual information

**Wrapper Methods**:
- Use ML algorithm performance
- Computationally expensive
- Account for feature interactions
- Examples: RFE, forward/backward selection

**Embedded Methods**:
- Feature selection during model training
- Algorithm-specific
- Balance between filter and wrapper
- Examples: LASSO, tree-based importance

#### **When to Use Each Method**

- **High Dimensionality**: Start with filter methods
- **Small Datasets**: Wrapper methods may overfit
- **Linear Models**: L1 regularization works well
- **Tree Models**: Use built-in feature importance
- **Time Constraints**: Filter methods are fastest

In [None]:
# Generate synthetic high-dimensional dataset
np.random.seed(42)
n_samples = 1000
n_features = 100
n_informative = 20  # Only 20% of features are actually useful
n_redundant = 10
n_clusters_per_class = 2

# Generate classification dataset
X, y = make_classification(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=n_informative,
    n_redundant=n_redundant,
    n_clusters_per_class=n_clusters_per_class,
    class_sep=0.8,
    random_state=42
)

# Create feature names
feature_names = [f'feature_{i:03d}' for i in range(n_features)]
feature_df = pd.DataFrame(X, columns=feature_names)

print("High-Dimensional Dataset:")
print("="*50)
print(f"Samples: {n_samples}")
print(f"Features: {n_features}")
print(f"Informative features: {n_informative}")
print(f"Redundant features: {n_redundant}")
print(f"Noise features: {n_features - n_informative - n_redundant}")
print(f"Class distribution: {np.bincount(y)} ({y.mean():.1%} positive)")

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"\nData split:")
print(f"Training: {X_train.shape}")
print(f"Testing: {X_test.shape}")

# Baseline performance (all features)
baseline_rf = RandomForestClassifier(n_estimators=50, random_state=42)
baseline_scores = cross_val_score(baseline_rf, X_train, y_train, cv=5, scoring='accuracy')
baseline_performance = baseline_scores.mean()

print(f"\nBaseline Performance (all {n_features} features):")
print(f"Random Forest Accuracy: {baseline_performance:.4f} ± {baseline_scores.std():.4f}")

# Visualize feature importance from baseline model
baseline_rf.fit(X_train, y_train)
baseline_importance = baseline_rf.feature_importances_

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Feature importance distribution
axes[0].hist(baseline_importance, bins=20, alpha=0.7, color='skyblue')
axes[0].set_title('Feature Importance Distribution (Random Forest)')
axes[0].set_xlabel('Importance Score')
axes[0].set_ylabel('Number of Features')
axes[0].grid(True, alpha=0.3)

# Top 20 feature importances
top_20_idx = np.argsort(baseline_importance)[-20:]
axes[1].barh(range(20), baseline_importance[top_20_idx], alpha=0.7, color='lightgreen')
axes[1].set_title('Top 20 Most Important Features')
axes[1].set_xlabel('Importance Score')
axes[1].set_ylabel('Feature Rank')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFeature Importance Analysis:")
print(f"Mean importance: {baseline_importance.mean():.6f}")
print(f"Std importance: {baseline_importance.std():.6f}")
print(f"Max importance: {baseline_importance.max():.6f}")
print(f"Features with >mean importance: {(baseline_importance > baseline_importance.mean()).sum()}")

In [None]:
# Implement and compare different feature selection methods
import time
from sklearn.linear_model import LassoCV

# Store results
selection_results = {}

print("Feature Selection Methods Comparison:")
print("="*70)

# 1. Filter Method: Chi-square
print("\n1. Chi-square Test (Filter Method)")
print("-" * 40)
start_time = time.time()

# Need non-negative features for chi-square
X_train_pos = X_train - X_train.min() + 1e-6
X_test_pos = X_test - X_test.min() + 1e-6

chi2_selector = SelectKBest(chi2, k=20)
X_train_chi2 = chi2_selector.fit_transform(X_train_pos, y_train)
X_test_chi2 = chi2_selector.transform(X_test_pos)

chi2_time = time.time() - start_time
chi2_features = chi2_selector.get_support(indices=True)
chi2_scores = cross_val_score(RandomForestClassifier(n_estimators=50, random_state=42),
                              X_train_chi2, y_train, cv=5, scoring='accuracy')

selection_results['Chi-square'] = {
    'selected_features': chi2_features,
    'n_features': len(chi2_features),
    'computation_time': chi2_time,
    'cv_score': chi2_scores.mean(),
    'cv_std': chi2_scores.std()
}

print(f"Selected features: {len(chi2_features)}")
print(f"Computation time: {chi2_time:.4f} seconds")
print(f"CV Accuracy: {chi2_scores.mean():.4f} ± {chi2_scores.std():.4f}")

# 2. Filter Method: Mutual Information
print("\n2. Mutual Information (Filter Method)")
print("-" * 40)
start_time = time.time()

mi_selector = SelectKBest(mutual_info_classif, k=20)
X_train_mi = mi_selector.fit_transform(X_train, y_train)
X_test_mi = mi_selector.transform(X_test)

mi_time = time.time() - start_time
mi_features = mi_selector.get_support(indices=True)
mi_scores = cross_val_score(RandomForestClassifier(n_estimators=50, random_state=42),
                            X_train_mi, y_train, cv=5, scoring='accuracy')

selection_results['Mutual Info'] = {
    'selected_features': mi_features,
    'n_features': len(mi_features),
    'computation_time': mi_time,
    'cv_score': mi_scores.mean(),
    'cv_std': mi_scores.std()
}

print(f"Selected features: {len(mi_features)}")
print(f"Computation time: {mi_time:.4f} seconds")
print(f"CV Accuracy: {mi_scores.mean():.4f} ± {mi_scores.std():.4f}")

# 3. Wrapper Method: Recursive Feature Elimination
print("\n3. Recursive Feature Elimination (Wrapper Method)")
print("-" * 40)
start_time = time.time()

rfe_estimator = RandomForestClassifier(n_estimators=20, random_state=42)  # Reduced for speed
rfe_selector = RFE(rfe_estimator, n_features_to_select=20, step=5)
X_train_rfe = rfe_selector.fit_transform(X_train, y_train)
X_test_rfe = rfe_selector.transform(X_test)

rfe_time = time.time() - start_time
rfe_features = rfe_selector.get_support(indices=True)
rfe_scores = cross_val_score(RandomForestClassifier(n_estimators=50, random_state=42),
                             X_train_rfe, y_train, cv=5, scoring='accuracy')

selection_results['RFE'] = {
    'selected_features': rfe_features,
    'n_features': len(rfe_features),
    'computation_time': rfe_time,
    'cv_score': rfe_scores.mean(),
    'cv_std': rfe_scores.std()
}

print(f"Selected features: {len(rfe_features)}")
print(f"Computation time: {rfe_time:.4f} seconds")
print(f"CV Accuracy: {rfe_scores.mean():.4f} ± {rfe_scores.std():.4f}")

# 4. Embedded Method: LASSO (L1 Regularization)
print("\n4. LASSO Regularization (Embedded Method)")
print("-" * 40)
start_time = time.time()

# Use LASSO with cross-validation to select alpha
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=1000)
lasso_cv.fit(X_train, y_train)

# Select features with non-zero coefficients
lasso_features = np.where(np.abs(lasso_cv.coef_) > 1e-6)[0]
X_train_lasso = X_train[:, lasso_features]
X_test_lasso = X_test[:, lasso_features]

lasso_time = time.time() - start_time
lasso_scores = cross_val_score(RandomForestClassifier(n_estimators=50, random_state=42),
                               X_train_lasso, y_train, cv=5, scoring='accuracy')

selection_results['LASSO'] = {
    'selected_features': lasso_features,
    'n_features': len(lasso_features),
    'computation_time': lasso_time,
    'cv_score': lasso_scores.mean(),
    'cv_std': lasso_scores.std()
}

print(f"Selected features: {len(lasso_features)}")
print(f"Computation time: {lasso_time:.4f} seconds")
print(f"CV Accuracy: {lasso_scores.mean():.4f} ± {lasso_scores.std():.4f}")
print(f"LASSO alpha: {lasso_cv.alpha_:.6f}")

# 5. Hybrid Approach: Combine multiple methods
print("\n5. Hybrid Approach (Intersection of Methods)")
print("-" * 40)
start_time = time.time()

# Find intersection of top features from multiple methods
all_selected_features = {
    'chi2': set(chi2_features),
    'mi': set(mi_features),
    'rfe': set(rfe_features),
    'lasso': set(lasso_features)
}

# Features selected by at least 2 methods
feature_votes = {}
for method, features in all_selected_features.items():
    for feature in features:
        feature_votes[feature] = feature_votes.get(feature, 0) + 1

hybrid_features = np.array([f for f, votes in feature_votes.items() if votes >= 2])
X_train_hybrid = X_train[:, hybrid_features]
X_test_hybrid = X_test[:, hybrid_features]

hybrid_time = time.time() - start_time
hybrid_scores = cross_val_score(RandomForestClassifier(n_estimators=50, random_state=42),
                                X_train_hybrid, y_train, cv=5, scoring='accuracy')

selection_results['Hybrid'] = {
    'selected_features': hybrid_features,
    'n_features': len(hybrid_features),
    'computation_time': hybrid_time,
    'cv_score': hybrid_scores.mean(),
    'cv_std': hybrid_scores.std()
}

print(f"Selected features: {len(hybrid_features)}")
print(f"Computation time: {hybrid_time:.4f} seconds")
print(f"CV Accuracy: {hybrid_scores.mean():.4f} ± {hybrid_scores.std():.4f}")

# Add baseline to results
selection_results['Baseline (All)'] = {
    'selected_features': np.arange(n_features),
    'n_features': n_features,
    'computation_time': 0.0,
    'cv_score': baseline_performance,
    'cv_std': baseline_scores.std()
}

In [None]:
# Visualize and analyze feature selection results

# Create summary table
print("\nFeature Selection Comparison Summary:")
print("="*80)
print(f"{'Method':<15} {'Features':<10} {'Time (s)':<10} {'CV Score':<12} {'Std':<10} {'vs Baseline':<12}")
print("-" * 80)

baseline_score = selection_results['Baseline (All)']['cv_score']

for method, results in selection_results.items():
    score_diff = results['cv_score'] - baseline_score
    print(f"{method:<15} {results['n_features']:<10} {results['computation_time']:<10.4f} "
          f"{results['cv_score']:<12.4f} {results['cv_std']:<10.4f} {score_diff:>+8.4f}")

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Feature Selection Methods Comparison', fontsize=16)

methods = list(selection_results.keys())
colors = plt.cm.Set3(np.linspace(0, 1, len(methods)))

# 1. Number of features
n_features_list = [selection_results[m]['n_features'] for m in methods]
bars1 = axes[0, 0].bar(methods, n_features_list, color=colors, alpha=0.7)
axes[0, 0].set_title('Number of Selected Features')
axes[0, 0].set_ylabel('Number of Features')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].grid(True, alpha=0.3)
for bar, n_feat in zip(bars1, n_features_list):
    axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                    str(n_feat), ha='center', va='bottom')

# 2. Computation time
times = [selection_results[m]['computation_time'] for m in methods]
bars2 = axes[0, 1].bar(methods, times, color=colors, alpha=0.7)
axes[0, 1].set_title('Computation Time')
axes[0, 1].set_ylabel('Time (seconds)')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].set_yscale('log')
axes[0, 1].grid(True, alpha=0.3)
for bar, time_val in zip(bars2, times):
    if time_val > 0:
        axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() * 1.1,
                        f'{time_val:.3f}', ha='center', va='bottom')

# 3. CV Scores with error bars
cv_scores = [selection_results[m]['cv_score'] for m in methods]
cv_stds = [selection_results[m]['cv_std'] for m in methods]
bars3 = axes[0, 2].bar(methods, cv_scores, yerr=cv_stds, color=colors, alpha=0.7, capsize=5)
axes[0, 2].set_title('Cross-Validation Accuracy')
axes[0, 2].set_ylabel('Accuracy')
axes[0, 2].tick_params(axis='x', rotation=45)
axes[0, 2].grid(True, alpha=0.3)
axes[0, 2].set_ylim(min(cv_scores) - 0.02, max(cv_scores) + 0.02)

# 4. Feature overlap analysis
feature_sets = {
    'Chi-square': set(selection_results['Chi-square']['selected_features']),
    'Mutual Info': set(selection_results['Mutual Info']['selected_features']),
    'RFE': set(selection_results['RFE']['selected_features']),
    'LASSO': set(selection_results['LASSO']['selected_features'])
}

# Calculate pairwise overlaps
overlap_matrix = np.zeros((4, 4))
method_names = list(feature_sets.keys())
for i, method1 in enumerate(method_names):
    for j, method2 in enumerate(method_names):
        if i == j:
            overlap_matrix[i, j] = len(feature_sets[method1])
        else:
            overlap = len(feature_sets[method1].intersection(feature_sets[method2]))
            overlap_matrix[i, j] = overlap

im = axes[1, 0].imshow(overlap_matrix, cmap='Blues', aspect='auto')
axes[1, 0].set_title('Feature Overlap Between Methods')
axes[1, 0].set_xticks(range(4))
axes[1, 0].set_yticks(range(4))
axes[1, 0].set_xticklabels([m.replace(' ', '\n') for m in method_names], rotation=45)
axes[1, 0].set_yticklabels(method_names)

# Add text annotations
for i in range(4):
    for j in range(4):
        axes[1, 0].text(j, i, int(overlap_matrix[i, j]), ha='center', va='center')

plt.colorbar(im, ax=axes[1, 0])

# 5. Performance vs complexity trade-off
axes[1, 1].scatter(n_features_list[:-1], cv_scores[:-1], 
                   c=times[:-1], s=100, alpha=0.7, cmap='viridis')
axes[1, 1].scatter(n_features_list[-1], cv_scores[-1], 
                   c='red', s=150, marker='*', label='Baseline')
axes[1, 1].set_xlabel('Number of Features')
axes[1, 1].set_ylabel('CV Accuracy')
axes[1, 1].set_title('Performance vs Complexity')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].legend()

# Add method labels
for i, method in enumerate(methods[:-1]):
    axes[1, 1].annotate(method.split()[0], 
                        (n_features_list[i], cv_scores[i]), 
                        xytext=(5, 5), textcoords='offset points', fontsize=8)

# 6. Feature importance comparison for selected features
# Show baseline importance for features selected by hybrid method
if len(hybrid_features) > 0:
    hybrid_importance = baseline_importance[hybrid_features]
    axes[1, 2].bar(range(len(hybrid_features)), hybrid_importance, alpha=0.7, color='green')
    axes[1, 2].set_title('Baseline Importance of Hybrid-Selected Features')
    axes[1, 2].set_xlabel('Feature Index (Hybrid Selection)')
    axes[1, 2].set_ylabel('Baseline Importance')
    axes[1, 2].grid(True, alpha=0.3)
else:
    axes[1, 2].text(0.5, 0.5, 'No features selected\nby hybrid method', 
                    ha='center', va='center', transform=axes[1, 2].transAxes)
    axes[1, 2].set_title('Hybrid Method Results')

plt.tight_layout()
plt.show()

# Analysis of results
print(f"\nDetailed Analysis:")
print(f"="*60)

# Find best performing method
best_method = max(selection_results.keys(), 
                  key=lambda x: selection_results[x]['cv_score'])
print(f"Best performing method: {best_method}")
print(f"Score: {selection_results[best_method]['cv_score']:.4f}")

# Find most efficient method (score/time ratio)
efficiency_scores = {}
for method, results in selection_results.items():
    if results['computation_time'] > 0:
        efficiency = results['cv_score'] / results['computation_time']
        efficiency_scores[method] = efficiency

if efficiency_scores:
    most_efficient = max(efficiency_scores.keys(), key=lambda x: efficiency_scores[x])
    print(f"Most efficient method: {most_efficient}")
    print(f"Efficiency (score/time): {efficiency_scores[most_efficient]:.2f}")

print(f"\nKey Insights:")
print(f"• Filter methods (Chi-square, MI) are fastest but may miss interactions")
print(f"• Wrapper methods (RFE) are slowest but account for feature interactions")
print(f"• Embedded methods (LASSO) balance speed and performance")
print(f"• Hybrid approaches can improve robustness")
print(f"• Dimensionality reduction doesn't always improve performance")
print(f"• Computational cost varies dramatically between methods")

---

## Summary and Key Takeaways

### **Core Preprocessing Concepts Mastered**

1. **Missing Data Handling**: Understanding MCAR, MAR, MNAR and appropriate imputation strategies
2. **Feature Scaling**: Knowing when and which scaling method to apply for different algorithms
3. **Categorical Encoding**: Choosing appropriate encoding based on cardinality and variable type
4. **Feature Selection**: Comparing filter, wrapper, and embedded methods for dimensionality reduction

### **Critical Decision Framework**

**Missing Data Strategy Selection:**
- **Low missingness (<5%)**: Simple imputation (mean/median/mode)
- **Moderate missingness (5-25%)**: Advanced imputation (KNN, iterative)
- **High missingness (>25%)**: Consider dropping or specialized techniques
- **Systematic patterns**: Add missingness indicators

**Scaling Method Selection:**
- **Tree-based algorithms**: No scaling needed
- **Distance-based algorithms**: StandardScaler or MinMaxScaler
- **Linear algorithms**: StandardScaler for better convergence
- **Data with outliers**: RobustScaler

**Categorical Encoding Strategy:**
- **Low cardinality (<10)**: One-hot encoding
- **Ordinal variables**: Ordinal encoding
- **High cardinality (>50)**: Target encoding or embeddings
- **Cyclical variables**: Sin/cos transformation
- **Identifiers**: Drop or use for grouping

**Feature Selection Approach:**
- **High dimensionality**: Start with filter methods
- **Small datasets**: Be cautious with wrapper methods
- **Linear models**: L1 regularization
- **Tree models**: Built-in importance
- **Production systems**: Consider computational constraints

### **Common Pitfalls to Avoid**

- **Data Leakage**: Fitting preprocessors on entire dataset before splitting
- **Target Leakage**: Using target-dependent features for imputation
- **Inconsistent Preprocessing**: Different preprocessing for train/test
- **Ignoring Missingness Patterns**: Not investigating why data is missing
- **Over-Engineering**: Applying complex preprocessing when simple methods suffice

### **Best Practices**

1. **Always fit preprocessors on training data only**
2. **Use pipelines to ensure consistent preprocessing**
3. **Validate preprocessing choices with domain experts**
4. **Monitor preprocessing impact on model performance**
5. **Document preprocessing decisions and rationale**

### **Next Steps**

Continue to Part 3 to master model evaluation and validation techniques that build upon proper data preprocessing.

### **Practice Recommendations**

1. Build preprocessing pipelines for different data types
2. Practice identifying appropriate encoding strategies
3. Implement custom preprocessing functions for domain-specific needs
4. Experiment with feature selection on high-dimensional datasets
5. Create preprocessing checklists for different project types