# Machine Learning Course: Feature Selection

## Notebook 3: Advanced Feature Selection for Risk Classification

### Learning Objectives
By the end of this notebook, you will be able to:
1. Apply variance-based feature selection to remove uninformative genes
2. Use correlation-based methods to identify relevant and redundant features
3. Implement Cox proportional hazards models for survival-based feature ranking
4. Apply Random Forest feature importance for non-linear feature selection
5. Use Boruta algorithm for all-relevant feature detection
6. Compare and ensemble multiple feature selection methods
7. Create optimal feature subsets for different modeling approaches
8. Validate feature selection stability and biological relevance

### Prerequisites
- Completed `01_data_exploration.ipynb` and `02_data_preprocessing.ipynb`
- Understanding of statistical concepts (correlation, variance, p-values)
- Basic knowledge of survival analysis and ensemble methods

### Feature Selection Methods Overview
This notebook implements multiple complementary feature selection approaches:

**1. Variance-Based Selection**
- Statistical variance analysis
- VarianceThreshold filtering
- Coefficient of variation analysis

**2. Correlation-Based Selection**
- Pearson/Spearman correlation with target
- Feature-feature correlation for redundancy removal
- Mutual information analysis

**3. Cox Regression-Based Selection**
- Univariate Cox proportional hazards models
- Hazard ratio analysis
- Survival-based p-value ranking

**4. Random Forest-Based Selection**
- Tree-based feature importance
- Permutation importance
- Recursive feature elimination with RF

**5. Boruta Algorithm**
- All-relevant feature detection
- Shadow feature comparison
- Statistical significance testing

**6. Ensemble Selection**
- Method agreement analysis
- Stability assessment
- Final feature set optimization

---

## 1. Setup and Imports

Let's import all necessary libraries for advanced feature selection methods.

In [None]:
# 📝 ACTIVITY 1: Import Libraries for Advanced Feature Selection
#
# Your task: Import comprehensive libraries for multiple feature selection methods
#
# TODO: Import core data manipulation libraries:
# 1. pandas (as pd) - for data manipulation and analysis
# 2. numpy (as np) - for numerical operations and array handling
# 3. matplotlib.pyplot (as plt) - for plotting and visualization
# 4. seaborn (as sns) - for statistical visualization
# 5. os, warnings - for system operations and warning control
#
# TODO: Import feature selection libraries:
# 6. From sklearn.feature_selection import:
#    - VarianceThreshold (variance-based selection)
#    - SelectKBest, f_classif, chi2 (statistical selection)
#    - RFE, RFECV (recursive feature elimination)
#    - mutual_info_classif (mutual information)
# 7. From sklearn.ensemble import: RandomForestClassifier
# 8. From sklearn.preprocessing import: StandardScaler, LabelEncoder
#
# TODO: Import survival analysis libraries:
# 9. Try to import lifelines:
#    - from lifelines import CoxPHFitter (Cox proportional hazards)
#    - from lifelines.statistics import logrank_test
# 10. If lifelines not available, note alternative approaches
#
# TODO: Import Boruta library:
# 11. Try to import boruta:
#    - from boruta import BorutaPy (Boruta feature selection)
# 12. If not available, provide installation instructions
#
# TODO: Import additional statistical libraries:
# 13. From scipy.stats import: pearsonr, spearmanr, chi2_contingency
# 14. From scipy import stats
# 15. import joblib - for saving/loading objects
#
# TODO: Configure environment:
# 16. Suppress warnings: warnings.filterwarnings('ignore')
# 17. Set matplotlib style and seaborn palette
# 18. Set random seeds for reproducibility (np.random.seed(42))
# 19. Set pandas display options for better output
#
# TODO: Check library availability and versions:
# 20. Print versions of key libraries
# 21. Test import success for optional libraries (lifelines, boruta)
# 22. Provide installation instructions for missing libraries
#
# Expected output: Successfully imported libraries with version info and availability status

# Write your code below:
# import pandas as pd
# import numpy as np
# ...

## 2. Load Preprocessed Data

Let's load the clean, preprocessed data from the previous notebook and validate it for feature selection.

In [None]:
# 📝 ACTIVITY 2: Load and Validate Preprocessed Data
#
# Your task: Load the preprocessed datasets and validate them for feature selection
#
# TODO: Set up data paths:
# 1. Define PREPROCESSED_PATH = '../results/preprocessed/'
# 2. Define FEATURE_SELECTION_PATH = '../results/feature_selection/'
# 3. Create feature selection results directory: os.makedirs(FEATURE_SELECTION_PATH, exist_ok=True)
#
# TODO: Load training data for feature selection:
# 4. Load training features: X_train = pd.read_csv(PREPROCESSED_PATH + 'splits/X_train.csv', index_col=0)
# 5. Load training target: y_train = pd.read_csv(PREPROCESSED_PATH + 'splits/y_train.csv', index_col=0)['target']
# 6. Load validation data: X_val, y_val (for stability testing)
# 7. Handle any loading errors with try-except blocks
#
# TODO: Load additional preprocessed data:
# 8. Load normalized expression data: expression_data = pd.read_csv(PREPROCESSED_PATH + 'data/expression_normalized.csv', index_col=0)
# 9. Load processed clinical data: clinical_data = pd.read_csv(PREPROCESSED_PATH + 'data/clinical_processed.csv', index_col=0)
# 10. Load preprocessing summary: json.load(open(PREPROCESSED_PATH + 'preprocessing_summary.json'))
#
# TODO: Validate data for feature selection:
# 11. Print "DATA VALIDATION FOR FEATURE SELECTION" header
# 12. Check data dimensions and ensure they match expectations
# 13. Verify no missing values: X_train.isnull().sum().sum() == 0
# 14. Check target variable distribution: y_train.value_counts()
# 15. Confirm data types are appropriate for analysis
#
# TODO: Separate expression and clinical features:
# 16. If X_train contains both types, separate them:
#     - Identify which columns are expression features (gene names)
#     - Identify which columns are clinical features
#     - Create expression_features and clinical_features lists
# 17. Print counts of each feature type
#
# TODO: Create comprehensive dataset overview:
# 18. Print dataset dimensions: samples, total features, expression features, clinical features
# 19. Display data types and memory usage
# 20. Show class balance in training set
# 21. Confirm data is ready for feature selection analysis
#
# TODO: Prepare survival data if available:
# 22. Check if survival time and event data are available in clinical data
# 23. If available, prepare survival DataFrame for Cox regression
# 24. Handle missing survival data appropriately
#
# Expected output: Loaded and validated datasets with feature type identification

# Write your code below:
# PREPROCESSED_PATH = '../results/preprocessed/'
# FEATURE_SELECTION_PATH = '../results/feature_selection/'
# ...

## 3. Variance-Based Feature Selection

Remove features with low variance that provide little discriminative power.

In [None]:
# 📝 ACTIVITY 3: Advanced Variance-Based Feature Selection
#
# Your task: Apply sophisticated variance analysis to identify most informative features
#
# TODO: Set up variance analysis:
# 1. Print "VARIANCE-BASED FEATURE SELECTION" header with separators
# 2. Create working copies: X_variance = X_train.copy()
# 3. Store original feature count: original_feature_count = X_variance.shape[1]
#
# TODO: Calculate comprehensive variance statistics:
# 4. Calculate feature variances: feature_variances = X_variance.var()
# 5. Calculate coefficient of variation: feature_cv = X_variance.std() / X_variance.mean().abs()
# 6. Handle division by zero: feature_cv = feature_cv.fillna(0)
# 7. Calculate variance percentiles: np.percentile(feature_variances, [10, 25, 50, 75, 90])
#
# TODO: Apply VarianceThreshold filtering:
# 8. Create VarianceThreshold selector with threshold=0.05:
#    - selector = VarianceThreshold(threshold=0.05)
#    - selector.fit(X_variance)
# 9. Get selected features: selected_features_mask = selector.get_support()
# 10. Apply selection: X_variance_filtered = X_variance.loc[:, selected_features_mask]
# 11. Get removed features: removed_features = X_variance.columns[~selected_features_mask]
#
# TODO: Apply coefficient of variation filtering:
# 12. Set CV threshold (e.g., 0.1 for meaningful variation)
# 13. Identify high-variation features: high_cv_features = feature_cv[feature_cv > 0.1].index
# 14. Combine with variance filtering: final_variance_features = X_variance_filtered.columns.intersection(high_cv_features)
#
# TODO: Analyze variance distribution by feature type:
# 15. If expression and clinical features are separable:
#     - Calculate variance statistics separately for each type
#     - Compare variance distributions between feature types
#     - Apply type-specific thresholds if appropriate
#
# TODO: Create variance-based feature ranking:
# 16. Rank features by variance: variance_ranking = feature_variances.sort_values(ascending=False)
# 17. Rank features by CV: cv_ranking = feature_cv.sort_values(ascending=False)
# 18. Create combined ranking: combined_score = (variance_ranking.rank() + cv_ranking.rank()) / 2
#
# TODO: Visualize variance analysis results:
# 19. Create histogram of feature variances
# 20. Create scatter plot of variance vs CV
# 21. Show distribution of removed vs retained features
# 22. Create before/after comparison
#
# TODO: Generate variance selection summary:
# 23. Calculate features removed: removed_count = len(removed_features)
# 24. Calculate removal percentage: (removed_count / original_feature_count) * 100
# 25. Print summary statistics: original count, final count, removal stats
# 26. Show top 10 highest and lowest variance features
#
# TODO: Save variance selection results:
# 27. Save selected features list: variance_selected_features
# 28. Save variance statistics and rankings
# 29. Create variance selection report
#
# Expected output: Variance-filtered feature set with comprehensive analysis and rankings

# Write your code below:
# print("VARIANCE-BASED FEATURE SELECTION")
# print("="*50)
# ...

## 4. Correlation-Based Feature Selection

Identify features that are highly correlated with the target and remove redundant features.

In [None]:
# 📝 ACTIVITY 4: Comprehensive Correlation-Based Feature Selection
#
# Your task: Use correlation analysis to identify relevant features and remove redundant ones
#
# TODO: Set up correlation analysis:
# 1. Print "CORRELATION-BASED FEATURE SELECTION" header with separators
# 2. Create working copy: X_correlation = X_train.copy()
# 3. Ensure target variable is numeric: y_numeric = y_train.astype(int) if needed
#
# TODO: Calculate feature-target correlations:
# 4. Calculate Pearson correlation with target:
#    - pearson_corrs = X_correlation.apply(lambda x: pearsonr(x, y_numeric)[0])
#    - pearson_pvals = X_correlation.apply(lambda x: pearsonr(x, y_numeric)[1])
# 5. Calculate Spearman correlation (non-parametric):
#    - spearman_corrs = X_correlation.apply(lambda x: spearmanr(x, y_numeric)[0])
#    - spearman_pvals = X_correlation.apply(lambda x: spearmanr(x, y_numeric)[1])
# 6. Handle any NaN values in correlations
#
# TODO: Apply mutual information analysis:
# 7. Calculate mutual information scores:
#    - mi_scores = mutual_info_classif(X_correlation, y_numeric, random_state=42)
#    - mi_scores = pd.Series(mi_scores, index=X_correlation.columns)
# 8. Normalize MI scores: mi_scores_norm = mi_scores / mi_scores.max()
#
# TODO: Create correlation-based rankings:
# 9. Rank by absolute Pearson correlation: abs_pearson_ranking = pearson_corrs.abs().sort_values(ascending=False)
# 10. Rank by absolute Spearman correlation: abs_spearman_ranking = spearman_corrs.abs().sort_values(ascending=False)
# 11. Rank by mutual information: mi_ranking = mi_scores.sort_values(ascending=False)
# 12. Create combined correlation score: combined_corr_score = (abs_pearson_ranking.rank() + mi_ranking.rank()) / 2
#
# TODO: Select highly correlated features:
# 13. Set correlation thresholds:
#     - pearson_threshold = 0.1 (absolute correlation)
#     - mi_threshold = 0.1 (normalized MI score)
#     - p_value_threshold = 0.05
# 14. Select significant Pearson correlations: pearson_selected = abs_pearson_ranking[(abs_pearson_ranking > pearson_threshold) & (pearson_pvals < p_value_threshold)].index
# 15. Select high MI features: mi_selected = mi_ranking[mi_ranking > mi_threshold].index
# 16. Combine selections: correlation_selected_features = set(pearson_selected) | set(mi_selected)
#
# TODO: Remove redundant features (feature-feature correlation):
# 17. Calculate feature-feature correlation matrix: feature_corr_matrix = X_correlation[list(correlation_selected_features)].corr()
# 18. Find highly correlated feature pairs: high_corr_pairs = np.where((feature_corr_matrix.abs() > 0.8) & (feature_corr_matrix.abs() < 1.0))
# 19. For each highly correlated pair, keep the one with higher target correlation
# 20. Create final correlation-based feature set: final_correlation_features
#
# TODO: Analyze correlation patterns by feature type:
# 21. If expression and clinical features are separable:
#     - Compare correlation distributions between feature types
#     - Identify which feature type shows stronger correlations
#     - Apply type-specific correlation thresholds if needed
#
# TODO: Visualize correlation analysis:
# 22. Create histogram of feature-target correlations
# 23. Create scatter plot of Pearson vs Spearman correlations
# 24. Plot mutual information distribution
# 25. Show top correlated features in each category
#
# TODO: Generate correlation selection summary:
# 26. Count features selected by each method
# 27. Calculate overlap between methods
# 28. Print top 20 features by each correlation measure
# 29. Show correlation statistics summary
#
# TODO: Save correlation results:
# 30. Save correlation matrices and rankings
# 31. Export correlation-selected features list
# 32. Create correlation analysis report
#
# Expected output: Correlation-based feature selection with redundancy removal and method comparison

# Write your code below:
# print("CORRELATION-BASED FEATURE SELECTION")
# print("="*50)
# ...

## 5. Cox Regression-Based Feature Selection

Use survival analysis to identify genes associated with patient outcomes and survival risk.

In [None]:
# 📝 ACTIVITY 5: Cox Proportional Hazards Feature Selection
#
# Your task: Use survival analysis to identify features most associated with patient risk and outcomes
#
# TODO: Set up Cox regression analysis:
# 1. Print "COX REGRESSION-BASED FEATURE SELECTION" header with separators
# 2. Check if survival data is available in clinical data
# 3. If lifelines is available, proceed with Cox analysis
# 4. If not available, implement alternative survival-based ranking
#
# TODO: Prepare survival data:
# 5. Create survival DataFrame with required columns:
#    - duration: survival time (OS_MONTHS if available)
#    - event: event indicator (1 if deceased, 0 if censored)
#    - If survival data not available, create proxy using target variable
# 6. Align survival data with feature matrix
# 7. Handle missing survival times appropriately
#
# TODO: Implement univariate Cox regression (if lifelines available):
# 8. For each feature, fit univariate Cox model:
#    - cox_results = {}
#    - for feature in X_train.columns:
#    -     survival_df = survival_data.copy()
#    -     survival_df[feature] = X_train[feature]
#    -     cph = CoxPHFitter()
#    -     cph.fit(survival_df, duration_col='duration', event_col='event')
#    -     cox_results[feature] = {'hazard_ratio': cph.hazard_ratios_[feature], 'p_value': cph.summary.p[feature]}
#
# TODO: Alternative Cox-like analysis (if lifelines not available):
# 9. Use logistic regression with time-to-event as proxy:
#    - Create risk groups based on target variable and time
#    - Use statistical tests (t-test, Mann-Whitney) to rank features
#    - Calculate effect sizes for each feature
#
# TODO: Extract Cox regression results:
# 10. Extract hazard ratios: hazard_ratios = pd.Series({f: cox_results[f]['hazard_ratio'] for f in cox_results})
# 11. Extract p-values: cox_pvalues = pd.Series({f: cox_results[f]['p_value'] for f in cox_results})
# 12. Calculate log hazard ratios: log_hazard_ratios = np.log(hazard_ratios.abs())
# 13. Handle infinite values and NaNs appropriately
#
# TODO: Create Cox-based feature rankings:
# 14. Rank by hazard ratio magnitude: hr_ranking = hazard_ratios.abs().sort_values(ascending=False)
# 15. Rank by statistical significance: pval_ranking = cox_pvalues.sort_values()
# 16. Create combined Cox score: cox_score = -np.log10(cox_pvalues) * np.log(hazard_ratios.abs())
# 17. Handle mathematical edge cases (zero p-values, etc.)
#
# TODO: Select Cox-significant features:
# 18. Set significance thresholds:
#     - p_value_threshold = 0.05
#     - hazard_ratio_threshold = 1.2 (20% increase/decrease in hazard)
# 19. Select significant features: cox_significant = cox_pvalues[(cox_pvalues < p_value_threshold) & (hazard_ratios.abs() > hazard_ratio_threshold)].index
# 20. Rank selected features by combined Cox score
#
# TODO: Analyze Cox results by feature type:
# 21. If expression and clinical features are separable:
#     - Compare hazard ratio distributions between feature types
#     - Identify which feature type has more significant associations
#     - Note any biological patterns in significant genes
#
# TODO: Handle Cox regression limitations:
# 22. Check proportional hazards assumption (if possible)
# 23. Identify features with very extreme hazard ratios (potential outliers)
# 24. Validate results using alternative survival analysis methods
#
# TODO: Visualize Cox analysis results:
# 25. Create volcano plot: -log10(p-value) vs log(hazard ratio)
# 26. Create histogram of hazard ratios
# 27. Show distribution of p-values
# 28. Plot top significant features
#
# TODO: Generate Cox selection summary:
# 29. Count features significant at different p-value thresholds
# 30. Show distribution of hazard ratios for significant features
# 31. Print top 20 features by Cox score
# 32. Create Cox analysis interpretation guide
#
# TODO: Save Cox regression results:
# 33. Save cox_results dictionary
# 34. Export Cox-selected features list
# 35. Create Cox analysis report with interpretation
#
# Expected output: Cox regression-based feature selection with survival analysis insights

# Write your code below:
# print("COX REGRESSION-BASED FEATURE SELECTION")
# print("="*50)
# 
# # Check if lifelines is available
# try:
#     from lifelines import CoxPHFitter
#     lifelines_available = True
#     print("✓ Lifelines available - using Cox proportional hazards models")
# except ImportError:
#     lifelines_available = False
#     print("⚠️  Lifelines not available - using alternative survival-based ranking")
#     print("To install: pip install lifelines")
# ...

## 6. Random Forest-Based Feature Selection

Use tree-based ensemble methods to identify important features through multiple importance measures.

In [None]:
# 📝 ACTIVITY 6: Advanced Random Forest Feature Selection
#
# Your task: Use multiple Random Forest approaches to identify important features
#
# TODO: Set up Random Forest analysis:
# 1. Print "RANDOM FOREST-BASED FEATURE SELECTION" header with separators
# 2. Create working copy: X_rf = X_train.copy()
# 3. Set up Random Forest parameters: n_estimators=100, random_state=42, n_jobs=-1
#
# TODO: Train Random Forest for feature importance:
# 4. Create RandomForestClassifier: rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
# 5. Fit the model: rf.fit(X_rf, y_train)
# 6. Extract feature importances: rf_importances = pd.Series(rf.feature_importances_, index=X_rf.columns)
# 7. Sort by importance: rf_ranking = rf_importances.sort_values(ascending=False)
#
# TODO: Implement permutation importance:
# 8. Use permutation_importance from sklearn.inspection (if available):
#    - from sklearn.inspection import permutation_importance
#    - perm_importance = permutation_importance(rf, X_rf, y_train, n_repeats=10, random_state=42)
#    - perm_importances = pd.Series(perm_importance.importances_mean, index=X_rf.columns)
# 9. If not available, implement manual permutation importance
# 10. Sort permutation importances: perm_ranking = perm_importances.sort_values(ascending=False)
#
# TODO: Apply Recursive Feature Elimination with Cross-Validation:
# 11. Create RFECV selector: rfecv = RFECV(estimator=RandomForestClassifier(n_estimators=50, random_state=42), step=1, cv=5, scoring='accuracy')
# 12. Fit RFECV: rfecv.fit(X_rf, y_train)
# 13. Get selected features: rfecv_selected = X_rf.columns[rfecv.support_]
# 14. Get feature rankings: rfecv_rankings = pd.Series(rfecv.ranking_, index=X_rf.columns)
#
# TODO: Create ensemble Random Forest importance:
# 15. Train multiple RF models with different parameters:
#     - Different n_estimators: [50, 100, 200]
#     - Different max_features: ['sqrt', 'log2', 0.3]
#     - Different max_depth: [None, 10, 20]
# 16. Average importances across all models: ensemble_importance
# 17. Calculate importance stability (standard deviation across models)
#
# TODO: Analyze feature importance distributions:
# 18. Calculate importance percentiles and thresholds
# 19. Compare importance distributions between feature types (expression vs clinical)
# 20. Identify top features consistently selected across methods
#
# TODO: Apply importance-based selection:
# 21. Set importance thresholds:
#     - rf_threshold = np.percentile(rf_importances, 75)  # Top 25% of features
#     - perm_threshold = np.percentile(perm_importances, 75)
# 22. Select high-importance features: rf_selected = rf_ranking[rf_ranking > rf_threshold].index
# 23. Select high permutation importance: perm_selected = perm_ranking[perm_ranking > perm_threshold].index
# 24. Combine selections: rf_final_features = set(rf_selected) | set(perm_selected) | set(rfecv_selected)
#
# TODO: Validate Random Forest selections:
# 25. Check feature selection stability by training RF on different subsets
# 26. Calculate feature selection frequency across multiple runs
# 27. Identify most stable/consistent features
#
# TODO: Visualize Random Forest results:
# 28. Create feature importance bar plots for top 20 features
# 29. Create scatter plot comparing RF importance vs permutation importance
# 30. Plot RFECV cross-validation scores
# 31. Show importance distributions by feature type
#
# TODO: Analyze biological relevance (if expression features):
# 32. If gene names are available, look up biological functions of top genes
# 33. Check if selected genes are known biomarkers
# 34. Identify any pathway enrichment patterns
#
# TODO: Generate RF selection summary:
# 35. Count features selected by each RF method
# 36. Calculate method agreement/overlap
# 37. Print top 20 features by each importance measure
# 38. Show importance statistics and thresholds used
#
# TODO: Save Random Forest results:
# 39. Save all importance scores and rankings
# 40. Export RF-selected features list
# 41. Save trained RF model for later use
# 42. Create RF analysis report
#
# Expected output: Random Forest-based feature selection with multiple importance measures and validation

# Write your code below:
# print("RANDOM FOREST-BASED FEATURE SELECTION")
# print("="*50)
# ...

## 7. Boruta Feature Selection

Apply the Boruta algorithm for all-relevant feature detection using shadow features.

In [None]:
# 📝 ACTIVITY 7: Boruta All-Relevant Feature Selection
#
# Your task: Apply Boruta algorithm to identify all features that are relevant for the prediction task
#
# TODO: Set up Boruta analysis:
# 1. Print "BORUTA-BASED FEATURE SELECTION" header with separators
# 2. Check if Boruta library is available
# 3. If not available, provide installation instructions and alternative implementation
#
# TODO: Prepare data for Boruta:
# 4. Create working copy: X_boruta = X_train.copy()
# 5. Convert to numpy arrays if needed: X_array = X_boruta.values, y_array = y_train.values
# 6. Ensure data types are appropriate for Boruta
#
# TODO: Configure and run Boruta (if available):
# 7. Create Boruta selector:
#    - boruta = BorutaPy(RandomForestClassifier(n_jobs=-1, random_state=42), n_estimators=100, verbose=2, random_state=42)
# 8. Set Boruta parameters:
#    - max_iter=100 (maximum iterations)
#    - perc=100 (percentile for shadow feature importance)
# 9. Fit Boruta: boruta.fit(X_array, y_array)
#
# TODO: Alternative Boruta implementation (if library not available):
# 10. Implement simplified Boruta logic:
#     - Create shadow features by permuting original features
#     - Train Random Forest on original + shadow features
#     - Compare feature importances with shadow feature importances
#     - Select features that consistently outperform shadow features
#
# TODO: Extract Boruta results:
# 11. Get confirmed features: boruta_confirmed = X_boruta.columns[boruta.support_].tolist()
# 12. Get tentative features: boruta_tentative = X_boruta.columns[boruta.support_weak_].tolist()
# 13. Get rejected features: boruta_rejected = X_boruta.columns[~(boruta.support_ | boruta.support_weak_)].tolist()
# 14. Get feature rankings: boruta_rankings = boruta.ranking_
#
# TODO: Analyze Boruta results:
# 15. Count features in each category: confirmed, tentative, rejected
# 16. Calculate selection percentages
# 17. Compare Boruta results with other selection methods
# 18. Analyze convergence behavior (if available)
#
# TODO: Handle tentative features:
# 19. Decide on tentative feature treatment:
#     - Option 1: Include all tentative features
#     - Option 2: Exclude all tentative features
#     - Option 3: Apply additional criteria to tentative features
# 20. Create final Boruta feature set: boruta_final_features
#
# TODO: Validate Boruta stability:
# 21. Run Boruta multiple times with different random seeds
# 22. Calculate feature selection stability across runs
# 23. Identify most consistently selected features
#
# TODO: Compare with shadow feature importance:
# 24. If shadow feature data is available, show importance comparisons
# 25. Visualize feature importance vs shadow importance distributions
# 26. Identify features with strongest signal vs noise ratios
#
# TODO: Analyze Boruta results by feature type:
# 27. If expression and clinical features are separable:
#     - Compare selection rates between feature types
#     - Identify which type is more frequently selected
#     - Look for biological patterns in selected genes
#
# TODO: Visualize Boruta analysis:
# 28. Create feature importance plot with Boruta decisions
# 29. Show histogram of confirmed vs rejected vs tentative features
# 30. Plot Boruta convergence history (if available)
# 31. Create comparison with Random Forest importance
#
# TODO: Generate Boruta summary:
# 32. Print counts for each Boruta decision category
# 33. Show top confirmed features with their rankings
# 34. Compare Boruta selection with other methods
# 35. Provide interpretation of Boruta results
#
# TODO: Save Boruta results:
# 36. Save Boruta object and results
# 37. Export confirmed and tentative features lists
# 38. Create Boruta analysis report
# 39. Document Boruta parameters and decisions made
#
# Expected output: Boruta-based all-relevant feature selection with comprehensive analysis

# Write your code below:
# print("BORUTA-BASED FEATURE SELECTION")
# print("="*50)
#
# # Check if Boruta is available
# try:
#     from boruta import BorutaPy
#     boruta_available = True
#     print("✓ Boruta library available")
# except ImportError:
#     boruta_available = False
#     print("⚠️  Boruta library not available")
#     print("To install: pip install Boruta")
#     print("Alternative: We'll implement simplified Boruta logic")
# ...

## 8. Feature Selection Comparison and Ensemble

Compare all methods and create optimal feature sets through ensemble approaches.

In [None]:
# 📝 ACTIVITY 8: Comprehensive Feature Selection Comparison and Ensemble
#
# Your task: Compare all feature selection methods and create optimal ensemble feature sets
#
# TODO: Set up method comparison:
# 1. Print "FEATURE SELECTION METHOD COMPARISON" header with separators
# 2. Collect all feature sets from previous methods:
#    - variance_selected_features, correlation_selected_features
#    - cox_selected_features, rf_selected_features, boruta_selected_features
# 3. Handle cases where some methods might not have results
#
# TODO: Create method comparison matrix:
# 4. Create binary matrix showing which features are selected by which methods
# 5. Methods as columns, features as rows, 1 if selected, 0 if not
# 6. Calculate method agreement scores: intersection/union for each pair
# 7. Create method similarity matrix using Jaccard similarity
#
# TODO: Analyze feature selection overlap:
# 8. Calculate intersection of all methods: core_features = intersection of all feature sets
# 9. Find features selected by majority: majority_features = selected by >50% of methods
# 10. Identify method-specific features: features selected by only one method
# 11. Calculate total unique features across all methods
#
# TODO: Create voting-based ensemble selection:
# 12. Count votes for each feature: vote_count = sum of selections across methods
# 13. Create ensemble feature sets with different vote thresholds:
#     - unanimous_features: vote_count == number_of_methods
#     - strong_consensus: vote_count >= 80% of methods
#     - majority_consensus: vote_count >= 50% of methods
#     - weak_consensus: vote_count >= 30% of methods
#
# TODO: Weight methods by performance (optional):
# 15. If validation performance is available, weight methods by accuracy
# 16. Create weighted voting: weight high-performing methods more
# 17. Calculate weighted ensemble scores for each feature
#
# TODO: Apply stability analysis:
# 18. Run each method multiple times with bootstrap sampling
# 19. Calculate selection stability for each feature: selection_frequency
# 20. Combine stability with vote counts: stable_consensus_features
#
# TODO: Create different ensemble strategies:
# 21. Conservative ensemble: high agreement threshold (>=4 methods)
# 22. Moderate ensemble: medium agreement threshold (>=3 methods)
# 23. Liberal ensemble: low agreement threshold (>=2 methods)
# 24. Method-specific ensembles: combine similar methods (e.g., RF + Boruta)
#
# TODO: Validate ensemble feature sets:
# 25. For each ensemble, train quick Random Forest model
# 26. Evaluate on validation set: accuracy, precision, recall, F1-score
# 27. Compare ensemble performance to individual methods
# 28. Identify optimal ensemble size vs performance trade-off
#
# TODO: Analyze biological relevance (if gene features):
# 29. For top ensemble features, look up biological functions
# 30. Check literature for breast cancer biomarker validation
# 31. Identify known pathways represented in selected features
# 32. Note any surprising or novel feature selections
#
# TODO: Create comprehensive comparison visualization:
# 33. Create UpSet plot or Venn diagram showing method overlaps
# 34. Create heatmap of method agreement
# 35. Plot ensemble performance vs feature set size
# 36. Show feature ranking stability across methods
#
# TODO: Generate final feature recommendations:
# 37. Recommend 3-5 different feature sets for different use cases:
#     - Minimal set: core features selected by all methods
#     - Balanced set: majority consensus features
#     - Comprehensive set: liberal consensus features
#     - Performance-optimized set: best validation performance
# 38. Provide rationale for each recommendation
#
# TODO: Create feature selection summary report:
# 39. Document all methods used and their parameters
# 40. Show method comparison results and overlaps
# 41. Present final feature set recommendations with justifications
# 42. Include performance validation results
#
# TODO: Export final results:
# 43. Save all ensemble feature sets
# 44. Export method comparison matrices
# 45. Save validation performance results
# 46. Create comprehensive feature selection report
#
# Expected output: Complete feature selection comparison with optimized ensemble feature sets

# Write your code below:
# print("FEATURE SELECTION METHOD COMPARISON")
# print("="*50)
# ...

## 💡 Coding Hints and Templates

Need help getting started? Here are some code templates and hints for feature selection methods:

### 📋 **Template: Variance-Based Selection**
```python
# Variance threshold filtering
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.05)
selector.fit(X_train)
selected_features = X_train.columns[selector.get_support()]
```

### 📋 **Template: Correlation Analysis**
```python
# Feature-target correlation
from scipy.stats import pearsonr
correlations = X_train.apply(lambda x: pearsonr(x, y_train)[0])
high_corr_features = correlations[abs(correlations) > 0.1].index
```

### 📋 **Template: Cox Regression (with lifelines)**
```python
# Univariate Cox regression
from lifelines import CoxPHFitter
cox_results = {}
for feature in X_train.columns:
    df = pd.DataFrame({'duration': survival_time, 'event': event_indicator, 'feature': X_train[feature]})
    cph = CoxPHFitter()
    cph.fit(df, duration_col='duration', event_col='event')
    cox_results[feature] = {'hr': cph.hazard_ratios_.values[0], 'p': cph.summary.p.values[0]}
```

### 📋 **Template: Random Forest Importance**
```python
# Random Forest feature importance
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
top_features = importances.nlargest(100).index
```

### 📋 **Template: Permutation Importance**
```python
# Permutation importance
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(rf, X_train, y_train, n_repeats=10, random_state=42)
perm_scores = pd.Series(perm_importance.importances_mean, index=X_train.columns)
```

### 📋 **Template: Boruta Selection**
```python
# Boruta feature selection
from boruta import BorutaPy
boruta = BorutaPy(RandomForestClassifier(n_jobs=-1), n_estimators=100, verbose=2, random_state=42)
boruta.fit(X_train.values, y_train.values)
selected_features = X_train.columns[boruta.support_]
```

### 📋 **Template: Method Comparison**
```python
# Compare feature selection methods
method_results = {
    'variance': variance_selected,
    'correlation': correlation_selected,
    'cox': cox_selected,
    'rf': rf_selected,
    'boruta': boruta_selected
}

# Create voting matrix
all_features = set().union(*method_results.values())
vote_matrix = pd.DataFrame(index=all_features, columns=method_results.keys())
for method, features in method_results.items():
    vote_matrix[method] = vote_matrix.index.isin(features).astype(int)

# Majority voting
vote_counts = vote_matrix.sum(axis=1)
majority_features = vote_counts[vote_counts >= len(method_results) // 2 + 1].index
```

### 🔍 **Key Feature Selection Principles**
- **Variance**: Remove features with little variation (threshold typically 0.01-0.1)
- **Correlation**: Balance relevance (high target correlation) with redundancy removal
- **Cox Regression**: Focus on hazard ratios >1.2 or <0.8 with p<0.05
- **Random Forest**: Use multiple importance measures and stability assessment
- **Boruta**: All-relevant selection, good for comprehensive feature discovery
- **Ensemble**: Combine methods to balance different selection criteria

### 🔍 **Selection Strategy Guidelines**
- **Conservative**: Use intersection of multiple methods (high precision)
- **Liberal**: Use union of methods (high recall)
- **Balanced**: Use majority voting (balance precision/recall)
- **Validate**: Always check performance on held-out validation set

## 🎯 Learning Assessment

### ✅ **Self-Check Questions**

After completing the feature selection activities, you should be able to answer:

1. **Variance-Based Selection**
   - Why do we remove low-variance features before machine learning?
   - What's the difference between variance and coefficient of variation for feature selection?
   - How do you choose appropriate variance thresholds for different data types?
   - What are the limitations of variance-based selection?

2. **Correlation-Based Selection**
   - What's the difference between Pearson and Spearman correlation for feature selection?
   - How do you balance feature relevance (target correlation) with redundancy removal?
   - When should you use mutual information vs linear correlation?
   - How do you handle multicollinearity in selected features?

3. **Cox Regression-Based Selection**
   - What does a hazard ratio >1 vs <1 indicate in survival analysis?
   - How do you interpret p-values in the context of multiple testing?
   - What are the assumptions of Cox proportional hazards models?
   - How do you handle censored survival data in feature selection?

4. **Random Forest-Based Selection**
   - What's the difference between impurity-based and permutation importance?
   - Why might Random Forest importance be biased toward certain feature types?
   - How do you assess feature importance stability across different RF models?
   - When should you use RFECV vs simple importance thresholding?

5. **Boruta Algorithm**
   - What are shadow features and why are they important in Boruta?
   - How does Boruta differ from other feature selection methods?
   - What's the difference between confirmed, tentative, and rejected features?
   - How do you handle tentative features in final feature selection?

6. **Ensemble Methods**
   - How do you combine results from different feature selection methods?
   - What are the trade-offs between conservative vs liberal ensemble strategies?
   - How do you validate the performance of ensemble feature sets?
   - When should you weight different methods differently in ensembles?

### 🏆 **Success Criteria**

You have successfully completed this feature selection notebook if you can:
- ✅ Apply multiple feature selection methods to the same dataset
- ✅ Understand the strengths and limitations of each method
- ✅ Compare and contrast results from different approaches
- ✅ Create ensemble feature sets using voting strategies
- ✅ Validate feature selection results on held-out data
- ✅ Interpret biological relevance of selected features (for genomics data)
- ✅ Export optimized feature sets for downstream modeling
- ✅ Document feature selection decisions and rationale

### 🚀 **Extension Challenges** (Optional)

For advanced students:
1. **Advanced Selection Methods**:
   - Implement LASSO regularization for feature selection
   - Use Elastic Net for combined L1/L2 penalty selection
   - Apply genetic algorithm-based feature selection

2. **Stability Analysis**:
   - Assess feature selection stability across bootstrap samples
   - Implement Kuncheva stability index
   - Create stability-weighted ensemble selection

3. **Biological Integration**:
   - Incorporate pathway information into feature selection
   - Use protein-protein interaction networks for feature grouping
   - Apply gene set enrichment analysis to selected features

4. **Multi-objective Optimization**:
   - Balance feature set size vs prediction performance
   - Optimize for multiple objectives (accuracy, interpretability, cost)
   - Use Pareto optimization for feature selection

5. **Advanced Ensemble Methods**:
   - Implement stacking-based feature selection ensemble
   - Use meta-learning to weight different selection methods
   - Apply Bayesian model averaging for feature importance

---

## 📚 Feature Selection Summary

In this notebook, you have successfully completed:

### ✅ **Completed Tasks:**
1. **Data Loading & Validation**: Loaded preprocessed data and validated for feature selection
2. **Variance-Based Selection**: Applied statistical variance analysis and VarianceThreshold filtering
3. **Correlation-Based Selection**: Used Pearson, Spearman, and mutual information analysis
4. **Cox Regression Selection**: Implemented survival-based feature ranking with hazard ratios
5. **Random Forest Selection**: Applied multiple RF importance measures and RFECV
6. **Boruta Selection**: Used all-relevant feature detection with shadow features
7. **Method Comparison**: Analyzed agreement and overlap between selection methods
8. **Ensemble Creation**: Developed voting-based ensemble feature sets
9. **Performance Validation**: Evaluated feature sets on validation data
10. **Results Export**: Saved optimal feature sets with comprehensive documentation

### 🎯 **Key Feature Selection Achievements:**
- **Comprehensive Analysis**: Applied 5+ different feature selection approaches
- **Method Comparison**: Quantified agreement and complementarity between methods
- **Ensemble Optimization**: Created multiple feature sets for different use cases
- **Biological Relevance**: Identified features with known biomarker significance
- **Performance Validation**: Confirmed feature sets improve prediction accuracy
- **Reproducibility**: Documented all parameters and decisions for reproducible analysis

### 🔄 **Next Steps:**
In the next notebook (`04_model_development.ipynb`), we will:
1. **Baseline Models**: Train simple classifiers using selected feature sets
2. **Algorithm Comparison**: Evaluate multiple ML algorithms (SVM, RF, Neural Networks)
3. **Hyperparameter Tuning**: Optimize model parameters using grid/random search
4. **Cross-Validation**: Implement robust model evaluation with multiple CV strategies
5. **Model Interpretation**: Understand how selected features contribute to predictions
6. **Ensemble Modeling**: Combine multiple models for improved performance

### 📁 **Exported Files:**
- `../results/feature_selection/`: All feature selection results and rankings
- `../results/feature_selection/ensemble_features/`: Optimized feature sets for modeling
- `../results/feature_selection/method_comparison/`: Method comparison matrices and analysis
- `../results/feature_selection/feature_selection_report.json`: Complete selection documentation

### 🧬 **Feature Selection Insights:**
- **Most Important Methods**: Random Forest and Boruta showed highest agreement
- **Core Features**: X genes/variables selected by all methods (biological significance)
- **Method Complementarity**: Different methods captured different aspects of feature importance
- **Optimal Feature Set Size**: 50-200 features provided best performance/interpretability trade-off

---

**Outstanding work completing the comprehensive feature selection analysis! 🎉**

You've successfully applied multiple state-of-the-art feature selection methods and created optimized feature sets that will serve as the foundation for robust machine learning models. The rigorous comparison and ensemble approach ensures you're working with the most informative and stable features for risk classification.