# Machine Learning Course: Data Preprocessing

## Notebook 2: Data Preprocessing, Normalization, and Cleaning

### Learning Objectives
By the end of this notebook, you will be able to:
1. Load and validate preprocessed data from exploration phase
2. Implement strategies for handling missing values in genomics data
3. Filter genes based on variance and quality metrics
4. Normalize gene expression data using different methods
5. Scale and encode clinical variables appropriately
6. Identify and handle outlier samples
7. Create balanced train/validation/test splits
8. Export clean, analysis-ready datasets

### Prerequisites
- Completed `01_data_exploration.ipynb`
- Understanding of basic statistics and data quality concepts
- Familiarity with pandas and numpy operations

### Preprocessing Pipeline Overview
This notebook follows a systematic preprocessing pipeline:
1. **Data Loading & Validation** - Load exploration results and verify data integrity
2. **Missing Values Handling** - Implement appropriate strategies for different data types
3. **Gene Filtering** - Remove low-quality and uninformative genes
4. **Outlier Detection** - Identify and handle problematic samples
5. **Normalization** - Apply appropriate scaling to expression and clinical data
6. **Data Splitting** - Create stratified train/validation/test sets
7. **Quality Validation** - Verify preprocessing results
8. **Data Export** - Save clean datasets for modeling

---

## 1. Setup and Imports

Let's start by importing all necessary libraries for data preprocessing and validation.

In [None]:
# 📝 ACTIVITY 1: Import Required Libraries for Data Preprocessing
#
# Your task: Import comprehensive libraries for data preprocessing, normalization, and quality control
#
# TODO: Import core data manipulation libraries:
# 1. pandas (as pd) - for data manipulation and analysis
# 2. numpy (as np) - for numerical operations and array handling
# 3. matplotlib.pyplot (as plt) - for plotting and visualization
# 4. seaborn (as sns) - for statistical visualization
# 5. os - for file system operations
# 6. warnings - to suppress warning messages
#
# TODO: Import preprocessing and machine learning libraries:
# 7. From sklearn.preprocessing import: StandardScaler, MinMaxScaler, RobustScaler
# 8. From sklearn.impute import: SimpleImputer, KNNImputer
# 9. From sklearn.model_selection import: train_test_split, StratifiedShuffleSplit
# 10. From sklearn.feature_selection import: VarianceThreshold
#
# TODO: Import statistical and scientific libraries:
# 11. From scipy.stats import: zscore, pearsonr, spearmanr
# 12. From scipy import stats
# 13. import json - for saving/loading configuration files
#
# TODO: Configure the environment:
# 14. Suppress warnings using warnings.filterwarnings('ignore')
# 15. Set matplotlib plotting style to 'seaborn-v0_8' (or 'default' if seaborn style unavailable)
# 16. Set seaborn color palette to "Set2"
# 17. Set numpy random seed to 42 for reproducibility
# 18. Set pandas display options: pd.set_option('display.max_columns', 20)
#
# TODO: Print library versions and setup confirmation:
# 19. Print versions of key libraries (pandas, numpy, sklearn)
# 20. Print confirmation that preprocessing environment is ready
#
# Expected output: Successfully imported libraries with version information

# Write your code below:
# import pandas as pd
# import numpy as np
# ...

## 2. Data Loading and Validation

Before we can preprocess the data, we need to load it and validate its structure based on our exploration findings.

In [None]:
# 📝 ACTIVITY 2: Load Raw Data and Exploration Results
#
# Your task: Load the datasets and exploration results from the previous notebook
#
# TODO: Set up data paths and configuration:
# 1. Define DATA_PATH = '../data/' for raw data location
# 2. Define RESULTS_PATH = '../results/exploration/' for exploration results
# 3. Define PROCESSED_PATH = '../results/preprocessed/' for output location
# 4. Create the processed results directory using os.makedirs() with exist_ok=True
#
# TODO: Load raw datasets (same as in exploration notebook):
# 5. Load expression data from 'data_mrna_illumina_microarray.txt'
# 6. Load clinical patient data from 'data_clinical_patient.txt'
# 7. Load clinical sample data from 'data_clinical_sample.txt' (if available)
# 8. Use appropriate pd.read_csv() parameters (sep='\t', comment='#', index_col for expression)
# 9. Include proper error handling with try-except blocks
#
# TODO: Load exploration results:
# 10. Load gene statistics from '../results/exploration/gene_statistics.csv'
# 11. Load sample statistics from '../results/exploration/sample_statistics.csv'
# 12. Load clinical summary from '../results/exploration/clinical_summary.json'
# 13. Handle cases where exploration results might not exist
#
# TODO: Validate data consistency:
# 14. Check that data shapes match expectations from exploration
# 15. Verify that gene names and sample IDs are consistent
# 16. Compare current missing value counts with exploration results
# 17. Print validation summary with ✓ for passed checks and ⚠️ for issues
#
# TODO: Display loading summary:
# 18. Print header "DATA LOADING AND VALIDATION SUMMARY"
# 19. Show dimensions of loaded datasets
# 20. Report which exploration results were successfully loaded
# 21. Confirm data is ready for preprocessing
#
# Expected output: Successfully loaded datasets with validation confirmation

# Write your code below:
# DATA_PATH = '../data/'
# RESULTS_PATH = '../results/exploration/'
# ...

In [None]:
# 📝 ACTIVITY 3: Create Data Quality Baseline
#
# Your task: Establish baseline data quality metrics before preprocessing
#
# TODO: Calculate baseline statistics for expression data:
# 1. Print "BASELINE DATA QUALITY METRICS" header with separators
# 2. Calculate and store original data dimensions
# 3. Count missing values: expression_data.isnull().sum().sum()
# 4. Calculate percentage of missing values relative to total data size
# 5. Count genes with zero variance: (expression_data.var(axis=1) == 0).sum()
# 6. Count low-variance genes (variance < 0.01): (expression_data.var(axis=1) < 0.01).sum()
#
# TODO: Analyze sample quality:
# 7. Calculate sample means and standard deviations
# 8. Identify outlier samples using 3-sigma rule:
#    - Find samples where |sample_mean - overall_mean| > 3 * overall_std
#    - Do the same for sample standard deviations
# 9. Count outlier samples and store their IDs
#
# TODO: Assess clinical data quality:
# 10. Count missing values per clinical variable
# 11. Identify variables with >50% missing data
# 12. Count categorical variables vs numerical variables
# 13. Check for duplicate patient records
#
# TODO: Create baseline quality summary:
# 14. Store all baseline metrics in a dictionary: baseline_metrics
# 15. Include: original_genes, original_samples, expression_missing_pct,
#     zero_var_genes, low_var_genes, outlier_samples, clinical_high_missing
# 16. Print formatted summary of all baseline metrics
# 17. Save baseline metrics as JSON for later comparison
#
# TODO: Flag potential issues:
# 18. Create issues_found list to collect quality concerns
# 19. Add warnings for high missing rates, many low-variance genes, etc.
# 20. Print issues found or "No major quality issues detected"
#
# Expected output: Comprehensive baseline quality report with identified issues

# Write your code below:
# print("BASELINE DATA QUALITY METRICS")
# print("="*50)
# ...

## 3. Missing Values Analysis and Handling

Missing values are common in genomics data. Let's implement appropriate strategies for handling them.

In [None]:
# 📝 ACTIVITY 4: Analyze Missing Value Patterns in Detail
#
# Your task: Perform comprehensive analysis of missing value patterns to guide imputation strategy
#
# TODO: Set up missing values analysis:
# 1. Print "DETAILED MISSING VALUES ANALYSIS" header with separators
# 2. Create copies of original data for analysis: expression_missing = expression_data.copy()
#
# TODO: Analyze expression data missing patterns:
# 3. Calculate missing values per gene: missing_per_gene = expression_data.isnull().sum(axis=1)
# 4. Calculate missing values per sample: missing_per_sample = expression_data.isnull().sum(axis=0)
# 5. Find genes with any missing values: genes_with_missing = missing_per_gene[missing_per_gene > 0]
# 6. Find samples with any missing values: samples_with_missing = missing_per_sample[missing_per_sample > 0]
#
# TODO: Display missing value statistics:
# 7. Print total missing values and percentage of entire dataset
# 8. Print number of genes affected and percentage of total genes
# 9. Print number of samples affected and percentage of total samples
# 10. If genes have missing values, show top 10 genes with most missing
# 11. If samples have missing values, show top 10 samples with most missing
#
# TODO: Analyze clinical data missing patterns:
# 12. Calculate missing per variable: clinical_missing = clinical_data.isnull().sum()
# 13. Find variables with missing values: vars_with_missing = clinical_missing[clinical_missing > 0]
# 14. Calculate missing percentages: missing_pct = (vars_with_missing / len(clinical_data)) * 100
# 15. Create summary DataFrame with variable name, missing count, and percentage
# 16. Sort by missing percentage (descending) and display
#
# TODO: Categorize missing value severity:
# 17. Categorize clinical variables by missing percentage:
#     - Low missing: <5%, Moderate: 5-20%, High: 20-50%, Very high: >50%
# 18. Count variables in each category
# 19. Print categorization summary
#
# TODO: Check for missing value patterns:
# 20. Look for systematic missing patterns (e.g., all related variables missing together)
# 21. Check correlation between missing indicators using pd.DataFrame(clinical_data.isnull())
# 22. Identify any non-random missing patterns
#
# Expected output: Comprehensive missing value analysis with categorization and pattern detection

# Write your code below:
# print("DETAILED MISSING VALUES ANALYSIS")
# print("="*50)
# ...

In [None]:
# 📝 ACTIVITY 5: Implement Missing Values Imputation Strategy
#
# Your task: Apply appropriate imputation methods for different types of missing data
#
# TODO: Set up imputation strategy:
# 1. Print "MISSING VALUES IMPUTATION" header with separators
# 2. Create working copies: expression_imputed = expression_data.copy()
# 3. Create clinical_imputed = clinical_data.copy()
#
# TODO: Handle expression data missing values:
# 4. Check if expression data has missing values: if expression_data.isnull().sum().sum() > 0:
# 5. For expression data, use KNN imputation (genes have biological relationships):
#    - Import KNNImputer from sklearn.impute
#    - Create imputer with n_neighbors=5
#    - Fit and transform the data: imputer.fit_transform(expression_data.T).T
#    - The .T transpose is because KNN works on samples, but we want to impute across samples for each gene
# 6. If no missing values, print "No missing values in expression data"
#
# TODO: Handle clinical data missing values by variable type:
# 7. Separate numerical and categorical clinical variables
# 8. For numerical variables with missing values:
#    - Use median imputation for variables with <20% missing
#    - Use KNN imputation for variables with 20-50% missing
#    - Consider dropping variables with >50% missing (flag for review)
# 9. For categorical variables with missing values:
#    - Use mode (most frequent) imputation for variables with <30% missing
#    - Create "Unknown" category for variables with >30% missing
#    - Use pd.get_dummies() later for encoding
#
# TODO: Implement clinical imputation:
# 10. Create SimpleImputer for numerical: SimpleImputer(strategy='median')
# 11. Create SimpleImputer for categorical: SimpleImputer(strategy='most_frequent')
# 12. Apply imputers separately to numerical and categorical variables
# 13. Combine results back into single DataFrame
#
# TODO: Validate imputation results:
# 14. Check that no missing values remain: assert imputed_data.isnull().sum().sum() == 0
# 15. Compare distributions before/after imputation for key variables
# 16. Print summary of imputation methods used
#
# TODO: Create imputation summary:
# 17. Document what imputation methods were applied to which variables
# 18. Save imputation log for reproducibility
#
# Expected output: Successfully imputed datasets with validation and method documentation

# Write your code below:
# print("MISSING VALUES IMPUTATION")
# print("="*50)
# ...

## 4. Gene Filtering and Quality Control

Not all genes are informative for machine learning. Let's filter out low-quality and uninformative genes.

In [None]:
# 📝 ACTIVITY 6: Filter Low-Variance and Uninformative Genes
#
# Your task: Remove genes that are not informative for machine learning modeling
#
# TODO: Set up gene filtering analysis:
# 1. Print "GENE FILTERING AND QUALITY CONTROL" header with separators
# 2. Create working copy: expression_filtered = expression_imputed.copy()
# 3. Store original gene count: original_gene_count = len(expression_filtered)
#
# TODO: Calculate gene-level statistics:
# 4. Calculate variance for each gene: gene_variances = expression_filtered.var(axis=1)
# 5. Calculate mean expression for each gene: gene_means = expression_filtered.mean(axis=1)
# 6. Calculate coefficient of variation: gene_cv = expression_filtered.std(axis=1) / gene_means.abs()
# 7. Handle division by zero in CV calculation using np.where() or .fillna(0)
#
# TODO: Apply variance filtering:
# 8. Use VarianceThreshold from sklearn.feature_selection
# 9. Set threshold=0.01 (genes with variance < 0.01 will be removed)
# 10. Create variance_selector = VarianceThreshold(threshold=0.01)
# 11. Fit selector on transposed data: variance_selector.fit(expression_filtered.T)
# 12. Get selected gene indices: selected_genes_idx = variance_selector.get_support()
# 13. Filter expression data: expression_filtered = expression_filtered.loc[selected_genes_idx]
#
# TODO: Apply additional quality filters:
# 14. Remove genes with very low mean expression (mean < -3 for log2 data)
# 15. Remove genes with very high mean expression (potential housekeeping genes, mean > 3)
# 16. Filter using boolean indexing on gene_means
# 17. Update expression_filtered with these additional filters
#
# TODO: Apply coefficient of variation filter (optional):
# 18. Keep genes with moderate to high CV (CV > 0.1) - these show meaningful variation
# 19. This removes genes that are too stable or too noisy
# 20. Apply CV filter: expression_filtered = expression_filtered[gene_cv > 0.1]
#
# TODO: Calculate and display filtering results:
# 21. Calculate filtered_gene_count = len(expression_filtered)
# 22. Calculate genes_removed = original_gene_count - filtered_gene_count
# 23. Calculate percentage_removed = (genes_removed / original_gene_count) * 100
# 24. Print summary: original count, final count, removed count and percentage
#
# TODO: Analyze filtering impact:
# 25. Compare variance distribution before/after filtering
# 26. Show range of mean expression values before/after
# 27. Create before/after comparison visualization if needed
#
# Expected output: Filtered expression dataset with quality summary

# Write your code below:
# print("GENE FILTERING AND QUALITY CONTROL")
# print("="*50)
# ...

In [None]:
# 📝 ACTIVITY 7: Outlier Detection and Handling
#
# Your task: Identify and handle outlier samples that might affect model performance
#
# TODO: Set up outlier detection:
# 1. Print "OUTLIER DETECTION AND HANDLING" header with separators
# 2. Create working copy: expression_outlier_checked = expression_filtered.copy()
#
# TODO: Sample-level outlier detection:
# 3. Calculate sample statistics:
#    - sample_means = expression_outlier_checked.mean(axis=0)
#    - sample_stds = expression_outlier_checked.std(axis=0)
#    - sample_medians = expression_outlier_checked.median(axis=0)
# 4. Calculate z-scores for sample means: mean_zscores = np.abs(zscore(sample_means))
# 5. Calculate z-scores for sample stds: std_zscores = np.abs(zscore(sample_stds))
#
# TODO: Identify outlier samples:
# 6. Set outlier threshold: z_threshold = 3.0 (3-sigma rule)
# 7. Find samples with outlier means: outlier_means = sample_means[mean_zscores > z_threshold]
# 8. Find samples with outlier stds: outlier_stds = sample_stds[std_zscores > z_threshold]
# 9. Combine all outlier samples: outlier_samples = set(outlier_means.index) | set(outlier_stds.index)
#
# TODO: Analyze outlier characteristics:
# 10. Print number of outlier samples found
# 11. If outliers exist, print their sample IDs
# 12. For each outlier, print its mean and std z-scores
# 13. Calculate outlier percentage: (len(outlier_samples) / len(sample_means)) * 100
#
# TODO: Gene-level outlier detection:
# 14. Calculate gene statistics for outlier detection:
#     - gene_means_filtered = expression_outlier_checked.mean(axis=1)
#     - gene_stds_filtered = expression_outlier_checked.std(axis=1)
# 15. Find genes with extreme values: genes with mean > 4 or mean < -4 (for log2 data)
# 16. Find genes with extreme variability: genes with std > 3
#
# TODO: Decide on outlier handling strategy:
# 17. For samples: decide whether to remove or cap outlier values
# 18. Conservative approach: flag for investigation but don't remove automatically
# 19. Alternative: remove samples that are outliers in multiple metrics
# 20. Document outlier handling decisions
#
# TODO: Apply outlier handling (if decided):
# 21. If removing outlier samples: expression_outlier_checked = expression_outlier_checked.drop(columns=outlier_samples)
# 22. If capping values: use np.clip() to limit extreme values
# 23. Update clinical data accordingly to match remaining samples
#
# TODO: Validate outlier handling:
# 24. Recalculate sample statistics after handling
# 25. Verify that extreme outliers are reduced
# 26. Print final sample count and outlier handling summary
#
# Expected output: Outlier analysis with handling strategy and validation

# Write your code below:
# print("OUTLIER DETECTION AND HANDLING")
# print("="*50)
# ...

## 5. Data Normalization and Scaling

Different normalization methods are appropriate for different types of data. Let's apply the right methods for expression and clinical data.

In [None]:
# 📝 ACTIVITY 8: Normalize Expression Data
#
# Your task: Apply appropriate normalization to gene expression data for machine learning
#
# TODO: Set up normalization analysis:
# 1. Print "EXPRESSION DATA NORMALIZATION" header with separators
# 2. Create working copy: expression_to_normalize = expression_outlier_checked.copy()
#
# TODO: Analyze current data distribution:
# 3. Calculate current statistics: overall_mean, overall_std, data_min, data_max
# 4. Create histogram of all expression values to understand current distribution
# 5. Check if data appears to be already log2-transformed (values typically -10 to +10)
# 6. Print current data characteristics
#
# TODO: Gene-wise (row-wise) Z-score normalization:
# 7. Apply Z-score normalization per gene (across samples):
#    - expression_gene_zscore = expression_to_normalize.apply(zscore, axis=1)
#    - This makes each gene have mean=0, std=1 across samples
# 8. Handle any NaN values that result from zero-variance genes
# 9. Calculate statistics after gene-wise normalization
#
# TODO: Sample-wise (column-wise) normalization:
# 10. Apply quantile normalization or Z-score per sample:
#     - For sample Z-score: expression_sample_zscore = expression_to_normalize.apply(zscore, axis=0)
#     - This makes each sample have mean=0, std=1 across genes
# 11. Alternative: Use RobustScaler for sample-wise scaling (less sensitive to outliers)
# 12. Compare different normalization approaches
#
# TODO: Choose and apply final normalization method:
# 13. For machine learning, gene-wise Z-score is often preferred:
#     - It makes genes comparable in their variation patterns
#     - Removes gene-specific baseline differences
# 14. Apply chosen method: expression_normalized = expression_gene_zscore.copy()
# 15. Ensure no NaN values remain: expression_normalized.fillna(0, inplace=True)
#
# TODO: Validate normalization results:
# 16. Check that genes now have mean ≈ 0 and std ≈ 1 (within tolerance)
# 17. Verify that sample relationships are preserved
# 18. Create before/after distribution comparison plots
# 19. Calculate normalization quality metrics
#
# TODO: Create normalization summary:
# 20. Print normalization method used
# 21. Show before/after statistics comparison
# 22. Document any issues encountered and how they were handled
#
# Expected output: Normalized expression data with validation metrics and quality assessment

# Write your code below:
# print("EXPRESSION DATA NORMALIZATION")
# print("="*50)
# ...

In [None]:
# 📝 ACTIVITY 9: Scale and Encode Clinical Variables
#
# Your task: Prepare clinical variables for machine learning by appropriate scaling and encoding
#
# TODO: Set up clinical data preprocessing:
# 1. Print "CLINICAL DATA SCALING AND ENCODING" header with separators
# 2. Create working copy: clinical_to_process = clinical_imputed.copy()
# 3. Ensure clinical data matches remaining samples after any outlier removal
#
# TODO: Separate variable types:
# 4. Get numerical variables: numerical_vars = clinical_to_process.select_dtypes(include=[np.number]).columns.tolist()
# 5. Get categorical variables: categorical_vars = clinical_to_process.select_dtypes(include=['object', 'category']).columns.tolist()
# 6. Print counts of each variable type
#
# TODO: Scale numerical variables:
# 7. For most ML algorithms, use StandardScaler (Z-score normalization):
#    - scaler = StandardScaler()
#    - clinical_numerical_scaled = scaler.fit_transform(clinical_to_process[numerical_vars])
#    - Convert back to DataFrame: pd.DataFrame(clinical_numerical_scaled, columns=numerical_vars, index=clinical_to_process.index)
# 8. Alternative: Use RobustScaler if outliers are a concern
# 9. Store scaler object for later use on test data
#
# TODO: Encode categorical variables:
# 10. For binary variables (like ER_IHC: Positive/Negative):
#     - Use LabelEncoder or simple mapping: {'Positive': 1, 'Negative': 0}
# 11. For multi-class variables (like CLAUDIN_SUBTYPE):
#     - Use pd.get_dummies() for one-hot encoding
#     - Set drop_first=True to avoid multicollinearity
# 12. Handle any 'Unknown' or missing categories appropriately
#
# TODO: Create encoded categorical dataset:
# 13. Apply encoding to each categorical variable
# 14. For one-hot encoded variables, use prefix to identify original variable
# 15. Combine all encoded variables: clinical_categorical_encoded
#
# TODO: Combine scaled numerical and encoded categorical data:
# 16. Concatenate along columns: clinical_processed = pd.concat([clinical_numerical_scaled, clinical_categorical_encoded], axis=1)
# 17. Ensure index alignment with expression data
# 18. Handle any column name conflicts
#
# TODO: Validate clinical preprocessing:
# 19. Check for any remaining missing values
# 20. Verify that all variables are now numerical
# 21. Check data types and value ranges
# 22. Print summary of transformations applied
#
# TODO: Create preprocessing summary:
# 23. Document which variables were scaled vs encoded
# 24. Save preprocessing objects (scalers, encoders) for later use
# 25. Print final clinical data shape and characteristics
#
# Expected output: Processed clinical data ready for machine learning with transformation summary

# Write your code below:
# print("CLINICAL DATA SCALING AND ENCODING")
# print("="*50)
# ...

## 6. Data Splitting Strategy

Creating proper train/validation/test splits is crucial for unbiased model evaluation. Let's implement stratified splitting.

In [None]:
# 📝 ACTIVITY 10: Prepare Target Variable and Stratification Strategy
#
# Your task: Define the target variable for risk classification and prepare stratification strategy
#
# TODO: Set up target variable definition:
# 1. Print "TARGET VARIABLE AND STRATIFICATION SETUP" header with separators
# 2. Identify potential target variables from clinical data (e.g., survival, ER status, molecular subtype)
#
# TODO: Create risk classification target:
# 3. For survival-based risk classification, create binary target:
#    - If OS_MONTHS exists: high_risk = (OS_MONTHS < median_survival) & (VITAL_STATUS == 'DECEASED')
#    - Alternative: use established risk groups based on molecular subtypes
# 4. If survival data unavailable, use ER_IHC status as proxy: target = (ER_IHC == 'Negative')
# 5. Ensure target variable has reasonable class balance (30-70% split is good)
#
# TODO: Analyze target variable distribution:
# 6. Calculate class counts: target.value_counts()
# 7. Calculate class percentages: target.value_counts(normalize=True) * 100
# 8. Print target variable distribution
# 9. Check if classes are severely imbalanced (>90% in one class)
#
# TODO: Identify stratification variables:
# 10. Besides target variable, consider stratifying by:
#     - Molecular subtype (CLAUDIN_SUBTYPE) if available
#     - Age groups (create age bins: young <50, middle 50-65, old >65)
#     - Tumor stage/grade if available
# 11. Create composite stratification variable combining target + important clinical factor
#
# TODO: Handle class imbalance if present:
# 12. If severe imbalance detected (>80% in one class):
#     - Consider redefining target variable
#     - Plan for stratified sampling
#     - Note need for balanced evaluation metrics
# 13. Document class distribution issues
#
# TODO: Prepare data for splitting:
# 14. Ensure expression_normalized and clinical_processed have matching sample indices
# 15. Create combined feature matrix if needed: X = pd.concat([expression_normalized.T, clinical_processed], axis=1)
# 16. Define target vector: y = target variable aligned with X
# 17. Remove any samples where target is undefined/missing
#
# TODO: Validate data alignment:
# 18. Check X.index == y.index (all samples match)
# 19. Print final dataset dimensions
# 20. Confirm no missing values in features or target
#
# Expected output: Defined target variable with class distribution and aligned feature matrix

# Write your code below:
# print("TARGET VARIABLE AND STRATIFICATION SETUP")
# print("="*50)
# ...

In [None]:
# 📝 ACTIVITY 11: Create Stratified Train/Validation/Test Splits
#
# Your task: Create balanced and representative data splits for model development and evaluation
#
# TODO: Set up splitting configuration:
# 1. Print "STRATIFIED DATA SPLITTING" header with separators
# 2. Define split ratios: train=60%, validation=20%, test=20%
# 3. Set random_state=42 for reproducibility
#
# TODO: Create initial train/temp split:
# 4. Use train_test_split from sklearn.model_selection
# 5. Split into train (60%) and temp (40%): X_train, X_temp, y_train, y_temp
# 6. Use stratify=y to maintain class balance
# 7. Set test_size=0.4, random_state=42
#
# TODO: Split temp into validation and test:
# 8. Split X_temp, y_temp into validation (50% of temp = 20% overall) and test (50% of temp = 20% overall)
# 9. Use train_test_split again: X_val, X_test, y_val, y_test
# 10. Use stratify=y_temp, test_size=0.5, random_state=42
#
# TODO: Validate split quality:
# 11. Check final split sizes:
#     - print(f"Train: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
#     - print(f"Validation: {len(X_val)} samples ({len(X_val)/len(X)*100:.1f}%)")
#     - print(f"Test: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")
#
# TODO: Verify stratification worked:
# 12. Check class distribution in each split:
#     - train_dist = y_train.value_counts(normalize=True)
#     - val_dist = y_val.value_counts(normalize=True)
#     - test_dist = y_test.value_counts(normalize=True)
# 13. Create comparison DataFrame showing class percentages across splits
# 14. Ensure distributions are similar (within 5% of each other)
#
# TODO: Split expression and clinical data separately (if combined):
# 15. If X contains both expression and clinical data:
#     - Determine column indices for expression vs clinical features
#     - Create separate matrices: X_train_expr, X_train_clin, etc.
# 16. Maintain index alignment across all data splits
#
# TODO: Create split summary:
# 17. Create split_info dictionary with:
#     - split_sizes, class_distributions, feature_counts
# 18. Print comprehensive splitting summary
# 19. Document any stratification issues encountered
#
# TODO: Validate no data leakage:
# 20. Confirm no overlap between train/val/test sample indices
# 21. Check: len(set(X_train.index) & set(X_val.index)) == 0
# 22. Check: len(set(X_train.index) & set(X_test.index)) == 0
# 23. Check: len(set(X_val.index) & set(X_test.index)) == 0
#
# Expected output: Balanced train/validation/test splits with distribution verification

# Write your code below:
# print("STRATIFIED DATA SPLITTING")
# print("="*50)
# ...

## 💡 Coding Hints and Templates

Need help getting started? Here are some code templates and hints for preprocessing operations:

### 📋 **Template: Missing Values Imputation**
```python
# KNN Imputation for expression data
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
# Transpose for gene-wise imputation
expression_imputed = pd.DataFrame(
    imputer.fit_transform(expression_data.T).T,
    index=expression_data.index,
    columns=expression_data.columns
)
```

### 📋 **Template: Variance Filtering**
```python
# Remove low-variance genes
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
selected_genes = selector.fit_transform(expression_data.T)
selected_indices = selector.get_support()
expression_filtered = expression_data.loc[selected_indices]
```

### 📋 **Template: Z-score Normalization**
```python
# Gene-wise Z-score normalization
from scipy.stats import zscore
expression_normalized = expression_data.apply(zscore, axis=1)
# Handle any NaN values from zero-variance genes
expression_normalized = expression_normalized.fillna(0)
```

### 📋 **Template: Clinical Data Scaling**
```python
# Scale numerical variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_scaled = pd.DataFrame(
    scaler.fit_transform(clinical_data[numerical_vars]),
    columns=numerical_vars,
    index=clinical_data.index
)
```

### 📋 **Template: One-Hot Encoding**
```python
# Encode categorical variables
categorical_encoded = pd.get_dummies(
    clinical_data[categorical_vars], 
    drop_first=True, 
    prefix=categorical_vars
)
```

### 📋 **Template: Stratified Splitting**
```python
# Create stratified train/validation/test splits
from sklearn.model_selection import train_test_split

# First split: train vs temp (60% vs 40%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=42
)

# Second split: validation vs test (20% vs 20%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)
```

### 📋 **Template: Outlier Detection**
```python
# Sample-wise outlier detection using Z-scores
from scipy.stats import zscore
sample_means = expression_data.mean(axis=0)
mean_zscores = np.abs(zscore(sample_means))
outlier_samples = sample_means[mean_zscores > 3].index
print(f"Found {len(outlier_samples)} outlier samples")
```

### 🔍 **Key Preprocessing Decisions**
- **Missing Values**: KNN for expression (biological relationships), median/mode for clinical
- **Gene Filtering**: Remove variance < 0.01, extreme mean values
- **Normalization**: Gene-wise Z-score for expression, StandardScaler for clinical
- **Encoding**: One-hot for multi-class, binary encoding for binary variables
- **Splitting**: Stratified 60/20/20 with class balance preservation

### 🔍 **Quality Checks to Remember**
- Verify no missing values remain after imputation
- Check that normalization resulted in mean≈0, std≈1
- Ensure class balance is maintained across splits
- Confirm no data leakage between train/val/test
- Validate that feature distributions look reasonable

## 🎯 Learning Assessment

### ✅ **Self-Check Questions**

After completing the preprocessing activities, you should be able to answer:

1. **Missing Values Handling**
   - What imputation method is most appropriate for gene expression data and why?
   - How do you handle categorical variables with >50% missing values?
   - What are the risks of using mean imputation vs KNN imputation?
   - How do you verify that imputation preserved data distributions?

2. **Gene Filtering and Quality Control**
   - Why do we remove low-variance genes before machine learning?
   - What expression value ranges indicate potential data quality issues?
   - How do you identify outlier samples without removing important biological variation?
   - What percentage of genes is typically filtered in transcriptomics studies?

3. **Normalization and Scaling**
   - What's the difference between gene-wise and sample-wise normalization?
   - When should you use StandardScaler vs RobustScaler vs MinMaxScaler?
   - Why is Z-score normalization often preferred for gene expression data?
   - How do you handle zero-variance genes during normalization?

4. **Data Splitting Strategy**
   - Why is stratification important when creating train/validation/test splits?
   - How do you ensure no data leakage between splits?
   - What's the impact of class imbalance on model evaluation?
   - Why use 60/20/20 splits instead of 80/20 or other ratios?

5. **Clinical Data Preprocessing**
   - How do you decide between one-hot encoding vs label encoding?
   - What's the multicollinearity issue with one-hot encoding and how to address it?
   - How do you handle ordinal vs nominal categorical variables differently?
   - When should you create interaction features between clinical variables?

### 🏆 **Success Criteria**

You have successfully completed this preprocessing notebook if you can:
- ✅ Load and validate data from the exploration phase
- ✅ Apply appropriate imputation strategies for different data types
- ✅ Filter genes based on variance and quality metrics
- ✅ Normalize expression data using gene-wise Z-score method
- ✅ Scale and encode clinical variables appropriately
- ✅ Create stratified train/validation/test splits with balanced classes
- ✅ Validate preprocessing results and identify potential issues
- ✅ Export clean, analysis-ready datasets

### 🚀 **Extension Challenges** (Optional)

For advanced students:
1. **Advanced Imputation**:
   - Implement matrix factorization-based imputation (e.g., iterative imputer)
   - Compare multiple imputation methods and evaluate their impact
   - Use biological pathway information to guide gene imputation

2. **Sophisticated Normalization**:
   - Implement quantile normalization for cross-sample comparability
   - Apply Combat for batch effect correction
   - Compare different normalization methods on downstream model performance

3. **Feature Engineering**:
   - Create gene ratio features (e.g., ER/PR ratios)
   - Generate pathway-level summary features
   - Create interaction terms between clinical and expression features

4. **Advanced Quality Control**:
   - Implement statistical tests for outlier detection
   - Use PCA for quality assessment and batch effect detection
   - Apply clustering to identify sample subgroups

5. **Preprocessing Pipeline**:
   - Create automated preprocessing pipeline using sklearn Pipeline
   - Implement cross-validation-aware preprocessing
   - Build preprocessing parameter tuning framework

## 7. Export Preprocessed Data

Let's save our preprocessed datasets and create comprehensive documentation for the next modeling phase.

In [None]:
# 📝 ACTIVITY 12: Export Preprocessed Datasets and Create Documentation
#
# Your task: Save all preprocessed data and create comprehensive documentation for modeling phase
#
# TODO: Set up export directory structure:
# 1. Create directories for organized data export:
#    - os.makedirs(PROCESSED_PATH + 'data/', exist_ok=True)
#    - os.makedirs(PROCESSED_PATH + 'splits/', exist_ok=True)
#    - os.makedirs(PROCESSED_PATH + 'preprocessing_objects/', exist_ok=True)
# 2. Print "EXPORTING PREPROCESSED DATA" header with separators
#
# TODO: Export preprocessed datasets:
# 3. Save normalized expression data:
#    - expression_normalized.to_csv(PROCESSED_PATH + 'data/expression_normalized.csv')
# 4. Save processed clinical data:
#    - clinical_processed.to_csv(PROCESSED_PATH + 'data/clinical_processed.csv')
# 5. Save target variable:
#    - pd.DataFrame({'target': y}).to_csv(PROCESSED_PATH + 'data/target_variable.csv')
#
# TODO: Export train/validation/test splits:
# 6. Save all split datasets:
#    - X_train.to_csv(PROCESSED_PATH + 'splits/X_train.csv')
#    - X_val.to_csv(PROCESSED_PATH + 'splits/X_val.csv')
#    - X_test.to_csv(PROCESSED_PATH + 'splits/X_test.csv')
#    - pd.DataFrame({'target': y_train}).to_csv(PROCESSED_PATH + 'splits/y_train.csv')
#    - pd.DataFrame({'target': y_val}).to_csv(PROCESSED_PATH + 'splits/y_val.csv')
#    - pd.DataFrame({'target': y_test}).to_csv(PROCESSED_PATH + 'splits/y_test.csv')
#
# TODO: Save preprocessing objects (for applying to new data):
# 7. Use joblib to save sklearn objects:
#    - import joblib
#    - joblib.dump(scaler, PROCESSED_PATH + 'preprocessing_objects/clinical_scaler.pkl')
#    - Save any other preprocessing objects (imputers, encoders, etc.)
#
# TODO: Create comprehensive preprocessing summary:
# 8. Create preprocessing_summary dictionary with:
#    - original_dimensions: original data shapes
#    - final_dimensions: final processed data shapes
#    - preprocessing_steps: list of all steps applied
#    - gene_filtering_summary: genes removed and criteria
#    - missing_values_handling: methods used for each data type
#    - normalization_methods: scaling applied to each dataset
#    - split_information: sample counts and class distributions
#    - quality_metrics: before/after quality comparisons
#
# TODO: Export preprocessing documentation:
# 9. Save preprocessing summary as JSON:
#    - with open(PROCESSED_PATH + 'preprocessing_summary.json', 'w') as f:
#    -     json.dump(preprocessing_summary, f, indent=2, default=str)
#
# TODO: Create data dictionary:
# 10. Create feature_descriptions DataFrame with:
#     - feature_name, feature_type, description, preprocessing_applied
# 11. Include information about expression genes and clinical variables
# 12. Save as CSV: PROCESSED_PATH + 'feature_dictionary.csv'
#
# TODO: Generate processing report:
# 13. Create human-readable processing report:
#     - Data dimensions before/after each step
#     - Number of samples and features in final datasets
#     - Class balance in train/validation/test splits
#     - Quality metrics and any issues encountered
# 14. Save as text file: PROCESSED_PATH + 'preprocessing_report.txt'
#
# TODO: Print export confirmation:
# 15. List all files created with their purposes
# 16. Print data ready for modeling confirmation
# 17. Provide guidance on next steps (modeling notebook)
#
# Expected output: Complete set of preprocessed data files with comprehensive documentation

# Write your code below:
# import joblib
# print("EXPORTING PREPROCESSED DATA")
# print("="*50)
# ...

---

## 📚 Preprocessing Summary

In this notebook, you have successfully completed:

### ✅ **Completed Tasks:**
1. **Data Loading & Validation**: Loaded raw data and exploration results with integrity checks
2. **Missing Values Analysis**: Identified patterns and applied appropriate imputation strategies
3. **Gene Filtering**: Removed low-variance and uninformative genes based on quality metrics
4. **Outlier Detection**: Identified and handled extreme samples using statistical methods
5. **Expression Normalization**: Applied gene-wise Z-score normalization for ML compatibility
6. **Clinical Data Processing**: Scaled numerical and encoded categorical variables appropriately
7. **Target Variable Definition**: Created binary risk classification target with proper stratification
8. **Data Splitting**: Generated balanced 60/20/20 train/validation/test splits
9. **Quality Validation**: Verified preprocessing results and data integrity
10. **Data Export**: Saved clean datasets with comprehensive documentation

### 🎯 **Key Preprocessing Achievements:**
- **Data Quality**: Eliminated missing values and filtered uninformative features
- **Standardization**: Normalized all features for consistent ML algorithm performance
- **Balance**: Maintained class balance across all data splits
- **Documentation**: Created comprehensive preprocessing logs for reproducibility
- **Validation**: Verified no data leakage and proper preprocessing application

### 🔄 **Next Steps:**
In the next notebook (`03_model_development.ipynb`), we will:
1. **Baseline Models**: Train simple classifiers for performance benchmarking
2. **Feature Selection**: Apply statistical and ML-based feature selection methods
3. **Model Comparison**: Evaluate multiple algorithms (SVM, Random Forest, Neural Networks)
4. **Hyperparameter Tuning**: Optimize model parameters using grid/random search
5. **Cross-Validation**: Implement robust model evaluation strategies
6. **Model Interpretation**: Understand which genes drive risk predictions

### 📁 **Exported Files:**
- `../results/preprocessed/data/`: Cleaned datasets ready for modeling
- `../results/preprocessed/splits/`: Train/validation/test splits with balanced classes
- `../results/preprocessed/preprocessing_objects/`: Saved scalers and transformers
- `../results/preprocessed/preprocessing_summary.json`: Complete processing documentation

---

**Excellent work completing the data preprocessing phase! 🎉**

Your data is now clean, normalized, and ready for machine learning model development. The careful preprocessing work you've completed will be crucial for building robust and reliable risk classification models.