# Machine Learning Course: Data Preprocessing

## Notebook 2: Data Preprocessing and Quality Control

> **📋 Prerequisites:** Complete `01_data_exploration.ipynb` first. This notebook focuses on preprocessing tasks not covered in the exploration phase.

### Learning Objectives
By the end of this notebook, you will be able to:
1. **Handle missing values** in clinical data using appropriate imputation
2. **Normalize gene expression data** for machine learning compatibility  
3. **Filter low-variance genes** that are uninformative for modeling
4. **Create train/validation/test splits** with proper stratification
5. **Set up survival analysis targets** using RFS_MONTHS and RFS_STATUS
6. **Export clean datasets** ready for feature selection and modeling

### Target Variable Focus
This notebook prepares data for **recurrence-free survival (RFS) prediction**:
- **RFS_MONTHS**: Time to recurrence or last follow-up (continuous target)
- **RFS_STATUS**: Recurrence event indicator (0=no recurrence, 1=recurrence)

### Key Preprocessing Steps
1. **Load Data**: Import datasets from notebook 1
2. **Missing Value Imputation**: Handle clinical data gaps
3. **Gene Filtering**: Remove low-variance/uninformative genes
4. **Normalization**: Scale data for ML algorithms
5. **Target Setup**: Prepare RFS variables for survival analysis
6. **Data Splitting**: Create stratified train/val/test sets

---

## 1. Setup and Imports

Let's start by importing all necessary libraries for data preprocessing and validation.

In [None]:
# 📝 ACTIVITY 1: Import Preprocessing Libraries
#
# Note: Basic libraries (pandas, numpy, matplotlib, seaborn) are already imported from notebook 1
#
# Your task: Import specialized preprocessing libraries
#
# TODO: Import preprocessing tools:
# 1. From sklearn.preprocessing import: StandardScaler, RobustScaler
# 2. From sklearn.impute import: SimpleImputer
# 3. From sklearn.model_selection import: train_test_split, StratifiedShuffleSplit
# 4. From sklearn.feature_selection import: VarianceThreshold
#
# TODO: Set environment:
# 5. Set random seed: np.random.seed(42)
# 6. Configure sklearn random state
#
# Expected output: Preprocessing libraries imported successfully

# Write your code below:
# from sklearn.preprocessing import StandardScaler, RobustScaler
# from sklearn.impute import SimpleImputer
# from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
# from sklearn.feature_selection import VarianceThreshold
# import numpy as np
# np.random.seed(42)
# print("✓ Preprocessing libraries imported")

## 2. Data Loading and Validation

Before we can preprocess the data, we need to load it and validate its structure based on our exploration findings.

In [None]:
# 📝 ACTIVITY 2: Load Data from Notebook 1
#
# Your task: Load the datasets explored in the previous notebook
#
# TODO: Load datasets:
# 1. Reload expression_data and clinical_data (same as notebook 1)
# 2. Use the same file paths and loading code
# 3. Verify data shapes match exploration results
#
# TODO: Focus on RFS target variables:
# 4. Check if RFS_MONTHS and RFS_STATUS columns exist in clinical data
# 5. Print basic info about RFS variables (range, missing values, distribution)
# 6. These will be our primary targets for survival analysis
#
# Expected output: Datasets loaded with focus on RFS target variables

# Write your code below:
# import os
# import pandas as pd
# 
# # Load data (same as notebook 1)
# DATA_PATH = '../data/'
# clinical_file = os.path.join(DATA_PATH, 'data_clinical_patient.txt')
# expression_file = os.path.join(DATA_PATH, 'data_mrna_illumina_microarray.txt')
# 
# clinical_data = pd.read_csv(clinical_file, sep='\t', comment='#')
# expression_data = pd.read_csv(expression_file, sep='\t', index_col=0)
# if 'Entrez_Gene_Id' in expression_data.columns:
#     expression_data = expression_data.drop('Entrez_Gene_Id', axis=1)
# 
# print(f"✓ Expression data: {expression_data.shape}")
# print(f"✓ Clinical data: {clinical_data.shape}")
# 
# # Check RFS target variables
# if 'RFS_MONTHS' in clinical_data.columns and 'RFS_STATUS' in clinical_data.columns:
#     print(f"✓ RFS_MONTHS range: {clinical_data['RFS_MONTHS'].min():.1f} - {clinical_data['RFS_MONTHS'].max():.1f}")
#     print(f"✓ RFS_STATUS distribution: {clinical_data['RFS_STATUS'].value_counts().to_dict()}")
#     print(f"✓ RFS missing values: {clinical_data[['RFS_MONTHS', 'RFS_STATUS']].isnull().sum().to_dict()}")
# else:
#     print("⚠️ RFS variables not found - check column names")
#     print(f"Available columns with 'RFS': {[col for col in clinical_data.columns if 'RFS' in col.upper()]}")

In [None]:
# 📝 ACTIVITY 3: Handle Missing Values in Clinical Data (or exclude)
#
# Your task: Impute missing values in clinical variables (expression data usually has no missing values)
#
# TODO: Analyze missing values in clinical data:
# 1. Check which clinical variables have missing values
# 2. Focus on important variables for survival analysis
# 3. Decide imputation strategy for each variable type
#
# TODO: Implement imputation:
# 4. For numerical variables: use median imputation (SimpleImputer with strategy='median')
# 5. For categorical variables: use most frequent imputation (strategy='most_frequent')
# 6. Apply imputation and create clinical_data_imputed
#
# TODO: Validate imputation:
# 7. Verify no missing values remain
# 8. Compare distributions before/after imputation
#
# Expected output: Clinical data with missing values properly imputed

# Write your code below:
# # Check missing values
# missing_clinical = clinical_data.isnull().sum()
# missing_vars = missing_clinical[missing_clinical > 0]
# print(f"Variables with missing values: {len(missing_vars)}")
# if len(missing_vars) > 0:
#     print(missing_vars.head(10))
# 
# # Separate numerical and categorical columns
# numerical_cols = clinical_data.select_dtypes(include=[np.number]).columns.tolist()
# categorical_cols = clinical_data.select_dtypes(include=['object']).columns.tolist()
# 
# # Create imputed copy
# clinical_data_imputed = clinical_data.copy()
# 
# # Impute numerical variables
# if len(numerical_cols) > 0:
#     num_imputer = SimpleImputer(strategy='median')
#     clinical_data_imputed[numerical_cols] = num_imputer.fit_transform(clinical_data_imputed[numerical_cols])
# 
# # Impute categorical variables  
# if len(categorical_cols) > 0:
#     cat_imputer = SimpleImputer(strategy='most_frequent')
#     clinical_data_imputed[categorical_cols] = cat_imputer.fit_transform(clinical_data_imputed[categorical_cols])
# 
# print(f"✓ Missing values after imputation: {clinical_data_imputed.isnull().sum().sum()}")
# print(f"✓ Clinical data ready: {clinical_data_imputed.shape}")

In [None]:
# 📝 ACTIVITY 4: Filter Low-Variance Genes
#
# Your task: Remove genes with low variance that won't be informative for machine learning
#
# TODO: Calculate gene variance:
# 1. Calculate variance for each gene across all samples
# 2. Examine the distribution of gene variances
# 3. Set a reasonable threshold (e.g., remove bottom 10% or variance < 0.1)
#
# TODO: Apply variance filtering:
# 4. Use VarianceThreshold from sklearn.feature_selection
# 5. Set threshold (e.g., 0.1 or percentile-based)
# 6. Fit on transposed expression data (genes as features)
# 7. Transform data to remove low-variance genes
#
# TODO: Validate filtering:
# 8. Print number of genes before and after filtering
# 9. Show which genes were removed
# 10. Verify remaining genes have sufficient variance
#
# Expected output: Expression data with uninformative genes removed

# Write your code below:
# print("GENE VARIANCE FILTERING")
# print("="*40)
# 
# # Calculate gene variances
# gene_variances = expression_data.var(axis=1)
# print(f"Original genes: {len(gene_variances)}")
# print(f"Variance range: {gene_variances.min():.4f} - {gene_variances.max():.4f}")
# print(f"Variance median: {gene_variances.median():.4f}")
# 
# # Set threshold (remove genes with variance < 0.1)
# variance_threshold = 0.1
# 
# # Apply VarianceThreshold
# selector = VarianceThreshold(threshold=variance_threshold)
# expression_filtered = expression_data.T  # Transpose: samples as rows, genes as columns
# expression_filtered = selector.fit_transform(expression_filtered)
# expression_filtered = pd.DataFrame(expression_filtered.T,  # Transpose back
#                                   index=expression_data.index[selector.get_support()],
#                                   columns=expression_data.columns)
# 
# genes_removed = len(expression_data) - len(expression_filtered)
# print(f"✓ Genes after filtering: {len(expression_filtered)}")
# print(f"✓ Genes removed: {genes_removed} ({genes_removed/len(expression_data)*100:.1f}%)")
# print(f"✓ Remaining variance range: {expression_filtered.var(axis=1).min():.4f} - {expression_filtered.var(axis=1).max():.4f}")

In [None]:
# 📝 ACTIVITY 5: Normalize Expression Data (yes or no?)
#
# Your task: Scale expression data for machine learning compatibility
#
# TODO: Choose and apply scaling:
# 1. Use StandardScaler (mean=0, std=1) or RobustScaler (median-based, outlier-resistant)
# 2. Fit scaler on expression data (transpose first: samples as rows)
# 3. Transform data and convert back to original format
# 4. Keep gene names and sample names
#
# TODO: Validate normalization:
# 5. Check that mean ≈ 0 and std ≈ 1 for StandardScaler
# 6. Print before/after statistics
# 7. Ensure no NaN or infinite values introduced
#
# Expected output: Normalized expression data ready for machine learning

# Write your code below:
# print("EXPRESSION DATA NORMALIZATION")
# print("="*40)
# 
# # Print before statistics
# print(f"Before - Mean: {expression_filtered.values.mean():.3f}, Std: {expression_filtered.values.std():.3f}")
# print(f"Before - Range: {expression_filtered.values.min():.3f} to {expression_filtered.values.max():.3f}")
# 
# # Apply StandardScaler (samples as rows for sklearn)
# scaler = StandardScaler()
# expression_scaled = scaler.fit_transform(expression_filtered.T)  # Transpose for sklearn
# 
# # Convert back to DataFrame with original structure
# expression_scaled = pd.DataFrame(expression_scaled.T,  # Transpose back
#                                 index=expression_filtered.index,
#                                 columns=expression_filtered.columns)
# 
# # Validate normalization
# print(f"After - Mean: {expression_scaled.values.mean():.3f}, Std: {expression_scaled.values.std():.3f}")
# print(f"After - Range: {expression_scaled.values.min():.3f} to {expression_scaled.values.max():.3f}")
# print(f"✓ Shape maintained: {expression_scaled.shape}")
# print(f"✓ No missing values: {expression_scaled.isnull().sum().sum() == 0}")
# print(f"✓ Expression data normalized and ready for ML")

In [None]:
# 📝 ACTIVITY 6: Prepare RFS Target Variables
#
# Your task: Set up RFS_MONTHS and RFS_STATUS for survival analysis and classification
#
# TODO: Extract and validate RFS variables:
# 1. Extract RFS_MONTHS and RFS_STATUS from clinical_data_imputed
# 2. Check for any remaining missing values in these key variables
# 3. Validate data types and value ranges
#
# TODO: Create target variables for different analyses:
# 4. For survival analysis: keep RFS_MONTHS (time) and RFS_STATUS (event) as is
# 5. For classification: create binary outcome (e.g., recurrence within 5 years)
# 6. Handle any edge cases (negative times, invalid status codes)
#
# TODO: Align samples between expression and clinical data:
# 7. Find common samples between expression_scaled and clinical_data_imputed
# 8. Filter both datasets to include only matching samples
# 9. Ensure sample order is consistent
#
# Expected output: Aligned datasets with properly formatted RFS target variables

# Write your code below:
# print("RFS TARGET VARIABLE PREPARATION")
# print("="*40)
# 
# # Extract RFS variables
# if 'RFS_MONTHS' in clinical_data_imputed.columns and 'RFS_STATUS' in clinical_data_imputed.columns:
#     rfs_months = clinical_data_imputed['RFS_MONTHS'].copy()
#     rfs_status = clinical_data_imputed['RFS_STATUS'].copy()
#     
#     print(f"RFS_MONTHS - Missing: {rfs_months.isnull().sum()}, Range: {rfs_months.min():.1f}-{rfs_months.max():.1f}")
#     print(f"RFS_STATUS - Missing: {rfs_status.isnull().sum()}, Values: {rfs_status.value_counts().to_dict()}")
#     
#     # Create binary classification target (recurrence within 60 months/5 years)
#     recurrence_5yr = ((rfs_months <= 60) & (rfs_status == 1)).astype(int)
#     print(f"5-year recurrence rate: {recurrence_5yr.mean():.3f} ({recurrence_5yr.sum()}/{len(recurrence_5yr)})")
#     
#     # Align samples (assuming PATIENT_ID or similar identifier exists)
#     # For now, use index alignment - adjust based on actual sample ID column
#     common_samples = list(set(expression_scaled.columns) & set(clinical_data_imputed.index))
#     if len(common_samples) == 0:
#         # Try with PATIENT_ID if available
#         print("⚠️ No direct index match - check sample ID alignment")
#         print(f"Expression samples (first 5): {list(expression_scaled.columns[:5])}")
#         print(f"Clinical samples (first 5): {list(clinical_data_imputed.index[:5])}")
#     else:
#         print(f"✓ Common samples found: {len(common_samples)}")
# 
# else:
#     print("❌ RFS variables not found in clinical data")
#     print("Available columns:", [col for col in clinical_data_imputed.columns if any(x in col.upper() for x in ['RFS', 'RECUR', 'SURV', 'STATUS', 'MONTH'])])

In [None]:
# 📝 ACTIVITY 7: Create Train/Validation/Test Splits
#
# Your task: Split data into training, validation, and test sets with proper stratification
#
# TODO: Prepare data for splitting:
# 1. Ensure expression_scaled and target variables are aligned
# 2. Create stratification variable (use RFS_STATUS or 5-year recurrence)
# 3. Handle any remaining data alignment issues
#
# TODO: Create splits:
# 4. Use train_test_split to create 70% train, 30% temp
# 5. Split temp into 15% validation, 15% test
# 6. Use stratify parameter to maintain class balance
# 7. Set random_state=42 for reproducibility
#
# TODO: Validate splits:
# 8. Check sizes of train/val/test sets
# 9. Verify class distributions are maintained
# 10. Ensure no data leakage between sets
#
# Expected output: Balanced train/val/test splits ready for feature selection and modeling

# Write your code below:
# print("DATA SPLITTING")
# print("="*30)
# 
# # Prepare aligned data (adjust sample alignment as needed)
# # This assumes we have aligned samples - modify based on actual data structure
# if 'common_samples' in locals() and len(common_samples) > 0:
#     # Use aligned samples
#     X = expression_scaled[common_samples].T  # Samples as rows
#     y_survival = clinical_data_imputed.loc[common_samples, ['RFS_MONTHS', 'RFS_STATUS']]
#     y_binary = recurrence_5yr[clinical_data_imputed.index.isin(common_samples)]
#     
#     print(f"Total samples for splitting: {len(X)}")
#     print(f"Features (genes): {X.shape[1]}")
#     print(f"Class distribution: {y_binary.value_counts().to_dict()}")
#     
#     # Create stratified splits
#     X_temp, X_test, y_temp, y_test = train_test_split(
#         X, y_binary, test_size=0.15, random_state=42, stratify=y_binary
#     )
#     
#     X_train, X_val, y_train, y_val = train_test_split(
#         X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp  # 0.176 * 0.85 ≈ 0.15
#     )
#     
#     print(f"✓ Train set: {X_train.shape} (class dist: {y_train.value_counts().to_dict()})")
#     print(f"✓ Val set: {X_val.shape} (class dist: {y_val.value_counts().to_dict()})")
#     print(f"✓ Test set: {X_test.shape} (class dist: {y_test.value_counts().to_dict()})")
#     
# else:
#     print("❌ Cannot create splits - sample alignment needed")
#     print("Fix sample alignment in previous activity first")

In [None]:
# 📝 ACTIVITY 8: Export Preprocessed Data
#
# Your task: Save cleaned and split datasets for feature selection and modeling
#
# TODO: Create results directory and save datasets:
# 1. Create '../results/preprocessing/' directory
# 2. Save train/val/test splits as CSV files
# 3. Save both expression data (X) and target variables (y)
# 4. Include both survival targets (RFS_MONTHS, RFS_STATUS) and binary target
#
# TODO: Save preprocessing objects:
# 5. Save scaler and other preprocessing objects using joblib
# 6. Create preprocessing summary with key statistics
#
# Expected output: All preprocessed data saved and ready for next notebook

# Write your code below:
# import os
# import joblib
# 
# # Create results directory
# os.makedirs('../results/preprocessing', exist_ok=True)
# 
# if 'X_train' in locals():
#     # Save expression data splits
#     pd.DataFrame(X_train).to_csv('../results/preprocessing/X_train_scaled.csv')
#     pd.DataFrame(X_val).to_csv('../results/preprocessing/X_val_scaled.csv') 
#     pd.DataFrame(X_test).to_csv('../results/preprocessing/X_test_scaled.csv')
#     
#     # Save target variables
#     pd.DataFrame({'rfs_status': y_train}).to_csv('../results/preprocessing/y_train.csv')
#     pd.DataFrame({'rfs_status': y_val}).to_csv('../results/preprocessing/y_val.csv')
#     pd.DataFrame({'rfs_status': y_test}).to_csv('../results/preprocessing/y_test.csv')
#     
#     # Save preprocessing objects
#     joblib.dump(scaler, '../results/preprocessing/expression_scaler.pkl')
#     
#     # Create summary
#     preprocessing_summary = {
#         'original_genes': len(expression_data),
#         'filtered_genes': len(expression_filtered),
#         'final_genes': X_train.shape[1],
#         'train_samples': len(X_train),
#         'val_samples': len(X_val),
#         'test_samples': len(X_test),
#         'target_variable': 'RFS_STATUS (5-year recurrence)',
#         'missing_values_imputed': True,
#         'normalization': 'StandardScaler'
#     }
#     
#     import json
#     with open('../results/preprocessing/preprocessing_summary.json', 'w') as f:
#         json.dump(preprocessing_summary, f, indent=2)
#     
#     print("✓ All datasets exported to ../results/preprocessing/")
#     print("✓ Ready for feature selection (notebook 3)")
#     
# else:
#     print("❌ No splits to export - complete previous activities first")

## 7. Export Preprocessed Data

Let's save our preprocessed datasets and create comprehensive documentation for the next modeling phase.

---

## 📚 Preprocessing Summary

### ✅ **What You Accomplished:**
1. **Library Import**: Added preprocessing-specific libraries (sklearn tools)
2. **Data Loading**: Loaded datasets with focus on RFS target variables
3. **Missing Value Handling**: Imputed missing clinical data appropriately  
4. **Gene Filtering**: Removed low-variance genes uninformative for ML
5. **Normalization**: Standardized expression data for ML compatibility
6. **Target Preparation**: Set up RFS_MONTHS and RFS_STATUS for survival analysis
7. **Data Splitting**: Created stratified train/val/test splits (70/15/15)
8. **Export**: Saved preprocessed datasets for feature selection

### 🎯 **Key Focus - RFS Target Variables:**
- **RFS_MONTHS**: Time to recurrence or last follow-up (continuous)
- **RFS_STATUS**: Recurrence event indicator (0=no event, 1=recurrence)
- **5-Year Classification**: Binary target for recurrence within 5 years

### 📊 **Data Ready for Feature Selection:**
- **Clean expression data**: Normalized, low-variance genes removed
- **Imputed clinical data**: No missing values
- **Survival targets**: RFS variables ready for time-to-event modeling
- **Balanced splits**: Stratified train/val/test maintaining class distribution

### 🔄 **Next Steps:**
Run `03_feature_selection.ipynb` to select optimal features for RFS prediction modeling.

---

**Great job preprocessing the data for survival analysis! 🎉**