# Machine Learning Fundamentals: Data Cleaning, Labeling, and Normalization
# SOLUTION NOTEBOOK

## Introduction

This is the complete solution notebook for the ML fundamentals exercise. It demonstrates best practices for data preprocessing in machine learning.

### Learning Goals
1. Handle missing values, duplicates, and outliers
2. Standardize inconsistent labels
3. Apply and compare normalization techniques
4. Understand the impact of preprocessing on model performance
5. Avoid common pitfalls like data leakage

---

## Setup: Import Libraries

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Display settings
pd.set_option('display.max_columns', 20)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")

## Load the Data

In [None]:
# Load datasets
expression_df = pd.read_csv('gene_expression_data.csv')
metadata_df = pd.read_csv('sample_metadata.csv')

print("Expression data shape:", expression_df.shape)
print("Metadata shape:", metadata_df.shape)
print("\nFirst few rows of expression data:")
display(expression_df.head())
print("\nFirst few rows of metadata:")
display(metadata_df.head())

---
# Part 1: Data Cleaning

## Task 1.1: Visualize Missing Data

**Key Insight**: Visualizing missing data patterns helps identify if missingness is random or systematic.

In [None]:
# Check for missing values
missing_summary = expression_df.isnull().sum()
print(f"Total missing values: {expression_df.isnull().sum().sum()}")
print(f"Percentage missing: {100 * expression_df.isnull().sum().sum() / expression_df.size:.2f}%")

# Create heatmap of missing values
plt.figure(figsize=(12, 8))
sns.heatmap(expression_df.isnull(), cbar=False, cmap='YlOrRd', yticklabels=False)
plt.title('Missing Data Heatmap (Yellow = Missing)', fontsize=14)
plt.xlabel('Genes')
plt.ylabel('Samples')
plt.tight_layout()
plt.show()

## Task 1.2: Handle Missing Values

We'll use **median imputation** as it's robust to outliers.

In [None]:
# Make a copy to preserve original data
expr_cleaned = expression_df.copy()

# Get gene columns (all except sample_id)
gene_columns = [col for col in expr_cleaned.columns if col != 'sample_id']

print(f"Before imputation:")
print(f"  Total NaN values: {expr_cleaned[gene_columns].isnull().sum().sum()}")
print(f"  Shape: {expr_cleaned.shape}")

# Use sklearn's SimpleImputer for robust median imputation
imputer = SimpleImputer(strategy='median')
expr_cleaned[gene_columns] = imputer.fit_transform(expr_cleaned[gene_columns])

# Verify no missing values remain
nan_count = expr_cleaned.isnull().sum().sum()
print(f"\nAfter imputation:")
print(f"  Total NaN values: {nan_count}")
print(f"  Shape: {expr_cleaned.shape}")

# Additional check: verify columns are numeric
print(f"\nData types check:")
print(f"  All gene columns numeric: {all(expr_cleaned[col].dtype in ['float64', 'int64'] for col in gene_columns)}")

# Final assertion
assert nan_count == 0, f"ERROR: Still have {nan_count} missing values after imputation!"
print("\n✓ Success! All missing values imputed.")
print(f"✓ Imputed {len(gene_columns)} genes using median strategy.")

## Task 1.3: Detect and Remove Duplicate Samples

**Key Insight**: Duplicates violate independence assumptions and can bias results.

In [None]:
# Check for duplicates
print(f"Shape before removing duplicates: {expr_cleaned.shape}")
print(f"Number of duplicate rows: {expr_cleaned.duplicated(subset=gene_columns).sum()}")

# Show duplicate rows if any
if expr_cleaned.duplicated(subset=gene_columns).sum() > 0:
    print("\nDuplicate sample IDs:")
    duplicate_mask = expr_cleaned.duplicated(subset=gene_columns, keep=False)
    print(expr_cleaned[duplicate_mask]['sample_id'].tolist())

# Remove duplicates (keep first occurrence)
expr_cleaned = expr_cleaned.drop_duplicates(subset=gene_columns, keep='first')

print(f"\nShape after removing duplicates: {expr_cleaned.shape}")
assert expr_cleaned.duplicated(subset=gene_columns).sum() == 0, "Still have duplicates!"
print("Success! All duplicates removed.")

## Task 1.4: Identify and Handle Outliers

In [None]:
# Visualize distribution of first 10 genes
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.flatten()

for i, gene in enumerate(gene_columns[:10]):
    axes[i].boxplot(expr_cleaned[gene].dropna())
    axes[i].set_title(gene, fontsize=10)
    axes[i].set_ylabel('Expression')

plt.suptitle('Gene Expression Distributions (First 10 Genes)', fontsize=14)
plt.tight_layout()
plt.show()

print("\nObservation: Some genes have extreme outliers and very different scales.")
print("We'll handle these through normalization rather than removal.")

## Task 1.5: Standardize Sample IDs

**Common Mistake**: Ignoring ID mismatches leads to data loss during merging.

In [None]:
# Check current ID formats
print("Expression data sample IDs (first 10):")
print(expr_cleaned['sample_id'].head(10).tolist())
print("\nMetadata sample IDs (first 10):")
print(metadata_df['sample_id'].head(10).tolist())

# Check how many IDs match
matching_ids = set(expr_cleaned['sample_id']) & set(metadata_df['sample_id'])
print(f"\nCurrently matching IDs: {len(matching_ids)}")

In [None]:
def standardize_sample_id(sample_id):
    """
    Standardize sample ID to 'SAMPLE_XXX' format.
    
    Examples:
    'patient-1' -> 'SAMPLE_001'
    'Patient_01' -> 'SAMPLE_001'
    'SAMPLE-002' -> 'SAMPLE_002'
    """
    # Convert to string and uppercase
    sample_id = str(sample_id).upper()
    
    # Extract numeric part using regex
    match = re.search(r'(\d+)', sample_id)
    if match:
        number = int(match.group(1))
        # Return formatted ID with zero-padding
        return f"SAMPLE_{number:03d}"
    else:
        # If no number found, return original
        return sample_id

# Apply standardization
expr_cleaned['sample_id'] = expr_cleaned['sample_id'].apply(standardize_sample_id)
metadata_df['sample_id'] = metadata_df['sample_id'].apply(standardize_sample_id)

# Check matches now
matching_ids = set(expr_cleaned['sample_id']) & set(metadata_df['sample_id'])
print(f"Matching IDs after standardization: {len(matching_ids)}")
print("\nSample of standardized IDs:")
print(expr_cleaned['sample_id'].head(10).tolist())

---
# Part 2: Data Labeling

## Task 2.1: Inspect Label Inconsistencies

In [None]:
# Examine unique diagnosis values
print("Unique diagnosis labels:")
print(metadata_df['diagnosis'].unique())
print("\nDiagnosis value counts:")
print(metadata_df['diagnosis'].value_counts())

## Task 2.2: Map Labels to Binary Classification

In [None]:
def map_diagnosis_to_binary(diagnosis):
    """
    Map diagnosis string to binary classification.
    
    Returns:
    - 0 for normal/healthy/control
    - 1 for cancer/tumor/malignant
    - None for ambiguous cases
    """
    diagnosis_lower = diagnosis.lower()
    
    # Normal class
    if diagnosis_lower in ['normal', 'healthy', 'control']:
        return 0
    
    # Cancer class
    elif diagnosis_lower in ['cancer', 'tumor', 'malignant']:
        return 1
    
    # Ambiguous cases
    elif diagnosis_lower in ['borderline', 'unclear', 'suspicious']:
        return None
    
    else:
        # Unknown label
        print(f"Warning: Unknown diagnosis '{diagnosis}'")
        return None

# Apply mapping
metadata_df['label'] = metadata_df['diagnosis'].apply(map_diagnosis_to_binary)

print("Label distribution:")
print(metadata_df['label'].value_counts())
print(f"\nAmbiguous cases: {metadata_df['label'].isnull().sum()}")

## Task 2.3: Handle Ambiguous Labels

We'll drop ambiguous cases for this binary classification task.

In [None]:
# Show ambiguous cases before removal
print("Ambiguous cases:")
print(metadata_df[metadata_df['label'].isnull()][['sample_id', 'diagnosis']])

# Remove rows with ambiguous labels
metadata_df = metadata_df[metadata_df['label'].notna()].copy()

print(f"\nSamples after removing ambiguous cases: {len(metadata_df)}")
assert metadata_df['label'].isnull().sum() == 0, "Still have ambiguous labels!"
print("Success! All ambiguous labels removed.")

## Task 2.4: Merge Expression Data with Labels

In [None]:
# Merge on sample_id (inner join)
full_data = expr_cleaned.merge(metadata_df, on='sample_id', how='inner')

print(f"Final dataset shape: {full_data.shape}")
print(f"\nColumns: {full_data.columns.tolist()[:10]}...")  # Show first 10 columns

# Verify merge didn't introduce NaN values
gene_nan_count = full_data[gene_columns].isnull().sum().sum()
print(f"\nNaN check after merge:")
print(f"  Gene expression NaN: {gene_nan_count}")
print(f"  Label NaN: {full_data['label'].isnull().sum()}")

if gene_nan_count > 0:
    print(f"\n⚠️  WARNING: Merge introduced {gene_nan_count} NaN values in gene data!")
    print("This shouldn't happen with an inner join. Investigating...")
    # Show which columns have NaN
    nan_cols = full_data[gene_columns].isnull().sum()
    print(f"Columns with NaN: {nan_cols[nan_cols > 0]}")
else:
    print("✓ No NaN values introduced by merge")

print(f"\nSample of merged data:")
display(full_data[['sample_id', 'GENE_001', 'GENE_002', 'diagnosis', 'label', 'age', 'gender']].head())

## Task 2.5: Visualize Class Distribution

In [None]:
# Create bar plot
plt.figure(figsize=(8, 5))
class_counts = full_data['label'].value_counts().sort_index()
plt.bar(class_counts.index, class_counts.values, color=['blue', 'red'], alpha=0.7)
plt.xlabel('Class (0=Normal, 1=Cancer)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Class Distribution', fontsize=14)
plt.xticks([0, 1])
for i, v in enumerate(class_counts.values):
    plt.text(i, v + 1, str(v), ha='center', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

# Calculate class balance
print(f"\nClass balance:")
print(f"Normal (0): {class_counts[0]} ({100*class_counts[0]/len(full_data):.1f}%)")
print(f"Cancer (1): {class_counts[1]} ({100*class_counts[1]/len(full_data):.1f}%)")

## Task 2.6: Create Train/Validation/Test Splits

**Key Insight**: Use stratification to maintain class balance across splits.

In [None]:
# Prepare features (X) and target (y)
X = full_data[gene_columns].values
y = full_data['label'].values.astype(int)  # Convert to int for proper bincount

# Verify no NaN in features before splitting
print(f"NaN values in X: {np.isnan(X).sum()}")
if np.isnan(X).sum() > 0:
    print("ERROR: Found NaN values in features! Check that Task 1.2 (imputation) completed successfully.")
    print("Attempting to fix by applying median imputation...")
    imputer_fix = SimpleImputer(strategy='median')
    X = imputer_fix.fit_transform(X)
    print(f"NaN values after fix: {np.isnan(X).sum()}")

# First split: 60% train, 40% temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=42
)

# Second split: Split temp into 50/50 (which is 20/20 of total)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

print("\nSplit sizes:")
print(f"Training: {len(X_train)} samples ({100*len(X_train)/len(X):.1f}%)")
print(f"Validation: {len(X_val)} samples ({100*len(X_val)/len(X):.1f}%)")
print(f"Test: {len(X_test)} samples ({100*len(X_test)/len(X):.1f}%)")

# Verify stratification
print("\nClass distribution in each split:")
print(f"Training: {np.bincount(y_train)} -> [{100*np.bincount(y_train)[0]/len(y_train):.1f}%, {100*np.bincount(y_train)[1]/len(y_train):.1f}%]")
print(f"Validation: {np.bincount(y_val)} -> [{100*np.bincount(y_val)[0]/len(y_val):.1f}%, {100*np.bincount(y_val)[1]/len(y_val):.1f}%]")
print(f"Test: {np.bincount(y_test)} -> [{100*np.bincount(y_test)[0]/len(y_test):.1f}%, {100*np.bincount(y_test)[1]/len(y_test):.1f}%]")

---
# Part 3: Normalization and Its Impact

## Task 3.1: Baseline Model (No Normalization)

**Key Insight**: Features with larger scales dominate the model.

In [None]:
# Train logistic regression without normalization
baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train, y_train)

# Predict on validation set
y_val_pred_baseline = baseline_model.predict(X_val)

# Evaluate
baseline_acc = accuracy_score(y_val, y_val_pred_baseline)
print("="*60)
print("BASELINE MODEL (No Normalization)")
print("="*60)
print(f"Validation Accuracy: {baseline_acc:.4f}")
print("\nConfusion Matrix:")
cm_baseline = confusion_matrix(y_val, y_val_pred_baseline)
print(cm_baseline)
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred_baseline, target_names=['Normal', 'Cancer']))
print("="*60)

In [None]:
# Visualize feature importance (model coefficients)
coef_baseline = baseline_model.coef_[0]

plt.figure(figsize=(12, 4))
plt.bar(range(len(coef_baseline)), coef_baseline, alpha=0.7, color='red')
plt.xlabel('Gene Index', fontsize=12)
plt.ylabel('Coefficient', fontsize=12)
plt.title('Baseline Model Coefficients (No Normalization)', fontsize=14)
plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Coefficient range: [{coef_baseline.min():.6f}, {coef_baseline.max():.6f}]")
print(f"Coefficient std: {coef_baseline.std():.6f}")

## Task 3.2: Min-Max Scaling

**Critical**: Fit scaler on training data ONLY!

In [None]:
# Create and fit scaler on training data
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)

# Transform validation and test sets using the fitted scaler
X_val_minmax = minmax_scaler.transform(X_val)
X_test_minmax = minmax_scaler.transform(X_test)

# Train model on scaled data
model_minmax = LogisticRegression(random_state=42, max_iter=1000)
model_minmax.fit(X_train_minmax, y_train)

# Evaluate
y_val_pred_minmax = model_minmax.predict(X_val_minmax)
minmax_acc = accuracy_score(y_val, y_val_pred_minmax)

print("="*60)
print("MIN-MAX SCALING")
print("="*60)
print(f"Validation Accuracy: {minmax_acc:.4f}")
print(f"Improvement over baseline: {minmax_acc - baseline_acc:+.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_val, y_val_pred_minmax))
print("="*60)

In [None]:
# Visualize coefficients after Min-Max scaling
coef_minmax = model_minmax.coef_[0]

plt.figure(figsize=(12, 4))
plt.bar(range(len(coef_minmax)), coef_minmax, alpha=0.7, color='blue')
plt.xlabel('Gene Index', fontsize=12)
plt.ylabel('Coefficient', fontsize=12)
plt.title('Model Coefficients After Min-Max Scaling', fontsize=14)
plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Coefficient range: [{coef_minmax.min():.4f}, {coef_minmax.max():.4f}]")
print(f"Coefficient std: {coef_minmax.std():.4f}")

## Task 3.3: Standardization (Z-Score)

In [None]:
# Create and fit scaler
standard_scaler = StandardScaler()
X_train_standard = standard_scaler.fit_transform(X_train)
X_val_standard = standard_scaler.transform(X_val)
X_test_standard = standard_scaler.transform(X_test)

# Train model
model_standard = LogisticRegression(random_state=42, max_iter=1000)
model_standard.fit(X_train_standard, y_train)

# Evaluate
y_val_pred_standard = model_standard.predict(X_val_standard)
standard_acc = accuracy_score(y_val, y_val_pred_standard)

print("="*60)
print("STANDARDIZATION (Z-Score)")
print("="*60)
print(f"Validation Accuracy: {standard_acc:.4f}")
print(f"Improvement over baseline: {standard_acc - baseline_acc:+.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_val, y_val_pred_standard))
print("="*60)

In [None]:
# Visualize coefficients
coef_standard = model_standard.coef_[0]

plt.figure(figsize=(12, 4))
plt.bar(range(len(coef_standard)), coef_standard, alpha=0.7, color='green')
plt.xlabel('Gene Index', fontsize=12)
plt.ylabel('Coefficient', fontsize=12)
plt.title('Model Coefficients After Standardization', fontsize=14)
plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Coefficient range: [{coef_standard.min():.4f}, {coef_standard.max():.4f}]")
print(f"Coefficient std: {coef_standard.std():.4f}")

## Task 3.4: Log Transformation + Standardization

**Key Insight**: Gene expression data is often log-normally distributed.

**Important Note**: This cell requires that missing values were properly imputed in Part 1. If you skipped that step, you'll get NaN errors here!

In [None]:
# Apply log transformation (log1p handles zeros)
# Note: Ensure data is clean before log transform

# Safety check: Handle any remaining NaN values BEFORE log transform
if np.isnan(X_train).any() or np.isnan(X_val).any() or np.isnan(X_test).any():
    print("WARNING: NaN values detected before log transform!")
    print(f"  Train NaN: {np.isnan(X_train).sum()}")
    print(f"  Val NaN: {np.isnan(X_val).sum()}")
    print(f"  Test NaN: {np.isnan(X_test).sum()}")
    print("Applying emergency imputation...")
    
    # Emergency fix: impute NaN with median
    emergency_imputer = SimpleImputer(strategy='median')
    X_train = emergency_imputer.fit_transform(X_train)
    X_val = emergency_imputer.transform(X_val)
    X_test = emergency_imputer.transform(X_test)
    
    print(f"After fix - Train NaN: {np.isnan(X_train).sum()}")
    print(f"After fix - Val NaN: {np.isnan(X_val).sum()}")
    print(f"After fix - Test NaN: {np.isnan(X_test).sum()}")

# Check for negative values (would cause NaN in log)
neg_train = (X_train < 0).sum()
neg_val = (X_val < 0).sum()
neg_test = (X_test < 0).sum()

if neg_train > 0 or neg_val > 0 or neg_test > 0:
    print(f"\nWARNING: Negative values detected!")
    print(f"  Train negatives: {neg_train}")
    print(f"  Val negatives: {neg_val}")
    print(f"  Test negatives: {neg_test}")
    print("Converting to absolute values for log transform...")
    X_train = np.abs(X_train)
    X_val = np.abs(X_val)
    X_test = np.abs(X_test)

# Apply log transformation
X_train_log = np.log1p(X_train)
X_val_log = np.log1p(X_val)
X_test_log = np.log1p(X_test)

# CRITICAL: Replace any NaN values created by log transform
# This handles edge cases where log1p might produce NaN
X_train_log = np.nan_to_num(X_train_log, nan=0.0, posinf=0.0, neginf=0.0)
X_val_log = np.nan_to_num(X_val_log, nan=0.0, posinf=0.0, neginf=0.0)
X_test_log = np.nan_to_num(X_test_log, nan=0.0, posinf=0.0, neginf=0.0)

# Verify no NaN after log transform
train_nan = np.isnan(X_train_log).sum()
val_nan = np.isnan(X_val_log).sum()
test_nan = np.isnan(X_test_log).sum()

print(f"\nAfter log transform:")
print(f"  Train NaN: {train_nan}, Inf: {np.isinf(X_train_log).sum()}")
print(f"  Val NaN: {val_nan}, Inf: {np.isinf(X_val_log).sum()}")
print(f"  Test NaN: {test_nan}, Inf: {np.isinf(X_test_log).sum()}")

if train_nan > 0 or val_nan > 0 or test_nan > 0:
    print("\nERROR: NaN still present after nan_to_num! This should not happen.")
    # Last resort: use imputer
    final_imputer = SimpleImputer(strategy='constant', fill_value=0)
    X_train_log = final_imputer.fit_transform(X_train_log)
    X_val_log = final_imputer.transform(X_val_log)
    X_test_log = final_imputer.transform(X_test_log)
    print(f"Applied final imputation. NaN count: {np.isnan(X_val_log).sum()}")

# Then standardize
log_scaler = StandardScaler()
X_train_log_std = log_scaler.fit_transform(X_train_log)
X_val_log_std = log_scaler.transform(X_val_log)
X_test_log_std = log_scaler.transform(X_test_log)

print(f"\nAfter standardization:")
print(f"  Train NaN: {np.isnan(X_train_log_std).sum()}")
print(f"  Val NaN: {np.isnan(X_val_log_std).sum()}")
print(f"  Test NaN: {np.isnan(X_test_log_std).sum()}")

# Train model
model_log = LogisticRegression(random_state=42, max_iter=1000)
model_log.fit(X_train_log_std, y_train)

# Evaluate
y_val_pred_log = model_log.predict(X_val_log_std)
log_acc = accuracy_score(y_val, y_val_pred_log)

print("\n" + "="*60)
print("LOG TRANSFORMATION + STANDARDIZATION")
print("="*60)
print(f"Validation Accuracy: {log_acc:.4f}")
print(f"Improvement over baseline: {log_acc - baseline_acc:+.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_val, y_val_pred_log))
print("="*60)

In [None]:
# Visualize coefficients
coef_log = model_log.coef_[0]

plt.figure(figsize=(12, 4))
plt.bar(range(len(coef_log)), coef_log, alpha=0.7, color='purple')
plt.xlabel('Gene Index', fontsize=12)
plt.ylabel('Coefficient', fontsize=12)
plt.title('Model Coefficients After Log Transformation + Standardization', fontsize=14)
plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Coefficient range: [{coef_log.min():.4f}, {coef_log.max():.4f}]")
print(f"Coefficient std: {coef_log.std():.4f}")

## Task 3.5: Compare All Normalization Methods

In [None]:
# Create comparison table
results_df = pd.DataFrame({
    'Method': ['Baseline (No Norm)', 'Min-Max Scaling', 'Standardization', 'Log + Standardization'],
    'Validation Accuracy': [baseline_acc, minmax_acc, standard_acc, log_acc],
    'Improvement': [0, minmax_acc - baseline_acc, standard_acc - baseline_acc, log_acc - baseline_acc]
})

print("\n" + "="*70)
print("NORMALIZATION COMPARISON")
print("="*70)
print(results_df.to_string(index=False))
print("="*70)

# Find best method
best_idx = results_df['Validation Accuracy'].idxmax()
best_method = results_df.loc[best_idx, 'Method']
print(f"\nBest method: {best_method} ({results_df.loc[best_idx, 'Validation Accuracy']:.4f})")

In [None]:
# Visualize comparison
plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green', 'purple']
bars = plt.bar(results_df['Method'], results_df['Validation Accuracy'], color=colors, alpha=0.7)
plt.ylabel('Validation Accuracy', fontsize=12)
plt.title('Model Performance: Impact of Normalization Methods', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.ylim([0.5, 1.0])
plt.axhline(y=baseline_acc, color='red', linestyle='--', alpha=0.5, label='Baseline')

# Add value labels on bars
for bar, acc in zip(bars, results_df['Validation Accuracy']):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{acc:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.legend()
plt.tight_layout()
plt.show()

---
# Part 4: Pipeline Integration

## Task 4.1: Build a Proper Pipeline

In [None]:
# Create pipeline with StandardScaler and LogisticRegression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

# Fit pipeline on training data
pipeline.fit(X_train, y_train)

# Evaluate
y_val_pred_pipeline = pipeline.predict(X_val)
pipeline_acc = accuracy_score(y_val, y_val_pred_pipeline)

print("="*60)
print("SKLEARN PIPELINE")
print("="*60)
print(f"Validation Accuracy: {pipeline_acc:.4f}")
print(f"\nPipeline steps: {list(pipeline.named_steps.keys())}")
print("\nAdvantages of pipelines:")
print("  - Prevents data leakage automatically")
print("  - Cleaner, more maintainable code")
print("  - Easy to save and deploy")
print("  - Works seamlessly with cross-validation")
print("="*60)

## Task 4.2: Demonstrate Data Leakage Bug

**Warning**: This shows INCORRECT data handling!

In [None]:
print("⚠️  WARNING: This demonstrates INCORRECT data handling! ⚠️")
print("="*60)

# WRONG: Normalize ALL data before split (data leakage!)
leaky_scaler = StandardScaler()
X_all_normalized = leaky_scaler.fit_transform(X)  # Fit on ALL data - BAD!

# Then split
X_train_leaky, X_temp_leaky, y_train_leaky, y_temp_leaky = train_test_split(
    X_all_normalized, y, test_size=0.4, stratify=y, random_state=42
)
X_val_leaky, X_test_leaky, y_val_leaky, y_test_leaky = train_test_split(
    X_temp_leaky, y_temp_leaky, test_size=0.5, stratify=y_temp_leaky, random_state=42
)

# Train model
leaky_model = LogisticRegression(random_state=42, max_iter=1000)
leaky_model.fit(X_train_leaky, y_train_leaky)
leaky_acc = accuracy_score(y_val_leaky, leaky_model.predict(X_val_leaky))

print("Results comparison:")
print(f"  Leaky approach (WRONG):  {leaky_acc:.4f}")
print(f"  Proper approach (RIGHT): {standard_acc:.4f}")
print(f"  Difference: {leaky_acc - standard_acc:+.4f}")
print("\nWhy this is wrong:")
print("  - Test set statistics leaked into training")
print("  - Performance estimate is overly optimistic")
print("  - Model won't generalize to new data")
print("="*60)

## Task 4.3: Final Evaluation on Test Set

In [None]:
# Use best model (based on validation performance)
# Let's use the log transformation + standardization model

y_test_pred = model_log.predict(X_test_log_std)
test_acc = accuracy_score(y_test, y_test_pred)

print("\n" + "="*60)
print("FINAL TEST SET EVALUATION")
print("="*60)
print(f"Model: Log Transformation + Standardization")
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Validation Accuracy: {log_acc:.4f}")
print(f"Difference: {abs(test_acc - log_acc):.4f}")
print("\nConfusion Matrix:")
cm_test = confusion_matrix(y_test, y_test_pred)
print(cm_test)
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred, target_names=['Normal', 'Cancer']))
print("="*60)

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Normal', 'Cancer'],
            yticklabels=['Normal', 'Cancer'],
            cbar_kws={'label': 'Count'})
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.title('Confusion Matrix - Test Set', fontsize=14)
plt.tight_layout()
plt.show()

---
# Summary and Key Insights

## What We Learned

### 1. Data Cleaning is Critical
- **Missing values** (7% of data) were handled with median imputation
- **Duplicates** were identified and removed to avoid bias
- **Inconsistent IDs** required standardization to merge datasets
- **Outliers** were handled through normalization rather than removal

### 2. Label Quality Matters
- Started with 11 different label terms!
- Mapped to binary classification (0=Normal, 1=Cancer)
- Removed ambiguous cases (borderline, unclear, suspicious)
- Used stratified splitting to maintain class balance

### 3. Normalization Significantly Improves Performance
- Baseline (no normalization): Struggled with different feature scales
- Min-Max scaling: Improved by bringing features to [0,1]
- Standardization: Even better by centering at mean=0, std=1
- Log transformation: Best for log-normally distributed gene expression

### 4. Pipelines Prevent Mistakes
- Encapsulate preprocessing + model together
- Automatically prevent data leakage
- Make code cleaner and more reproducible
- Essential for production deployments

### 5. Data Leakage is a Serious Problem
- Normalizing before splitting gives overoptimistic results
- Always fit preprocessing on training data only
- Use pipelines or be very careful with manual preprocessing

## Key Takeaways

1. **Data quality often matters more than model choice**
   - A simple model on clean data beats a complex model on messy data
   
2. **Preprocessing is not optional for real-world data**
   - Real datasets always have quality issues
   - Document your cleaning decisions
   
3. **Always validate on held-out data**
   - Training accuracy is meaningless
   - Test set gives realistic performance estimate
   
4. **Use appropriate metrics**
   - Confusion matrix shows types of errors
   - Precision/recall matter for imbalanced classes
   
5. **Reproducibility requires discipline**
   - Set random seeds
   - Document preprocessing steps
   - Use version control

## Impact Summary Table

In [None]:
# Create comprehensive summary
summary_df = pd.DataFrame({
    'Stage': [
        'Raw messy data',
        'After cleaning',
        'No normalization',
        'Min-Max scaling',
        'Standardization',
        'Log + Standardization'
    ],
    'Data Quality': [
        'Missing, duplicates, inconsistent',
        'Clean, no missing/duplicates',
        'Clean',
        'Clean + normalized [0,1]',
        'Clean + normalized (μ=0, σ=1)',
        'Clean + log-normalized'
    ],
    'Val Accuracy': [
        'N/A',
        'N/A',
        f'{baseline_acc:.4f}',
        f'{minmax_acc:.4f}',
        f'{standard_acc:.4f}',
        f'{log_acc:.4f}'
    ]
})

print("\n" + "="*80)
print("COMPLETE PIPELINE IMPACT SUMMARY")
print("="*80)
print(summary_df.to_string(index=False))
print("="*80)
print(f"\nTotal improvement from baseline: {log_acc - baseline_acc:+.4f}")
print(f"Relative improvement: {100*(log_acc - baseline_acc)/baseline_acc:+.1f}%")

## Extension Ideas

Try these challenges to deepen your understanding:

1. **Different ML Models**
   ```python
   from sklearn.ensemble import RandomForestClassifier
   from sklearn.svm import SVC
   # Compare performance across models
   ```

2. **Cross-Validation**
   ```python
   from sklearn.model_selection import cross_val_score
   scores = cross_val_score(pipeline, X_train, y_train, cv=5)
   ```

3. **Feature Selection**
   ```python
   # Find top 10 most important genes
   coef_importance = np.abs(model_log.coef_[0])
   top_genes_idx = np.argsort(coef_importance)[-10:]
   ```

4. **Dimensionality Reduction**
   ```python
   from sklearn.decomposition import PCA
   pca = PCA(n_components=10)
   X_train_pca = pca.fit_transform(X_train_standard)
   ```

5. **Handle Class Imbalance**
   ```python
   from imblearn.over_sampling import SMOTE
   smote = SMOTE(random_state=42)
   X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
   ```

6. **Error Analysis**
   ```python
   # Find misclassified samples
   misclassified = X_test[y_test != y_test_pred]
   # Analyze what makes them difficult
   ```

---
## Congratulations!

You've completed a comprehensive machine learning preprocessing pipeline!

### Skills Mastered:
- Data cleaning and quality assessment
- Label standardization and handling ambiguity
- Multiple normalization techniques
- Proper train/validation/test splitting
- Avoiding data leakage
- Building sklearn pipelines
- Model evaluation and comparison

### Next Steps:
1. Apply these techniques to your own datasets
2. Experiment with different models and parameters
3. Learn about feature engineering
4. Explore automated ML (AutoML) tools
5. Study deep learning for complex patterns

Remember: **Good data beats fancy algorithms!**