# Tutorial 08: Comprehensive Guide to Missing Value Imputation

This tutorial provides a comprehensive guide to missing value imputation methods for single-cell proteomics data.

## Learning Objectives

By the end of this tutorial, you will:
- Understand different types of missing values (MCAR, MAR, MNAR)
- Learn about all 10 available imputation methods in ScpTensor
- Know when to use each imputation method
- Apply each method with code examples
- Visualize and compare imputation results
- Evaluate imputation quality
- Follow best practices for imputation workflows

## Table of Contents

1. [Understanding Missing Values](#1-understanding-missing-values)
2. [Imputation Methods Overview](#2-imputation-methods-overview)
3. [MCAR Methods - Missing Completely At Random](#3-mcar-methods)
4. [MNAR Methods - Missing Not At Random](#4-mnar-methods)
5. [Advanced Methods](#5-advanced-methods)
6. [Performance Comparison](#6-performance-comparison)
7. [Best Practices](#7-best-practices)

## 1. Understanding Missing Values

### 1.1 Types of Missing Values

Single-cell proteomics data contains different types of missing values:

| Type | Full Name | Description | Common Cause | Recommended Methods |
|------|-----------|-------------|--------------|---------------------|
| **MCAR** | Missing Completely At Random | Missingness unrelated to data | Technical dropout, random failures | KNN, PPCA, SVD, LLS, BPCA, MissForest |
| **MAR** | Missing At Random | Missingness related to observed data | Systematic bias, batch effects | KNN, MissForest, BPCA |
| **MNAR** | Missing Not At Random | Missingness related to unobserved value | Below detection limit (LOD) | QRILC, MinProb, MinDet |

### 1.2 Mask Codes in ScpTensor

ScpTensor uses mask codes to track the origin of each value:

| Code | Name | Description |
|------|------|-------------|
| 0 | VALID | Valid detected value |
| 1 | MBR | Missing Between Runs (technical variation) |
| 2 | LOD | Limit of Detection (below threshold) |
| 3 | FILTERED | Removed by quality control |
| 5 | IMPUTED | Filled by imputation method |

### 1.3 Setup

First, let's import the required libraries and load a dataset.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.gridspec import GridSpec

# Apply SciencePlots style
plt.style.use(["science", "no-latex"])
plt.rcParams["figure.dpi"] = 100

# Import ScpTensor
import scptensor
from scptensor import (
    # Utilities
    count_mask_codes,
    impute_bpca,
    # All imputation methods
    impute_knn,
    impute_lls,
    impute_mf,
    impute_mindet,
    impute_minprob,
    impute_ppca,
    impute_qrilc,
    impute_svd,
    norm_log,
)
from scptensor.datasets import load_simulated_scrnaseq_like

print(f"ScpTensor version: {scptensor.__version__}")

### 1.4 Load and Explore Data

In [None]:
# Load dataset
container = load_simulated_scrnaseq_like()

print(f"Dataset loaded: {container}")
print(f"Samples: {container.n_samples}")
print(f"Features: {container.assays['proteins'].n_features}")
print(f"Available layers: {list(container.assays['proteins'].layers.keys())}")

### 1.5 Analyze Missing Value Patterns

In [None]:
# Analyze missing value patterns
matrix = container.assays["proteins"].layers["raw"]
mask_counts = count_mask_codes(matrix.M)

print("Missing Value Analysis:")
print("=" * 50)
print(f"Total values: {matrix.M.size}")
for code, name in [(0, "VALID"), (1, "MBR"), (2, "LOD"), (3, "FILTERED"), (5, "IMPUTED")]:
    count = mask_counts.get(code, 0)
    pct = count / matrix.M.size * 100
    print(f"{name} ({code}): {count} ({pct:.1f}%)")

# Overall missing rate
missing_rate = (matrix.M != 0).sum() / matrix.M.size
print(f"\nOverall missing rate: {missing_rate * 100:.1f}%")

### 1.6 Visualize Missing Patterns

In [None]:
# Visualize missing value patterns
fig = plt.figure(figsize=(14, 10))
gs = GridSpec(3, 3, figure=fig)

# Missing rate per sample
ax1 = fig.add_subplot(gs[0, 0])
sample_missing = (matrix.M != 0).sum(axis=1) / matrix.M.shape[1]
ax1.bar(range(len(sample_missing)), sample_missing, color="steelblue", alpha=0.7)
ax1.set_xlabel("Sample Index")
ax1.set_ylabel("Missing Rate")
ax1.set_title("Missing Rate per Sample")
ax1.grid(axis="y", alpha=0.3)

# Missing rate per feature
ax2 = fig.add_subplot(gs[0, 1])
feature_missing = (matrix.M != 0).sum(axis=0) / matrix.M.shape[0]
ax2.hist(feature_missing, bins=30, color="coral", edgecolor="black", alpha=0.7)
ax2.set_xlabel("Missing Rate")
ax2.set_ylabel("Number of Features")
ax2.set_title("Distribution of Missing Rate per Feature")
ax2.grid(axis="y", alpha=0.3)

# Overall missing distribution
ax3 = fig.add_subplot(gs[0, 2])
mask_labels = ["VALID", "MBR", "LOD", "FILTERED"]
mask_values = [mask_counts.get(i, 0) for i in [0, 1, 2, 3]]
colors_pie = ["lightgreen", "orange", "red", "gray"]
ax3.pie(mask_values, labels=mask_labels, colors=colors_pie, autopct="%1.1f%%", startangle=90)
ax3.set_title("Mask Code Distribution")

# Spy plot (missing pattern) - subset
ax4 = fig.add_subplot(gs[1, :])
n_show = min(100, matrix.M.shape[0], matrix.M.shape[1])
missing_mask = (matrix.M[:n_show, :n_show] != 0).astype(float)
ax4.imshow(missing_mask, aspect="auto", cmap="Reds", interpolation="none")
ax4.set_xlabel(f"Feature Index (1-{n_show})")
ax4.set_ylabel(f"Sample Index (1-{n_show})")
ax4.set_title("Missing Value Pattern (Spy Plot)")

# Missing by mask code type
ax5 = fig.add_subplot(gs[2, :])
x_pos = np.arange(matrix.M.shape[1])
width = 0.6
lod_rate = (matrix.M[:, :n_show] == 2).sum(axis=0) / matrix.M.shape[0]
mbr_rate = (matrix.M[:, :n_show] == 1).sum(axis=0) / matrix.M.shape[0]
ax5.bar(x_pos, lod_rate[:n_show], width, label="LOD (MNAR)", color="coral", alpha=0.7)
ax5.bar(
    x_pos,
    mbr_rate[:n_show],
    width,
    bottom=lod_rate[:n_show],
    label="MBR (MCAR)",
    color="steelblue",
    alpha=0.7,
)
ax5.set_xlabel("Feature Index")
ax5.set_ylabel("Missing Rate")
ax5.set_title("Missing Type by Feature")
ax5.legend()
ax5.set_xlim(-0.5, n_show - 0.5)

plt.tight_layout()
plt.savefig("tutorial_output/imputation_missing_patterns.png", dpi=300, bbox_inches="tight")
plt.show()

print("Missing pattern visualizations saved.")

---

## 2. Imputation Methods Overview

ScpTensor provides **10 imputation methods** for different scenarios:

### 2.1 Method Selection Guide

```python
# Quick decision tree:
if missingness_type == "MCAR":  # Technical dropout
    if n_samples < 1000:
        use = "KNN"  # Fast, accurate for small data
    else:
        use = "PPCA"  # Scalable to large data
elif missingness_type == "MNAR":  # Below detection limit
    if want_best_accuracy:
        use = "QRILC"  # Best for left-censored data
    else:
        use = "MinProb"  # Faster, simpler
elif high_correlation:
    use = "LLS"  # Exploits feature correlations
else:
    use = "BPCA"  # Robust, automatic regularization
```

### 2.2 Method Comparison Table

| Method | Full Name | Missing Type | Speed | Accuracy | Best For |
|--------|-----------|--------------|-------|----------|----------|
| **KNN** | K-Nearest Neighbors | MCAR | Fast | Good | Small datasets, general use |
| **PPCA** | Probabilistic PCA | MCAR | Medium | Good | Large datasets, linear structure |
| **BPCA** | Bayesian PCA | MCAR | Slow | Excellent | Automatic model selection |
| **SVD** | Singular Value Decomposition | MCAR | Fast | Good | Dimensionality reduction |
| **MissForest** | Random Forest | MCAR/MAR | Very Slow | Excellent | Complex patterns |
| **LLS** | Local Least Squares | MCAR | Medium | Excellent | Correlated features |
| **QRILC** | Quantile Regression Imputation of Left-Censored Data | MNAR | Medium | Excellent | Below detection limit |
| **MinProb** | Probabilistic Minimum | MNAR | Fast | Good | Fast MNAR imputation |
| **MinDet** | Deterministic Minimum | MNAR | Very Fast | Fair | Quick baseline |
| **NMF** | Non-negative Matrix Factorization | MCAR | Medium | Good | Non-negative data |

---

## 3. MCAR Methods - Missing Completely At Random

These methods assume missingness is unrelated to the data values.

### 3.1 Preprocessing: Log Normalization

Before imputation, we typically apply log normalization to stabilize variance.

In [None]:
# Apply log normalization
container = norm_log(
    container,
    assay_name="proteins",
    base_layer="raw",
    new_layer_name="log",
    base=2.0,
    offset=1.0,
)

print("Log normalization completed.")
print(f"Available layers: {list(container.assays['proteins'].layers.keys())}")

### 3.2 K-Nearest Neighbors (KNN) Imputation

**Algorithm:** For each sample with missing values, find k nearest neighbors and impute using the average of their values.

**Best for:** Small to medium datasets, general-purpose imputation.

**Key Parameters:**
- `k`: Number of neighbors (default: 10)
- Larger k = more smoothing, less local variation
- Smaller k = more sensitive to noise

In [None]:
# Apply KNN imputation
print("Running KNN imputation...")

container = impute_knn(
    container,
    assay_name="proteins",
    base_layer="log",
    new_layer_name="knn_imputed",
    k=10,  # Number of neighbors
)

print("KNN imputation completed.")

# Check results
knn_matrix = container.assays["proteins"].layers["knn_imputed"]
knn_missing_rate = (knn_matrix.M == 5).sum() / knn_matrix.M.size
print(f"Imputed values marked: {knn_missing_rate * 100:.2f}%")
print(f"Imputed count: {(knn_matrix.M == 5).sum()}")

### 3.3 Probabilistic PCA (PPCA) Imputation

**Algorithm:** Models data as x = Wz + mu + epsilon using EM algorithm. Iteratively estimates latent variables and parameters.

**Best for:** Large datasets, data with linear structure.

**Key Parameters:**
- `n_components`: Number of principal components
- `max_iter`: Maximum EM iterations
- `tol`: Convergence tolerance

In [None]:
# Apply PPCA imputation
print("Running PPCA imputation...")

container = impute_ppca(
    container,
    assay_name="proteins",
    base_layer="log",
    new_layer_name="ppca_imputed",
    n_components=10,  # Number of principal components
    max_iter=100,
)

print("PPCA imputation completed.")

# Check results
ppca_matrix = container.assays["proteins"].layers["ppca_imputed"]
print(f"Imputed values: {(ppca_matrix.M == 5).sum()}")

### 3.4 Bayesian PCA (BPCA) Imputation

**Algorithm:** Extends PPCA with Bayesian inference and Automatic Relevance Determination (ARD) priors. Automatically determines effective number of components.

**Best for:** Automatic model selection, avoiding overfitting.

**Key Features:**
- Uses ARD priors to prune unnecessary components
- More robust than standard PPCA
- Automatic determination of effective dimensions

In [None]:
# Apply BPCA imputation
print("Running BPCA imputation...")

container = impute_bpca(
    container,
    assay_name="proteins",
    base_layer="log",
    new_layer_name="bpca_imputed",
    n_components=10,  # Maximum components
    max_iter=100,
    random_state=42,
)

print("BPCA imputation completed.")

# Check results
bpca_matrix = container.assays["proteins"].layers["bpca_imputed"]
print(f"Imputed values: {(bpca_matrix.M == 5).sum()}")

### 3.5 SVD Imputation

**Algorithm:** Iterative SVD - alternates between SVD decomposition and missing value imputation.

**Best for:** Large datasets, fast dimensionality reduction.

**Key Parameters:**
- `rank`: Rank for SVD approximation
- `max_iter`: Maximum iterations
- `tol`: Convergence tolerance

In [None]:
# Apply SVD imputation
print("Running SVD imputation...")

container = impute_svd(
    container,
    assay_name="proteins",
    base_layer="log",
    new_layer_name="svd_imputed",
    rank=10,  # SVD rank
)

print("SVD imputation completed.")

# Check results
svd_matrix = container.assays["proteins"].layers["svd_imputed"]
print(f"Imputed values: {(svd_matrix.M == 5).sum()}")

---

## 4. MNAR Methods - Missing Not At Random

These methods are designed for left-censored data where missingness is due to values being below the detection limit (LOD).

### 4.1 QRILC (Quantile Regression Imputation of Left-Censored Data)

**Algorithm:** For each feature, estimates the detection limit and models detected values with a normal distribution. Samples from the left-censored tail for missing values.

**Best for:** MNAR data, below detection limit missingness.

**Key Parameters:**
- `q`: Quantile for detection limit (0-1, default: 0.01)
- Lower q = more aggressive censoring threshold
- `random_state`: Random seed for reproducibility

In [None]:
# Apply QRILC imputation
print("Running QRILC imputation...")

container = impute_qrilc(
    container,
    assay_name="proteins",
    source_layer="log",
    new_layer_name="qrilc_imputed",
    q=0.01,  # Left-censoring quantile
    random_state=42,
)

print("QRILC imputation completed.")

# Check results
qrilc_matrix = container.assays["proteins"].layers["qrilc_imputed"]
print(f"Imputed values: {(qrilc_matrix.M == 5).sum()}")

### 4.2 MinProb (Probabilistic Minimum) Imputation

**Algorithm:** For each feature, samples from a distribution centered at the minimum detected value.

**Best for:** Fast MNAR imputation, preserves uncertainty.

**Key Parameters:**
- `sigma`: Distribution width multiplier (default: 2.0)
- Larger sigma = wider distribution
- `random_state`: Random seed

In [None]:
# Apply MinProb imputation
print("Running MinProb imputation...")

container = impute_minprob(
    container,
    assay_name="proteins",
    source_layer="log",
    new_layer_name="minprob_imputed",
    sigma=2.0,  # Distribution width
    random_state=42,
)

print("MinProb imputation completed.")

# Check results
minprob_matrix = container.assays["proteins"].layers["minprob_imputed"]
print(f"Imputed values: {(minprob_matrix.M == 5).sum()}")

### 4.3 MinDet (Deterministic Minimum) Imputation

**Algorithm:** Uses a fixed deterministic value for all missing values in each feature.

**Best for:** Quick baseline, reproducible results.

**Key Parameters:**
- `sigma`: Controls how far below minimum to impute (default: 1.0)
- No random_state needed (deterministic)

In [None]:
# Apply MinDet imputation
print("Running MinDet imputation...")

container = impute_mindet(
    container,
    assay_name="proteins",
    source_layer="log",
    new_layer_name="mindet_imputed",
    sigma=1.0,
)

print("MinDet imputation completed.")

# Check results
mindet_matrix = container.assays["proteins"].layers["mindet_imputed"]
print(f"Imputed values: {(mindet_matrix.M == 5).sum()}")

---

## 5. Advanced Methods

### 5.1 Local Least Squares (LLS) Imputation

**Algorithm:** Combines KNN with local linear regression. For each sample with missing values, finds K nearest neighbors and builds a local linear model to predict missing values.

**Best for:** High-dimensional data with correlated features.

**Key Parameters:**
- `k`: Number of neighbors (default: 10)
- `max_iter`: Maximum iterations for convergence
- `tol`: Convergence threshold

In [None]:
# Apply LLS imputation
print("Running LLS imputation (this may take a moment)...")

container = impute_lls(
    container,
    assay_name="proteins",
    source_layer="log",
    new_layer_name="lls_imputed",
    k=10,
    max_iter=10,  # Reduced for faster execution
)

print("LLS imputation completed.")

# Check results
lls_matrix = container.assays["proteins"].layers["lls_imputed"]
print(f"Imputed values: {(lls_matrix.M == 5).sum()}")

### 5.2 MissForest (Random Forest) Imputation

**Algorithm:** Iterative random forest imputation. For each feature with missing values, trains a random forest on observed features and predicts missing values.

**Best for:** Complex non-linear patterns, mixed data types.

**Note:** This is the slowest method but often most accurate for complex patterns.

In [None]:
# Apply MissForest imputation
print("Running MissForest imputation (this may take longer)...")

try:
    container = impute_mf(
        container,
        assay_name="proteins",
        base_layer="log",
        new_layer_name="mf_imputed",
        max_iter=10,  # Maximum iterations
        n_estimators=50,  # Number of trees (reduced for speed)
    )
    print("MissForest imputation completed.")

    # Check results
    mf_matrix = container.assays["proteins"].layers["mf_imputed"]
    print(f"Imputed values: {(mf_matrix.M == 5).sum()}")
except ImportError:
    print("MissForest skipped: scikit-learn not available.")

---

## 6. Performance Comparison

### 6.1 Visualize Imputation Results

Let's compare the distributions of imputed values across methods.

In [None]:
# Collect all imputed layers
imputed_layers = {
    "KNN": "knn_imputed",
    "PPCA": "ppca_imputed",
    "BPCA": "bpca_imputed",
    "SVD": "svd_imputed",
    "LLS": "lls_imputed",
    "QRILC": "qrilc_imputed",
    "MinProb": "minprob_imputed",
    "MinDet": "mindet_imputed",
}

# Filter layers that exist
available_layers = {
    k: v for k, v in imputed_layers.items() if v in container.assays["proteins"].layers
}

print(f"Available imputed layers: {list(available_layers.keys())}")

In [None]:
# Visualize imputation distributions
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.flatten()

# Original log data for comparison
log_data = container.assays["proteins"].layers["log"].X.flatten()[:5000]
axes[0].hist(log_data, bins=50, color="gray", alpha=0.7, edgecolor="black")
axes[0].set_title("Original (Log, with NaNs)")
axes[0].set_xlabel("Intensity")
axes[0].set_ylabel("Frequency")
axes[0].grid(axis="y", alpha=0.3)

# Plot each imputed method
for i, (name, layer_name) in enumerate(available_layers.items(), 1):
    if i >= len(axes):
        break
    X = container.assays["proteins"].layers[layer_name].X.flatten()[:5000]
    axes[i].hist(X, bins=50, alpha=0.7, edgecolor="black")
    axes[i].set_title(f"{name}")
    axes[i].set_xlabel("Intensity")
    axes[i].set_ylabel("Frequency")
    axes[i].grid(axis="y", alpha=0.3)

# Hide empty subplots
for i in range(len(available_layers) + 1, len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.savefig("tutorial_output/imputation_distributions.png", dpi=300, bbox_inches="tight")
plt.show()

print("Distribution comparison saved.")

### 6.2 Pairwise Scatter Comparison

In [None]:
# Pairwise scatter comparison of methods
methods_to_compare = list(available_layers.items())[:4]  # First 4 methods

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

# Select a subset of samples for visualization
n_samples_show = min(500, container.n_samples)
n_features_show = min(20, container.assays["proteins"].n_features)

for idx, (name1, layer1) in enumerate(methods_to_compare):
    if idx >= len(methods_to_compare) - 1:
        break

    name2, layer2 = methods_to_compare[idx + 1]

    X1 = container.assays["proteins"].layers[layer1].X[:n_samples_show, :n_features_show].flatten()
    X2 = container.assays["proteins"].layers[layer2].X[:n_samples_show, :n_features_show].flatten()

    axes[idx].scatter(X1, X2, alpha=0.3, s=5)
    axes[idx].plot([X1.min(), X1.max()], [X1.min(), X1.max()], "r--", lw=2)
    axes[idx].set_xlabel(f"{name1}")
    axes[idx].set_ylabel(f"{name2}")
    axes[idx].set_title(f"{name1} vs {name2}")
    axes[idx].grid(alpha=0.3)

    # Compute correlation
    corr = np.corrcoef(X1, X2)[0, 1]
    axes[idx].text(
        0.05,
        0.95,
        f"r = {corr:.3f}",
        transform=axes[idx].transAxes,
        bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5),
    )

plt.tight_layout()
plt.savefig("tutorial_output/imputation_scatter_comparison.png", dpi=300, bbox_inches="tight")
plt.show()

print("Scatter comparison saved.")

### 6.3 Statistical Summary of Imputation Results

In [None]:
# Statistical summary
import pandas as pd

summary_data = []
for name, layer_name in available_layers.items():
    X = container.assays["proteins"].layers[layer_name].X
    summary_data.append(
        {
            "Method": name,
            "Mean": np.mean(X),
            "Std": np.std(X),
            "Min": np.min(X),
            "Median": np.median(X),
            "Max": np.max(X),
        }
    )

summary_df = pd.DataFrame(summary_data)
print("\nImputation Summary Statistics:")
print("=" * 80)
print(summary_df.to_string(index=False))

# Save summary
summary_df.to_csv("tutorial_output/imputation_summary.csv", index=False)
print("\nSummary saved to tutorial_output/imputation_summary.csv")

### 6.4 Correlation Heatmap of Imputed Layers

In [None]:
# Correlation heatmap between methods
method_names = list(available_layers.keys())
n_methods = len(method_names)
corr_matrix = np.zeros((n_methods, n_methods))

# Compute correlations
for i, name1 in enumerate(method_names):
    for j, name2 in enumerate(method_names):
        layer1 = available_layers[name1]
        layer2 = available_layers[name2]
        X1 = container.assays["proteins"].layers[layer1].X.flatten()[:10000]
        X2 = container.assays["proteins"].layers[layer2].X.flatten()[:10000]
        corr_matrix[i, j] = np.corrcoef(X1, X2)[0, 1]

# Plot heatmap
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(corr_matrix, cmap="RdBu_r", vmin=-1, vmax=1, aspect="auto")

# Add labels
ax.set_xticks(range(n_methods))
ax.set_yticks(range(n_methods))
ax.set_xticklabels(method_names, rotation=45, ha="right")
ax.set_yticklabels(method_names)

# Add correlation values
for i in range(n_methods):
    for j in range(n_methods):
        text = ax.text(
            j, i, f"{corr_matrix[i, j]:.2f}", ha="center", va="center", color="black", fontsize=9
        )

ax.set_title("Correlation Between Imputation Methods")
plt.colorbar(im, ax=ax, label="Correlation")
plt.tight_layout()
plt.savefig("tutorial_output/imputation_correlation_heatmap.png", dpi=300, bbox_inches="tight")
plt.show()

print("Correlation heatmap saved.")

---

## 7. Best Practices

### 7.1 Imputation Workflow

```python
# Recommended workflow
from scptensor import (
    load_csv, norm_log, count_mask_codes,
    impute_knn, impute_qrilc
)

# 1. Load data
container = load_csv("data.csv")

# 2. Analyze missingness patterns
mask_counts = count_mask_codes(container.assays['proteins'].layers['raw'].M)
lod_ratio = mask_counts.get(2, 0) / sum(mask_counts.values())

# 3. Normalize
container = norm_log(container, "proteins", "raw", "log")

# 4. Choose imputation method based on missingness
if lod_ratio > 0.5:  # Mostly MNAR
    container = impute_qrilc(container, "proteins", "log", "imputed")
else:  # Mostly MCAR
    container = impute_knn(container, "proteins", "log", "imputed", k=10)

# 5. Verify imputation
imputed_matrix = container.assays['proteins'].layers['imputed']
print(f"Imputed: {(imputed_matrix.M == 5).sum()} values")
```

### 7.2 Choosing the Right Method

| Scenario | Recommended Method | Alternative |
|----------|-------------------|-------------|
| Quick analysis, small data | KNN | PPCA |
| Large dataset (>10k samples) | PPCA | SVD |
| High LOD missingness | QRILC | MinProb |
| Correlated features | LLS | BPCA |
| Complex patterns, enough time | MissForest | BPCA |
| Baseline comparison | MinDet | KNN |
| Automatic model selection | BPCA | PPCA |

### 7.3 Common Pitfalls to Avoid

1. **Not analyzing missingness type first**
   - Always check if missingness is MCAR or MNAR
   - Use `count_mask_codes()` to understand mask distribution

2. **Applying imputation on raw, unnormalized data**
   - Most methods work better on log-transformed data
   - Normalize first, then impute

3. **Using MNAR methods for MCAR data**
   - QRILC/MinProb will underestimate values for MCAR missingness
   - KNN/PPCA are better for random missingness

4. **Ignoring imputation uncertainty**
   - Single imputation doesn't capture uncertainty
   - Consider multiple imputations for critical analyses

5. **Over-imputing**
   - Features with >80% missingness may be better removed
   - Use `filter_features_missing()` before imputation

6. **Not validating results**
   - Always check imputed value distributions
   - Compare with original data visually

### 7.4 Parameter Tuning Guide

**KNN:**
- `k=5-15`: Typical range
- Smaller k for heterogeneous data
- Larger k for homogeneous data

**PPCA/BPCA:**
- `n_components`: Start with 5-20
- Use elbow method on scree plot
- BPCA automatically prunes unnecessary components

**QRILC:**
- `q=0.01-0.05`: Recommended for proteomics
- Lower q for more aggressive censoring

**MinProb:**
- `sigma=1.0-3.0`: Typical range
- Higher sigma for more variance in imputed values

### 7.5 Summary

**All 10 Imputation Methods in ScpTensor:**

| # | Method | Category | Key Feature | Use When... |
|---|--------|----------|-------------|-------------|
| 1 | `impute_knn` | MCAR | Fast, simple | Small datasets, general use |
| 2 | `impute_ppca` | MCAR | Scalable | Large datasets |
| 3 | `impute_bpca` | MCAR | Auto model selection | Uncertain about components |
| 4 | `impute_svd` | MCAR | Dimensionality reduction | Need fast imputation |
| 5 | `impute_mf` | MCAR/MAR | Non-linear | Complex patterns, time available |
| 6 | `impute_lls` | MCAR | Exploits correlation | Correlated features |
| 7 | `impute_qrilc` | MNAR | Best for LOD | Below detection limit |
| 8 | `impute_minprob` | MNAR | Fast MNAR | MNAR, need speed |
| 9 | `impute_mindet` | MNAR | Deterministic | Quick baseline |
| 10 | `impute_nmf` | MCAR | Non-negative | Count-like data |

**Quick Decision Guide:**
- **Start with KNN** for general-purpose imputation
- **Use QRILC** when most missingness is LOD-related
- **Try BPCA** for automatic model selection
- **Use MissForest** for the best accuracy (if time allows)

**Next Tutorials:**
- Tutorial 04: Clustering and Visualization
- Tutorial 07: Feature Selection