# Tutorial 06: Advanced Quality Control Analysis

This tutorial covers advanced quality control (QC) techniques for single-cell proteomics data, building upon the basic QC methods introduced in Tutorial 02.

## Learning Objectives

By the end of this tutorial, you will:
- Understand sensitivity analysis and feature saturation
- Analyze missing value types using mask matrices
- Compute and interpret coefficient of variation (CV)
- Detect contaminants and doublets
- Create comprehensive QC visualizations

---

## 1. Setup

Import required libraries and load an example dataset.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import polars as pl

# Apply SciencePlots style
plt.style.use(["science", "no-latex"])

# Import ScpTensor
import scptensor
from scptensor import (
    # Missing value analysis
    analyze_missing_types,
    calculate_qc_metrics,
    compute_batch_cv,
    compute_completeness,
    compute_cumulative_sensitivity,
    # Variability analysis
    compute_cv,
    compute_jaccard_index,
    compute_missing_stats,
    # Sensitivity analysis
    compute_sensitivity,
    # Advanced QC
    detect_contaminants,
    detect_doublets,
    filter_by_cv,
    qc_report_metrics,
    report_missing_values,
)
from scptensor.datasets import load_simulated_scrnaseq_like

print(f"ScpTensor version: {scptensor.__version__}")

### Load Dataset

For this tutorial, we'll use a simulated dataset that includes batch information and realistic missing value patterns.

In [None]:
# Load simulated dataset
container = load_simulated_scrnaseq_like()

print(f"Dataset loaded: {container}")
print(f"\nSamples: {container.n_samples}")
print(f"Features: {container.assays['proteins'].n_features}")
print("\nAvailable columns in obs:")
print(container.obs.columns)
print("\nBatch distribution:")
print(container.obs["batch"].value_counts())

### Expected Output:
```
Dataset loaded: ScpContainer with 500 samples and 1 assay

Samples: 500
Features: 200

Available columns in obs:
['sample_id', 'batch', 'cell_type', ...]

Batch distribution:
shape: (3, 2)
┌───────┬───────────┐
│ batch ┆ count     │
│ ---   ┆ ---       │
│ str   ┆ u32       │
╞═══════╪═══════════╡
│ B1    ┆ 167       │
│ B2    ┆ 167       │
│ B3    ┆ 166       │
└───────┴───────────┘
```

---

## 2. Sensitivity Analysis

Sensitivity analysis measures the number of features detected per sample and across the entire dataset. This is crucial for assessing data quality in single-cell proteomics where missing values are common.

### 2.1 Compute Basic Sensitivity Metrics

In [None]:
# Compute sensitivity metrics
metrics = compute_sensitivity(
    container, assay_name="proteins", layer_name="raw", detection_threshold=0.0
)

print("Sensitivity Metrics:")
print("=" * 50)
print(f"Total unique features detected: {metrics.total_features}")
print(f"Mean features per sample: {metrics.mean_sensitivity:.2f}")
print(f"Median features per sample: {np.median(metrics.n_features_per_sample):.2f}")
print(f"Min features per sample: {np.min(metrics.n_features_per_sample)}")
print(f"Max features per sample: {np.max(metrics.n_features_per_sample)}")
print(f"\nMean completeness: {np.mean(metrics.completeness_per_sample):.3f}")

### 2.2 Compute Completeness

Completeness is the proportion of features detected in each sample. Low completeness indicates poor quality samples.

In [None]:
# Compute completeness per sample
completeness = compute_completeness(container, assay_name="proteins", layer_name="raw")

# Identify low completeness samples
low_completeness_threshold = 0.5  # 50% completeness
low_comp_samples = np.where(completeness < low_completeness_threshold)[0]

print(f"Samples with completeness < {low_completeness_threshold * 100}%: {len(low_comp_samples)}")
print(f"Percentage: {len(low_comp_samples) / len(completeness) * 100:.2f}%")

# Plot completeness distribution
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(completeness, bins=30, edgecolor="black", alpha=0.7, color="steelblue")
ax.axvline(
    low_completeness_threshold,
    color="red",
    linestyle="--",
    label=f"Threshold: {low_completeness_threshold * 100}%",
)
ax.axvline(
    np.mean(completeness),
    color="green",
    linestyle="-.",
    label=f"Mean: {np.mean(completeness) * 100:.1f}%",
)
ax.set_xlabel("Completeness (proportion of detected features)")
ax.set_ylabel("Number of Samples")
ax.set_title("Sample Completeness Distribution")
ax.legend()
plt.tight_layout()
plt.savefig("tutorial_output/completeness_distribution.png", dpi=300)
plt.show()

print("Plot saved to: tutorial_output/completeness_distribution.png")

### 2.3 Cumulative Sensitivity (Feature Saturation)

Cumulative sensitivity analysis shows how the number of unique features detected increases as more samples are included. This helps assess whether the feature space has been adequately sampled.

In [None]:
# Compute cumulative sensitivity
cumulative_result = compute_cumulative_sensitivity(
    container, assay_name="proteins", layer_name="raw", n_steps=20, seed=42
)

print("Cumulative Sensitivity Results:")
print("=" * 50)
print(f"Total features in dataset: {container.assays['proteins'].n_features}")
print(f"Total unique detected: {cumulative_result.cumulative_features[-1]}")
if cumulative_result.saturation_point:
    print(f"Estimated saturation at: {cumulative_result.saturation_point} samples")
else:
    print("Saturation not reached within sample size")

# Plot cumulative sensitivity curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(
    cumulative_result.sample_sizes,
    cumulative_result.cumulative_features,
    marker="o",
    linewidth=2,
    markersize=4,
)
ax.axhline(
    container.assays["proteins"].n_features,
    color="red",
    linestyle="--",
    label=f"Total features: {container.assays['proteins'].n_features}",
)
ax.set_xlabel("Number of Samples")
ax.set_ylabel("Cumulative Unique Features Detected")
ax.set_title("Feature Saturation Curve")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("tutorial_output/cumulative_sensitivity.png", dpi=300)
plt.show()

print("\nPlot saved to: tutorial_output/cumulative_sensitivity.png")

### 2.4 Jaccard Index (Sample Similarity)

The Jaccard index measures the overlap of detected features between sample pairs. High similarity indicates consistent detection patterns.

In [None]:
# Compute Jaccard index matrix
jaccard = compute_jaccard_index(container, assay_name="proteins", layer_name="raw")

# Calculate summary statistics
n_samples = jaccard.shape[0]
# Get upper triangle (excluding diagonal)
upper_tri = jaccard[np.triu_indices_from(jaccard, k=1)]

print("Jaccard Index Statistics:")
print("=" * 50)
print(f"Mean similarity: {np.mean(upper_tri):.3f}")
print(f"Median similarity: {np.median(upper_tri):.3f}")
print(f"Min similarity: {np.min(upper_tri):.3f}")
print(f"Max similarity: {np.max(upper_tri):.3f}")

# Visualize as heatmap for a subset of samples
n_display = min(50, n_samples)
jaccard_subset = jaccard[:n_display, :n_display]

fig, ax = plt.subplots(figsize=(8, 7))
im = ax.imshow(
    jaccard_subset, cmap="viridis", aspect="auto", interpolation="nearest", vmin=0, vmax=1
)
cbar = plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
cbar.set_label("Jaccard Index")
ax.set_title(f"Sample Similarity Heatmap (first {n_display} samples)")
ax.set_xlabel("Sample Index")
ax.set_ylabel("Sample Index")
plt.tight_layout()
plt.savefig("tutorial_output/jaccard_heatmap.png", dpi=300)
plt.show()

print("\nPlot saved to: tutorial_output/jaccard_heatmap.png")

### 2.5 Generate QC Report with Grouping

You can generate QC reports grouped by metadata columns such as batch or cell type.

In [None]:
# Generate QC metrics grouped by batch
container_with_qc = qc_report_metrics(
    container, assay_name="proteins", layer_name="raw", group_by="batch"
)

# View the added QC metrics
print("QC metrics added to obs:")
print("=" * 50)
print(
    container_with_qc.obs.select(
        [
            "sample_id",
            "batch",
            "n_detected_features",
            "completeness",
            "batch_mean_features",
            "batch_total_features",
        ]
    ).head(10)
)

# Generate summary report by batch
batch_report = report_missing_values(
    container_with_qc, assay_name="proteins", layer_name="raw", by="batch"
)

print("\nBatch-wise QC Report:")
print("=" * 50)
print(batch_report)

---

## 3. Missing Value Analysis

ScpTensor tracks different types of missing values using mask codes:
- **VALID (0)**: Detected values
- **MBR (1)**: Missing Between Runs
- **LOD (2)**: Below Limit of Detection
- **FILTERED (3)**: Removed by QC
- **IMPUTED (5)**: Filled by imputation

### 3.1 Analyze Missing Types

In [None]:
# Analyze missing types based on mask matrix
container_analyzed = analyze_missing_types(container, assay_name="proteins", layer_name="raw")

# View mask statistics in var
var = container_analyzed.assays["proteins"].var

print("Missing Type Analysis (first 10 features):")
print("=" * 60)
print(
    var.select(
        [
            "feature_id",
            "mask_valid_count",
            "mask_mbr_count",
            "mask_lod_count",
            "mask_valid_rate",
            "mask_missing_rate",
        ]
    ).head(10)
)

# Summary statistics
print("\nSummary across all features:")
print("=" * 60)
print(f"Mean valid rate: {var['mask_valid_rate'].mean():.3f}")
print(f"Mean missing rate: {var['mask_missing_rate'].mean():.3f}")

### 3.2 Compute Comprehensive Missing Statistics

In [None]:
# Compute comprehensive missing value statistics
missing_report = compute_missing_stats(
    container, assay_name="proteins", layer_name="raw", high_missing_threshold=0.5
)

print("Missing Value Statistics Report:")
print("=" * 60)
print(f"Total missing rate: {missing_report.total_missing_rate:.2%}")
print(f"Valid rate: {missing_report.valid_rate:.2%}")
print(f"MBR rate: {missing_report.mbr_rate:.2%}")
print(f"LOD rate: {missing_report.lod_rate:.2%}")
print(f"Imputed rate: {missing_report.imputed_rate:.2%}")
print(
    f"\nStructural missing features (100% missing): {len(missing_report.structural_missing_features)}"
)
print(f"Samples with high missing rate (>50%): {len(missing_report.samples_with_high_missing)}")

# Visualize missing value type distribution
fig, ax = plt.subplots(figsize=(8, 6))

# Mask type counts
mask_types = ["VALID", "MBR", "LOD", "FILTERED", "IMPUTED"]
mask_codes = [0, 1, 2, 3, 5]
colors = ["#2ca02c", "#ffdd57", "#ff9f40", "#d62728", "#9467bd"]

# Get actual counts from the mask matrix
M = container.assays["proteins"].layers["raw"].M
if hasattr(M, "toarray"):
    M = M.toarray()

type_counts = []
type_labels = []
type_colors = []

for code, label, color in zip(mask_codes, mask_types, colors, strict=False):
    count = np.sum(code == M)
    if count > 0:
        type_counts.append(count)
        type_labels.append(f"{label}\n({code})")
        type_colors.append(color)

# Create bar plot
bars = ax.bar(range(len(type_counts)), type_counts, color=type_colors, edgecolor="black", alpha=0.8)
ax.set_xticks(range(len(type_labels)))
ax.set_xticklabels(type_labels)
ax.set_ylabel("Count")
ax.set_title("Missing Value Type Distribution")
ax.grid(True, alpha=0.3, axis="y")

# Add count labels on bars
for bar, count in zip(bars, type_counts, strict=False):
    height = bar.get_height()
    pct = count / M.size * 100
    ax.text(
        bar.get_x() + bar.get_width() / 2.0,
        height,
        f"{count}\n({pct:.1f}%)",
        ha="center",
        va="bottom",
        fontsize=9,
    )

plt.tight_layout()
plt.savefig("tutorial_output/missing_type_distribution.png", dpi=300)
plt.show()

print("\nPlot saved to: tutorial_output/missing_type_distribution.png")

### 3.3 Missing Value Report by Group

In [None]:
# Generate missing value report by batch
missing_by_batch = report_missing_values(
    container, assay_name="proteins", layer_name="raw", by="batch"
)

print("Missing Value Report by Batch:")
print("=" * 60)
print(missing_by_batch)

# Compare batches
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Extract data for plotting
batches = missing_by_batch["group"].to_numpy()
local_sensitivity = missing_by_batch["LocalSensitivityMean"].to_numpy()
total_sensitivity = missing_by_batch["TotalSensitivity"].to_numpy()
completeness = missing_by_batch["Completeness"].to_numpy()

# Local sensitivity
axes[0].bar(batches, local_sensitivity, color="steelblue", edgecolor="black", alpha=0.7)
axes[0].set_ylabel("Mean Detected Features")
axes[0].set_title("Local Sensitivity by Batch")
axes[0].grid(True, alpha=0.3, axis="y")

# Total sensitivity
axes[1].bar(batches, total_sensitivity, color="coral", edgecolor="black", alpha=0.7)
axes[1].set_ylabel("Total Unique Features")
axes[1].set_title("Total Sensitivity by Batch")
axes[1].grid(True, alpha=0.3, axis="y")

# Completeness
axes[2].bar(batches, completeness, color="seagreen", edgecolor="black", alpha=0.7)
axes[2].set_ylabel("Completeness")
axes[2].set_title("Completeness by Batch")
axes[2].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.savefig("tutorial_output/batch_missing_comparison.png", dpi=300)
plt.show()

print("\nPlot saved to: tutorial_output/batch_missing_comparison.png")

---

## 4. Variability Analysis (Coefficient of Variation)

The coefficient of variation (CV) measures the relative variability of each feature. Features with high CV may indicate poor measurement quality or true biological heterogeneity.

### 4.1 Compute Feature CV

In [None]:
# Compute CV for all features
cv_report = compute_cv(container, assay_name="proteins", layer_name="raw", min_mean=1e-6)

print("Coefficient of Variation Report:")
print("=" * 60)
print(f"Mean CV: {cv_report.mean_cv:.3f}")
print(f"Median CV: {cv_report.median_cv:.3f}")
print(f"Min CV: {np.min(cv_report.feature_cv):.3f}")
print(f"Max CV: {np.max(cv_report.feature_cv):.3f}")

# Define CV thresholds
low_cv_threshold = 0.1
medium_cv_threshold = 0.3
high_cv_threshold = 0.5

n_low = np.sum(cv_report.feature_cv < low_cv_threshold)
n_medium = np.sum(
    (cv_report.feature_cv >= low_cv_threshold) & (cv_report.feature_cv < medium_cv_threshold)
)
n_high = np.sum(cv_report.feature_cv >= medium_cv_threshold)

print(
    f"\nFeatures with low CV (<{low_cv_threshold}): {n_low} ({n_low / len(cv_report.feature_cv) * 100:.1f}%)"
)
print(
    f"Features with medium CV ({low_cv_threshold}-{medium_cv_threshold}): {n_medium} ({n_medium / len(cv_report.feature_cv) * 100:.1f}%)"
)
print(
    f"Features with high CV (>={medium_cv_threshold}): {n_high} ({n_high / len(cv_report.feature_cv) * 100:.1f}%)"
)

### 4.2 CV Distribution by Group

In [None]:
# Compute CV by batch
cv_report_batched = compute_cv(container, assay_name="proteins", layer_name="raw", group_by="batch")

print("CV Statistics by Batch:")
print("=" * 60)
if cv_report_batched.cv_by_group:
    for group, cv_values in cv_report_batched.cv_by_group.items():
        print(
            f"{group}: Mean CV = {np.mean(cv_values):.3f}, Median CV = {np.median(cv_values):.3f}"
        )

# Plot CV distributions by batch
fig, ax = plt.subplots(figsize=(10, 6))

if cv_report_batched.cv_by_group:
    # Prepare data for box plot
    groups = list(cv_report_batched.cv_by_group.keys())
    data_to_plot = [cv_report_batched.cv_by_group[g] for g in groups]

    bp = ax.boxplot(data_to_plot, labels=groups, patch_artist=True)

    # Color the boxes
    colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]
    for patch, color in zip(bp["boxes"], colors[: len(groups)], strict=False):
        patch.set_facecolor(color)
        patch.set_alpha(0.6)

    ax.set_ylabel("Coefficient of Variation")
    ax.set_title("CV Distribution by Batch")
    ax.grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.savefig("tutorial_output/cv_by_batch.png", dpi=300)
plt.show()

print("\nPlot saved to: tutorial_output/cv_by_batch.png")

### 4.3 Batch CV Analysis

Compare within-batch variability to between-batch variability to detect batch effects.

In [None]:
# Compute batch CV statistics
batch_cv_report = compute_batch_cv(
    container, assay_name="proteins", layer_name="raw", batch_col="batch", high_cv_threshold=0.3
)

print("Batch CV Analysis:")
print("=" * 60)
print("Within-batch CV by batch:")
if batch_cv_report.within_batch_cv:
    for batch, cv_val in batch_cv_report.within_batch_cv.items():
        print(f"  {batch}: {cv_val:.3f}")
print(f"\nBetween-batch CV: {batch_cv_report.between_batch_cv:.3f}")
print(f"High CV features: {len(batch_cv_report.high_cv_features)}")

# Assess batch effect
if batch_cv_report.within_batch_cv:
    within_mean = np.mean(list(batch_cv_report.within_batch_cv.values()))
    ratio = batch_cv_report.between_batch_cv / within_mean if within_mean > 0 else np.nan
    print(f"\nBatch effect ratio (between/within): {ratio:.2f}")
    if ratio > 1.5:
        print("Assessment: High batch effect detected")
    elif ratio > 1.1:
        print("Assessment: Moderate batch effect detected")
    else:
        print("Assessment: Low batch effect")

### 4.4 Filter Features by CV

In [None]:
# Filter features by CV threshold
cv_threshold = 0.5
container_filtered = filter_by_cv(
    container,
    assay_name="proteins",
    layer_name="raw",
    cv_threshold=cv_threshold,
    keep_filtered=False,  # Remove high CV features
)

print(f"Before CV filtering: {container.assays['proteins'].n_features} features")
print(f"After CV filtering: {container_filtered.assays['proteins'].n_features} features")
print(
    f"Features removed: {container.assays['proteins'].n_features - container_filtered.assays['proteins'].n_features}"
)

# Alternative: keep all features but mark filtered ones
container_marked = filter_by_cv(
    container,
    assay_name="proteins",
    layer_name="raw",
    cv_threshold=cv_threshold,
    keep_filtered=True,  # Keep all, add filter column
)

# View filter status
var_marked = container_marked.assays["proteins"].var
n_filtered = var_marked["cv_filtered"].sum()
print(f"\nFeatures marked as high CV (>): {n_filtered}")

---

## 5. Advanced Detection Methods

### 5.1 Contaminant Detection

Detect common contaminant proteins in proteomics data (e.g., keratins, trypsin, albumin).

In [None]:
# Detect contaminants
container_with_contaminants = detect_contaminants(
    container, assay_name="proteins", layer_name="raw", min_prevalence=3
)

# Check for detected contaminants
var = container_with_contaminants.assays["proteins"].var
contaminants = var.filter(pl.col("is_contaminant") == True)

print(f"Detected {len(contaminants)} contaminant proteins")
if len(contaminants) > 0:
    print("\nContaminant list:")
    print(contaminants.select(["feature_id", "contaminant_prevalence"]).head(10))

# View sample-level contaminant statistics
print("\nSample contaminant content:")
print(
    container_with_contaminants.obs.select(
        ["sample_id", "contaminant_content", "contaminant_ratio"]
    ).head(10)
)

# Visualize contaminant distribution
if len(contaminants) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Contaminant ratio per sample
    contaminant_ratio = container_with_contaminants.obs["contaminant_ratio"].to_numpy()
    axes[0].hist(contaminant_ratio, bins=30, edgecolor="black", alpha=0.7, color="coral")
    axes[0].set_xlabel("Contaminant Ratio")
    axes[0].set_ylabel("Number of Samples")
    axes[0].set_title("Sample Contaminant Distribution")
    axes[0].axvline(
        np.mean(contaminant_ratio),
        color="red",
        linestyle="--",
        label=f"Mean: {np.mean(contaminant_ratio):.3f}",
    )
    axes[0].legend()

    # High contaminant samples
    high_contam_threshold = 0.05  # 5%
    n_high_contam = np.sum(contaminant_ratio > high_contam_threshold)
    axes[1].bar(
        ["All samples", f"High (> {high_contam_threshold * 100}%)"],
        [len(contaminant_ratio), n_high_contam],
        color=["steelblue", "red"],
        edgecolor="black",
    )
    axes[1].set_ylabel("Number of Samples")
    axes[1].set_title("High Contaminant Samples")

    plt.tight_layout()
    plt.savefig("tutorial_output/contaminant_analysis.png", dpi=300)
    plt.show()

    print("\nPlot saved to: tutorial_output/contaminant_analysis.png")

### 5.2 Doublet Detection

Detect potential doublets (samples containing material from multiple cells).

In [None]:
# Detect doublets using KNN method
container_with_doublets = detect_doublets(
    container,
    assay_name="proteins",
    layer_name="raw",
    method="knn",  # Options: 'knn', 'isolation', 'hybrid'
    n_neighbors=15,
    expected_doublet_rate=0.1,
    random_state=42,
)

# View doublet detection results
n_doublets = container_with_doublets.obs["is_doublet"].sum()
doublet_rate = n_doublets / container_with_doublets.n_samples * 100

print("Doublet Detection Results:")
print("=" * 60)
print(f"Detected doublets: {n_doublets}")
print(f"Doublet rate: {doublet_rate:.2f}%")
print(f"Expected rate: {0.1 * 100:.1f}%")

# Doublet score statistics
doublet_scores = container_with_doublets.obs["doublet_score"].to_numpy()
print("\nDoublet score statistics:")
print(f"  Mean: {np.mean(doublet_scores):.3f}")
print(f"  Median: {np.median(doublet_scores):.3f}")
print(f"  Min: {np.min(doublet_scores):.3f}")
print(f"  Max: {np.max(doublet_scores):.3f}")

# Visualize doublet scores
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Score distribution
axes[0].hist(doublet_scores, bins=30, edgecolor="black", alpha=0.7)
axes[0].axvline(
    doublet_scores[np.where(container_with_doublets.obs["is_doublet"].to_numpy())[0]].mean()
    if n_doublets > 0
    else 0,
    color="red",
    linestyle="--",
    label="Doublet mean" if n_doublets > 0 else "",
)
axes[0].set_xlabel("Doublet Score")
axes[0].set_ylabel("Frequency")
axes[0].set_title("Doublet Score Distribution")
if n_doublets > 0:
    axes[0].legend()

# Score by batch
if "batch" in container_with_doublets.obs.columns:
    batches = container_with_doublets.obs["batch"].unique().to_list()
    batch_scores = [
        container_with_doublets.obs.filter(pl.col("batch") == b)["doublet_score"].to_numpy()
        for b in batches
    ]

    bp = axes[1].boxplot(batch_scores, labels=batches, patch_artist=True)
    for patch in bp["boxes"]:
        patch.set_facecolor("lightblue")
        patch.set_alpha(0.6)
    axes[1].set_ylabel("Doublet Score")
    axes[1].set_title("Doublet Score by Batch")
    axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.savefig("tutorial_output/doublet_detection.png", dpi=300)
plt.show()

print("\nPlot saved to: tutorial_output/doublet_detection.png")

---

## 6. QC Visualization Summary

Create a comprehensive QC summary figure combining multiple metrics.

In [None]:
# First, calculate comprehensive QC metrics
container_qc = calculate_qc_metrics(container, assay_name="proteins", layer_name="raw")

# Create comprehensive QC summary
fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# Panel 1: Missing rate per sample
ax1 = fig.add_subplot(gs[0, 0])
missing_rate = container_qc.obs["missing_rate"].to_numpy()
ax1.hist(missing_rate, bins=30, edgecolor="black", alpha=0.7, color="coral")
ax1.set_xlabel("Missing Rate")
ax1.set_ylabel("Number of Samples")
ax1.set_title("Sample Missing Rate Distribution")

# Panel 2: Number of detected features per sample
ax2 = fig.add_subplot(gs[0, 1])
n_detected = container_qc.obs["n_detected"].to_numpy()
ax2.hist(n_detected, bins=30, edgecolor="black", alpha=0.7, color="steelblue")
ax2.set_xlabel("Number of Detected Features")
ax2.set_ylabel("Number of Samples")
ax2.set_title("Detected Features Distribution")

# Panel 3: Total intensity per sample
ax3 = fig.add_subplot(gs[0, 2])
total_intensity = container_qc.obs["total_intensity"].to_numpy()
ax3.hist(total_intensity, bins=30, edgecolor="black", alpha=0.7, color="seagreen")
ax3.set_xlabel("Total Intensity")
ax3.set_ylabel("Number of Samples")
ax3.set_title("Total Intensity Distribution")

# Panel 4: Batch comparison - missing rate
ax4 = fig.add_subplot(gs[1, 0])
if "batch" in container_qc.obs.columns:
    batch_missing = (
        container_qc.obs.group_by("batch")
        .agg(pl.col("missing_rate").mean().alias("mean_missing"))
        .sort("batch")
    )
    ax4.bar(
        batch_missing["batch"].to_list(),
        batch_missing["mean_missing"].to_list(),
        edgecolor="black",
        alpha=0.7,
    )
    ax4.set_xlabel("Batch")
    ax4.set_ylabel("Mean Missing Rate")
    ax4.set_title("Missing Rate by Batch")

# Panel 5: Batch comparison - detected features
ax5 = fig.add_subplot(gs[1, 1])
if "batch" in container_qc.obs.columns:
    batch_detected = (
        container_qc.obs.group_by("batch")
        .agg(pl.col("n_detected").mean().alias("mean_detected"))
        .sort("batch")
    )
    ax5.bar(
        batch_detected["batch"].to_list(),
        batch_detected["mean_detected"].to_list(),
        color="steelblue",
        edgecolor="black",
        alpha=0.7,
    )
    ax5.set_xlabel("Batch")
    ax5.set_ylabel("Mean Detected Features")
    ax5.set_title("Detected Features by Batch")

# Panel 6: Batch comparison - total intensity
ax6 = fig.add_subplot(gs[1, 2])
if "batch" in container_qc.obs.columns:
    batch_intensity = (
        container_qc.obs.group_by("batch")
        .agg(pl.col("total_intensity").mean().alias("mean_intensity"))
        .sort("batch")
    )
    ax6.bar(
        batch_intensity["batch"].to_list(),
        batch_intensity["mean_intensity"].to_list(),
        color="seagreen",
        edgecolor="black",
        alpha=0.7,
    )
    ax6.set_xlabel("Batch")
    ax6.set_ylabel("Mean Total Intensity")
    ax6.set_title("Total Intensity by Batch")

# Panel 7: Feature missing rate distribution
ax7 = fig.add_subplot(gs[2, 0])
feature_missing = container_qc.assays["proteins"].var["prevalence"].to_numpy()
ax7.hist(feature_missing, bins=30, edgecolor="black", alpha=0.7, color="orange")
ax7.set_xlabel("Prevalence (proportion of samples)")
ax7.set_ylabel("Number of Features")
ax7.set_title("Feature Prevalence Distribution")

# Panel 8: Feature variance distribution
ax8 = fig.add_subplot(gs[2, 1])
feature_var = container_qc.assays["proteins"].var["variance"].to_numpy()
ax8.hist(feature_var, bins=30, edgecolor="black", alpha=0.7, color="purple")
ax8.set_xlabel("Variance")
ax8.set_ylabel("Number of Features")
ax8.set_title("Feature Variance Distribution")

# Panel 9: Feature mean intensity vs variance
ax9 = fig.add_subplot(gs[2, 2])
feature_mean = container_qc.assays["proteins"].var["mean_intensity"].to_numpy()
scatter = ax9.scatter(feature_mean, feature_var, alpha=0.5, s=20, c="darkblue")
ax9.set_xlabel("Mean Intensity")
ax9.set_ylabel("Variance")
ax9.set_title("Mean-Variance Relationship")
ax9.set_xscale("log")
ax9.set_yscale("log")

# Overall title
fig.suptitle("Comprehensive QC Summary", fontsize=16, fontweight="bold", y=0.995)

plt.savefig("tutorial_output/comprehensive_qc_summary.png", dpi=300, bbox_inches="tight")
plt.show()

print("Comprehensive QC summary saved to: tutorial_output/comprehensive_qc_summary.png")

---

## 7. Summary

In this tutorial, you learned advanced QC techniques for single-cell proteomics data:

### Sensitivity Analysis:
- **`compute_sensitivity()`**: Total and local (per-sample) feature detection
- **`compute_completeness()`**: Proportion of detected values per sample
- **`compute_cumulative_sensitivity()`**: Feature saturation analysis
- **`compute_jaccard_index()`**: Sample similarity based on feature overlap
- **`qc_report_metrics()`**: Generate comprehensive QC reports

### Missing Value Analysis:
- **`analyze_missing_types()`**: Analyze mask matrix for different missing types
- **`compute_missing_stats()`**: Comprehensive missing value statistics
- **`report_missing_values()`**: Generate missing value reports by group

### Variability Analysis:
- **`compute_cv()`**: Coefficient of variation per feature
- **`compute_batch_cv()`**: Within-batch and between-batch CV
- **`filter_by_cv()`**: Filter features by CV threshold

### Advanced Detection:
- **`detect_contaminants()`**: Identify common contaminant proteins
- **`detect_doublets()`**: Detect potential multiplets/doublets

### Best Practices:
1. **Start with sensitivity analysis** to understand overall data quality
2. **Use cumulative sensitivity** to assess if you have enough samples
3. **Analyze missing types** to understand the nature of missing values
4. **Check batch effects** using batch CV analysis
5. **Detect and remove contaminants** before downstream analysis
6. **Filter high CV features** carefully - they may be biologically relevant
7. **Always visualize QC metrics** before and after filtering

### Next Steps:
- **Tutorial 03**: Imputation and Batch Correction
- **Tutorial 04**: Clustering and Visualization
- **Tutorial 05**: Differential Expression Analysis
- **Tutorial 07**: Feature Selection