# Tutorial 07: Feature Selection

This tutorial covers feature selection methods for single-cell proteomics data, helping you identify the most informative proteins/peptides for downstream analysis.

## Learning Objectives

By the end of this tutorial, you will:
- Understand the importance of feature selection in single-cell analysis
- Select highly variable features using HVG methods
- Apply variance stabilizing transformation (VST) for feature selection
- Filter features based on dropout rates
- Use model-based feature selection methods
- Perform PCA loading-based feature selection
- Compare and evaluate different feature selection strategies

---

## 1. Setup

Import required libraries and load a dataset.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Apply SciencePlots style
plt.style.use(["science", "no-latex"])

# Import ScpTensor
import scptensor
from scptensor import (
    get_dropout_stats,
    # Normalization
    norm_log,
    select_by_dispersion,
    select_by_dropout,
    select_by_model_importance,
    select_by_pca_loadings,
    select_by_vst,
    # Feature selection methods
    select_hvg,
)
from scptensor.datasets import load_simulated_scrnaseq_like

print(f"ScpTensor version: {scptensor.__version__}")

## 2. Load and Prepare Data

Load a dataset and apply basic normalization for feature selection.

In [None]:
# Load dataset
container = load_simulated_scrnaseq_like()

print(f"Dataset loaded: {container}")
print(f"Samples: {container.n_samples}")
print(f"Features: {container.assays['proteins'].n_features}")

# Apply log normalization for feature selection
container = norm_log(
    container,
    assay_name="proteins",
    base_layer="raw",
    new_layer_name="log",
    base=2.0,
    offset=1.0,
)

print("\nLog normalization completed.")
print(f"Available layers: {list(container.assays['proteins'].layers.keys())}")

## 3. Why Feature Selection?

Feature selection is crucial for single-cell analysis because:

1. **Dimensionality Reduction**: Proteomics data often has 1000+ features, but only a subset is informative
2. **Noise Reduction**: Removing low-quality features improves signal-to-noise ratio
3. **Computational Efficiency**: Fewer features speed up downstream analysis
4. **Biological Relevance**: Variable features often represent biologically relevant proteins
5. **Improved Clustering**: Feature selection enhances cell type separation

Key considerations:
- **Highly Variable Features**: Show variation across samples/cells
- **Dropout Rate**: Features with high missingness may be less reliable
- **Expression Level**: Very low expression features have high technical noise
- **Correlation**: Highly correlated features provide redundant information

## 4. Highly Variable Genes/Proteins (HVG)

HVG selection identifies features with high variability using coefficient of variation (CV) or dispersion metrics.

In [None]:
# Select HVGs using coefficient of variation
n_features_original = container.assays["proteins"].n_features
n_top_features = 100  # Select top 100 most variable features

container_hvg = select_hvg(
    container=container,
    assay_name="proteins",
    layer="log",
    n_top_features=n_top_features,
    method="cv",  # Options: 'cv' (coefficient of variation) or 'dispersion'
    subset=True,  # If True, return filtered container; if False, add annotation
)

print("HVG Selection Results:")
print(f"  Original features: {n_features_original}")
print(f"  Selected features: {container_hvg.assays['proteins'].n_features}")
print(f"  Features removed: {n_features_original - container_hvg.assays['proteins'].n_features}")

### 4.1 HVG Annotation Mode

Instead of subsetting, you can annotate features without removing them.

In [None]:
# Annotate HVGs without subsetting
container_annotated = select_hvg(
    container=container,
    assay_name="proteins",
    layer="log",
    n_top_features=50,
    method="cv",
    subset=False,  # Add annotation instead of subsetting
)

# Check annotations
var = container_annotated.assays["proteins"].var
if "highly_variable" in var.columns:
    n_hvg = var["highly_variable"].sum()
    print("HVG Annotation Results:")
    print(f"  Total features: {len(var)}")
    print(f"  Highly variable: {n_hvg}")
    print(f"  Percentage: {n_hvg / len(var) * 100:.1f}%")

    # View variability scores
    if "variability_score" in var.columns:
        print("\nVariability Score Statistics:")
        print(f"  Min: {var['variability_score'].min():.4f}")
        print(f"  Max: {var['variability_score'].max():.4f}")
        print(f"  Mean: {var['variability_score'].mean():.4f}")

### 4.2 Visualizing HVG Selection

In [None]:
# Visualize mean-variance relationship for HVG selection
from scptensor.feature_selection._shared import _compute_mean_var

# Compute mean and variance
X = container.assays["proteins"].layers["log"].X
means, variances = _compute_mean_var(X, axis=0)

# Compute coefficient of variation
eps = np.finfo(means.dtype).eps
cv = np.sqrt(variances) / (means + eps)

# Get HVG annotation
is_hvg = (
    var["highly_variable"].to_numpy()
    if "highly_variable" in var.columns
    else np.zeros(len(means), dtype=bool)
)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Mean vs Variance (colored by HVG status)
colors = ["red" if h else "lightblue" for h in is_hvg]
axes[0].scatter(means, variances, c=colors, alpha=0.6, s=20)
axes[0].set_xlabel("Mean Expression")
axes[0].set_ylabel("Variance")
axes[0].set_title("Mean-Variance Relationship")
axes[0].set_xscale("log")
axes[0].set_yscale("log")

# Add legend
from matplotlib.patches import Patch

legend_elements = [
    Patch(facecolor="red", label=f"HVG ({is_hvg.sum()})"),
    Patch(facecolor="lightblue", label=f"Non-HVG ({(~is_hvg).sum()})"),
]
axes[0].legend(handles=legend_elements)

# CV distribution
axes[1].hist(cv, bins=50, color="steelblue", edgecolor="black", alpha=0.7)
axes[1].axvline(
    cv[is_hvg].min() if is_hvg.sum() > 0 else cv.max(),
    color="red",
    linestyle="--",
    linewidth=2,
    label="HVG threshold",
)
axes[1].set_xlabel("Coefficient of Variation")
axes[1].set_ylabel("Frequency")
axes[1].set_title("CV Distribution")
axes[1].legend()

plt.tight_layout()
plt.savefig("tutorial_output/hvg_selection.png", dpi=300)
plt.show()

print("HVG visualization saved to: tutorial_output/hvg_selection.png")

## 5. Variance Stabilizing Transformation (VST)

VST selects variable features using the Seurat-style approach, modeling the mean-variance relationship.

In [None]:
# Select features using VST
container_vst = select_by_vst(
    container=container,
    assay_name="proteins",
    layer="log",
    n_top_features=100,
    subset=True,
)

print("VST Selection Results:")
print(f"  Original features: {n_features_original}")
print(f"  Selected features: {container_vst.assays['proteins'].n_features}")

## 6. Dispersion-Based Selection

Select features based on normalized dispersion (variance-to-mean ratio).

In [None]:
# Select features by dispersion
container_disp = select_by_dispersion(
    container=container,
    assay_name="proteins",
    layer="log",
    n_top_features=100,
    n_bins=20,  # Number of bins for normalization
    subset=True,
)

print("Dispersion Selection Results:")
print(f"  Original features: {n_features_original}")
print(f"  Selected features: {container_disp.assays['proteins'].n_features}")

## 7. Dropout-Based Filtering

Filter features based on their dropout rate (percentage of missing values).

In [None]:
# Get dropout statistics first
dropout_stats = get_dropout_stats(
    container=container,
    assay_name="proteins",
    layer="raw",
)

print("Dropout Statistics:")
print(dropout_stats.head(10))

# Summary
print("\nDropout Rate Summary:")
print(f"  Mean dropout rate: {dropout_stats['dropout_rate'].mean() * 100:.2f}%")
print(f"  Median dropout rate: {dropout_stats['dropout_rate'].median() * 100:.2f}%")
print(f"  Features with >50% dropout: {(dropout_stats['dropout_rate'] > 0.5).sum()}")

### 7.1 Filter by Dropout Rate

In [None]:
# Filter features with high dropout rate
container_dropout = select_by_dropout(
    container=container,
    assay_name="proteins",
    layer="raw",
    max_dropout_rate=0.5,  # Keep features detected in at least 50% of samples
    subset=True,
)

print("Dropout Filtering Results:")
print(f"  Original features: {n_features_original}")
print(f"  Features after filtering: {container_dropout.assays['proteins'].n_features}")
print(
    f"  Features removed: {n_features_original - container_dropout.assays['proteins'].n_features}"
)

### 7.2 Visualizing Dropout Rates

In [None]:
# Visualize dropout rate distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Dropout rate histogram
axes[0].hist(dropout_stats["dropout_rate"], bins=50, color="coral", edgecolor="black")
axes[0].axvline(0.5, color="red", linestyle="--", linewidth=2, label="50% threshold")
axes[0].set_xlabel("Dropout Rate")
axes[0].set_ylabel("Frequency")
axes[0].set_title("Dropout Rate Distribution")
axes[0].legend()

# Mean expression vs dropout rate
mean_expr = dropout_stats["mean_expression"].to_numpy()
dropout_rate = dropout_stats["dropout_rate"].to_numpy()

axes[1].scatter(mean_expr, dropout_rate, alpha=0.5, s=20)
axes[1].axhline(0.5, color="red", linestyle="--", linewidth=2)
axes[1].set_xlabel("Mean Expression")
axes[1].set_ylabel("Dropout Rate")
axes[1].set_title("Mean Expression vs Dropout Rate")
axes[1].set_xscale("log")

plt.tight_layout()
plt.savefig("tutorial_output/dropout_analysis.png", dpi=300)
plt.show()

print("Dropout analysis saved to: tutorial_output/dropout_analysis.png")

## 8. Model-Based Feature Selection

Use machine learning models to identify the most important features.

In [None]:
# Select features using random forest importance
# Note: This requires scikit-learn
try:
    container_rf = select_by_model_importance(
        container=container,
        assay_name="proteins",
        layer="log",
        n_top_features=100,
        model_type="random_forest",  # Options: 'random_forest', 'variance_threshold'
        random_state=42,
        subset=True,
    )

    print("Random Forest Feature Selection Results:")
    print(f"  Original features: {n_features_original}")
    print(f"  Selected features: {container_rf.assays['proteins'].n_features}")
except Exception as e:
    print(f"Random forest selection skipped: {e}")
    print("(Requires scikit-learn: pip install scikit-learn)")

## 9. PCA Loading-Based Selection

Select features based on their contribution to principal components.

In [None]:
# Select features by PCA loadings
container_pca = select_by_pca_loadings(
    container=container,
    assay_name="proteins",
    layer="log",
    n_top_features=100,
    n_components=10,  # Use first 10 PCs
    subset=True,
)

print("PCA Loading-Based Selection Results:")
print(f"  Original features: {n_features_original}")
print(f"  Selected features: {container_pca.assays['proteins'].n_features}")

## 10. Comparing Feature Selection Methods

Compare overlap and differences between different feature selection methods.

In [None]:
# Compare selected features from different methods
# Get selected feature IDs from each method
hvg_ids = set(container_hvg.assays["proteins"].var["_index"].to_list())
vst_ids = set(container_vst.assays["proteins"].var["_index"].to_list())
disp_ids = set(container_disp.assays["proteins"].var["_index"].to_list())
pca_ids = set(container_pca.assays["proteins"].var["_index"].to_list())

print("Feature Selection Comparison:")
print("=" * 60)
print(f"HVG: {len(hvg_ids)} features")
print(f"VST: {len(vst_ids)} features")
print(f"Dispersion: {len(disp_ids)} features")
print(f"PCA: {len(pca_ids)} features")

# Calculate overlaps
hvg_vst_overlap = len(hvg_ids & vst_ids)
hvg_disp_overlap = len(hvg_ids & disp_ids)
hvg_pca_overlap = len(hvg_ids & pca_ids)
all_overlap = len(hvg_ids & vst_ids & disp_ids & pca_ids)

print("\nOverlap with HVG:")
print(f"  HVG & VST: {hvg_vst_overlap} ({hvg_vst_overlap / len(hvg_ids) * 100:.1f}%)")
print(f"  HVG & Dispersion: {hvg_disp_overlap} ({hvg_disp_overlap / len(hvg_ids) * 100:.1f}%)")
print(f"  HVG & PCA: {hvg_pca_overlap} ({hvg_pca_overlap / len(hvg_ids) * 100:.1f}%)")
print(f"  All four methods: {all_overlap}")

### 10.1 UpSet Plot for Feature Overlap

In [None]:
# Create a simplified overlap visualization

methods = {
    "HVG": hvg_ids,
    "VST": vst_ids,
    "Dispersion": disp_ids,
    "PCA": pca_ids,
}

# Calculate pairwise overlaps
fig, ax = plt.subplots(figsize=(10, 8))

# Create overlap matrix
method_names = list(methods.keys())
n_methods = len(method_names)
overlap_matrix = np.zeros((n_methods, n_methods))

for i, m1 in enumerate(method_names):
    for j, m2 in enumerate(method_names):
        if i == j:
            overlap_matrix[i, j] = len(methods[m1])
        else:
            overlap_matrix[i, j] = len(methods[m1] & methods[m2])

# Plot heatmap
im = ax.imshow(overlap_matrix, cmap="Blues")
ax.set_xticks(range(n_methods))
ax.set_yticks(range(n_methods))
ax.set_xticklabels(method_names)
ax.set_yticklabels(method_names)
ax.set_title("Feature Selection Overlap Matrix")

# Add text annotations
for i in range(n_methods):
    for j in range(n_methods):
        text = ax.text(
            j, i, int(overlap_matrix[i, j]), ha="center", va="center", color="black", fontsize=10
        )

plt.colorbar(im, ax=ax, label="Number of Features")
plt.tight_layout()
plt.savefig("tutorial_output/feature_selection_overlap.png", dpi=300)
plt.show()

print("Overlap visualization saved to: tutorial_output/feature_selection_overlap.png")

## 11. Impact of Feature Selection on Clustering

Visualize how feature selection affects downstream clustering analysis.

In [None]:
# Compare PCA on full vs selected features
from scptensor import reduce_pca

# Run PCA on full feature set
container_full_pca = reduce_pca(
    container=container,
    assay_name="proteins",
    base_layer_name="log",
    new_assay_name="pca_full",
    n_components=10,
)

# Run PCA on HVG-selected features
container_hvg_pca = reduce_pca(
    container=container_hvg,
    assay_name="proteins",
    base_layer_name="log",
    new_assay_name="pca_hvg",
    n_components=10,
)

# Get cell type colors
cell_types = container.obs["cell_type"].to_numpy()
unique_ct = sorted(container.obs["cell_type"].unique().to_list())
ct_colors = plt.cm.Set2(np.linspace(0, 1, len(unique_ct)))
ct_to_color = dict(zip(unique_ct, ct_colors, strict=False))
point_colors = [ct_to_color[ct] for ct in cell_types]

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Full feature PCA
pc1_full = container_full_pca.assays["pca_full"].layers["scores"].X[:, 0]
pc2_full = container_full_pca.assays["pca_full"].layers["scores"].X[:, 1]
axes[0].scatter(pc1_full, pc2_full, c=point_colors, alpha=0.6, s=30)
axes[0].set_xlabel(
    f"PC1 ({container_full_pca.assays['pca_full'].var['variance_ratio'][0] * 100:.1f}%)"
)
axes[0].set_ylabel(
    f"PC2 ({container_full_pca.assays['pca_full'].var['variance_ratio'][1] * 100:.1f}%)"
)
axes[0].set_title(f"PCA: All Features ({n_features_original})")

# Add legend for full
for ct, color in ct_to_color.items():
    axes[0].scatter([], [], c=[color], label=ct, s=50)
axes[0].legend()

# HVG PCA
pc1_hvg = container_hvg_pca.assays["pca_hvg"].layers["scores"].X[:, 0]
pc2_hvg = container_hvg_pca.assays["pca_hvg"].layers["scores"].X[:, 1]
axes[1].scatter(pc1_hvg, pc2_hvg, c=point_colors, alpha=0.6, s=30)
axes[1].set_xlabel(
    f"PC1 ({container_hvg_pca.assays['pca_hvg'].var['variance_ratio'][0] * 100:.1f}%)"
)
axes[1].set_ylabel(
    f"PC2 ({container_hvg_pca.assays['pca_hvg'].var['variance_ratio'][1] * 100:.1f}%)"
)
axes[1].set_title(f"PCA: HVG Selected ({container_hvg.assays['proteins'].n_features})")

# Add legend for HVG
for ct, color in ct_to_color.items():
    axes[1].scatter([], [], c=[color], label=ct, s=50)
axes[1].legend()

plt.tight_layout()
plt.savefig("tutorial_output/pca_feature_selection_comparison.png", dpi=300)
plt.show()

print("PCA comparison saved to: tutorial_output/pca_feature_selection_comparison.png")

## 12. Feature Selection Strategy Guidelines

### When to Use Each Method:

1. **HVG (CV-based)**:
   - Best for: General-purpose feature selection
   - Works well: When expression varies across conditions
   - Fast computation

2. **VST (Seurat-style)**:
   - Best for: Single-cell RNA-seq style analysis
   - Works well: With normalized log-transformed data
   - Accounts for mean-variance relationship

3. **Dispersion-based**:
   - Best for: Features with high variance-to-mean ratio
   - Works well: When variance increases with mean

4. **Dropout-based**:
   - Best for: Removing unreliable features
   - Works well: As a pre-filtering step
   - Conservative approach

5. **PCA loading-based**:
   - Best for: Dimensionality reduction focused
   - Works well: When using PCA for downstream analysis
   - Captures major sources of variation

6. **Model-based**:
   - Best for: Complex patterns
   - Works well: With labeled data (supervised)
   - Slower but may capture non-linear relationships

## 13. Best Practices Summary

### Recommended Workflow:

1. **Pre-filtering** (optional):
   - Remove features with very high dropout rates (>80%)
   - Remove features with near-zero variance

2. **Normalization**:
   - Apply log transformation before HVG/VST
   - Consider Z-score normalization for some methods

3. **Feature Selection**:
   - Use HVG or VST as primary method
   - Select 1000-2000 features for large datasets
   - Select 50-200 features for small datasets

4. **Validation**:
   - Check biological relevance of selected features
   - Compare PCA before/after feature selection
   - Assess clustering quality

### Common Pitfalls:
- **Over-filtering**: Removing too many features loses biological signal
- **Under-filtering**: Too many features include noise
- **Ignoring batch effects**: Batch-specific features may be selected
- **Not validating**: Always check the impact on downstream analysis

### Parameter Guidelines:
- **n_top_features**: 
  - Small datasets (<100 samples): 50-200 features
  - Medium datasets (100-1000 samples): 500-1000 features
  - Large datasets (>1000 samples): 1000-2000 features

- **max_dropout_rate**: 
  - Conservative: 0.3 (70% detection required)
  - Moderate: 0.5 (50% detection required)
  - Permissive: 0.7 (30% detection required)

## Summary

In this tutorial, you learned:

### Feature Selection Methods:
1. **select_hvg()**: Coefficient of variation-based selection
2. **select_by_vst()**: Variance stabilizing transformation (Seurat-style)
3. **select_by_dispersion()**: Normalized dispersion-based selection
4. **select_by_dropout()**: Filter by missing/dropout rate
5. **select_by_model_importance()**: Random forest importance
6. **select_by_pca_loadings()**: PCA contribution-based selection

### Key Concepts:
- **Variability**: Variable features are more informative
- **Dropout rate**: High missingness reduces reliability
- **Mean-variance relationship**: Important for proper normalization
- **Method selection**: Depends on data characteristics and goals

### Best Practices:
- Apply appropriate normalization before feature selection
   - Use annotation mode (subset=False) for exploration
   - Use subset mode (subset=True) for final filtering
- Compare multiple methods to find robust features
- Validate by checking impact on PCA and clustering

### Next Steps:
- **Tutorial 08**: Custom Analysis Pipeline
- Apply selected features to clustering and differential expression
- Explore pathway enrichment on selected features