# Single-Cell RNA-seq Analysis Pipeline

Interactive pipeline for scRNA-seq preprocessing, dimensionality reduction, clustering, and annotation.

## Pipeline Overview

1. **Data Loading** - Load h5ad, rds, or mtx format
2. **Quality Control** - Interactive QC metric visualization and filtering
3. **Doublet Detection** - Identify and remove doublets
4. **Normalization** - Choose normalization method
5. **Feature Selection** - Select highly variable genes
6. **Dimensionality Reduction** - PCA and UMAP
7. **Clustering** - Interactive resolution selection
8. **Gene Visualization** - View gene expression on UMAP
9. **Marker Gene Analysis** - Identify cluster markers
10. **Manual Annotation** - Annotate cell types
11. **Subclustering** (optional) - Refine specific clusters
12. **Export & Report** - Save results and generate PDF report

## Setup

In [None]:
# Import libraries
import sys
sys.path.append('..')

import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Import pipeline modules
from src import io, qc, preprocessing, reduction, clustering, annotation, visualization, interactive, reports

# Scanpy settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')

print("✓ Setup complete")

## 1. Data Loading

Load your scRNA-seq data. Supported formats:
- `.h5ad` (AnnData)
- `.rds` (Seurat object)
- `matrix.mtx` + `barcodes.tsv` + `genes.tsv` (10X format)

In [None]:
# Option 1: Load h5ad file
# adata = io.load_h5ad('../data/your_data.h5ad')

# Option 2: Load RDS file (Seurat object)
# adata = io.load_rds('../data/your_data.rds')

# Option 3: Load 10X MTX format
# adata = io.load_10x_mtx('../data/filtered_feature_bc_matrix/')

# Option 4: Auto-detect format
adata = io.load_data('../data/your_data.h5ad')  # Change to your file path

print(f"\nLoaded: {adata.n_obs} cells × {adata.n_vars} genes")

In [None]:
# Save raw data checkpoint
io.save_checkpoint(adata, 'raw')
print("✓ Raw data saved")

## 2. Quality Control

Calculate QC metrics and visualize data quality.

In [None]:
# Calculate QC metrics
adata = qc.calculate_qc_metrics(adata, mito_prefix='MT-')

# Display QC summary
summary = qc.get_qc_summary(adata)
print("\nQC Metrics Summary:")
display(summary)

In [None]:
# Generate QC plots
fig = qc.plot_qc_metrics(adata, figsize=(15, 10))
plt.show()

### Interactive QC Filtering

Adjust thresholds interactively and see the number of cells that will be retained.

In [None]:
# Interactive QC filtering widget
qc_widget = interactive.QCFilterWidget(adata)
qc_widget.display()

In [None]:
# Apply filters (run after clicking "Apply Filters" button)
adata = qc_widget.get_filtered_data()

if adata is not None:
    print(f"\n✓ Filtered data: {adata.n_obs} cells × {adata.n_vars} genes")
else:
    print("Please click 'Apply Filters' button first")

In [None]:
# Filter genes (remove genes detected in < 20 cells)
qc.filter_genes_by_counts(adata, min_cells=20)

## 3. Doublet Detection

Identify and remove potential doublets using Scrublet.

In [None]:
# Detect doublets
adata = preprocessing.detect_doublets(adata, expected_doublet_rate=0.06)

### Interactive Doublet Filtering

In [None]:
# Interactive doublet filtering widget
doublet_widget = interactive.DoubletFilterWidget(adata)
doublet_widget.display()

In [None]:
# Apply doublet filtering (run after clicking "Remove Doublets" button)
adata = doublet_widget.get_filtered_data()

if adata is not None:
    print(f"\n✓ After doublet removal: {adata.n_obs} cells × {adata.n_vars} genes")
else:
    print("Please click 'Remove Doublets' button first")

## 4. Normalization

Choose normalization method based on your analysis goals:
- **log1p**: Standard log transformation (recommended for general use)
- **scran**: Better for batch correction
- **pearson**: Better for rare cell type identification

In [None]:
# Normalize data
preprocessing.normalize_data(adata, method='log1p', target_sum=1e4)
print("✓ Normalization complete")

## 5. Feature Selection

Select highly variable genes for downstream analysis.

In [None]:
# Select highly variable genes
preprocessing.select_highly_variable_genes(adata, method='seurat', n_top_genes=2000)

# Plot HVGs
fig = preprocessing.plot_highly_variable_genes(adata)
plt.show()

## 6. Scaling and Regression (Optional)

Scale data and optionally regress out unwanted sources of variation.

In [None]:
# Optional: Regress out total counts and mitochondrial percentage
# preprocessing.regress_out(adata, keys=['total_counts', 'pct_counts_mt'])

# Scale data
preprocessing.scale_data(adata, max_value=10)
print("✓ Data scaled")

## 7. Dimensionality Reduction

Compute PCA, then UMAP for visualization.

In [None]:
# Compute PCA
reduction.compute_pca(adata, n_comps=50)

# Plot variance explained
fig = reduction.plot_pca_variance(adata, n_pcs=50)
plt.show()

### Interactive PCA Component Selection

In [None]:
# Interactive PCA selection widget
pca_widget = interactive.PCAWidget(adata)
pca_widget.display()

In [None]:
# Get selected number of PCs
n_pcs = pca_widget.get_n_pcs()
print(f"Using {n_pcs} principal components")

In [None]:
# Compute neighbors and UMAP
reduction.compute_neighbors(adata, n_neighbors=15, n_pcs=n_pcs)
reduction.compute_umap(adata, min_dist=0.5)
print("✓ UMAP computed")

## 8. Clustering

Test multiple resolutions and select the optimal clustering.

In [None]:
# Compute clustering at multiple resolutions
clustering.compute_multiple_resolutions(adata, resolutions=[0.25, 0.5, 1.0, 1.5, 2.0])

### Interactive Resolution Selection

In [None]:
# Interactive clustering widget
clustering_widget = interactive.ClusteringResolutionWidget(adata, resolutions=[0.25, 0.5, 1.0, 1.5, 2.0])
clustering_widget.display()

In [None]:
# Get selected resolution
selected_res = clustering_widget.get_resolution()
print(f"\nUsing resolution: {selected_res}")

# Plot selected clustering
fig = reduction.plot_umap(adata, color=['leiden'], figsize=(10, 8))
plt.show()

In [None]:
# Display cluster sizes
fig = clustering.plot_cluster_sizes(adata, clustering_key='leiden')
plt.show()

# Cluster statistics
stats = clustering.compute_cluster_statistics(adata, clustering_key='leiden')
display(stats)

## 9. Save Preprocessed Data

Save checkpoint before annotation.

In [None]:
# Save preprocessed data
io.save_checkpoint(adata, 'preprocessed')
print("✓ Preprocessed data saved")

## 10. Gene Expression Visualization

Visualize specific genes to inform annotation decisions.

In [None]:
# Interactive gene visualization widget
gene_widget = interactive.GeneVisualizationWidget(adata)
gene_widget.display()

# Enter gene names (one per line) and click "Plot Genes"

## 11. Marker Gene Analysis

Identify marker genes for each cluster.

In [None]:
# Find marker genes
annotation.find_marker_genes(adata, clustering_key='leiden', method='wilcoxon', n_genes=100)

In [None]:
# Plot top marker genes
fig = annotation.plot_marker_genes_heatmap(adata, n_genes=10, clustering_key='leiden')
plt.show()

In [None]:
# Dotplot of top markers
fig = annotation.plot_marker_genes_dotplot(adata, n_genes=5, clustering_key='leiden')
plt.show()

In [None]:
# View top markers for specific cluster
cluster_id = 0  # Change to cluster of interest
markers = annotation.get_top_markers(adata, cluster=cluster_id, n_genes=20)
display(markers)

## 12. Manual Cell Type Annotation

Annotate clusters based on marker genes.

In [None]:
# Interactive annotation widget
annotation_widget = interactive.AnnotationWidget(adata, clustering_key='leiden')
annotation_widget.display()

# For each cluster:
# 1. Select cluster from dropdown
# 2. Click "Show Markers" to see top marker genes
# 3. Enter cell type in text box
# 4. Click "Annotate Cluster"
# 5. Repeat for all clusters
# 6. Click "Finish Annotation" when done

### Alternative: Manual Annotation (Code)

If you prefer to annotate programmatically:

In [None]:
# Manual annotation dictionary
# annotations = {
#     '0': 'T cells',
#     '1': 'B cells',
#     '2': 'Monocytes',
#     '3': 'NK cells',
#     # ... add all clusters
# }

# annotation.annotate_clusters(adata, annotations=annotations, clustering_key='leiden')

In [None]:
# Plot annotated UMAP
if 'cell_type' in adata.obs.columns:
    fig = annotation.plot_annotated_umap(adata, annotation_key='cell_type', figsize=(12, 10))
    plt.show()

## 13. Cluster Manipulation (Optional)

Split or merge clusters if needed.

In [None]:
# Merge clusters (example: merge clusters 3 and 5)
# clustering.merge_clusters(adata, clusters_to_merge=['3', '5'], clustering_key='leiden', new_cluster_name='3')

# Split cluster (example: split cluster 2)
# clustering.split_cluster(adata, cluster_id='2', clustering_key='leiden', resolution=1.5)

## 14. Subclustering (Optional)

Perform detailed analysis on specific cell populations.

In [None]:
# Subcluster specific clusters
# cluster_ids = ['0', '1']  # Specify clusters to subcluster
# adata_subset = clustering.subcluster_cells(
#     adata,
#     cluster_ids=cluster_ids,
#     clustering_key='leiden',
#     resolution=1.0
# )

# # Visualize subclusters
# fig = reduction.plot_umap(adata_subset, color=['leiden_sub'], figsize=(10, 8))
# plt.show()

# # Find markers for subclusters
# annotation.find_marker_genes(adata_subset, clustering_key='leiden_sub')

## 15. Save Annotated Data

In [None]:
# Save final annotated data
io.save_checkpoint(adata, 'annotated')
print("✓ Annotated data saved")

# Export annotations to CSV
if 'cell_type' in adata.obs.columns:
    annotation.export_annotations(adata, annotation_key='cell_type', output_file='../outputs/cell_annotations.csv')

## 16. Generate Reports

In [None]:
# Generate QC report
reports.create_qc_report(adata, output_file='../outputs/reports/qc_report.pdf')

In [None]:
# Generate analysis report
reports.create_analysis_report(
    adata,
    output_file='../outputs/reports/analysis_report.pdf',
    clustering_key='leiden',
    annotation_key='cell_type'
)

## Summary

Pipeline complete! Your outputs are saved in:
- **Checkpoints**: `../outputs/checkpoints/`
  - `raw.h5ad`
  - `preprocessed.h5ad`
  - `annotated.h5ad`
- **Reports**: `../outputs/reports/`
  - `qc_report.pdf`
  - `analysis_report.pdf`
- **Annotations**: `../outputs/cell_annotations.csv`