# PBMC Single-Cell Classification: Expert Annotation Approach
## Comprehensive Machine Learning Pipeline

**Author**: Kristof Torkenczy  
**Course**: Tech 27 Machine Learning  
**Approach**: Expert-curated SeuratData annotations

---

## Project Overview

This analysis leverages **expertly curated cell type annotations** from the SeuratData package to perform high-accuracy PBMC (Peripheral Blood Mononuclear Cell) classification. Rather than relying on algorithmic marker-based approaches, this pipeline uses professional annotations validated by the single-cell genomics community.

### **Expert Annotation Approach**

**Method**: Professional curation by Seurat team and community
- **Source**: SeuratData package with peer-reviewed annotations
- **Quality**: Multi-study consensus and expert validation
- **Detail**: 9-20 detailed cell subtypes vs 5 broad categories
- **Standardization**: Consistent nomenclature across studies

**Key Features**:
- **Expert-curated training data**: Pre-validated cell type annotations
- **Comprehensive ML evaluation**: 9 different machine learning algorithms
- **High-quality datasets**: pbmc3k (training) + pbmcMultiome (testing)
- **Detailed analysis**: Complete downstream analysis with high-resolution figures
- **Clinical relevance**: Disease-relevant cell subtypes and biomarkers

**Datasets**:
- **pbmc3k**: Training dataset with expert annotations (2,700 cells, 9 cell types)
- **pbmcMultiome_full**: Test dataset with expert annotations (11,909 cells, 20 cell types)
- **Technology**: 10x Chromium (v1 and Multiome) with cross-platform validation
- **Source**: SeuratData package with expert validation

---


## 1. Data Download and Organization

Download and organize SeuratData expert-annotated PBMC datasets.


In [7]:
# Download annotated (SeuratData) data
!python download_data_unified.py --approach annotated


 Unified PBMC Data Download Pipeline
 Download approach: annotated
📁 Created directory structure:
 data/annotated/
 data/not_annotated/
 figures/EDA/annotated/
 figures/EDA/not_annotated/
 figures/machine_learning_results/annotated/
 figures/machine_learning_results/not_annotated/
\n🔬 Running SeuratData Download (Expert Annotations)
Running SeuratData export scripts...
PBMC3k export output:
🔬 DIRECT PBMC3K EXPORT
📊 Extracting pbmc3k data directly...
   Cells: 2700 
   Features: 13714 
   Cell types ( 10 ): Memory CD4 T, B, CD14+ Mono, NK, CD8 T, Naive CD4 T, FCGR3A+ Mono, NA, DC, Platelet 
   Cell type distribution:
       Naive CD4 T : 697 
       Memory CD4 T : 483 
       CD14+ Mono : 480 
       B : 344 
       CD8 T : 271 
       FCGR3A+ Mono : 162 
       NK : 155 
       DC : 32 
       Platelet : 14 
   💾 Converting to dense matrix...
   💾 Writing files...
   ✅ Export complete:
      pbmc3k_expression.csv ( 70.9 MB )
      pbmc3k_metadata.csv ( 0.1 MB )

🐍 Python import code:
i

## 2. Exploratory Data Analysis with Expert Annotations

Perform comprehensive EDA on expert-annotated datasets without additional annotation steps.


In [8]:
# Run EDA with expert annotations (no additional annotation needed)
!python eda_unified.py --approach annotated


 Unified PBMC EDA Pipeline
 Processing approach: annotated
\n Processing Annotated Data (SeuratData)
\n🔬 Processing pbmc3k (annotated)
 Loading annotated data from: data/annotated/pbmc3k
    Loaded: 2700 cells × 13714 genes
   Existing annotations: 9 cell types
 Creating EDA plots for pbmc3k (annotated)...
    EDA plots saved to: figures/EDA/annotated

🔬 Applying enhanced preprocessing for pbmc3k...
🔧 Preprocessing 2700 cells × 13714 genes...
   Mitochondrial genes found: 13
   Mean mitochondrial fraction: 2.22%
   After gene/cell filtering: 2700 → 2638 cells
   After mitochondrial filtering (20.0%): 2698 cells
   After annotation filtering: 2638 cells × 13714 genes
   Highly variable genes identified: 2000
   Computing PCA...
   Computing neighbors and UMAP...
   Final processed data: 2638 cells × 2000 genes
   PCA explained variance ratio (first 10 PCs): 0.073
Creating enhanced EDA plots for pbmc3k...
Enhanced EDA plots saved to: figures/EDA/annotated
ML-ready data saved to: data/ann

## 3. Machine Learning Pipeline

Train and evaluate 9 machine learning algorithms on expert-annotated data with both same-dataset and cross-dataset validation.


In [9]:
# Run complete ML pipeline
!python run_pipeline_unified.py --approach annotated --mode both


 Unified PBMC ML Pipeline
 Approach: annotated
 Mode: both
 Processing ANNOTATED approach
\n Same-Dataset Analysis (annotated) - Union HVG
🧬 Computing Union HVGs for fair cross-dataset comparison...
   pbmc3k: 2,000 HVGs
   ⚠️  Need at least 2 datasets for Union HVG
   ⚠️  Insufficient Union HVGs, falling back to standard approach
 Loading processed annotated data...
   Loading pbmc_multiome_full from data/annotated/pbmc_multiome_full_processed.h5ad
    pbmc_multiome_full: 10412 cells × 2000 genes
      Cell types: 19
   Loading pbmc3k_full from data/annotated/pbmc3k_full_processed.h5ad
    pbmc3k_full: 2638 cells × 13714 genes
      Cell types: 9
   Loading pbmc_multiome_full_full from data/annotated/pbmc_multiome_full_full_processed.h5ad
    pbmc_multiome_full_full: 10412 cells × 26346 genes
      Cell types: 19
   Loading pbmc3k from data/annotated/pbmc3k_processed.h5ad
    pbmc3k: 2638 cells × 2000 genes
      Cell types: 9
   Using pbmc_multiome_full for same-dataset analysis (sta