# 03 – Human preprocessing (GSE253355, Bandyopadhyay et al.)

This notebook documents how the human bone marrow atlas (GSE253355) was processed.

Because the authors provide a preprocessed Seurat object 
`GSE253355_Normal_Bone_Marrow_Atlas_Seurat_SB_v2.rds` with PCA and UMAP already 
computed and curated cell-type annotations (`cluster_anno_l2`, `cluster_anno_coarse`, 
`cluster_anno_l1`), all human preprocessing was done in **R/Seurat** rather than 
re-implementing the full pipeline in Python.

The key steps (run in R) were:

1. Load the Seurat object:

   ```r
   bm <- readRDS("data/human/GSE253355_Normal_Bone_Marrow_Atlas_Seurat_SB_v2.rds")

Idents(bm) <- "cluster_anno_l2"


human_markers_raw <- FindAllMarkers(
  bm,
  only.pos        = TRUE,
  min.pct         = 0.10,
  logfc.threshold = 0.25
)

library(dplyr)

human_markers <- human_markers_raw %>%
  group_by(cluster) %>%
  slice_max(order_by = avg_log2FC, n = 50) %>%
  ungroup() %>%
  transmute(
    group          = as.character(cluster),
    names          = gene,
    scores         = avg_log2FC,
    logfoldchanges = avg_log2FC,
    pvals          = p_val,
    pvals_adj      = p_val_adj
  )

write.csv(
  human_markers,
  file = "results/tables/human_celltype_markers.csv",
  row.names = FALSE
)



### Cell 2 – Python sanity check

Add a code cell to verify that the R output exists and looks reasonable:

```python
import pandas as pd
from pathlib import Path

markers_path = Path("../results/tables/human_celltype_markers.csv")

if not markers_path.exists():
    raise FileNotFoundError(f"Expected marker file not found: {markers_path}")

human_markers = pd.read_csv(markers_path)
print(human_markers.head())
print("\nColumns:", human_markers.columns.tolist())
print("\nNumber of human cell types (groups):", human_markers["group"].nunique())
print(human_markers["group"].value_counts().head())