### Generate an annotated dataset

| **Category**                    | **Name / Variable**    | **Type**    | **Source / How to Compute**                                     | **Interpretation / Purpose**               |
| ------------------------------- | ---------------------- | ----------- | --------------------------------------------------------------- | ------------------------------------------ |
| **Technical Factors**        | `n_counts`             | continuous  | Sum of counts per cell (`adata.X.sum(axis=1)`)                  | Library size / sequencing depth            |
|                                 | `n_genes`              | continuous  | Number of detected genes per cell (`(adata.X > 0).sum(axis=1)`) | Cell complexity / transcriptional richness |
|                                 | `pct_mito`             | continuous  | Fraction of mitochondrial gene counts                           | Cell quality, apoptosis, stress            |
|                                 | `pct_ribo`             | continuous  | Fraction of ribosomal gene counts                               | Protein synthesis / technical noise        |
|                                 | `dataset_id`           | categorical | From metadata                                                   | Batch / dataset effects                    |
|                                 | `assay`                | categorical | From metadata                                                   | Sequencing platform differences            |
| **Biological State Factors** | `cell_type`            | categorical | From metadata                                                   | Known biological class label               |
|                                 | `tissue_general`       | categorical | From metadata                                                   | Tissue identity                            |
|                                 | `development_stage`    | categorical | From metadata                                                   | Developmental / age stage                  |
|                                 | `disease`              | categorical | From metadata                                                   | Disease or control status                  |
|                                 | `sex`                  | categorical | From metadata                                                   | Biological sex differences                 |
|                                 | `phase`                | categorical | From `score_genes_cell_cycle`                                   | Cell cycle phase (S/G2M/G1)                |
|                                 | `S_score`, `G2M_score` | continuous  | From `scanpy.tl.score_genes_cell_cycle`                         | Quantitative cell-cycle activity           |
|                                 | `ssgsea__Inflammation` | continuous  | Per-cell ssGSEA (GSEApy)                                        | Immune activation / inflammatory state     |
|                                 | `regulon__MYC`         | continuous  | TF activity (DoRothEA / pySCENIC)                               | Transcriptional amplification / growth     |
|                                 | `regulon__TP53`        | continuous  | TF activity (DoRothEA / pySCENIC)                               | Stress response / DNA damage / apoptosis   |
|                                 | `ssgsea__Hypoxia`      | continuous  | Per-cell ssGSEA (GSEApy)                                        | Hypoxia / metabolic stress response        |


In [15]:
import scanpy as sc
import anndata
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr, f_oneway
from statsmodels.stats.multitest import multipletests
import cellxgene_census
import scanpy as sc
import pandas as pd
import numpy as np
from scipy.sparse import issparse

# GSEAPY and decoupler (DoRothEA / pySCENIC)
import gseapy as gp
from decoupler.op import dorothea
from decoupler.mt import viper

# 1. Choose a Census version and organism
ORGANISM = "homo_sapiens"
MEASUREMENT = "RNA"
CENSUS_VERSION = "2025-01-30"

SAMPLE_SIZE = 5000

EMBEDDING_NAME = "geneformer"

METADATA_FIELDS = [
    "assay",
    "dataset_id",
    "cell_type",
    "development_stage",
    "disease",
    "self_reported_ethnicity",
    "sex",
    "tissue_general",
    "tissue",
    "soma_joinid"  # Need this for joining with expression data
]

### Get a sample set of cells from the census

In [7]:
with cellxgene_census.open_soma(census_version=CENSUS_VERSION) as census:
    adata = cellxgene_census.get_anndata(
        census,
        organism=ORGANISM,
        measurement_name=MEASUREMENT,
        obs_value_filter=f"soma_joinid < {SAMPLE_SIZE}",
        var_value_filter="feature_type=='protein_coding'",
        obs_embeddings=[EMBEDDING_NAME],
        obs_column_names=METADATA_FIELDS,
    )



In [16]:
adata

AnnData object with n_obs × n_vars = 5000 × 20045
    obs: 'assay', 'dataset_id', 'cell_type', 'development_stage', 'disease', 'self_reported_ethnicity', 'sex', 'tissue_general', 'tissue', 'soma_joinid'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_type', 'feature_length', 'nnz', 'n_measured_obs'
    obsm: 'geneformer'

In [29]:
# set adata.var_names as feature_names and keep only unique
adata.var_names = adata.var["feature_name"]

adata.var

Unnamed: 0_level_0,soma_joinid,feature_id,feature_name,feature_type,feature_length,nnz,n_measured_obs
feature_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
NOC2L,1,ENSG00000188976,NOC2L,protein_coding,1244,18685092,105784525
PERM1,2,ENSG00000187642,PERM1,protein_coding,2765,664016,95688802
HES4,4,ENSG00000188290,HES4,protein_coding,961,19206715,105542421
ISG15,5,ENSG00000187608,ISG15,protein_coding,657,17874266,106022022
AGRN,6,ENSG00000188157,AGRN,protein_coding,2142,7710637,105784525
...,...,...,...,...,...,...,...
SHOX_ENSG00000292354,61859,ENSG00000292354,SHOX_ENSG00000292354,protein_coding,2854,18957,4625370
SLC25A6_ENSG00000292334,61860,ENSG00000292334,SLC25A6_ENSG00000292334,protein_coding,900,2046567,4625370
VAMP7_ENSG00000292366,61862,ENSG00000292366,VAMP7_ENSG00000292366,protein_coding,719,373206,4625370
ZBED1_ENSG00000292345,61864,ENSG00000292345,ZBED1_ENSG00000292345,protein_coding,3665,306944,4625370


### Add annotations

In [31]:
# add metadata
annotations_df = pd.DataFrame(index=adata.obs_names)

for field in METADATA_FIELDS:
    if field in adata.obs.columns:
        annotations_df[field] = adata.obs[field]

In [None]:
# technical metrics
if issparse(adata.X):
    X = adata.X.toarray()
else:
    X = adata.X

annotations_df['n_counts'] = X.sum(axis=1)
annotations_df['n_genes'] = (X > 0).sum(axis=1)

mito_genes = adata.var_names.str.upper().str.startswith("MT-")
annotations_df['pct_mito'] = X[:, mito_genes].sum(axis=1) / annotations_df['n_counts'] * 100

ribo_genes = adata.var_names.str.startswith(("RPS","RPL"))
annotations_df['pct_ribo'] = X[:, ribo_genes].sum(axis=1) / annotations_df['n_counts'] * 100

In [None]:
# get hallmark gene sets from gseapy
hallmark_genesets = gp.get_library(name='MSigDB_Hallmark_2020', organism='Human')

In [None]:
# cell cycle
# S-phase → E2F targets, G2/M → G2M checkpoint
s_genes = [g for g in hallmark_genesets['E2F Targets'] if g in adata.var_names]
g2m_genes = [g for g in hallmark_genesets['G2-M Checkpoint'] if g in adata.var_names]

if len(s_genes) == 0 or len(g2m_genes) == 0:
    raise ValueError("No S/G2M genes from Hallmark found in adata.var_names.")

sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes, copy=False)
annotations_df['S_score'] = adata.obs['S_score']
annotations_df['G2M_score'] = adata.obs['G2M_score']
annotations_df['phase'] = adata.obs['phase']

2025-11-05 11:21:41 | [INFO] Downloading and generating Enrichr library gene sets...
2025-11-05 11:21:41 | [INFO] Library is already downloaded in: /home/amoneim/.cache/gseapy/Enrichr.MSigDB_Hallmark_2020.gmt, use local file


In [57]:
# additional biological programs via ssGSEA
marker_sets = {
    "Inflammation Response": hallmark_genesets['Inflammatory Response'],
    "Hypoxia": hallmark_genesets['Hypoxia']
}

expr_df = pd.DataFrame(X.T, index=adata.var_names, columns=adata.obs_names)

for name, genes in marker_sets.items():
    genes_present = [g for g in genes if g in adata.var_names]
    if len(genes_present) == 0:
        print(f"Warning: no genes from {name} found in adata.var_names, skipping")
        continue
    ss_res = gp.ssgsea(
        data=expr_df,
        gene_sets={name: genes_present},
        sample_norm_method="rank",
        outdir=None,
        no_plot=True,
        permutation_num=0
    )

    sample_names = ss_res.res2d['Name']
    if sample_names.dtype != expr_df.columns.dtype:
        sample_names = sample_names.astype(expr_df.columns.dtype)
    nes_series = pd.Series(data=ss_res.res2d['NES'].values, index=sample_names)

    annotations_df[name] = annotations_df.index.map(
        lambda i: nes_series[i]
    )

In [58]:
annotations_df

Unnamed: 0,assay,dataset_id,cell_type,development_stage,disease,self_reported_ethnicity,sex,tissue_general,tissue,soma_joinid,n_counts,n_genes,pct_mito,pct_ribo,S_score,G2M_score,phase,Inflammation Response,Hypoxia
0,10x 3' v3,d7476ae2-e320-4703-8304-da5c42627e71,endothelial cell,29-year-old stage,breast cancer,European,female,liver,liver,0,17412.0,6961,8.884677,10.762692,-0.351288,0.131993,G2M,0.007696,0.338464
1,10x 3' v3,d7476ae2-e320-4703-8304-da5c42627e71,malignant cell,29-year-old stage,breast cancer,European,female,liver,liver,1,16892.0,5246,10.999290,15.699739,-0.243737,0.117134,G2M,-0.530180,0.088746
2,10x 3' v3,d7476ae2-e320-4703-8304-da5c42627e71,fibroblast,29-year-old stage,breast cancer,European,female,liver,liver,2,14494.0,3857,19.325239,14.468056,-0.540893,-0.228877,G1,-0.279263,0.169821
3,10x 3' v3,d7476ae2-e320-4703-8304-da5c42627e71,fibroblast,29-year-old stage,breast cancer,European,female,liver,liver,3,13499.0,4911,11.489740,18.193941,-0.376543,-0.145575,G1,-0.236354,0.242783
4,10x 3' v3,d7476ae2-e320-4703-8304-da5c42627e71,macrophage,29-year-old stage,breast cancer,European,female,liver,liver,4,11482.0,4338,11.356907,7.585786,-0.346582,0.127845,G2M,0.264579,0.005123
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,10x 3' v2,bdacc907-7c26-419f-8808-969eab3ca2e8,mature microglial cell,82-year-old stage,Alzheimer disease,unknown,male,brain,superior frontal gyrus,4995,869.0,749,0.920598,2.646720,-0.050791,-0.030933,G1,-0.264313,-0.595455
4996,10x 3' v2,bdacc907-7c26-419f-8808-969eab3ca2e8,mature microglial cell,82-year-old stage,Alzheimer disease,unknown,male,brain,superior frontal gyrus,4996,539.0,437,0.556586,1.298701,-0.040816,0.010473,G2M,-0.346291,-0.573556
4997,10x 3' v2,bdacc907-7c26-419f-8808-969eab3ca2e8,mature microglial cell,82-year-old stage,Alzheimer disease,unknown,male,brain,superior frontal gyrus,4997,557.0,489,4.488330,2.154398,-0.031237,0.000068,G2M,-0.324344,-0.517428
4998,10x 3' v2,bdacc907-7c26-419f-8808-969eab3ca2e8,mature microglial cell,82-year-old stage,Alzheimer disease,unknown,male,brain,superior frontal gyrus,4998,1469.0,1158,0.612662,1.565691,-0.067564,0.038921,G2M,-0.288540,-0.403517
