# C2S Donor-Zellen: Cell-Type Prediction (Neuer Ansatz)

Dieses Notebook nutzt **keinen Projektcode** und führt die Vorhersage direkt mit dem vortrainierten Hugging-Face-Modell `vandijklab/C2S-Pythia-410m-cell-type-prediction` durch.

In [1]:
# Optional: Abhängigkeiten installieren (bei Bedarf entkommentieren)
# %pip install -q cell2sentence anndata scanpy datasets transformers pandas numpy scipy

In [None]:

# ── Windows: neutralize uvloop BEFORE any other import ───────────────────────
import sys, asyncio
if sys.platform == 'win32':
    from unittest.mock import MagicMock
    _mock_uvloop = MagicMock()
    _mock_uvloop.install = lambda: None
    _mock_uvloop.EventLoopPolicy = asyncio.DefaultEventLoopPolicy
    sys.modules['uvloop'] = _mock_uvloop
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    print('Windows detected – uvloop mocked, using WindowsSelectorEventLoopPolicy')


In [2]:
from pathlib import Path
import random

import numpy as np
import pandas as pd
import anndata as ad
import scanpy as sc

import cell2sentence as cs
from cell2sentence.tasks import predict_cell_types_of_data

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# ---------- Konfiguration ----------
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

H5AD_PATH = Path('../../data/dominguez_conde_immune_tissue_two_donors.h5ad')
DONOR_COLUMN = 'batch_condition'  # hat im Datensatz 2 Werte (A29, A31)
DONOR_VALUE = 'A29'
N_CELLS = 24
TOP_K_GENES = 200
BASE_MODEL = 'vandijklab/C2S-Pythia-410m-cell-type-prediction'

assert H5AD_PATH.exists(), f'Datei nicht gefunden: {H5AD_PATH.resolve()}'

In [4]:
# ---------- Daten laden ----------
adata = ad.read_h5ad(H5AD_PATH)
print('Shape:', adata.shape)
print('obs-Spalten:', list(adata.obs.columns))

if DONOR_COLUMN not in adata.obs.columns:
    raise KeyError(f"Spalte '{DONOR_COLUMN}' nicht in adata.obs vorhanden.")

print(f"Verfügbare Donor-Werte in {DONOR_COLUMN}:")
print(adata.obs[DONOR_COLUMN].value_counts())

Shape: (29773, 36503)
obs-Spalten: ['cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex']
Verfügbare Donor-Werte in batch_condition:
batch_condition
A29    17327
A31    12446
Name: count, dtype: int64


In [5]:
# ---------- Donor filtern + zufällige Zellen auswählen ----------
adata_donor = adata[adata.obs[DONOR_COLUMN] == DONOR_VALUE].copy()
if adata_donor.n_obs == 0:
    raise ValueError(f"Keine Zellen für {DONOR_COLUMN}={DONOR_VALUE} gefunden.")

n_select = min(N_CELLS, adata_donor.n_obs)
idx = np.random.choice(adata_donor.n_obs, size=n_select, replace=False)
adata_small = adata_donor[idx].copy()

print(f'Gewählter Donor: {DONOR_VALUE}')
print(f'Zellen im Donor: {adata_donor.n_obs}')
print(f'Für Inferenz gesampelt: {adata_small.n_obs}')
print('Cell type Verteilung im Sample:')
print(adata_small.obs['cell_type'].value_counts().head(10))

Gewählter Donor: A29
Zellen im Donor: 17327
Für Inferenz gesampelt: 24
Cell type Verteilung im Sample:
cell_type
macrophage                                                                    4
naive thymus-derived CD4-positive, alpha-beta T cell                          4
alveolar macrophage                                                           3
memory B cell                                                                 2
animal cell                                                                   1
CD4-positive helper T cell                                                    1
CD8-positive, alpha-beta memory T cell, CD45RO-positive                       1
effector memory CD8-positive, alpha-beta T cell, terminally differentiated    1
effector memory CD4-positive, alpha-beta T cell                               1
classical monocyte                                                            1
Name: count, dtype: int64


In [6]:
# ---------- Minimales Preprocessing für stabile C2S-Eingaben ----------
adata_small.var_names_make_unique()
sc.pp.normalize_total(adata_small, target_sum=1e4)
sc.pp.log1p(adata_small)

label_cols = [c for c in ['cell_type', DONOR_COLUMN, 'tissue', 'sex', 'organism'] if c in adata_small.obs.columns]
print('Label-Spalten:', label_cols)

Label-Spalten: ['cell_type', 'batch_condition', 'tissue', 'sex', 'organism']


In [7]:
# ---------- Cell-Sentences erzeugen ----------
arrow_ds, vocab = cs.CSData.adata_to_arrow(
    adata_small,
    random_state=SEED,
    sentence_delimiter=' ',
    label_col_names=label_cols,
)

csdata = cs.CSData.csdata_from_arrow(
    arrow_dataset=arrow_ds,
    vocabulary=vocab,
    save_dir='./tmp_c2s_notebook',
    save_name='donor_subset',
    dataset_backend='arrow',
)

print(csdata)

WARN: more variables (36503) than observations (24)... did you mean to transpose the object (e.g. adata.T)?
WARN: more variables (36503) than observations (24), did you mean to transpose the object (e.g. adata.T)?
100%|██████████| 24/24 [00:00<00:00, 889.00it/s]
Saving the dataset (1/1 shards): 100%|██████████| 24/24 [00:00<00:00, 1735.10 examples/s]

CSData Object; Path=./tmp_c2s_notebook\donor_subset, Format=arrow





In [8]:
# ---------- Vortrainiertes C2S-Modell laden ----------
csmodel = cs.CSModel(
    model_name_or_path=BASE_MODEL,
    save_dir='./tmp_c2s_notebook_model',
    save_name='pretrained_c2s_inference',
)
print('Modell geladen:', BASE_MODEL)

Using device: cpu


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|██████████| 292/292 [00:01<00:00, 230.94it/s, Materializing param=gpt_neox.layers.23.post_attention_layernorm.weight] 
Writing model shards: 100%|██████████| 1/1 [00:06<00:00,  6.96s/it]


Modell geladen: vandijklab/C2S-Pythia-410m-cell-type-prediction


In [9]:
# ---------- Zelltyp-Vorhersage ----------
preds = predict_cell_types_of_data(
    csdata=csdata,
    csmodel=csmodel,
    n_genes=TOP_K_GENES,
    max_num_tokens=32,
)

df = pd.DataFrame({
    'cell_id': adata_small.obs_names.astype(str),
    'y_true': adata_small.obs['cell_type'].astype(str).values,
    'y_pred': [str(p).strip() for p in preds],
})
df['correct'] = (df['y_true'] == df['y_pred']).astype(int)
df.head(20)

Reloading model from path on disk: ./tmp_c2s_notebook_model\pretrained_c2s_inference


Loading weights: 100%|██████████| 292/292 [00:01<00:00, 283.95it/s, Materializing param=gpt_neox.layers.23.post_attention_layernorm.weight] 


Predicting cell types for 24 cells using CSModel...


100%|██████████| 24/24 [08:56<00:00, 22.35s/it]


Unnamed: 0,cell_id,y_true,y_pred,correct
0,Pan_T7935491_CATGACAGTGTGTGCC,macrophage,macrophage.,0
1,Pan_T7935493_CTAACTTAGACAAGCC,"naive thymus-derived CD4-positive, alpha-beta ...","CD4-positive, alpha-beta T cell.",0
2,Pan_T7935491_CTGATAGGTGCCTTGG,macrophage,macrophage.,0
3,Pan_T7935491_AGTGGGACATGCGCAC,CD4-positive helper T cell,"CD8-positive, alpha-beta T cell.",0
4,Pan_T7935494_AACGTTGGTTCTCATT,memory B cell,B cell.,0
5,Pan_T7935494_ATAACGCGTCAGCTAT,regulatory T cell,"CD4-positive, alpha-beta memory T cell.",0
6,Pan_T7935495_AGTGTCAGTGTCAATC,animal cell,"CD14-positive, CD16-negative classical monocyte.",0
7,Pan_T7935493_CTGATAGAGACTAGAT,"naive thymus-derived CD8-positive, alpha-beta ...","CD8-positive, alpha-beta T cell.",0
8,Pan_T7935497_ACCCACTTCCCTAATT,"CD8-positive, alpha-beta memory T cell, CD45RO...","CD8-positive, alpha-beta T cell.",0
9,Pan_T7935492_AGCGTCGTCGCTTGTC,"effector memory CD4-positive, alpha-beta T cell","CD8-positive, alpha-beta T cell.",0


In [10]:
# ---------- Kurze Auswertung + Speichern ----------
acc = df['correct'].mean()
print(f'Accuracy (exact match) auf Sample: {acc:.3f}')

out_path = Path('./donor_predictions.csv')
df.to_csv(out_path, index=False)
print('Gespeichert:', out_path.resolve())

Accuracy (exact match) auf Sample: 0.000
Gespeichert: C:\Users\Daniel\Desktop\GitProjects\Improving-Cell2Sentence-with-Single-Cell-Foundation-Model-Embeddings\notebooks\c2s_donor_new_approach\donor_predictions.csv


In [11]:

# ---------- 3-Klassen Genauigkeit: Exakt / Teilweise / Falsch ----------
import re

def normalize(text):
    """Normalize cell type label: strip trailing dot, lowercase, collapse whitespace."""
    text = str(text).strip()
    text = text.rstrip('.')
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    return text

def classify_prediction(y_true, y_pred):
    """
    3-tier classification:
      2 = Exactly correct   (normalized strings are identical)
      1 = Partly correct    (one label contains the other, or significant word overlap)
      0 = Not correct
    """
    t = normalize(y_true)
    p = normalize(y_pred)

    # Tier 1: exact match after normalization
    if t == p:
        return 2

    # Tier 2a: substring containment (e.g. "B cell" in "memory B cell")
    if p in t or t in p:
        return 1

    # Tier 2b: Jaccard word overlap ≥ 30 % (ignores short stop-words)
    t_words = {w for w in re.findall(r'\b\w+\b', t) if len(w) > 2}
    p_words = {w for w in re.findall(r'\b\w+\b', p) if len(w) > 2}
    if t_words and p_words:
        jaccard = len(t_words & p_words) / len(t_words | p_words)
        if jaccard >= 0.30:
            return 1

    return 0


df['norm_true'] = df['y_true'].apply(normalize)
df['norm_pred'] = df['y_pred'].apply(normalize)
df['score'] = df.apply(lambda r: classify_prediction(r['y_true'], r['y_pred']), axis=1)

label_map = {2: 'Exactly correct', 1: 'Partly correct', 0: 'Not correct'}
df['verdict'] = df['score'].map(label_map)

# ── Summary ───────────────────────────────────────────────────────────────────
print("=" * 55)
print("  3-Tier Accuracy")
print("=" * 55)
for score in [2, 1, 0]:
    lbl   = label_map[score]
    count = (df['score'] == score).sum()
    pct   = count / len(df) * 100
    print(f"  {lbl:<20}: {count:>2}/{len(df)}  ({pct:.1f} %)")

exact_plus_partial = (df['score'] >= 1).sum()
print("-" * 55)
print(f"  Exact + Partial   : {exact_plus_partial:>2}/{len(df)}  ({exact_plus_partial/len(df)*100:.1f} %)")
print("=" * 55)

# ── Per-row detail ─────────────────────────────────────────────────────────────
pd.set_option('display.max_colwidth', 70)
df[['y_true', 'y_pred', 'verdict']].style.applymap(
    lambda v: 'background-color: #c8e6c9' if v == 'Exactly correct'
         else ('background-color: #fff9c4' if v == 'Partly correct'
         else 'background-color: #ffcdd2'),
    subset=['verdict']
)


  3-Tier Accuracy
  Exactly correct     :  5/24  (20.8 %)
  Partly correct      : 14/24  (58.3 %)
  Not correct         :  5/24  (20.8 %)
-------------------------------------------------------
  Exact + Partial   : 19/24  (79.2 %)




Unnamed: 0,y_true,y_pred,verdict
0,macrophage,macrophage.,Exactly correct
1,"naive thymus-derived CD4-positive, alpha-beta T cell","CD4-positive, alpha-beta T cell.",Partly correct
2,macrophage,macrophage.,Exactly correct
3,CD4-positive helper T cell,"CD8-positive, alpha-beta T cell.",Not correct
4,memory B cell,B cell.,Partly correct
5,regulatory T cell,"CD4-positive, alpha-beta memory T cell.",Not correct
6,animal cell,"CD14-positive, CD16-negative classical monocyte.",Not correct
7,"naive thymus-derived CD8-positive, alpha-beta T cell","CD8-positive, alpha-beta T cell.",Partly correct
8,"CD8-positive, alpha-beta memory T cell, CD45RO-positive","CD8-positive, alpha-beta T cell.",Partly correct
9,"effector memory CD4-positive, alpha-beta T cell","CD8-positive, alpha-beta T cell.",Partly correct
