# Cell Type Prediction with c2s-mini

This notebook demonstrates the full pipeline for cell type prediction:
- Loading annotated single-cell data
- Creating C2SData with metadata
- Loading the C2S model
- Running cell type predictions
- Comparing predictions to ground truth
- Generating synthetic cells
- Creating cell embeddings

## Setup

In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
from c2s_mini import C2SData, C2SModel, predict_cell_types, generate_cells, embed_cells

# Set scanpy settings
sc.settings.verbosity = 1

## Load Annotated Data

We'll use the processed PBMC3k dataset which includes cell type annotations.

In [None]:
# Load the processed PBMC3k dataset with cell type annotations
adata = sc.datasets.pbmc3k_processed()

# Use a small subset for faster demo
adata = adata[:100]

print(f"Loaded {adata.n_obs} cells Ã— {adata.n_vars} genes")
print(f"\nCell type clusters (Louvain):")
print(adata.obs['louvain'].value_counts())

## Create C2SData with Metadata

We'll preserve the cell type annotations by including the 'louvain' column from the AnnData object.

In [None]:
# Create C2SData object, preserving cell type metadata
csdata = C2SData.from_anndata(adata, include_obs_columns=['louvain'])

print(csdata)
print(f"\nMetadata columns: {list(csdata.metadata.columns)}")
print(f"\nFirst few rows of metadata:")
print(csdata.metadata.head())

## Load the C2S Model

This will download the `vandijklab/pythia-160m-c2s` model from HuggingFace (if not already cached).

**Note**: First-time loading may take a few minutes to download the model weights (~300MB).

In [None]:
# Load the model (will use GPU if available, otherwise CPU)
model = C2SModel(device='auto')

## Predict Cell Types

Now we'll use the model to predict cell types based on gene expression patterns.

In [None]:
# Predict cell types using the top 100 genes per cell
predictions = predict_cell_types(
    csdata,
    model,
    n_genes=100,
    organism='Homo sapiens',
    batch_size=4,
    max_tokens=50
)

print(f"\nGenerated {len(predictions)} predictions")

## Compare Predictions to Ground Truth

Let's see how the model's predictions compare to the Louvain cluster assignments.

In [None]:
# Create a comparison dataframe
results = pd.DataFrame({
    'ground_truth_cluster': csdata.metadata['louvain'],
    'model_prediction': predictions
})

print("First 20 predictions vs ground truth:")
print(results.head(20))

# Show summary statistics
print("\n" + "="*80)
print("\nPredicted cell types (top 10 most common):")
print(results['model_prediction'].value_counts().head(10))

## Group by Cluster and Show Representative Predictions

In [None]:
# Group predictions by cluster
print("Predictions grouped by Louvain cluster:\n")
for cluster in sorted(results['ground_truth_cluster'].unique()):
    cluster_predictions = results[results['ground_truth_cluster'] == cluster]['model_prediction']
    print(f"Cluster {cluster}:")
    print(f"  Most common prediction: {cluster_predictions.mode().values[0] if len(cluster_predictions) > 0 else 'N/A'}")
    print(f"  Unique predictions: {cluster_predictions.nunique()}")
    print(f"  Sample predictions: {list(cluster_predictions.head(3))}")
    print()

## Generate Synthetic Cells

We can also use the model to generate synthetic cells conditioned on specific cell types.

In [None]:
# Generate synthetic cells for different cell types
cell_types_to_generate = ['T cell', 'B cell', 'Monocyte', 'NK cell']

print(f"Generating {len(cell_types_to_generate)} synthetic cells...\n")

generated_sentences = generate_cells(
    cell_types_to_generate,
    model,
    n_genes=100,
    organism='Homo sapiens',
    batch_size=2,
    max_tokens=512
)

print("\nGenerated cell sentences:\n")
for cell_type, sentence in zip(cell_types_to_generate, generated_sentences):
    genes = sentence.split()[:20]  # Show first 20 genes
    print(f"{cell_type}:")
    print(f"  Top genes: {' '.join(genes)}")
    print()

## Create Cell Embeddings

We can also generate embeddings for cells, which can be used for downstream analysis like clustering or visualization.

In [None]:
# Generate embeddings for the first 20 cells
small_csdata = C2SData(
    vocab=csdata.get_vocab(),
    sentences=csdata.get_sentences()[:20],
    metadata=csdata.metadata.head(20)
)

print("Generating embeddings for 20 cells...\n")

embeddings = embed_cells(
    small_csdata,
    model,
    n_genes=100,
    organism='Homo sapiens',
    batch_size=4
)

print(f"Embeddings shape: {embeddings.shape}")
print(f"  - Number of cells: {embeddings.shape[0]}")
print(f"  - Embedding dimension: {embeddings.shape[1]}")

## Compute Pairwise Similarities

Let's use the embeddings to compute similarities between cells.

In [None]:
# Compute cosine similarities
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(embeddings)

print("Cosine similarity matrix (first 10x10):")
print(pd.DataFrame(
    similarities[:10, :10],
    index=[f"Cell {i}" for i in range(10)],
    columns=[f"Cell {i}" for i in range(10)]
).round(3))

## Summary

In this notebook, we've demonstrated:

1. **Cell Type Prediction**: Using the model to predict cell types from gene expression
2. **Cell Generation**: Generating synthetic cells conditioned on cell type labels
3. **Cell Embeddings**: Creating vector representations of cells for downstream analysis

The c2s-mini library provides a simple interface to work with the Cell2Sentence model for various single-cell analysis tasks!

## Next Steps

You can extend this analysis by:
- Using larger datasets
- Comparing predictions across different cell populations
- Using embeddings for UMAP/t-SNE visualization
- Fine-tuning the number of genes used (`n_genes` parameter)
- Experimenting with different generation parameters