# Basic Usage of c2s-mini

This notebook demonstrates the basic functionality of c2s-mini:
- Loading AnnData objects
- Creating C2SData objects
- Inspecting vocabulary and cell sentences
- Formatting prompts for the model

## Setup

In [None]:
import scanpy as sc
from c2s_mini import C2SData
from c2s_mini.prompts import format_cell_type_prediction, format_cell_generation

# Set scanpy settings
sc.settings.verbosity = 1

## Load Sample Data

We'll use the PBMC3k dataset from Scanpy. For this demo, we'll use a small subset to keep things fast.

In [None]:
# Load the PBMC3k dataset
adata = sc.datasets.pbmc3k()

# Use a small subset for demo
adata = adata[:100, :500]

print(f"Loaded {adata.n_obs} cells Ã— {adata.n_vars} genes")
print(f"\nGene names (first 10): {list(adata.var_names[:10])}")

## Create C2SData Object

The `C2SData` class transforms the AnnData object into cell sentences - space-separated gene names ranked by expression level.

In [None]:
# Create C2SData object
csdata = C2SData.from_anndata(adata)
print(csdata)

## Inspect the Vocabulary

The vocabulary contains all gene names (in uppercase) and their total expression counts across all cells.

In [None]:
vocab = csdata.get_vocab()

print(f"Vocabulary size: {len(vocab)} genes\n")
print(f"First 20 genes in vocabulary:")
for i, (gene, count) in enumerate(list(vocab.items())[:20]):
    print(f"  {i+1}. {gene}: {count:.0f} total counts")

## Inspect Cell Sentences

Each cell is represented as a "sentence" - a space-separated list of gene names ranked by expression level (highest to lowest).

In [None]:
sentences = csdata.get_sentences()

print(f"Number of cell sentences: {len(sentences)}\n")
print(f"First cell sentence (truncated to first 200 characters):")
print(f"  {sentences[0][:200]}...\n")
print(f"First 20 genes in this cell (ranked by expression):")
first_cell_genes = sentences[0].split()[:20]
for i, gene in enumerate(first_cell_genes):
    print(f"  {i+1}. {gene}")

## Example: Different Cells Have Different Gene Rankings

Let's compare the top genes in the first 5 cells to see the diversity.

In [None]:
print("Top 10 genes for first 5 cells:\n")
for i in range(min(5, len(sentences))):
    top_genes = ' '.join(sentences[i].split()[:10])
    print(f"Cell {i+1}: {top_genes}")

## Format Prompts for the Model

The c2s-mini library provides prompt formatting functions for different tasks.

### Cell Type Prediction Prompt

Format a cell sentence as a prompt for predicting the cell type.

In [None]:
# Format a cell type prediction prompt using the top 50 genes
prompt = format_cell_type_prediction(
    sentences[0],
    n_genes=50,
    organism='Homo sapiens'
)

print("Cell Type Prediction Prompt:")
print("=" * 80)
print(prompt)
print("=" * 80)

### Cell Generation Prompt

Format a prompt for generating a synthetic cell of a specific type.

In [None]:
# Format a cell generation prompt
prompt = format_cell_generation(
    cell_type='T cell',
    n_genes=100,
    organism='Homo sapiens'
)

print("Cell Generation Prompt:")
print("=" * 80)
print(prompt)
print("=" * 80)

## Summary

In this notebook, we've seen:
1. How to load single-cell data and convert it to C2SData format
2. How cell sentences represent ranked gene expression
3. How to format prompts for different tasks

In the next notebook (`cell_type_prediction.ipynb`), we'll see how to use the C2S model for actual predictions!