# Example how to read cell embeddings

The following script demonstrates how to load the cell data embeddings generated by scGPT pre-trained model.

## Data description
The [cell embeddings folder](https://drive.google.com/drive/folders/1xFOABVtRueVETWdjMkqFmobByjCRGFDN) contains three files generated from the scGPT reference mapping workflow:

1. **scGPT_embeddings_*.npy**
   - NumPy array of cell embeddings.
   - Shape: (number of cells, embedding dimension).
   - Each row corresponds to a cell.

2. **scGPT_metadata_*.csv**
   - CSV table with metadata for each cell.
   - Columns: cell_id, cell_type, lineage, dataset, embedding_index.
   - Use embedding_index to match rows to embeddings in the .npy file.

3. **scGPT_embeddings_with_metadata_*.pkl**
   - Python pickle file containing:
     - All cell embeddings.
     - Metadata (cell IDs, cell types, lineage, dataset, gene names).
     - Useful for loading everything in Python.

In [None]:
import numpy as np
import pandas as pd
import torch

# Load embeddings as numpy array
embeddings_file = "path-to-cell_embeddings/scGPT_embeddings_20251106_102213.npy"
embeddings = np.load(embeddings_file) 
print("Embeddings shape:", embeddings.shape)

# Optionally, convert to torch tensor
embeddings_torch = torch.from_numpy(embeddings)
print("Torch tensor shape:", embeddings_torch.shape)

# Load metadata as DataFrame
metadata_file = "path-to-cell_embeddings/scGPT_metadata_20251106_102213.csv"
metadata = pd.read_csv(metadata_file)
print("Metadata columns:", metadata.columns.tolist())
print("First 3 cell IDs:", metadata['cell_id'].head(3).tolist())

# Example: Get cell name for a given embedding index
example_index = 0
cell_id = metadata.loc[example_index, 'cell_id']
cell_type = metadata.loc[example_index, 'cell_type']
print(f"Embedding index {example_index} corresponds to cell '{cell_id}' of type '{cell_type}'.")

Embeddings shape: (281, 512)
Torch tensor shape: torch.Size([281, 512])
Metadata columns: ['cell_id', 'cell_type', 'lineage', 'dataset', 'embedding_index']
First 3 cell IDs: ['Basophil_01', 'Basophil_02', 'Basophil_03']
Embedding index 0 corresponds to cell 'Basophil_01' of type 'Basophil'.
