# Tahoe-100M Gene Extraction

This notebook extracts all unique genes from the Tahoe-100M dataset and saves them as a CSV file.

The Tahoe-100M dataset is the world's largest single-cell transcriptomic atlas, containing:
- ~100 million cells
- 50 cancer cell lines
- 1,200+ drug perturbations

We'll extract the gene metadata (equivalent to `adata.var`) which contains all genes measured in the dataset.

In [1]:
# Import required libraries
from datasets import load_dataset
import pandas as pd
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load Gene Metadata from Tahoe-100M

The Tahoe-100M dataset on Hugging Face has several tables:
- `expression_data` - main single-cell expression data
- `gene_metadata` - gene information (equivalent to adata.var)
- `cell_line_metadata` - cell line annotations
- `drug_metadata` - drug/perturbation information
- `sample_metadata` - sample-level metadata
- `obs_metadata` - cell-level metadata (equivalent to adata.obs)

In [2]:
# Load gene metadata from Tahoe-100M
print("Loading gene metadata from Tahoe-100M...")
gene_ds = load_dataset(
    "tahoebio/Tahoe-100M",
    "gene_metadata",
    split="train"
)

# Convert to pandas DataFrame
gene_df = gene_ds.to_pandas()

print(f"\nGene metadata loaded successfully!")
print(f"Shape: {gene_df.shape}")
print(f"Columns: {gene_df.columns.tolist()}")

Loading gene metadata from Tahoe-100M...

Gene metadata loaded successfully!
Shape: (62710, 3)
Columns: ['gene_symbol', 'ensembl_id', 'token_id']


In [3]:
# Display the first few genes
print("First 10 genes:")
gene_df.head(10)

First 10 genes:


Unnamed: 0,gene_symbol,ensembl_id,token_id
0,TSPAN6,ENSG00000000003,3
1,TNMD,ENSG00000000005,4
2,DPM1,ENSG00000000419,5
3,SCYL3,ENSG00000000457,6
4,C1orf112,ENSG00000000460,7
5,FGR,ENSG00000000938,8
6,CFH,ENSG00000000971,9
7,FUCA2,ENSG00000001036,10
8,GCLC,ENSG00000001084,11
9,NFYA,ENSG00000001167,12


In [4]:
# Display the last few genes
print("Last 10 genes:")
gene_df.tail(10)

Last 10 genes:


Unnamed: 0,gene_symbol,ensembl_id,token_id
62700,POLGARF,ENSG00000291307,62703
62701,ENSG00000291308,ENSG00000291308,62704
62702,LY6S,ENSG00000291309,62705
62703,ENSG00000291310,ENSG00000291310,62706
62704,ENSG00000291312,ENSG00000291312,62707
62705,ENSG00000291313,ENSG00000291313,62708
62706,ENSG00000291314,ENSG00000291314,62709
62707,ENSG00000291315,ENSG00000291315,62710
62708,ENSG00000291316,ENSG00000291316,62711
62709,TMEM276,ENSG00000291317,62712


## 2. Explore Gene Metadata

In [5]:
# Basic statistics
print(f"Total number of genes: {len(gene_df)}")
print(f"\nColumn data types:")
print(gene_df.dtypes)

Total number of genes: 62710

Column data types:
gene_symbol    object
ensembl_id     object
token_id        int64
dtype: object


In [6]:
# Check for any duplicate genes
if 'gene_name' in gene_df.columns:
    n_unique = gene_df['gene_name'].nunique()
    n_total = len(gene_df)
    print(f"Unique gene names: {n_unique}")
    print(f"Total rows: {n_total}")
    if n_unique != n_total:
        print(f"\nNote: There are {n_total - n_unique} duplicate gene names")
        duplicates = gene_df[gene_df['gene_name'].duplicated(keep=False)]
        print(f"\nDuplicate genes:")
        print(duplicates)
else:
    print("No 'gene_name' column found. Available columns:")
    print(gene_df.columns.tolist())

No 'gene_name' column found. Available columns:
['gene_symbol', 'ensembl_id', 'token_id']


In [7]:
# Summary statistics for numerical columns
print("Summary statistics:")
gene_df.describe()

Summary statistics:


Unnamed: 0,token_id
count,62710.0
mean,31357.5
std,18102.962027
min,3.0
25%,15680.25
50%,31357.5
75%,47034.75
max,62712.0


## 3. Save Gene Metadata to CSV

In [None]:
# Define output path
output_dir = Path("../data")
output_dir.mkdir(parents=True, exist_ok=True)

output_file = output_dir / "tahoe_100m_genes.csv"

# Save to CSV
gene_df.to_csv(output_file, index=False)

print(f"Gene metadata saved to: {output_file}")
print(f"File size: {output_file.stat().st_size / 1024:.1f} KB")

In [None]:
# Verify the saved file
saved_df = pd.read_csv(output_file)
print(f"Verification - rows read back: {len(saved_df)}")
print(f"Columns: {saved_df.columns.tolist()}")
saved_df.head()

## 4. Summary

We have successfully extracted all genes from the Tahoe-100M dataset and saved them to a CSV file.

In [None]:
# Final summary
print("="*60)
print("EXTRACTION COMPLETE")
print("="*60)
print(f"Total genes extracted: {len(gene_df)}")
print(f"Output file: {output_file.absolute()}")
print(f"\nGene metadata columns:")
for col in gene_df.columns:
    print(f"  - {col}")