# Stereo-seq Mouse Brain Data Preprocessing

This notebook demonstrates the complete preprocessing pipeline for Stereo-seq spatial transcriptomics data:
1. Load and explore transcript data
2. Select top 500 highly expressed genes
3. Generate multi-channel gene expression maps
4. Perform nuclear segmentation using Cellpose

**Dataset**: Mouse Brain Adult Stereo-seq data (subset)

In [8]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import os
import tempfile
import shutil
from concurrent.futures import ThreadPoolExecutor
import tifffile

In [9]:
# Define data paths - all files are in demo_data directory
bin_file = "demo_data/Mouse_brain_Adult_GEM_bin1_sub.tsv"
dapi_image = "demo_data/Mouse_brain_Adult_sub.tif"
output_dir = "demo_data/"

print(f"Input transcript file: {bin_file}")
print(f"Input DAPI image: {dapi_image}")
print(f"Output directory: {output_dir}")

Input transcript file: demo_data/Mouse_brain_Adult_GEM_bin1_sub.tsv
Input DAPI image: demo_data/Mouse_brain_Adult_sub.tif
Output directory: demo_data/


## 1. Load and Explore Transcript Data

In [10]:
# Load bin file (TSV format)
print("Loading transcript data...")
bin_df = pd.read_csv(bin_file, sep='\t')

# Display basic information
print(f"\nData shape: {bin_df.shape}")
print(f"Columns: {bin_df.columns.tolist()}")

# Display data types
print("\nData types:")
print(bin_df.dtypes)

# Preview data
print("\nFirst few rows:")
bin_df.head()

Loading transcript data...

Data shape: (1526503, 4)
Columns: ['geneID', 'x', 'y', 'MIDCounts']

Data types:
geneID       object
x             int64
y             int64
MIDCounts     int64
dtype: object

First few rows:

Data shape: (1526503, 4)
Columns: ['geneID', 'x', 'y', 'MIDCounts']

Data types:
geneID       object
x             int64
y             int64
MIDCounts     int64
dtype: object

First few rows:


Unnamed: 0,geneID,x,y,MIDCounts
0,0610005C13Rik,732,10,1
1,0610009B22Rik,154,120,1
2,0610009B22Rik,566,710,1
3,0610009B22Rik,4,600,1
4,0610009B22Rik,503,906,1


In [11]:
# Count unique genes
unique_genes = bin_df['geneID'].nunique()
print(f"Number of unique genes: {unique_genes}")

# View first 10 gene names
print(f"\nFirst 10 genes:")
print(bin_df['geneID'].unique()[:10])

# Gene expression statistics
print(f"\nGene expression statistics:")
gene_counts = bin_df['geneID'].value_counts()

# Gene expression distribution
print(f"\nGene expression distribution:")
print(f"Highest expressed gene count: {gene_counts.max()}")
print(f"Lowest expressed gene count: {gene_counts.min()}")
print(f"Average count per gene: {gene_counts.mean():.2f}")
print(f"Median count: {gene_counts.median():.2f}")

Number of unique genes: 17943

First 10 genes:
['0610005C13Rik' '0610009B22Rik' '0610009O20Rik' '0610010F05Rik'
 '0610010K14Rik' '0610012G03Rik' '0610030E20Rik' '0610037L13Rik'
 '0610038B21Rik' '0610040B10Rik']

Gene expression statistics:

Gene expression distribution:
Highest expressed gene count: 71880
Lowest expressed gene count: 1
Average count per gene: 85.08
Median count: 23.00


In [12]:
# Analyze proportion of expression covered by top 500 genes
top500_genes = gene_counts.head(500)
total_counts = gene_counts.sum()
top500_counts = top500_genes.sum()
top500_percentage = (top500_counts / total_counts) * 100

print(f"=== Top 500 Highly Expressed Genes Statistics ===")
print(f"Total number of genes: {len(gene_counts)}")
print(f"Top 500 genes count: {len(top500_genes)}")
print(f"Top 500 genes as % of total genes: {500/len(gene_counts)*100:.2f}%")
print(f"\nTotal expression count: {total_counts:,}")
print(f"Top 500 genes total count: {top500_counts:,}")
print(f"Top 500 genes as % of total expression: {top500_percentage:.2f}%")

# Cumulative expression coverage for different top-N genes
print(f"\n=== Cumulative Expression Coverage by Top-N Genes ===")
gene_numbers = [50, 100, 200, 300, 500, 1000, 2000]

for n in gene_numbers:
    if n <= len(gene_counts):
        top_n_counts = gene_counts.head(n).sum()
        percentage = (top_n_counts / total_counts) * 100
        print(f"Top {n:4d} genes cover {percentage:6.2f}% of total expression")
    else:
        print(f"Top {n:4d} genes: exceeds total gene count ({len(gene_counts)})")

=== Top 500 Highly Expressed Genes Statistics ===
Total number of genes: 17943
Top 500 genes count: 500
Top 500 genes as % of total genes: 2.79%

Total expression count: 1,526,503
Top 500 genes total count: 607,315
Top 500 genes as % of total expression: 39.78%

=== Cumulative Expression Coverage by Top-N Genes ===
Top   50 genes cover  14.75% of total expression
Top  100 genes cover  19.63% of total expression
Top  200 genes cover  26.73% of total expression
Top  300 genes cover  32.11% of total expression
Top  500 genes cover  39.78% of total expression
Top 1000 genes cover  51.66% of total expression
Top 2000 genes cover  65.42% of total expression


## 2. Select Top 500 Highly Expressed Genes

In [13]:
# Step 1: Filter top 500 genes and save data
print("=== Selecting Top 500 Highly Expressed Genes ===\n")

# Get top 500 gene list
top500_gene_names = gene_counts.head(500).index.tolist()
print(f"Selected {len(top500_gene_names)} genes")

# Filter data to include only these genes
bin_df_top500 = bin_df[bin_df['geneID'].isin(top500_gene_names)]

print(f"Original data rows: {len(bin_df):,}")
print(f"Filtered data rows: {len(bin_df_top500):,}")
print(f"Retention rate: {len(bin_df_top500)/len(bin_df)*100:.2f}%")

# Calculate coordinate ranges for gene map
x_min = bin_df['x'].min()
x_max = bin_df['x'].max()
y_min = bin_df['y'].min()
y_max = bin_df['y'].max()

map_width = x_max - x_min + 1
map_height = y_max - y_min + 1

print(f"\nGene map dimensions: ({map_height}, {map_width})")

# Create output directory in demo_data
top500_out_dir = os.path.join(output_dir, "top500")
os.makedirs(top500_out_dir, exist_ok=True)

# Save gene list
gene_list_path = os.path.join(top500_out_dir, 'top500_gene_names.txt')
with open(gene_list_path, 'w') as f:
    for gene in top500_gene_names:
        f.write(gene + '\n')

# Save filtered transcript data
transcripts_path = os.path.join(top500_out_dir, 'top500_transcripts.csv')
bin_df_top500.to_csv(transcripts_path, index=False)

print(f"\nSaved gene list to: {gene_list_path}")
print(f"Saved transcript data to: {transcripts_path}")

=== Selecting Top 500 Highly Expressed Genes ===

Selected 500 genes
Original data rows: 1,526,503
Filtered data rows: 607,315
Retention rate: 39.78%

Gene map dimensions: (1200, 1200)
Original data rows: 1,526,503
Filtered data rows: 607,315
Retention rate: 39.78%

Gene map dimensions: (1200, 1200)

Saved gene list to: demo_data/top500\top500_gene_names.txt
Saved transcript data to: demo_data/top500\top500_transcripts.csv

Saved gene list to: demo_data/top500\top500_gene_names.txt
Saved transcript data to: demo_data/top500\top500_transcripts.csv


## 3. Generate Multi-Channel Gene Expression Map

In [14]:
# Step 2: Generate multi-channel gene expression map
print("=== Generating Top 500 Genes Expression Map ===\n")

# Create temporary directory in demo_data
temp_dir_top500 = os.path.join(output_dir, "temp")
os.makedirs(temp_dir_top500, exist_ok=True)
print(f"Temporary directory: {temp_dir_top500}")

# Define function to process a chunk of genes
def process_gene_chunk_top500(gene_chunk):
    """Process a chunk of genes and create individual expression maps"""
    for gene in gene_chunk:
        # Filter data for current gene
        gene_data = bin_df_top500[bin_df_top500['geneID'] == gene]
        
        # Create blank image
        gene_map = np.zeros((map_height, map_width), dtype=np.uint8)
        
        # Mark transcript locations on the image
        for _, row in gene_data.iterrows():
            x = int(row['x'] - x_min)
            y = int(row['y'] - y_min)
            
            # Ensure coordinates are within valid range
            if 0 <= x < map_width and 0 <= y < map_height:
                if gene_map[x, y] < 255:  # Prevent overflow
                    gene_map[x, y] += 1
        
        # Save individual gene expression map
        tifffile.imwrite(
            os.path.join(temp_dir_top500, f"{gene}.tif"), 
            gene_map, 
            photometric='minisblack'
        )

# Parallel processing of all genes
print("Generating individual gene expression maps in parallel...")
n_workers = min(32, os.cpu_count() * 2)
gene_chunks_top500 = np.array_split(top500_gene_names, n_workers)

with ThreadPoolExecutor(max_workers=n_workers) as executor:
    results = list(executor.map(process_gene_chunk_top500, gene_chunks_top500))
    
print(f"Processed {len(gene_chunks_top500)} gene chunks")

# Merge all gene expression maps into a multi-channel image
print("Merging all gene expression maps into multi-channel image...")
gene_map_all_top500 = np.zeros((map_height, map_width, len(top500_gene_names)), dtype=np.uint8)

for i, gene in enumerate(top500_gene_names):
    if (i + 1) % 100 == 0:  # Print progress every 100 genes
        print(f"  Merged {i + 1}/{len(top500_gene_names)} genes...")
    gene_map_all_top500[:, :, i] = tifffile.imread(os.path.join(temp_dir_top500, f"{gene}.tif"))

print(f"Merged all {len(top500_gene_names)} genes")

# Save multi-channel gene expression map
gene_map_path_top500 = os.path.join(top500_out_dir, "top500_gene_map.tif")
tifffile.imwrite(gene_map_path_top500, gene_map_all_top500)
print(f"Successfully saved gene expression map: {gene_map_path_top500}")

# Clean up temporary files
shutil.rmtree(temp_dir_top500, ignore_errors=True)
print("Temporary files cleaned up")

print(f"\n{'='*50}")
print(f"Top 500 Gene Map Generation Complete")
print(f"{'='*50}")
print(f"Output directory: {top500_out_dir}")
print(f"Gene map shape: {gene_map_all_top500.shape}")

=== Generating Top 500 Genes Expression Map ===

Temporary directory: demo_data/temp
Generating individual gene expression maps in parallel...
Processed 32 gene chunks
Merging all gene expression maps into multi-channel image...
Processed 32 gene chunks
Merging all gene expression maps into multi-channel image...
  Merged 100/500 genes...
  Merged 100/500 genes...
  Merged 200/500 genes...
  Merged 200/500 genes...
  Merged 300/500 genes...
  Merged 300/500 genes...
  Merged 400/500 genes...
  Merged 400/500 genes...
  Merged 500/500 genes...
Merged all 500 genes
  Merged 500/500 genes...
Merged all 500 genes
Successfully saved gene expression map: demo_data/top500\top500_gene_map.tif
Temporary files cleaned up

Top 500 Gene Map Generation Complete
Output directory: demo_data/top500
Gene map shape: (1200, 1200, 500)
Successfully saved gene expression map: demo_data/top500\top500_gene_map.tif
Temporary files cleaned up

Top 500 Gene Map Generation Complete
Output directory: demo_data/to

In [15]:
# Step 3: Generate auxiliary files
print("=== Generating Auxiliary Files ===\n")

# Gene index mapping (gene name to channel index)
gene_index_df_top500 = pd.DataFrame({
    'gene': top500_gene_names,
    'index': range(len(top500_gene_names))
})
gene_index_path = os.path.join(top500_out_dir, 'gene_index_map.csv')
gene_index_df_top500.to_csv(gene_index_path, index=False)

# Gene expression statistics
gene_expr_path = os.path.join(top500_out_dir, 'gene_expression_counts.csv')
gene_counts.loc[top500_gene_names].to_csv(gene_expr_path)

print(f"Auxiliary files saved:")
print(f"  - Gene index mapping: {gene_index_path}")
print(f"  - Gene expression counts: {gene_expr_path}")

print(f"\n{'='*50}")
print(f"Top 500 Gene Processing Complete")
print(f"{'='*50}")
print(f"Output directory: {top500_out_dir}")
print(f"Number of genes: {len(top500_gene_names)}")
print(f"Number of transcripts: {len(bin_df_top500):,}")
print(f"Gene map dimensions: {gene_map_all_top500.shape}")

=== Generating Auxiliary Files ===

Auxiliary files saved:
  - Gene index mapping: demo_data/top500\gene_index_map.csv
  - Gene expression counts: demo_data/top500\gene_expression_counts.csv

Top 500 Gene Processing Complete
Output directory: demo_data/top500
Number of genes: 500
Number of transcripts: 607,315
Gene map dimensions: (1200, 1200, 500)


## 4. Nuclear Segmentation with Cellpose

In [16]:
# Define DAPI image path
dapi_path = os.path.join(output_dir, "Mouse_brain_Adult_sub.tif")
print(f"DAPI image path: {dapi_path}")

DAPI image path: demo_data/Mouse_brain_Adult_sub.tif


In [17]:
# Cellpose nuclear segmentation for Stereo-seq DAPI image
from cellpose import models
from cellpose.io import save_masks

print("=== Cellpose Nuclear Segmentation ===\n")

# Output directory for segmentation results
segmentation_dir = os.path.join(output_dir, "segmentation")
os.makedirs(segmentation_dir, exist_ok=True)

# Load DAPI image
print(f"Loading DAPI image: {dapi_path}")
dapi_img = tifffile.imread(dapi_path)

print(f"\nDAPI image information:")
print(f"  - Shape: {dapi_img.shape}")
print(f"  - Data type: {dapi_img.dtype}")
print(f"  - Value range: {dapi_img.min()} - {dapi_img.max()}")

# Image preprocessing and normalization
print("\nPreprocessing image...")
# Use percentile stretching to enhance contrast
p2, p98 = np.percentile(dapi_img, (2, 98))
dapi_img_norm = np.clip((dapi_img - p2) / (p98 - p2) * 255, 0, 255).astype(np.uint8)
print(f"Percentile stretching: P2={p2:.1f}, P98={p98:.1f}")
print(f"Preprocessed image range: {dapi_img_norm.min()} - {dapi_img_norm.max()}")

# Visualize original image (optional)
plt.figure(figsize=(12, 10))
plt.imshow(dapi_img, cmap='gray')
plt.title("Stereo-seq DAPI Image")
plt.colorbar()
plt.savefig(os.path.join(segmentation_dir, "dapi_original.png"), 
            dpi=300, bbox_inches='tight')
plt.close()

# Initialize Cellpose model
print("\nInitializing Cellpose model...")
model = models.Cellpose(
    gpu=True,           # Use GPU acceleration
    model_type='nuclei' # Use nuclei segmentation model
)

# Segmentation parameters optimized for Stereo-seq
diameter = None             # Let Cellpose auto-estimate
flow_threshold = 0.4        # Flow threshold
cellprob_threshold = 0.0    # Cell probability threshold

print(f"\nSegmentation parameters:")
print(f"  - diameter: {diameter} (auto-estimate)")
print(f"  - flow_threshold: {flow_threshold}")
print(f"  - cellprob_threshold: {cellprob_threshold}")

# Run Cellpose segmentation
print("\nRunning nuclear segmentation...")
masks, flows, styles, diams_estimated = model.eval(
    dapi_img_norm,
    diameter=diameter,
    channels=[0, 0],  # Single-channel grayscale image
    flow_threshold=flow_threshold,
    cellprob_threshold=cellprob_threshold,
    do_3D=False
)

print(f"\nSegmentation complete!")
print(f"  - Estimated nucleus diameter: {diams_estimated:.1f} pixels")
print(f"  - Number of cells detected: {masks.max()}")

# Save segmentation masks
mask_path = os.path.join(segmentation_dir, "nuclei_masks.tif")
tifffile.imwrite(mask_path, masks.astype(np.uint16))
print(f"\nMask file saved: {mask_path}")

# Save visualization
segmentation_vis_path = os.path.join(segmentation_dir, "segmentation_visualization.png")
save_masks(dapi_img_norm, masks, flows, segmentation_vis_path)
print(f"Segmentation visualization saved: {segmentation_vis_path}")

# Create overlay visualization
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.imshow(dapi_img_norm, cmap='gray')
plt.title("DAPI Image")
plt.axis('off')

plt.subplot(1, 3, 2)
plt.imshow(masks, cmap='viridis')
plt.title(f'Segmentation Masks\n({masks.max()} cells)')
plt.axis('off')

plt.subplot(1, 3, 3)
plt.imshow(dapi_img_norm, cmap='gray')
plt.imshow(masks > 0, alpha=0.3, cmap='viridis')
plt.title('Overlay')
plt.axis('off')

plt.tight_layout()
plt.savefig(os.path.join(segmentation_dir, "segmentation_results.png"), 
            dpi=300, bbox_inches='tight')
plt.close()

print(f"\n{'='*50}")
print(f"Cellpose Segmentation Complete")
print(f"{'='*50}")
print(f"Output directory: {segmentation_dir}")
print(f"Detected cells: {masks.max()}")
print(f"Estimated diameter: {diams_estimated:.1f} pixels")

=== Cellpose Nuclear Segmentation ===

Loading DAPI image: demo_data/Mouse_brain_Adult_sub.tif

DAPI image information:
  - Shape: (1200, 1200)
  - Data type: uint8
  - Value range: 4 - 255

Preprocessing image...
Percentile stretching: P2=7.0, P98=144.0
Preprocessed image range: 0 - 255

Initializing Cellpose model...

Initializing Cellpose model...

Segmentation parameters:
  - diameter: None (auto-estimate)
  - flow_threshold: 0.4
  - cellprob_threshold: 0.0

Running nuclear segmentation...

Segmentation parameters:
  - diameter: None (auto-estimate)
  - flow_threshold: 0.4
  - cellprob_threshold: 0.0

Running nuclear segmentation...

Segmentation complete!
  - Estimated nucleus diameter: 11.2 pixels
  - Number of cells detected: 1133

Mask file saved: demo_data/segmentation\nuclei_masks.tif
Segmentation visualization saved: demo_data/segmentation\segmentation_visualization.png

Segmentation complete!
  - Estimated nucleus diameter: 11.2 pixels
  - Number of cells detected: 1133

Ma

## Summary

This notebook demonstrated the complete preprocessing pipeline for Stereo-seq data:

1. **Data Loading**: Loaded transcript data from TSV file
2. **Gene Selection**: Identified and selected top 500 highly expressed genes
3. **Gene Map Generation**: Created multi-channel gene expression maps
4. **Nuclear Segmentation**: Performed Cellpose segmentation on DAPI images

**Output Files**:
- `demo_data/top500/top500_gene_names.txt` - List of selected genes
- `demo_data/top500/top500_transcripts.csv` - Filtered transcript data
- `demo_data/top500/top500_gene_map.tif` - Multi-channel gene expression map
- `demo_data/top500/gene_index_map.csv` - Gene to channel index mapping
- `demo_data/top500/gene_expression_counts.csv` - Gene expression statistics
- `demo_data/segmentation/nuclei_masks.tif` - Nuclear segmentation masks
- `demo_data/segmentation/*.png` - Visualization images

**Ready for SpaCut**: The generated gene map and nuclei masks can now be used as inputs for SpaCut cell segmentation.