To generate synthetic VisiumHD data from seqFISH+, please read and run all the cells below. Thanks!

### Install prerequisite libraries

In [None]:
!pip install --upgrade pip
!pip install scipy
!pip install shapely
!pip install tifffile
!pip install plotly
!pip install tensorflow-gpu==2.10.0
!pip install stardist
!pip install geopandas
!pip install scanpy
!pip install fastparquet
!pip install imagecodecs
!pip install zarr
!pip install scipy
!pip install h5py

Looking in indexes: https://jfrog-proxy.services.p171649450587.aws-emea.sanofi.com/artifactory/api/pypi/pypi-one_ai-virtual/simple, https://pypi.org/simple
Looking in indexes: https://jfrog-proxy.services.p171649450587.aws-emea.sanofi.com/artifactory/api/pypi/pypi-one_ai-virtual/simple, https://pypi.org/simple
Looking in indexes: https://jfrog-proxy.services.p171649450587.aws-emea.sanofi.com/artifactory/api/pypi/pypi-one_ai-virtual/simple, https://pypi.org/simple
Looking in indexes: https://jfrog-proxy.services.p171649450587.aws-emea.sanofi.com/artifactory/api/pypi/pypi-one_ai-virtual/simple, https://pypi.org/simple
Looking in indexes: https://jfrog-proxy.services.p171649450587.aws-emea.sanofi.com/artifactory/api/pypi/pypi-one_ai-virtual/simple, https://pypi.org/simple
Looking in indexes: https://jfrog-proxy.services.p171649450587.aws-emea.sanofi.com/artifactory/api/pypi/pypi-one_ai-virtual/simple, https://pypi.org/simple
Looking in indexes: https://jfrog-proxy.services.p171649450587.a

### Import Relevant Libraries

In [None]:
import tifffile as tifi # Package to read the WSI (whole slide image)
from csbdeep.utils import normalize # Image normalization
from shapely.geometry import Polygon, Point # Representing bins and cells as Shapely Polygons and Point objects
from shapely import wkt
import geopandas as gpd # Geopandas for storing Shapely objects
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import scanpy as sc
import pandas as pd
from scipy import sparse
import anndata
import os
import gzip
import numpy as np
import re
import shapely
import zarr


### Create folders to store synthetic data

For both the `seqfish_dir` and `enact_data_dir`, change `"/home/oneai/"` to the directory that stores this repo.

In [None]:
seqfish_dir = "/home/oneai/oneai-dda-spatialtr-visiumhd_analysis/synthetic_data/seqFISH" # Update it to the directory where you want to save the synthetic data
enact_data_dir = "/home/oneai/oneai-dda-spatialtr-visiumhd_analysis/cache/seqfish/chunks" # Directory that saves all the input and results of the enact pipeline, 
# should end with "oneai-dda-spatialtr-visiumhd_analysis/cache/seqfish/chunks"

transcripts_df_chunks_dir = os.path.join(seqfish_dir, "transcripts_patches") # Directory to store the files that contain the transcripts info for each chunk
output_dir = os.path.join(enact_data_dir, "bins_gdf") # Directory to store the generated synthetic binned transcript counts
cells_df_chunks_dir =  os.path.join(enact_data_dir,"cells_gdf") # Directory to store the generated synthetic binned transcript counts

# Making relevant directories
os.makedirs(seqfish_dir, exist_ok=True)
os.makedirs(enact_data_dir, exist_ok=True)
os.makedirs(transcripts_df_chunks_dir, exist_ok=True)
os.makedirs(output_dir, exist_ok=True)
os.makedirs(cells_df_chunks_dir, exist_ok=True)

### Download seqFISH+ data

1. Download "ROIs_Experiment1_NIH3T3.zip" from https://zenodo.org/records/2669683#.Xqi1w5NKg6g to seqfish_dir. The zipfile contains cell segmentation files
2. Download "run1.csv.gz" from https://github.com/MonashBioinformaticsPlatform/seqfish-hack. It contains the tidy format of "seqFISH+_NIH3T3_point_locations.zip" from the official seqFISH+ zenodo site

### Load Cell & Transcripts Info

This following cells first unzip "ROIs_Experiment1_NIH3T3.zip" to extract the cell segmentation information. Then load transcripts dataframe from "run1.csv.gz"

In [None]:
import zipfile
import os
zip_file_path = os.path.join(seqfish_dir, "ROIs_Experiment1_NIH3T3.zip")

# Open the ZIP file and extract all the contents
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(seqfish_dir)

print(f'Files extracted to {seqfish_dir}')

In [None]:
file_path =  os.path.join(seqfish_dir, "run1.csv.gz")

transcripts_df = pd.read_csv(file_path, compression='gzip')
print(transcripts_df)

In [None]:
# convert from pixel to um
transcripts_df.x = transcripts_df.x*0.103
transcripts_df.y = transcripts_df.y*0.103
# label cell to include fov and cell number
transcripts_df['new_cell_name'] = transcripts_df.apply(lambda x: f"{x['fov']}_Cell_{x['cell']}", axis=1)

### Generate Ground Truth

The following cell will generate and save the ground truth of the synthetic VisiumHD data for the use of bin-to-cell assignment methods evaluation. Ground truth dataframe consists of rows representing the transcript counts of each cell. Each column represents a gene feature (gene feature name is also the column name).

In [None]:
groundtruth_df = transcripts_df.pivot_table(index=['new_cell_name'], columns='gene', aggfunc='size', fill_value=0)
ground_truth_file = os.path.join(seqfish_dir, "groundtruth.csv")
groundtruth_df.to_csv(ground_truth_file)

### Generate Synthetic VesiumHD Dataset

#### Break transcripts df to patches (based on fov)

Break transcripts df to patches based on their field of view (fov), since cell segmentation is done on each individual fov seperately.

In [None]:
# Create a df for each fov
grouped = transcripts_df.groupby(['fov'])
for fov, group in grouped:
    filename = f"patch_{fov}.csv"
    output_loc = os.path.join(transcripts_df_chunks_dir, filename)
    group.to_csv(output_loc)

    print(f"Saved {filename}")

#### Generate synthetic vesiumHD for each patch

Each fov is broken into bins of size 2um x 2um. The synthetic data contains transcript counts orgnized by bin_id. Each row contains transcript counts for a unique bin. Bins with no transcript counts is not included. 

In addition to all the gene features, there are two additional columns represent the row number and column number of the bin, and a column contains the Shapely polygon item that represents the bin. The first column is the bin_id.

In [None]:
def generate_synthetic_VesiumHD_data(transcripts_df, bin_size=2):
    
    filtered_df = transcripts_df.copy()
    
    # assigne bin to each transcript
    filtered_df.loc[:, 'row'] =np.ceil(filtered_df['y'] / bin_size).astype(int)
    filtered_df.loc[:, 'column'] = np.ceil(filtered_df['x'] / bin_size).astype(int)
    filtered_df.loc[:, 'assigned_bin_id'] = filtered_df.apply(
        lambda row: f"{bin_size}um_" + str(row['row']).zfill(5) +"_"+ str(row['column']).zfill(5),
        axis=1)
    bin_coordinates = filtered_df[['assigned_bin_id', 'row', 'column']].drop_duplicates().set_index('assigned_bin_id')
    bin_gene_matrix = filtered_df.groupby(['assigned_bin_id', 'gene']).size().unstack(fill_value=0)
    bin_gene_matrix_with_coords = bin_gene_matrix.merge(bin_coordinates, left_index=True, right_index=True)
    return bin_gene_matrix_with_coords

In [None]:
# Extract row and column number from the bin_id
def extract_numbers(entry):
    match = re.search(r'_(\d{5})_(\d{5})', entry)
    if match:
        number1 = int(match.group(1).lstrip('0'))  
        number2 = int(match.group(2).lstrip('0'))  
        return number2*2-1, number1*2-1
    else:
        return None, None

In [None]:
from tqdm import tqdm
def generate_bin_polys(bins_df, x_col, y_col, bin_size):
        """Represents the bins as Shapely polygons

        Args:
            bins_df (pd.DataFrame): bins dataframe
            x_col (str): column with the bin centre x-coordinate
            y_col (str): column with the bin centre y-coordinate
            bin_size (int): bin size in pixels

        Returns:
            list: list of Shapely polygons
        """
        geometry = []
        # Generates Shapely polygons to represent each bin

        if True:
            half_bin_size = bin_size / 2
            bbox_coords = pd.DataFrame(
                {
                    "min_x": bins_df[x_col] - half_bin_size,
                    "min_y": bins_df[y_col] - half_bin_size,
                    "max_x": bins_df[x_col] + half_bin_size,
                    "max_y": bins_df[y_col] + half_bin_size,
                }
            )
            geometry = [
                shapely.geometry.box(min_x, min_y, max_x, max_y)
                for min_x, min_y, max_x, max_y in tqdm(
                    zip(
                        bbox_coords["min_x"],
                        bbox_coords["min_y"],
                        bbox_coords["max_x"],
                        bbox_coords["max_y"],
                    ),
                    total=len(bins_df),
                )
            ]

        return geometry

In [None]:
# Loop through all the transcripra_df patches and generate gene-to-bin assignments 
bin_size = 2
transcripts_df_chunks = os.listdir(transcripts_df_chunks_dir)
for chunk_fname in transcripts_df_chunks:
    output_loc = os.path.join(output_dir, chunk_fname)
    if chunk_fname in [".ipynb_checkpoints"]:
        continue
    # if os.path.exists(output_loc):
    #     continue
    transcripts_df_chunk = pd.read_csv(os.path.join(transcripts_df_chunks_dir, chunk_fname))
    bin_df_chunk = generate_synthetic_VesiumHD_data(transcripts_df_chunk, bin_size)
    bin_df_chunk['column'] = bin_df_chunk['column']*2-1
    bin_df_chunk['row'] = bin_df_chunk['row']*2-1
    bin_df_chunk['geometry'] = generate_bin_polys(bin_df_chunk, 'column', 'row', 2)
    bin_gdf_chunk = gpd.GeoDataFrame( bin_df_chunk, geometry = bin_df_chunk['geometry'])
    bin_gdf_chunk.to_csv(output_loc)
   
    print(f"Successfully assigned transcripts to bins for {chunk_fname}")

### Generate ENACT pipeline cell segmentation input

This session generate the cell_df patches required to run the enact pipeline. The main purpose is to create Shapely polygons that represent the cell outline.

#### Load cell boundary data and create cell polygons

In [None]:
import read_roi
def process_roi_file(key, roi_file_path):
    roi_data = read_roi.read_roi_file(roi_file_path)
    data = roi_data[key]
    # Apply the scaling factor to each coordinate separately
    scaled_x = [x * 0.103 for x in data['x']]
    scaled_y = [y * 0.103 for y in data['y']]
    # Create the list of points using zip on the scaled coordinates
    points = [(x, y) for x, y in zip(scaled_x, scaled_y)]
    # Create and return the polygon
    polygon = Polygon(points)
    return polygon

In [None]:
def extract_fov_from_string(s):
    # Search for one or more digits in the string
    match = re.search(r'\d+', s)
    if match:
        return int(match.group(0))+1 # Convert the found number to an integer
    else:
        return None  # Return None if no number is found

In [None]:
base_path = os.path.join(seqfish_dir, "ALL_Roi")  # Change this to the path where your fov folders are stored
fov_data = []

for fov_folder in os.listdir(base_path):
    fov_folder_path = os.path.join(base_path, fov_folder)
    if os.path.isdir(fov_folder_path):
        # Loop through each ROI file in the fov folder
        for roi_file in os.listdir(fov_folder_path):
            if roi_file.endswith('.roi'):
                key = roi_file.replace('.roi', '')
                roi_file_path = os.path.join(fov_folder_path, roi_file)
                polygon = process_roi_file(key, roi_file_path)
                fov_data.append({
                    'fov':  extract_fov_from_string(fov_folder),
                    'cell': roi_file.replace('.roi', ''),
                    'geometry': polygon
                })

cell_boundary_df = pd.DataFrame(fov_data)

#### relabel cell name of polygons df to the standard name

In [None]:
df_sorted = cell_boundary_df.sort_values(by=['fov', 'cell'])
df_sorted['cell_id'] = df_sorted.groupby('fov').cumcount() + 1
df_sorted['cell_id'] = df_sorted.apply(lambda x: f"{x['fov']}_Cell_{x['cell_id']}", axis=1)
df_sorted.to_csv("/home/oneai/oneai-dda-spatialtr-visiumhd_analysis/cache/seqfish/cells_df.csv")

#### Break cell polygons df to patches (based on fov)

In [None]:

# Create a df for each patch
grouped = df_sorted.groupby(['fov'])
for fov, group in grouped:
    filename = f"patch_{fov}.csv"
    output_loc = os.path.join(cells_df_chunks_dir, filename)
    group.to_csv(output_loc)

    print(f"Saved {filename}")


Saved patch_1.csv
Saved patch_2.csv
Saved patch_3.csv
Saved patch_4.csv
Saved patch_5.csv
Saved patch_6.csv
Saved patch_7.csv


  for fov, group in grouped:


### Run ENACT bin-to-cell pipeline
In the configs.yaml file: 

    Set "analysis_name" in the configs.yaml file to "seqfish".
    Set "run_synthetic" to True.
    Set "bin_to_cell_method" to one of these four: "naive", "weighted_by_area", "weighted_by_gene", or "weighted_by_cluster"

Run `make run_enact`

### Evaluation of ENACT bin-to-cell results

To evaluate and compare the four bin-to-cell methods, please first complete the step above with all four methods. You can also only run the methods you are interested in and change the following code accordingly.

#### Calculate precision, recall, and F1 for each bin2cell method

Run this session with all the methods you have run with ENACT, change 'method' in the cell below to the one you want to evaluate.

In [None]:
# Concatenate all patches of ENACT results file 
method = "weighted_by_gene" # other methods: "naive", "weighted_by_area", "weighted_by_cluster" 
directory_path = os.path.join(enact_data_dir,method,"bin_to_cell_assign") 
output_file = os.path.join(enact_data_dir,method,"bin_to_cell_assign/merged.csv") 

concatenate_csv_files(directory_path, output_file)

In [None]:
import os
import pandas as pd

def concatenate_csv_files(directory_path, output_file):
    dataframes = []

    for filename in os.listdir(directory_path):
        if filename.endswith('.csv'):
            file_path = os.path.join(directory_path, filename)
            df = pd.read_csv(file_path)
            dataframes.append(df)
    
    concatenated_df = pd.concat(dataframes, ignore_index=True)
    concatenated_df = concatenated_df.drop(columns = ['Unnamed: 0.1','Unnamed: 0'])
    sorted_df = concatenated_df.sort_values(by='id')
    sorted_df.to_csv(output_file, index=False)
    print(f"All CSV files have been concatenated into {output_file}")

In [None]:
import pandas as pd
import numpy as np
from shapely.geometry import Polygon

def calculate_metrics(ground_truth_file, generated_file, eval_file):
    # Load ground truth and generated data
    ground_truth = pd.read_csv(ground_truth_file)
    generated = pd.read_csv(generated_file)
    generated.fillna(0)
    # Ensure 'cell_id' is properly handled
    if 'id' in generated.columns:
        generated.rename(columns={'id': 'new_cell_name'}, inplace=True)

    # Merge data on 'cell_id'
    merged = pd.merge(
        ground_truth, generated, on='new_cell_name', how='outer', suffixes=('_gt', '_gen')
    ).fillna(0)
    # print(merged)

    # Identify common gene features
    gt_columns = merged.filter(like='_gt').columns
    gen_columns = merged.filter(like='_gen').columns

    common_genes = set(gt_columns).intersection(gen_columns)

    # Reorder columns based on common genes
    ordered_gt_columns = sorted(gt_columns)
    ordered_gen_columns = sorted(gen_columns)
    

    # Extract aligned matrices for ground truth and generated data
    ground_truth_aligned = merged[['new_cell_name'] + [col for col in ordered_gt_columns if col in gt_columns]].values
    generated_aligned = merged[['new_cell_name'] + [col for col in ordered_gen_columns if col in gen_columns]].values
    
    print(ground_truth_aligned)
    print(generated_aligned)
    # Ensure matrices are aligned and have the same shape
    if ground_truth_aligned.shape[1] != generated_aligned.shape[1]:
        raise ValueError("The aligned matrices must have the same shape!")

    ground_truth_aligned = ground_truth_aligned[:, 1:]  # Exclude cell_ids
    generated_aligned = generated_aligned[:, 1:]     

    num_cells = (ground_truth.iloc[:, 1:] != 0).any(axis=1).sum()
    tp = np.sum(np.minimum(generated_aligned, ground_truth_aligned), axis=1)
    predicted = np.sum(generated_aligned, axis=1)
    actual = np.sum(ground_truth_aligned, axis=1)

    # Calculate precision, recall, and F1 score for each row
    precision = tp / predicted
    recall = tp / actual
    f1_score = 2 * (precision * recall) / (precision + recall)
    

   # Add a column called 'Method' where all rows have the same entry
    method_column = np.full((precision.shape[0],), 'Naive')  # Replace 'YourMethodName' with the actual method name

    df = pd.DataFrame({
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1_score,
        'Method': method_column
    })


    df.to_csv(eval_file)


In [None]:
ground_truth_file = os.path.join(seqfish_dir, "groundtruth.csv")
generated_file = os.path.join(enact_data_dir,method,"bin_to_cell_assign/merged.csv")
eval_file = os.path.join(enact_data_dir,method,"eval.csv") 

calculate_metrics(ground_truth_file, generated_file, eval_file)

#### Create violin plots comparing four bin2cell methods

The following cells would create violin plots for all four methods in order to better compare the results. You can choose to only compare the ones you have run by changing the 'file_names' list to only include those.

In [None]:
file_names = [os.path.join(enact_data_dir,"naive/eval.csv"), 
              os.path.join(enact_data_dir,"weighted_by_area/eval.csv"), 
              os.path.join(enact_data_dir,"weighted_by_gene/eval.csv"),
              os.path.join(enact_data_dir,"weighted_by_cluster/eval.csv")]  # Replace with actual file paths

# Read and concatenate all files
df_list = [pd.read_csv(file) for file in file_names]
metrics_df = pd.concat(df_list, ignore_index=True)

In [None]:
# Visualize the distributions using violin plots
sns.set(style="whitegrid")

# Create a figure with subplots for each metric
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Precision Violin Plot
sns.violinplot(x='Method', y='Precision', data=metrics_df, ax=axes[0], inner='quartile', palette='Set2')
axes[0].set_title('Precision')
axes[0].set_xlabel('Method')
axes[0].set_ylabel('value')
axes[0].set_ylim(0.8,1)
axes[0].tick_params(axis='x', labelsize=8)  # Adjust the font size here

# Recall Violin Plot
sns.violinplot(x='Method', y='Recall', data=metrics_df, ax=axes[1], inner='quartile', palette='Set2')
axes[1].set_title('Recall')
axes[1].set_xlabel('Method')
axes[1].set_ylabel('value')
axes[1].set_ylim(0.8,1)
axes[1].tick_params(axis='x', labelsize=8)  # Adjust the font size here

# F1 Score Violin Plot
sns.violinplot(x='Method', y='F1 Score', data=metrics_df, ax=axes[2], inner='quartile', palette='Set2')
axes[2].set_title('F1 Score')
axes[2].set_xlabel('Method')
axes[2].set_ylabel('value')
axes[2].set_ylim(0.8,1)
axes[2].tick_params(axis='x', labelsize=8)  # Adjust the font size here

plt.tight_layout()
plt.show()