# Chapter 01 — Loading and Organizing the Data

## Objective

In this chapter, we load all single-nucleus RNA-seq datasets required for benchmarking the **Singulator + FACS** and **Singulator + LeviCell** protocols. We also verify metadata consistency, perform basic integrity checks, and prepare the data objects for downstream analysis.

This includes:

- Locating the data on the shared filesystem (Iris)
- Loading raw count matrices (e.g., `filtered_feature_bc_matrix.h5` from 10X)
- Loading data into `AnnData` objects

## Data Source

The data for this benchmarking project is stored on the **Iris** HPC filesystem under:

`/data1/collab002/sail/isabl/datalake/prod/010/collaborators/SAIL/projects/singulator_debris_removal_and/experiments`

We will be working with the data under identifier `MB-4027_*`


## Create Conda Environment for Analysis

To ensure reproducibility and consistent dependency management, we use a Conda environment for all analyses in this benchmarking project.

The environment includes key packages for single-cell RNA-seq data handling, visualization, and benchmarking, including:

- `scanpy`
- `anndata`
- `pandas`, `numpy`, `matplotlib`, `seaborn`
- `jupyterlab` and `notebook`
- `scikit-learn`, `umap-learn`, `leidenalg`

### Step 1: Create the Environment

You can create the environment using `mamba` (recommended) or `conda`. An `env.yml` file is provided for you within this directory with all the prerequisites for performing the analysis in this book. The commands below allow you to create your environment:


In [1]:
%%bash

# Recommended
mamba env create -f env.yml

# Or with conda
conda env create -f env.yml

Retrieving notices: ...working... done



CondaValueError: prefix already exists: /usersoftware/peerd/ghoshr/.conda/envs/benchmarking_template


CondaValueError: prefix already exists: /usersoftware/peerd/ghoshr/.conda/envs/benchmarking_template



CalledProcessError: Command 'b'\n# Recommended\nmamba env create -f env.yml\n\n# Or with conda\nconda env create -f env.yml\n'' returned non-zero exit status 1.

Now you can run python using the ipykernel (benchmarking_template). Select the correct kernel by using the menu in the top right of this file.

## Move data into local repository


To ensure reproducibility and simplify access across users and machines, we mirror the relevant input data directories into the local repository using **symbolic links** (symlinks). This allows us to reference raw data in a standardized location without duplicating large files or modifying the originals.

We can use the following commands to do so:

### Core Imports

In [2]:
import os
import scanpy as sc
import pandas as pd
import subprocess
from pathlib import Path

### Create Symlinks to Data (avoid copying very large files)

In [None]:
ORIGINAL_DATA_ROOT_DIR = Path(
    "/data1/collab002/sail/projects/ongoing/benchmarks/benchmark_singulator_levicell/data/read_only"
)
LOCAL_DATA_ROOT_DIR = Path("./data/read_only")

# Create local directory if it doesn't exist
LOCAL_DATA_ROOT_DIR.mkdir(parents=True, exist_ok=True)

# Create symlinks for each MB-4027_* subdirectory
for full_path in ORIGINAL_DATA_ROOT_DIR.glob("MB-4027_*"):
    if full_path.is_dir():
        sample_name = full_path.name
        symlink_path = LOCAL_DATA_ROOT_DIR / sample_name

        if symlink_path.exists():
            print(f"Skipping existing: {symlink_path}")
        else:
            symlink_path.symlink_to(full_path)
            print(f"Created symlink: {symlink_path} -> {full_path}")

# Make everything under LOCAL_DATA_ROOT_DIR read-only
subprocess.run(["chmod", "-R", "a-w", str(LOCAL_DATA_ROOT_DIR)], check=True)

## Create metadata file

In order to make sure we know what files we have, we will create a metadata.tsv file within the read_only dir. This will help future users identify what data we're working with.

In [None]:
import pandas as pd
import os

DATA_DIR = "./data"

# Define metadata
data = {
    "sample_id": [
        "MB-4027_SF_LN",
        "MB-4027_SF_N",
        "MB-4027_SF_T",
        "MB-4027_SL_LN",
        "MB-4027_SL_N",
        "MB-4027_SL_T",
    ],
    "tissue": [
        "Normal Liver",
        "Normal Colon",
        "Tumor Colon",
        "Normal Liver",
        "Normal Colon",
        "Tumor Colon",
    ],
    "protocol": [
        "Singulator+FACS",
        "Singulator+FACS",
        "Singulator+FACS",
        "Singulator+LeviCell",
        "Singulator+LeviCell",
        "Singulator+LeviCell",
    ],
}


# Define output file path
metadata_output_file_path = os.path.join(DATA_DIR, "metadata.tsv")

# Add description to the metadata file
with open(metadata_output_file_path, "w") as f:
    f.write("\n# This file contains metadata for the benchmark dataset.\n")
    f.write(
        "# Each row corresponds to a sample with its tissue type and protocol used.\n"
    )
    f.write("# The data is organized under the 'data/read_only' directory.\n")
    f.write("# The symlinks point to the original data directories.\n")

# Create DataFrame
metadata_df = pd.DataFrame(data)

# Save to TSV
metadata_df.to_csv(metadata_output_file_path, sep="\t", mode="a", index=False)

### Read in AnnDatas and Move to Data Directory

In [3]:
# File and directory paths
DATA_DIR = "./data"
READ_ONLY_DIR = os.path.join(DATA_DIR, "read_only")

ANALYSIS_DIR = os.path.join(DATA_DIR, "analysis")

FILTERED_FEATURE_BC_MATRIX_FILE_NAME = "filtered_feature_bc_matrix.h5"
OUTPUT_ADATA_DIR = os.path.join(ANALYSIS_DIR, "adatas", "adatas_X_filtered_cells_raw")
MOLECULE_INFO_FILE_NAME = "molecule_info.h5"

# Make sure output directories exist
os.makedirs(OUTPUT_ADATA_DIR, exist_ok=True)

# Sample metadata - in data/metadata.tsv
samples = {
    "SF_N": ("MB-4027_SF_N", "Normal Colon", "Singulator+FACS"),
    "SL_N": ("MB-4027_SL_N", "Normal Colon", "Singulator+LeviCell"),
    "SF_T": ("MB-4027_SF_T", "Tumor Colon", "Singulator+FACS"),
    "SL_T": ("MB-4027_SL_T", "Tumor Colon", "Singulator+LeviCell"),
    "SF_LN": ("MB-4027_SF_LN", "Normal Liver", "Singulator+FACS"),
    "SL_LN": ("MB-4027_SL_LN", "Normal Liver", "Singulator+LeviCell"),
}

# Color palette for plotting
protocol_color_palette = {
    "Singulator+FACS": "#AEC6CF",
    "Singulator+LeviCell": "#FFDAB9",
}

In [4]:
# Load AnnData objects
adatas = {}
adata_metadata = {}

for key, (folder, tissue, protocol) in samples.items():
    file_path = os.path.join(
        READ_ONLY_DIR,
        folder,
        f"analyses/CellRangerGex-9.0.0-{folder}/outputs",
        FILTERED_FEATURE_BC_MATRIX_FILE_NAME,
    )
    adata = sc.read_10x_h5(file_path)
    adatas[key] = adata
    adata_metadata[key] = (tissue, protocol)
    print(f"{key}: {adata}")

  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SF_N: AnnData object with n_obs × n_vars = 7179 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SL_N: AnnData object with n_obs × n_vars = 7929 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SF_T: AnnData object with n_obs × n_vars = 7146 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SL_T: AnnData object with n_obs × n_vars = 8593 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SF_LN: AnnData object with n_obs × n_vars = 5165 × 38606
    var: 'gene_ids', 'feature_types', 'genome'
SL_LN: AnnData object with n_obs × n_vars = 8836 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


### Compute QC Metrics

In [5]:
def run_qc_metrics(adata) -> None:
    """
    Compute standard quality control (QC) metrics on an AnnData object.

    Args:
        adata (AnnData): Annotated data matrix of shape n_obs x n_vars.

    Returns:
        None: Modifies the input AnnData object in-place by adding QC metrics to `adata.obs` and `adata.var`.
    """

    sc.pp.calculate_qc_metrics(
        adata,
        inplace=True,
        percent_top=None,
    )

    adata.var_names_make_unique()
    adata.obs_names_make_unique()

    sc.pp.filter_cells(adata, min_counts=500)
    return


# Run QC metrics on each dataset
for adata in adatas.values():
    run_qc_metrics(adata)
    print(adata)

AnnData object with n_obs × n_vars = 7179 × 38606
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
AnnData object with n_obs × n_vars = 7929 × 38606
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
AnnData object with n_obs × n_vars = 7146 × 38606
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
AnnData object with n_obs × n_v

### Save AnnDatas to File

In [6]:
# Save each AnnData object to file in the analysis directory
for key, (folder, tissue, protocol) in samples.items():
    out_path = os.path.join(OUTPUT_ADATA_DIR, f"{key}_adata.h5ad")
    adatas[key].write(out_path)
    print(f"Saved {key}_adata to {out_path}")

Saved SF_N_adata to ./data/analysis/adatas/adatas_X_filtered_cells_raw/SF_N_adata.h5ad
Saved SL_N_adata to ./data/analysis/adatas/adatas_X_filtered_cells_raw/SL_N_adata.h5ad
Saved SF_T_adata to ./data/analysis/adatas/adatas_X_filtered_cells_raw/SF_T_adata.h5ad
Saved SL_T_adata to ./data/analysis/adatas/adatas_X_filtered_cells_raw/SL_T_adata.h5ad
Saved SF_LN_adata to ./data/analysis/adatas/adatas_X_filtered_cells_raw/SF_LN_adata.h5ad
Saved SL_LN_adata to ./data/analysis/adatas/adatas_X_filtered_cells_raw/SL_LN_adata.h5ad
