# Chapter 01 — Loading and Organizing the Data

## Objective

In this chapter, we load all single-nucleus RNA-seq datasets required for benchmarking the **Singulator + FACS** and **Singulator + LeviCell** protocols. We also verify metadata consistency, perform basic integrity checks, and prepare the data objects for downstream analysis.

This includes:

- Locating the data on the shared filesystem (Iris)
- Loading raw count matrices (e.g., `filtered_feature_bc_matrix.h5` from 10X)
- Loading data into `AnnData` objects

## Data Source

The data for this benchmarking project is stored on the **Iris** HPC filesystem under:

`/data1/collab002/sail/isabl/datalake/prod/010/collaborators/SAIL/projects/singulator_debris_removal_and/experiments`

We will be working with the data under identifier `MB-4027_*`


## Create Conda Environment for Analysis

To ensure reproducibility and consistent dependency management, we use a Conda environment for all analyses in this benchmarking project.

The environment includes key packages for single-cell RNA-seq data handling, visualization, and benchmarking, including:

- `scanpy`
- `anndata`
- `pandas`, `numpy`, `matplotlib`, `seaborn`
- `jupyterlab` and `notebook`
- `scikit-learn`, `umap-learn`, `leidenalg`

### Step 1: Create the Environment

You can create the environment using `mamba` (recommended) or `conda`. An `env.yml` file is provided for you within this directory with all the prerequisites for performing the analysis in this book. The commands below allow you to create your environment:


In [None]:
%%bash

# Recommended
mamba env create -f env.yml

# Or with conda
conda env create -f env.yml

Now you can run python using the ipykernel (benchmarking_template). Select the correct kernel by using the menu in the top right of this file.

## Move data into local repository


To ensure reproducibility and simplify access across users and machines, we mirror the relevant input data directories into the local repository using **symbolic links** (symlinks). This allows us to reference raw data in a standardized location without duplicating large files or modifying the originals.

We can use the following commands to do so:

### Core Imports

In [10]:
import os
import pandas as pd
import subprocess
from pathlib import Path

### Create Symlinks to Data (avoid copying very large files)

In [None]:
ORIGINAL_DATA_ROOT_DIR = Path("/data1/collab002/sail/projects/ongoing/benchmarks/benchmark_singulator_levicell/data/read_only")
LOCAL_DATA_ROOT_DIR = Path("./data/read_only")

# Create local directory if it doesn't exist
LOCAL_DATA_ROOT_DIR.mkdir(parents=True, exist_ok=True)

# Create symlinks for each MB-4027_* subdirectory
for full_path in ORIGINAL_DATA_ROOT_DIR.glob("MB-4027_*"):
    if full_path.is_dir():
        sample_name = full_path.name
        symlink_path = LOCAL_DATA_ROOT_DIR / sample_name

        if symlink_path.exists():
            print(f"Skipping existing: {symlink_path}")
        else:
            symlink_path.symlink_to(full_path)
            print(f"Created symlink: {symlink_path} -> {full_path}")

# Make everything under LOCAL_DATA_ROOT_DIR read-only
subprocess.run(["chmod", "-R", "a-w", str(LOCAL_DATA_ROOT_DIR)], check=True)


## Create metadata file

In order to make sure we know what files we have, we will create a metadata.tsv file within the read_only dir. This will help future users identify what data we're working with.

In [18]:
import pandas as pd
import os

DATA_DIR = "./data"

# Define metadata
data = {
    "sample_id": [
        "MB-4027_SF_LN",
        "MB-4027_SF_N",
        "MB-4027_SF_T",
        "MB-4027_SL_LN",
        "MB-4027_SL_N",
        "MB-4027_SL_T",
    ],
    "tissue": [
        "Normal Liver",
        "Normal Colon",
        "Tumor Colon",
        "Normal Liver",
        "Normal Colon",
        "Tumor Colon",
    ],
    "protocol": [
        "Singulator+FACS",
        "Singulator+FACS",
        "Singulator+FACS",
        "Singulator+LeviCell",
        "Singulator+LeviCell",
        "Singulator+LeviCell",
    ]
}


# Define output file path
metadata_output_file_path = os.path.join(DATA_DIR, "metadata.tsv")

# Add description to the metadata file
with open(metadata_output_file_path, "w") as f:
    f.write("\n# This file contains metadata for the benchmark dataset.\n")
    f.write("# Each row corresponds to a sample with its tissue type and protocol used.\n")
    f.write("# The data is organized under the 'data/read_only' directory.\n")
    f.write("# The symlinks point to the original data directories.\n")

# Create DataFrame
metadata_df = pd.DataFrame(data)

# Save to TSV
metadata_df.to_csv(metadata_output_file_path, sep="\t", mode = "a", index=False)