# Chapter 01 — Loading and Organizing the Data

### Objective

In this chapter, we load all single-cell or single-nucleus RNA-seq datasets required for benchmarking the protocols under comparison:

This includes:

- Locating the data on the shared filesystem (Iris)
- Loading raw count matrices (e.g., `filtered_feature_bc_matrix.h5` from 10X)
- Loading data into `AnnData` objects and saving them to a local data directory

<br>


### Data Source

The data for this benchmarking project is stored on the **Iris** HPC filesystem under:

> **FILL IN HERE**  
> Include where the data is located for downstream analysis

**EXAMPLE (DELETE THIS SECTION BEFORE FINALIZING):**  
In this example, our data is located at :
`/data1/collab002/sail/isabl/datalake/prod/010/collaborators/SAIL/projects/singulator_debris_removal_and/experiments`
We will be working with the data under identifier `MB-4027_*`


<br>


### Create Conda Environment for Analysis

To ensure reproducibility and consistent dependency management, we use a Conda environment for all analyses in this benchmarking project.

The environment includes key packages for single-cell RNA-seq data handling, visualization, and benchmarking, including:

- `scanpy`
- `pandas`, `numpy`, `matplotlib`, `seaborn`
- `ipykernel`
  
An `env.yml` file is included in this directory with all required dependencies.

#### Step 1: Create the Environment

You can create the environment using `mamba` (recommended) or `conda`. An `env.yml` file is provided for you within this directory with all the prerequisites for performing the analysis in this book. The commands below allow you to create your environment:



In [None]:
%%bash

# Recommended (faster)
mamba env create -f env.yml

# Alternatively, use conda
conda env create -f env.yml`


#### Step 2: Select the Kernel

Once the environment is created, start JupyterLab or Jupyter Notebook and select the appropriate kernel (e.g., `benchmarking_template`) from the **top-right kernel selector**.

If the kernel does not appear, run the following in your terminal to register it:

In [13]:
%%bash

python -m ipykernel install --user --name=benchmarking_template

Installed kernelspec benchmarking_template in /home/ghoshr/.local/share/jupyter/kernels/benchmarking_template


Once the environment is created, start JupyterLab or Jupyter Notebook and select the appropriate kernel (e.g., `benchmarking_template`) from the **top-right kernel selector**.

If the kernel does not appear, run the following in your terminal to register it:

#### Step 3: Update config.py

Within ./utils, there is a file called config.py that we will be using to configure file paths and metadata. Be sure to update this file before doing any downstream analysis.


#### Step 4: Copy data from Original Source to Local Data Directory (optional)

In order to avoid modifying the original data, we can use the following to command to a local data directory. If the files are very large, you can modify config.py to read from the original source. Alternatively, you can create symlinks to the data using `ln -s` but be aware that modifications to the data in the local directory may still affect the files that are being linked to.

##### Run Autoreload (Automically Reloads Edited Module Files)

In [2]:
%load_ext autoreload
%autoreload 2

##### Imports

In [None]:
import os
import subprocess

import scanpy as sc

import utils.config as config

##### Copy Code Snippet

We're going to keep these files in a "read only" directory to ensure that we aren't modifying or deleting files we're going to need for downstream analysis.

In [None]:
src_dir = config.ORIGINAL_DATA_ROOT_DIR
dst_dir = config.DATA_DIR / "read_only"
sample_identifier = config.SAMPLE_IDENTIFIER

# Ensure the destination exists
os.makedirs(dst_dir, exist_ok=True)

def should_copy(src_path):
    """Determine if a path should be copied."""
    rel_path = os.path.relpath(src_path, src_dir)
    parts = rel_path.split(os.sep)

    # Only copy directories starting with the sample identifier
    if not parts[0].startswith(sample_identifier):
        return False

    # Exclude any raw_data subdirectory
    if "raw_data" in parts:
        return False

    return True

for root, dirs, files in os.walk(src_dir, followlinks=True):
    # Filter out raw_data dirs
    dirs[:] = [d for d in dirs if should_copy(os.path.join(root, d))]

    for file in files:
        src_path = os.path.join(root, file)
        if should_copy(src_path):
            rel_path = os.path.relpath(src_path, src_dir)
            dst_path = os.path.join(dst_dir, rel_path)
            os.makedirs(os.path.dirname(dst_path), exist_ok=True)
            subprocess.run(
                ["cp", "-p", src_path, dst_path],
                check=True
            )
            print(f"Copied: {rel_path}")
            # Set read-only permissions
            subprocess.run(["chmod", "444", dst_path], check=True)


Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/README.pdf
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/metrics_summary.csv
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/filtered_feature_bc_matrix.h5
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/analysis.tgz
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/possorted_genome_bam.bam
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/molecule_info.h5
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/possorted_genome_bam.bam.bai
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/web_summary.html
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/cloupe.cloupe
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/paths.json
Copied: MB-4027_SL_T/analyses/CellRangerGex-9.0.0-MB-4027_SL_T/outputs/raw_feature_bc_matrix.h5
Copied: MB-4027

### Read in AnnDatas and Move to Data Directory

Rather than modifying the original data directory, we will read in the AnnDatas and save them in a local directory for further analysis

##### File Paths

In [None]:
# File and directory paths
adata_dir = config.ADATA_DIR
output_adata_dir = os.path.join(adata_dir, "adatas_X_filtered_cells_raw")

# Make sure output directories exist
os.makedirs(output_adata_dir, exist_ok=True)

In [None]:
# Load AnnData objects
adata_dict = {}
adata_metadata = {}

sample_metadata = config.SAMPLES_METADATA
sample_identifier = config.SAMPLE_IDENTIFIER
input_data_dir = config.DATA_DIR / "read_only"
file_name = config.CELL_RANGER_FILTERED_FEATURE_MATRIX_FILE_NAME

for key, sample_info in sample_metadata.items():
    tissue = sample_info["tissue"]
    protocol = sample_info["protocol"]

    folder = f"{sample_identifier}_{key}"

    file_path = os.path.join(
        input_data_dir,
        folder,
        f"analyses/CellRangerGex-9.0.0-{folder}/outputs",
        file_name,
    )

    adata = sc.read_10x_h5(file_path)
    adata_dict[key] = adata
    print(f"{key}: {adata}")

  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SF_N: AnnData object with n_obs × n_vars = 7179 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SL_N: AnnData object with n_obs × n_vars = 7929 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SF_T: AnnData object with n_obs × n_vars = 7146 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SL_T: AnnData object with n_obs × n_vars = 8593 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


SF_LN: AnnData object with n_obs × n_vars = 5165 × 38606
    var: 'gene_ids', 'feature_types', 'genome'
SL_LN: AnnData object with n_obs × n_vars = 8836 × 38606
    var: 'gene_ids', 'feature_types', 'genome'


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


### Compute QC Metrics

We'll compute the basic QC metrics from Scanpy here before we do any downstream analysis

In [None]:
def run_qc_metrics(adata) -> None:
    """
    Compute standard quality control (QC) metrics on an AnnData object.

    Args:
        adata (AnnData): Annotated data matrix of shape n_obs x n_vars.

    Returns:
        None: Modifies the input AnnData object in-place by adding QC metrics to `adata.obs` and `adata.var`.
    """

    sc.pp.calculate_qc_metrics(
        adata,
        inplace=True,
        percent_top=None,
    )

    adata.var_names_make_unique()
    adata.obs_names_make_unique()
    return


# Run QC metrics on each dataset
for adata in adata_dict.values():
    run_qc_metrics(adata)
    print(adata)

AnnData object with n_obs × n_vars = 7179 × 38606
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
AnnData object with n_obs × n_vars = 7929 × 38606
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
AnnData object with n_obs × n_vars = 7146 × 38606
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
AnnData object with n_obs × n_v

### Save AnnDatas to File

In [None]:
for key, sample_info in sample_metadata.items():
    tissue = sample_info["tissue"]
    protocol = sample_info["protocol"]

    out_path = os.path.join(output_adata_dir, f"{key}_adata.h5ad")
    adata_dict[key].write(out_path)
    print(f"Saved {key}_adata to {out_path}")

Saved SF_N_adata to /data1/collab002/sail/projects/ongoing/benchmarks/sail-benchmarking-template/utils/../data/adatas/adatas_X_filtered_cells_raw/SF_N_adata.h5ad
Saved SL_N_adata to /data1/collab002/sail/projects/ongoing/benchmarks/sail-benchmarking-template/utils/../data/adatas/adatas_X_filtered_cells_raw/SL_N_adata.h5ad
Saved SF_T_adata to /data1/collab002/sail/projects/ongoing/benchmarks/sail-benchmarking-template/utils/../data/adatas/adatas_X_filtered_cells_raw/SF_T_adata.h5ad
Saved SL_T_adata to /data1/collab002/sail/projects/ongoing/benchmarks/sail-benchmarking-template/utils/../data/adatas/adatas_X_filtered_cells_raw/SL_T_adata.h5ad
Saved SF_LN_adata to /data1/collab002/sail/projects/ongoing/benchmarks/sail-benchmarking-template/utils/../data/adatas/adatas_X_filtered_cells_raw/SF_LN_adata.h5ad
Saved SL_LN_adata to /data1/collab002/sail/projects/ongoing/benchmarks/sail-benchmarking-template/utils/../data/adatas/adatas_X_filtered_cells_raw/SL_LN_adata.h5ad
