# Chapter 6 - CellBender Downsampling Gained Gene Analysis

## Objective


In this chapter, we load all single-nucleus RNA-seq datasets required for benchmarking the **Singulator + FACS** and **Singulator + LeviCell** protocols. We also verify metadata consistency, perform basic integrity checks, and prepare the data objects for downstream analysis.

This includes:

- Locating the data on the shared filesystem (Iris)
- Loading raw count matrices (e.g., `filtered_feature_bc_matrix.h5` from 10X)
- Loading data into `AnnData` objects

## Data Source

The data for this benchmarking project is stored on the **Iris** HPC filesystem under:

`/data1/collab002/sail/isabl/datalake/prod/010/collaborators/SAIL/projects/singulator_debris_removal_and/experiments`

We will be working with the data under identifier `MB-4027_*`


## Core Imports

In [1]:
import os
import scanpy as sc
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from typing import Callable, Dict, List
import dill

## File Paths and Metadata

In [4]:
# Constants
DATA_DIR = "./data"
READ_ONLY_DIR = os.path.join(DATA_DIR, "read_only")
FIGURES_OUTPUT_DIR = os.path.join(
    DATA_DIR, "figures", "chapter_06_cellbender_basic_analysis"
)

DOWNSAMPLING_BOOTSTRAP_RESULTS_FILE_PATH = os.path.join(
    DATA_DIR, "other", "downsampling_bootstrap_results.dill"
)
OUTPUT_ADATA_DIR = os.path.join(READ_ONLY_DIR, "cellbender_downsampling_adata")

# Make sure output adata dir directory exists
os.makedirs(OUTPUT_ADATA_DIR, exist_ok=True)

# Sample metadata - in data/metadata.tsv
samples = {
    "SF_N": ("MB-4027_SF_N", "Normal Colon", "Singulator+FACS"),
    "SL_N": ("MB-4027_SL_N", "Normal Colon", "Singulator+LeviCell"),
    "SF_T": ("MB-4027_SF_T", "Tumor Colon", "Singulator+FACS"),
    "SL_T": ("MB-4027_SL_T", "Tumor Colon", "Singulator+LeviCell"),
    "SF_LN": ("MB-4027_SF_LN", "Normal Liver", "Singulator+FACS"),
    "SL_LN": ("MB-4027_SL_LN", "Normal Liver", "Singulator+LeviCell"),
}

# Color palette for plotting
protocol_color_palette = {
    "Singulator+FACS": "#AEC6CF",
    "Singulator+LeviCell": "#FFDAB9",
}

### Load in Data

In [5]:
with open(DOWNSAMPLING_BOOTSTRAP_RESULTS_FILE_PATH, "rb") as f:
    downsampling_bootstrap_results = dill.load(f)

from pprint import pprint

pprint(downsampling_bootstrap_results)

defaultdict(<function <lambda> at 0x7f3b626037e0>,
            {'SF_LN': defaultdict(<class 'list'>,
                                  {'mean_reads': [[1936.174830590513,
                                                   4195.799031945789,
                                                   6453.846079380445,
                                                   8712.573475314617,
                                                   10970.644336882866,
                                                   13231.7701839303,
                                                   15489.260987415295,
                                                   17746.92700871249,
                                                   20007.705905130686,
                                                   22266.049564375604],
                                                  [1934.9703775411424,
                                                   4194.496224588577,
                                                   645

In [None]:
filtered_molecule_info = None

