This script performs proprocessing and filtering of the single cell reference data.

**Author:** Yiqing Wang

**Date:** 2024-6-15

INPUT: untransformed, unnormalized single cell AnnData

OUTPUT: single cell data after gene filtering

In [None]:
import os
import scanpy as sc
import numpy as np
import cell2location

In [None]:
from matplotlib import rcParams

rcParams["pdf.fonttype"] = 42 # ensures correct plotting of text for PDFs

In [None]:
dir = "path/to/data/"
os.chdir(dir)
results_folder = "./test_results/"
ref_run_name = f"{results_folder}/run_name"

1. Loading single cell data

In [None]:
adata_ref = sc.read(
    f"{results_folder}/reference_file.h5ad",
)

print("reference data loaded")

print(f"Number of cells in reference data: {adata_ref.shape[0]}")
print(f"Number of genes in reference data: {adata_ref.shape[1]}")

print(np.max(adata_ref.X)) # check maximum value in the data

2. Preprocessing

Since the max is ~8, the data is log-transformed.
We need to copy adata_ref.raw.X, which stores raw counts, to adata_ref.X.

In [None]:
adata_ref.X = adata_ref.raw.X.copy()

# delete raw attribute after copying
del adata_ref.raw

For the attributes of the AnnData object, .obs stores cell info, .var stores gene symbols, and .obsm stores additional cell info, such as UMAP embeddings.

In [None]:
# Check the number of cells with no annotation
# The cells with no annotation need to be removed.
cell_no_anno = sum(adata_ref.obs["label"] == "NA")
print(f"Number of cells with no annotation: {cell_no_anno}")

The tutorial sets the gene IDs as row names of adata_ref.var, but since the data used here did not have gene ID info, gene symbols were kept as row names, and the code ran fine.

It is important that the gene names are consistent between the single cell data and the spatial data. In other words, the row names of adata_ref.var and the row names of adata_vis.var should be the same.

Next we check the number of cells with zero counts in all genes.
If there are cells with zero counts in all genes, they need to be removed.

In [None]:
# Check the number of cells with zero counts in all genes

adata_ref.obs["n_genes"] = (adata_ref.X > 0).sum(1)
ncells_with_0_genes = (adata_ref.obs["n_genes"] == 0).sum()
print(f"Number of cells with 0 genes detected: {ncells_with_0_genes}")

3. Gene filtering

1) cell_count_cutoff: genes that are expressed in at least this number of cells
2) cell_percentage_cutoff2: genes that are expressed in at least this percentage of cells
3) nonz_mean_cutoff: genes that have mean expression of at least this value in nonzero cells

filtering logic: 2 OR (1 AND 3)

The cutoff values can be customized according to the dataset.

In [None]:
print(f"Number of genes in reference data before filtering: {adata_ref.shape[1]}")

from cell2location.utils.filtering import filter_genes

selected = filter_genes(
    adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12
)

adata_ref = adata_ref[:, selected].copy()

print(f"Number of genes in reference data after filtering: {adata_ref.shape[1]}")

In [None]:
# Write the filtered single cell AnnData to file
adata_ref.write(f"{ref_run_name}/adata_ref_filtered.h5ad")