# Preprocessing of the single-cell gene expression data set

This notebook summarizes the preprocessing of the single-cell gene expression data set used in the Image2Reg pipeline.

---

## 0. Environmental setup

In [None]:
import pandas as pd
import scanpy as sc
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
import networkx as nx
import mygene
import copy

import matplotlib.pyplot as plt

%load_ext nb_black

---

## 1. Read in data

The used gene expression data set consists of 1,152 U2OS FUCCI cells that were sorted by FACS and sequenced at single-cell resolution using SMART-seq2 chemistry. In total 42'728 genes were captured. We first read in the data.

In [None]:
fucci_adata = sc.read_csv("../../../data/gex/scrnaseq/GSE146773_Counts.csv")
fucci_adata

Before we can continue we need to translate the Ensemble IDs to their respective gene names for consistency purposes.

In [None]:
mg = mygene.MyGeneInfo()
fucci_gene_list = list(fucci_adata.var.index)
fucci_query_results = mg.querymany(
    fucci_gene_list, scopes="ensembl.gene", fields="symbol", species="human"
)

We then filter out genes with duplicate or missing HGNC symbols.

In [None]:
fucci_gene_symbs = []
fucci_gene_ensid = []
missing_duplicate_symbs = []
for query_result in fucci_query_results:
    try:
        gene_symbol = query_result["symbol"]
        if gene_symbol not in fucci_gene_symbs:
            fucci_gene_symbs.append(query_result["symbol"])
            fucci_gene_ensid.append(query_result["query"])
        else:
            missing_duplicate_symbs.append(query_result["query"])
    except KeyError:
        missing_duplicate_symbs.append(query_result["query"])
len(missing_duplicate_symbs)

There are 6803 ensemble ID that were not found in the reference data set and an additional of 1205 ensemble IDs mapped to the same HGNC symbol. We will remove the 6803 missing genes from our analyses and only keep the data from the first mapping for those cases where multiple ensemble IDs mapped to the same HGNC symbol.

In [None]:
fucci_adata = fucci_adata[:, fucci_gene_ensid]
fucci_adata

In [None]:
fucci_adata.var["gene_symbol"] = fucci_gene_symbs

We are left with 34'720 genes.

---

## 2. Preprocessing

We will now run a standard single-cell gene expression data preprocessing pipeline that includes filtering out cells and genes with low support in the data set, library normalization and log-transformation.

In [None]:
fucci_adata.var["n_cells_per_gene"] = np.sum(
    (np.array(fucci_adata.to_df()) > 0), axis=0
)
fucci_adata.obs["n_genes_per_cell"] = np.sum(
    (np.array(fucci_adata.to_df())) > 0, axis=1
)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=[12, 3])
ax = ax.flatten()
sns.histplot(fucci_adata.obs["n_genes_per_cell"], ax=ax[0])
ax[0].set_title("#Genes expressed per cell")

sns.histplot(fucci_adata.var["n_cells_per_gene"], ax=ax[1])
ax[1].set_title("#Cells expressing a gene")
plt.show()

We filter out cells for which not at least transcripts from 8'000 different genes were measured and genes that are expressed in less than 10 cells. The cut-offs were chosen under consideration of the corresponding empirical distributions to remove outlier.

In [None]:
sc.pp.filter_cells(fucci_adata, min_genes=8000)
sc.pp.filter_genes(fucci_adata, min_cells=10)
fucci_adata

This filtering step reduces the dimensionality of the data set to 1'126 cells for which 21'445 genes were measured.

In [None]:
sc.pp.normalize_total(fucci_adata, target_sum=1e6)
sc.pp.log1p(fucci_adata)

---

## 3. Data export

We finally export the preprocessed gene expression data.

Finally, we save the two analyzed data sets and the gene target list to disk.

In [None]:
fucci_adata_fname = "../../data/gex/fucci_adata.h5"
fucci_adata.write(fucci_adata_fname)