# NN CELL-TYPE ANNOTATION

To understand your data better and make use of existing knowledge, it is useful to know to which cell-type or cell identity each of the cells in your data belong. For example, knowing that there are specific immune cell types in a tumor, or unusual hematopoietic stem cells in your bone marrow sample can be a valuable insight into your data.<br>
So what is a cell type? Biologists use the term cell type to denote a cellular phenotype that is robust across datasets, identifiable based on expression of specific markers (i.e. proteins or gene transcripts), and often linked to specific functions. For example, a plasma B cell is a type of white blood cell that secretes antibodies used to fight pathogens, and it can be identified using specific markers. <br>
However, like with any categorization, the size of categories and the borders drawn between them are partly subjective and can change over time, e.g. because new technologies allow for a higher resolution view of cells, or because specific "sub-phenotypes" that were not considered biologically meaningful are found to have important biological implications (see e.g. {cite}KadurLakshminarasimhaMurthy2022). Cell types are therefore often further classified into "subtypes" or "cell states" (e.g. activated versus resting), and some researchers use the term "cell identity" to avoid this sometimes arbitrary distinction of cell types, cell subtypes or cell states. For a more detailed discussion of this topic, we recommend this review by Wagner et al. {cite}Wagner2016<br>
Similarly, multiple cell types can be part of a single continuum, where one cell type might transition or differentiate into another. For example, in hematopoiesis cells differentiate from a stem cell into a specific immune cell type. Although hard borders between early and late stages of this differentiation are often drawn, the state of these cells can more accurately be described by the differentiation coordinate between the less and more differentiated cellular phenotypes. We will discuss differentiation and cellular trajectories in subsequent chapters.<br>
So how do we go about annotating cells in single-cell data? There are multiple ways to do it, and we will give an overview of different approaches below. As we are working with transcriptomic data, each of these methods is ultimately based on the expression of specific genes or gene sets, or general transcriptomic similarity between cells. 

## NN.1 Load modules and set paths:

In [1]:
import scanpy as sc

In [4]:
path_query_data = "/home/icb/anna.schaar/data/bestprac2/neurips_qc_normalized_all_samples.h5ad"
path_reference_model = "" # waiting for Malte to integrate...

## NN.2 Load data:

Data to annotate:

In [5]:
# adata = sc.read_h5ad(path_query_data)

Reference data for label transfer:

In [None]:
# ...download reference model from Malte here...

## NN.3 Manual annotation based on marker gene expression.

The classical or oldest way to perform cell type annotation is based on a single or small set of marker genes known to be associated with a particular cell type. This approach dates back to "pre-scRNA-seq times", when single cell data was low dimensional (e.g. FACS data with gene panels consisting of no more than 30-40 genes). It is a fast and transparent way to annotate your data. However, when no unique markers exist for a specific cell type (which is often the case), this approach can get more complicated and even less objective, with combinations of markers or expression thresholds necessary for proper annotation. A robust set of marker genes and prior knowledge or annotation experience can help here, but the approach comes with the risk of unclear and subjective decision-making. 

In this setting, the data is often clustered before annotation, so that we can annotate groups of cells instead of making a per-cell call. This is not only less laborious, but also more robust to noise: a single cell might not have a count for a specific marker even if it was expressed in that cell, simply due to the inherent sparsity of single cell data. Clustering enables the detection of cells highly similar in overall gene expression, and can therefore account for drop-outs at single cell level. 

Finally, there are two angles from which to approach the marker-gene based annotation. One option is to work from a table of marker genes for all the cell types you expect in your data, and check in which those clusters are expressed. The other option is to check which genes are highly expressed in the clusters you defined, and then check if they are associated with known cell types or states. If necessary, one can move back and forth between those approaches.

As an example, we show the annotation of [cell type tbd] here, based on known marker genes [...]. 

In [None]:
# ...code for annotation of cell type based on known marker...

Conversely, we can calculate marker genes per cluster and then look up whether we can link those marker genes to any known biology, such as cell types and/or states. For marker gene calculation of clusters, simple methods such as the Wilcoxon rank-sum test are thought to perform best {cite}Pullin2022.05.09.490241. Importantly, as the definition of the clusters is based on the same data as used for these statistical tests, the p-values of these tests will be inflated, as also described here {cite}ZHANG2019383. (Finally, note that for more complex scenarios, such as the comparison of conditions (e.g. disease versus healthy), more complex models are required, as discussed in the chapter about differential gene expression analysis (LINK HERE?)). 

In [None]:
# ...code for marker gene calculation and annotation based on calculated markers...

## NN.4  Automated annotation based on marker gene expression.

The remainder of the discussed methods will be methods for automated, rather than manual annotation of your data. See also e.g. this review {cite}PASQUINI2021961 from 2021 for a more elaborate discussion of automated annotation methods.

Waiting for feedback from Luke here.

## NN.5. Automated annotation using pre-trained classifiers.

It should be noted that the methods discussed so far use only a small subset of the genes detected in the data: often a set of only 1 to ~10 marker genes is used. An alternative approach is to use a classifier that takes as input a larger set of genes (several thousands or tens of thousands), thereby making more use of the breadth of scRNA-seq data. Such classifiers are trained on previously annotated datasets or atlases, and are designed to be used on unannotated, newly generated datasets. Examples of these are CellTypist {cite}doi:10.1126/science.abl5197, (see also https://www.celltypist.org, where data can be uploaded to a portal to get automated cell annotations), and Clustifyr {cite}Fu2020. In this case, the quality of the annotations depends on:
1) the type of classifier chosen: Previous benchmark studies have shown that different types of classifiers often perform comparibly, with neural network-based methods often performing as well as or worse than more general-purpose models such as support vector machines or linear regression models{cite}Abdelaal2019 {cite}PASQUINI2021961 {cite}Huang2021.<br>
2) the quality of the data that the classifier was trained on. If the training data was not well annotated, or annotated at low resolution, the classifier will do the same. Similarly, if the training data and/or its annotation was noisy, the classifier might not perform well.<br>
3) the similarity of your own data to the data that the classifier was trained on. For example, if the classifier was trained on a drop-seq single cell dataset, and your data is 10X single nucleus, this might worsen the quality of the annotation. Classifiers trained on cross-dataset atlases including a diversity of datasets (e.g. the CellTypist classifier trained on the Human Lung Cell Atlas {cite}Sikkema2022.03.10.483747) might give more robust and better quality annotations.

The abovementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data, amd sometimes merely involves the upload of your data to an online webpage. Second, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. Third, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. 

Finally, as these classifiers are often less transparent than e.g. manual marker-based annotation, it is important to include some type of uncertainty measure in the annotations. We will discuss this more extensively further down.

## NN.6. Automated annotation by mapping to a reference.