# RNA velocity

## Motivation

Single-cell datasets allow studying biological processes such as early development at a high resolution. While single cells are analysed instead of a tissue as a whole, for example, changes in cells' phenotypic trades cannot be tracked over time. This fact stems from the destructive nature of single-cell sequencing protocols. Upon sequencing a cell, it is destroyed and its defining characteristics can, thus, not be measured again at a later time point. Notably, experimental techniques not only fail measuring the general cellular profile at different times but also how quickly these changes take place. Recovering the position in time along the developmental landscape can be achieved with tools from the field of *trajectory inference* (TI). However, classical TI methods do not offer any directed, dynamic information. Additionally, these algorithms traditionally do not take into account information beyond transcriptomic reads and similarity.

## Modeling RNA velocity

The change in the transcriptomic profile of a cell is triggered by a cascade of events: Broadly speaking, DNA is transcribed to produce so-called unspliced precursor messenger RNA (pre-mRNA). Unspliced pre-mRNA contains regions relevant for translation (exons) as well as non-coding regions (introns). These non-coding regions are spliced out, *i.e.*, removed, to form spliced, mature mRNA. While single-cell RNA sequencing (scRNA-seq) protocols fail to capture the transcriptome at multiple timepoints, they do include the necessay information to disassociate unspliced and spliced mRNA reads __<span style="color: red;">[CITE]</span>__.

Identifying unspliced and spliced reads allows formulating a dynamical model describing splicing kinetics __<span style="color: red;">[CITE]</span>__ and inferring the corresponding model weights based on single cell data. The change in spliced RNA described by the model is called RNA velocity __<span style="color: red;">[CITE]</span>__. Current models of RNA velocity assume the gene-specific model

$$
    \begin{aligned}
        \frac{du_g}{dt} &= \alpha_g - \beta_g u_g\\
        \frac{ds_g}{dt} &= \beta_g u_g - \gamma_g s_g,
    \end{aligned}
$$

with transcription rate $\alpha_g$, splicing rate $\beta_g$, and degradation rate $\gamma_g$ of spliced RNA. While the kinetics of each gene are modelled independent of each other, we will drop the index $g$ for notational simplicity. Even though the field of parameter estimation in dynamical systems is well studied, inference algorithms require the time associated with each observation to be known. Consequently, these traditional methods cannot be applied to infer RNA velocity and its model parameters in the context of scRNA-seq data.

## Parameter inference

Single-cell measurements are snapshot data and can, thus, not be plotted against time. Instead, classical RNA velocity methods rely on studying the cell-specific tuples $(u, s)$ of unspliced and spliced RNA for each gene. The collection of these tuples form the so-called phase portrait. Assuming constant rates of transcription, splicing, and degradation, the phase portraits exhibits an almond shape. The upper arc corresponds to the induction, the lower arc the repression phase. However, as real-world data is noisy, plotting the unspliced against spliced counts does not recover the expected almond shape. Instead, the data needs to be smoothed first. Classically, this preprocessing step consists in averaging the gene expression of each cell over its neighbors in a cell-cell similarity graph.

### The *steady-state model*

The first attempt at estimating RNA velocity assumed gene independence and the underlying kinetics to be goverened by the above model. Additionally, it is assumed that (1) kinetics reached their equilibrium, (2) rates are constant, and (3) there is a single, common splicing rate across all genes. In the following, we will refer to this model as the *steady-state model* due to the first assumption. The steady-states itself are found in the upper right corner of the phase portrait (induction phase) and its origin (repression phase). Based on these extreme quantiles, the *steady-state model* estimates the steady-state ratio with a linear regression fit. RNA velocity is then defined as the residual to this fit.

Even though the *steady-state model* can successfully recover the developmental direction in some systems, it is inherently limited by its model assumptions. The two assumptions readily violated are the common splicing rate across genes and that the equilibria are observed during the experiment. Consequently, inference in these cases will yield incorrect results. Additionally, the *steady-state model* only considers a subset of the data, and only the steady-state ratio but not each model parameter is inferred.

### The *EM model*

To overcome the limitations of the *steady-state model*, several extensions have been proposed. The so-far most popular one is the *EM model* implemented in scVelo __<span style="color: red;">[CITE]</span>__. The *EM model* no longer assumes that steady-states have been reached or that genes share a common splicing rate. Additionally, all datapoints are used to infer the full set of parameters as well as a gene and cell specific latent time of the splicing model. The algorithm uses an expectation-maximization (EM) framework to estimate parameters. The unobserved variables found in the E-step consist of each cell's time and state (induction, repression, or steady-state). All other model parameters are inferred during the M-step.

While the *EM model* no longer relies on key assumptions of the *steady-state model* and, thus, is more broadly applicable, the inferred RNA velocity may still violate prior biological knowledge __<span style="color: red;">[CITE]</span>__. The reason for such failure cases are mainly two-fold: On the one hand, the *EM model* continues to assume constant rates. Consequently, whenever these assumption does not hold, for example in erythroid maturation __<span style="color: red;">[CITE]</span>__, the inference is incorrect. On the other hand, the proposed model relies on phase portraits as its predecessor. As such, the algorithm is inherently inapplicable and fails whenever the gene phase portraits do not follow the expected shape.

## Key takeaways

To understand if RNA velocity analysis is applicable to a given dataset, we remark the following points:

1. To infer RNA velocity, the time scale of the developmental process under investigation must be comparable to the half-life of RNA molecules. This requirement is, for example, met in pancreatic endocrinogenesis __<span style="color: red;">[CITE]</span>__ but not in long term diseases such as Alzheimer's or Parkinson's disease. Similarly, RNA velocity analysis is not applicable to steady-state systems such as peripheral blood mononuclear cells lacking any transitions between (mature) cell types.
2. RNA velocity can only be inferred robustly and reliantly if the underlying model assumptions (approximately) hold true. To check the assumptions, the phase portraits can be studied to verify that they exhibit the expected almond shape. If a gene includes multiple, pronounced kinetcs, RNA velocity analysis should be applied with caution and the data possibly subsetted to individual lineages.
3. Classically, the high-dimensional RNA velocity vectors have been visualized by projecting them onto a low-dimensional representation of the data. This approach for verifying hypotheses can be erronous and misleading as the projecteceted velocity stream is highly dependend on (1) the number of included genes and (2) chosen plotting parameters. Additionally, the projection quality decreases at the boundary of the low dimensional embedding __<span style="color: red;">[CITE]</span>__.

## New directions

Although RNA velocity has been applied successfully to many systems, some model limitations persist. Violated model assumptions may cause erronous result __<span style="color: red;">[CITE]</span>__, and projecting the high dimensional velocity vectors onto a low dimensional representation of the data misleading. To overcome these pitfalls several tools have been developed. CellRank __<span style="color: red;">[CITE]</span>__, for example, uses the inferred velocity field to infer likely future states of a cell. As the algorithm operates on the higher dimensional representation of the data, misleading velocity streams on embeddings are circumvented. Contrastingly, a recent publication tries to improve the quality of the lower dimensional embedding __<span style="color: red;">[CITE]</span>__.

To soften current assumptions of RNA velocity inference, several new approaches have been suggested __<span style="color: red;">[CITE]</span>__. For example, these methods try to no longer assume constant rates __<span style="color: red;">[CITE]</span>__, work with raw counts __<span style="color: red;">[CITE]</span>__, or reformulate the inference methods in a variational inference framework to associate uncertainty with estimates. Additionally, to aid in understanding if RNA velocity analysis can be inferred for individual genes or entire datasets, different procedures have been proposed  __<span style="color: red;">[CITE]</span>__.

## References

```{bibliography}
:filter: docname in docnames
```