# Lineage tracing data analysis

Cellular lineages are ubiquitious in biology. Perhaps the most famous example is that of embyrogenesis: the process by which an organism like a human being is generated from from a single cell, the fertilized egg. During this process, subsequent cell divisions give rise to daughter cells and over time entire "lineages" that can take on specialized roles within the developing embryo. The amazing complexity of this process has captured the imaginations of scientists for thousands of years, and over the past century and a half our understanding of this process has been bolstered by the development of new "lineage tracing" technologies for visualizing and characterizing this process {cite}`Woodworth2017`. 

In this chapter, we provide a brief overview of these new technologies and focus on the analytical pipelines available to researchers. To note, we pay special attention to CRISPR/Cas9-based "evolving" lineage tracing technologies though there exist several other greatly useful alternatives; for a more complete view, we refer the interested reader to the excellent reviews by Wagner & Klein {cite}`wagner2018`, Mckenna & Gagnon {cite}`mckenna`, and VanHorn & Morris {cite}`VanHorn2021`.

## Lineage tracing technologies

The goal of all lineage tracing techniques is to infer developmental relationships between observed cells. In this, there are two major variables to consider: scale and resolution. Classical approaches relied heavily on visual observation: for example, in the 1970s Sulston and colleagues derived the first developmental lineage of the nematode _C. elegans_ by meticulously watching cell divisions under a microscope {cite}`Sulston1973`. While greatly impressive, such approaches cannot scale to complex organisms with more stochastic developmental lineages.

Over the two decades, the development of revolutionary sequencing assays and microfluidic devices has contributed to a tremendous investment into new lineage tracing methodology. To digest the plethora of techniques, it is helpful to classify approaches as "prospective" or "retrospective". Generally speaking, prospective lineage tracing approaches use a heritable marker to trace a clone, or the progeny of a single cell. On the other hand, retrospective lineage tracing approaches use variability between observed cells - such as genetic mutations - to infer a model of the lineage, or "phylogeny", summarizing the cell division history of a clonal population.

Prospective lineage tracing technologies use heritable marks to offer a scalable approach for tracking the clonal dynamics in a cell population. To date, there are several ways to introduce heritable marks into single-cell clonal progenitors both _in vivo_ and _in vitro_. The first collection of approaches use recombinases driven by tissue-specific promoters or an inducible switch to activate fluorescent markers that can be used to discern between clonal populations (i.e., cells sharing the same fluorescent color descended from the same clonal progentor). A more scalable solution introduces DNA barcodes (e.g., by lentiviral transduction) into clonal progenitors that then can be read out at the end of an experiment to deconvolve clonal identities.

Retrospective lineage tracers provide an additional advantage over prospective tracers in that they can report on *sub*clonal relationships between cells. Traditionally, this has been done by leveraging natural genetic variation between cells to reconstruct a cell division history: this approach is still widely and successfully used to study human tumors or tissue developmental histories. In experimental models, there are opportunities to recapitulate the advantages of retrospective tracers by engineering evolvable lineage tracers. Such evolving tracers typically consist of engineering cells with a "scratchpad" (or, synonymously, "target site") that can acquire mutations and using a system like Cas9 to introduce variability at this scratchpad. In this way, cellular lineages acquire heritable mutations over time which can be used to phylogenies representing a model of the cell lineage.

For both prospective and retrospective approaches, investigators have leveraged cutting-edge single-cell assays to provide additional molecular characterization of cell populations. Most common has been the utilization of single-cell RNA-seq (scRNA-seq) to measure the functional state of single cells in parallel to their lineage relationships. This multimodal readout has created opportunity for new computational methodologies, which we detail below. 

As stated above, in this chapter we provide a detailed walkthrough on the analysis of data from evolving CRISPR/Cas9-based lineage tracers. However, there exist other excellent resources for the analysis of prospective barcoding systems, for example CoSpar {cite}`CoSpar` which additionally provides a tutorial on this type of analysis (available at https://cospar.readthedocs.io/en/latest/).  

## Overview of analysis pipelines (Matt)

Before delving into the analysis of example datasets, here we will provide an overview of the analytical pipeline with a focus on CRISPR/Cas9-based evolving tracers. With these systems, analysis will begin with raw sequencing data of an **amplicon** library from a conventional scRNA-seq platform like 10X Chromium consisting of the Cas9 target sites. Depending on the technology at hand, each sequenced amplicon will be between 150-300bp long and consist of one more Cas9 cut sites. Each of these cut sites has the opportunity to be targeted and cut by Cas9, which leaves a heritable insertion or deletion (i.e., "indel"), and analysts typically discern between "cut" and "uncut" sites. While this preprocessing step is very important, it requires a very extensive treatme

An analysis framework will being by processing these raw reads to summarize the observed mutations in each cell in a data structure called a **character matrix**  (denoted by $\chi$). In this data structure, each row is a cell, each column is a Cas9 cut site, and the (row, column) values are categorical variables representing the identity of the indel observed in that cell at that particular cut site. Theoretically, we refer to the cells as "samples", target sites as "characters", and internal values as "character states". In most evolving lineage tracing applications, the dimensions of this matrix are usually ~1000 x 50. 

At this point, this data structure abstracts away the technicalities of the evolving lineage tracing assay and opens up the opportunity to apply one of several tools from classical phylogenetic inference to learn a phylogenetic tree over the cells. Specifically, the goal is to learn a hierarchical tree structure ($\mathcal{T}$) over each of the cells in our character matrix. In this tree, each node represents a sample and each edge represents a relationship. Importantly, we often have only observed the _leaves_ (denoted by $\mathcal{L}$) of the tree and we refer to any of the unobserved set of internal nodes as _ancestral_ nodes.

There are many algorithmic choices for inferring the phylogenetic tree $\mathcal{T}$ from the character matrix $\chi$, yet can be broken up into "character-based" and "distance-based" approaches. Character-based approaches perform a combinatorial search through all possible tree topologies while seeking to optimize a function over the characters (e.g., the likelihood of the evolutionary history given the mutations observed in the characters). On the other hand, distance-based approaches (like Neighbor-Joining) use a notion of cell-cell distances (denoted by $\delta$) to infer a phylogenetic tree and typically run in polynomial time. While distance-based approaches can perform much faster, they require one to iteratively find the best cell-cell dissimilarity function which can be equally time consuming.

After tree inference has been performed, there are several options for downstream analysis. For example, one can learn about the rates of cell state changes across the developmental history or the relative propensities of cells to divide in a population. Below, we will demonstrate via code examples across two major case studies how these different components fit together to gain fundamental insights into dynamic processes. 

## Environment setup.
We'll focus on using Cassiopeia as the main analysis engine and third parties around tree analysis (Moscot)

# Case Study: Tracing tumor development in a mouse model of lung cancer (Matt)
We'll provide a brief background into the dataset. We'll use the study presented in [Yang et al, Cell (2022)](https://www.cell.com/cell/pdf/S0092-8674(22)00462-7.pdf)

## Preprocessing raw data
We'll discuss the major preprocessing steps needed to go from raw data to character matrices, that will be used for tree reconstruction.

## Reconstructing lineages
We'll discuss the algorithms available for tree reconstruction and discuss the pros/cons of each algorithm. We'll also detail best practcies for tree reconstruction.

## Interpreting tree structure
We'll demonstrate useful approaches for quantifying interesting properties on trees (e.g., expansion, fitness)

## Learning from the tree
We'll discuss how one can integrate transcriptomic data to derive insights into evolutionary patterns. Topics to discuss are:

### Plasticity



### Coupling analysis

# Additional tools for integrating with transcriptomic data (Zoe)

# Conclusions, more resources, future directions (Matt / Zoe)

## Benchmarking / Simulations

## Recap on Dos & Don'ts