# Lineage tracing data analysis

Cellular lineages are ubiquitious in biology. Perhaps the most famous example is that of embyrogenesis: the process by which an organism like a human being is generated from from a single cell, the fertilized egg. During this process, subsequent cell divisions give rise to daughter cells and over time entire "lineages" that can take on specialized roles within the developing embryo. The amazing complexity of this process has captured the imaginations of scientists for thousands of years, and over the past century and a half our understanding of this process has been bolstered by the development of high-throughput sequencing assays and new "lineage tracing" technologies for visualizing and characterizing this process {cite}`Woodworth2017`.

\ZP{not sure how to phrase it best but maybe add a _bottom line_ sentence here, e.g.  **These methods allow the study of cell differentiation trajectories while providing a record of cell state changes**.}

Analagously to the sequencing revolutions of the early 2000s, the marriage of single-cell assays and lineage tracing approahces has yielded an exponential growth in the complexity of datasets. As such, there has been a strong need in developing new computational methodology for processing these datasets {cite}`DreamChallenge`. Sourcing heavily from population genetics literature, the past half decade has witnessed an exciting confluence of traditional concepts in evolutionary biology with cutting-edge genome engineering techniques.     

In this chapter, we provide a brief overview of these new technologies and focus on the analytical pipelines available to researchers. To note, we pay special attention to CRISPR/Cas9-based "evolving" lineage tracing technologies though there exist several other greatly useful alternatives; for a more complete view, we refer the interested reader to the excellent reviews by Wagner & Klein {cite}`wagner2018` (\ZP{wouldnt it be the 2020 review?}), Mckenna & Gagnon {cite}`mckenna`, and VanHorn & Morris {cite}`VanHorn2021`.

\ZP{suggestion: The given example relates to a CRISPR/Cas9-based "evolving" lineage tracing setting. However, there exist several other greatly useful alternatives; for a more complete view, we refer the interested reader to the excellent reviews by Wagner & Klein {cite}`wagner2018` (\ZP{wouldnt it be the 2020 review?}), Mckenna & Gagnon {cite}`mckenna`, and VanHorn & Morris {cite}`VanHorn2021` }


\ZP{maybe we should also refer to recent reviews which emphasize the power and strength of LT along with current limitations?, e.g. {cite}[`rodriguez2022`](https://doi.org/10.1242/dev.200877),{cite}[`mukhopadhyay2022`](https://www.nature.com/articles/s41592-021-01370-6) }

## Lineage tracing technologies

The goal of all lineage tracing techniques is to infer developmental relationships between observed cells. In this, there are two major variables to consider: scale and resolution. Classical approaches relied heavily on visual observation: for example, in the 1970s Sulston and colleagues derived the first developmental lineage of the nematode _C. elegans_ by meticulously watching cell divisions under a microscope {cite}`Sulston1973`. While greatly impressive, such approaches cannot scale to complex organisms with more stochastic developmental lineages.

Over the two decades, the development of revolutionary sequencing assays and microfluidic devices has contributed to a tremendous investment into new lineage tracing methodology. To digest the plethora of techniques, it is helpful to classify approaches as "_prospective_" or "_retrospective_". Generally speaking, prospective lineage tracing approaches use a heritable marker to trace a clonal population (i.e., all the descendants of a single cell). On the other hand, retrospective lineage tracing approaches use variability between observed cells - such as naturally occuring genetic mutations - to infer a model of the lineage, or "_phylogeny_", summarizing the cell division history of a clonal population.

Several approaches have been used for prospectively tracing a clonal population: for example, recombinases under a tissue-specific promoter can be used to activate fluorscent markers that act as heritable marks for a specific tissue lineage {cite}`brainbow`. Alternatively, lentiviral transduction can be used to integrate random DNA barcodes into cellular genomes to provide a heritable mark that can be used to deconvovle clonal identities with a sequencing readout. Though these approaches are highly scalable and often do not require heavy genome engineering, they can only report properties at the clonal-level such as clone size and composition.

Retrospective lineage tracers provide an additional advantage over prospective tracers in that they can report on *sub*clonal relationships reflecting how subpopulations in a clone evolve over time. Traditionally, this has been done by leveraging natural genetic variation between cells to reconstruct a cell division history. While this approach is still widely and successfully used to study human tumors or tissue developmental histories, experimentalists have little control over how often or where mutations occur. In experimental models, there are opportunities to recapitulate the advantages of retrospective tracers while improving on the caveats by engineering evolvable lineage tracers. Such evolving tracers typically consist of engineering cells with a "scratchpad" (or, synonymously, "target site") that can acquire mutations. For example, a popular approach that this chapter focuses on uses Cas9 to introduce insertions and deletions (i.e., "indels") at the target site. In this way, cellular lineages acquire heritable mutations over time that can be subsequently read out with high-throughput sequencing platforms and used to infer phylogenies representing a model of the cell lineage.

To note, both classes of lineage-tracing approaches can take advantage of adjacent advances in single-cell multiomic profiling. For example, investigators have routinely used single-cell RNA-seq (scRNA-seq) to read out simultaneously the functional state of single cells and their lineage relationships. This multimodal readout has created opportunity for new computational methodologies, which we detail below. 

As stated above, in this chapter we provide a detailed walkthrough on the analysis of data from evolving CRISPR/Cas9-based lineage tracers.

\ZP{should we (i) add more citations\useful references here along text? 
(ii) at the end refer to CoSpar \ LARRY for examples in different setting?}

## Overview of evolving lineage tracing data analysis pipelines

Before delving into the analysis of our example dataset, we will provide an overview of the analytical pipeline for evolving tracers. With these systems, analysis will begin with raw sequencing data of an _amplicon_ library of target sites (often derived from a conventional scRNA-seq platform like 10X Chromium). Depending on the technology at hand, each sequenced amplicon will be between 150-300bp long; in the case of CRISPR/Cas9-based evolving tracers, each read will contain one more Cas9 cut sites. Amongst other things in the preprocessing of this data, anlaysts are tasked with aligning the reads to a reference sequence and identifying any mutations (e.g., indels). 

As one might expect, the preprocessing of these datasets is a critical step that requires detailed treatment. In the interest of space, we focus on analysis pipelines after raw sequencing data has been preprocessed and refer the reader to an external tutorial which can be found here: https://cassiopeia-lineage.readthedocs.io/en/latest/notebooks/preprocess.html.  

In most analysis frameworks, the preprocessing of the raw sequencing reads produces a data structure called the **character matrix** (denoted by $\chi$) that summarizes the observed mutations in each cell across the target sites. In this data structure, each row is a cell (or "sample"), each column is a target site (or "character"), and the (row, column) values are categorical variables representing the identity of the indel observed in that cell at that particular cut site (or "character-state"). Depending on the technology at hand, these character matrices can report on anywhere between 100 and 10,000 samples across up to 100 characters.

At this point, this data structure abstracts away the technicalities of the evolving lineage tracing assay and opens up the opportunity to apply one of several tools from classical phylogenetic inference to learn a **phylogenetic tree** ($\mathcal{T}$) over the cells. Specifically, the goal is to learn a hierarchical tree structure over each of the cells in $\chi$ (our character matrix). In this tree, each node represents a sample and each edge represents a relationship. Importantly, we often have only observed the _leaves_ (denoted by $\mathcal{L}$) of the tree and we refer to any of the unobserved set of internal nodes as _ancestral_ nodes.

There are many algorithmic choices for inferring the phylogenetic tree ($\mathcal{T}$) from the character matrix $\chi$, yet can be broken up into "character-based" and "distance-based" approaches:
- Character-based: perform a combinatorial search through all possible tree topologies while seeking to optimize a function over the characters (e.g., the likelihood of the evolutionary history given the mutations observed in the characters). 
- distance-based (e.g Neighbor-Joining): use a notion of cell-cell distances (denoted by $\delta$) to infer a phylogenetic tree and typically run in polynomial time. While distance-based approaches can perform much faster, they require one to iteratively find the best cell-cell dissimilarity function which can be equally time consuming.

After tree inference has been performed, there are several options for downstream analysis. For example, one can learn about the rates of cell state changes across the developmental history or the relative propensities of cells to divide in a population. Below, we will demonstrate via code examples across two major case studies how these different components fit together to gain fundamental insights into dynamic processes. 

## Environment setup.

In this tutorial, we will primarily make use of the `Cassiopeia` package {cite}`Jones2020` for lineage tracing analysis.

Before we enter this notebook's analysis, let's set up our environmnet.

In [3]:
import cassiopeia as cas
import numpy as np
import pandas as pd

<div class="alert alert-block alert-warning" style="color:black;">
<b>Heads up!</b> 
If you're having trouble installing these packages, this banner will help. This banner will also provide a quick summary of common missteps. 
</div>


# Case Study: Tracing tumor development in a mouse model of lung cancer (Matt)
We'll provide a brief background into the dataset. We'll use the study presented in [Yang et al, Cell (2022)](https://www.cell.com/cell/pdf/S0092-8674(22)00462-7.pdf)

## Preprocessing raw data
We'll discuss the major preprocessing steps needed to go from raw data to character matrices, that will be used for tree reconstruction.

## Reconstructing lineages
We'll discuss the algorithms available for tree reconstruction and discuss the pros/cons of each algorithm. We'll also detail best practcies for tree reconstruction.

## Interpreting tree structure
We'll demonstrate useful approaches for quantifying interesting properties on trees (e.g., expansion, fitness)

## Learning from the tree
We'll discuss how one can integrate transcriptomic data to derive insights into evolutionary patterns. Topics to discuss are:

### Plasticity



### Coupling analysis

# Conclusions, more resources, future directions (Matt / Zoe)
First give an overview of the major applications of lineage tracing, review what we showed.

## Computational tools to interpret lineage tracing data

the increasing complexity of lineage tracing studies must be accompanied by computational methods to  interpret the resulting data, extending the analysis beyond the construction of single-cell phylogenies. That is, methods that integrate long-term lineage tracing with multi-omics measurements and temporal information to enable powerful analyses of molecular programs governing cell state, differentiation and behavior {cite}[`mukhopadhyay2022`](https://www.nature.com/articles/s41592-021-01370-6).

As this field is at its infacy the number of available tools is still limited yet it is worth highlighting leading approaches: 
- LineageOT {cite}[`forrow2021`](https://www.nature.com/articles/s41467-021-25133-1): A general-purpose method for inferring developmental trajectories from scRNA-seq time courses equipped with lineage information each time point applicable for evolving CRISPR/Cas9-based setting. The method  was suggested as an extension of the Waddington OT {cite}[`schiebinger2019`](https://doi.org/10.1016/j.cell.2019.01.006) algorithm to take the lineage relationships into account when mapping cells from earlier to later time-points. When computing transition matrices across pairs of time-points, LineageOT corrects expression profiles in the later time-point based on their lineage similarity. For more details and tutorials we refer the reader to https://lineageot.readthedocs.io.
- CoSpar {cite}[`wang2022`](https://www.nature.com/articles/s41587-022-01209-1): A computational approach to infer cell dynamics from single-cell transcriptomics integrated with static barcoding lineage tracing data. The method relies on two basic assumptions on the nature of biological dynamics (i) cells in similar states behave similarly and (ii) cells limit their possible dynamics to give sparse transitions.  For more details and tutorials we refer the reader to https://cospar.readthedocs.io/.


## Benchmarking / Simulations (Matt)
Discuss tools for simulating lineage data for benchmarking new tools. Potentially discuss simulating transcriptome ontop of lineages too? (e.g., TedSim)

## Recap on Dos & Don'ts (Matt & Zoe)