vignettes/qpadm.Rmd

---
title: "qpWave and qpAdm"
author: "Robert Maier"
date: "2022-11-12"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{qpWave and qpAdm}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---


```{r, warning = FALSE, message = FALSE}
library(magrittr)
library(tidyverse)
library(admixtools)
```

*qpWave* and *qpAdm* are programs which can answer a number of questions about the relationships of different populations, such as:

* Do populations form clades with respect to each other?
* What is the minimum number of admixture waves that separates two groups of populations?
* Can a population be modeled as a mix of a set of other populations, or would this miss an ancestry component?
* What are the relative proportions of the different admixing populations?

The first two questions can be answered by *qpWave*. The latter two questions are about a specific target population, and they can be answered by *qpAdm*, an extension of *qpWave*.

This page describes how both programs use $f_4$-statistics to address these questions.
The original description of *qpWave* and *qpAdm* which goes into more technical detail can be found in Supplementary Information section 10 in [this paper](https://www.nature.com/articles/nature14317).

## qpWave

### qpWave and f4

$f_4$-statistics can be used to test whether two pairs of populations ($L_1, L_2$ and $R_1, R_2$) form clades with respect to one another: If $f_4(L_1,L_2; R_1,R_2)$ is significantly different from 0, the two groups do not form a clade. If we have two groups with more than two populations each, and we want to know if the groups form two clades, we could compute all possible $f_4$-statistics and test if any of them deviate from zero. Except we would have to correct for multiple testing, which is not straight-forward in this case because the $f_4$-statistics are not independent of each other.

*qpWave* provides another way to test whether two groups of populations form two clades. We can empirically see that *qpWave* and $f_4$ are closely related, because in the special case of applying *qpWave* to only four populations, we get almost the same p-value as for an $f_4$-statistic of the four populations:

```{r}
L = c('Altai_Neanderthal.DG', 'Vindija.DG')
R = c('Chimp.REF', 'Mbuti.DG')
f4(example_f2_blocks, L[1], L[2], R[1], R[2])$p
qpwave(example_f2_blocks, L, R)$rankdrop$p
```

### Estimating the number of independent gene flows

Testing whether two pairs of populations form clades with each other is the same as asking whether there is only one gene flow event which separates the two pairs ($f_4 = 0$) or whether there must have been more than a single gene flow event between the two pairs ($f_4 != 0$).

In the graph on the left, $L1$ and $L2$ form a clade relative to $R1$ and $R2$, which means $L$ and $R$ are only connected through one gene flow event. On the right they do not form a clade, and they are connected through two independent gene flow events (one going through the root, and the other going through the edge `n1b|n2b`).

:::::: {style="display: flex"}

::: {style="width: 50%"}

```{r, echo = F, out.width="100%"}
matrix(c('R', 'n1', 'R', 'n2', 'n1', 'L1',
         'n1', 'L2', 'n2', 'R1', 'n2', 'R2'), , 2, byrow = T) %>%
  plotly_graph
```

:::

::: {style="width: 50%"}

```{r, echo = F, out.width="100%"}
matrix(c('R', 'n1', 'R', 'n2', 'n1', 'n1b', 'n1b', 'L2', 'n1b', 'n2b',
         'n1', 'L1', 'n2b', 'R1', 'n2', 'R2', 'n2', 'n2b'), , 2, byrow = T) %>%
  plotly_graph
```

:::

::::

*qpWave* allows us to estimate a lower bound on the number of gene flow events that separate two groups of populations $L$ and $R$ of size $n_L$ and $n_R$.

The basic idea behind it is that we can use $f_4$-statistics of the form $f_4(L_i, L_j; R_m, R_n)$ to estimate the genetic drift that is shared between pairs of left populations and pairs of right populations. There are $\frac{n_L(n_L-1)}{2} \frac{n_R(n_R-1)}{2}$ $f_4$-statistics of that form, but not all of them contain unique information. In the case where $L$ and $R$ form two clades, all $f_4$-statistics $f_4(L_i, L_j; R_m, R_n)$ are expected to be zero. If there are two independent gene flow events, a few of these $f_4$-statistics will be enough to determine what the other $f_4$-statistics should be. More generally, we can construct a matrix $X$ of these $f_4$-statistics, and the rank of $X$ tells us what the minimum number of gene flows is between $L$ and $R$. A high rank of $X$ suggests many independent gene flows between $L$ and $R$.

In practice, $X$ is constructed by fixing one population in $L$ and one population in $R$ which results in a matrix of dimensions $(n_L - 1) \times (n_R - 1)$ with all $f_4$-statistics of the form $f_4(L_1, L_i; R_1, R_j)$. By convention, $n_R > n_L$, which means that the highest possible rank of $X$ is $n_L - 1$. A rank of $n_L - 1$ suggests that at least $n_L - 1$ independent gene flows have occurred between $L$ and $R$.


### Estimating the rank of X

To estimate the rank of $X$ we test how well $X$ can be approximated by different low rank approximations. An approximation of $X$ with one rank less than full rank can be written as $AB$, where $A$ is $(n_L-1) \times (n_L-2)$, and $B$ is $(n_L-2) \times (n_R-1)$ and $AB$ should be as close as possible to $X$. In this case (one rank less than full rank), $X$ would have rank $n_L - 1$, and $AB$ would have rank $n_L - 2$. If this approximation works well, the residuals $E = X - AB$ will be very small. Large residuals suggest that the low rank approximation $AB$ doesn't capture all the information in $X$ and that a simple model with only few admixture events doesn't fit the data. How large these residuals can become before a model can be rejected depends on the standard errors of the $f_4$-statistics. If we have data from a large number of SNPs, these standard errors will be small, which gives us more power to reject models.

To get low rank approximations of $X$, we start with a singular value decompositions (SVD) of $X$, which gives us precursors of the matrices $A$ and $B$. Because we cannot ignore the covariance among the $f_4$-statistics ($Q$), $A$ and $B$ need to be further adjusted. The initial SVD estimates of $A$ and $B$ already minimize the squared sum of the residuals $E = X - AB$. What is actually minimized in the end is the quadratic form $E' Q^{-1} E$ ($E$ is a vector of length $(n_L - 1)(n_R - 1)$, and $Q$ is a square matrix of dimension $(n_L - 1)(n_R - 1) \times (n_L - 1)(n_R - 1)$). This quantity is closely related to the likelihood of the model, $L(A,B) = -\frac{1}{2} E'Q^{-1}E$.
The difference between the log likelihoods of two models (same populations, but one with higher rank than the other) approximately follows a $\chi^2$ distribution with $x$ degrees of freedom. From this $\chi^2$ distribution we can compute a p-value that tests whether we can reject simple models with only one or a few admixture waves.

<br>

Let's consider two extreme cases:

First, $L$ and $R$ are two groups of populations which form clades with respect to each other. This means that all $f_4$-statistics of the form $f_4(L_i, L_j; R_m, R_n)$ are $0$, and hence $X$ is just a matrix full of $0s$ and rank has rank $0$.

The opposite extreme is when $X$ (which has dimensions $(n_L - 1) \times (n_R - 1)$) has full rank ($(n_L - 1)$, because $n_L$ <= $n_R$). This means that there is no population in $L$ which can be modeled as a combination of other populations in $L$. Each population in $L$ shares some amount of drift with populations in $R$, which is not captured by other populations in $L$.

This second extreme case is important, because it forms the basis for *qpAdm*: Once we have identified two groups $L$ and $R$ with this property, we can add an additional population $T$ and test whether this property is retained or not.


## qpAdm

### Modelling an admixed population

The *qpWave* section above describes how we can use a matrix of $f_4$-statistics ($X$) to test the number of gene flows that separate $L$ and $R$. The idea behind *qpAdm* is that we can say something about an additional population $T$ if we follow this approach twice: Once for a model with $L$ and $R$, and once for a model with $L \cup T$ (the union of $L$ and $T$) and $R$. If we find a higher number of gene flows for the model with $L \cup T$ and $R$ than for the model with only $L$ and $R$, this suggests that there was some gene flow between $T$ and $R$. Thus $T$ cannot be modeled as a combination of populations in $L$. This may hint at the existence of a yet undiscovered “ghost” population which is related to a population in $R$, but not represented by any of the populations in $L$. On the other hand, if both models have the same rank (no extra gene flow between $T$ and $R$), then $T$ can be modeled as a combination of populations in $L$, and we can estimate the relative contribution of each population.

*qpAdm* therefore produces two types of output that are of main interest: A p-value which tests whether there is a connection between $R$ and $T$, other than through $L$, and (if the answer to the last question is no), admixture weights which can tell us in what proportions populations in $L$ have contributed to $T$.

The null hypothesis tested by *qpAdm* is that the populations in $L$ can account for all shared drift between $T$ and $R$. A significant p-value rejecting this null hypothesis indicates that there has been gene flow between $T$ and $R$ (in either direction) that is not accounted for by any population in $L$.

The admixture weights are estimated by computing the left null space of the matrix $A$ (they satisfy $w A = 0$). This is basically asking in what proportions the populations in $L$ need to be added up so that their $f_4$-statistics will be equal to the $f_4$-statistics involving $T$. These weights will also be computed when the model is rejected (the p-value is significant), but in that case they can't be interpreted as admixture weights.
It is important to keep in mind that the true source populations of $T$ are almost certainly not identical to any of the populations in $L$, and that any inferences about the mixture proportions of the true, unobserved source populations depend on an unverifiable assumption: Each left population needs to be strictly cladal with one of the source populations.


<!-- There are simpler methods which can model a population as a mixture of other populations. For example, variations of PCA or ADMIXTURE can model allele frequencies in one population as a linear combination of allele frequencies in other populations. These methods do not require a set of right populations, so why are they necessary in the qpAdm framework? The right populations serve two purposes: First, they allow us to test whether the L is sufficient to explain all variation in T, or whether there is evidence of shared drift between T and R which is not present in L. Second, they make it possible to assign the correct weights to populations in L even in cases where the populations in L have a complex demographic history. For this it is necessary that the populations in R are differentially related to the populations in L. -->
<!-- In most cases the sampled left populations L will not be the true source populations for T. However, the ancestry proportions of the true source populations can still be estimated accurately if each true source population forms a clade with its representative in L relative to all other modelled populations, and if the right populations are chosen such that no two left populations form a clade relative to all right populations. While the second assumption can be tested, the first assumption is untestable because it depends on potentially unobserved populations, which should be considered when interpreting qpAdm results. -->