In [None]:
import sys

if sys.platform != 'emscripten':
    # allow the notebook to be downloaded and run locally too
    print("This notebook is not running in JupyterLite: you may need to install tskit, tszip, etc.")
else:
    import micropip
    await micropip.install('tszip')
    await micropip.install('pyslim==1.0.4')  # Need an older version for SLiM < 5
    await micropip.install('stdpopsim')
    await micropip.install('demesdraw')
    await micropip.install('jupyterquiz')
    await micropip.install('tskit_arg_visualizer')

from jupyterquiz import display_quiz
from matplotlib import pyplot as plt
from WhatIsAnARG_module import Workbook2

Workbook2.setup()

### Overview
This is the first of 2 workbooks: each is intended to take about 1½-2 hours to complete.

<dl>
<dt>Workbook 1</dt>
    <dd>Graphs and local trees, mutations, genetic variation and statistical calculations: exercises A-J & questions 1-11</dd>
<dt>Workbook 2</dt>
    <dd>Types of ARG, simplification, and ARG simulation: exercises K-U & questions 12-30</dd>
</dl>

# Workbook 2: ARG types and simulation

### Recap

ARGs can be saved in tree sequence format, as a collection of **nodes** and **edges**. This format can also contain **individuals** (simply used to group nodes together), and if genetic variation is to be encoded, **mutations** associated with **sites**.

Such ARGs can be analysed using the _tskit_ library, and drawn tree-by-tree using the built-in `.draw_svg()` method, or as a graph by using the [tskit_arg_visualizer](https://github.com/kitchensjn/tskit_arg_visualizer/blob/main/docs/tutorial.md) package, based on the D3js interactive visualization library. We have been using a so-called *full ARG* stored in the `example_ARG.trees` file.


In [None]:
import tskit
import tskit_arg_visualizer as argviz
from msprime import NODE_IS_RE_EVENT
from tskit import NODE_IS_SAMPLE

arg = tskit.load("data/example_ARG.trees")
d3arg = argviz.D3ARG.from_ts(arg)
d3arg.set_node_styles({u: {"symbol": "d3.symbolSquare"} for u in arg.samples()})
d3arg.set_node_styles({
    u: {"fill": "red" if flags & NODE_IS_RE_EVENT else "grey" if flags & NODE_IS_SAMPLE else "blue"}
    for u, flags in enumerate(arg.nodes_flags)
})
d3arg.draw()

## Full ARGs: perfect historical knowledge

The full ARG above shows "perfect knowledge". Specifically, it records the exact times of two sorts of historical event:
* <span style="color:blue">Common ancester (CA) events</span>, caused by two genomes duplicating during a round of mitosis. These are represented by a node with two children.
* <span style="color:red">Recombination (RE) events</span>, caused by two genomes combining during a round of meiosis. These are represented by a node with two parents (as we saw, the _msprime_ simulator actually encodes these as a pair of nodes, but we visualise this pair as a single node).

However, requiring perfect knowledge can be too restrictive, especially when representing ARGs which have been inferred, imperfectly, from real data. In the next section we explore ARG representations which omit unknowable or undetectable nodes and edges. These *simplified ARGs* retain essentially the same genetic ancestry as the equivalent full ARGs, but can be substantially smaller. This means they are easier to visualise, analyse, infer, and simulate.

### Unknowable nodes

The easiest example of an unknowable structure in a full ARG is a diamond. 

In [None]:
display_quiz(Workbook2.url + "Q12.json")

We previously saw that a recombination event within a diamond does not cause a change in topology or branch lengths of the local trees. In fact, nowhere on the ARG can we place a mutation such that this diamond is detectable, because none of the nodes in a diamond help to group samples together. More specifically, none of the diamond nodes are *coalescent* in any of the local trees.

As a rule, we can treat the times of all non-coalescent nodes as unknowable. This includes all RE nodes, but also some CA nodes such at the top of diamonds. To find these in our ARG, we can use the `Workbook.node_coalescence_status()` function from the first workbook. We will wrap this in a function called `create_styled_D3ARG()` that creates and colours the ARG as appropriate (you don't need to understand the workings of this function, however).

<div class="alert alert-block alert-info"><b>Note:</b> Later on we shall start removing nodes from the ARG, which could change the node IDs. For this reason we will switch to labelling the nodes using metadata. Rather unimaginatively, in their metadata, the sample nodes have assigned the names A-J, which will be maintained even if their IDs change. In real data, metadata labels should be more meaningful, e.g. <code>{'name':'HG00157', 'sex':'Male'}</code> for an individual human from the 1000 Genomes Project.
</div>

In [None]:
import json
import os
from pathlib import Path

import numpy as np

def metadata_based_labels(arg, use_ids=True):
    if use_ids:
        return {nd.id: (nd.metadata or {}).get("label", nd.id) for nd in arg.nodes()}
    else:
        # only make keys in the dictionary if there is a "label"
        return {nd.id: nd.metadata["label"] for nd in arg.nodes() if nd.metadata and "label" in nd.metadata}
        

def create_styled_D3ARG(arg, x_pos_file=None, node_map=None):
    """
    You don't need to understand this function!

    Take a tskit ARG and return a D3ARG, with coloured nodes and with sample nodes
    labelled with node.metadata["label"]. If an x_pos_file is given, load X positions
    from this file, translating the node IDs via the node_map if provided
    """
    d3arg = argviz.D3ARG.from_ts(arg)
    status = np.array(['cyan', 'blue', 'green'])[Workbook2.node_coalescence_status(arg)]
    status[arg.samples()] = 'gray'
    status[(arg.nodes_flags & NODE_IS_RE_EVENT) != 0] = 'red'
    d3arg.set_node_styles({u: {"fill": str(colour)} for u, colour in enumerate(status)})
    d3arg.set_node_styles({u: {"symbol": "d3.symbolSquare"} for u in arg.samples()})
    d3arg.set_node_labels(metadata_based_labels(arg, use_ids=False))

    if x_pos_file is not None:
        x_pos = argviz.extract_x_positions_from_json(json.loads(Path(x_pos_file).read_text()))
        if node_map is None:
            d3arg.set_node_x_positions(x_pos)
        else:
            d3arg.set_node_x_positions({node_map[u]: x for u, x in x_pos.items() if node_map[u] != tskit.NULL})

    return d3arg

d3arg = create_styled_D3ARG(arg, x_pos_file="data/Xpos.json")
d3arg.draw(height=500)

In [None]:
display_quiz(Workbook2.url + "Q13.json")

## Simplified ARGs: inheritance between genomes

It is possible to construct ARGs without unknowable nodes. Instead of being associated with explicit events, nodes in such an ARG are better imagined as *genomes*, between which genetic information is inherited. These *simplified* ARGs can be produced directly from simulation or inferred from real data. Here, however, we will explore how they relate to full ARGs, via the process of *simplification*.

### Simplification

Simplification is the act of removing redundant information from an ARG. The _tskit_ library provides a flexible [`.simplify()`](https://tskit.dev/tutorials/simplification.html) method that performs this task. The core idea is to retain only the ancestry of a specified set of "focal nodes". Nodes that are not needed to represent this ancestry are removed.


#### Partial simplification

The code below carries out a "partial simplification" of the ARG, detaching all unknowable (non-coalescent) nodes from the ancestry (although we use `filter_nodes=False` to keep them in the stored object).

Hover over the genome bar to check that a complete ancestry of the samples is still present, even though various intermediate nodes have been removed:

In [None]:
def simplify_non_coalescent_nodes(arg, **kwargs):
    # NB: this function can probably be removed when https://github.com/tskit-dev/tskit/issues/2127 is fixed
    partial_or_always_coalescent_nodes = np.where(Workbook2.node_coalescence_status(arg) > 0)[0]
    focal_nodes = np.concatenate((arg.samples(), partial_or_always_coalescent_nodes))
    return arg.simplify(
        focal_nodes,  # The set of nodes whose ancestry we want to keep
        update_sample_flags=False,  # Usually all focal nodes are turned into samples, but we don't want that here
        **kwargs,  # pass on any additional parameters (e.g. "filter_nodes") to `simplify`
    )

part_simp = simplify_non_coalescent_nodes(arg, filter_nodes=False)

create_styled_D3ARG(
    part_simp,
    x_pos_file="data/Xpos.json"
).draw(height=400)

In [None]:
display_quiz(Workbook2.url + "Q14.json")

By specifying `filter_nodes=True` (which is the default) we can actually delete the unused nodes completely. However, this will change the node IDs. So that we can plot the nodes in the right place, we'll also return the `node_map` of old to new IDs when simplifying.

In [None]:
part_simplified_arg, node_map = simplify_non_coalescent_nodes(arg, filter_nodes=True, map_nodes=True)
create_styled_D3ARG(
    part_simplified_arg,
    x_pos_file="data/Xpos.json",
    node_map=node_map
).draw(height=400)

In [None]:
display_quiz(Workbook2.url + "Q15.json")

Here's what the partially simplified trees look like. As you can see, they still look slightly unusual compared to a standard "gene tree", as there are a few locally unary nodes left. These nodes are coalescent in some of the local trees, but not in others. 

In [None]:
display_quiz(Workbook2.url + "Q16.json")

<dl class="exercise"><dt>Exercise K</dt>
    <dd>Use <code>part_simplified_arg.draw_svg()</code> to plot the local trees.
        If you change the height of the plot using <code>size=(1000, 400)</code> the tree will be clearer.
        You might also want to specify <code>node_labels=metadata_based_labels(part_simplified_arg)</code> to get nicer labels.
    </dd>
</dl>

In [None]:
# Complete Exercise K here


In [None]:
display_quiz(Workbook2.url + "Q17.json")

## Full simplification

We can go one stage further by fully simplifying so that ancestral nodes are all-coalescent. This modifies edges so that, for example, the edge above node 18 in the first tree goes straight to node 20, rather than going via node 19. This will not change the general topology or branch lengths of the trees, and it also makes the trees simpler. However, it can create more edges (making ARG analysis less efficient), and slightly paradoxically, makes the graph visualization *less* simple:

In [None]:
fully_simplified_arg, node_map = arg.simplify(map_nodes=True)

display(fully_simplified_arg.draw_svg(
    size=(1000, 400),
    node_labels=metadata_based_labels(fully_simplified_arg),
    title="Fully simplified local trees",
))

create_styled_D3ARG(
    fully_simplified_arg,
    "data/Xpos.json",
    node_map).draw(height=500, title="Fully simplified ARG")


As you can see above, the local trees look simpler, but the graph looks more tangled (although every node is now <b style="color:green">all-coalescent</b>). The fully-simplified ARG encodes exactly the same genetic information on the samples, and the same branch lengths in the local trees, but is less genealogically-informative about how adjacent trees relate to each other.

<dl class="exercise"><dt>Exercise L</dt>
    <dd>use the <code>.num_edges</code>, <code>.num_nodes</code>, <code>.num_trees</code>, and <code>.num_mutations</code> attributes to print out the number of edges, nodes, trees, and mutations in the <code>part_simplified_arg</code> ARG and the <code>fully_simplified_arg</code> ARG (or you could simply <code>display</code> the summary table of each ARG).</dd>

In [None]:
# Complete Exercise L here


In [None]:
display_quiz(Workbook2.url + "Q18.json")

Because the number of sites and mutations has remained the same, and there is still a complete ancestry for all the samples, the encoded genetic variation and even branch-length calculations are the same between the full ARG and any of the simplified versions. This is easy to demonstrate:

In [None]:
print(f"    ====== Original (branch π: {arg.diversity(mode='branch'):.2f}) ========    ",
      f"    === Fully simplified (branch π: {fully_simplified_arg.diversity(mode='branch'):.2f}) ===")
for v1, v2 in zip(arg.variants(), fully_simplified_arg.variants()):
    print(
        f"pos {int(v1.site.position):>3}", v1.states(), "   ",
        f"pos {int(v2.site.position):>3}", v2.states()
    )


For historical reasons, the fully simplified format is the default output of the _msprime_ and _SLiM_ simulators, even though in some cases it can be slightly less efficient.

### Nodes represent genomes

Philosophically, the nodes in a simplified ARG no longer represent historical events, but the genomes that are produced as an *outcome* of those (unknown) events. For example, sample node 6 now has 2 parents, even though it does not itself represent a recombination event. We know a recombination event occurred some time between the time of node 6 and the time of its youngest parent, node 15, but we can't say exactly when.

### Simplifying with different focal nodes

One of the main uses of `simplify()` is to change which nodes are treated as samples. By default the existing sample nodes are taken, but we can specify a subset of the existing samples to make a smaller ARG, which is the topic of the next subsection.

However, for demonstration purposes, the code below (unusually) passes a previously non-sample node to `simplify()`. This is a shortcut that allows us to output the assumed genome of internal nodes in the ancestry: 

In [None]:
sample_ids_plus_node_21 = np.append(part_simplified_arg.samples(), 21)  # make a list of sample nodes and add node 21
new_arg = part_simplified_arg.simplify(sample_ids_plus_node_21)  # pretend that we have "sampled" node 21
print("Known variation for ancestral node 21")
for v in new_arg.variants():
    print(f"pos {int(v.site.position):>3}:   ", v.states(missing_data_string="-")[new_arg.num_samples - 1])

You can see that the start and end of the genome of node 21 is unknown (its haplotype is truncated).

In [None]:
display_quiz(Workbook2.url + "Q19.json")

#### Simplifying to reduce the sample size

If we choose a subset of the existing samples, we can reduce the ARG to reflect the ancestry of only those samples. Repeated simplification to the current-day samples is a key component of forward simulation.

<dl class="exercise"><dt>Exercise M</dt>
    <dd>Simplify the original ARG so that it only shows the ARG relating sample 0 (A)  to sample 6 (F). By default <code>simplify()</code> removes any sites that no longer have any mutations (are "monomorphic"). Set <code>filter_sites=False</code> to leave those sites in the resulting tree sequence. Sites with no mutations will have a tickmark on the x-axis without an associated red v-shaped mutation mark.</dd></dl>

In [None]:
# Carry out exercise M here


In [None]:
display_quiz(Workbook2.url + "Q20.json")

### The simplest possible ARG

We can even go to the extreme, and simplify an ARG to only 2 genomes (say 0 and 9)

In [None]:
arg.simplify([0,9], filter_sites=False).draw_svg()

Although there are only 2 samples (and 3 trees) in this ARG, and the topology of the local trees is always a "cherry", there is still some ARG structure, because the time to the MRCA differs as we go along the genome. The ability to infer a simple ARG of this sort can still be extremely powerful: this is the basis behind techniques based on the Pairwise Sequentially Markovian Coalescent (PSMC) (see e.g. [this 2019 review](https://doi.org/10.1002/ece3.5888) or more recent papers by [Schweiger (2023)](https://doi.org/10.1101/gr.277665.123) and [Terhorst (2025)](https://doi.org/10.1038/s41588-025-02323-x))

## Polytomies and undetectable nodes

It is possible to create ARGs in *tskit* that contain **polytomies** (nodes in a local tree with more than 2 children). This can happen under some models of evolution (e.g. the Dirac coalescent, where there is a burst of coalescences at a single timepoint). Such ARGs are also produced by some ARG inference methods (e.g. *tsinfer*) in which nodes that are not associated with a mutation are omitted.

We can demonstrate the second sort of ARG by removing "undetectable" nodes, often those without associated mutations. This process collapses edges in a way that appears simular to parts of the simplification process:

In [None]:
Workbook2.remove_unsupported_edges(arg).draw_svg(
    title="ARG, collapsing edges unsupported by mutations",
    size=(1000, 400)
)

In [None]:
display_quiz(Workbook2.url + "Q21.json")

<div class="alert alert-block alert-info"><b>Note:</b> Unlike simplification, creating polytomies will change the branch lengths in the local trees, and hence affect branch-length measures of genetic variation. More generally, it needs special treatment when measuring features of ARGs, or performing statistical tests, which can make such ARGs hard to handle.
</div>

## Simulating ARGs

There are two basic approaches for simulating ancestry: forward and backward in time. Backward-time simulators such as _msprime_ are usually much faster but less flexible (for instance, it is hard to simulate anything but the simplest form of natural selection). Forward-time simulators such as _SLiM_ are more general, but must track the entire population during simulation, rather than simply following the ancestry of a small set of sample genomes.


<dl class="exercise"><dt>Exercise N</dt>
    <dd>Tree sequence files often have "provenance" data,
    which decribes how the file was made. You can see this when you display a
tree sequence to the screen in a notebook: the list of provenances appears the bottom of the output.
Use <code>display(arg)</code> (or simply <code>arg</code>) to show the ARG summary in the cell below.</dd>
</dl>

In [None]:
# Carry out exercise N, displaying a summary of the ARG, here


In [None]:
display_quiz(Workbook2.url + "Q22.json")

### _Msprime_: a backward-time simulator

_Msprime_ is a fast and flexible backward-time simulator whose core functions are `sim_ancestry` and `sim_mutations`. The `sim_ancestry` command is run first and this creates the ARG structure.

Running an _msprime_ simulation in Python is extremely easy. You simply need to call `sim_ancestry` with a number of (assumed diploid) samples. You'll usually want to provide a `sequence_length`, a `population_size` and a `recombination_rate` too:

In [None]:
# Name the ARG `ts` (for "tree sequence")
import msprime

ts = msprime.sim_ancestry(10, population_size=1e4, sequence_length=10_000, recombination_rate=2e-8)
print(f"Simulated a simplified ARG of {ts.num_samples} haploid genomes")

The command above generates a "fully simplifed" ARG of 10 diploid samples (20 haploids) over a 10kB genome, assuming a constant effective population size ($N_e$) of 10,000, and with a recombination rate of $2\times10^{-8}$ crossover mutations per base pair per generation. As the resulting ARG is stored in tree sequence format, we often name the resulting simulation `ts`. 

#### Demographic models

Of course, most populations have not stayed at a constant size over time, and are often modelled as multiple subpopulations. Instead of providing a `population_size` parameter, you can provide _msprime_ with a **demographic model**. This is how we simulated the original full ARG (for the curious, the details are in the `sim_ancestry` provenance entry, under "parameters" -> "demography"). The following code reads the demographic model from the full ARG's provenance, converts it to the portable [demes](https://popsim-consortium.github.io/demes-spec-docs/main/introduction.html) format, and plots it using the elegant [_demesdraw_](https://grahamgower.github.io/demesdraw/latest/) software:

In [None]:
import demesdraw

cmd, parameters = msprime.provenance.parse_provenance(arg.provenance(0), arg)
assert cmd == "sim_ancestry"  # just check we have the right (zeroth) provenance entry

msprime_demography_object = parameters["demography"]
demesdraw.tubes(msprime_demography_object.to_demes(), log_time=False);

<dl class="exercise"><dt>Exercise O</dt>
    <dd>The widths on the X axis, indicating the two population sizes, is dominated by a rapid exponential increase in the past few hundred generations. Change <code>log_time</code> from <code>True</code> to <code>False</code> above, to zoom in on more recent times, and re-plot.</dd>
</dl>



In [None]:
display_quiz(Workbook2.url + "Q23.json")

### Running a larger simulation

We will rerun under the same demography, but simulate twenty times more of the genome (20kb). As we did in the original simulation, we will used the `record_full_arg` option to simulate a full ARG, rather than the simplified ARG that _msprime_ produces by default. Because there are two populations, we will 5 samples from each (for a total of 10 diploid samples), and use the original simulation recombination rate of $1.15\times10^{-8}$. We'll use a fixed `random_seed` to ensure that everyone gets the same results:


In [None]:
from datetime import datetime

n_diploids=10
genome_length=20_000  # bp

start_time = datetime.now()
larger_arg = msprime.sim_ancestry(
    samples={"AFR": n_diploids/2, "EUR": n_diploids/2},
    recombination_rate=1.15e-08,
    sequence_length=genome_length,
    demography=msprime_demography_object,
    record_full_arg=True,
    random_seed=321
)

print(f"Simulated a full ARG of {larger_arg.num_samples} haploid samples over "
      f"{larger_arg.sequence_length/1e3}kb in {datetime.now()-start_time} seconds")
print(
    f"ARG takes up {larger_arg.nbytes * 1e-6:.2f} MB, with {larger_arg.num_nodes} nodes "
    f"and {larger_arg.num_edges} edges, encoding {larger_arg.num_trees} trees\n")

# Calculate the number of "unknowable" nodes
non_coal = Workbook2.node_coalescence_status(larger_arg) == 0
p = (sum(non_coal)-larger_arg.num_samples)/(len(non_coal) - larger_arg.num_samples)
print(f'{p * 100:.2f}% of nodes in this full ARG are non-coalescent ("unknowable")')
print(f'{np.sum(larger_arg.nodes_flags & msprime.NODE_IS_RE_EVENT > 0)/ larger_arg.num_nodes * 100:.2f}% of nodes are RE nodes')

You can see that this slightly larger ARG now has quite a high proportion of unknowable nodes.
Plotting this out as a complete ARG is possible, but maybe not very helpful (feel free to change the `edge_type` below to `"ortho"` if you think it will help):

In [None]:
if larger_arg.num_edges > 10000:
    raise RuntimeError("ARG too big to plot!!!")
d3larger_arg = create_styled_D3ARG(larger_arg)
d3larger_arg.draw(
    height=800, width=800,
    title=f"ARG of {larger_arg.num_trees} trees",
    y_axis_labels={0:0, 0.5e4:0.5e4, 1e4:1e4, 1.5e4:1.5e4, 2e4:2e4, 2.5e4:2.5e4, 3e4:3e4})


As you can see, this ARG is rather too large to visualise, although it can be shrunk considerably by removing the ~70% of non-coalescent nodes (most of which are recombination nodes).

In [None]:
partial_or_always_coalescent_nodes = np.where(Workbook2.node_coalescence_status(larger_arg) > 0)[0]
focal_nodes = np.concatenate((larger_arg.samples(), partial_or_always_coalescent_nodes))
part_simp_arg = larger_arg.simplify(
    focal_nodes,  # The set of nodes whose ancestry we want to keep
    filter_nodes=False,  # Omit nodes from the genealogy, but do not completely remove them in the returned object    
    update_sample_flags=False,  # Usually all focal nodes are turned into samples, but we don't want that here
)

d3part_simp_arg = create_styled_D3ARG(part_simp_arg)
d3part_simp_arg.draw(height=1000, width=800, title="Simplified larger ARG")


<dl class="exercise"><dt>Exercise P</dt>
    <dd>Now repeat the <i>msprime</i> <code>larger_arg</code> simulation above, including all the subsequent <code>print</code> statements in that cell, but edit the <code>n_diploids</code> value to simulate 500 diploids (1000 sample genomes) and the <code>genome_length</code> line to simulate 10 Mb of genome instead. Run the simulation (which could take a minute or two). BE CAREFUL NOT TO PLOT THE RESULTING ARG (that will probably make your computer run out of memory and/or crash the browser).</dd></dl>

In [None]:
# Use this cell for Exercise P, simulating 500 diploids over a 10Mb genome


In [None]:
display_quiz(Workbook2.url + "Q24.json", shuffle_answers=False)

Rather than go through the hassle a simulating a full ARG, we can simulate an already-simplified ARG. This is much more efficient:

<dl class="exercise"><dt>Exercise Q</dt>
    <dd>Go back to the simulation code you just ran, change the <code>record_full_arg</code> parameter to <code>False</code> (or remove it entirely), and re-run the 1000 sample / 10Mb simulation. Rather than simulating the full ARG, this simulates the all-coalescent ("fully simplified") ARG, and so should be substantially faster.</dd></dl>

In [None]:
# Use this cell for Exercise Q, simulating a 10Mb genome for 500 diploids but with record_full_arg=False


In [None]:
display_quiz(Workbook2.url + "Q25.json", shuffle_answers=False)

The <code>extend_haplotypes()</code> method can infer missing non-coalescent segments of nodes, and place them back on the ARG. However, this can take some time for larger ARGs.

In [None]:
start_time = datetime.now()  # Could take a few minutes
part_coalescent_arg = larger_arg.extend_haplotypes()
print(
    "Added suggested non-coalescent regions in",
    (datetime.now()-start_time).seconds,
    "seconds",
)
display(part_coalescent_arg)

In [None]:
display_quiz(Workbook2.url + "Q26.json", shuffle_answers=False)

Although it can take some time, it may be worth creating this `part_coalescent_arg`, as a smaller ARG can make many genetic calculations more efficient, and span of ancestral haplotypes, relevant to e.g. the span of segments considered identical-by-descent, is likely to be more accurate.

Alternatively, you can retain the exact non-coalescing spans of ancestral haplotypes during an _msprime_ simulation by using `coalescing_segments_only=False`. This is equivalent to partially simplifying the full ARG. However, this is not guaranteed to produce a smaller encoding (it retains extra unknowable information such as the genetic location of masked recombination events).

## Mutations

We can add mutations *after* simulating ancestry using the [`msprime.sim_mutations()`](https://tskit.dev/msprime/docs/stable/mutations.html) function.

It is valid to add mutations *after* simulating the ARG as long as they are neutral. One of the main reasons that ARGs are an efficient way to simulate genomes is because most mutations are neutral, and therefore do not need to be tracked during ancestry simulation.

The default _msprime_ mutation model only adds single nucleotide changes, with equal probability of mutating between the 4 bases, but other sorts of mutation models [are available](https://tskit.dev/msprime/docs/stable/mutations.html#models).

In [None]:
if larger_arg.sequence_length != 10_000_000 or larger_arg.num_samples != 1000:
    raise RuntimeError("You have not simulated the right length ARG: Make sure you run exercise Q correctly")
mutated_large_arg = msprime.sim_mutations(larger_arg, rate=1e-8, random_seed=42)
print(f"Simulated {mutated_large_arg.num_mutations} mutations at {mutated_large_arg.num_sites} sites")

In [None]:
display_quiz(Workbook2.url + "Q27.json")

<dl class="exercise"><dt>Exercise R</dt>
    <dd>
        Get the times of all these mutations as a big array using <code>mutated_large_arg.mutations_time</code>, take the log (base 10) of this using <code>np.log10</code>, then plot these logged times out using<code>plt.hist</code>. You might want to also set nice tickmarks, e.g. using <code>plt.xticks(np.arange(5), 10**np.arange(5))</code>, and specify a label e.g. via <code>plt.xlabel(f"Mutation time ({mutated_large_arg.time_units})")</code>
    </dd>
</dl>

In [None]:
# Plot mutation times using this cell


## Stdpopsim: easily run verified simulations

Even though the demographic model above is relatively simple, it still contains many parameters, and specifying it, or something similar, can be tricky and prone to error. For this reason, rather than using <em>msprime</em> directly, it is often much easier to use the <a href="https://popsim-consortium.github.io/stdpopsim-docs/">Standard Library for Population Genetic Simulation Models</a> (<em>stdpopsim</em>).

<img style="float:right; margin: 0.5em;" src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9a/NASA_Joins_Jane_Goodall_to_Conserve_Chimp_Habitats_%28SVS14410%29.jpg/330px-NASA_Joins_Jane_Goodall_to_Conserve_Chimp_Habitats_%28SVS14410%29.jpg" />
<em>Stdpopsim</em> is a set of tried and tested genomic and demographic models for various species. We will demonstrate using the Python API (as documented in the <a href="https://popsim-consortium.github.io/stdpopsim-docs/stable/tutorial.html#running-stdpopsim-with-the-python-interface-api">tutorial documentation</a>). For instance, here is an example of a demographic model of populations in the genus <em>Pan</em> i.e. common chimpanzees and bonobos.

In [None]:
import stdpopsim
import demesdraw
species = stdpopsim.get_species("PanTro")  # /Pan troglodytes/
model = species.get_demographic_model("BonoboGhost_4K19")

print(
    f'Chosen the demographic model "{model.id}" for `{species.name}` ({species.common_name}) '
    f"out of {len(species.demographic_models)} model(s) available:")
for d in species.demographic_models:
    print(f"* {d.id}: {d.long_description}")

colours = {'central': 'tab:blue', 'bonobo': 'tab:orange', 'western': 'tab:green'}
demesdraw.tubes(model.model.to_demes(), colours=colours);  # You can ignore the warnings here, see https://github.com/popsim-consortium/stdpopsim/issues/1734

And here's how to tell _stdpopsim_ to run an _msprime_ simulation of the model,  e.g. the first 20 MB of [chromosome 2A](https://en.wikipedia.org/wiki/Chimpanzee_genome_project#Genes_of_the_chromosome_2_fusion_site), from 20 bonobos, 12 central chimpanzees, and 8 western chimpanzees (remember that these are diploid genomes, so that makes 20 haploid sample nodes in total). Mutations will be automatically laid onto the ARG using `msprime.sim_mutations()` (you'll notice that we provide a `mutation_rate` below).

In [None]:
contig = species.get_contig("chr2A", mutation_rate=model.mutation_rate, right=20e6)  # only simulate 20 Mb
samples = {"bonobo": 20, "central": 12, "western": 8}

engine = stdpopsim.get_engine("msprime")  # Use msprime as the underlying simulator
chimp_arg = engine.simulate(
    model,
    contig,
    samples,
    msprime_model="dtwf",
    msprime_change_model=[(20, "hudson")],
    coalescing_segments_only=False,
    random_seed=42,
)
print(f"Simulated a chimp ARG of {chimp_arg.num_samples} samples with {chimp_arg.num_trees} local trees")
chimp_arg = chimp_arg.trim()

The `model`, `contig` and `samples` parameters should be obvious. Here we have also used best practice and used the "Discrete-Time Wright Fisher" ([`dtwf`](https://tskit.dev/msprime/docs/stable/ancestry.html#sec-ancestry-models-dtwf)) coalescent model for the most recent 20 generations, and the (default) [`hudson`](https://tskit.dev/msprime/docs/stable/ancestry.html#sec-ancestry-models-hudson) model further back in time (this is mainly important for whole genome simulations or when sample sizes appraoch the population size, see [this paper](https://doi.org/10.1371/journal.pgen.1008619)).

Finally, any extra options are passed directly to the underlying simulation engine, in this case _msprime_. In particular, we have seen that the `coalescing_segments_only` argument will generate a partially- rather than fully-simplified ARG.

<dl class="exercise"><dt>Exercise S</dt>
    <dd>
        The cell below creates a list of positions along the simulated genome that can be used as windows, and a mapping of population name to population ID. In the cell below that, get the samples that correspond to the <i>central</i> population by using <code>chimp_arg.samples(pop_ids['central'])</code>, and calculate the diversity in windows along the genome using <p><code>win_div = chimp_arg.diversity(central_samples, windows=windows)<code></p>You should then be able to plot the diversity using <p><code>plt.stairs(win_div, windows, baseline=None, label='central', color=colours['central'])</code></p>In the same cell, repeat for the <i>bonobo</i>, and <i>western</i> populations, and add <code>plt.legend()</code> to generate a legend.
    </dd>
</dl>

In [None]:
# This code creates the genome windows and a map from population name to population ID.
windows = np.linspace(0, chimp_arg.sequence_length, 11)
print(f"Created {len(windows)} windows of length {(windows[1]-windows[0]) / 1e6} megabases each")
pop_ids = {pop.metadata["name"]: pop.id for pop in chimp_arg.populations()}
print("Population IDs are:", pop_ids)

In [None]:
# Complete Exercise S here: plot diversity in windows along the genome for the 'central' population, then add the two other populations


In [None]:
display_quiz(Workbook2.url + "Q28.json")

### Coalescence rates

A relatively new and useful _tskit_ feature is the ability to estimate coalescence rates (including cross-coalescence rates) through time. The key method here is <code>pair_coalescence_rates()</code> which estimates rates in time windows and also allows different rates to be estimated along the genome. In this case, as we have a neutral simulation, there's not much point using genome windows. If e.g. we expect selection, or want to inspect hybrids, we could easily plot out a heatmap instead.

In [None]:
num_log_timebins = 20
time_breaks = np.logspace(
    3,  # start at 10^3 generations (not much info before that)
    np.log10(chimp_arg.max_time),
    num_log_timebins
)
plot_breaks = time_breaks.copy()
time_breaks[0] = 0  # we require the first bin to start at the sampling time
time_breaks[-1] = np.inf  # we require the last bin to go to infinity
rates = chimp_arg.pair_coalescence_rates(time_windows=time_breaks)
plt.stairs(1/rates, plot_breaks)
plt.xscale("log")
plt.xlabel(f"Time ({chimp_arg.time_units})")
plt.ylabel(f"IICR (inverse instantaneous coal. rate)");

The "huge" IICR (sometimes interpreted as $N_e$) between 1000 and 10,000 generations probably just reflects the fact that these are separate species, and therefore not coalesing much at all. In this case the IICR

#### Cross coalescence
If we want to look at migration *between* populations, we can look at the **cross coalescence rate**. For this we define some sets of samples (the samples in each population: central, bonobo, and western) and only record a pairwise coalescence if one of the pair is from population A and the other is from population B. Since we have 3 populations, there are only 3 possible pairs to look at:

* central - bonobo
* central - western
* bonobo - western

We specify which comparisons we want to look at by using indexes into the provided list of sample sets, i.e. in this case `(0, 1)`, `(0, 2)`, and `(1, 2)`. Note that below we plot the coalescence rate (not its inverse).

In [None]:
central_samples = chimp_arg.samples(pop_ids['central'])
bonobo_samples = chimp_arg.samples(pop_ids['bonobo'])
western_samples = chimp_arg.samples(pop_ids['western'])
rates = chimp_arg.pair_coalescence_rates(
    time_windows=time_breaks,
    sample_sets=[central_samples, bonobo_samples, western_samples],
    indexes=[(0, 1), (0, 2), (1, 2)]
)
plt.stairs(rates[0, :], plot_breaks, label="central/bonobo")
plt.stairs(rates[1, :], plot_breaks, label="central/western")
plt.stairs(rates[2, :], plot_breaks, label="bonobo/western")
plt.xscale("log")
plt.xlabel(f"Time ({chimp_arg.time_units})")
plt.ylabel(f"Estimated instantaneous coalecence rate")
plt.legend();

In [None]:
display_quiz(Workbook2.url + "Q29.json")

### PCAs

Another _tskit_ feature is the ability to perform efficient matrix-vector multiplications on the genetic relatedness matrix. This enables a very efficient implementation of genetic principle components analysis, even for huge datasets. Here's an example:

In [None]:
principal_components = chimp_arg.pca(2).factors
for pop in chimp_arg.populations():
    samples = chimp_arg.samples(population=pop.id)
    if len(samples) > 0:
        plt.scatter(
            principal_components[samples, 0],
            principal_components[samples, 1],
            c=colours[pop.metadata["name"]],
            label=pop.metadata["name"],
            alpha=0.5,
        )
    plt.xlabel("PCA 1")
    plt.ylabel("PCA 2")
    plt.legend()

## SLiM simulations

If you want to add complex selection or demography to a simulation (which can be done in _stdpopsim_), you will need to use a forward simulator like the SLiM simulation engine.

SLiM is a separate program that won't work in the browser, so we can't perform forward-time simulations easily in this workbook. Instead, we'll load up the results of a previous SLiM simulation. Although _tskit_ `.trees` files are quite small, there is also an additional compression program, _tszip_, which specializes in making them even smaller. It's common to use this for saving and loading large ARGs in `.tsz` format, but _tszip_ will also quite happily load normal `.trees` files too, so it's not a bad idea to get use to using it to load any tree sequence:

In [None]:
import tszip
# Running SLiM is an advanced topic beyond the scope of this workbook
slim_arg = tszip.load("data/SLiM_sim.tsz")

print(
    f"Loaded a SLiM simulated ARG with {slim_arg.num_samples} sampled genomes, " +
    f"{slim_arg.num_trees} trees, " +
    f"and {slim_arg.num_mutations} mutations."
)

<dl class="exercise"><dt>Exercise T</dt>
    <dd>Get the first tree in the ARG using <code>slim_arg.first()</code>, and then plot it <code>.draw_svg()</code>. As there are a lot of samples, you might also want to set <code>size=(1000, 400)</code> and  (to omit any node labels) <code>node_labels={}</code>.</dd>
</dl>

In [None]:
# Complete Exercise T here (plot the first tree in the slim_arg)


## Recapitation

You can see that the simulation needed more burn-in, as the lineages have not all coalesced (in _tskit_ parlance, the trees have **multiple roots**). We can add the correct amount of (neutral) burn-in using _msprime_, which is usually much faster than SLiM. This is called [recapitation](https://tskit.dev/pyslim/docs/latest/tutorial.html). Here's a quick introduction.

You can carry out recapitation using `msprime.sim_ancestry()`, but there is a convenient wrapper provided by the _pyslim_ Python library:

In [None]:
import pyslim

# ignore the warning the command below generates: see the docs
coalesced_arg = pyslim.recapitate(slim_arg, recombination_rate=1e-8, ancestral_Ne=200, random_seed=5)

You can now check that the simulation has indeed fully coalesced:

In [None]:
# Check the first tree visually
coalesced_arg.first().draw_svg(size=(1000, 400), node_labels={})

<dl class="exercise"><dt>Exercise U</dt>
    <dd>It's all very well checking by eye, but you probably want to make sure that all the trees (not just the first) have coalesced. It's very useful to get into the habit of putting lots of <code>assert</code> statements into your code, to sanity check your logic. An assert statement should always be True: your code will stop otherwise: it's is a great way to avoid mistakes.
    Check that all the trees in the <code>coalesced_arg</code> have a single root by iterating through the trees and calling <code>assert tree.num_roots == 1</code> for each tree.
    </dd>
</dl>

In [None]:
# Complete Exercise U here (put in an assert to check that recapitation has worked)


Now that we have a fully coalescent ARG, we can e.g. add neutral mutations, to create realistic genetic variation:

In [None]:
mu = 1e-8
mutated_slim_sim = msprime.sim_mutations(coalesced_arg, rate=mu, random_seed=123)
print(f"Added {mutated_slim_sim.num_mutations} mutations to the ARG")

With a fully-coalesced ARG, we can reasonably calculate stats such as windowed genetic diversity (and because we have added mutations, we can calculate site-based as well as branch-based measures):

In [None]:
genome_windows = np.linspace(0, mutated_slim_sim.sequence_length, 1001)
step = genome_windows[1] - genome_windows[0]
site_diversity = mutated_slim_sim.diversity(windows=genome_windows)
branch_diversity = mutated_slim_sim.diversity(windows=genome_windows, mode="branch")
scaled_branch_diversity = branch_diversity * mu
plt.title(f"Genetic diversity in {step / 1000:.0f}kb genome windows")
plt.stairs(site_diversity, genome_windows, baseline=None, ls=":", label="site-based")
plt.stairs(scaled_branch_diversity, genome_windows, baseline=None, label="branch-based")
plt.legend();

In [None]:
display_quiz(Workbook2.url + "Q30.json", shuffle_answers=False)

Recapitation can be a game-changer when simulating realistic populations, saving many days of simulation time. 

# More information

That's the end of the workbook. Theere is a lot more official documentation at https://tskit.dev/tskit/docs/, and an ever-growing set of tutorials at https://tskit.dev/tutorials. More help to write tutorials is also most welcome! Good luck on your ARG journey.