<div style="color: white; background-color: #03303e; display: flex; align-items: flex-start; margin-bottom: 1em">
    <figure style="text-align: center; margin: 0 1em 0.5em 0.5em">
    <img src="tskit_logo.svg" style="height: 3em;" />
    <figcaption style="font-size: 0.6em; line-height: 1em">Interactive workbooks</figcaption>
</figure>
    <h1 style="margin: auto">An interactive tour of <em>tskit</em></span></h1></div>
This tour uses the <em>tskit</em> <a href="https://tskit.dev/tskit/docs/stable/python-api.html">Python interface</a>. Press "shift-enter" (or click the toolbar "run" button, &#9654;) to run a notebook cell and move to the next one. Repeat this from the top of the workbook to work gradually through the demonstration code.
<div class="alert alert-block alert-info"><b>Note:</b> For static worked examples see our <a href="https://tskit.dev/tutorials/">tutorials site</a>, which includes introductions for <a href="https://tskit.dev/tutorials/popgen.html">population genetics</a> and <a <a href="https://tskit.dev/tutorials/phylogen.html">phylogenetics</a>.</div>

## Loading an ARG in "succinct tree sequence" format

In [None]:
import tskit

In [None]:
ts = tskit.load("data/demo.trees")  # By convention we use `ts` or `arg` for the object name

In a notebook, we can show a tabular summary of the tree sequence file by displaying it to screen:

In [None]:
ts  # Running this notebook cell should display a tabular summary of the `ts` object

<div class="alert alert-block alert-info"><b>Note:</b> the <a href="https://tskit.dev/tskit/docs/stable/provenance.html">provenances</a> listed above (in the last part of the output) show how this tree sequence was originally generated. In this case, close inspection shows it was initially simulated by the command <a href="https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_ancestry"><code>sim_ancestry()</code></a>, provided by <a href="https://tskit.dev/msprime/docs/stable/intro.html"><i>msprime</i></a> version 1.4.0, then simplified, with mutations finally added using the <i>msprime</i> <a href="https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_mutations"><code>sim_mutations()</code></a> function.</div>

Each [node](https://tskit.dev/tskit/docs/stable/data-model.html#node-table) in the tree sequence represents a (haploid) genome. *Sample* nodes are the (usually current-day) genomes whose DNA sequences are known. Here there are 80 sample nodes, belonging to 40 (diploid) individuals.

As its name suggests, a tree sequence can be interpreted as a sequence of evolutionary ("gene") trees along a genome. This is done by linking the sample nodes to ancestral nodes via *edges*. The table above reveals that there are 6979 edges in this tree sequence, defining 1811 trees. Later in this workbook you'll see how to plot one of the local trees.

## Mutations and genetic variation

*Mutations* on an ARG define the DNA sequence of the samples. These cause sample nodes to differ from each other at a number of *variable sites*.

In [None]:
print(
    f"The mutations in this ARG define {ts.num_sites} sites,",
    f"for {ts.num_samples} sample genomes,",
    f"over a total length of {ts.sequence_length} base pairs",
)

In [None]:
for v in ts.variants():
    print(
        "For the first variable site, the allelic state for",
        f"{ts.num_samples} genomes at position {v.site.position} is:\n",
        v.states(),
    )
    print(f"At variable site number # {v.site.id} (at position {v.site.position} basepairs along the chromosome)")
    for state, freq in v.frequencies().items():
        print(f"* the frequency of {state} is {freq}")
    break

## Basic analysis 

Genetic analysis using tree sequences is usually incredibly fast, even for huge sample sizes. Here are a few examples of analysing the loaded _tskit_ ARG: feel free to change them and re-run the cells.

Documentation for the methods is here: https://tskit.dev/tskit/docs/stable/python-api.html (e.g. see [here](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.allele_frequency_spectrum) for the allele frequency spectrum).

In [None]:
# Calculate the unpolarised allele frequency spectrum
afs = ts.allele_frequency_spectrum(polarised=False)
print(f"AFS calculated for {ts.num_samples} genomes of length {ts.sequence_length/1e6:.2f} Mb")

In [None]:
# and plot it
from matplotlib import pyplot as plt

plt.bar(range(ts.num_samples + 1), afs, color="grey")
plt.xlabel("Count (frequency) in sample")
plt.title("Unpolarised allele frequency spectrum");

In [None]:
# Calculate and plot genetic diversity in windows along the genome
genome_windows = list(range(0, int(ts.sequence_length+1), 20_000))  # or `np.linspace(0, ts.sequence_length, 51)`
windowed_diversity = ts.diversity(windows=genome_windows)
plt.stairs(windowed_diversity, genome_windows)
plt.ylabel("Genetic diversity (Ï€)")
plt.xlabel("Genome position (bp)")
plt.title("Diversity along the genome");


## IDs and underlying data

At its heart, a tree sequence is just a collection of tables. For example, genomes are represented by **nodes** in the *nodes table*. Each is referred to by an ID, which is simply its row number in the table. Note that all IDs start with 0, not 1. Here, for instance is the first node (ID: 0) in the nodes table:

In [None]:
ts.node(0)

<div class="alert alert-block alert-info"><b>Note:</b> The fact that the flag is an odd number indicates that node 0 is a <i>sample node</i>. Nodes that are not samples usually represent ancestral genomes.</div>

Here's another table (the _populations_ table, usually much smaller than the nodes table). You can see that simulation defines 7 different populations, with IDs 0 to 6. Above, node 0 was shown as belonging to population 0 (i.e. the population named "A" below).

In [None]:
ts.tables.populations

<div class="alert alert-block alert-info"><b>Note:</b> It is not necessary to define or use populations: they are a <i>tskit</i> convenience for grouping multiple nodes together. In other words, the populations table can be left empty, in which case each node should refererence a "population" with ID <a href="https://tskit.dev/tskit/docs/stable/python-api.html#tskit.NULL"><code>tskit.NULL</code> (-1)</a>. Similarly although sample nodes are commonly paired into a diploid <a href="https://tskit.dev/tskit/docs/stable/data-model.html#individual-table">individuals</a>, this is not a strict requirement for a <a href="https://tskit.dev/tskit/docs/stable/data-model.html#valid-tree-sequence-requirements">valid tree sequence</a>.</div>

## More complex analysis 

_Tskit_ is designed to allow users to write their own analysis tools, but also has several sophisticated built-in analysis methods.

### Principle components analysis

Here is an example of efficient principle components analysis (PCA). Each point is one of the 80 genomes, but the _tskit_ implementation scales to millions of genomes. Here we using `mode="branch"`, the default for PCA analysis, which means that we don't even need mutations (realised genetic variation) to look at genetic distances (see [this tutorial](https://tskit.dev/tutorials/no_mutations.html)).

In [None]:
# Calculate the first 2 principle components for every haploid genome
# For a more conventional PCA of diploid individuals, use the `individuals` argument)

pca_output = ts.pca(2, mode="branch", random_seed=42)  # or add `individuals=range(ts.num_individuals)`

# Plot the PCA
for pop in ts.populations():
    use = ts.samples(population=pop.id)  # `use = ts.individual_populations == pop.id` if using individuals
    if use.any():
        plt.scatter(*pca_output.factors[use, :].T, label=f'Population {pop.metadata["name"]}')
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.legend();

### Coalescence rates

Built-in tools exist to calculate coalescence and cross-coalescence rates. Below you can see that there is no coalescence between populations A and B more recently than 250 generations ago, suggesting populations A and B split around this time, whereas C and D probably split more recently, about 200 generations ago. This could partially explain why points from C and D partially overlapping on the PCA plot above.

Note that the "inverse instantaneous coalescence rate" (IICR) is sometimes known as the effective population size or $N_e$, which is often used in population genetics.

In [None]:
import numpy as np
num_timebins = 20 
time_breaks = np.concatenate(( # NB: first bin must start at sample time & last at infinity
    [0], np.logspace(np.log10(35), np.log10(ts.max_time/2), num_timebins - 2), [np.inf]
))

all_rates = ts.pair_coalescence_rates(time_windows=time_breaks)
plt.stairs(all_rates[1:-1], time_breaks[1:-1], label="All coalescences")

# Now calculate some cross coalescences
samples = {pop.metadata["name"]: ts.samples(pop.id) for pop in ts.populations()}
cross_AB_rates = ts.pair_coalescence_rates(sample_sets=[samples["A"], samples["B"]], time_windows=time_breaks)
cross_CD_rates = ts.pair_coalescence_rates(sample_sets=[samples["C"], samples["D"]], time_windows=time_breaks)
plt.stairs(cross_AB_rates[1:-1], time_breaks[1:-1], label="A/B cross coalescences")
plt.stairs(cross_CD_rates[1:-1], time_breaks[1:-1], label="C/D cross coalescences")

plt.xscale("log")
plt.yscale("log")
plt.xlabel(f"Time ({ts.time_units})")
plt.ylabel(f"Genome-wide instantaneous coalescence rate")
#plt.legend();

### Windowing in time and space

Like all other tskit statistics, we can window coalescence rates along the genome. Below you can see a burst of coalescence (red) at about 170 generations ago around 0.3Mb along the genome, with an absence of coalescence information (grey) within that population at more distant times for that genomic region. This is indicative of a selective sweep at that genomic location (particularly clear because of the cleanliness of simulated data).

In [None]:
time_breaks = np.concatenate(( # NB: first bin must start at sample time & last at infinity
    [0], np.logspace(np.log10(35), np.log10(ts.max_time/10), num_timebins - 2), [np.inf]
))

popA_rates = ts.pair_coalescence_rates(time_windows=time_breaks, windows=genome_windows, sample_sets=[samples["A"]])
colourmap = plt.get_cmap('jet')
colourmap.set_bad(color='grey')
im = plt.pcolormesh(genome_windows, time_breaks[1:-1], popA_rates[:,1:-1].T, cmap=colourmap)
bar = plt.colorbar(im)
bar.ax.set_ylabel('pairwise coalescent density', labelpad=10, rotation=270)
plt.yscale("log")
plt.xlabel(f"Genome position")
plt.text(0, 170, "170\ngens", ha="right", va="center", c="darkred")
plt.ylabel(f"Time ({ts.time_units})")
plt.title("Local coalescence densities for population A");

## Information about a tree sequence's origin ("provenance")

For reproducability, the [provenance table](https://tskit.dev/tskit/docs/stable/provenance.html) stores information about how a tree sequence was generated. The amount of detail is down to the tool(s) used to make the tree sequence. In this case, _msprime_ was used, which stores extensive provenance information including the simulated [demographic model](https://tskit.dev/msprime/docs/stable/demography.html). This can be plotted using the [demesdraw](https://grahamgower.github.io/demesdraw) software.

In [None]:
import sys
if sys.platform == 'emscripten':  # only needed for jupyterlite to load demesdraw
    import micropip
    await micropip.install("demesdraw")

import demesdraw
import msprime

first_provenance_entry = ts.provenance(0)
cmd, parameters = msprime.provenance.parse_provenance(first_provenance_entry, ts)
assert cmd == "sim_ancestry"  # just check we have the right (zeroth) provenance entry

msprime_demography_object = parameters["demography"]
demesdraw.tubes(
    msprime_demography_object.to_demes(),
    colours={"A": "tab:blue", "B": "tab:orange", "C": "tab:green", "D": "tab:red"},
    log_time=False,
)
plt.show()

The demography explains the pattern that was seen in the PCA and the cross-coalescence rates, where populations C and D split more recently than A and B.

## Tree plotting

The tree sequence allows fast extraction of ancestral trees along the genome, which can be easily plotted

In [None]:

first_tree = ts.first()
first_tree.draw_svg(size=(1000, 250), title="Tree at genome position 0", node_labels={})

_Tskit_ tree plots can be decorated using a variety of styles (see the [viz tutorial](https://tskit.dev/tutorials/viz.html)). Here we plot the tree at position 0.3 Mb, to see if population A really does have a cluster of coalescences around 170 generations ago (at the dotted magenta line: it does!).

In [None]:
from matplotlib import colors
styles = [f".leaf.p{p.id}>.sym" + "{fill:" + colors.to_hex(f"C{p.id}") + "}" for p in ts.populations()]

# Making a legend can be done, but involves some hand-positioning of elements
legend = '<rect width="100" height="100" x="100" y="10" fill="#EEE" /><text x="140" y="25">Key</text>'
legend += "".join([  # The legend lines, one for each population.
    f'<g transform="translate(103, {40 + 18*p.id})" class="leaf p{p.id}">'  # an SVG group
    f'<rect width="6" height="6" class="sym" />'  # Square symbol
    f'<text x="10" y="7">Population {p.metadata["name"]}</text></g>'  # Label
    for p in ts.populations() if len(ts.samples(p.id)) > 0
])

ts.at(0.3e6).draw_svg(
    size=(1000, 250),
    node_labels={},    # Remove all node labels for a clearer viz
    mutation_labels={},
    style="".join(styles) + ".y-axis .tick:nth-child(2) .grid {stroke: magenta; stroke-dasharray: 4}",
    preamble=legend,
    y_axis=True,
    y_gridlines=True,
    y_ticks=[0, 170, 500, 1000, 1500],
    title="Tree at genome position 300,000 bp",
)

## Other resources and analyses
This notebook is fully interactive. Feel free to change the analyses above or perform more analysis below. 

Extensive static tutorial material is available at https://tskit.dev/tutorials/, and there are also other notebooks in this JupyterLite instance: View &rarr; File Browser will reveal them on the right-hand toolbar if they are not shown.

In [None]:
# Use this as a playground for trying out tskit
