In [None]:
import sys

if sys.platform != 'emscripten':
    # allow the notebook to be downloaded and run locally too
    print("This notebook is not running in JupyterLite: you may need to install tskit, tszip, etc.")
else:
    import micropip
    await micropip.install('jupyterquiz')
    await micropip.install('tskit_arg_visualizer')

from jupyterquiz import display_quiz
from WhatIsAnARG_module import Workbook1


# ARGs and _tskit_: an introduction

### Background reading and skills required

ARGs provide a principled way of thinking about genetic inheritance with recombination: i.e. recombinant phylogenies. Some recent reviews are [Lewanski at al. (2024)](https://doi.org/10.1371/journal.pgen.1011110) and [Nielsen et al. (2024)](https://doi.org/10.1038/s41576-024-00772-4). To complete this workbook you should be comfortable with basic programming in [Python](https://www.python.org) (including e.g. [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) and [dictionaries / dict comprehensions](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)). The workbook also makes substantial use of the "numerical Python" library, [`numpy`](https://numpy.org): there is a quickstart tutorial [here](https://numpy.org/devdocs/user/quickstart.html). It may also help if you have some familiarity with the basics of the [Tree Sequence Toolkit (_tskit_)](https://tskit.dev/genetics-research/), and its Python interface, by reading https://tskit.dev/tutorials/what_is.html, https://tskit.dev/tutorials/args.html and https://tskit.dev/tutorials/no_mutations.html.

### Overview
This is the first of 2 workbooks: each is intended to take about 1½-2 hours to complete.

<dl>
<dt>Workbook 1</dt>
    <dd>Graphs and local trees, mutations, genetic variation and statistical calculations: exercises A-J & questions 1-11</dd>
<dt>Workbook 2</dt>
    <dd>Types of ARG, simplification, and ARG simulation: exercises K-U & questions 12-30</dd>
</dl>

In [None]:
Workbook1.setup()

# Workbook 1: ARGs, local trees, and genetic variation

## _Tskit_ basics

The _tskit_ library allows you to store and analyse ARGs in the *tree sequence* format. This is the library that underlies simulators such as _msprime_ and _SLiM_, and inference tools such as _tsinfer_, _tsdate_, and _SINGER_. The library is extensive and well-documented at https://tskit.dev/tskit/docs/stable/. We will use the [Python interface](https://tskit.dev/tskit/docs/stable/python-api.html), but there are also C and Rust inferfaces, and it is possible to use _tskit_ from [within R]() as well.

First, we load a small simulated ARG in tree sequence format. This is a "full ARG" which describes the ancestry of a set of 10 sampled genomes and includes nodes that explicitly record recombination events (so-called "recombination nodes").

It is conventional to store such genealogies in a variable called `arg` or `ts`:

In [None]:
import tskit
arg = tskit.load("data/example_ARG.trees")  # it is conventional to use `arg` or `ts` as the standard variable
print("A tskit ARG has been loaded into the variable 'arg'")

It is important to distinguish the *sample genomes* in this ARG. Their numerical IDs can be obtained using the `.samples()` method. Often these will be sequential, starting from zero. However, this need not always be the case so it's always best to check:
<div class="alert alert-block alert-info"><b>Note:</b> For those coming from <em>R</em>, <em>tskit</em> uses the Python convention of indexing from 0 rather than from 1. So the first node has ID 0, not ID 1</div>

In [None]:
print("The sampled genomes have IDs:", arg.samples())

## ARGs and local trees

The [_tskit_arg_visualizer_](https://github.com/kitchensjn/tskit_arg_visualizer) software uses the [D3js library](https://d3js.org) to visualise ARGs and other tree sequences interactively, in a browser or Jupyter notebook. As is conventional, the oldest nodes are drawn at the top, with the youngest, usually at time 0, at the bottom.

The visualiser creates a [`D3ARG`](https://github.com/kitchensjn/tskit_arg_visualizer/blob/main/docs/tutorial.md#what-is-a-d3arg) object from the _tskit_ ARG. This object can then be plotted using `.draw()`.

<div class="alert alert-block alert-info"><b>Note:</b> You'll see that some nodes in this plot have two IDs. Don't worry about this: as we'll see later it's because the simulator has represented recombination using 2 nodes, which have been overlaid in the visualizer</div>

<dl class="exercise"><dt>Exercise A</dt>
    <dd>Try tidying-up the plot below by dragging nodes horizontally. You can save the result if you want via the buttons above the graph.</dd>
</dl>

In [None]:
import tskit_arg_visualizer as argviz

d3arg = argviz.D3ARG.from_ts(arg)  # by convention, the argviz object has "d3" prepended to the original name
d3arg.set_node_styles({u: {"symbol": "d3.symbolSquare"} for u in arg.samples()}) # Set some viz styles
d3arg.draw(edge_type="ortho", width=800);  # draw the D3ARG in the notebook: the semicolon hides the return value of the draw() method

## The importance of sample nodes

Each node in the graph above represents a genome. The **square** nodes representing *sampled genomes* (or "samples") and the round "nonsample" nodes represent *ancestral genomes*. Importantly, ARGs only represent the ancestry of the sample nodes. For example, ancestor 35 would have passed on a whole genome's worth of DNA to node 34, but if you hover over the edge that joins them, you well see that only part of the genome eventually made it into the samples.

This means that even if we have the whole genomes of all the sample nodes, the known regions of ancestral genome may not cover the whole sequence length. To put it another way, ancestral genomes may only be partially knowable.

<dl class="exercise"><dt>Exercise B</dt>
    <dd>Hover over a few of the lines in the graph above to see which inherited regions are inherited along different edges. Then try adding <code>variable_edge_width=True</code> to the <code>.draw()</code> call, to display the widths of edges as proportional to their span. Does this help show the routes through which the majority of the genome has travelled?</dd>
</dl> 

In [None]:
display_quiz(Workbook1.url + "Q1.json")

## SPR operations

Each left/right transition from one tree to another represents a *recombination breakpoint* where forward in time, two lineages (one from the maternal and one from the paternal genome) were combined via meiosis. Looking backwards in time, this results in one genome (node) having two parents.

If we know the recombination nodes, it is easy to see how one tree changes into another along the genome. A good approximation is that the tree on one side of a breakpoint can be transformed into the tree on the other side of the breakpoint via a single tree-editing operation, or SPR. 

<dl class="exercise"><dt>Exercise C</dt>
    <dd>Move your pointer over the genome bar underneath the graph, and try to get a feel for how one local tree (which will be highlighted in green, embedded in the larger ARG) is transformed into the next one.</dd>
</dl> 

In [None]:
display_quiz(Workbook1.url + "Q2.json")

## Left-right ARG traversal: the `.trees()` iterator

A fundamental ARG operation, which forms the basis of operations such as calculating genetic statistics, is to move along the genome, obtaining the local trees at each point. In _tskit_ this is done using the [`.trees()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.trees) method. Each new tree, and the data associated with it, is formed by making a slight change to the previous tree, making this an efficient operation, even for huge trees of millions of nodes.

In [None]:
for tree in arg.trees():
    display(tree)

### Visualising local trees

By default, `tskit` displays each local tree as a summary table, as above. To draw the tree, you can use the [`.draw_svg()`](https://tskit.dev/tutorials/viz.html#svg-format) method, suitable for small trees of tens or hundreds of nodes each.

In [None]:
arg.draw_svg(
    size=(1200, 400),  # (width, height) in pixels
    omit_sites=True,  # Later in the workbook we'll remove this, to reveal the variable sites 
)

In [None]:
display_quiz(Workbook1.url + "Q3.json")

### Branch length changes
Perhaps surprisingly, many recombination events in an ARG do not change the topology of the local trees, but merely change branch lengths (i.e. change the time and identity of the most recent common ancestor, or MRCA, of certain sets of samples). This can be seen most clearly at the top of the graph-based ARG visualization. At 30 000 generations ago there are only two lineages left. Whichever route we take above that through the ARG, will end up joining node 26 and node 23, just at a potentially different time in the past.

In [None]:
display_quiz(Workbook1.url + "Q4.json")

### "Identical tree" changes

Some recombination events change neither the topology nor the branch lengths. This is seen when going from the third to the fourth tree. That change only involves node ID 24 switching to ID 25, which represents the recombination event at the bottom of the diamond. A peculiarity of ARGs created using the _msprime_ simulator is that for [technical reasons](https://tskit.dev/msprime/docs/stable/ancestry.html#recombination-events), recombination events are recorded as *two* simultanous recombination nodes (e.g. nodes 24/25), representing the paternal and maternal genomes prior to meiosis. These are visually merged into a single node in the ARG visualizer. Assuming a single crossover event, one node is to the left and one to the right of the breakpoint.

<dl class="exercise"><dt>Exercise D</dt>
    <dd>Copy the <code>draw_svg()</code> code above into the box below, but plot only the trees between genome position 200 to 800 by using <code>x_lim=(200, 800)</code>, and add a Y axis, using <code>y_axis=True</code>. To make the Y axis ticks look nicer, you could also specify <code>y_ticks={0: "0", 20_000: "20k", 40_000: "40k"}</code>.</dd>
</dl>

In [None]:
# Complete Exercise D here


## Coalescent and non-coalescent regions

Looking at the tree-by-tree plot, it should be clear that some of the nodes in a local tree have one child in some trees, and two children in others. There are even some nodes that have only one child in every tree in which they appear (e.g. node 26). We can classify nodes into:

0. **non-coalescent**, sometimes called _always unary_ (i.e. one child in all local trees, e.g. node 26)
1. **part-coalescent**, sometimes called _locally unary_ (i.e. one child in some local trees, coalescent in others, e.g. node 18)
2. **all-coalescent**, or _never unary_ (always 2 or more children in any local tree in which they are found, e.g. node 19).

<div class="alert alert-block alert-info"><b>Note:</b> This classification will turn out to be important in part B, when we learn about <em>simplifying</em> ARGs by collapsing some nodes (particularly "non-coalescent" nodes) depending on whether they are detectable.</div>

In [None]:
display_quiz(Workbook1.url + "Q5.json")

The `Workbook.node_coalescence_status()` function, written for this workbook, uses the `.trees()` iterator to construct an array denoting the classification of each node. `returned_array[i]` is **0** if node `i` is non-coalescent, **1** if node `i` is part-coalescent, and **2** if node `i` is all-coalescent. You can check below that this agrees with your understanding:

In [None]:
description = {0: "Non-coalescent", 1: "Part-coalescent", 2: "All-coalescent"}
samples = arg.samples()
for node_id, status in enumerate(Workbook1.node_coalescence_status(arg)):
    extra = ", sample" if node_id in samples else ""
    print(f"Node {node_id} coalescence status: {description[status]} ({status}{extra})")

## Nodes and individuals

You can think of each ARG node as representing a (haploid) genome. Humans (and most other eukaryotes) are *diploid*, meaning they contain 2 genomes: a maternal and a paternal one. 

In _tskit_, this information is stored by allowing a node to be asociated with an individual ID, linked to a separate table of individuals. Usually only sample nodes have positive individual ID. Other nodes have their `individual` ID set to `-1` (also known as `tskit.NULL`); this includes ancestral nodes, as we usually don't know how ancestral genomes were grouped into diploid individuals.

In [None]:
colours = {
    tskit.NULL: "LightGrey",  # nodes that are not associated with an individual are coloured in grey
    0: "LightGreen",
    1: "DodgerBlue", 
    2: "MediumVioletRed",
    3: "Coral",
    4: "DarkGoldenRod"
}
for node in arg.nodes():
    individual_id = node.individual
    print(
        f"Node {node.id} is associated with individual {individual_id}",
        f"which we will colour {colours[individual_id]}" if  individual_id != tskit.NULL else ""
    )

When visualising the ARG (either as a graph or as a series of trees), we don't usually want to group the two nodes of each individual together, as the maternal and paternal genomes within an individual can have quite different histories. You can try to rearrange the sample nodes by colour, to convince yourself that the two samples from a single individual do not trivially group together.

Note that the following code does not use the "orthogonal" plot style, but reverts to the default `edge_type` of `"line"`, which simply joins nodes by straight lines. As well as colouring the nodes by individual, it colours recombination nodes as black. It also shows not just the *topology* of the ARG, but also the length (in time) of each edge, by plotting the Y axis on a linear timescale.

In [None]:
import msprime

d3arg.set_node_styles({
    node.id: {"fill": colours[node.individual]}
    for node in arg.nodes()
})
d3arg.draw(title="Nodes coloured by individual", y_axis_scale="time", width=800);

The ARG was actually simulated using a model of human evolution that reflects the Out of Africa event. As well as having a value denoting the <code>individual</code>, each node also has a value indicating a <code>population</code> to which it belongs.

<dl class="exercise"><dt>Exercise E</dt>
    <dd>Change the code above to colour by <code>node.population</code> ID rather than <code>node.individual</code> ID. You could also stop colouring the recombination nodes as black if you like.</dd>
</dl>

In [None]:
display_quiz(Workbook1.url + "Q6.json")

### Metadata (a side note)

For efficiency, most <em>tskit</em> functions refer to nodes, individuals, populations, etc using their numerical IDs. However can be tricky to keep track of e.g. which populations correspond to which IDs. Moreover, these ID can change, for instance if populations or nodes are removed from the tree sequence. So to help keep track of objects, even when their IDs change, <em>tskit</em> allows them to be associated with user-defined *metadata*. There's a <a href="https://tskit.dev/tutorials/metadata.html">tutorial on metadata</a> if you need to know more.

In our example ARG, we have added metadata for each population consisting of a name and a fuller description. You can see the this when accessing a population by its ID:

In [None]:
pop_id = 0
pop = arg.population(pop_id)
print(pop)  # or print(pop.metadata) to see just the metadata

Or you can see the metadata for all populations by looping over them:

In [None]:
for pop in arg.populations():
    print(pop)

In [None]:
# Sometimes it is helpful to use a mapping of a metadata value to the object ID
pop_name_to_id_map = {pop.metadata['name']: pop.id for pop in arg.populations()}

# Now use the map to get only those samples from the `AFR` population
print(
    "The sample genomes for the AFR population have IDs:",
    arg.samples(population=pop_name_to_id_map["AFR"])
)

<dl class="exercise"><dt>Exercise F</dt>
    <dd>Using a <code>for</code> loop, iterate over <code>arg.individuals()</code> and print our each individual.</dd>
</dl>

In [None]:
# Complete Exercise F here


In [None]:
display_quiz(Workbook1.url + "Q7.json")

## Mutations

Genetic variation is create by *mutations* on the ARG. So far we have not shown mutations, but we can reveal them by setting `show_mutations` to `True` in the ARG visualizer. If you hover over the orange mutation symbols in the visualization below, their position on the lower genome bar will be revealed.

In [None]:
d3arg.draw(show_mutations=True, y_axis_scale="time", width=800, title=f"ARG with {arg.num_mutations} mutations");

### Encoding sites and mutations in _tskit_

In _tskit_, mutations occur at specific *sites*. A site is defined by a *site id*, a *position* and an *ancestral state*. Mutations are defined as an *mutation id*, a *derived state*, an associated *node id* an (optional) *time*, and an associated *site id*. If two mutations have the same site id, those mutations occur at the same position along the genome.

<dl class="exercise"><dt>Exercise G</dt>
    <dd>The <code>include_mutation_labels</code> and <code>condense_mutations</code> parameters provide different ways to display mutations on branches. Add <code>label_mutations=True</code> to the <code>.draw()</code> method above, to see its effect.</dd>
</dl>

<dl class="exercise"><dt>Exercise H</dt>
    <dd>In the cell below, use <code>arg.draw_svg()</code> to plot the mutated arg as a series of trees <em>without</em> using the <code>omit_mutations</code> parameter. You will see that mutations are drawn as red crosses on the trees, and also depicted as red arrows on the x axis, at the appropriate site position. Multiple mutations at the same site will have those arrows stacked.</dd>
</dl>

In [None]:
# Complete exercise H here


In [None]:
display_quiz(Workbook1.url + "Q8.json")

## Genetic variation & the `.variants()` iterator

The inheritance of mutations through the ARG defines the genotypes at each variable site for each sample. The _tskit_ [`.variants()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.variants) method goes through the local trees, decoding the genetic variation at each site by looking at the inheritance of mutations. For efficiency, the genotypes for all the samples at a site are returned as a numerical vector. Each value in the vector denotes the index into a corresponding list of alleles:

In [None]:
for variant in arg.variants():
    print(
        f"Genotypes at position {variant.site.position}",
        f"for the {arg.num_samples} samples ",
        f"are {variant.genotypes},",
        f"denoting the alleles {variant.alleles}",
    )

<dl class="exercise"><dt>Exercise I</dt>
    <dd>Change the code above to display the actual allelic states (rather that a numerical value), by using the <a href="https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Variant.states"><code>variant.states()</code></a> method in place of accessing <a href="https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Variant.genotypes"><code>variant.genotypes</code></a> directly. This can be useful for displaying or checking smaller examples, as opposed to numerical analysis of larger datasets.</dd>
</dl>

In [None]:
display_quiz(Workbook1.url + "Q9.json")

## Mutations allow ARG inference

Mutations on an ARG generate genetic variation. Conversely, genetic variation among a set of samples can be used to *infer* an ARG that could plausibly have generated the samples. To do this, it is often useful to make the "infinite sites" approximation, that every mutation occurs at a different position (as we saw above, this is not always true).

When we use mutations do infer genealogy, it can be helpful to *polarise* the alleles at a genetic site (figure out which allele is ancestral and which are derived). Then we can reasonably assume that if a set of genomes share the derived allele, they all inherited it from a common ancestor. This allows us to deduce the presence of a common ancestor node in the ARG.



In [None]:
display_quiz(Workbook1.url + "Q10.json")

## Provenance

Helpfully, _tskit_ tries to store information on how an ARG was generated. Software like _msprime_, whcih was used to generate this ARG, place details of their commands in the [provenance](https://tskit.dev/tskit/docs/stable/provenance.html) table. We'll use this to check on the mutation rate that was used in the simulation, so that we can use it later:

In [None]:
import json  # Provenance data is stored in JSON format
last_provenance = json.loads(arg.provenance(-1).record)
assert last_provenance['parameters']['command'] == "sim_mutations"  # Check the last operation added mutations via the msprime `sim_mutations` command
mutation_rate = last_provenance['parameters']['rate']
print("The ARG was generated using a mutation rate of", mutation_rate, "mutations per base pair per generation")

## Population genetic statistics

An ARG with mutations completely summarises the genetic variation present in the samples. Moreover, by only looking at differences between local trees, genome-wide statistical calculations can often be performed much more efficiently than naive approaches. Details of the statistics available in _tskit_ (including genetic divergence, Patterson's f statistics, PCA and genetic relatedness, etc) are at [https://tskit.dev/tskit/docs/stable/stats.html](https://tskit.dev/tskit/docs/stable/stats.html). Many of these statistics are summaries of the _allele frequency spectrum_, which itself can be returned by a single call to the [`.allele_frequency_spectrum()`](https://tskit.dev/tskit/docs/stable/stats.html#sec-stats-notes-afs) method:

In [None]:
from matplotlib import pyplot as plt


# Whole genome AFS
afs = arg.allele_frequency_spectrum(polarised=True)
plt.bar(range(arg.num_samples + 1), afs)
plt.xlabel("Number of samples inheriting a mutation")
plt.ylabel("Density")
plt.title("Polarised allele frequency spectrum")
plt.show()

<dl class="exercise"><dt>Exercise J</dt>
    <dd>Add <code>span_normalise=False</code> to the <code>.allele_frequency_spectrum()</code> call above, and re-run, so that the Y axis plots the actual number of mutations instead dividing that by the sequence length. Does the number of mutations in all the bars add up to the total number in the ARG?</dd>
</dl>

### Windowed statistics

Statistics can also be windowed along the genome and also by time. Here we demonstrate a simplest measure, the genetic diversity (sometimes known as $\pi$), and a more complex one, Wright's Fst.

In [None]:
print(f"Average genetic diversity (π) = {arg.diversity()}")
AFR_samples = arg.samples(population=0)
EUR_samples = arg.samples(population=1)

print(f"Average Fst = {arg.Fst([AFR_samples, EUR_samples])}")

# Now show this in windows
genome_windows = [0, 250, 500, 750, 1000]
windowed_diversity = arg.diversity(windows=genome_windows)
windowed_Fst = arg.Fst([AFR_samples, EUR_samples], windows=genome_windows)

fig, (ax_top, ax_bottom) = plt.subplots(2, 1, sharex=True)
ax_top.stairs(windowed_diversity, genome_windows, baseline=None)
ax_top.set_ylabel("Genetic diversity")
ax_top.set_title(f"Genetic statistics in {len(genome_windows) - 1} windows along the genome")
ax_bottom.stairs(windowed_Fst, genome_windows, baseline=None)
ax_bottom.set_ylabel("AFR/EUR Fst")
ax_bottom.set_xlabel("Genome position")

plt.show()

### Do you need genetic variation?

It is worth noting that if you have the genealogy, then an alternative set of statistics exist which can be less noisy than those based on the raw genetic variation data. See https://tskit.dev/tutorials/no_mutations.html for more details.

These "branch" mode statistics do not require mutations, and so do not contain noise associated with the mutational process. For instance, as our ARG was created under a simple neutral model, we would expect the genetic statistics along the genome to be constant. And indeed, using `mode="branch"` results in a comparable windowed genome plot, in which the genetic statistics are less noisy than those based on the genetic variation created by mutations.

In [None]:
branch_Fst = arg.Fst([AFR_samples, EUR_samples], windows=genome_windows, mode="branch")
branch_diversity = arg.diversity(windows=genome_windows, mode="branch")
scaled_branch_diversity = branch_diversity * mutation_rate


fig, (ax_top, ax_bottom) = plt.subplots(2, 1, sharex=True)
ax_top.stairs(windowed_diversity, genome_windows, baseline=None, ls=":", label="site (mutation)-based")
ax_top.stairs(scaled_branch_diversity, genome_windows, baseline=None, label="branch-based")
ax_top.set_ylabel("Genetic diversity")
ax_top.set_title(f"Genetic statistics in {len(genome_windows) - 1} windows along the genome")
ax_bottom.stairs(windowed_Fst, genome_windows, baseline=None, ls=":", label="site (mutation)-based")
ax_bottom.stairs(branch_Fst, genome_windows, baseline=None, label="branch-based")
ax_bottom.text(0, 0.13, f'Average branch Fst: {arg.Fst([AFR_samples, EUR_samples], mode="branch"):.4f}')
ax_bottom.set_ylabel("AFR/EUR Fst")

ax_bottom.set_xlabel("Genome position")
ax_top.legend()
plt.show()

In [None]:
display_quiz(Workbook1.url + "Q11.json")

We will encounter other summary statistics in future workshops

## Extra *tskit* details

This section aims to give you a deeper understanding of how ARGs are stored within _tskit_, and how to access the raw data underlying a _tskit_ ARG. It is deliberately optional in the sense that there are no exercises or questions, but it should help you to master the _tskit_ library.

### Tables

Nodes, edges, individuals, populations, sites, and mutations are all stored as *rows* in a set of separate *tskit* **tables**. The ID of an object is simply its row number in the corresponding table. Together the tables make up a [TableCollection](https://tskit.dev//tables_and_editing.html). You can see a summary of the tables by simply displaying a tree sequence in a notebook: this describes the number of rows in each table, and the size of each table. Below you will see that our `arg` has 42 edges, 37 nodes, and 13 mutations at 12 different sites along the genome:

In [None]:
display(arg)

Each of the tables can also be displayed separately, using `arg.tables.nodes`, `arg.tables.edges`, etc. Printing a table or the whole table collection creates a plain-text representation (helpful e.g. if you are not using a notebook):

In [None]:
print(arg.tables.edges)  # or display(arg.tables.edges) for a prettier HTML version

It should be reasonably obvious how this works. E.g. edge 0 connects parent node 10 to child node 6 in the part of the genome that spans 0 to 930 bp. For further information see [https://tskit.dev/tskit/docs/stable/data-model.html](https://tskit.dev/tskit/docs/stable/data-model.html), and for a tutorial approach, see [https://tskit.dev/tutorials/tables_and_editing.html](https://tskit.dev/tutorials/tables_and_editing.html).

We previously used [`arg.nodes()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.nodes), [`arg.individuals()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.individuals), and [`arg.populations()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.populations) to return Python objects, created by iterating over all the rows in a table. Similarly, methods exist for [`arg.edges()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.edges), [`arg.sites()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.sites), and [`arg.mutations()`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.mutations). To access a specific edge, node, site, etc. as a Python object you can also use [`arg.edge(i)`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.edge), [`arg.node(i)`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.node), [`arg.site(i)`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.site), and so on. 

In [None]:
print(arg.edge(0))

In [None]:
print(arg.node(10))
print(arg.node(6))
individual_id_for_node_6 = arg.node(6).individual

In [None]:
print(arg.individual(individual_id_for_node_6))

In [None]:
print(arg.mutation(0))

In [None]:
print(arg.site(0))

#### High performance data access

Using Python objects is convenient, but can be inefficient for large ARGs. The most performant way to access the underlying data is to use the [efficient column accessors](https://tskit.dev/tskit/docs/stable/python-api.html#efficient-table-column-access), which provide _numpy_ arrays that are a direct view into memory. For example, to find all the site positions along the genome, you can use  `arg.tables.sites.position` (or the shortcut [`arg.sites_position`](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.sites_position)). This is particularly relevant when dealing with ARGs containing large tables (e.g. millions of rows).

In [None]:
arg.tables.sites.position

In [None]:
arg.tables.edges.parent  # Also available via the shorthand arg.edges_parent

In [None]:
arg.tables.edges.child  # Also available via the shorthand arg.edges_child

and so on. These can be used with _numpy_, the numerical Python library:

<!-- For more information about writing high performance algorithms, see https://github.com/tskit-dev/tutorials/issues/151 (not yet written) -->

In [None]:
import numpy as np
# find the the mean of the base 10 log of the times of all mutation in the arg
np.mean(np.log10(arg.tables.mutations.time))

#### High performance trees

Local trees in a tree sequence are not stored in a table, but iteratively constructed on the fly using the `arg.trees()` method. However, a tree object has a set of [fast array access methods](https://tskit.dev/tskit/docs/stable/python-api.html#array-access) to provide efficient access to tree-based information, such as the [parents of nodes](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree.parent_array) in a tree, the [number of children](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree.num_children_array) of tree nodes, or the [edge above each node](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.Tree.edge_array).


In [None]:
tree = arg.first()

# Simple access to the parent of node 0 in the tree
print("Parent of node 0 in the first tree is", tree.parent(0))

# High performance tree array access
parents = tree.parent_array  # The parent ID of each node in this tree, or tskit.NULL (-1) if the node is not in the tree, or has no parent
print("\nParents of all the nodes:\n", parents)

The `parent_array` is a direct view into memory, which only needs slightly adjusting to encode the parents in the next tree. For large ARGs, this is much more efficient than making a new array for each local tree:

In [None]:
tree.next()
print("Parents of all the nodes in the next tree:\n", parents)  # will be slightly different from the previous time we viewed this array

*This is the end of Workbook 1. Workbook 2 looks in detail about different sorts of ARG, and how they can be simulated.*