Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Book settings
# Learn more at https://jupyterbook.org/customize/config.html

title: Tskit Tutorials
title: Tree Sequence Tutorials
author: Tskit Developers
# logo: logo.png
logo: _static/tskit_logo.svg

# Force re-execution of notebooks on each build.
# See https://jupyterbook.org/content/execute.html
Expand Down
30 changes: 30 additions & 0 deletions _static/tskit_logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 7 additions & 7 deletions basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,9 +119,9 @@ SVG(ts.draw_svg(y_axis=True, y_gridlines=True, time_scale="rank"))
Each tree records the lines of descent along which a piece of DNA has been
inherited (ignore for the moment the red symbols, which represent a mutation).
For example, the first tree tells us that DNA from ancestral genome 7 duplicated
to produce two lineages, which ended up in sample genome 1 and sample genome 4 in the
current population (since this pattern is seen in all trees, this must have happened
for all the DNA along the entire 1000 base pair genome).
to produce two lineages, which ended up in genomes 1 and 4, both of which exist in the
current population. In fact, since this pattern is seen in all trees, these particular
lines of inheritance were taken by all the DNA in this 1000 base pair genome.


(sec_basics_terminology_nodes)=
Expand Down Expand Up @@ -274,17 +274,17 @@ for mutation in ts.mutations():

The mutation can have a {attr}`~Mutation.time` or if, as in this case, the times of
mutations in the tree sequence are unknown, all mutations can have the special NaN value
``tskit.UNKNOWN_TIME``. Notice that the genomic position of the mutation is not included.
Instead, that is a property of the _site_ to which the mutation refers, in this case,
site ID 0 (which happens to be at position 257):
{data}`tskit.UNKNOWN_TIME`. Notice that the genomic position of the mutation is not
included. Instead, that is a property of the _site_ to which the mutation refers, in
this case, site ID 0 (which happens to be at position 751):

```{code-cell} ipython3
print(ts.site(0)) # For convenience, the Python API also returns the mutations at the site
```

In the plot above, since the the only mutation is above node 8 in the last tree, and has
a {attr}`~Mutation.derived_state` of "G", we know that the samples descending from node
8 in the first tree (sample genomes 2, 4, and 5) have a "G" at {attr}`~Site.position` 257,
8 in the first tree (sample genomes 2, 3, and 5) have a "G" at {attr}`~Site.position` 751,
while the others have the {attr}`~Site.ancestral_state` of "T". This means that Ada is
homozygous for "T", Bob is homozygous for "G", and Cat is heterozygous "T/G".
In other words the ancestral state and the details of any mutations at that site,
Expand Down
16 changes: 8 additions & 8 deletions getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ contains 40 *sample nodes*, one for each genome.

## Processing trees

A common idiom is to iterate over all the {class}`tree objects<Tree>` in a tree
A common idiom is to iterate over all the {class}`Tree` objects in a tree
sequence. This process underlies many tree sequence algorithms, including those we'll
encounter later in this tutorial for calculating
{ref}`population genetic statistics<tskit:sec_stats>`.
Expand Down Expand Up @@ -137,7 +137,7 @@ Now that we know all trees have coalesced, we know that at each position in the
all the 40 sample nodes must have one most recent common ancestor (MRCA). Below, we
iterate over the trees, finding the IDs of the root (MRCA) node for each tree. The
time of this root node can be found via the {meth}`tskit.TreeSequence.node` method, which
returns a {class}`node object<Node>` whose attributes include the node time:
returns a {class}`Node` object whose attributes include the node time:

```{code-cell} ipython3
import matplotlib.pyplot as plt
Expand Down Expand Up @@ -194,7 +194,6 @@ for {meth}`~TreeSequence.simplify`.
It can often be helpful to slim down a tree sequence so that it represents the genealogy
of a smaller subset of the original samples. This can be done using the powerful
{meth}`TreeSequence.simplify` method.
.

The {meth}`TreeSequence.draw_svg` method allows us to draw
more than one tree: either the entire tree sequence, or
Expand Down Expand Up @@ -233,11 +232,11 @@ sites or mutations in your analyses.
:::

For many purposes it may be better to focus on the genealogy of your samples, rather than
the sites and
mutations that
the {ref}`sites<sec_data_model_definitions_site>` and
{ref}`mutations<sec_data_model_definitions_mutation>` that
{ref}`define <sec_what_is_dna_data>` the genome sequence itself. Nevertheless,
{program}`tskit` also provides efficient ways to return {class}`site objects<Site>` and
{class}`mutation objects<Mutation>` from a tree sequence.
{program}`tskit` also provides efficient ways to return {class}`Site` object and
{class}`Mutation` objects from a tree sequence.
For instance, under the finite sites model of mutation that we used above, multiple mutations
can occur at some sites, and we can identify them by iterating over the sites using the
{meth}`TreeSequence.sites` method:
Expand Down Expand Up @@ -512,7 +511,8 @@ in rough order of importance:
* {meth}`~TreeSequence.simplify()` reduces the number of sample nodes in the tree
sequence to a specified subset
* {meth}`~TreeSequence.keep_intervals()` (or its complement,
{meth}`~TreeSequence.delete_intervals()`) deletes unwanted parts of the genome
{meth}`~TreeSequence.delete_intervals()`) removes genetic information from
specific regions of the genome
* {meth}`~TreeSequence.draw_svg()` plots tree sequences (and {meth}`Tree.draw_svg()`
plots trees)
* {meth}`~TreeSequence.at()` returns a tree at a particular genomic position
Expand Down
4 changes: 1 addition & 3 deletions intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,7 @@ kernelspec:

(sec_intro)=

# Tree sequence tutorials

**Welcome!**
# Welcome!

This site contains a number of tutorials to develop your understanding of
[tree sequences](https://tskit.dev/learn.html#what) and software programs,
Expand Down
95 changes: 88 additions & 7 deletions viz.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,14 @@ def viz_ts():
ts_small.dump("data/viz_ts_small.trees")

ts_small_mutated = msprime.sim_mutations(ts_small, rate=1e-7, random_seed=342)
# 3rd tree should have first site with 2 muts
first_site_tree_2 = next(ts_small_mutated.at_index(2).sites())
assert len(first_site_tree_2.mutations) == 2
# mutation 8 should be above node 16 in the 1st tree
assert ts_small_mutated.site(8).mutations[0].id == 8
assert ts_small_mutated.site(8).mutations[0].node == 16
ts_small_mutated.dump("data/viz_ts_small_mutated.trees")


def viz_root_mut():
"""
Taken from the drawing unit tests
Expand Down Expand Up @@ -758,7 +763,8 @@ y_ticks = np.delete(y_ticks, np.argwhere(np.ediff1d(y_ticks) <= 0.01))

svg = ts.draw_svg(size=(1000, 350), y_axis=True, y_gridlines=True, y_ticks=y_ticks, style=(
".tree .lab {font-family: sans-serif}"
".x-axis .tick .lab {text-anchor: start; transform: rotate(90deg) translate(8px)}"
# Normal X axis tick labels have dominant baseline: hanging, but it needs centring when rotated
".x-axis .tick .lab {text-anchor: start; dominant-baseline: central; transform: rotate(90deg)}"
".y-axis .grid {stroke: #DDDDDD}"
+ ".tree :not(.leaf).node > .lab {transform: translate(0,0); text-anchor:middle; fill: white}"
+ ".tree :not(.leaf).node > .sym {transform: scale(3.5)}"
Expand Down Expand Up @@ -789,11 +795,12 @@ SVG(ts.draw_svg(style=css_string, time_scale="rank", x_lim=[0, 30]))
```

#### Leaf, sample & isolated nodes
By default, sample nodes are square and non-sample nodes circular. However, neither need
to be at time 0. Moreover, leaves need not be samples, and samples need not be leaves.
Here we change the previous tree sequence to make some leaves non-samples and some samples
internal nodes. To highlight the change, we have plotted sample nodes in green, and
leaf nodes (if not samples) in blue.
By default, sample nodes are square and non-sample nodes circular (at the moment this
can't easily be changed). However, neither need to be at specific times: sample nodes can
be at times other than 0, and nonsample nodes can be at time 0. Moreover, leaves need not
be samples, and samples need not be leaves. Here we change the previous tree sequence to
make some leaves non-samples and some samples internal nodes. To highlight the change,
we have plotted sample nodes in green, and leaf nodes (if not samples) in blue.

```{code-cell} ipython3
:"tags": ["hide-input"]
Expand All @@ -812,6 +819,80 @@ means that there can be sample nodes which are "isolated" in a tree. These are d
unconnected to the main topology in one or more trees (e.g. nodes 7 and 8 above).
:::

#### Tick labels and gridlines

Y tick labels can be specified explicitly, which allows time scales to be plotted
e.g. in years even if the tree sequence ticks in generations. The grid lines associated
with each y tick can also be changed or even hidden individually using the CSS
[nth-child pseudo-selector](https://www.w3.org/TR/2018/REC-selectors-3-20181106/#nth-child-pseudo),
where tickmarks are indexed from the bottom. This is used in
the {ref}`sec_msprime_introgression` tutorial to show lines behind the trees at specific,
important times. Below we show a slightly simpler example than in that tutorial,
keeping node and mutation symbols in black, but colouring gridlines instead:

```{code-cell} ipython3
:"tags": ["hide-input"]
# Function from the introgression tutorial - see there for justification
import msprime
time_units = 1000 / 25 # Conversion factor for kya to generations

def run_simulation(sequence_length, random_seed=None):
demography = msprime.Demography()
# The same size for all populations; highly unrealistic!
Ne = 10**4
demography.add_population(name="Africa", initial_size=Ne)
demography.add_population(name="Eurasia", initial_size=Ne)
demography.add_population(name="Neanderthal", initial_size=Ne)

# 2% introgression 50 kya
demography.add_mass_migration(
time=50 * time_units, source='Eurasia', dest='Neanderthal', proportion=0.02)
# Eurasian 'merges' backwards in time into Africa population, 70 kya
demography.add_mass_migration(
time=70 * time_units, source='Eurasia', dest='Africa', proportion=1)
# Neanderthal 'merges' backwards in time into African population, 300 kya
demography.add_mass_migration(
time=300 * time_units, source='Neanderthal', dest='Africa', proportion=1)

ts = msprime.sim_ancestry(
recombination_rate=1e-8,
sequence_length=sequence_length,
samples=[
msprime.SampleSet(1, ploidy=1, population='Africa'),
msprime.SampleSet(1, ploidy=1, population='Eurasia'),
# Neanderthal sample taken 30 kya
msprime.SampleSet(1, ploidy=1, time=30 * time_units, population='Neanderthal'),
],
demography = demography,
record_migrations=True, # Needed for tracking segments.
random_seed=random_seed,
)
return ts

ts = run_simulation(20 * 10**6, 1)

css = ".y-axis .tick .lab {font-size: 85%}" # Multi-line labels unimplemented: use smaller font
css += ".y-axis .tick .grid {stroke: lightgrey}" # Default gridline type
css += ".y-axis .ticks .tick:nth-child(3) .grid {stroke-dasharray: 4}" # 3rd line from bottom
css += ".y-axis .ticks .tick:nth-child(3) .grid {stroke: magenta}" # also 3rd line from bottom
css += ".y-axis .ticks .tick:nth-child(4) .grid {stroke: blue}" # 4th line from bottom
css += ".y-axis .ticks .tick:nth-child(5) .grid {stroke: darkgrey}" # 5th line from bottom
y_ticks = {0: "0", 30: "30", 50: "Introgress", 70: "Eur origin", 300: "Nea origin", 1000: "1000"}
y_ticks = {y * time_units: lab for y, lab in y_ticks.items()}
SVG(ts.draw_svg(
size=(1200, 500),
x_lim=(0, 25_000),
time_scale="log_time",
node_labels = {0: "Afr", 1: "Eur", 2: "Neand"},
y_axis=True,
y_label="Time (kya)",
x_label="Genomic position (bp)",
y_ticks=y_ticks,
y_gridlines=True,
style=css))
```


#### Animation

The classes attached to the SVG also allow elements to be animated. Here's a
Expand Down