From 110984c61abd514034ed523e2f9b39bd01f7740a Mon Sep 17 00:00:00 2001 From: Yan Wong Date: Wed, 29 Apr 2026 09:00:18 +0100 Subject: [PATCH] More phylogen editiing --- phylogen.md | 132 ++++++++++++++++++++++++++-------------------------- 1 file changed, 66 insertions(+), 66 deletions(-) diff --git a/phylogen.md b/phylogen.md index 8483d6fa..39ca4dfc 100644 --- a/phylogen.md +++ b/phylogen.md @@ -18,9 +18,9 @@ kernelspec: # {program}`Tskit` for phylogenetics -{program}`Tskit`, the tree sequence toolkit, can be used as an efficient library for -very large evolutionary trees. {program}`Tskit` makes it easy to deal with trees with millions of -tips, as in the example below: +{program}`Tskit`, the tree sequence toolkit, is an efficient library for very large +evolutionary trees. It makes trees with millions of tips straightforward to store +and analyse, as in the example below: ```{code-cell} :"tags": ["hide-input"] @@ -130,12 +130,12 @@ def collapse_tree_for_plot(tree, max_tips=20, style=None): %%time num_tips = 1_000_000 -big_tree = tskit.Tree.generate_comb(num_tips); +big_tree = tskit.Tree.generate_comb(num_tips) print("Tree sequence takes up", big_tree.tree_sequence.nbytes / 1024**2, "Mb") print(f"Generating a 'comb' (pectinate) tree of {num_tips} tips took:") ``` -Clearly, plotting a tree with a million tips is problematic, but we can draw a summary +Plotting a tree with a million tips is impractical, but we can draw a summary (see the {ref}`visualization tutorial` for details): @@ -149,7 +149,7 @@ plot_tree.draw_svg( ``` -Calculations on these huge trees can be very efficient: +Calculations on these large trees can be very efficient: ```{code-cell} %%time @@ -181,27 +181,27 @@ import tsconvert # used for reading tree sequences from different formats # example code reading in a large file, timed -# Or read smaller trees from strings (here we create a tree spanning 1000 genomic units) +# Read a smaller tree from a string (here we create a tree spanning 1000 genomic units) ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);", span=1000) ``` The "succinct tree sequence" format used by {program}`tskit` can also store mutations (and optionally a reference genome) along with the tree(s). This results in a single unified representation of large genomic datasets, storing trees, -sequence data and metadata in a single efficient structure. Examples are given -in the section below entitled {ref}`sec_phylogen_unified_structure`. +sequence data, and metadata in a single efficient structure. See +{ref}`sec_phylogen_unified_structure` for examples. As the name suggests, a tree sequence can also store and analyse a sequence of trees along a genome (i.e. a "phylogenetic network"). This is necessary to account for recombination between lineages, and may be important even when looking at species-level phylogenies due to the effects of hybridization and incomplete lineage -sorting. An overview, and links to further details are given at the +sorting. An overview and links to further details are given at the {ref}`end of this page `. ## Hints for phylogeneticists -Unlike other phylogenetic libraries, {program}`tskit` is designed to efficiently store not just -single trees, but sequences of correlated trees along a genome. This means that the +Unlike other phylogenetic libraries, {program}`tskit` is designed to efficiently store not +just single trees, but also sequences of correlated trees along a genome. This means that the library has some features not found in more standard phylogenetic libraries. Here we focus on the {ref}`sec_python_api`, introducing seven {program}`tskit` concepts that may be useful to those with a background in @@ -226,10 +226,10 @@ phylogenetics (each is linked to a separate section below): (sec_phylogen_tree_in_sequence)= ### Trees are always part of a tree sequence -In tskit, all trees are stored as a "tree sequence" of correlated trees. This allows -easy extension of the library to multiple trees produced e.g. by hybridization. +In {program}`tskit`, all trees are stored as a "tree sequence" of correlated trees. +This extends naturally to multiple trees, such as those produced by hybridization. In the simplest case, however, the tree sequence can contain just a single tree. This -can be obtained using the {meth}`~TreeSequence.first()` method. +tree can be obtained using the {meth}`~TreeSequence.first()` method. ```{code-cell} :"tags": ["hide-input"] @@ -257,7 +257,7 @@ tree.tree_sequence # When output in a notebook, prints a summary of the tree se (sec_phylogen_ids)= ### Integer node and edge IDs -The plot above labels nodes by their name, but internally the {program}`tskit` library relies +The plot above labels nodes by name, but internally the {program}`tskit` library relies heavily on integer IDs. Here's the same tree with node IDs plotted instead: ```{code-cell} @@ -268,12 +268,12 @@ tree.draw_svg() #### Nodes Each {ref}`node` in a tree sequence is allocated an -integer ID starting from 0 to `ts.num_nodes - 1` (IDs can be allocated in any order; -often the tips are labelled starting from 0 but this is not necessarily so, and +integer ID from 0 to `ts.num_nodes - 1` (IDs can be allocated in any order; +often the tips are labelled starting from 0, but this is not guaranteed, and is not the case in the example above). For efficiency reasons, tree traversal routines, as well as many other {program}`tskit` -methods, tend to return integer IDs. You can use this ID to get specific information +methods, tend to return integer IDs. You can use these IDs to get specific information about the node and its position in the tree, for example ```{code-cell} @@ -281,22 +281,22 @@ node_id = 4 parent_id = tree.parent(node_id) child_ids = tree.children(node_id) print("The parent of", node_id, "is", parent_id, "and its children are", child_ids) -# or e.g. get all parents them as an array (where -1 means there is no parent) +# or get all parents as an array (where -1 means there is no parent) print(f"The parents of nodes 0..{ts.num_nodes-1} are", tree.parent_array) ``` Other methods also exist to {ref}`examine nodes in a tree`, e.g. {meth}`Tree.is_leaf`, {meth}`Tree.mrca` for the most recent common ancestor between -2 or more nodes, etc. +two or more nodes, etc. #### Edges -Rather than refer to "branches" of a tree, tskit tends to refer to +Rather than refer to "branches" of a tree, {program}`tskit` tends to refer to {ref}`sec_terminology_edges` (the term "edge" emphasises that these can span {ref}`sec_phylogen_multiple_trees`, although for tree sequences containing a single tree, the terms are interchangeable). Like other entities in {program}`tskit`, edges -are referred to by an integer ID. For instance, here is the edge above the internal node 4 +are referred to by an integer ID. For instance, here is the edge above the internal node 4: ```{code-cell} node_id = 4 @@ -313,10 +313,10 @@ important in tree sequences that contain more than one tree. ### The `Tree` object The {class}`Tree` object has {ref}`methods` to perform basic operations -on a tree such as traversing the nodes, identifying parents, children, and common -ancestors, etc. {ref}`Several methods` +on a tree, such as traversing nodes and identifying parents, children, and common +ancestors. {ref}`Several methods` also return numpy arrays for use in -{ref}`efficient algorithms using numba` +{ref}`efficient algorithms using numba`. ```{code-cell} for n_id in tree.nodes(order="postorder"): @@ -326,10 +326,10 @@ for n_id in tree.nodes(order="postorder"): print("Node IDs in postorder:", tree.postorder()) ``` -Various phylogenetic statistics are also available on trees, e.g +Various phylogenetic statistics are also available on trees, for example: ```{code-cell} -print(f"The colless imbalance index is {tree.colless_index()}") +print(f"The Colless imbalance index is {tree.colless_index()}") ``` See {ref}`sec_phylogen_methods` for more examples. @@ -337,14 +337,14 @@ See {ref}`sec_phylogen_methods` for more examples. (sec_phylogen_samples)= ### Sample nodes -Often we are only have detailed information about specific nodes that we have sampled, +Often, we only have detailed information about specific nodes that we have sampled, such as genomes A, B, C, D, and E in the example above. These are designated as *sample nodes*, and are plotted as square nodes. The concept of {ref}`sample nodes` is integral to the {program}`tskit` format. They can be identified by using the {meth}`Node.is_sample` and {meth}`Tree.is_sample` methods, or can be listed using {meth}`TreeSequence.samples` or {meth}`Tree.samples()` (internally, the `node.flags` -field is used to {ref}`flag up` which nodes are samples): +field is used to {ref}`record` which nodes are samples): ```{code-cell} for n_id in tree.nodes(): @@ -355,7 +355,7 @@ print("Sample nodes are", tree.tree_sequence.samples()) ``` Often the sample nodes are the leaves of a tree, but this need not be the case. There -are fast methods for identifying the sample nodes under an internal node in the tree, +are fast methods for identifying the sample nodes below an internal node in the tree, etc. @@ -371,16 +371,16 @@ node object via the `tree_sequence` to which this tree belongs: tree.tree_sequence.node(node_id) # or simply ts.node(node_id) ``` -Attributes such as `id`, `flags` and `time` are always present. Arbitrary information, -such a name or e.g. bootstrap values, are stored in *metadata* +Attributes such as `id`, `flags`, and `time` are always present. Arbitrary information, +such as a name or bootstrap values, is stored in *metadata*. ```{code-cell} for n_id in tree.nodes(): print("Node", n_id, tree.tree_sequence.node(n_id).metadata.get("name", "")) ``` -However, for large datasets, it may be more efficient to access the array of e.g. -times for all nodes, which provides direct memory access into the +However, for large datasets, it may be more efficient to access the time array for +all nodes, which provides direct memory access into the {ref}`tables` that underlie the tree sequence format: ```{code-cell} @@ -391,22 +391,23 @@ tree.tree_sequence.nodes_time (sec_phylogen_node_time)= ### Nodes must have times -Perhaps the most noticable different between a {program}`tskit` tree and the encoding of trees +Perhaps the most noticeable difference between a {program}`tskit` tree and the encoding of trees in other phylogenetic libraries is that {program}`tskit` does not explicitly store branch lengths. Instead, each node has a *time* associated with it. Branch lengths can therefore be found by calculating the difference between the time of a node and the time of its parent node. -Since nodes *must* have a time, {program}`tskit` trees aways have these (implicit) branch +Since nodes *must* have a time, {program}`tskit` trees always have implicit branch lengths. To represent a tree ("cladogram") in which the branch lengths are not meaningful, the {attr}`TreeSequence.time_units` of a tree sequence can be -specified as `"uncalibrated"` (see below) +specified as `"uncalibrated"` (see below). Another implication of storing node times rather than branch lengths is that {program}`tskit` trees are always directional (i.e. they are "rooted"). The reason that {program}`tskit` stores -times of nodes (rather than e.g. genetic distances between them) is to ensure temporal -consistency. In particular it makes it impossible for a node to be an ancestor of a -node in one tree, and a descendant of the same node in another tree in the tree sequence. +node times (rather than e.g. genetic distances between them) is to ensure temporal +consistency. In particular, it makes it impossible for one node to be an ancestor of +another node in one tree, and a descendant of that same node in another tree in the +tree sequence. This is of critical importance when extending the concept of genetic ancestry to {ref}`sec_phylogen_multiple_trees` along a genome. @@ -442,13 +443,13 @@ print( "and", target_node_2, "is", - tree.distance_between(target_node_1, target_node_2), # beware: tree.path_length counts number of edges + tree.distance_between(target_node_1, target_node_2), # beware: tree.path_length counts edges ) ``` It is worth noting that this distance is the basis for the "genetic divergence" between two samples in a tree. For this reason, an equivalent way to carry out the -calculation is to use {meth}`TreeSequence.divergence`, part of the the standard {program}`tskit` +calculation is to use {meth}`TreeSequence.divergence`, part of the standard {program}`tskit` {ref}`sec_stats` framework, setting `mode="branch"` and `windows="trees"`. This is a more flexible approach, as it allows the distance between multiple sets of samples in {ref}`sec_phylogen_multiple_trees` to be calculated @@ -467,7 +468,7 @@ print( (sec_phylogen_multiroot)= ### Roots and multiroot trees -In {program}`tskit`, {ref}`sec_data_model_tree_roots` of trees are defined with respect to the +In {program}`tskit`, {ref}`tree roots` are defined with respect to the sample nodes. In particular, if we move back in time along the tree branches from a sample, the oldest node that we encounter is defined as a root. The ID of a root can be obtained using {attr}`Tree.root`: @@ -477,9 +478,9 @@ print("The root node of the following tree has ID", tree.root) tree.draw_svg() ``` -But in {program}`tskit`, we can also create a single "tree" consisting of multiple unlinked +We can also create a single "tree" consisting of multiple unlinked clades. In our example, we can create one of these phylogenetically unusual objects -if we remove the edge above node 4, by +by removing the edge above node 4 via {ref}`editing the underlying tables`: ```{code-cell} @@ -492,15 +493,15 @@ new_tree = new_ts.first() new_tree.draw_svg() ``` -Although there are two separate topologies in this plot, in {program}`tskit` terminology, -it is considered a single tree, but with two roots: +Although there are two separate topologies in this plot, in {program}`tskit` terminology +this is considered a single tree with two roots: ```{code-cell} print("The first tree has", len(new_tree.roots), "roots:", new_tree.roots) ``` This also means that if we have no topology at all (i.e. an "empty tree"), each -sample is its own root. +sample is its own root: ```{code-cell} tables.edges.clear() @@ -510,9 +511,9 @@ print("This empty tree has", len(empty_tree.roots), "roots:", empty_tree.roots) empty_tree.draw_svg() ``` -The samples here are {ref}`sec_data_model_tree_isolated_nodes`. This may seem like a +The samples here are {ref}`isolated nodes`. This may seem like a strange corner case, but in {program}`tskit`, isolated sample nodes are used to represent -{ref}`sec_data_model_missing_data`. This therefore represents a tree in which +{ref}`sec_data_model_missing_data`. The empty tree therefore represents a case in which relationships between the samples are not known. This could apply, for instance, in regions of the genome where no genetic data exists, or where genetic ancestry has not been simulated. @@ -532,16 +533,16 @@ Demo some phylogenetic methods. e.g. (sec_phylogen_unified_structure)= ## Storing and accessing genetic data -{program}`Tskit` has been designed to capture both evolutionary tree topologies and the genetic -sequences that evolve along the branches of these trees. This is achieved by defining -{ref}`sec_terminology_mutations_and_sites` which are associated with specific positions +{program}`Tskit` is designed to capture both evolutionary tree topologies and the genetic +sequences that evolve along the branches of these trees. It does this by defining +{ref}`mutations and sites`, which are associated with specific positions along the genome. ```{code-cell} -import msprime # The `msprime` package can throw mutations onto a tree sequence +import msprime # The `msprime` package can add mutations to a tree sequence mutated_ts = msprime.sim_mutations(ts, rate=3e-3, random_seed=321) mutated_tree = mutated_ts.first() -print("Variable sites with the following IDs generated") +print("Variable sites with the following IDs were generated:") for site in mutated_tree.sites(): print( f"Site ID {site.id} @ genomic position {site.position:g}:", @@ -550,9 +551,9 @@ for site in mutated_tree.sites(): mutated_tree.draw_svg() ``` -Mutations occur above nodes in a tree, with all the descendant -nodes inheriting that specific mutation (unless replaced by a subsequent -mutation at the same site). This allows genetic variation to be +Mutations occur above nodes in a tree, and all descendant nodes inherit +that specific mutation (unless it is replaced by a subsequent mutation at the +same site). This allows genetic variation to be {ref}`efficiently represented` using the tree topology. To obtain the genetic variation at each site across the entire genome, you can use the {meth}`TreeSequence.sites` method, or (less efficiently), you can use @@ -571,7 +572,7 @@ for node_id, alignment in zip( (sec_phylogen_multiple_trees)= ## Multiple trees -Where {program}`tskit` really shines is when the ancestry of your dataset cannot be adequately +{program}`Tskit` is particularly useful when the ancestry of your dataset cannot be adequately represented by a single tree. This is a pervasive issue in genomes (even from different species) that have undergone recombination in the past. The resulting series of {ref}`local trees` along a genome are highly correlated @@ -588,20 +589,19 @@ tables = ts.dump_tables() edge_id_above_node_4 = ts.first().edge(4) left_coord_for_edges = tables.edges.left left_coord_for_edges[edge_id_above_node_4] = 50 -tables.edges.left = left_coord_for_edges # reset the right coords +tables.edges.left = left_coord_for_edges # apply the modified left coordinates tables.sort() multi_ts = tables.tree_sequence() multi_ts.draw_svg() ``` -For the left hand side of the genome we lack information about the ancestry of -node 4, but for the right hand side we know this information. The result is to -generate 2 trees in the tree sequence, which differ only in the presence of absence of -a single branch. We do not have to separately store the entire tree on the right: all +For the left-hand side of the genome, we lack information about the ancestry of +node 4, but for the right-hand side, we know this information. This generates two +trees in the tree sequence, which differ only in the presence or absence of +a single branch. We do not have to store the entire tree on the right separately: all the edges that are shared between trees are stored only once. The rest of the {program}`tskit` tutorials will lead you through the concepts involved with storing and analysing sequences of many correlated trees. For a simple introduction, you might want to start with {ref}`sec_what_is`. -