Flag up whether a mutation was placed because of a mismatch rather than because of shared ancestry #429

hyanwong · 2020-12-17T21:30:48Z

At the moment, in the matching process, we place mutations directly above the ancestor at the ancestor's focal sites. We can also place mutations at the same site when we think a mismatch has occurred. It would be useful to distinguish between these two sorts of mutation in the final TS by adding some sort of flag or metadata on the mutation.

This came up after looking at re-placing the mutations at inference sites using parsimony instead, which seems to drastically reduce the number of mutations needed to explain the pattern of variation. If most sites do have a single origin, but also have some error, then we might hope that using parsimony would keep the single origin more-or-less intact, but mess with the placement of mismatch-mutations.

jeromekelleher · 2020-12-18T09:57:20Z

Good idea. We can add a metadata schema like we have for nodes and an enumeration of the types of mutation.

jeromekelleher · 2020-12-18T15:28:17Z

Best to wait until we also have node metadata sorted out with a proper schema. Will have to wait until 0.2.1.

hyanwong · 2021-08-03T12:47:39Z

One issue here is that if a site has multiple mutations, then it is quite possible that the ancestor associated with the focal site will not be the one with the focal mutation above it. Here's a (slightly odd) example:

import msprime
import tsinfer
import tskit
import json
import collections
import numpy as np
from IPython.display import display, SVG
r = 1e-6
ts = msprime.sim_ancestry(5, sequence_length=1e8, recombination_rate=r, random_seed=2)
ts = msprime.sim_mutations(ts, rate=r, random_seed=2)  # same mutation rate as rec rate
print("Simulated TS with", ts.num_trees, "trees and", ts.num_sites, "sites")
sample_data = tsinfer.SampleData.from_tree_sequence(ts, use_sites_time=False)
ancestors = tsinfer.generate_ancestors(sample_data)
ancestors_ts = tsinfer.match_ancestors(sample_data, ancestors, recombination_rate=r)
inferred_ts = tsinfer.match_samples(sample_data, ancestors_ts)

# which focal ancestors have a mutation above them?
node_ancestor_map = {}
ancestor_node_map = {}

for n in inferred_ts.nodes():
    if n.metadata:
        ancestor_id = json.loads(n.metadata)['ancestor_data_id']
        node_ancestor_map[n.id] = ancestor_id
        assert ancestor_id not in ancestor_node_map
        ancestor_node_map[ancestor_id] = n.id

site_ancestors = {
    s.id: {node_ancestor_map[m.node] for m in s.mutations if m.node in node_ancestor_map}
    for s in inferred_ts.sites()
}
site_focal_ancestors = {}
site_map = np.where(np.isin(sample_data.sites_position[:], ancestors.sites_position[:]))[0]
for ancestor in ancestors.ancestors():
    for true_site_id in site_map[ancestor.focal_sites]:
        assert true_site_id not in site_focal_ancestors  # check only 1 focal anc per site
        site_focal_ancestors[true_site_id] = ancestor.id
assert len(site_focal_ancestors), ancestors_ts.num_sites

num_inference_sites_with_focal = 0
num_inference_sites_without_focal = 0

sites_without_focal = {}
for site_id, focal_ancestor in site_focal_ancestors.items():
    if focal_ancestor in site_ancestors[site_id]:
        num_inference_sites_with_focal += 1
    else:
        sites_without_focal[site_id] = focal_ancestor

print(
    num_inference_sites_with_focal / len(site_focal_ancestors) * 100,
    "% of inference sites have a mutation above the focal ancestor"
)
it = iter(sites_without_focal.items())
first_bad_site_id, first_bad_anc_id = next(it)

# Mark the focal ancestor as a sample, so it is plotted
print(
    "The magenta node in the first plot below is the " +
    f"ancestor on which the focal site {first_bad_site_id} exists"
)
focal_node = ancestor_node_map[first_bad_anc_id]
tables = inferred_ts.dump_tables()
flags = tables.nodes.flags
flags[focal_node] |= tskit.NODE_IS_SAMPLE
tables.nodes.flags = flags
lims = [inferred_ts.site(first_bad_site_id).position]
lims += [np.nextafter(lims[0], lims[0]+1)]
display(SVG(tables.tree_sequence().draw_svg(
    x_lim=lims, size=(400, 400), x_label="Inferred tree sequence",
    style=f".node > .sym {{fill: grey}}.n{focal_node} > .sym {{fill: magenta}}", time_scale="rank")))
display(SVG(ts.draw_svg(
    x_lim=lims, size=(400, 400), x_label="Original (simulated) tree sequence", time_scale="rank")))

In this case, rather weirdly, we could have placed a mutation in the correct place on the topology, but for some reason we mismatched to other (younger) ancestors instead of ancestor 276 (which was the one based on focal site 367). I'll open another issue about that.

jeromekelleher modified the milestones: Release 0.2.0, Release 0.2.1 Dec 18, 2020

hyanwong mentioned this issue Aug 3, 2021

Investigate causes of multiple mutations #568

Open

jeromekelleher modified the milestones: Release 0.4.0, Release 0.5.0 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flag up whether a mutation was placed because of a mismatch rather than because of shared ancestry #429

Flag up whether a mutation was placed because of a mismatch rather than because of shared ancestry #429

hyanwong commented Dec 17, 2020 •

edited

Loading

jeromekelleher commented Dec 18, 2020

jeromekelleher commented Dec 18, 2020

hyanwong commented Aug 3, 2021 •

edited

Loading

Flag up whether a mutation was placed because of a mismatch rather than because of shared ancestry #429

Flag up whether a mutation was placed because of a mismatch rather than because of shared ancestry #429

Comments

hyanwong commented Dec 17, 2020 • edited Loading

jeromekelleher commented Dec 18, 2020

jeromekelleher commented Dec 18, 2020

hyanwong commented Aug 3, 2021 • edited Loading

hyanwong commented Dec 17, 2020 •

edited

Loading

hyanwong commented Aug 3, 2021 •

edited

Loading