Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag up whether a mutation was placed because of a mismatch rather than because of shared ancestry #429

Open
hyanwong opened this issue Dec 17, 2020 · 3 comments
Milestone

Comments

@hyanwong
Copy link
Member

hyanwong commented Dec 17, 2020

At the moment, in the matching process, we place mutations directly above the ancestor at the ancestor's focal sites. We can also place mutations at the same site when we think a mismatch has occurred. It would be useful to distinguish between these two sorts of mutation in the final TS by adding some sort of flag or metadata on the mutation.

This came up after looking at re-placing the mutations at inference sites using parsimony instead, which seems to drastically reduce the number of mutations needed to explain the pattern of variation. If most sites do have a single origin, but also have some error, then we might hope that using parsimony would keep the single origin more-or-less intact, but mess with the placement of mismatch-mutations.

@jeromekelleher
Copy link
Member

Good idea. We can add a metadata schema like we have for nodes and an enumeration of the types of mutation.

@jeromekelleher
Copy link
Member

Best to wait until we also have node metadata sorted out with a proper schema. Will have to wait until 0.2.1.

@hyanwong
Copy link
Member Author

hyanwong commented Aug 3, 2021

One issue here is that if a site has multiple mutations, then it is quite possible that the ancestor associated with the focal site will not be the one with the focal mutation above it. Here's a (slightly odd) example:

import msprime
import tsinfer
import tskit
import json
import collections
import numpy as np
from IPython.display import display, SVG
r = 1e-6
ts = msprime.sim_ancestry(5, sequence_length=1e8, recombination_rate=r, random_seed=2)
ts = msprime.sim_mutations(ts, rate=r, random_seed=2)  # same mutation rate as rec rate
print("Simulated TS with", ts.num_trees, "trees and", ts.num_sites, "sites")
sample_data = tsinfer.SampleData.from_tree_sequence(ts, use_sites_time=False)
ancestors = tsinfer.generate_ancestors(sample_data)
ancestors_ts = tsinfer.match_ancestors(sample_data, ancestors, recombination_rate=r)
inferred_ts = tsinfer.match_samples(sample_data, ancestors_ts)

# which focal ancestors have a mutation above them?
node_ancestor_map = {}
ancestor_node_map = {}

for n in inferred_ts.nodes():
    if n.metadata:
        ancestor_id = json.loads(n.metadata)['ancestor_data_id']
        node_ancestor_map[n.id] = ancestor_id
        assert ancestor_id not in ancestor_node_map
        ancestor_node_map[ancestor_id] = n.id

site_ancestors = {
    s.id: {node_ancestor_map[m.node] for m in s.mutations if m.node in node_ancestor_map}
    for s in inferred_ts.sites()
}
site_focal_ancestors = {}
site_map = np.where(np.isin(sample_data.sites_position[:], ancestors.sites_position[:]))[0]
for ancestor in ancestors.ancestors():
    for true_site_id in site_map[ancestor.focal_sites]:
        assert true_site_id not in site_focal_ancestors  # check only 1 focal anc per site
        site_focal_ancestors[true_site_id] = ancestor.id
assert len(site_focal_ancestors), ancestors_ts.num_sites

num_inference_sites_with_focal = 0
num_inference_sites_without_focal = 0

sites_without_focal = {}
for site_id, focal_ancestor in site_focal_ancestors.items():
    if focal_ancestor in site_ancestors[site_id]:
        num_inference_sites_with_focal += 1
    else:
        sites_without_focal[site_id] = focal_ancestor

print(
    num_inference_sites_with_focal / len(site_focal_ancestors) * 100,
    "% of inference sites have a mutation above the focal ancestor"
)
it = iter(sites_without_focal.items())
first_bad_site_id, first_bad_anc_id = next(it)

# Mark the focal ancestor as a sample, so it is plotted
print(
    "The magenta node in the first plot below is the " +
    f"ancestor on which the focal site {first_bad_site_id} exists"
)
focal_node = ancestor_node_map[first_bad_anc_id]
tables = inferred_ts.dump_tables()
flags = tables.nodes.flags
flags[focal_node] |= tskit.NODE_IS_SAMPLE
tables.nodes.flags = flags
lims = [inferred_ts.site(first_bad_site_id).position]
lims += [np.nextafter(lims[0], lims[0]+1)]
display(SVG(tables.tree_sequence().draw_svg(
    x_lim=lims, size=(400, 400), x_label="Inferred tree sequence",
    style=f".node > .sym {{fill: grey}}.n{focal_node} > .sym {{fill: magenta}}", time_scale="rank")))
display(SVG(ts.draw_svg(
    x_lim=lims, size=(400, 400), x_label="Original (simulated) tree sequence", time_scale="rank")))

Screenshot 2021-08-03 at 13 47 08

In this case, rather weirdly, we could have placed a mutation in the correct place on the topology, but for some reason we mismatched to other (younger) ancestors instead of ancestor 276 (which was the one based on focal site 367). I'll open another issue about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants