Fix parsimony on nonbinary trees #1030

jeromekelleher · 2020-11-23T14:54:29Z

Closes #987

PR Checklist:

Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate.
Changelogs, if there are API changes.

AdminBot-tskit · 2020-11-24T00:45:43Z

📖 Docs for this PR can be previewed here

codecov · 2020-11-24T00:50:27Z

Codecov Report

Merging #1030 (0c63887) into main (2900201) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1030   +/-   ##
=======================================
  Coverage   93.71%   93.71%           
=======================================
  Files          26       26           
  Lines       20920    20923    +3     
  Branches      875      875           
=======================================
+ Hits        19606    19609    +3     
  Misses       1277     1277           
  Partials       37       37

Flag	Coverage Δ
c-tests	`92.49% <100.00%> (+<0.01%)`	⬆️
lwt-tests	`93.58% <ø> (ø)`
python-c-tests	`94.90% <ø> (ø)`
python-tests	`98.61% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
c/tskit/trees.c	`94.79% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2900201...0c63887. Read the comment docs.

jeromekelleher · 2020-11-25T09:20:03Z

This is ready to go I think. @hyanwong, can you take a look at this please? In particular, can you think of other test examples we should put in here, to make sure the result is parsimonious?

hyanwong · 2020-11-25T12:03:55Z

Will do. Sorry for the slow reply - internet issues.

hyanwong · 2020-11-25T13:27:10Z

This is ready to go I think. @hyanwong, can you take a look at this please? In particular, can you think of other test examples we should put in here, to make sure the result is parsimonious?

This all looks great. Thanks for sorting this out @jeromekelleher . If you are looking for more trees to test, then perhaps

we could (trivially) test a star topology, which should give the ancestral state being the highest freq mutation, and the number of transitions being the count of lowest freq alleles (perhaps you have, but I can't find it by searching for "star" in the changes).
a few more tests with more than 2 states might be helpful. For instance,
i. the star topology above with n tips and n-1 alleles - the ancestral state should be the only one present on 2 tips.
ii. the least parsimonious possible version of your tskit.Tree.generate_balanced(27, arity=3) tree, where there are 3 alleles, distributed as [0,1,2] * (27//3) over the tips.

It's really helpful to have the "balanced tree" generation code now! Thanks for doing that. It also occurred to me that it's now trivial to produce a "random" binary tree for testing, by coupling star_topology with split_polytomies - do you think that would be useful, in which case should I open an issue?

hyanwong · 2020-11-25T13:31:03Z

Oh, and another thing: should this deal with (a) internal sample nodes, especially when there is an internal sample node which is also a polytomy, and (b) tip nodes that aren't samples ("dangling nodes). I suspect (b) is already tested, but maybe not when some of the branches from a polytomy are dangling, and some aren't.

(a) is more important to test than (b), I would think.

jeromekelleher · 2020-11-25T13:36:18Z

It's really helpful to have the "balanced tree" generation code now! Thanks for doing that. It also occurred to me that it's now trivial to produce a "random" binary tree for testing, by coupling star_topology with split_polytomies - do you think that would be useful, in which case should I open an issue?

I did think of this, but then wondered if it would really be of much use since we can already call msprime.simulate() to do much the same thing. I was also slightly worried that people might use this as some sort of evolutionary model, when it's not. I could easily be convinced it's a good idea, though, if you think it's a useful thing to add.

jeromekelleher · 2020-11-25T13:36:34Z

Thanks for the review, I'll add the tests you mention.

hyanwong · 2020-11-25T13:38:01Z

Oh, and lastly (promise!), are we dealing OK with polytomies that have missing data? Both where some or all of the children of a polytomy are marked as missing (e.g. a star topology with all samples bar one missing), and where there is an internal sample but that sample is marked as missing (I guess this should behave just as if the internal sample node wasn't a sample at all.

hyanwong · 2020-11-25T13:55:50Z

I did think of this, but then wondered if it would really be of much use since we can already call msprime.simulate() to do much the same thing. I was also slightly worried that people might use this as some sort of evolutionary model, when it's not. I could easily be convinced it's a good idea, though, if you think it's a useful thing to add.

That's a very good point. The (relatively weak) arguments that I can see for it are:

It allows tskit to produce random trees without relying on an external dependency. That could be useful e.g. for SLiM testing, or what have you.
The distribution of coalescence tree topologies is not equiprobable - see below - the top plot is msprime topologies, the bottom is from split_polytomy.

import matplotlib.pyplot as plt
import collections

import tqdm
import msprime
import tskit

msprime_counts = collections.defaultdict(int)
split_poly_counts = collections.defaultdict(int)
for s in tqdm.trange(1, 10000):
    msprime_counts[msprime.simulate(5, random_seed=s).first().rank()] += 1

for s in tqdm.trange(1, 10000):
    split_poly_counts[tskit.Tree.generate_star(5).split_polytomies(random_seed=s).rank()] += 1

fig, axes = plt.subplots(2)
axes[0].bar(list(range(len(msprime_counts))), msprime_counts.values())
axes[1].bar(list(range(len(split_poly_counts))), split_poly_counts.values())

jeromekelleher · 2020-11-25T14:05:47Z

OK, let's break the generate_random discussion to a separate issue. Can you copy the comment above so we don't lose track of it please?

jeromekelleher · 2020-11-25T14:27:19Z

I've added the extra tests @hyanwong - great ideas. The internal samples and missing data stuff was already very thoroughly tested, and I've added in a few extra cases where we're using the balanced trees of varying arity. I think we can be confident this is correct now!

hyanwong · 2020-11-25T18:36:48Z

Great. This seems mergeable to me then. Great work @jeromekelleher - hope the Hartigan algorithm didn't mess with your head too much.

benjeffery

Loving the tests here. One suggestion about fixtures - but it is no biggie as happy either way.

benjeffery · 2020-11-26T12:00:45Z

c/tskit/trees.c

    uint64_t t = 1;
    int8_t r = 0;

+    assert(v != 0);


I'm assuming this isn't tsk_bug_assert as the consequences are pretty disastrous!

This is pretty perf sensitive code, so I want to compile out the assert.

benjeffery · 2020-11-26T13:11:08Z

python/tests/test_parsimony.py

-            tree = ts.first()
+    @pytest.mark.parametrize("n", range(2, 10))
+    def test_all_missing(self, n):
+        ts = msprime.simulate(n, random_seed=2)


The places you have ts = msprime.simulate(n, random_seed=2) could be a session-scoped parameterised fixture. https://docs.pytest.org/en/stable/fixture.html#fixture-parametrize

might pick that up later - it's hard to know where to draw the line when updating the test code!

hyanwong · 2020-11-26T13:46:08Z

Great stuff. Does this automatically make it into the likelihood compression part of the matching algorithm now?

jeromekelleher · 2020-11-26T14:32:55Z

Great stuff. Does this automatically make it into the likelihood compression part of the matching algorithm now?

No, that's got to be done separately.

We were incorrectly using Fitch parsimony on general trees. Closes tskit-dev#987

jeromekelleher force-pushed the map-mutations-bugfix branch from 8989085 to 070a4a5 Compare November 24, 2020 00:44

jeromekelleher force-pushed the map-mutations-bugfix branch 2 times, most recently from 3868846 to 5e65e57 Compare November 25, 2020 09:15

jeromekelleher marked this pull request as ready for review November 25, 2020 09:18

jeromekelleher changed the title ~~Add some tests for for polytomies in parsimony~~ Fix parsimony on nonbinary trees Nov 25, 2020

jeromekelleher force-pushed the map-mutations-bugfix branch from 5e65e57 to b78576b Compare November 25, 2020 09:21

jeromekelleher requested a review from benjeffery November 25, 2020 12:18

hyanwong mentioned this pull request Nov 25, 2020

Add Tree.generate_random() method #1033

Closed

jeromekelleher force-pushed the map-mutations-bugfix branch from b78576b to 0521c0b Compare November 25, 2020 14:25

benjeffery approved these changes Nov 26, 2020

View reviewed changes

benjeffery mentioned this pull request Nov 26, 2020

Add tests for genetic_relatedness #1023

Merged

jeromekelleher added the AUTOMERGE-REQUESTED label Nov 26, 2020

Implement Hartigan parsimony for map_mutations.

0c63887

We were incorrectly using Fitch parsimony on general trees. Closes tskit-dev#987

AdminBot-tskit force-pushed the map-mutations-bugfix branch from 0521c0b to 0c63887 Compare November 26, 2020 14:34

jeromekelleher mentioned this pull request Nov 26, 2020

Haplotype matching code needs to use Hartigan parsimony #1040

Closed

mergify bot merged commit c156792 into tskit-dev:main Nov 26, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Nov 26, 2020

jeromekelleher deleted the map-mutations-bugfix branch October 26, 2022 11:13

Fix parsimony on nonbinary trees #1030

Fix parsimony on nonbinary trees #1030

Uh oh!

Conversation

jeromekelleher commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist:

Uh oh!

AdminBot-tskit commented Nov 24, 2020

Uh oh!

codecov bot commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Nov 25, 2020

Uh oh!

hyanwong commented Nov 25, 2020

Uh oh!

hyanwong commented Nov 25, 2020

Uh oh!

hyanwong commented Nov 25, 2020

Uh oh!

jeromekelleher commented Nov 25, 2020

Uh oh!

jeromekelleher commented Nov 25, 2020

Uh oh!

hyanwong commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented Nov 25, 2020

Uh oh!

jeromekelleher commented Nov 25, 2020

Uh oh!

jeromekelleher commented Nov 25, 2020

Uh oh!

hyanwong commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery left a comment

Choose a reason for hiding this comment

Uh oh!

benjeffery Nov 26, 2020

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Nov 26, 2020

Choose a reason for hiding this comment

Uh oh!

benjeffery Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Nov 26, 2020

Choose a reason for hiding this comment

Uh oh!

hyanwong commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeromekelleher commented Nov 23, 2020 •

edited

Loading

codecov bot commented Nov 24, 2020 •

edited

Loading

hyanwong commented Nov 25, 2020 •

edited

Loading

hyanwong commented Nov 25, 2020 •

edited

Loading

benjeffery Nov 26, 2020 •

edited

Loading

hyanwong commented Nov 26, 2020 •

edited

Loading