Skip to content

Conversation

@jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Nov 23, 2020

Closes #987

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@codecov
Copy link

codecov bot commented Nov 24, 2020

Codecov Report

Merging #1030 (0c63887) into main (2900201) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1030   +/-   ##
=======================================
  Coverage   93.71%   93.71%           
=======================================
  Files          26       26           
  Lines       20920    20923    +3     
  Branches      875      875           
=======================================
+ Hits        19606    19609    +3     
  Misses       1277     1277           
  Partials       37       37           
Flag Coverage Δ
c-tests 92.49% <100.00%> (+<0.01%) ⬆️
lwt-tests 93.58% <ø> (ø)
python-c-tests 94.90% <ø> (ø)
python-tests 98.61% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/trees.c 94.79% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2900201...0c63887. Read the comment docs.

@jeromekelleher jeromekelleher force-pushed the map-mutations-bugfix branch 2 times, most recently from 3868846 to 5e65e57 Compare November 25, 2020 09:15
@jeromekelleher jeromekelleher marked this pull request as ready for review November 25, 2020 09:18
@jeromekelleher jeromekelleher changed the title Add some tests for for polytomies in parsimony Fix parsimony on nonbinary trees Nov 25, 2020
@jeromekelleher
Copy link
Member Author

This is ready to go I think. @hyanwong, can you take a look at this please? In particular, can you think of other test examples we should put in here, to make sure the result is parsimonious?

@hyanwong
Copy link
Member

Will do. Sorry for the slow reply - internet issues.

@hyanwong
Copy link
Member

This is ready to go I think. @hyanwong, can you take a look at this please? In particular, can you think of other test examples we should put in here, to make sure the result is parsimonious?

This all looks great. Thanks for sorting this out @jeromekelleher . If you are looking for more trees to test, then perhaps

  1. we could (trivially) test a star topology, which should give the ancestral state being the highest freq mutation, and the number of transitions being the count of lowest freq alleles (perhaps you have, but I can't find it by searching for "star" in the changes).
  2. a few more tests with more than 2 states might be helpful. For instance,
    i. the star topology above with n tips and n-1 alleles - the ancestral state should be the only one present on 2 tips.
    ii. the least parsimonious possible version of your tskit.Tree.generate_balanced(27, arity=3) tree, where there are 3 alleles, distributed as [0,1,2] * (27//3) over the tips.

It's really helpful to have the "balanced tree" generation code now! Thanks for doing that. It also occurred to me that it's now trivial to produce a "random" binary tree for testing, by coupling star_topology with split_polytomies - do you think that would be useful, in which case should I open an issue?

@hyanwong
Copy link
Member

Oh, and another thing: should this deal with (a) internal sample nodes, especially when there is an internal sample node which is also a polytomy, and (b) tip nodes that aren't samples ("dangling nodes). I suspect (b) is already tested, but maybe not when some of the branches from a polytomy are dangling, and some aren't.

(a) is more important to test than (b), I would think.

@jeromekelleher
Copy link
Member Author

It's really helpful to have the "balanced tree" generation code now! Thanks for doing that. It also occurred to me that it's now trivial to produce a "random" binary tree for testing, by coupling star_topology with split_polytomies - do you think that would be useful, in which case should I open an issue?

I did think of this, but then wondered if it would really be of much use since we can already call msprime.simulate() to do much the same thing. I was also slightly worried that people might use this as some sort of evolutionary model, when it's not. I could easily be convinced it's a good idea, though, if you think it's a useful thing to add.

@jeromekelleher
Copy link
Member Author

Thanks for the review, I'll add the tests you mention.

@hyanwong
Copy link
Member

hyanwong commented Nov 25, 2020

Oh, and lastly (promise!), are we dealing OK with polytomies that have missing data? Both where some or all of the children of a polytomy are marked as missing (e.g. a star topology with all samples bar one missing), and where there is an internal sample but that sample is marked as missing (I guess this should behave just as if the internal sample node wasn't a sample at all.

@hyanwong
Copy link
Member

I did think of this, but then wondered if it would really be of much use since we can already call msprime.simulate() to do much the same thing. I was also slightly worried that people might use this as some sort of evolutionary model, when it's not. I could easily be convinced it's a good idea, though, if you think it's a useful thing to add.

That's a very good point. The (relatively weak) arguments that I can see for it are:

  1. It allows tskit to produce random trees without relying on an external dependency. That could be useful e.g. for SLiM testing, or what have you.
  2. The distribution of coalescence tree topologies is not equiprobable - see below - the top plot is msprime topologies, the bottom is from split_polytomy.
import matplotlib.pyplot as plt
import collections

import tqdm
import msprime
import tskit

msprime_counts = collections.defaultdict(int)
split_poly_counts = collections.defaultdict(int)
for s in tqdm.trange(1, 10000):
    msprime_counts[msprime.simulate(5, random_seed=s).first().rank()] += 1

for s in tqdm.trange(1, 10000):
    split_poly_counts[tskit.Tree.generate_star(5).split_polytomies(random_seed=s).rank()] += 1

fig, axes = plt.subplots(2)
axes[0].bar(list(range(len(msprime_counts))), msprime_counts.values())
axes[1].bar(list(range(len(split_poly_counts))), split_poly_counts.values())

image

@jeromekelleher
Copy link
Member Author

OK, let's break the generate_random discussion to a separate issue. Can you copy the comment above so we don't lose track of it please?

@jeromekelleher
Copy link
Member Author

I've added the extra tests @hyanwong - great ideas. The internal samples and missing data stuff was already very thoroughly tested, and I've added in a few extra cases where we're using the balanced trees of varying arity. I think we can be confident this is correct now!

@hyanwong
Copy link
Member

hyanwong commented Nov 25, 2020

Great. This seems mergeable to me then. Great work @jeromekelleher - hope the Hartigan algorithm didn't mess with your head too much.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the tests here. One suggestion about fixtures - but it is no biggie as happy either way.

uint64_t t = 1;
int8_t r = 0;

assert(v != 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this isn't tsk_bug_assert as the consequences are pretty disastrous!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty perf sensitive code, so I want to compile out the assert.

tree = ts.first()
@pytest.mark.parametrize("n", range(2, 10))
def test_all_missing(self, n):
ts = msprime.simulate(n, random_seed=2)
Copy link
Member

@benjeffery benjeffery Nov 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The places you have ts = msprime.simulate(n, random_seed=2) could be a session-scoped parameterised fixture. https://docs.pytest.org/en/stable/fixture.html#fixture-parametrize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might pick that up later - it's hard to know where to draw the line when updating the test code!

@hyanwong
Copy link
Member

hyanwong commented Nov 26, 2020

Great stuff. Does this automatically make it into the likelihood compression part of the matching algorithm now?

@jeromekelleher
Copy link
Member Author

Great stuff. Does this automatically make it into the likelihood compression part of the matching algorithm now?

No, that's got to be done separately.

We were incorrectly using Fitch parsimony on general trees.

Closes tskit-dev#987
@mergify mergify bot merged commit c156792 into tskit-dev:main Nov 26, 2020
@jeromekelleher jeromekelleher deleted the map-mutations-bugfix branch October 26, 2022 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in map_mutations() for polytomy

4 participants