New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Likelihood evaluation #665
Likelihood evaluation #665
Conversation
This is superb Jere! Really neat and tidy with minimal changes to the algorithm; perfect. I've pushed a quick update to add a |
OK, for discussion, here's what I get with
and the same thing without the
|
Here's the code I used for drawing the trees btw in case it's useful: ts = msprime.load("tmp.trees")
print(ts.tables.nodes)
print(ts.tables.edges)
trees = [tree.draw(format="unicode").splitlines() for tree in ts.trees()]
for j in range(len(trees[0])):
line = " | ".join(tree[j] for tree in trees)
print(line) |
Codecov Report
@@ Coverage Diff @@
## master #665 +/- ##
==========================================
- Coverage 84.55% 81.97% -2.58%
==========================================
Files 12 12
Lines 6157 6236 +79
Branches 1260 1283 +23
==========================================
- Hits 5206 5112 -94
- Misses 533 699 +166
- Partials 418 425 +7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic Jere, I'm really pleased that it's looking so clean to encode the full ARG. I think it's not quite right at the moment, but I think it's an easy fix. When you crank up the number of samples and recombination rates a bit you can see the trees start getting weird, with some bits being unrooted and others having disconnected sections.
algorithms.py
Outdated
if self.full_arg: | ||
self.store_node(y.population) | ||
u = len(self.tables.nodes) - 1 | ||
self.store_edge(y.left, y.right, u, y.node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right I think: shouldn't the store_edge
call be inside in the while loop? We want to note the edge that happens for every segment, not just the one we happen to intersect with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you're right. I was thinking of y as the node, and not the segment when writing this. Presumably we should record y.right, then go all the way left in the while loop, and add a single edge to cover the whole interval of material?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, that would end up recording sites on edges at which the corresponding parent node isn't ancestral. Storing an edge per segment is probably the best way to go, provided it doesn't completely swamp us with edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An edge per segment is the way to go all right --- the flush_edges
call takes care of squashing together any edges that can be, so it's equivalent.
algorithms.py
Outdated
u = len(self.tables.nodes) - 1 | ||
self.store_edge(y.left, y.right, u, y.node) | ||
y.node = u | ||
while y.prev is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right either I think. Shouldn't it be:
u = len(self.tables.nodes) - 1
while y is not None:
self.store_edge(y.left, y.right, u, y.node)
y.node = u
y = y.prev
algorithms.py
Outdated
else: | ||
# split the link between x and y. | ||
x.next = None | ||
y.prev = None | ||
z = y | ||
if self.full_arg: | ||
self.store_node(x.population) | ||
u = len(self.tables.nodes) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is identical to the section above. To avoid repetition, maybe we could add a method store_edges_left
function which does this? (And corresponding right for below?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely be possible. Each of the store_edge loops gets called a couple of times at present.
algorithms.py
Outdated
if self.full_arg: | ||
self.store_node(population_index) | ||
u = len(self.tables.nodes) - 1 | ||
self.store_edge(x.left, x.right, u, x.node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is basically doing the same thing, isn't it? So, we can call store_edges_right
on x and y to do the work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but what happens when we have a coalescence? Are the segments actually getting zipped to together correctly with the new node IDs. I'll have to think about that one a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are. The only difference is that self.store_node has been called earlier, in line 841, as opposed to in line 881. The variable u still points to the new parent node, exactly as it did before I made any changes, and I haven't touched the code where mergers are recorded.
The output looks good to me. Both tree sequences depict the same genealogy as far as I can see, but the former carries all of the extra information that would be needed to implement a log likelihood function. It also looks like implementing prune-regraft moves (where we cut off an edge and reattach it somewhere else) shouldn't be too hard. Perhaps implementing those should be the next move? Thanks for the tree drawing code - that'll come in very handy. I've always tended to sketch out an ARG by hand from the edge list. |
Wow, this is a nice idea. Ping @JuliaPalacios - we were talking about how it'd be nice to compute likelihoods, but we'd only get the likelihood of the particular ARG that gave the tree sequence, not the sum over all compatible ARGs. This fixes that problem, in a sense, if I understand correctly? |
It's neat, isn't it? I think we're still looking at calculating the likelihood of the single ARG that we happened to traverse rather than the sum over all compatible ARGs --- I am reliably informed that this is a Hard Problem. We need all the intermediate nodes because the likelihood formula is in terms of ARG events, which we lose a lot of in the usual minimal tree sequence encoding. Being able to compute the likelihood of a given output of |
algorithms.py
Outdated
y.node = u | ||
while y.prev is not None: | ||
y = y.prev | ||
self.store_edge(y.left, y.right, u, y.node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will do the same thing, I think, and is simpler:
while y is not None:
self.store_edge(y.left, y.right, u, y.node)
y.node = u
y = y.prev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll tweak that, and fix the whitespace that CircleCI is complaining about.
As Jerome said, this still only gives the likelihood of the particular ARG that was traversed. The extra nodes have been added so that the likelihood can be evaluated from the stored, augmented tree sequence, rather than just at runtime. Marginalising over all ARGs that are compatible with a given tree sequence means evaluating a high-dimensional integral over a complicated region of ARG-space, and I'd be surprised if it was much easier than obtaining the true sampling distribution outright (though I'd love to be wrong). |
It's looking good Jere, but there's something not quite right with the coalescences I think. If I run
and then plot with the following code def draw_trees(ts):
trees = [tree.draw(format="unicode").splitlines() for tree in ts.trees()]
max_height = max(len(tree) for tree in trees)
for j in range(max_height):
line = ""
for tree in trees:
k = j - max_height + len(tree)
if k < 0:
line += " " * len(tree[0])
else:
line += tree[k]
line += " | "
print(line)
ts_arg = msprime.load("arg.trees")
ts_noarg = msprime.load("noarg.trees")
ts = ts_arg.simplify()
draw_trees(ts_noarg)
draw_trees(ts_arg.simplify()) I get
If we simplify a full-arg tree sequence, we should get the same tree sequence back as running the same simulation without arg recording (this must be true, right?). But, we can see here that the toplogies are different. See the first two trees, for example, we have different nodes as root, which can't be right. I think this is must because we're not handling coalescences quite right. What do you think? |
You're right. It wouldn't surprise me if the id of the MRCA node were higher under --full-arg, but the differences in the patterns of MRCAs means something must be going wrong. Do the x and y returned in lines 833 and 835 always point to the leftmost segment carried by the removed ancestor? It looks like they do, and I had assumed that when writing my tweaks, but it would explain everything nicely if that were wrong. |
Yes, it's always the left-most segment that stored in the population container. Looking through again it's not obvious what the problem is... hmmm. |
How about this: At the top of common_ancestor_event, we've run store_edges_right for both child lineages, which makes both of them point to the same node. I think that makes the z.node == alpha.node check in defrag_required on on line 915 redundant. I don't immediately see whether that causes problems, but it does at least look like a genuine difference between the two settings. |
Isn't that what we're doing now? Sorry, I don't have time to look into this now, I'm checking out until after PopGroup. Happy holidays! |
No problem and likewise! I'll respond with a little more detail in preparation for January while this is fresh in my mind. I think I'm on the right track. I made I ran
while with defrag commented out I got
I'm not quite sure what to make of that, but I'll try to work on finding out exactly why there's a difference. |
That's weird, Ah, but it does change the random trajectory of the simulation because of the way that recombination breakpoints are chosen, so you can't deterministically compare outcomes with and without defragmentation. |
Yes, that was the conclusion I came to eventually. It also means that we can't deterministically compare simplified trees from the |
Aha, yes, of course it does. Sorry, I sent you down a rabbit hole there! I think the best thing to do is probably push the implementation into C and then run our tests to verify the statistical properties (if the algorithm is wrong at the moment, it's not by very much, so there won't be any wasted effort). Usually the next step is to add the option to the |
No problem - tracking down what was going on helped with my understanding of the algorithm, which can only be a good thing in the long run. In the same vein, I'll try to incorporate the same tweaks into |
The algorithm changes go into msprime.c. See the developer docs for information on |
Looks like this has got into a bit of a mess (because of all the churn on the msprime side). Do you mind if I jump in and clear things up @JereKoskela? |
Not at all. I made a mess of this update all by myself as well. I think I've now undone all of the new damage that I caused, and also uploaded everything I meant to upload into the place to which I meant to upload it. If you can rescue the rest, I'd be very grateful. |
OK, cool. I have another PR to merge in first and I'll fix this up then; should be sorted in a couple of hours. |
Fixed issue where nodes with multiple segments caused missing edges in the ARG.
Shortened some while loops in store_edges_left and store_edges_right, and tidied whitespace.
Thanks @jeromekelleher! It'll take me some time to get an understanding of the overall changes (I've never heard of meson or ninja before), but all of the commits I had made are still present and look as I intended them to look. |
I had a chance to read the docs today, and managed to compile and run the new version. I've not been able to break it, or generate output that looks broken to the naked eye, so I would say that it's ready for some statistical tests. I'm not sure how to run those from the C implementation (as far as I can tell the developer docs only cover validation.py). Is that something you can sort out @jeromekelleher? |
Excellent, thanks @JereKoskela. Yep, I'll wire this up with the Python API ASAP. |
OK, I've added the basic plumbing here so you can now call Anyway, it's all looking good. I think the first step in terms of running evaluations is to run some simple simulations to compare the distributions of the number of nodes and edges when we (a) run a normal sim and (b) run with We should get some low-level test coverage on this too ( |
afd4a73
to
fe63c62
Compare
Oh yes, before I forget: it would be good to update the |
Good point - I've added similar I've also written a quick python script to test the node and edge count distributions, and the output looks good. I'll email you those directly rather than clutter the branch with extra files. |
Excellent, thanks @JereKoskela. I'll make a polishing pass here and add in your test to |
98bd006
to
ab1d3e0
Compare
OK, I've made a pass through this @JereKoskela and I think improved it a bit. The main idea is to pull the The other difference is that I've made the semantics of the CA_EVENT flag mark those nodes that don't result in coalescence. These can only occur in the ARG, so it's handy to be able to track them. Would you mind taking one last look through here and see if it looks OK to you please? One last question also: is the reason we need two nodes to make a recombination event that we'll squash together the edges for the left and right hand edges, and so we won't be able to tell which is the RHS and which as the LHS? If so, does this actually matter? It seems a bit redundant to record this event twice, and I'm just wondering what we actually get from this. |
Disabled ARG recording for multiple merger coalescents as there is a problem with 'cancelling' events in this context. It can be done easily enough, but doesn't seem worth doing right now.
Thanks @jeromekelleher. I agree that the new code is simpler. I'm not sure it's quite right though. For example, it looks to me like segments in which ancestral material merge will get double stored when storing the full ARG: first by the Your question about needing two nodes for recombinations is a good one. I implemented it as a split into two nodes because that's what happens in the ARG, but you're right that having one node to signify the time of the recombination event might well be enough. Another point on which I'll have to get back to you, I think. |
I think this is OK --- the |
Removed a stray printf from msp_merge_ancestors
Yeah, it turns out everything was correct. It was just much more compact than what I had written, so I needed a while to spot where everything happened. I found one stray On the subject of whether we need two nodes to store a recombination event, as it stands the answer is yes. The reason is that a single node wouldn't store the location of the recombination, and hence we wouldn't be able to track the number of extant segments in each state for likelihood evaluation. In most cases we could infer where the recombination took place from subsequent coalescence events, but not if the two recombinant lineages merge back together before anything else happens. Multiple recombinations along the same lineage before segments merge with anyone else would also cause tracking problems. We could probably replace the second node by some kind of richer label that tells us where the recombination happened, but we can't just remove the second node outright. |
I see... One thing we could do is record the location of the recombination breakpoint using the node metadata. We might need to do this anyway if the exact breakpoint is needed as we won't know this with the two-node scheme if the breakpoint falls between two segments. Is this important information, or can we forget about it? There's clearly pros and cons in both directions. The question comes down, for me, to whether one or two nodes is the right way to think about it. A node represents a distinct individual who lived at a specific time. Does a recombination event produce two distinct individuals? (I think so, right?) |
That level of detail we can forget about. We need to know i) which segments of ancestral material actually got separated, and ii) how many links there are that would separate ancestral material in the event of a recombination event. Both of those things can be recovered from the present two-node approach, but would be lost in some cases with just one node without added metadata.
Yes, in the model the two recombinant segments are inherited from different ancestors. |
OK, great, let's merge this then. Thanks @JereKoskela! |
For posterity, Jere now says "I'm not opposed to the single node approach. That's almost certainly what I'd do if I was setting this up from scratch." and (with respect to the comment about single nodes not storing the location of the recombination) "my comment is not informative of anything except that I hadn't yet appreciated what we could store with edges as well as nodes." |
Both hudson_recombination_event and common_ancestor_event now begin by setting store_full_arg = 1, which is what I envisage turning into a parameter in the args object.
The calls to store_edge for tree sequence output (as opposed to full ARG) have been placed inside if-checks for store_full_arg = 0.
Correspondingly, both hudson_recombination_event and common_ancestor_event create a new node(s), and corresponding edges, when store_full_arg = 1. That's always accompanied by a while-loop, which makes sure that all segments that used to point to the old node are updated to point to the new one. I'm not sure whether this is the cleanest way of doing it, though don't have a better one in mind either.