Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likelihood evaluation #665

Merged
merged 14 commits into from Jan 30, 2019

Conversation

JereKoskela
Copy link
Member

Both hudson_recombination_event and common_ancestor_event now begin by setting store_full_arg = 1, which is what I envisage turning into a parameter in the args object.

The calls to store_edge for tree sequence output (as opposed to full ARG) have been placed inside if-checks for store_full_arg = 0.

Correspondingly, both hudson_recombination_event and common_ancestor_event create a new node(s), and corresponding edges, when store_full_arg = 1. That's always accompanied by a while-loop, which makes sure that all segments that used to point to the old node are updated to point to the new one. I'm not sure whether this is the cleanest way of doing it, though don't have a better one in mind either.

@jeromekelleher
Copy link
Member

This is superb Jere! Really neat and tidy with minimal changes to the algorithm; perfect. I've pushed a quick update to add a --full-arg option on the command line, and I'm going to play with it for a bit.

@jeromekelleher
Copy link
Member

OK, for discussion, here's what I get with python3 algorithms.py 4 -r 0.01 tmp.trees --full-arg

id      flags   population      individual      time    metadata
0       1       0       -1      0.00000000000000
1       1       0       -1      0.00000000000000
2       1       0       -1      0.00000000000000
3       1       0       -1      0.00000000000000
4       0       0       -1      0.03643713740139
5       0       0       -1      0.03643713740139
6       0       0       -1      0.04271069012711
7       0       0       -1      0.17225959614628
8       0       0       -1      0.18194209600111
9       0       0       -1      0.18194209600111
10      0       0       -1      0.30339156052675
11      0       0       -1      0.34668465218404
12      0       0       -1      0.35957225513474
id      left            right           parent  child
0       0.00000000      95.00000000     4       0
1       95.00000000     100.00000000    5       0
2       0.00000000      100.00000000    6       3
3       95.00000000     100.00000000    6       5
4       0.00000000      100.00000000    7       2
5       0.00000000      95.00000000     7       4
6       0.00000000      24.00000000     8       1
7       24.00000000     100.00000000    9       1
8       0.00000000      100.00000000    10      6
9       24.00000000     100.00000000    10      9
10      0.00000000      24.00000000     11      8
11      0.00000000      100.00000000    11      10
12      0.00000000      100.00000000    12      7
13      0.00000000      100.00000000    12      11
   12     |    12     |   12
 ┏━┻━┓    |  ┏━┻━┓    | ┏━┻━┓
 ┃   11   |  ┃   11   | ┃   11
 ┃  ┏┻┓   |  ┃   ┃    | ┃   ┃
 ┃  ┃ 10  |  ┃   10   | ┃   10
 ┃  ┃ ┃   |  ┃  ┏┻┓   | ┃  ┏┻━┓
 ┃  8 ┃   |  ┃  ┃ 9   | ┃  ┃  9
 ┃  ┃ ┃   |  ┃  ┃ ┃   | ┃  ┃  ┃
 7  ┃ ┃   |  7  ┃ ┃   | 7  ┃  ┃
┏┻┓ ┃ ┃   | ┏┻┓ ┃ ┃   | ┃  ┃  ┃
┃ ┃ ┃ 6   | ┃ ┃ 6 ┃   | ┃  6  ┃
┃ ┃ ┃ ┃   | ┃ ┃ ┃ ┃   | ┃ ┏┻┓ ┃
┃ 4 ┃ ┃   | ┃ 4 ┃ ┃   | ┃ ┃ 5 ┃
┃ ┃ ┃ ┃   | ┃ ┃ ┃ ┃   | ┃ ┃ ┃ ┃
2 0 1 3   | 2 0 3 1   | 2 3 0 1

and the same thing without the full-arg:

id      flags   population      individual      time    metadata
0       1       0       -1      0.00000000000000
1       1       0       -1      0.00000000000000
2       1       0       -1      0.00000000000000
3       1       0       -1      0.00000000000000
4       0       0       -1      0.04271069012711
5       0       0       -1      0.17225959614628
6       0       0       -1      0.30339156052675
7       0       0       -1      0.34668465218404
8       0       0       -1      0.35957225513474
id      left            right           parent  child
0       95.00000000     100.00000000    4       0
1       95.00000000     100.00000000    4       3
2       0.00000000      95.00000000     5       0
3       0.00000000      95.00000000     5       2
4       24.00000000     100.00000000    6       1
5       24.00000000     95.00000000     6       3
6       95.00000000     100.00000000    6       4
7       0.00000000      24.00000000     7       1
8       0.00000000      24.00000000     7       3
9       95.00000000     100.00000000    8       2
10      0.00000000      95.00000000     8       5
11      24.00000000     100.00000000    8       6
12      0.00000000      24.00000000     8       7
   8      |    8      |     8
 ┏━┻━┓    |  ┏━┻━┓    |   ┏━┻━┓
 ┃   7    |  ┃   6    |   6   ┃
 ┃  ┏┻┓   |  ┃  ┏┻┓   | ┏━┻┓  ┃
 5  ┃ ┃   |  5  ┃ ┃   | ┃  4  ┃
┏┻┓ ┃ ┃   | ┏┻┓ ┃ ┃   | ┃ ┏┻┓ ┃
0 2 1 3   | 0 2 1 3   | 1 0 3 2

@jeromekelleher
Copy link
Member

Here's the code I used for drawing the trees btw in case it's useful:

ts = msprime.load("tmp.trees")
print(ts.tables.nodes)
print(ts.tables.edges)

trees = [tree.draw(format="unicode").splitlines() for tree in ts.trees()]
for j in range(len(trees[0])):
    line = " | ".join(tree[j] for tree in trees)
    print(line)

@codecov
Copy link

codecov bot commented Dec 20, 2018

Codecov Report

Merging #665 into master will decrease coverage by 2.57%.
The diff coverage is 68.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #665      +/-   ##
==========================================
- Coverage   84.55%   81.97%   -2.58%     
==========================================
  Files          12       12              
  Lines        6157     6236      +79     
  Branches     1260     1283      +23     
==========================================
- Hits         5206     5112      -94     
- Misses        533      699     +166     
- Partials      418      425       +7
Flag Coverage Δ
#C 81.97% <68.08%> (-2.58%) ⬇️
#python 98.66% <100%> (ø) ⬆️
Impacted Files Coverage Δ
lib/util.c 21.59% <0%> (-74.93%) ⬇️
msprime/simulations.py 99.64% <100%> (ø) ⬆️
_msprimemodule.c 82.76% <100%> (+0.04%) ⬆️
lib/msprime.c 75.96% <66.4%> (-3.48%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a93dc3b...908c733. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic Jere, I'm really pleased that it's looking so clean to encode the full ARG. I think it's not quite right at the moment, but I think it's an easy fix. When you crank up the number of samples and recombination rates a bit you can see the trees start getting weird, with some bits being unrooted and others having disconnected sections.

algorithms.py Outdated
if self.full_arg:
self.store_node(y.population)
u = len(self.tables.nodes) - 1
self.store_edge(y.left, y.right, u, y.node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right I think: shouldn't the store_edge call be inside in the while loop? We want to note the edge that happens for every segment, not just the one we happen to intersect with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right. I was thinking of y as the node, and not the segment when writing this. Presumably we should record y.right, then go all the way left in the while loop, and add a single edge to cover the whole interval of material?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, that would end up recording sites on edges at which the corresponding parent node isn't ancestral. Storing an edge per segment is probably the best way to go, provided it doesn't completely swamp us with edges.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An edge per segment is the way to go all right --- the flush_edges call takes care of squashing together any edges that can be, so it's equivalent.

algorithms.py Outdated
u = len(self.tables.nodes) - 1
self.store_edge(y.left, y.right, u, y.node)
y.node = u
while y.prev is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right either I think. Shouldn't it be:

u = len(self.tables.nodes) - 1
while y is not None:
    self.store_edge(y.left, y.right, u, y.node)
    y.node = u
    y = y.prev

algorithms.py Outdated
else:
# split the link between x and y.
x.next = None
y.prev = None
z = y
if self.full_arg:
self.store_node(x.population)
u = len(self.tables.nodes) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to the section above. To avoid repetition, maybe we could add a method store_edges_left function which does this? (And corresponding right for below?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would definitely be possible. Each of the store_edge loops gets called a couple of times at present.

algorithms.py Outdated
if self.full_arg:
self.store_node(population_index)
u = len(self.tables.nodes) - 1
self.store_edge(x.left, x.right, u, x.node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is basically doing the same thing, isn't it? So, we can call store_edges_right on x and y to do the work here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but what happens when we have a coalescence? Are the segments actually getting zipped to together correctly with the new node IDs. I'll have to think about that one a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are. The only difference is that self.store_node has been called earlier, in line 841, as opposed to in line 881. The variable u still points to the new parent node, exactly as it did before I made any changes, and I haven't touched the code where mergers are recorded.

@JereKoskela
Copy link
Member Author

The output looks good to me. Both tree sequences depict the same genealogy as far as I can see, but the former carries all of the extra information that would be needed to implement a log likelihood function. It also looks like implementing prune-regraft moves (where we cut off an edge and reattach it somewhere else) shouldn't be too hard. Perhaps implementing those should be the next move?

Thanks for the tree drawing code - that'll come in very handy. I've always tended to sketch out an ARG by hand from the edge list.

@petrelharp
Copy link
Contributor

Wow, this is a nice idea. Ping @JuliaPalacios - we were talking about how it'd be nice to compute likelihoods, but we'd only get the likelihood of the particular ARG that gave the tree sequence, not the sum over all compatible ARGs. This fixes that problem, in a sense, if I understand correctly?

@jeromekelleher
Copy link
Member

Wow, this is a nice idea. Ping @JuliaPalacios - we were talking about how it'd be nice to compute likelihoods, but we'd only get the likelihood of the particular ARG that gave the tree sequence, not the sum over all compatible ARGs. This fixes that problem, in a sense, if I understand correctly?

It's neat, isn't it? I think we're still looking at calculating the likelihood of the single ARG that we happened to traverse rather than the sum over all compatible ARGs --- I am reliably informed that this is a Hard Problem. We need all the intermediate nodes because the likelihood formula is in terms of ARG events, which we lose a lot of in the usual minimal tree sequence encoding. Being able to compute the likelihood of a given output of msprime.simulate would be a significant step forward though, at least opening up the possibility of full likelihood methods.

algorithms.py Outdated
y.node = u
while y.prev is not None:
y = y.prev
self.store_edge(y.left, y.right, u, y.node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do the same thing, I think, and is simpler:

while y is not None:
    self.store_edge(y.left, y.right, u, y.node)
    y.node = u
    y = y.prev

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll tweak that, and fix the whitespace that CircleCI is complaining about.

@JereKoskela
Copy link
Member Author

Wow, this is a nice idea. Ping @JuliaPalacios - we were talking about how it'd be nice to compute likelihoods, but we'd only get the likelihood of the particular ARG that gave the tree sequence, not the sum over all compatible ARGs. This fixes that problem, in a sense, if I understand correctly?

As Jerome said, this still only gives the likelihood of the particular ARG that was traversed. The extra nodes have been added so that the likelihood can be evaluated from the stored, augmented tree sequence, rather than just at runtime. Marginalising over all ARGs that are compatible with a given tree sequence means evaluating a high-dimensional integral over a complicated region of ARG-space, and I'd be surprised if it was much easier than obtaining the true sampling distribution outright (though I'd love to be wrong).

@jeromekelleher
Copy link
Member

It's looking good Jere, but there's something not quite right with the coalescences I think. If I run

$ python3 algorithms.py 5 -r 0.025 arg.trees --full-arg
$ python3 algorithms.py 5 -r 0.025 noarg.trees

and then plot with the following code

def draw_trees(ts):
    trees = [tree.draw(format="unicode").splitlines() for tree in ts.trees()]
    max_height = max(len(tree) for tree in trees)
    for j in range(max_height):
        line = ""
        for tree in trees:
            k = j - max_height + len(tree)
            if k < 0:
                line += " " * len(tree[0])
            else:
                line += tree[k]
            line += " | "
        print(line)

ts_arg = msprime.load("arg.trees")
ts_noarg = msprime.load("noarg.trees")

ts = ts_arg.simplify()
draw_trees(ts_noarg)
draw_trees(ts_arg.simplify())

I get

  12        |   11        |   11        |   11        |     10      |   10        |
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       |   ┏━┻━━┓    | ┏━┻━┓       |
┃   9       | ┃   9       | ┃   8       | ┃   10      |   ┃    8    | ┃   7       |
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     |   ┃   ┏┻┓   | ┃ ┏━┻━┓     |
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     |   6   ┃ ┃   | ┃ ┃   6     |
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┏━┻┓  ┃ ┃   | ┃ ┃ ┏━┻┓    |
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃  5  ┃ ┃   | ┃ ┃ ┃  5    |
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┏┻┓ ┃ ┃   | ┃ ┃ ┃ ┏┻┓   |
2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 0 3 4 1 2   | 2 1 0 3 4   |
  11        |   11        |   11        |   12        |   12        |   10        |
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       |
┃   9       | ┃   8       | ┃   10      | ┃   10      | ┃   7       | ┃   7       |
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     |
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     |
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    |
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    |
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   |
2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   |

If we simplify a full-arg tree sequence, we should get the same tree sequence back as running the same simulation without arg recording (this must be true, right?). But, we can see here that the toplogies are different. See the first two trees, for example, we have different nodes as root, which can't be right.

I think this is must because we're not handling coalescences quite right. What do you think?

@JereKoskela
Copy link
Member Author

You're right. It wouldn't surprise me if the id of the MRCA node were higher under --full-arg, but the differences in the patterns of MRCAs means something must be going wrong.

Do the x and y returned in lines 833 and 835 always point to the leftmost segment carried by the removed ancestor? It looks like they do, and I had assumed that when writing my tweaks, but it would explain everything nicely if that were wrong.

@jeromekelleher
Copy link
Member

Do the x and y returned in lines 833 and 835 always point to the leftmost segment carried by the removed ancestor? It looks like they do, and I had assumed that when writing my tweaks, but it would explain everything nicely if that were wrong.

Yes, it's always the left-most segment that stored in the population container.

Looking through again it's not obvious what the problem is... hmmm.

@JereKoskela
Copy link
Member Author

How about this: At the top of common_ancestor_event, we've run store_edges_right for both child lineages, which makes both of them point to the same node. I think that makes the z.node == alpha.node check in defrag_required on on line 915 redundant. I don't immediately see whether that causes problems, but it does at least look like a genuine difference between the two settings.

@jeromekelleher
Copy link
Member

How about this: At the top of common_ancestor_event, we've run store_edges_right for both child lineages, which makes both of them point to the same node. I think that makes the z.node == alpha.node check in defrag_required on on line 915 redundant. I don't immediately see whether that causes problems, but it does at least look like a genuine difference between the two settings.

Isn't that what we're doing now? Sorry, I don't have time to look into this now, I'm checking out until after PopGroup. Happy holidays!

@JereKoskela
Copy link
Member Author

Isn't that what we're doing now? Sorry, I don't have time to look into this now, I'm checking out until after PopGroup. Happy holidays!

No problem and likewise! I'll respond with a little more detail in preparation for January while this is fresh in my mind. I think I'm on the right track.

I made defrag_segment_chain print a statement at runtime, and it does get run noticeably more often under the --full-arg setting. That's because we collect segments to the same node every time there's a coancestry event, so that the z.right == alpha.left and z.node == alpha.node constraint on line 915 is much easier to satisfy. My understanding is that running or not running defrag_segment_chain shouldn't make a difference to output, but it seems that it does:

I ran algorithms.py with the same parameters you used, once as-is, and once with the call to defrag_segment_chain on line 922 commented out. Neither run used the --full-arg modifications. The one with defrag enabled gave the same output you saw,

  12        |   11        |   11        |   11        |     10      |   10        | 
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       |   ┏━┻━━┓    | ┏━┻━┓       | 
┃   9       | ┃   9       | ┃   8       | ┃   10      |   ┃    8    | ┃   7       | 
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     |   ┃   ┏┻┓   | ┃ ┏━┻━┓     | 
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     |   6   ┃ ┃   | ┃ ┃   6     | 
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┏━┻┓  ┃ ┃   | ┃ ┃ ┏━┻┓    | 
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃  5  ┃ ┃   | ┃ ┃ ┃  5    | 
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┏┻┓ ┃ ┃   | ┃ ┃ ┃ ┏┻┓   | 
2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 0 3 4 1 2   | 2 1 0 3 4   |

while with defrag commented out I got

  11        |   10        |   10        |   10        | 
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | 
┃   10      | ┃   8       | ┃   9       | ┃   7       | 
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | 
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | 
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | 
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | 
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | 
2 1 0 3 4   | 1 2 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   |

I'm not quite sure what to make of that, but I'll try to work on finding out exactly why there's a difference.

@jeromekelleher
Copy link
Member

That's weird, defrag_segment_chain shouldn't make any difference to the output as it's only cleaning up the internal representation of the state...

Ah, but it does change the random trajectory of the simulation because of the way that recombination breakpoints are chosen, so you can't deterministically compare outcomes with and without defragmentation.

@JereKoskela
Copy link
Member Author

Yes, that was the conclusion I came to eventually. It also means that we can't deterministically compare simplified trees from the --full-arg run with the corresponding simulation without full arg recording. Can you think of another way to try to validate the changes @jeromekelleher? The distribution of all graph-functionals should be the same, so any previous validation you've done should transfer over easily as long as it can handle redundant nodes in the tree sequence.

@jeromekelleher
Copy link
Member

Yes, that was the conclusion I came to eventually. It also means that we can't deterministically compare simplified trees from the --full-arg run with the corresponding simulation without full arg recording. Can you think of another way to try to validate the changes @jeromekelleher? The distribution of all graph-functionals should be the same, so any previous validation you've done should transfer over easily as long as it can handle redundant nodes in the tree sequence.

Aha, yes, of course it does. Sorry, I sent you down a rabbit hole there!

I think the best thing to do is probably push the implementation into C and then run our tests to verify the statistical properties (if the algorithm is wrong at the moment, it's not by very much, so there won't be any wasted effort). Usually the next step is to add the option to the main.c interface and get stuff working in C and then finally update the Python/C interface to percolate the option up to the top level. I'm happy to do some of this if you like, but it should be fairly straightforward if you want to plug ahead with it. Let me know how you'd like to proceed.

@JereKoskela
Copy link
Member Author

Yes, that was the conclusion I came to eventually. It also means that we can't deterministically compare simplified trees from the --full-arg run with the corresponding simulation without full arg recording. Can you think of another way to try to validate the changes @jeromekelleher? The distribution of all graph-functionals should be the same, so any previous validation you've done should transfer over easily as long as it can handle redundant nodes in the tree sequence.

Aha, yes, of course it does. Sorry, I sent you down a rabbit hole there!

I think the best thing to do is probably push the implementation into C and then run our tests to verify the statistical properties (if the algorithm is wrong at the moment, it's not by very much, so there won't be any wasted effort). Usually the next step is to add the option to the main.c interface and get stuff working in C and then finally update the Python/C interface to percolate the option up to the top level. I'm happy to do some of this if you like, but it should be fairly straightforward if you want to plug ahead with it. Let me know how you'd like to proceed.

No problem - tracking down what was going on helped with my understanding of the algorithm, which can only be a good thing in the long run. In the same vein, I'll try to incorporate the same tweaks into main.c (or more specifically msprime.c, at a quick glance), and push them into this PR. I'm clueless about Python/C interfaces, so will probably need your help at that stage.

@jeromekelleher
Copy link
Member

The algorithm changes go into msprime.c. See the developer docs for information on main.c --- although some of this is out of date as I'm still extracting tskit.

@jeromekelleher
Copy link
Member

Looks like this has got into a bit of a mess (because of all the churn on the msprime side). Do you mind if I jump in and clear things up @JereKoskela?

@JereKoskela
Copy link
Member Author

Not at all. I made a mess of this update all by myself as well. I think I've now undone all of the new damage that I caused, and also uploaded everything I meant to upload into the place to which I meant to upload it. If you can rescue the rest, I'd be very grateful.

@jeromekelleher
Copy link
Member

OK, cool. I have another PR to merge in first and I'll fix this up then; should be sorted in a couple of hours.

JereKoskela and others added 5 commits January 16, 2019 13:11
Fixed issue where nodes with multiple segments caused missing edges in the ARG.
Shortened some while loops in store_edges_left and store_edges_right, and tidied whitespace.
@JereKoskela
Copy link
Member Author

Thanks @jeromekelleher! It'll take me some time to get an understanding of the overall changes (I've never heard of meson or ninja before), but all of the commits I had made are still present and look as I intended them to look.

@JereKoskela
Copy link
Member Author

I had a chance to read the docs today, and managed to compile and run the new version. I've not been able to break it, or generate output that looks broken to the naked eye, so I would say that it's ready for some statistical tests. I'm not sure how to run those from the C implementation (as far as I can tell the developer docs only cover validation.py). Is that something you can sort out @jeromekelleher?

@jeromekelleher
Copy link
Member

Excellent, thanks @JereKoskela. Yep, I'll wire this up with the Python API ASAP.

@jeromekelleher
Copy link
Member

OK, I've added the basic plumbing here so you can now call msprime.simulate(...record_full_arg=True). I also added the flags to allow us easily identify RE nodes --- see the new test case for this in action. On reflection, the flag for the CA nodes is pointless because it's a mutually exclusive category (nodes are either RE or CA). I guess it would be useful to identify nodes in which there was no coalescence, as these will come in handy at some point?

Anyway, it's all looking good. I think the first step in terms of running evaluations is to run some simple simulations to compare the distributions of the number of nodes and edges when we (a) run a normal sim and (b) run with record_full_arg=True and then simplify the result.

We should get some low-level test coverage on this too (tests/tests.c). We just want something simple like test_single_locus_simulation that exercises the ARG code.

@jeromekelleher
Copy link
Member

Oh yes, before I forget: it would be good to update the msp_merge_ancestors method also, so that we can deal with non-binary events properly. I know we don't need it for evaluating the likelihood, but if we leave it out it'll be a nasty bug waiting for someone in a few years when they want to get full ARGs for a multiple-merger coalescent or instantaneous bottleneck model.

@JereKoskela
Copy link
Member Author

Oh yes, before I forget: it would be good to update the msp_merge_ancestors method also, so that we can deal with non-binary events properly. I know we don't need it for evaluating the likelihood, but if we leave it out it'll be a nasty bug waiting for someone in a few years when they want to get full ARGs for a multiple-merger coalescent or instantaneous bottleneck model.

Good point - I've added similar store_full_arg settings to it as already implemented in msp_merge_two_ancestors. I also added a store_full_arg run to the existing test_multi_locus_simulation test, rather than writing a whole new one. I have no experience with unit testing, so let me know if this is somehow bad practice. I note that it hasn't really improved the codecov scores, which leads me to believe that something may not work like I think it does.

I've also written a quick python script to test the node and edge count distributions, and the output looks good. I'll email you those directly rather than clutter the branch with extra files.

@jeromekelleher
Copy link
Member

Excellent, thanks @JereKoskela. I'll make a polishing pass here and add in your test to verification.py, but this is basically done now. Great work, this is a really neat feature!

@jeromekelleher
Copy link
Member

OK, I've made a pass through this @JereKoskela and I think improved it a bit. The main idea is to pull the add_arg_edges into one function to simplify things a bit, and then use this to record the ARG edges after the common ancestor event is done. This cuts down the code a bit and makes it easier to see what's happening.

The other difference is that I've made the semantics of the CA_EVENT flag mark those nodes that don't result in coalescence. These can only occur in the ARG, so it's handy to be able to track them.

Would you mind taking one last look through here and see if it looks OK to you please?

One last question also: is the reason we need two nodes to make a recombination event that we'll squash together the edges for the left and right hand edges, and so we won't be able to tell which is the RHS and which as the LHS? If so, does this actually matter? It seems a bit redundant to record this event twice, and I'm just wondering what we actually get from this.

Disabled ARG recording for multiple merger coalescents as there is a
problem with 'cancelling' events in this context. It can be done easily
enough, but doesn't seem worth doing right now.
@JereKoskela
Copy link
Member Author

Thanks @jeromekelleher. I agree that the new code is simpler. I'm not sure it's quite right though. For example, it looks to me like segments in which ancestral material merge will get double stored when storing the full ARG: first by the msp_store_edge calls in the main loop, which used to be contained in if (!store_full_arg) statements, but no longer are, and then again by the msp_store_arg_edges call at the end of the msp_merge_two_ancestors function. I'll have a play with this to make sure.

Your question about needing two nodes for recombinations is a good one. I implemented it as a split into two nodes because that's what happens in the ARG, but you're right that having one node to signify the time of the recombination event might well be enough. Another point on which I'll have to get back to you, I think.

@jeromekelleher
Copy link
Member

Thanks @jeromekelleher. I agree that the new code is simpler. I'm not sure it's quite right though. For example, it looks to me like segments in which ancestral material merge will get double stored when storing the full ARG: first by the msp_store_edge calls in the main loop, which used to be contained in if (!store_full_arg) statements, but no longer are, and then again by the msp_store_arg_edges call at the end of the msp_merge_two_ancestors function. I'll have a play with this to make sure.

I think this is OK --- the store_arg_edges function checks if the segment's node is already set to the target node (which happens for coalesced segments in the coalescence=true case) and skips these. This avoids the double-storing problem.

Removed a stray printf from msp_merge_ancestors
@JereKoskela
Copy link
Member Author

Thanks @jeromekelleher. I agree that the new code is simpler. I'm not sure it's quite right though. For example, it looks to me like segments in which ancestral material merge will get double stored when storing the full ARG: first by the msp_store_edge calls in the main loop, which used to be contained in if (!store_full_arg) statements, but no longer are, and then again by the msp_store_arg_edges call at the end of the msp_merge_two_ancestors function. I'll have a play with this to make sure.

I think this is OK --- the store_arg_edges function checks if the segment's node is already set to the target node (which happens for coalesced segments in the coalescence=true case) and skips these. This avoids the double-storing problem.

Yeah, it turns out everything was correct. It was just much more compact than what I had written, so I needed a while to spot where everything happened. I found one stray printf statement in msp_merge_ancestors, which I've removed. Everything looks good to me now.

On the subject of whether we need two nodes to store a recombination event, as it stands the answer is yes. The reason is that a single node wouldn't store the location of the recombination, and hence we wouldn't be able to track the number of extant segments in each state for likelihood evaluation. In most cases we could infer where the recombination took place from subsequent coalescence events, but not if the two recombinant lineages merge back together before anything else happens. Multiple recombinations along the same lineage before segments merge with anyone else would also cause tracking problems. We could probably replace the second node by some kind of richer label that tells us where the recombination happened, but we can't just remove the second node outright.

@jeromekelleher
Copy link
Member

On the subject of whether we need two nodes to store a recombination event, as it stands the answer is yes. The reason is that a single node wouldn't store the location of the recombination, and hence we wouldn't be able to track the number of extant segments in each state for likelihood evaluation. In most cases we could infer where the recombination took place from subsequent coalescence events, but not if the two recombinant lineages merge back together before anything else happens. Multiple recombinations along the same lineage before segments merge with anyone else would also cause tracking problems. We could probably replace the second node by some kind of richer label that tells us where the recombination happened, but we can't just remove the second node outright.

I see... One thing we could do is record the location of the recombination breakpoint using the node metadata. We might need to do this anyway if the exact breakpoint is needed as we won't know this with the two-node scheme if the breakpoint falls between two segments. Is this important information, or can we forget about it?

There's clearly pros and cons in both directions. The question comes down, for me, to whether one or two nodes is the right way to think about it. A node represents a distinct individual who lived at a specific time. Does a recombination event produce two distinct individuals? (I think so, right?)

@JereKoskela
Copy link
Member Author

I see... One thing we could do is record the location of the recombination breakpoint using the node metadata. We might need to do this anyway if the exact breakpoint is needed as we won't know this with the two-node scheme if the breakpoint falls between two segments. Is this important information, or can we forget about it?

That level of detail we can forget about. We need to know i) which segments of ancestral material actually got separated, and ii) how many links there are that would separate ancestral material in the event of a recombination event. Both of those things can be recovered from the present two-node approach, but would be lost in some cases with just one node without added metadata.

Does a recombination event produce two distinct individuals? (I think so, right?)

Yes, in the model the two recombinant segments are inherited from different ancestors.

@jeromekelleher
Copy link
Member

OK, great, let's merge this then. Thanks @JereKoskela!

@jeromekelleher jeromekelleher merged commit 0e4c8d8 into tskit-dev:master Jan 30, 2019
@JereKoskela JereKoskela deleted the likelihood_evaluation branch February 6, 2019 10:29
@hyanwong
Copy link
Member

hyanwong commented Dec 6, 2021

For posterity, Jere now says "I'm not opposed to the single node approach. That's almost certainly what I'd do if I was setting this up from scratch." and (with respect to the comment about single nodes not storing the location of the recombination) "my comment is not informative of anything except that I hadn't yet appreciated what we could store with edges as well as nodes."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants