Likelihood evaluation #665

JereKoskela · 2018-12-20T16:19:56Z

Both hudson_recombination_event and common_ancestor_event now begin by setting store_full_arg = 1, which is what I envisage turning into a parameter in the args object.

The calls to store_edge for tree sequence output (as opposed to full ARG) have been placed inside if-checks for store_full_arg = 0.

Correspondingly, both hudson_recombination_event and common_ancestor_event create a new node(s), and corresponding edges, when store_full_arg = 1. That's always accompanied by a while-loop, which makes sure that all segments that used to point to the old node are updated to point to the new one. I'm not sure whether this is the cleanest way of doing it, though don't have a better one in mind either.

jeromekelleher · 2018-12-20T17:38:12Z

This is superb Jere! Really neat and tidy with minimal changes to the algorithm; perfect. I've pushed a quick update to add a --full-arg option on the command line, and I'm going to play with it for a bit.

jeromekelleher · 2018-12-20T17:49:21Z

OK, for discussion, here's what I get with python3 algorithms.py 4 -r 0.01 tmp.trees --full-arg

id      flags   population      individual      time    metadata
0       1       0       -1      0.00000000000000
1       1       0       -1      0.00000000000000
2       1       0       -1      0.00000000000000
3       1       0       -1      0.00000000000000
4       0       0       -1      0.03643713740139
5       0       0       -1      0.03643713740139
6       0       0       -1      0.04271069012711
7       0       0       -1      0.17225959614628
8       0       0       -1      0.18194209600111
9       0       0       -1      0.18194209600111
10      0       0       -1      0.30339156052675
11      0       0       -1      0.34668465218404
12      0       0       -1      0.35957225513474
id      left            right           parent  child
0       0.00000000      95.00000000     4       0
1       95.00000000     100.00000000    5       0
2       0.00000000      100.00000000    6       3
3       95.00000000     100.00000000    6       5
4       0.00000000      100.00000000    7       2
5       0.00000000      95.00000000     7       4
6       0.00000000      24.00000000     8       1
7       24.00000000     100.00000000    9       1
8       0.00000000      100.00000000    10      6
9       24.00000000     100.00000000    10      9
10      0.00000000      24.00000000     11      8
11      0.00000000      100.00000000    11      10
12      0.00000000      100.00000000    12      7
13      0.00000000      100.00000000    12      11
   12     |    12     |   12
 ┏━┻━┓    |  ┏━┻━┓    | ┏━┻━┓
 ┃   11   |  ┃   11   | ┃   11
 ┃  ┏┻┓   |  ┃   ┃    | ┃   ┃
 ┃  ┃ 10  |  ┃   10   | ┃   10
 ┃  ┃ ┃   |  ┃  ┏┻┓   | ┃  ┏┻━┓
 ┃  8 ┃   |  ┃  ┃ 9   | ┃  ┃  9
 ┃  ┃ ┃   |  ┃  ┃ ┃   | ┃  ┃  ┃
 7  ┃ ┃   |  7  ┃ ┃   | 7  ┃  ┃
┏┻┓ ┃ ┃   | ┏┻┓ ┃ ┃   | ┃  ┃  ┃
┃ ┃ ┃ 6   | ┃ ┃ 6 ┃   | ┃  6  ┃
┃ ┃ ┃ ┃   | ┃ ┃ ┃ ┃   | ┃ ┏┻┓ ┃
┃ 4 ┃ ┃   | ┃ 4 ┃ ┃   | ┃ ┃ 5 ┃
┃ ┃ ┃ ┃   | ┃ ┃ ┃ ┃   | ┃ ┃ ┃ ┃
2 0 1 3   | 2 0 3 1   | 2 3 0 1

and the same thing without the full-arg:

id      flags   population      individual      time    metadata
0       1       0       -1      0.00000000000000
1       1       0       -1      0.00000000000000
2       1       0       -1      0.00000000000000
3       1       0       -1      0.00000000000000
4       0       0       -1      0.04271069012711
5       0       0       -1      0.17225959614628
6       0       0       -1      0.30339156052675
7       0       0       -1      0.34668465218404
8       0       0       -1      0.35957225513474
id      left            right           parent  child
0       95.00000000     100.00000000    4       0
1       95.00000000     100.00000000    4       3
2       0.00000000      95.00000000     5       0
3       0.00000000      95.00000000     5       2
4       24.00000000     100.00000000    6       1
5       24.00000000     95.00000000     6       3
6       95.00000000     100.00000000    6       4
7       0.00000000      24.00000000     7       1
8       0.00000000      24.00000000     7       3
9       95.00000000     100.00000000    8       2
10      0.00000000      95.00000000     8       5
11      24.00000000     100.00000000    8       6
12      0.00000000      24.00000000     8       7
   8      |    8      |     8
 ┏━┻━┓    |  ┏━┻━┓    |   ┏━┻━┓
 ┃   7    |  ┃   6    |   6   ┃
 ┃  ┏┻┓   |  ┃  ┏┻┓   | ┏━┻┓  ┃
 5  ┃ ┃   |  5  ┃ ┃   | ┃  4  ┃
┏┻┓ ┃ ┃   | ┏┻┓ ┃ ┃   | ┃ ┏┻┓ ┃
0 2 1 3   | 0 2 1 3   | 1 0 3 2

jeromekelleher · 2018-12-20T17:56:36Z

Here's the code I used for drawing the trees btw in case it's useful:

ts = msprime.load("tmp.trees")
print(ts.tables.nodes)
print(ts.tables.edges)

trees = [tree.draw(format="unicode").splitlines() for tree in ts.trees()]
for j in range(len(trees[0])):
    line = " | ".join(tree[j] for tree in trees)
    print(line)

codecov · 2018-12-20T18:02:41Z

Codecov Report

Merging #665 into master will decrease coverage by 2.57%.
The diff coverage is 68.08%.

@@            Coverage Diff             @@
##           master     #665      +/-   ##
==========================================
- Coverage   84.55%   81.97%   -2.58%     
==========================================
  Files          12       12              
  Lines        6157     6236      +79     
  Branches     1260     1283      +23     
==========================================
- Hits         5206     5112      -94     
- Misses        533      699     +166     
- Partials      418      425       +7

Flag	Coverage Δ
#C	`81.97% <68.08%> (-2.58%)`	⬇️
#python	`98.66% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
lib/util.c	`21.59% <0%> (-74.93%)`	⬇️
msprime/simulations.py	`99.64% <100%> (ø)`	⬆️
_msprimemodule.c	`82.76% <100%> (+0.04%)`	⬆️
lib/msprime.c	`75.96% <66.4%> (-3.48%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a93dc3b...908c733. Read the comment docs.

jeromekelleher

This is fantastic Jere, I'm really pleased that it's looking so clean to encode the full ARG. I think it's not quite right at the moment, but I think it's an easy fix. When you crank up the number of samples and recombination rates a bit you can see the trees start getting weird, with some bits being unrooted and others having disconnected sections.

jeromekelleher · 2018-12-20T17:59:34Z

algorithms.py

+            if self.full_arg:
+                self.store_node(y.population)
+                u = len(self.tables.nodes) - 1
+                self.store_edge(y.left, y.right, u, y.node)


This isn't quite right I think: shouldn't the store_edge call be inside in the while loop? We want to note the edge that happens for every segment, not just the one we happen to intersect with.

Yes, I think you're right. I was thinking of y as the node, and not the segment when writing this. Presumably we should record y.right, then go all the way left in the while loop, and add a single edge to cover the whole interval of material?

On second thought, that would end up recording sites on edges at which the corresponding parent node isn't ancestral. Storing an edge per segment is probably the best way to go, provided it doesn't completely swamp us with edges.

An edge per segment is the way to go all right --- the flush_edges call takes care of squashing together any edges that can be, so it's equivalent.

jeromekelleher · 2018-12-20T18:02:10Z

algorithms.py

+                u = len(self.tables.nodes) - 1
+                self.store_edge(y.left, y.right, u, y.node)
+                y.node = u
+                while y.prev is not None:


This isn't quite right either I think. Shouldn't it be:

u = len(self.tables.nodes) - 1 while y is not None: self.store_edge(y.left, y.right, u, y.node) y.node = u y = y.prev

jeromekelleher · 2018-12-20T18:05:03Z

algorithms.py

        else:
            # split the link between x and y.
            x.next = None
            y.prev = None
            z = y
+            if self.full_arg:
+                self.store_node(x.population)
+                u = len(self.tables.nodes) - 1


This is identical to the section above. To avoid repetition, maybe we could add a method store_edges_left function which does this? (And corresponding right for below?)

That would definitely be possible. Each of the store_edge loops gets called a couple of times at present.

jeromekelleher · 2018-12-20T18:07:22Z

algorithms.py

+        if self.full_arg:
+            self.store_node(population_index)
+            u = len(self.tables.nodes) - 1
+            self.store_edge(x.left, x.right, u, x.node)


Yeah, this is basically doing the same thing, isn't it? So, we can call store_edges_right on x and y to do the work here.

Hmm, but what happens when we have a coalescence? Are the segments actually getting zipped to together correctly with the new node IDs. I'll have to think about that one a bit.

I think they are. The only difference is that self.store_node has been called earlier, in line 841, as opposed to in line 881. The variable u still points to the new parent node, exactly as it did before I made any changes, and I haven't touched the code where mergers are recorded.

JereKoskela · 2018-12-20T18:15:35Z

The output looks good to me. Both tree sequences depict the same genealogy as far as I can see, but the former carries all of the extra information that would be needed to implement a log likelihood function. It also looks like implementing prune-regraft moves (where we cut off an edge and reattach it somewhere else) shouldn't be too hard. Perhaps implementing those should be the next move?

Thanks for the tree drawing code - that'll come in very handy. I've always tended to sketch out an ARG by hand from the edge list.

petrelharp · 2018-12-20T19:17:50Z

Wow, this is a nice idea. Ping @JuliaPalacios - we were talking about how it'd be nice to compute likelihoods, but we'd only get the likelihood of the particular ARG that gave the tree sequence, not the sum over all compatible ARGs. This fixes that problem, in a sense, if I understand correctly?

jeromekelleher · 2018-12-20T19:35:57Z

Wow, this is a nice idea. Ping @JuliaPalacios - we were talking about how it'd be nice to compute likelihoods, but we'd only get the likelihood of the particular ARG that gave the tree sequence, not the sum over all compatible ARGs. This fixes that problem, in a sense, if I understand correctly?

It's neat, isn't it? I think we're still looking at calculating the likelihood of the single ARG that we happened to traverse rather than the sum over all compatible ARGs --- I am reliably informed that this is a Hard Problem. We need all the intermediate nodes because the likelihood formula is in terms of ARG events, which we lose a lot of in the usual minimal tree sequence encoding. Being able to compute the likelihood of a given output of msprime.simulate would be a significant step forward though, at least opening up the possibility of full likelihood methods.

jeromekelleher · 2018-12-21T08:36:14Z

algorithms.py

+        y.node = u
+        while y.prev is not None:
+            y = y.prev
+            self.store_edge(y.left, y.right, u, y.node)


This will do the same thing, I think, and is simpler:

while y is not None: self.store_edge(y.left, y.right, u, y.node) y.node = u y = y.prev

You're right. I'll tweak that, and fix the whitespace that CircleCI is complaining about.

JereKoskela · 2018-12-21T10:49:31Z

Wow, this is a nice idea. Ping @JuliaPalacios - we were talking about how it'd be nice to compute likelihoods, but we'd only get the likelihood of the particular ARG that gave the tree sequence, not the sum over all compatible ARGs. This fixes that problem, in a sense, if I understand correctly?

As Jerome said, this still only gives the likelihood of the particular ARG that was traversed. The extra nodes have been added so that the likelihood can be evaluated from the stored, augmented tree sequence, rather than just at runtime. Marginalising over all ARGs that are compatible with a given tree sequence means evaluating a high-dimensional integral over a complicated region of ARG-space, and I'd be surprised if it was much easier than obtaining the true sampling distribution outright (though I'd love to be wrong).

jeromekelleher · 2018-12-21T13:12:39Z

It's looking good Jere, but there's something not quite right with the coalescences I think. If I run

$ python3 algorithms.py 5 -r 0.025 arg.trees --full-arg
$ python3 algorithms.py 5 -r 0.025 noarg.trees

and then plot with the following code

def draw_trees(ts):
    trees = [tree.draw(format="unicode").splitlines() for tree in ts.trees()]
    max_height = max(len(tree) for tree in trees)
    for j in range(max_height):
        line = ""
        for tree in trees:
            k = j - max_height + len(tree)
            if k < 0:
                line += " " * len(tree[0])
            else:
                line += tree[k]
            line += " | "
        print(line)

ts_arg = msprime.load("arg.trees")
ts_noarg = msprime.load("noarg.trees")

ts = ts_arg.simplify()
draw_trees(ts_noarg)
draw_trees(ts_arg.simplify())

I get

  12        |   11        |   11        |   11        |     10      |   10        |
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       |   ┏━┻━━┓    | ┏━┻━┓       |
┃   9       | ┃   9       | ┃   8       | ┃   10      |   ┃    8    | ┃   7       |
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     |   ┃   ┏┻┓   | ┃ ┏━┻━┓     |
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     |   6   ┃ ┃   | ┃ ┃   6     |
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┏━┻┓  ┃ ┃   | ┃ ┃ ┏━┻┓    |
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃  5  ┃ ┃   | ┃ ┃ ┃  5    |
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┏┻┓ ┃ ┃   | ┃ ┃ ┃ ┏┻┓   |
2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 0 3 4 1 2   | 2 1 0 3 4   |
  11        |   11        |   11        |   12        |   12        |   10        |
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       |
┃   9       | ┃   8       | ┃   10      | ┃   10      | ┃   7       | ┃   7       |
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     |
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     |
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    |
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    |
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   |
2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   |

If we simplify a full-arg tree sequence, we should get the same tree sequence back as running the same simulation without arg recording (this must be true, right?). But, we can see here that the toplogies are different. See the first two trees, for example, we have different nodes as root, which can't be right.

I think this is must because we're not handling coalescences quite right. What do you think?

JereKoskela · 2018-12-21T14:25:25Z

You're right. It wouldn't surprise me if the id of the MRCA node were higher under --full-arg, but the differences in the patterns of MRCAs means something must be going wrong.

Do the x and y returned in lines 833 and 835 always point to the leftmost segment carried by the removed ancestor? It looks like they do, and I had assumed that when writing my tweaks, but it would explain everything nicely if that were wrong.

jeromekelleher · 2018-12-21T14:36:36Z

Do the x and y returned in lines 833 and 835 always point to the leftmost segment carried by the removed ancestor? It looks like they do, and I had assumed that when writing my tweaks, but it would explain everything nicely if that were wrong.

Yes, it's always the left-most segment that stored in the population container.

Looking through again it's not obvious what the problem is... hmmm.

JereKoskela · 2018-12-21T16:48:11Z

How about this: At the top of common_ancestor_event, we've run store_edges_right for both child lineages, which makes both of them point to the same node. I think that makes the z.node == alpha.node check in defrag_required on on line 915 redundant. I don't immediately see whether that causes problems, but it does at least look like a genuine difference between the two settings.

jeromekelleher · 2018-12-21T17:45:01Z

How about this: At the top of common_ancestor_event, we've run store_edges_right for both child lineages, which makes both of them point to the same node. I think that makes the z.node == alpha.node check in defrag_required on on line 915 redundant. I don't immediately see whether that causes problems, but it does at least look like a genuine difference between the two settings.

Isn't that what we're doing now? Sorry, I don't have time to look into this now, I'm checking out until after PopGroup. Happy holidays!

JereKoskela · 2018-12-22T20:04:04Z

Isn't that what we're doing now? Sorry, I don't have time to look into this now, I'm checking out until after PopGroup. Happy holidays!

No problem and likewise! I'll respond with a little more detail in preparation for January while this is fresh in my mind. I think I'm on the right track.

I made defrag_segment_chain print a statement at runtime, and it does get run noticeably more often under the --full-arg setting. That's because we collect segments to the same node every time there's a coancestry event, so that the z.right == alpha.left and z.node == alpha.node constraint on line 915 is much easier to satisfy. My understanding is that running or not running defrag_segment_chain shouldn't make a difference to output, but it seems that it does:

I ran algorithms.py with the same parameters you used, once as-is, and once with the call to defrag_segment_chain on line 922 commented out. Neither run used the --full-arg modifications. The one with defrag enabled gave the same output you saw,

  12        |   11        |   11        |   11        |     10      |   10        | 
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       |   ┏━┻━━┓    | ┏━┻━┓       | 
┃   9       | ┃   9       | ┃   8       | ┃   10      |   ┃    8    | ┃   7       | 
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     |   ┃   ┏┻┓   | ┃ ┏━┻━┓     | 
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     |   6   ┃ ┃   | ┃ ┃   6     | 
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┏━┻┓  ┃ ┃   | ┃ ┃ ┏━┻┓    | 
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃  5  ┃ ┃   | ┃ ┃ ┃  5    | 
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┏┻┓ ┃ ┃   | ┃ ┃ ┃ ┏┻┓   | 
2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   | 0 3 4 1 2   | 2 1 0 3 4   |

while with defrag commented out I got

  11        |   10        |   10        |   10        | 
┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | ┏━┻━┓       | 
┃   10      | ┃   8       | ┃   9       | ┃   7       | 
┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | ┃ ┏━┻━┓     | 
┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | ┃ ┃   6     | 
┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | ┃ ┃ ┏━┻┓    | 
┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | ┃ ┃ ┃  5    | 
┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | ┃ ┃ ┃ ┏┻┓   | 
2 1 0 3 4   | 1 2 0 3 4   | 2 1 0 3 4   | 2 1 0 3 4   |

I'm not quite sure what to make of that, but I'll try to work on finding out exactly why there's a difference.

jeromekelleher · 2019-01-03T09:14:09Z

That's weird, defrag_segment_chain shouldn't make any difference to the output as it's only cleaning up the internal representation of the state...

Ah, but it does change the random trajectory of the simulation because of the way that recombination breakpoints are chosen, so you can't deterministically compare outcomes with and without defragmentation.

JereKoskela · 2019-01-08T14:35:30Z

Yes, that was the conclusion I came to eventually. It also means that we can't deterministically compare simplified trees from the --full-arg run with the corresponding simulation without full arg recording. Can you think of another way to try to validate the changes @jeromekelleher? The distribution of all graph-functionals should be the same, so any previous validation you've done should transfer over easily as long as it can handle redundant nodes in the tree sequence.

jeromekelleher · 2019-01-08T15:42:16Z

Yes, that was the conclusion I came to eventually. It also means that we can't deterministically compare simplified trees from the --full-arg run with the corresponding simulation without full arg recording. Can you think of another way to try to validate the changes @jeromekelleher? The distribution of all graph-functionals should be the same, so any previous validation you've done should transfer over easily as long as it can handle redundant nodes in the tree sequence.

Aha, yes, of course it does. Sorry, I sent you down a rabbit hole there!

I think the best thing to do is probably push the implementation into C and then run our tests to verify the statistical properties (if the algorithm is wrong at the moment, it's not by very much, so there won't be any wasted effort). Usually the next step is to add the option to the main.c interface and get stuff working in C and then finally update the Python/C interface to percolate the option up to the top level. I'm happy to do some of this if you like, but it should be fairly straightforward if you want to plug ahead with it. Let me know how you'd like to proceed.

JereKoskela · 2019-01-08T16:42:28Z

Yes, that was the conclusion I came to eventually. It also means that we can't deterministically compare simplified trees from the --full-arg run with the corresponding simulation without full arg recording. Can you think of another way to try to validate the changes @jeromekelleher? The distribution of all graph-functionals should be the same, so any previous validation you've done should transfer over easily as long as it can handle redundant nodes in the tree sequence.

Aha, yes, of course it does. Sorry, I sent you down a rabbit hole there!

I think the best thing to do is probably push the implementation into C and then run our tests to verify the statistical properties (if the algorithm is wrong at the moment, it's not by very much, so there won't be any wasted effort). Usually the next step is to add the option to the main.c interface and get stuff working in C and then finally update the Python/C interface to percolate the option up to the top level. I'm happy to do some of this if you like, but it should be fairly straightforward if you want to plug ahead with it. Let me know how you'd like to proceed.

No problem - tracking down what was going on helped with my understanding of the algorithm, which can only be a good thing in the long run. In the same vein, I'll try to incorporate the same tweaks into main.c (or more specifically msprime.c, at a quick glance), and push them into this PR. I'm clueless about Python/C interfaces, so will probably need your help at that stage.

jeromekelleher · 2019-01-08T17:01:25Z

The algorithm changes go into msprime.c. See the developer docs for information on main.c --- although some of this is out of date as I'm still extracting tskit.

jeromekelleher · 2019-01-16T11:07:55Z

Looks like this has got into a bit of a mess (because of all the churn on the msprime side). Do you mind if I jump in and clear things up @JereKoskela?

JereKoskela · 2019-01-16T11:09:24Z

Not at all. I made a mess of this update all by myself as well. I think I've now undone all of the new damage that I caused, and also uploaded everything I meant to upload into the place to which I meant to upload it. If you can rescue the rest, I'd be very grateful.

jeromekelleher · 2019-01-16T11:14:46Z

OK, cool. I have another PR to merge in first and I'll fix this up then; should be sorted in a couple of hours.

Fixed issue where nodes with multiple segments caused missing edges in the ARG.

Shortened some while loops in store_edges_left and store_edges_right, and tidied whitespace.

JereKoskela · 2019-01-16T14:40:05Z

Thanks @jeromekelleher! It'll take me some time to get an understanding of the overall changes (I've never heard of meson or ninja before), but all of the commits I had made are still present and look as I intended them to look.

JereKoskela · 2019-01-17T17:56:11Z

I had a chance to read the docs today, and managed to compile and run the new version. I've not been able to break it, or generate output that looks broken to the naked eye, so I would say that it's ready for some statistical tests. I'm not sure how to run those from the C implementation (as far as I can tell the developer docs only cover validation.py). Is that something you can sort out @jeromekelleher?

jeromekelleher · 2019-01-17T19:16:26Z

Excellent, thanks @JereKoskela. Yep, I'll wire this up with the Python API ASAP.

jeromekelleher · 2019-01-17T21:03:54Z

OK, I've added the basic plumbing here so you can now call msprime.simulate(...record_full_arg=True). I also added the flags to allow us easily identify RE nodes --- see the new test case for this in action. On reflection, the flag for the CA nodes is pointless because it's a mutually exclusive category (nodes are either RE or CA). I guess it would be useful to identify nodes in which there was no coalescence, as these will come in handy at some point?

Anyway, it's all looking good. I think the first step in terms of running evaluations is to run some simple simulations to compare the distributions of the number of nodes and edges when we (a) run a normal sim and (b) run with record_full_arg=True and then simplify the result.

We should get some low-level test coverage on this too (tests/tests.c). We just want something simple like test_single_locus_simulation that exercises the ARG code.

jeromekelleher · 2019-01-17T21:19:57Z

Oh yes, before I forget: it would be good to update the msp_merge_ancestors method also, so that we can deal with non-binary events properly. I know we don't need it for evaluating the likelihood, but if we leave it out it'll be a nasty bug waiting for someone in a few years when they want to get full ARGs for a multiple-merger coalescent or instantaneous bottleneck model.

JereKoskela · 2019-01-26T12:16:45Z

Oh yes, before I forget: it would be good to update the msp_merge_ancestors method also, so that we can deal with non-binary events properly. I know we don't need it for evaluating the likelihood, but if we leave it out it'll be a nasty bug waiting for someone in a few years when they want to get full ARGs for a multiple-merger coalescent or instantaneous bottleneck model.

Good point - I've added similar store_full_arg settings to it as already implemented in msp_merge_two_ancestors. I also added a store_full_arg run to the existing test_multi_locus_simulation test, rather than writing a whole new one. I have no experience with unit testing, so let me know if this is somehow bad practice. I note that it hasn't really improved the codecov scores, which leads me to believe that something may not work like I think it does.

I've also written a quick python script to test the node and edge count distributions, and the output looks good. I'll email you those directly rather than clutter the branch with extra files.

jeromekelleher · 2019-01-28T10:08:15Z

Excellent, thanks @JereKoskela. I'll make a polishing pass here and add in your test to verification.py, but this is basically done now. Great work, this is a really neat feature!

jeromekelleher · 2019-01-29T15:54:07Z

OK, I've made a pass through this @JereKoskela and I think improved it a bit. The main idea is to pull the add_arg_edges into one function to simplify things a bit, and then use this to record the ARG edges after the common ancestor event is done. This cuts down the code a bit and makes it easier to see what's happening.

The other difference is that I've made the semantics of the CA_EVENT flag mark those nodes that don't result in coalescence. These can only occur in the ARG, so it's handy to be able to track them.

Would you mind taking one last look through here and see if it looks OK to you please?

One last question also: is the reason we need two nodes to make a recombination event that we'll squash together the edges for the left and right hand edges, and so we won't be able to tell which is the RHS and which as the LHS? If so, does this actually matter? It seems a bit redundant to record this event twice, and I'm just wondering what we actually get from this.

Disabled ARG recording for multiple merger coalescents as there is a problem with 'cancelling' events in this context. It can be done easily enough, but doesn't seem worth doing right now.

JereKoskela · 2019-01-29T17:40:49Z

Thanks @jeromekelleher. I agree that the new code is simpler. I'm not sure it's quite right though. For example, it looks to me like segments in which ancestral material merge will get double stored when storing the full ARG: first by the msp_store_edge calls in the main loop, which used to be contained in if (!store_full_arg) statements, but no longer are, and then again by the msp_store_arg_edges call at the end of the msp_merge_two_ancestors function. I'll have a play with this to make sure.

Your question about needing two nodes for recombinations is a good one. I implemented it as a split into two nodes because that's what happens in the ARG, but you're right that having one node to signify the time of the recombination event might well be enough. Another point on which I'll have to get back to you, I think.

jeromekelleher · 2019-01-29T18:29:27Z

Thanks @jeromekelleher. I agree that the new code is simpler. I'm not sure it's quite right though. For example, it looks to me like segments in which ancestral material merge will get double stored when storing the full ARG: first by the msp_store_edge calls in the main loop, which used to be contained in if (!store_full_arg) statements, but no longer are, and then again by the msp_store_arg_edges call at the end of the msp_merge_two_ancestors function. I'll have a play with this to make sure.

I think this is OK --- the store_arg_edges function checks if the segment's node is already set to the target node (which happens for coalesced segments in the coalescence=true case) and skips these. This avoids the double-storing problem.

Removed a stray printf from msp_merge_ancestors

JereKoskela · 2019-01-30T11:01:57Z

Thanks @jeromekelleher. I agree that the new code is simpler. I'm not sure it's quite right though. For example, it looks to me like segments in which ancestral material merge will get double stored when storing the full ARG: first by the msp_store_edge calls in the main loop, which used to be contained in if (!store_full_arg) statements, but no longer are, and then again by the msp_store_arg_edges call at the end of the msp_merge_two_ancestors function. I'll have a play with this to make sure.

I think this is OK --- the store_arg_edges function checks if the segment's node is already set to the target node (which happens for coalesced segments in the coalescence=true case) and skips these. This avoids the double-storing problem.

Yeah, it turns out everything was correct. It was just much more compact than what I had written, so I needed a while to spot where everything happened. I found one stray printf statement in msp_merge_ancestors, which I've removed. Everything looks good to me now.

On the subject of whether we need two nodes to store a recombination event, as it stands the answer is yes. The reason is that a single node wouldn't store the location of the recombination, and hence we wouldn't be able to track the number of extant segments in each state for likelihood evaluation. In most cases we could infer where the recombination took place from subsequent coalescence events, but not if the two recombinant lineages merge back together before anything else happens. Multiple recombinations along the same lineage before segments merge with anyone else would also cause tracking problems. We could probably replace the second node by some kind of richer label that tells us where the recombination happened, but we can't just remove the second node outright.

jeromekelleher · 2019-01-30T11:14:31Z

On the subject of whether we need two nodes to store a recombination event, as it stands the answer is yes. The reason is that a single node wouldn't store the location of the recombination, and hence we wouldn't be able to track the number of extant segments in each state for likelihood evaluation. In most cases we could infer where the recombination took place from subsequent coalescence events, but not if the two recombinant lineages merge back together before anything else happens. Multiple recombinations along the same lineage before segments merge with anyone else would also cause tracking problems. We could probably replace the second node by some kind of richer label that tells us where the recombination happened, but we can't just remove the second node outright.

I see... One thing we could do is record the location of the recombination breakpoint using the node metadata. We might need to do this anyway if the exact breakpoint is needed as we won't know this with the two-node scheme if the breakpoint falls between two segments. Is this important information, or can we forget about it?

There's clearly pros and cons in both directions. The question comes down, for me, to whether one or two nodes is the right way to think about it. A node represents a distinct individual who lived at a specific time. Does a recombination event produce two distinct individuals? (I think so, right?)

JereKoskela · 2019-01-30T11:26:40Z

I see... One thing we could do is record the location of the recombination breakpoint using the node metadata. We might need to do this anyway if the exact breakpoint is needed as we won't know this with the two-node scheme if the breakpoint falls between two segments. Is this important information, or can we forget about it?

That level of detail we can forget about. We need to know i) which segments of ancestral material actually got separated, and ii) how many links there are that would separate ancestral material in the event of a recombination event. Both of those things can be recovered from the present two-node approach, but would be lost in some cases with just one node without added metadata.

Does a recombination event produce two distinct individuals? (I think so, right?)

Yes, in the model the two recombinant segments are inherited from different ancestors.

jeromekelleher · 2019-01-30T11:39:30Z

OK, great, let's merge this then. Thanks @JereKoskela!

hyanwong · 2021-12-06T23:22:58Z

For posterity, Jere now says "I'm not opposed to the single node approach. That's almost certainly what I'd do if I was setting this up from scratch." and (with respect to the comment about single nodes not storing the location of the recombination) "my comment is not informative of anything except that I hadn't yet appreciated what we could store with edges as well as nodes."

jeromekelleher reviewed Dec 20, 2018

View reviewed changes

jeromekelleher reviewed Dec 21, 2018

View reviewed changes

jeromekelleher mentioned this pull request Jan 10, 2019

Tools for ancestry/inheritance calculation tskit-dev/tskit#10

Closed

JereKoskela and others added 5 commits January 16, 2019 13:11

Added basic support for outputting full args

6812a99

Polished full ARG output

61e8676

Added command line arg for full-arg.

9ee48cc

Add files via upload

e2f8818

Fixed issue where nodes with multiple segments caused missing edges in the ARG.

While loop cosmetics

7a8fc28

Shortened some while loops in store_edges_left and store_edges_right, and tidied whitespace.

Added basic plumbing for record_full_arg.

6b4b5c8

Added flags for CA and RE nodes in the ARG.

fe63c62

jeromekelleher force-pushed the likelihood_evaluation branch from afd4a73 to fe63c62 Compare January 17, 2019 21:12

JereKoskela added 2 commits January 26, 2019 12:04

Added full_arg support to msp_merge_ancestors

7684ea3

Test coverage for store_full_arg

ca6c144

jeromekelleher mentioned this pull request Jan 28, 2019

Latent trees in the SMC' model #691

Closed

Simplified arg recording code.

ab1d3e0

jeromekelleher force-pushed the likelihood_evaluation branch from 98bd006 to ab1d3e0 Compare January 29, 2019 15:49

jeromekelleher mentioned this pull request Jan 29, 2019

Add migration events to full ARG recording #692

Closed

jeromekelleher added 2 commits January 29, 2019 16:47

Fixed bug in merge_ancestors.

f7fa596

Disabled ARG recording for multiple merger coalescents as there is a problem with 'cancelling' events in this context. It can be done easily enough, but doesn't seem worth doing right now.

Add ARG recording check.

42dcd16

Cleaned up print statement

908c733

Removed a stray printf from msp_merge_ancestors

jeromekelleher merged commit 0e4c8d8 into tskit-dev:master Jan 30, 2019

JereKoskela deleted the likelihood_evaluation branch February 6, 2019 10:29

hyanwong mentioned this pull request Dec 7, 2021

Should the msprime ARG representation have 2 recombination nodes or one? #1942

Closed

Likelihood evaluation #665

Likelihood evaluation #665

Conversation

JereKoskela commented Dec 20, 2018

jeromekelleher commented Dec 20, 2018

jeromekelleher commented Dec 20, 2018

jeromekelleher commented Dec 20, 2018

codecov bot commented Dec 20, 2018 • edited

Codecov Report

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JereKoskela commented Dec 20, 2018

petrelharp commented Dec 20, 2018

jeromekelleher commented Dec 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JereKoskela commented Dec 21, 2018

jeromekelleher commented Dec 21, 2018

JereKoskela commented Dec 21, 2018

jeromekelleher commented Dec 21, 2018

JereKoskela commented Dec 21, 2018

jeromekelleher commented Dec 21, 2018

JereKoskela commented Dec 22, 2018

jeromekelleher commented Jan 3, 2019

JereKoskela commented Jan 8, 2019

jeromekelleher commented Jan 8, 2019

JereKoskela commented Jan 8, 2019

jeromekelleher commented Jan 8, 2019

jeromekelleher commented Jan 16, 2019

JereKoskela commented Jan 16, 2019

jeromekelleher commented Jan 16, 2019

JereKoskela commented Jan 16, 2019

JereKoskela commented Jan 17, 2019

jeromekelleher commented Jan 17, 2019

jeromekelleher commented Jan 17, 2019

jeromekelleher commented Jan 17, 2019

JereKoskela commented Jan 26, 2019

jeromekelleher commented Jan 28, 2019

jeromekelleher commented Jan 29, 2019

JereKoskela commented Jan 29, 2019

jeromekelleher commented Jan 29, 2019

JereKoskela commented Jan 30, 2019

jeromekelleher commented Jan 30, 2019

JereKoskela commented Jan 30, 2019

jeromekelleher commented Jan 30, 2019

hyanwong commented Dec 6, 2021

codecov bot commented Dec 20, 2018 •

edited