-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C version of minimise. #545
Conversation
deca4d1
to
6b6a49c
Compare
6b6a49c
to
0d2527c
Compare
I've updated simplify here to include a new parameter What do you think of the implementation @petrelharp? (Ignore the C code; the implementation is in the simplify.py file.) What about the API, does this new parameter capture the idea well? I can implement the C side of things pretty easily if we like the general organisation. You were spot-on about this one @petrelharp: you are definitely the master of simplify! Other than a bit of fiddling around at the boundaries, the implementation was exactly as you said. |
I like the word "minimise", but it's not connoting this operation to me, exactly; "minimise topology" makes me think of other things. Maybe We'll also want to retain trees at a list of sites that aren't necessarily mutated (the integers); how do you imagine that working? We could pass in a list of such positions, but maybe that's overly general; we could have this plus the "integer positions" use case, adding another option, (looking at the code now) |
It's possible to do this by adding in a bunch of sites that have no mutations, and then running simplify: ts = msprime.simulate(n, length=L)
tables = ts.dump_tables()
for j in range(L):
tables.sites.add_row(j, "0")
tables.simplify(ts.samples(), site_topology=True) # Or whatever we call it
ts = tables.tree_sequence() # ts now has integer positions only I'm slightly wary about decoupling this from sites because it becomes a lot harder then. For example, if we were to provide another parameter I find it a bit unpleasant that we have to use zero-mutation sites to do this, but on the other hand it works very cleanly. |
That's nice and simple. But, will this be terribly unwieldy to get integer positions for genomes of size 1e9? |
tests/simplify.py
Outdated
@@ -159,20 +157,44 @@ def record_node(self, input_id, is_sample=False): | |||
self.node_id_map[input_id] = output_id | |||
return output_id | |||
|
|||
def unrecord_node(self, input_id, output_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewind_node
might be a better name because this only works on the last node recorded.
Yes it certainly would. Why would you want to do that though? I can see why someone would want only the topology at their sites, but what difference does it make if the underlying tree sequence has edges at integer locations? If they do care, then shouldn't this have been simulated under a discrete loci model in the first place? |
Note:
won't keep the sites sorted by position, which the documentation says is required for simplification. It'd be straightforward to loop through and make a new site table containing those sites also, though. |
tests/simplify.py
Outdated
if left_index == 1: | ||
left_index = 0 | ||
left = X[left_index] | ||
right = X[right_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Just saying in words, because it took me a while to verify it's doing the right thing: This is replacing left
and right
with the smallest site position greater than or equal to them, i.e., slides each endpoint of an interval to the right until they hit a site position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what the or (left_index == 0 and right_index == 1)
is for, though? Oh, I see: does X
start with two copies of 0.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To include an option to retain trees at integer sites also, I think we could change this to:
if self.minimise_topology:
X = self.edge_position_table
left_index = np.searchsorted(X, left)
right_index = np.searchsorted(X, right)
# because there are two copies of 0.0 at the start of X
if left_index == 1:
left_index = 0
left_pos = X[left_index]
right_pos = X[right_index]
if keep_integer_trees:
left_pos = min(left_pos, np.ceil(left))
right_pos = min(right_pos, np.ceil(right))
if left_pos == right_pos:
return
left = left_pos
right = right_pos
Very nice and simple. I think the "keep integer sites also" (so, keep both integer positions and ones in the site table) might be easy and useful enough to include. Just to clarify: I'm thinking the integer positions option would be useful in preparing tree sequences for integer-site-based software, like SLiM. SLiM won't do anything wrong if you give it a tree sequence file with noninteger recombination breakpoints, at present, but it does seem like it'd avoid future headaches. |
Simulating under a continuous model and then restricting to discrete sites isn't wrong; it's equivalent to a discrete model (with discrete per-site recombination rate equal to I don't have anything concrete for what headaches this would avoid, though. @bhaller? |
Hmm, well, you tagged me here Peter, but honestly I have no idea what y'all are talking about. :-> The fact that I'm reading this after driving for 11 hours straight and then eating a pizza and drinking two beers might have something to do with that, but actually I suspect I wouldn't know what you were talking about anyway. :-> If you need an opinion from me in the near term you'll need to explain some stuff to me (what is this about restricting to discrete sites? how is "minimisation" different from simplification? this is the first I've heard about any of this I think). Otherwise, catch me up once I'm in Eugene. :-> |
OK, thanks @petrelharp. I think we can keep the door open for including mapping to the integers if it turns out that we do need it, but for now I can't see a pressing need. It's an easy update to make, and I'm perfectly happy to do it if it has a genuine use case. |
How about |
Sounds good. I also now understand the wierdness with zero now. |
OK, merging this. Thanks for your help @petrelharp! |
Experimental. Not to be merged.