Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C version of minimise. #545

Merged
merged 8 commits into from
Aug 9, 2018
Merged

Conversation

jeromekelleher
Copy link
Member

Experimental. Not to be merged.

@jeromekelleher jeromekelleher force-pushed the minimise-c branch 4 times, most recently from deca4d1 to 6b6a49c Compare July 25, 2018 11:31
@jeromekelleher
Copy link
Member Author

I've updated simplify here to include a new parameter minimise_topology which discards any topology that isn't relevant to the sites. This is quite a useful operation in general I think.

What do you think of the implementation @petrelharp? (Ignore the C code; the implementation is in the simplify.py file.) What about the API, does this new parameter capture the idea well?

I can implement the C side of things pretty easily if we like the general organisation.

You were spot-on about this one @petrelharp: you are definitely the master of simplify! Other than a bit of fiddling around at the boundaries, the implementation was exactly as you said.

@petrelharp
Copy link
Contributor

I like the word "minimise", but it's not connoting this operation to me, exactly; "minimise topology" makes me think of other things. Maybe subset_sites? Or only_site_topology, since it's extracting the tree topologies at only the sites? Or only_site_trees?

We'll also want to retain trees at a list of sites that aren't necessarily mutated (the integers); how do you imagine that working? We could pass in a list of such positions, but maybe that's overly general; we could have this plus the "integer positions" use case, adding another option, only_integer_position_trees, and avoid needing to pass in a list of all integers in a possibly long genome.

(looking at the code now)

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jul 25, 2018

We'll also want to retain trees at a list of sites that aren't necessarily mutated (the integers); how do you imagine that working? We could pass in a list of such positions, but maybe that's overly general; we could have this plus the "integer positions" use case, adding another option, only_integer_position_trees, and avoid needing to pass in a list of all integers in a possibly long genome.

It's possible to do this by adding in a bunch of sites that have no mutations, and then running simplify:

ts = msprime.simulate(n, length=L)
tables = ts.dump_tables()
for j in range(L):
    tables.sites.add_row(j, "0")
tables.simplify(ts.samples(), site_topology=True) # Or whatever we call it
ts = tables.tree_sequence() # ts now has integer positions only

I'm slightly wary about decoupling this from sites because it becomes a lot harder then. For example, if we were to provide another parameter positions which specify where we want topologies, then what happens to sites that are not at these positions? We can't just shift them up or down to the nearest position because the mutations nodes may not subtend the same set of samples at that point. It gets horribly messy if we try to fix that.

I find it a bit unpleasant that we have to use zero-mutation sites to do this, but on the other hand it works very cleanly.

@petrelharp
Copy link
Contributor

It's possible to do this by adding in a bunch of sites that have no mutations, and then running simplify:

That's nice and simple. But, will this be terribly unwieldy to get integer positions for genomes of size 1e9?

@@ -159,20 +157,44 @@ def record_node(self, input_id, is_sample=False):
self.node_id_map[input_id] = output_id
return output_id

def unrecord_node(self, input_id, output_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewind_node might be a better name because this only works on the last node recorded.

@jeromekelleher
Copy link
Member Author

That's nice and simple. But, will this be terribly unwieldy to get integer positions for genomes of size 1e9?

Yes it certainly would. Why would you want to do that though? I can see why someone would want only the topology at their sites, but what difference does it make if the underlying tree sequence has edges at integer locations? If they do care, then shouldn't this have been simulated under a discrete loci model in the first place?

@petrelharp
Copy link
Contributor

Note:

for j in range(L):
    tables.sites.add_row(j, "0")

won't keep the sites sorted by position, which the documentation says is required for simplification. It'd be straightforward to loop through and make a new site table containing those sites also, though.

if left_index == 1:
left_index = 0
left = X[left_index]
right = X[right_index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Just saying in words, because it took me a while to verify it's doing the right thing: This is replacing left and right with the smallest site position greater than or equal to them, i.e., slides each endpoint of an interval to the right until they hit a site position.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what the or (left_index == 0 and right_index == 1) is for, though? Oh, I see: does X start with two copies of 0.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To include an option to retain trees at integer sites also, I think we could change this to:

        if self.minimise_topology:
            X = self.edge_position_table
            left_index = np.searchsorted(X, left)
            right_index = np.searchsorted(X, right)
            # because there are two copies of 0.0 at the start of X
            if left_index == 1:
                left_index = 0
            left_pos = X[left_index]
            right_pos = X[right_index]
            if keep_integer_trees:
                left_pos = min(left_pos, np.ceil(left))
                right_pos = min(right_pos, np.ceil(right))
            if left_pos == right_pos:
                return
            left = left_pos
            right = right_pos

@petrelharp
Copy link
Contributor

Very nice and simple. I think the "keep integer sites also" (so, keep both integer positions and ones in the site table) might be easy and useful enough to include.

Just to clarify: I'm thinking the integer positions option would be useful in preparing tree sequences for integer-site-based software, like SLiM. SLiM won't do anything wrong if you give it a tree sequence file with noninteger recombination breakpoints, at present, but it does seem like it'd avoid future headaches.

@petrelharp
Copy link
Contributor

will this be terribly unwieldy to get integer positions for genomes of size 1e9?

Yes it certainly would. Why would you want to do that though? I can see why someone would want only the topology at their sites, but what difference does it make if the underlying tree sequence has edges at integer locations? If they do care, then shouldn't this have been simulated under a discrete loci model in the first place?

Simulating under a continuous model and then restricting to discrete sites isn't wrong; it's equivalent to a discrete model (with discrete per-site recombination rate equal to exp(-r) * sinh(r), where r is the continuous per-site rate, IIRC).

I don't have anything concrete for what headaches this would avoid, though. @bhaller?

@bhaller
Copy link

bhaller commented Jul 26, 2018

Hmm, well, you tagged me here Peter, but honestly I have no idea what y'all are talking about. :-> The fact that I'm reading this after driving for 11 hours straight and then eating a pizza and drinking two beers might have something to do with that, but actually I suspect I wouldn't know what you were talking about anyway. :-> If you need an opinion from me in the near term you'll need to explain some stuff to me (what is this about restricting to discrete sites? how is "minimisation" different from simplification? this is the first I've heard about any of this I think). Otherwise, catch me up once I'm in Eugene. :->

@jeromekelleher
Copy link
Member Author

OK, thanks @petrelharp. I think we can keep the door open for including mapping to the integers if it turns out that we do need it, but for now I can't see a pressing need. It's an easy update to make, and I'm perfectly happy to do it if it has a genuine use case.

@jeromekelleher
Copy link
Member Author

How about reduce_to_site_topology @petrelharp? This is wordy, but descriptive and reasonably unambiguous? We can add reduce_to_integer_topology pretty naturally then afterwards. I've added a couple of comments to hopfully clarify the mapping process.

@petrelharp
Copy link
Contributor

Sounds good. I also now understand the wierdness with zero now.

@jeromekelleher
Copy link
Member Author

OK, merging this. Thanks for your help @petrelharp!

@jeromekelleher jeromekelleher merged commit 55255b1 into tskit-dev:master Aug 9, 2018
@jeromekelleher jeromekelleher deleted the minimise-c branch August 9, 2018 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants