Mutation map #711

jeromekelleher · 2019-02-19T18:27:56Z

Closes #710.

@petrelharp, I started thinking about this and I realise I was over complicating things. Let's just do it the simplest reasonable way first and see how it performs after.

tests/test_mutations.py

petrelharp · 2019-02-20T17:31:06Z

Gee, looks just right. Good thinking.

codecov · 2019-02-21T13:04:55Z

Codecov Report

Merging #711 into master will increase coverage by 0.07%.
The diff coverage is 91.18%.

@@            Coverage Diff            @@
##           master    #711      +/-   ##
=========================================
+ Coverage   87.43%   87.5%   +0.07%     
=========================================
  Files          13      14       +1     
  Lines        7902    8045     +143     
  Branches     1628    1648      +20     
=========================================
+ Hits         6909    7040     +131     
- Misses        499     505       +6     
- Partials      494     500       +6

Flag	Coverage Δ
#C	`87.5% <91.18%> (+0.07%)`	⬆️
#python	`97.7% <100%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
lib/msprime.c	`84.02% <100%> (-0.13%)`	⬇️
msprime/cli.py	`97.81% <100%> (ø)`	⬆️
msprime/simulations.py	`97.88% <100%> (ø)`	⬆️
msprime/mutations.py	`100% <100%> (ø)`	⬆️
lib/util.c	`97.56% <100%> (+0.21%)`	⬆️
_msprimemodule.c	`84.24% <85.26%> (+0.1%)`	⬆️
lib/mutgen.c	`82.47% <89.13%> (+2.72%)`	⬆️
lib/interval_map.c	`94% <94%> (ø)`
lib/recomb_map.c	`96.93% <97.87%> (+0.64%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 022fdb2...de191d5. Read the comment docs.

jeromekelleher · 2019-02-21T13:06:06Z

OK, that's most of it done. Just need to update the high-level code and test now.

The only change here is that the map is defined by two equal length arrays now. There's no point in specifying the end value, as it just makes things tricky.

jeromekelleher · 2020-03-22T18:42:45Z

@daniel-goldstein, I'd appreciate a quick look over this please. It's a WIP where I'm refactoring the recombination map class to make use of a interval_map_t. We want the interval_map because there are other values we want to associate with intervals (here, mutation rates; the idea here is to add variable mutation rates along the genome).

What do you think of the basic setup? I think the plan now is to make a high-level Pythonic interface for the IntervalMap class and to give it some useful methods. We then refactor the RecombinationMap class to use this (not quite sure how yet), but maintaining the old interface for compatibility.

I'm also changing my mind about insisting the arrays are the same length in Python - really, the values array should be one shorter, and the C code should never look at this value.

cc also @grahamgower - I know you were thinking about #902. I think this PR should also close that issue (or at least plan to with a concrete follow up).

lib/util.h

grahamgower · 2020-03-23T07:25:40Z

This is very cool. One thing that I had in mind for #902 is that it would be nice for the python interface to do zero memory allocations/copies when getting the rate and position properties of the RecombinationMap---just return a numpy array view of the underlying memory. I didn't think it through fully due to lack of familiarity with the code, but it probably only needs small changes to memory ownership semantics in the C interface.

jeromekelleher · 2020-03-23T08:23:33Z

This is very cool.

Thanks!

One thing that I had in mind for #902 is that it would be nice for the python interface to do zero memory allocations/copies when getting the rate and position properties of the RecombinationMap---just return a numpy array view of the underlying memory. I didn't think it through fully due to lack of familiarity with the code, but it probably only needs small changes to memory ownership semantics in the C interface.

Ah, yes, now is this tricky. I've thought about doing it before (for table columns in tskit) and ended up giving up and just returning copies. In principle, we should be able to use this method and set the base member to the IntervalMap instance so that ref counting works properly. There are some worries about then changing the shape of the array (or something) and then breaking underlying C code, but maybe it can be set to a read-only-ish mode. It would be handy to directly access the underlying arrays when manipulating these maps...

I might have another go at it, it's good idea.

jeromekelleher · 2020-03-23T08:24:42Z

Ah yes, now I remember why I didn't do this for Table columns. This is because they are dynamic, and the column memory can be realloced, therefore making the memory pointed to by the numpy array invalid. We don't have this problem here though.

benjeffery · 2020-03-23T09:57:06Z

The column memory can be realloced, therefore making the memory pointed to by the numpy array invalid.

Could you get round this by supporting a callback that the C API calls with the new memory address when ever it re-allocates? Would clearly add complexity - do you have a handle on how much copies are a perf issue?

jeromekelleher · 2020-03-23T10:11:12Z

Could you get round this by supporting a callback that the C API calls with the new memory address when ever it re-allocates? Would clearly add complexity - do you have a handle on how much copies are a perf issue?

Copies are totally not a perf issue, which is mainly why I decided to not worry about zero-copy semantics for this stuff. It's more about nice APIs really than perf. We'd like to be able to change the data in place sometimes, and modifying the array is much nicer.

The realloc stuff isn't an issue here though as the size is fixed, that's a tskit Tables API concern.

jeromekelleher · 2020-03-23T11:01:10Z

I started messing around with zero copy stuff there and realised there's actually not much point in this case. The gains are pretty minimal and we'd have to do a lot of work to make sure that (e.g.) the user couldn't modify the position array while it was in use by the mutation generator. Taking copies is safe and simple. Having a good high-level API is more important here than a super-efficient low-level one.

jeromekelleher · 2020-03-23T12:02:20Z

I think the thing to do here is to merge this PR as soon as possible and to follow up with more work on improving the high-level API and testing out the mutation mapping code. There's quite a big diff here, so it would be good to get it merged now so we can keep other PRs going through.

I've added #920 as a starting follow-up where we add the create the high-level interval map tools and change the semantics to something more sensible.

daniel-goldstein · 2020-03-23T13:11:56Z

I'd appreciate a quick look over this please. It's a WIP where I'm refactoring the recombination map class to make use of a interval_map_t. We want the interval_map because there are other values we want to associate with intervals (here, mutation rates; the idea here is to add variable mutation rates along the genome).

@jeromekelleher Sounds good to me. I wonder though if interval_map is a bit too broad to benefit fully from the abstraction for recombination/mutation maps.

It makes sense the way recomb_map is currently refactored because there's no promise that interval_map.value is a rate. But a mutation map would also treat it as such and I think would then duplicate some core functionality like expected # of mutations/recombinations within interval [start, end) or up to position x. I don't know much about how mutation maps are used differently than recombination maps, but to me they seem like they could be thin skins around an interval_rate_map, with recombination map doing a little extra work surrounding loci/discrete space etc. Something like an interval_rate_map could then manage position, rate, and cumulative_value, and could give the useful method of position-at-value.

I'm also changing my mind about insisting the arrays are the same length in Python - really, the values array should be one shorter, and the C code should never look at this value.

I am also a fan of nixing the last entry on the values array. I think it's no more mental overhead to make sure it's one shorter than make sure they're the same length and ends in zero.

I started messing around with zero copy stuff there and realised there's actually not much point in this case. The gains are pretty minimal and we'd have to do a lot of work to make sure that (e.g.) the user couldn't modify the position array while it was in use by the mutation generator. Taking copies is safe and simple. Having a good high-level API is more important here than a super-efficient low-level one.

Worth noting that the C simulator copies its recomb_map handed to it from python, and does some mutation translating model rates, so we'd have to be careful with views into the arrays. I haven't seen any perf reasons not to copy, so I think this sounds like the best move right now.

jeromekelleher · 2020-03-23T13:59:33Z

Thanks @daniel-goldstein, I was wondering myself about an interval_rate_map. Let's follow this up in #920, once this is merged.

petrelharp · 2020-03-24T13:20:00Z

Wow, this is great. And: Yes!! Please make values one shorter!!! That eliminates uncertainty about which number goes with which interval, a serious source of bugs.

petrelharp · 2020-03-24T13:24:54Z

lib/interval_map.c

+
+    fprintf(out, "interval_map (%p):: size = %d\n", (void *) self, (int) self->size);
+    fprintf(out, "\tsequence_length = %f\n", interval_map_get_sequence_length(self));
+    fprintf(out, "\tindex\tposition\tvalue\n");


Shouldn't this be left, right, value?

petrelharp · 2020-03-24T13:27:43Z

lib/interval_map.c

+}
+
+size_t
+interval_map_get_size(interval_map_t *self)


Maybe num_breakpoints instead of size?

Good idea. Will follow up in #920

petrelharp

I read over this. Looks good! We might do the mutation drawing differently if we expect a lot of small rate changes per edge, but it's good for now.

- Also improve the API.

jeromekelleher · 2020-03-24T14:02:15Z

OK, thanks @petrelharp. I'm going to merge this and then pick it up again in #920 to tidy up the rough edges on the interval map stuff.

jeromekelleher commented Feb 19, 2019

View reviewed changes

tests/test_mutations.py Outdated Show resolved Hide resolved

jeromekelleher force-pushed the mutation-map branch from 750a529 to f72c31c Compare February 19, 2019 18:32

jeromekelleher force-pushed the mutation-map branch from 89a935a to bd2f8f7 Compare February 21, 2019 13:29

jeromekelleher mentioned this pull request Feb 21, 2019

More general mutation models #553

Closed

jeromekelleher force-pushed the mutation-map branch 4 times, most recently from 720bf7d to 3c1099d Compare March 22, 2020 15:55

jeromekelleher commented Mar 22, 2020

View reviewed changes

lib/util.h Outdated Show resolved Hide resolved

jeromekelleher force-pushed the mutation-map branch 2 times, most recently from bc8e820 to 8373905 Compare March 23, 2020 11:56

jeromekelleher mentioned this pull request Mar 23, 2020

Interval map #920

Merged

petrelharp reviewed Mar 24, 2020

View reviewed changes

jeromekelleher added 2 commits March 24, 2020 13:58

Add poisson and flat to RandomGenerator for testing.

0f14bc0

Outline Python implementation.

62db04c

jeromekelleher added 5 commits March 24, 2020 13:58

Added low-level code for mutation maps.

ee3a9ee

Change intervals to same as recomb map.

b752c9e

Basic definition of interval_map class.

d9d81cc

Refactor recomb_map to use the interval_map.

b993263

- Also improve the API.

Added the IntervalMap & update high-level code.

de191d5

jeromekelleher force-pushed the mutation-map branch from 8373905 to de191d5 Compare March 24, 2020 13:59

jeromekelleher merged commit 876eb0c into tskit-dev:master Mar 24, 2020

jeromekelleher deleted the mutation-map branch March 24, 2020 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutation map #711

Mutation map #711

jeromekelleher commented Feb 19, 2019

petrelharp commented Feb 20, 2019

codecov bot commented Feb 21, 2019 •

edited

jeromekelleher commented Feb 21, 2019

jeromekelleher commented Mar 22, 2020

grahamgower commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

benjeffery commented Mar 23, 2020 •

edited

jeromekelleher commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

daniel-goldstein commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

petrelharp commented Mar 24, 2020

petrelharp Mar 24, 2020

petrelharp Mar 24, 2020

jeromekelleher Mar 24, 2020

petrelharp left a comment

jeromekelleher commented Mar 24, 2020

Mutation map #711

Mutation map #711

Conversation

jeromekelleher commented Feb 19, 2019

petrelharp commented Feb 20, 2019

codecov bot commented Feb 21, 2019 • edited

Codecov Report

jeromekelleher commented Feb 21, 2019

jeromekelleher commented Mar 22, 2020

grahamgower commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

benjeffery commented Mar 23, 2020 • edited

jeromekelleher commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

daniel-goldstein commented Mar 23, 2020

jeromekelleher commented Mar 23, 2020

petrelharp commented Mar 24, 2020

petrelharp Mar 24, 2020

Choose a reason for hiding this comment

petrelharp Mar 24, 2020

Choose a reason for hiding this comment

jeromekelleher Mar 24, 2020

Choose a reason for hiding this comment

petrelharp left a comment

Choose a reason for hiding this comment

jeromekelleher commented Mar 24, 2020

codecov bot commented Feb 21, 2019 •

edited

benjeffery commented Mar 23, 2020 •

edited