Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutation map #711

Merged
merged 7 commits into from Mar 24, 2020
Merged

Mutation map #711

merged 7 commits into from Mar 24, 2020

Conversation

jeromekelleher
Copy link
Member

Closes #710.

@petrelharp, I started thinking about this and I realise I was over complicating things. Let's just do it the simplest reasonable way first and see how it performs after.

@petrelharp
Copy link
Contributor

Gee, looks just right. Good thinking.

@codecov
Copy link

codecov bot commented Feb 21, 2019

Codecov Report

Merging #711 into master will increase coverage by 0.07%.
The diff coverage is 91.18%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #711      +/-   ##
=========================================
+ Coverage   87.43%   87.5%   +0.07%     
=========================================
  Files          13      14       +1     
  Lines        7902    8045     +143     
  Branches     1628    1648      +20     
=========================================
+ Hits         6909    7040     +131     
- Misses        499     505       +6     
- Partials      494     500       +6
Flag Coverage Δ
#C 87.5% <91.18%> (+0.07%) ⬆️
#python 97.7% <100%> (+0.01%) ⬆️
Impacted Files Coverage Δ
lib/msprime.c 84.02% <100%> (-0.13%) ⬇️
msprime/cli.py 97.81% <100%> (ø) ⬆️
msprime/simulations.py 97.88% <100%> (ø) ⬆️
msprime/mutations.py 100% <100%> (ø) ⬆️
lib/util.c 97.56% <100%> (+0.21%) ⬆️
_msprimemodule.c 84.24% <85.26%> (+0.1%) ⬆️
lib/mutgen.c 82.47% <89.13%> (+2.72%) ⬆️
lib/interval_map.c 94% <94%> (ø)
lib/recomb_map.c 96.93% <97.87%> (+0.64%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 022fdb2...de191d5. Read the comment docs.

@jeromekelleher
Copy link
Member Author

OK, that's most of it done. Just need to update the high-level code and test now.

The only change here is that the map is defined by two equal length arrays now. There's no point in specifying the end value, as it just makes things tricky.

@jeromekelleher
Copy link
Member Author

@daniel-goldstein, I'd appreciate a quick look over this please. It's a WIP where I'm refactoring the recombination map class to make use of a interval_map_t. We want the interval_map because there are other values we want to associate with intervals (here, mutation rates; the idea here is to add variable mutation rates along the genome).

What do you think of the basic setup? I think the plan now is to make a high-level Pythonic interface for the IntervalMap class and to give it some useful methods. We then refactor the RecombinationMap class to use this (not quite sure how yet), but maintaining the old interface for compatibility.

I'm also changing my mind about insisting the arrays are the same length in Python - really, the values array should be one shorter, and the C code should never look at this value.

cc also @grahamgower - I know you were thinking about #902. I think this PR should also close that issue (or at least plan to with a concrete follow up).

lib/util.h Outdated Show resolved Hide resolved
@grahamgower
Copy link
Member

This is very cool. One thing that I had in mind for #902 is that it would be nice for the python interface to do zero memory allocations/copies when getting the rate and position properties of the RecombinationMap---just return a numpy array view of the underlying memory. I didn't think it through fully due to lack of familiarity with the code, but it probably only needs small changes to memory ownership semantics in the C interface.

@jeromekelleher
Copy link
Member Author

This is very cool.

Thanks!

One thing that I had in mind for #902 is that it would be nice for the python interface to do zero memory allocations/copies when getting the rate and position properties of the RecombinationMap---just return a numpy array view of the underlying memory. I didn't think it through fully due to lack of familiarity with the code, but it probably only needs small changes to memory ownership semantics in the C interface.

Ah, yes, now is this tricky. I've thought about doing it before (for table columns in tskit) and ended up giving up and just returning copies. In principle, we should be able to use this method and set the base member to the IntervalMap instance so that ref counting works properly. There are some worries about then changing the shape of the array (or something) and then breaking underlying C code, but maybe it can be set to a read-only-ish mode. It would be handy to directly access the underlying arrays when manipulating these maps...

I might have another go at it, it's good idea.

@jeromekelleher
Copy link
Member Author

Ah yes, now I remember why I didn't do this for Table columns. This is because they are dynamic, and the column memory can be realloced, therefore making the memory pointed to by the numpy array invalid. We don't have this problem here though.

@benjeffery
Copy link
Member

benjeffery commented Mar 23, 2020

The column memory can be realloced, therefore making the memory pointed to by the numpy array invalid.

Could you get round this by supporting a callback that the C API calls with the new memory address when ever it re-allocates? Would clearly add complexity - do you have a handle on how much copies are a perf issue?

@jeromekelleher
Copy link
Member Author

Could you get round this by supporting a callback that the C API calls with the new memory address when ever it re-allocates? Would clearly add complexity - do you have a handle on how much copies are a perf issue?

Copies are totally not a perf issue, which is mainly why I decided to not worry about zero-copy semantics for this stuff. It's more about nice APIs really than perf. We'd like to be able to change the data in place sometimes, and modifying the array is much nicer.

The realloc stuff isn't an issue here though as the size is fixed, that's a tskit Tables API concern.

@jeromekelleher
Copy link
Member Author

I started messing around with zero copy stuff there and realised there's actually not much point in this case. The gains are pretty minimal and we'd have to do a lot of work to make sure that (e.g.) the user couldn't modify the position array while it was in use by the mutation generator. Taking copies is safe and simple. Having a good high-level API is more important here than a super-efficient low-level one.

@jeromekelleher jeromekelleher force-pushed the mutation-map branch 2 times, most recently from bc8e820 to 8373905 Compare March 23, 2020 11:56
@jeromekelleher jeromekelleher mentioned this pull request Mar 23, 2020
@jeromekelleher
Copy link
Member Author

I think the thing to do here is to merge this PR as soon as possible and to follow up with more work on improving the high-level API and testing out the mutation mapping code. There's quite a big diff here, so it would be good to get it merged now so we can keep other PRs going through.

I've added #920 as a starting follow-up where we add the create the high-level interval map tools and change the semantics to something more sensible.

@daniel-goldstein
Copy link
Member

I'd appreciate a quick look over this please. It's a WIP where I'm refactoring the recombination map class to make use of a interval_map_t. We want the interval_map because there are other values we want to associate with intervals (here, mutation rates; the idea here is to add variable mutation rates along the genome).

@jeromekelleher Sounds good to me. I wonder though if interval_map is a bit too broad to benefit fully from the abstraction for recombination/mutation maps.

It makes sense the way recomb_map is currently refactored because there's no promise that interval_map.value is a rate. But a mutation map would also treat it as such and I think would then duplicate some core functionality like expected # of mutations/recombinations within interval [start, end) or up to position x. I don't know much about how mutation maps are used differently than recombination maps, but to me they seem like they could be thin skins around an interval_rate_map, with recombination map doing a little extra work surrounding loci/discrete space etc. Something like an interval_rate_map could then manage position, rate, and cumulative_value, and could give the useful method of position-at-value.

I'm also changing my mind about insisting the arrays are the same length in Python - really, the values array should be one shorter, and the C code should never look at this value.

I am also a fan of nixing the last entry on the values array. I think it's no more mental overhead to make sure it's one shorter than make sure they're the same length and ends in zero.

I started messing around with zero copy stuff there and realised there's actually not much point in this case. The gains are pretty minimal and we'd have to do a lot of work to make sure that (e.g.) the user couldn't modify the position array while it was in use by the mutation generator. Taking copies is safe and simple. Having a good high-level API is more important here than a super-efficient low-level one.

Worth noting that the C simulator copies its recomb_map handed to it from python, and does some mutation translating model rates, so we'd have to be careful with views into the arrays. I haven't seen any perf reasons not to copy, so I think this sounds like the best move right now.

@jeromekelleher
Copy link
Member Author

Thanks @daniel-goldstein, I was wondering myself about an interval_rate_map. Let's follow this up in #920, once this is merged.

@petrelharp
Copy link
Contributor

Wow, this is great. And: Yes!! Please make values one shorter!!! That eliminates uncertainty about which number goes with which interval, a serious source of bugs.


fprintf(out, "interval_map (%p):: size = %d\n", (void *) self, (int) self->size);
fprintf(out, "\tsequence_length = %f\n", interval_map_get_sequence_length(self));
fprintf(out, "\tindex\tposition\tvalue\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be left, right, value?

}

size_t
interval_map_get_size(interval_map_t *self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe num_breakpoints instead of size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will follow up in #920

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read over this. Looks good! We might do the mutation drawing differently if we expect a lot of small rate changes per edge, but it's good for now.

@jeromekelleher
Copy link
Member Author

OK, thanks @petrelharp. I'm going to merge this and then pick it up again in #920 to tidy up the rough edges on the interval map stuff.

@jeromekelleher jeromekelleher merged commit 876eb0c into tskit-dev:master Mar 24, 2020
@jeromekelleher jeromekelleher deleted the mutation-map branch March 24, 2020 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add genomic interval argument to mutate()
5 participants