-
Notifications
You must be signed in to change notification settings - Fork 78
C implementation of IBD-finding code. #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #679 +/- ##
==========================================
- Coverage 93.61% 93.52% -0.10%
==========================================
Files 24 24
Lines 19563 19928 +365
Branches 789 789
==========================================
+ Hits 18314 18637 +323
- Misses 1217 1259 +42
Partials 32 32
Continue to review full report at Codecov.
|
|
This all looks good to me @gtsambos. To make the memory management practical, I think a good way to structure things would be to have client code that looks something like this: tsk_ibd_finder_t ibd_finder;
tsk_segment_t *seg;
int ret = tsk_ibd_finder_alloc(&ibd_finder, /* other params... */);
// error check ret
ret = tsk_ibd_finder_run(&ibd_finder, /* other params... */);
// error check ret
for (j = 0; j < ibd_finder.num_pairs; j++) {
ret = ibd_finder_get_pair_from_index(&ibd_finder, j, &a, &b);
// error check ret;
printf("IBD segs for (%d, %d) = ", a, b);
for (seg = ibd_finder.ibd_segments[j]; seg != NULL; seg = seg->next) {
/* Not sure if these are the correct fields, this is just to give a rough idea */
printf("(%f, %f, %d)", seg->left, seg->right, seg->node);
}
printf("\n");
}
tsk_ibd_finder_free(&ibd_finder);So, we'd make the Once this version is working, we can investigate more memory efficient sparse ways to represent the results, and can think about a final documented C API then. Does this make sense? |
|
Yes, thanks Jerome. I'll start by making a very rudimentary |
|
Hi @jeromekelleher, the structure you've now described in the comment above is now implemented in |
03812c0 to
78b3fc8
Compare
|
hey @jeromekelleher (or @benjeffery), is there more detailed log output from the CircleCI builds that we can see somewhere? This makes it sound like perhaps there is... I'm having some trouble fixing the CircleCI failure in this PR, because the ninja build is only failing in CircleCI -- all of my local ninja builds are fine. |
|
(posted circleCI log) ... edit: whoops, that's where you copied that line from. =) I think it's saying you have a segfault - have you run the test under valgrind? When I run it with valgrind, I do get an error (although not a segfault: |
|
Running it through valgrind is the first thing to try anyway @gtsambos - until that's running without problems things can fail in unpredictably different ways across systems. |
|
Thanks @petrelharp and @jeromekelleher -- I noticed the valgrind errors and was working on that too, but thought the ninja problem might have been independent of it. I'll see what happens when I fix up the conditional jumps. |
|
Circle CI failed due to a network issue. I re-ran and it has passed. |
|
Hallelujah! Thanks @benjeffery! Btw, this pull request is still a few commits away from being ready to review -- I just wanted to sort out the CI issues before putting the final bells and whistles in. |
|
Hi, I am new here and but worked on IBD calculation code in C for NEWICK format trees for my rotation project. Considering the inefficiency of this format, I really hope the code can be rewritten to work with tskit tree sequence format. @gtsambos, I am excited to know from @jeromekelleher that you have been working on this for thesis work. I have spent some time thinking about the algorithm for better efficiency. I have got some idea and would like to know whether the same algorithm has been already implemented. Would you like to chat about it? |
|
Hi @gbinux, I'm glad to hear that you're excited, and I certainly hope this proves more efficient than what is currently possible with Newick trees! I'd be interested in chatting (perhaps alongside @jeromekelleher or my supervisor @dvukcevic?), but I'm currently trying to wind up a few different projects and feel a bit swamped right now. Would it be okay if I got back to you in a week or two once this PR is done?
btw @benjeffery, it looks like this problem has recurred. I'm still tidying some things up and anticipate needing to make a few more commits -- how about I just ping you when they are done so that you don't have to keep rerunning CircleCI manually? |
|
I think a rebase might pull in a fix for gcov. Best to rebase often anyway. |
Hi @gtsambos, thanks for sharing your current status. I completely understand that it is not easy to work on multiple projects at the same time. Just let me know when you feel comfortable to chat. Looking forward to talking with you about IBD. |
004dbd6 to
e8c1e2b
Compare
|
Thanks @jeromekelleher, hopefully ready this time 🤞 btw:
I think I did this myself in the last round of commits -- see my comment |
fd29fe0 to
b8834f3
Compare
|
I've made a pass through and tidied things up a little bit in terms of the C API @gtsambos, and I think it's all looking good there. However, it looks like your tests in Once the tests are all hooked up and we're getting identical results from the Python and C versions, that I think we're ready to merge. |
|
Hi @jeromekelleher, I've now tidied up the Python tests a little and added a First, I'm having a bit of trouble getting this to work with the Another problem is that I get a |
In addition, there are some test failures resulting from genuine discrepancy between the methods. I think these are all because my C code is inadvertently returning all IBD segments instead of just the most recent ones (related to the concept of |
2f14cf4 to
84c1c05
Compare
|
I think this is ready now @jeromekelleher! I'm just about to push up a squashed version of the PR, feel free to have a look! There is one test that I've flagged as an expected failure that I think I will fix later, if it's okay by you. It is a test case involving a particular topology of ancient samples, so not a situation that I'll encounter in my demonstrations or analyses. The Python and C implementations are returning the same result, it's just a little different from what I would expect. (I will have to make further changes to this code anyway when your memory-saving procedure comes in a later PR.) |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @gtsambos! Once final round of tweaks and we're good to merge I think.
Removed sequence length from ibd_finder input ibd_finder_calculate_ibd is now only in the internal API default for max_time is now DBL_MAX ibd_finder struct no longer need pair_index attribute Changed output of python mock-up to be a dict of dict of numpy arrays Python test structure changed to incorporate new output format. Modified Python mock-up so that user can input sample pairs Marked expected failure in one test case (fix later) Changes so that tskitmodule.c can be built without error Added setter methods for min_length and max_time Changed C code to take in lists of sample pairs. Changed C tests to check output values. clang-format Remove extraneous stuff needed to calculate IBD for all pairs. Changed tskitmodule More small changes to IBD-C code Removed unnecessary comments Fixed weird reformatting problems Removed things used to calculate sample pair indices the old way. Removed some old comments Removed bits of the the ibd_finder struct that were only needed to calculate sample pairs the old way Changed find_sample_pair_index method to be a bit neater Various small changes. Changed structure of get_ibd_segments
Added test with empty result. bug fix Fixed problem that was generating GenericErrors in C-IBD. Added new test to IBD-C. Fixed parent_should_be_added Fixed IBD-C bug Fixed bug that was unnecessarily squashing IBD segments. Removed memory trimming step for now Cleaned up C code. Removed oldest parent attribute. Final small tweaks
|
Thanks @jeromekelleher, slightly revised PR just pushed now. I've added a few comments to yours above ⬆️ |
|
Hi @jeromekelleher, would it be okay if we pressed ahead with this PR and flagged the My latest problem is that |
|
Sounds good @gtsambos - I'll fix up the max_time issue now and merge. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @gtsambos! Looking forward to seeing what it can do!
At the moment, just contains skeleton code to show the expected structure.
@jeromekelleher @petrelharp @dvukcevic