C implementation of IBD-finding code. #679

gtsambos · 2020-06-15T06:24:48Z

At the moment, just contains skeleton code to show the expected structure.

codecov · 2020-06-15T07:00:29Z

Codecov Report

Merging #679 into master will decrease coverage by 0.09%.
The diff coverage is 88.49%.

@@            Coverage Diff             @@
##           master     #679      +/-   ##
==========================================
- Coverage   93.61%   93.52%   -0.10%     
==========================================
  Files          24       24              
  Lines       19563    19928     +365     
  Branches      789      789              
==========================================
+ Hits        18314    18637     +323     
- Misses       1217     1259      +42     
  Partials       32       32

Impacted Files	Coverage Δ
c/tskit/core.h	`100.00% <ø> (ø)`
c/tskit/tables.c	`90.57% <87.45%> (-0.17%)`	⬇️
python/_tskitmodule.c	`91.31% <90.32%> (-0.02%)`	⬇️
c/tskit/core.c	`96.95% <100.00%> (+0.02%)`	⬆️
python/tskit/tables.py	`99.70% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c673faa...f0679b0. Read the comment docs.

jeromekelleher · 2020-06-15T09:10:52Z

This all looks good to me @gtsambos. To make the memory management practical, I think a good way to structure things would be to have client code that looks something like this:

tsk_ibd_finder_t ibd_finder;
tsk_segment_t *seg;

int ret = tsk_ibd_finder_alloc(&ibd_finder, /* other params... */);
// error check ret
ret  = tsk_ibd_finder_run(&ibd_finder, /* other params... */);
// error check ret
for (j = 0; j < ibd_finder.num_pairs; j++) {
    ret = ibd_finder_get_pair_from_index(&ibd_finder, j, &a, &b);
    // error check ret;
    printf("IBD segs for (%d, %d) = ", a, b);
    for (seg = ibd_finder.ibd_segments[j]; seg != NULL; seg = seg->next) {
          /* Not sure if these are the correct fields, this is just to give a rough idea */
          printf("(%f, %f, %d)", seg->left, seg->right, seg->node);
    }
    printf("\n");
}
tsk_ibd_finder_free(&ibd_finder);

So, we'd make the tsk_ibd_finder part of the public API (so the struct is defined in tables.h), and have client code access some of the fields inside this struct directly. This is not ideal, and in particular the big ibd_segments array will use a lot of memory. However, it will give us working code and a way to move forward.

Once this version is working, we can investigate more memory efficient sparse ways to represent the results, and can think about a final documented C API then.

Does this make sense?

gtsambos · 2020-06-16T04:10:02Z

Yes, thanks Jerome. I'll start by making a very rudimentary ibd_finder struct that doesn't do anything (perhaps just holding the sample pairs), and try to get this client code working

gtsambos · 2020-07-06T05:28:36Z

Hi @jeromekelleher, the structure you've now described in the comment above is now implemented in test_tables.c as test_ibd_finder. I'm still a bit hazy on how the user will implement these methods directly themselves (a PyObject???), but maybe we can talk about that later?

gtsambos · 2020-07-21T02:17:10Z

hey @jeromekelleher (or @benjeffery), is there more detailed log output from the CircleCI builds that we can see somewhere? This makes it sound like perhaps there is...

Full log written to /home/circleci/tskit/build-gcc/meson-logs/testlog.txt

I'm having some trouble fixing the CircleCI failure in this PR, because the ninja build is only failing in CircleCI -- all of my local ninja builds are fine.

petrelharp · 2020-07-21T04:47:23Z

(posted circleCI log) ... edit: whoops, that's where you copied that line from. =)

I think it's saying you have a segfault - have you run the test under valgrind? When I run it with valgrind, I do get an error (although not a segfault:

  Test: test_ibd_finder ...==22946== Conditional jump or move depends on uninitialised value(s)
==22946==    at 0x13839F: tsk_ibd_finder_calculate_ibd (tables.c:5760)
==22946==    by 0x1382B5: tsk_ibd_finder_run (tables.c:5734)
==22946==    by 0x137C85: tsk_table_collection_find_ibd (tables.c:5592)
==22946==    by 0x11ED8B: test_ibd_finder (test_tables.c:2823)
==22946==    by 0x49AD9D6: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49ADC70: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49AE05D: CU_run_all_tests (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x158E2E: test_main (testlib.c:743)
==22946==    by 0x12970E: main (test_tables.c:4464)
==22946== 
L: 0.000000, R:1.000000 
==22946== Conditional jump or move depends on uninitialised value(s)
==22946==    at 0x138472: tsk_ibd_finder_calculate_ibd (tables.c:5770)
==22946==    by 0x1382B5: tsk_ibd_finder_run (tables.c:5734)
==22946==    by 0x137C85: tsk_table_collection_find_ibd (tables.c:5592)
==22946==    by 0x11ED8B: test_ibd_finder (test_tables.c:2823)
==22946==    by 0x49AD9D6: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49ADC70: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49AE05D: CU_run_all_tests (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x158E2E: test_main (testlib.c:743)
==22946==    by 0x12970E: main (test_tables.c:4464)
==22946== 
==22946== Conditional jump or move depends on uninitialised value(s)
==22946==    at 0x1384E3: tsk_ibd_finder_calculate_ibd (tables.c:5800)
==22946==    by 0x1382B5: tsk_ibd_finder_run (tables.c:5734)
==22946==    by 0x137C85: tsk_table_collection_find_ibd (tables.c:5592)
==22946==    by 0x11ED8B: test_ibd_finder (test_tables.c:2823)
==22946==    by 0x49AD9D6: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49ADC70: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49AE05D: CU_run_all_tests (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x158E2E: test_main (testlib.c:743)
==22946==    by 0x12970E: main (test_tables.c:4464)
==22946==

jeromekelleher · 2020-07-21T07:14:26Z

Running it through valgrind is the first thing to try anyway @gtsambos - until that's running without problems things can fail in unpredictably different ways across systems.

gtsambos · 2020-07-21T10:13:26Z

Thanks @petrelharp and @jeromekelleher -- I noticed the valgrind errors and was working on that too, but thought the ninja problem might have been independent of it. I'll see what happens when I fix up the conditional jumps.

benjeffery · 2020-07-29T10:32:13Z

Circle CI failed due to a network issue. I re-ran and it has passed.

gtsambos · 2020-07-29T10:59:22Z

Hallelujah! Thanks @benjeffery! Btw, this pull request is still a few commits away from being ready to review -- I just wanted to sort out the CI issues before putting the final bells and whistles in.

bguo068 · 2020-07-29T15:26:05Z

Hi, I am new here and but worked on IBD calculation code in C for NEWICK format trees for my rotation project. Considering the inefficiency of this format, I really hope the code can be rewritten to work with tskit tree sequence format. @gtsambos, I am excited to know from @jeromekelleher that you have been working on this for thesis work. I have spent some time thinking about the algorithm for better efficiency. I have got some idea and would like to know whether the same algorithm has been already implemented. Would you like to chat about it?

gtsambos · 2020-07-30T06:57:35Z

Hi @gbinux, I'm glad to hear that you're excited, and I certainly hope this proves more efficient than what is currently possible with Newick trees! I'd be interested in chatting (perhaps alongside @jeromekelleher or my supervisor @dvukcevic?), but I'm currently trying to wind up a few different projects and feel a bit swamped right now. Would it be okay if I got back to you in a week or two once this PR is done?

Circle CI failed due to a network issue. I re-ran and it has passed.

btw @benjeffery, it looks like this problem has recurred. I'm still tidying some things up and anticipate needing to make a few more commits -- how about I just ping you when they are done so that you don't have to keep rerunning CircleCI manually?

benjeffery · 2020-07-30T09:05:11Z

I think a rebase might pull in a fix for gcov. Best to rebase often anyway.

bguo068 · 2020-07-30T13:40:57Z

Hi @gbinux, I'm glad to hear that you're excited, and I certainly hope this proves more efficient than what is currently possible with Newick trees! I'd be interested in chatting (perhaps alongside @jeromekelleher or my supervisor @dvukcevic?), but I'm currently trying to wind up a few different projects and feel a bit swamped right now. Would it be okay if I got back to you in a week or two once this PR is done?

Hi @gtsambos, thanks for sharing your current status. I completely understand that it is not easy to work on multiple projects at the same time. Just let me know when you feel comfortable to chat. Looking forward to talking with you about IBD.

python/_tskitmodule.c

gtsambos · 2020-09-11T05:58:09Z

Thanks @jeromekelleher, hopefully ready this time 🤞

btw:

Once that's done I'll update the Python-C interface to input a 2D array of sample pairs

I think I did this myself in the last round of commits -- see my comment

jeromekelleher · 2020-09-11T12:59:52Z

I've made a pass through and tidied things up a little bit in terms of the C API @gtsambos, and I think it's all looking good there. However, it looks like your tests in test_ibd.py aren't actually calling into the C code, to me. We'd need to hook this in somehow so that we're running the library code as well as the Python prototypes (we use the compare_lib=True pattern when testing simplify).

Once the tests are all hooked up and we're getting identical results from the Python and C versions, that I think we're ready to merge.

gtsambos · 2020-09-13T07:42:32Z

Hi @jeromekelleher, I've now tidied up the Python tests a little and added a compare_lib option so the Python and C output can be compared. There are a few different issues:

First, I'm having a bit of trouble getting this to work with the min_length and max_time arguments -- I've added a few extra lines to the method in tables.py so that these arguments can at least be used, but I'm getting a tskit.GenericError when I actually try to specify non-generic values. I'm wondering whether I need to explicitly convert these into something like a PyArrayObject in _tskitmodule.c (like you did for the sample pairs)?

Another problem is that I get a tskit.GenericError whenever the output is empty. I've realised that I actually didn't have any C tests covering this case (!) so I'll add one in, as I'm not sure whether this is an issue with the C code itself, or whether there's some problem trying to convert the C output back into valid Python output.

gtsambos · 2020-09-13T07:44:45Z

Once the tests are all hooked up and we're getting identical results from the Python and C versions, that I think we're ready to merge.

In addition, there are some test failures resulting from genuine discrepancy between the methods. I think these are all because my C code is inadvertently returning all IBD segments instead of just the most recent ones (related to the concept of mrca_ibd that we discussed way earlier in the year). I have flagged these all as expected failures for now, but am looking into a fix

gtsambos · 2020-09-17T06:04:41Z

I think this is ready now @jeromekelleher! I'm just about to push up a squashed version of the PR, feel free to have a look!

There is one test that I've flagged as an expected failure that I think I will fix later, if it's okay by you. It is a test case involving a particular topology of ancient samples, so not a situation that I'll encounter in my demonstrations or analyses. The Python and C implementations are returning the same result, it's just a little different from what I would expect. (I will have to make further changes to this code anyway when your memory-saving procedure comes in a later PR.)

jeromekelleher

Looks great, thanks @gtsambos! Once final round of tweaks and we're good to merge I think.

c/tskit/tables.c

python/tskit/tables.py

Removed sequence length from ibd_finder input ibd_finder_calculate_ibd is now only in the internal API default for max_time is now DBL_MAX ibd_finder struct no longer need pair_index attribute Changed output of python mock-up to be a dict of dict of numpy arrays Python test structure changed to incorporate new output format. Modified Python mock-up so that user can input sample pairs Marked expected failure in one test case (fix later) Changes so that tskitmodule.c can be built without error Added setter methods for min_length and max_time Changed C code to take in lists of sample pairs. Changed C tests to check output values. clang-format Remove extraneous stuff needed to calculate IBD for all pairs. Changed tskitmodule More small changes to IBD-C code Removed unnecessary comments Fixed weird reformatting problems Removed things used to calculate sample pair indices the old way. Removed some old comments Removed bits of the the ibd_finder struct that were only needed to calculate sample pairs the old way Changed find_sample_pair_index method to be a bit neater Various small changes. Changed structure of get_ibd_segments

Added test with empty result. bug fix Fixed problem that was generating GenericErrors in C-IBD. Added new test to IBD-C. Fixed parent_should_be_added Fixed IBD-C bug Fixed bug that was unnecessarily squashing IBD segments. Removed memory trimming step for now Cleaned up C code. Removed oldest parent attribute. Final small tweaks

gtsambos · 2020-09-17T09:29:02Z

Thanks @jeromekelleher, slightly revised PR just pushed now. I've added a few comments to yours above ⬆️

gtsambos · 2020-09-18T01:35:43Z

Hi @jeromekelleher, would it be okay if we pressed ahead with this PR and flagged the max_time thing above as an issue to be fixed later? I've spent quite a bit of time trying and failing to get the tests working with that small alteration, and given how many deadlines I have for various things next week, I think it would be a better use of my time to focus on those just for now.

My latest problem is that sys.float_info.max in tables.py doesn't appear to be interpreted as equivalent to DBL_MAX in the C code, so most of my Python tests are failing when I change the max_time default away from what it currently is in this PR

jeromekelleher · 2020-09-18T07:31:55Z

Sounds good @gtsambos - I'll fix up the max_time issue now and merge.

jeromekelleher

Looks great, thanks @gtsambos! Looking forward to seeing what it can do!

gtsambos force-pushed the ibd-c branch from c86f7bd to 50534bb Compare June 15, 2020 06:26

gtsambos marked this pull request as draft July 3, 2020 02:41

gtsambos force-pushed the ibd-c branch from 50534bb to 181c05b Compare July 4, 2020 08:10

gtsambos force-pushed the ibd-c branch 3 times, most recently from 03812c0 to 78b3fc8 Compare July 20, 2020 06:37

gtsambos force-pushed the ibd-c branch from a2e3f1a to 360891c Compare July 21, 2020 13:16

gtsambos force-pushed the ibd-c branch from b07c326 to 2f541e2 Compare July 30, 2020 06:44

gtsambos force-pushed the ibd-c branch from 944d9d2 to 43a4bbe Compare July 30, 2020 13:38

gtsambos force-pushed the ibd-c branch 6 times, most recently from 004dbd6 to e8c1e2b Compare July 31, 2020 10:07

gtsambos marked this pull request as ready for review August 3, 2020 05:13

gtsambos commented Sep 11, 2020

View reviewed changes

python/_tskitmodule.c Show resolved Hide resolved

gtsambos force-pushed the ibd-c branch 3 times, most recently from fd29fe0 to b8834f3 Compare September 11, 2020 06:12

gtsambos force-pushed the ibd-c branch 3 times, most recently from 2f14cf4 to 84c1c05 Compare September 15, 2020 06:17

gtsambos force-pushed the ibd-c branch from 8eda0bc to b49ed0e Compare September 17, 2020 06:16

gtsambos and others added 2 commits September 17, 2020 16:18

Basics of IBD-C procedure now implemented.

0068597

Add low-level Python-C interface.

fa237ca

gtsambos force-pushed the ibd-c branch from b49ed0e to 230b097 Compare September 17, 2020 06:19

jeromekelleher reviewed Sep 17, 2020

View reviewed changes

gtsambos and others added 4 commits September 17, 2020 19:26

Tweaked C IBD API.

556e8b3

Simplified Python IBD tests and added a compare_lib option.

b75d8c9

gtsambos force-pushed the ibd-c branch from 230b097 to 732e37c Compare September 17, 2020 09:26

Set default for max_time to DBL_MAX

f0679b0

jeromekelleher approved these changes Sep 18, 2020

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED label Sep 18, 2020

mergify bot merged commit bc50b5a into tskit-dev:master Sep 18, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Sep 18, 2020

C implementation of IBD-finding code. #679

C implementation of IBD-finding code. #679

Uh oh!

Conversation

gtsambos commented Jun 15, 2020

Uh oh!

codecov bot commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gtsambos commented Jun 16, 2020

Uh oh!

gtsambos commented Jul 6, 2020

Uh oh!

gtsambos commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jul 21, 2020

Uh oh!

gtsambos commented Jul 21, 2020

Uh oh!

benjeffery commented Jul 29, 2020

Uh oh!

gtsambos commented Jul 29, 2020

Uh oh!

bguo068 commented Jul 29, 2020

Uh oh!

gtsambos commented Jul 30, 2020

Uh oh!

benjeffery commented Jul 30, 2020

Uh oh!

bguo068 commented Jul 30, 2020

Uh oh!

Uh oh!

gtsambos commented Sep 11, 2020

Uh oh!

jeromekelleher commented Sep 11, 2020

Uh oh!

gtsambos commented Sep 13, 2020

Uh oh!

gtsambos commented Sep 13, 2020

Uh oh!

gtsambos commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gtsambos commented Sep 17, 2020

Uh oh!

gtsambos commented Sep 18, 2020

Uh oh!

jeromekelleher commented Sep 18, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jun 15, 2020 •

edited

Loading

jeromekelleher commented Jun 15, 2020 •

edited

Loading

gtsambos commented Jul 21, 2020 •

edited

Loading

petrelharp commented Jul 21, 2020 •

edited

Loading

gtsambos commented Sep 17, 2020 •

edited

Loading