Skip to content

Conversation

@gtsambos
Copy link
Member

At the moment, just contains skeleton code to show the expected structure.

@jeromekelleher @petrelharp @dvukcevic

@codecov
Copy link

codecov bot commented Jun 15, 2020

Codecov Report

Merging #679 into master will decrease coverage by 0.09%.
The diff coverage is 88.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #679      +/-   ##
==========================================
- Coverage   93.61%   93.52%   -0.10%     
==========================================
  Files          24       24              
  Lines       19563    19928     +365     
  Branches      789      789              
==========================================
+ Hits        18314    18637     +323     
- Misses       1217     1259      +42     
  Partials       32       32              
Impacted Files Coverage Δ
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/tables.c 90.57% <87.45%> (-0.17%) ⬇️
python/_tskitmodule.c 91.31% <90.32%> (-0.02%) ⬇️
c/tskit/core.c 96.95% <100.00%> (+0.02%) ⬆️
python/tskit/tables.py 99.70% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c673faa...f0679b0. Read the comment docs.

@jeromekelleher
Copy link
Member

jeromekelleher commented Jun 15, 2020

This all looks good to me @gtsambos. To make the memory management practical, I think a good way to structure things would be to have client code that looks something like this:

tsk_ibd_finder_t ibd_finder;
tsk_segment_t *seg;

int ret = tsk_ibd_finder_alloc(&ibd_finder, /* other params... */);
// error check ret
ret  = tsk_ibd_finder_run(&ibd_finder, /* other params... */);
// error check ret
for (j = 0; j < ibd_finder.num_pairs; j++) {
    ret = ibd_finder_get_pair_from_index(&ibd_finder, j, &a, &b);
    // error check ret;
    printf("IBD segs for (%d, %d) = ", a, b);
    for (seg = ibd_finder.ibd_segments[j]; seg != NULL; seg = seg->next) {
          /* Not sure if these are the correct fields, this is just to give a rough idea */
          printf("(%f, %f, %d)", seg->left, seg->right, seg->node);
    }
    printf("\n");
}
tsk_ibd_finder_free(&ibd_finder);

So, we'd make the tsk_ibd_finder part of the public API (so the struct is defined in tables.h), and have client code access some of the fields inside this struct directly. This is not ideal, and in particular the big ibd_segments array will use a lot of memory. However, it will give us working code and a way to move forward.

Once this version is working, we can investigate more memory efficient sparse ways to represent the results, and can think about a final documented C API then.

Does this make sense?

@gtsambos
Copy link
Member Author

Yes, thanks Jerome. I'll start by making a very rudimentary ibd_finder struct that doesn't do anything (perhaps just holding the sample pairs), and try to get this client code working

@gtsambos gtsambos marked this pull request as draft July 3, 2020 02:41
@gtsambos
Copy link
Member Author

gtsambos commented Jul 6, 2020

Hi @jeromekelleher, the structure you've now described in the comment above is now implemented in test_tables.c as test_ibd_finder. I'm still a bit hazy on how the user will implement these methods directly themselves (a PyObject???), but maybe we can talk about that later?

@gtsambos gtsambos force-pushed the ibd-c branch 3 times, most recently from 03812c0 to 78b3fc8 Compare July 20, 2020 06:37
@gtsambos
Copy link
Member Author

gtsambos commented Jul 21, 2020

hey @jeromekelleher (or @benjeffery), is there more detailed log output from the CircleCI builds that we can see somewhere? This makes it sound like perhaps there is...

Full log written to /home/circleci/tskit/build-gcc/meson-logs/testlog.txt

I'm having some trouble fixing the CircleCI failure in this PR, because the ninja build is only failing in CircleCI -- all of my local ninja builds are fine.

@petrelharp
Copy link
Contributor

petrelharp commented Jul 21, 2020

(posted circleCI log) ... edit: whoops, that's where you copied that line from. =)

I think it's saying you have a segfault - have you run the test under valgrind? When I run it with valgrind, I do get an error (although not a segfault:

  Test: test_ibd_finder ...==22946== Conditional jump or move depends on uninitialised value(s)
==22946==    at 0x13839F: tsk_ibd_finder_calculate_ibd (tables.c:5760)
==22946==    by 0x1382B5: tsk_ibd_finder_run (tables.c:5734)
==22946==    by 0x137C85: tsk_table_collection_find_ibd (tables.c:5592)
==22946==    by 0x11ED8B: test_ibd_finder (test_tables.c:2823)
==22946==    by 0x49AD9D6: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49ADC70: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49AE05D: CU_run_all_tests (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x158E2E: test_main (testlib.c:743)
==22946==    by 0x12970E: main (test_tables.c:4464)
==22946== 
L: 0.000000, R:1.000000 
==22946== Conditional jump or move depends on uninitialised value(s)
==22946==    at 0x138472: tsk_ibd_finder_calculate_ibd (tables.c:5770)
==22946==    by 0x1382B5: tsk_ibd_finder_run (tables.c:5734)
==22946==    by 0x137C85: tsk_table_collection_find_ibd (tables.c:5592)
==22946==    by 0x11ED8B: test_ibd_finder (test_tables.c:2823)
==22946==    by 0x49AD9D6: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49ADC70: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49AE05D: CU_run_all_tests (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x158E2E: test_main (testlib.c:743)
==22946==    by 0x12970E: main (test_tables.c:4464)
==22946== 
==22946== Conditional jump or move depends on uninitialised value(s)
==22946==    at 0x1384E3: tsk_ibd_finder_calculate_ibd (tables.c:5800)
==22946==    by 0x1382B5: tsk_ibd_finder_run (tables.c:5734)
==22946==    by 0x137C85: tsk_table_collection_find_ibd (tables.c:5592)
==22946==    by 0x11ED8B: test_ibd_finder (test_tables.c:2823)
==22946==    by 0x49AD9D6: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49ADC70: ??? (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x49AE05D: CU_run_all_tests (in /usr/lib/x86_64-linux-gnu/libcunit.so.1.0.1)
==22946==    by 0x158E2E: test_main (testlib.c:743)
==22946==    by 0x12970E: main (test_tables.c:4464)
==22946== 

@jeromekelleher
Copy link
Member

Running it through valgrind is the first thing to try anyway @gtsambos - until that's running without problems things can fail in unpredictably different ways across systems.

@gtsambos
Copy link
Member Author

Thanks @petrelharp and @jeromekelleher -- I noticed the valgrind errors and was working on that too, but thought the ninja problem might have been independent of it. I'll see what happens when I fix up the conditional jumps.

@benjeffery
Copy link
Member

Circle CI failed due to a network issue. I re-ran and it has passed.

@gtsambos
Copy link
Member Author

Hallelujah! Thanks @benjeffery! Btw, this pull request is still a few commits away from being ready to review -- I just wanted to sort out the CI issues before putting the final bells and whistles in.

@bguo068
Copy link
Member

bguo068 commented Jul 29, 2020

Hi, I am new here and but worked on IBD calculation code in C for NEWICK format trees for my rotation project. Considering the inefficiency of this format, I really hope the code can be rewritten to work with tskit tree sequence format. @gtsambos, I am excited to know from @jeromekelleher that you have been working on this for thesis work. I have spent some time thinking about the algorithm for better efficiency. I have got some idea and would like to know whether the same algorithm has been already implemented. Would you like to chat about it?

@gtsambos
Copy link
Member Author

Hi @gbinux, I'm glad to hear that you're excited, and I certainly hope this proves more efficient than what is currently possible with Newick trees! I'd be interested in chatting (perhaps alongside @jeromekelleher or my supervisor @dvukcevic?), but I'm currently trying to wind up a few different projects and feel a bit swamped right now. Would it be okay if I got back to you in a week or two once this PR is done?

Circle CI failed due to a network issue. I re-ran and it has passed.

btw @benjeffery, it looks like this problem has recurred. I'm still tidying some things up and anticipate needing to make a few more commits -- how about I just ping you when they are done so that you don't have to keep rerunning CircleCI manually?

@benjeffery
Copy link
Member

I think a rebase might pull in a fix for gcov. Best to rebase often anyway.

@bguo068
Copy link
Member

bguo068 commented Jul 30, 2020

Hi @gbinux, I'm glad to hear that you're excited, and I certainly hope this proves more efficient than what is currently possible with Newick trees! I'd be interested in chatting (perhaps alongside @jeromekelleher or my supervisor @dvukcevic?), but I'm currently trying to wind up a few different projects and feel a bit swamped right now. Would it be okay if I got back to you in a week or two once this PR is done?

Hi @gtsambos, thanks for sharing your current status. I completely understand that it is not easy to work on multiple projects at the same time. Just let me know when you feel comfortable to chat. Looking forward to talking with you about IBD.

@gtsambos gtsambos force-pushed the ibd-c branch 6 times, most recently from 004dbd6 to e8c1e2b Compare July 31, 2020 10:07
@gtsambos gtsambos marked this pull request as ready for review August 3, 2020 05:13
@gtsambos
Copy link
Member Author

Thanks @jeromekelleher, hopefully ready this time 🤞

btw:

Once that's done I'll update the Python-C interface to input a 2D array of sample pairs

I think I did this myself in the last round of commits -- see my comment

@gtsambos gtsambos force-pushed the ibd-c branch 3 times, most recently from fd29fe0 to b8834f3 Compare September 11, 2020 06:12
@jeromekelleher
Copy link
Member

I've made a pass through and tidied things up a little bit in terms of the C API @gtsambos, and I think it's all looking good there. However, it looks like your tests in test_ibd.py aren't actually calling into the C code, to me. We'd need to hook this in somehow so that we're running the library code as well as the Python prototypes (we use the compare_lib=True pattern when testing simplify).

Once the tests are all hooked up and we're getting identical results from the Python and C versions, that I think we're ready to merge.

@gtsambos
Copy link
Member Author

Hi @jeromekelleher, I've now tidied up the Python tests a little and added a compare_lib option so the Python and C output can be compared. There are a few different issues:

First, I'm having a bit of trouble getting this to work with the min_length and max_time arguments -- I've added a few extra lines to the method in tables.py so that these arguments can at least be used, but I'm getting a tskit.GenericError when I actually try to specify non-generic values. I'm wondering whether I need to explicitly convert these into something like a PyArrayObject in _tskitmodule.c (like you did for the sample pairs)?

Another problem is that I get a tskit.GenericError whenever the output is empty. I've realised that I actually didn't have any C tests covering this case (!) so I'll add one in, as I'm not sure whether this is an issue with the C code itself, or whether there's some problem trying to convert the C output back into valid Python output.

@gtsambos
Copy link
Member Author

Once the tests are all hooked up and we're getting identical results from the Python and C versions, that I think we're ready to merge.

In addition, there are some test failures resulting from genuine discrepancy between the methods. I think these are all because my C code is inadvertently returning all IBD segments instead of just the most recent ones (related to the concept of mrca_ibd that we discussed way earlier in the year). I have flagged these all as expected failures for now, but am looking into a fix

@gtsambos gtsambos force-pushed the ibd-c branch 3 times, most recently from 2f14cf4 to 84c1c05 Compare September 15, 2020 06:17
@gtsambos
Copy link
Member Author

gtsambos commented Sep 17, 2020

I think this is ready now @jeromekelleher! I'm just about to push up a squashed version of the PR, feel free to have a look!

There is one test that I've flagged as an expected failure that I think I will fix later, if it's okay by you. It is a test case involving a particular topology of ancient samples, so not a situation that I'll encounter in my demonstrations or analyses. The Python and C implementations are returning the same result, it's just a little different from what I would expect. (I will have to make further changes to this code anyway when your memory-saving procedure comes in a later PR.)

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @gtsambos! Once final round of tweaks and we're good to merge I think.

gtsambos and others added 4 commits September 17, 2020 19:26
Removed sequence length from ibd_finder input

ibd_finder_calculate_ibd is now only in the internal API

default for max_time is now DBL_MAX

ibd_finder struct no longer need pair_index attribute

Changed output of python mock-up to be a dict of dict of numpy arrays

Python test structure changed to incorporate new output format.

Modified Python mock-up so that user can input sample pairs

Marked expected failure in one test case (fix later)

Changes so that tskitmodule.c can be built without error

Added setter methods for min_length and max_time

Changed C code to take in lists of sample pairs.

Changed C tests to check output values.

clang-format

Remove extraneous stuff needed to calculate IBD for all pairs.

Changed tskitmodule

More small changes to IBD-C code

Removed unnecessary comments

Fixed weird reformatting problems

Removed things used to calculate sample pair indices the old way.

Removed some old comments

Removed bits of the the ibd_finder struct that were only needed to calculate sample pairs the old way

Changed find_sample_pair_index method to be a bit neater

Various small changes.

Changed structure of get_ibd_segments
Added test with empty result.

bug fix

Fixed problem that was generating GenericErrors in C-IBD.

Added new test to IBD-C.

Fixed parent_should_be_added

Fixed IBD-C bug

Fixed bug that was unnecessarily squashing IBD segments.

Removed memory trimming step for now

Cleaned up C code.

Removed oldest parent attribute.

Final small tweaks
@gtsambos
Copy link
Member Author

Thanks @jeromekelleher, slightly revised PR just pushed now. I've added a few comments to yours above ⬆️

@gtsambos
Copy link
Member Author

Hi @jeromekelleher, would it be okay if we pressed ahead with this PR and flagged the max_time thing above as an issue to be fixed later? I've spent quite a bit of time trying and failing to get the tests working with that small alteration, and given how many deadlines I have for various things next week, I think it would be a better use of my time to focus on those just for now.

My latest problem is that sys.float_info.max in tables.py doesn't appear to be interpreted as equivalent to DBL_MAX in the C code, so most of my Python tests are failing when I change the max_time default away from what it currently is in this PR

@jeromekelleher
Copy link
Member

Sounds good @gtsambos - I'll fix up the max_time issue now and merge.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @gtsambos! Looking forward to seeing what it can do!

@mergify mergify bot merged commit bc50b5a into tskit-dev:master Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants