-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor table collection integrity checks and document. #709
Refactor table collection integrity checks and document. #709
Conversation
8dc0bdf
to
854c953
Compare
Codecov Report
@@ Coverage Diff @@
## master #709 +/- ##
==========================================
+ Coverage 85.41% 87.74% +2.32%
==========================================
Files 8 24 +16
Lines 9546 19105 +9559
Branches 1827 3563 +1736
==========================================
+ Hits 8154 16764 +8610
- Misses 795 1273 +478
- Partials 597 1068 +471
Continue to review full report at Codecov.
|
Will take a look at this later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. This is a real improvement on the flags, very minor comments!
457df80
to
078ac79
Compare
Thanks for the review @benjeffery. Clarifying this made we realise that we can fix an annoying problem with the tree sequence object, in that we can now detect any bad topologies at load time. There's no good reason not to I think, other than potentially breaking people's code. But, if it depends on this, then I'm sure it must be broken anyway. |
078ac79
to
f9133af
Compare
I considered changing TSK_CHECK_ALL to include an actual check of the trees too. In a way, this would be quite handy because we're actually depending on the trees in I backed out of it because it seemed like this would imply another iteration over the trees at tsk_treeseq_t load time, because we would still have to iterate over them once to count the trees. Maybe it's not worth worrying about this, but I figured it's better to just let this one go and get the code shipped. |
Great, I think moving the contradictory children check is the right thing. As for getting ALL to check the trees - I considered check_integrity returning the number of trees meaning that - |
It's a good idea, but I think it would complicate the interface a bit if It wouldn't take long to do, and it probably is the right thing to do, so we keep all the validation checks in one place. What do you reckon? We need to know the number of trees so we can malloc stuff on a per-tree basis. I don't think there's any way around that. |
Note this would change the semantics a bit, so that TSK_CHECK_TREES would imply all the other checks, and not be the OR of the bits. We can't check the trees without all the other things being safe first. Would need a little bit of a refactor but not that much. |
Ah of course. I guess there would be a perf hit to mallocing a reasonably sized buffer and then realloc'ing to double it every time you hit the end? I don't yet have a good intuition for how long tree iteration takes compared to that kind of operation. As for the return value - we have other operations (e.g. |
247fafc
to
db8b36f
Compare
Ready for another look @benjeffery |
db8b36f
to
412c554
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - just the one comment about adding a test.
Just found this in my notifications. Taking a look. |
@jeromekelleher -- I think this all looks sensible. I'll admit that I've not dug into the details of this API before, though. |
This removes the need for an extra iteration over the trees during tree sequence init.
412c554
to
3104ecd
Compare
This PR documents the tsk_table_collection_check_integrity function, documents it and refactors the set of options a bit.
As part of the process of documenting the options for this function I realised that they were a confusing mishmash, and I spent a long time trying to figure some simple and clear semantics. We had made special cases of the mutation parents and mutation times basically for the sake of
compute_mutation_parents
, but it's actually much simpler to just set the values to NULL beforehand.We do change the semantics of mutation time a little bit here too. Mixing known and unknown times at a site is fine, but we can have situations where a string of unknown time mutations separate mutations with times, and we need to check these times don't conflict.
Hopefully the whole thing is a bit easier to follow now, and maybe even a little bit more efficient too.
@benjeffery, @molpopgen, would you mind taking a look at the documentation and see if it makes sense to you as C API users? (Any other comments would be much appreciated too, of course!)
Closes #649
Closes #592