Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor table collection integrity checks and document. #709

Merged

Conversation

jeromekelleher
Copy link
Member

This PR documents the tsk_table_collection_check_integrity function, documents it and refactors the set of options a bit.

As part of the process of documenting the options for this function I realised that they were a confusing mishmash, and I spent a long time trying to figure some simple and clear semantics. We had made special cases of the mutation parents and mutation times basically for the sake of compute_mutation_parents, but it's actually much simpler to just set the values to NULL beforehand.

We do change the semantics of mutation time a little bit here too. Mixing known and unknown times at a site is fine, but we can have situations where a string of unknown time mutations separate mutations with times, and we need to check these times don't conflict.

Hopefully the whole thing is a bit easier to follow now, and maybe even a little bit more efficient too.

@benjeffery, @molpopgen, would you mind taking a look at the documentation and see if it makes sense to you as C API users? (Any other comments would be much appreciated too, of course!)

Closes #649
Closes #592

@codecov
Copy link

codecov bot commented Jul 8, 2020

Codecov Report

Merging #709 into master will increase coverage by 2.32%.
The diff coverage is 96.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #709      +/-   ##
==========================================
+ Coverage   85.41%   87.74%   +2.32%     
==========================================
  Files           8       24      +16     
  Lines        9546    19105    +9559     
  Branches     1827     3563    +1736     
==========================================
+ Hits         8154    16764    +8610     
- Misses        795     1273     +478     
- Partials      597     1068     +471     
Flag Coverage Δ
#c_tests 85.43% <96.19%> (+0.02%) ⬆️
#python_c_tests 90.08% <ø> (?)
#python_tests 99.00% <ø> (?)
Impacted Files Coverage Δ
c/tskit/core.h 100.00% <ø> (ø)
c/tskit/tables.c 81.37% <95.95%> (+0.31%) ⬆️
c/tskit/core.c 93.93% <100.00%> (+0.04%) ⬆️
c/tskit/trees.c 90.45% <100.00%> (-0.21%) ⬇️
python/tskit/provenance.py 100.00% <0.00%> (ø)
python/tskit/vcf.py 100.00% <0.00%> (ø)
python/tskit/__init__.py 100.00% <0.00%> (ø)
python/tskit/cli.py 96.29% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 667abe6...2a99090. Read the comment docs.

@benjeffery
Copy link
Member

Will take a look at this later today.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This is a real improvement on the flags, very minor comments!

c/tskit/tables.h Outdated Show resolved Hide resolved
c/tskit/tables.c Outdated Show resolved Hide resolved
c/tskit/tables.h Outdated Show resolved Hide resolved
@jeromekelleher jeromekelleher force-pushed the docs-for-check-integrity branch 2 times, most recently from 457df80 to 078ac79 Compare July 10, 2020 08:40
@jeromekelleher
Copy link
Member Author

Thanks for the review @benjeffery. Clarifying this made we realise that we can fix an annoying problem with the tree sequence object, in that we can now detect any bad topologies at load time. There's no good reason not to I think, other than potentially breaking people's code. But, if it depends on this, then I'm sure it must be broken anyway.

@jeromekelleher
Copy link
Member Author

I considered changing TSK_CHECK_ALL to include an actual check of the trees too. In a way, this would be quite handy because we're actually depending on the trees in compute_mutation_parents and so on. We may not even be checking for these bad topologies at the moment in there.

I backed out of it because it seemed like this would imply another iteration over the trees at tsk_treeseq_t load time, because we would still have to iterate over them once to count the trees. Maybe it's not worth worrying about this, but I figured it's better to just let this one go and get the code shipped.

@benjeffery
Copy link
Member

Great, I think moving the contradictory children check is the right thing.

As for getting ALL to check the trees - I considered check_integrity returning the number of trees meaning that - tsk_treeseq_init_trees wouldn't need to iterate over the trees. But on checking tsk_treeseq_init_trees also calculates breakpoints in a separate loop. From a cursory inspection I couldn't tell why this is done in a separate loop?

@jeromekelleher
Copy link
Member Author

As for getting ALL to check the trees - I considered check_integrity returning the number of trees meaning that - tsk_treeseq_init_trees wouldn't need to iterate over the trees. But on checking tsk_treeseq_init_trees also calculates breakpoints in a separate loop. From a cursory inspection I couldn't tell why this is done in a separate loop?

It's a good idea, but I think it would complicate the interface a bit if check_integrity returned 0 on success sometimes and the number of trees other times. Maybe it's not that bad though, if we have TSK_CHECK_TREES instead of TSK_CHECK_ALL, which returns the number of trees. It would save a pass through the edges at load time, which must add up for big tree sequences.

It wouldn't take long to do, and it probably is the right thing to do, so we keep all the validation checks in one place. What do you reckon?

We need to know the number of trees so we can malloc stuff on a per-tree basis. I don't think there's any way around that.

@jeromekelleher
Copy link
Member Author

Note this would change the semantics a bit, so that TSK_CHECK_TREES would imply all the other checks, and not be the OR of the bits. We can't check the trees without all the other things being safe first. Would need a little bit of a refactor but not that much.

@benjeffery
Copy link
Member

benjeffery commented Jul 10, 2020

We need to know the number of trees so we can malloc stuff on a per-tree basis. I don't think there's any way around that.

Ah of course. I guess there would be a perf hit to mallocing a reasonably sized buffer and then realloc'ing to double it every time you hit the end? I don't yet have a good intuition for how long tree iteration takes compared to that kind of operation.

As for the return value - we have other operations (e.g. add_row) using positive returns, and if it only does this on TSK_CHECK_TREES then none of the existing != 0 need to change. It never felt right to have the checks split across two functions and this is a way to do it without an extra pass.

@jeromekelleher
Copy link
Member Author

Ready for another look @benjeffery

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just the one comment about adding a test.

c/tests/test_tables.c Outdated Show resolved Hide resolved
@molpopgen
Copy link
Member

Just found this in my notifications. Taking a look.

@molpopgen
Copy link
Member

@jeromekelleher -- I think this all looks sensible. I'll admit that I've not dug into the details of this API before, though.

@jeromekelleher jeromekelleher force-pushed the docs-for-check-integrity branch 2 times, most recently from 412c554 to 3104ecd Compare July 15, 2020 16:52
@jeromekelleher jeromekelleher merged commit 0268fa7 into tskit-dev:master Jul 15, 2020
@jeromekelleher jeromekelleher deleted the docs-for-check-integrity branch July 15, 2020 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants