Refactor table collection integrity checks and document. #709

jeromekelleher · 2020-07-08T13:51:59Z

This PR documents the tsk_table_collection_check_integrity function, documents it and refactors the set of options a bit.

As part of the process of documenting the options for this function I realised that they were a confusing mishmash, and I spent a long time trying to figure some simple and clear semantics. We had made special cases of the mutation parents and mutation times basically for the sake of compute_mutation_parents, but it's actually much simpler to just set the values to NULL beforehand.

We do change the semantics of mutation time a little bit here too. Mixing known and unknown times at a site is fine, but we can have situations where a string of unknown time mutations separate mutations with times, and we need to check these times don't conflict.

Hopefully the whole thing is a bit easier to follow now, and maybe even a little bit more efficient too.

@benjeffery, @molpopgen, would you mind taking a look at the documentation and see if it makes sense to you as C API users? (Any other comments would be much appreciated too, of course!)

Closes #649
Closes #592

codecov · 2020-07-08T14:25:18Z

Codecov Report

Merging #709 into master will increase coverage by 2.32%.
The diff coverage is 96.19%.

@@            Coverage Diff             @@
##           master     #709      +/-   ##
==========================================
+ Coverage   85.41%   87.74%   +2.32%     
==========================================
  Files           8       24      +16     
  Lines        9546    19105    +9559     
  Branches     1827     3563    +1736     
==========================================
+ Hits         8154    16764    +8610     
- Misses        795     1273     +478     
- Partials      597     1068     +471

Flag	Coverage Δ
#c_tests	`85.43% <96.19%> (+0.02%)`	⬆️
#python_c_tests	`90.08% <ø> (?)`
#python_tests	`99.00% <ø> (?)`

Impacted Files	Coverage Δ
c/tskit/core.h	`100.00% <ø> (ø)`
c/tskit/tables.c	`81.37% <95.95%> (+0.31%)`	⬆️
c/tskit/core.c	`93.93% <100.00%> (+0.04%)`	⬆️
c/tskit/trees.c	`90.45% <100.00%> (-0.21%)`	⬇️
python/tskit/provenance.py	`100.00% <0.00%> (ø)`
python/tskit/vcf.py	`100.00% <0.00%> (ø)`
python/tskit/__init__.py	`100.00% <0.00%> (ø)`
python/tskit/cli.py	`96.29% <0.00%> (ø)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 667abe6...2a99090. Read the comment docs.

benjeffery · 2020-07-09T11:08:55Z

Will take a look at this later today.

benjeffery

Nice. This is a real improvement on the flags, very minor comments!

c/tskit/tables.h

c/tskit/tables.c

c/tskit/tables.h

jeromekelleher · 2020-07-10T08:40:51Z

Thanks for the review @benjeffery. Clarifying this made we realise that we can fix an annoying problem with the tree sequence object, in that we can now detect any bad topologies at load time. There's no good reason not to I think, other than potentially breaking people's code. But, if it depends on this, then I'm sure it must be broken anyway.

jeromekelleher · 2020-07-10T08:47:11Z

I considered changing TSK_CHECK_ALL to include an actual check of the trees too. In a way, this would be quite handy because we're actually depending on the trees in compute_mutation_parents and so on. We may not even be checking for these bad topologies at the moment in there.

I backed out of it because it seemed like this would imply another iteration over the trees at tsk_treeseq_t load time, because we would still have to iterate over them once to count the trees. Maybe it's not worth worrying about this, but I figured it's better to just let this one go and get the code shipped.

benjeffery · 2020-07-10T11:08:27Z

Great, I think moving the contradictory children check is the right thing.

As for getting ALL to check the trees - I considered check_integrity returning the number of trees meaning that - tsk_treeseq_init_trees wouldn't need to iterate over the trees. But on checking tsk_treeseq_init_trees also calculates breakpoints in a separate loop. From a cursory inspection I couldn't tell why this is done in a separate loop?

jeromekelleher · 2020-07-10T11:17:56Z

As for getting ALL to check the trees - I considered check_integrity returning the number of trees meaning that - tsk_treeseq_init_trees wouldn't need to iterate over the trees. But on checking tsk_treeseq_init_trees also calculates breakpoints in a separate loop. From a cursory inspection I couldn't tell why this is done in a separate loop?

It's a good idea, but I think it would complicate the interface a bit if check_integrity returned 0 on success sometimes and the number of trees other times. Maybe it's not that bad though, if we have TSK_CHECK_TREES instead of TSK_CHECK_ALL, which returns the number of trees. It would save a pass through the edges at load time, which must add up for big tree sequences.

It wouldn't take long to do, and it probably is the right thing to do, so we keep all the validation checks in one place. What do you reckon?

We need to know the number of trees so we can malloc stuff on a per-tree basis. I don't think there's any way around that.

jeromekelleher · 2020-07-10T11:19:55Z

Note this would change the semantics a bit, so that TSK_CHECK_TREES would imply all the other checks, and not be the OR of the bits. We can't check the trees without all the other things being safe first. Would need a little bit of a refactor but not that much.

benjeffery · 2020-07-10T11:28:45Z

We need to know the number of trees so we can malloc stuff on a per-tree basis. I don't think there's any way around that.

Ah of course. I guess there would be a perf hit to mallocing a reasonably sized buffer and then realloc'ing to double it every time you hit the end? I don't yet have a good intuition for how long tree iteration takes compared to that kind of operation.

As for the return value - we have other operations (e.g. add_row) using positive returns, and if it only does this on TSK_CHECK_TREES then none of the existing != 0 need to change. It never felt right to have the checks split across two functions and this is a way to do it without an extra pass.

jeromekelleher · 2020-07-10T14:01:24Z

Ready for another look @benjeffery

benjeffery

Looks good - just the one comment about adding a test.

c/tests/test_tables.c

molpopgen · 2020-07-10T18:25:54Z

Just found this in my notifications. Taking a look.

molpopgen · 2020-07-10T19:21:31Z

@jeromekelleher -- I think this all looks sensible. I'll admit that I've not dug into the details of this API before, though.

Closes tskit-dev#649 Closes tskit-dev#592

This removes the need for an extra iteration over the trees during tree sequence init.

jeromekelleher requested review from benjeffery and petrelharp July 8, 2020 13:51

jeromekelleher force-pushed the docs-for-check-integrity branch from 8dc0bdf to 854c953 Compare July 8, 2020 13:57

jeromekelleher mentioned this pull request Jul 9, 2020

Index lifecycle docs #713

Merged

benjeffery reviewed Jul 10, 2020

View reviewed changes

c/tskit/tables.h Outdated Show resolved Hide resolved

c/tskit/tables.c Outdated Show resolved Hide resolved

c/tskit/tables.h Outdated Show resolved Hide resolved

jeromekelleher force-pushed the docs-for-check-integrity branch 2 times, most recently from 457df80 to 078ac79 Compare July 10, 2020 08:40

jeromekelleher force-pushed the docs-for-check-integrity branch from 078ac79 to f9133af Compare July 10, 2020 08:43

benjeffery approved these changes Jul 10, 2020

View reviewed changes

jeromekelleher force-pushed the docs-for-check-integrity branch from 247fafc to db8b36f Compare July 10, 2020 14:01

jeromekelleher force-pushed the docs-for-check-integrity branch from db8b36f to 412c554 Compare July 10, 2020 14:02

benjeffery reviewed Jul 10, 2020

View reviewed changes

c/tests/test_tables.c Outdated Show resolved Hide resolved

jeromekelleher added 3 commits July 15, 2020 17:17

Refactor table collection integrity checks and document.

9561415

Closes tskit-dev#649 Closes tskit-dev#592

Detect bad tree topologies at load time.

4757193

Move tree checks into table collection.

15305e9

This removes the need for an extra iteration over the trees during tree sequence init.

jeromekelleher force-pushed the docs-for-check-integrity branch 2 times, most recently from 412c554 to 3104ecd Compare July 15, 2020 16:52

Add version to circleCI cache key.

2a99090

jeromekelleher merged commit 0268fa7 into tskit-dev:master Jul 15, 2020

jeromekelleher deleted the docs-for-check-integrity branch July 15, 2020 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor table collection integrity checks and document. #709

Refactor table collection integrity checks and document. #709

jeromekelleher commented Jul 8, 2020

codecov bot commented Jul 8, 2020 •

edited

Loading

benjeffery commented Jul 9, 2020

benjeffery left a comment

jeromekelleher commented Jul 10, 2020

jeromekelleher commented Jul 10, 2020

benjeffery commented Jul 10, 2020

jeromekelleher commented Jul 10, 2020

jeromekelleher commented Jul 10, 2020

benjeffery commented Jul 10, 2020 •

edited

Loading

jeromekelleher commented Jul 10, 2020

benjeffery left a comment

molpopgen commented Jul 10, 2020

molpopgen commented Jul 10, 2020

Refactor table collection integrity checks and document. #709

Refactor table collection integrity checks and document. #709

Conversation

jeromekelleher commented Jul 8, 2020

codecov bot commented Jul 8, 2020 • edited Loading

Codecov Report

benjeffery commented Jul 9, 2020

benjeffery left a comment

Choose a reason for hiding this comment

jeromekelleher commented Jul 10, 2020

jeromekelleher commented Jul 10, 2020

benjeffery commented Jul 10, 2020

jeromekelleher commented Jul 10, 2020

jeromekelleher commented Jul 10, 2020

benjeffery commented Jul 10, 2020 • edited Loading

jeromekelleher commented Jul 10, 2020

benjeffery left a comment

Choose a reason for hiding this comment

molpopgen commented Jul 10, 2020

molpopgen commented Jul 10, 2020

codecov bot commented Jul 8, 2020 •

edited

Loading

benjeffery commented Jul 10, 2020 •

edited

Loading