Skip to content

Conversation

@benjeffery
Copy link
Member

Part of the work in #493.

I realise I'm now three levels deep on PR requests, but at ~600 lines this seemed to warrant its own PR....

@codecov
Copy link

codecov bot commented Mar 25, 2020

Codecov Report

Merging #496 into master will decrease coverage by 0.15%.
The diff coverage is 81.10%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #496      +/-   ##
==========================================
- Coverage   87.23%   87.08%   -0.16%     
==========================================
  Files          21       21              
  Lines       15420    15597     +177     
  Branches     2999     3036      +37     
==========================================
+ Hits        13452    13583     +131     
- Misses        975     1006      +31     
- Partials      993     1008      +15     
Flag Coverage Δ
#c_tests 88.17% <76.92%> (-0.28%) ⬇️
#python_c_tests 90.46% <95.94%> (+0.07%) ⬆️
#python_tests 99.18% <100.00%> (+<0.01%) ⬆️
Impacted Files Coverage Δ
c/tskit/tables.c 78.15% <73.83%> (-0.54%) ⬇️
python/_tskitmodule.c 84.04% <94.91%> (+0.19%) ⬆️
c/tskit/core.c 90.41% <100.00%> (+0.13%) ⬆️
c/tskit/trees.c 90.93% <100.00%> (+0.01%) ⬆️
python/tskit/tables.py 99.79% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 98.60% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ae6972...fa6c865. Read the comment docs.

@benjeffery
Copy link
Member Author

@jeromekelleher This is all ready for you to have a gander.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @benjeffery. I know how tedious all that copy-pasting is...

One high-level point: do we want to increment FILE_VERSION_MAJOR for this? If all we were doing before the next release was adding edge metadata, I would say we should just increment the minor version and conditionally read in edge metadata if it's in the file. I think we want to get more flexible about having optional columns in the file format with well-defined defaults at some point anyway, so maybe now is the time to grasp that nettle. From a space perspective this will be important anyway, as adding an extra uint32 for each edge will be significant for things like UKB.

We could put in a special case to deal with edge metadata in this PR, and then think about a general mechanism for optional columns and defaults later. I think I've convinced myself that we do need to make this optional all right --- it's quite disruptive asking users to upgrade their files and the space argument is a clincher. Happy to be convinced otherwise though.

One other thing: when mucking with the low-level C module it's a good idea to run the stress_lowlevel.py script to make sure there are no memory leaks. I find just running the low-level module tests is enough,

python3 stress_lowlevel.py -m lowlevel

This should get the iteration count up pretty quickly. A good way to build up a feel for how this should look is to run it first for say 100 iterations. Then comment out one of your XDECREFs, run it again and see what happens.

@benjeffery
Copy link
Member Author

I know how tedious all that copy-pasting is...

Wasn't too bad as in doing it I got a good tour through the layers.

One high-level point: do we want to increment FILE_VERSION_MAJOR for this?

We need to at some point draw a line and from that point forward do everything we can to be backward compatible. We talked about this being 1.0 which seems to make sense as long as it lines up with how things are being used on the ground. If you think a breaking change would cause a lot of suffering already then we should start being backward compatible now.

I think optional columns and defaults is a separate issue (although linked in design considerations). Certainly one to consider for metadata on all tables. I'll have a play and see what this might look like for edges in the C.

@jeromekelleher
Copy link
Member

I was thinking about the optional columns thing this morning, and if we align it with the semantics of set_columns then it wouldn't be too bad to do. The idea would be to read in some pointers to columns using kastore, and any that are marked optional and missing would be NULL. The semantics what's allowed would then be taken care of automaticaly. It would need a bit of replumbing at the bottom level and we'd need to get rid of the idea of using the kastore memory directly, but that's probably of very little value anyway and a few memcpys won't make any difference.

Anyway, I don't think we should do that here. We can do something like tsk_table_collection_load_indexes for the edge table. In my mind everything we're adding to the file format in terms of metadata is optional, so I'd like to keep the file format version major the same. I've no problem bumping the version number if we do find something we need to change that'll break old code, though.

@benjeffery
Copy link
Member Author

benjeffery commented Mar 27, 2020

@jeromekelleher So I've had a go at making metadata an optional column for edges. Basically the metadata and metadata_offset columns are NULL until the first row is added or row set appended that has metadata, at which point they are initialised. This saves the uint32 per edge for empty metadata. This also means the files are backwards compatible.

I have only done the C side of things - wanted to get your thoughts before going further. It is probably easier to see these specific changes by just looking at 8ec7142.

EDIT: Dammit! Just noticed the one spot I didn't put a new line between declarations and code ;)

@jeromekelleher
Copy link
Member

Looks good, thanks @benjeffery. I think we should simplify it a bit on this first pass though, and follow the same logic as the rest of the tables. Metadata/metadata_offset are never NULL, as we set them to default values if they are not specified. If we do the same thing when we read the data in from file, this should minimise the changes.

It would be simpler to just do something specific for the edge_table_t (see the bit where we're reading in the indexes, e.g) rather than making a generic optional/non-optional mechanism. However, there's no point in taking out the changes you've made to the rest of the columns as we'll need them soon.

Figuring out when we do and don't write out columns is something we can follow up on. I don't think storing NULL is the way to do it though, we should always have valid memory for each of the columns in each table as part of the API contract. (For metadata, we can do something simple like if metadata_length == 0 don't store metadata or metadata offets)

@benjeffery
Copy link
Member Author

@jeromekelleher Ok, so this latest push creates an empty metadata and metadata_offset if none are found in the file. Note that this means that we keep the uint32 extra per edge for now. Is this what you meant in your last comment?

@jeromekelleher
Copy link
Member

@jeromekelleher Ok, so this latest push creates an empty metadata and metadata_offset if none are found in the file. Note that this means that we keep the uint32 extra per edge for now. Is this what you meant in your last comment?

Yes, that's the idea. It's probably more efficient to go through this in person though, let's talk it through later.

@benjeffery
Copy link
Member Author

@jeromekelleher I've added the metadata_malloced_locally flag and added a python test I realised was missing. We talked about doing the same for migrations, I guess it would almost be a copy-paste of this?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @benjeffery. One nitpick about indentation and we're good to merge.

Let's do the migrations as a follow-up PR - this is big enough already.

@jeromekelleher
Copy link
Member

I'm assuming that codecov is being pedantic also, and we've covered everything that we reasonably can?

@benjeffery
Copy link
Member Author

Thought I'd fixed that indent! Must have run git-fubar at some point.
Rebased and squashed. I think coverage is complaining about the patch as it adds a few lines that aren't covered as they are only run on OOM.

@jeromekelleher jeromekelleher merged commit 4b285a5 into tskit-dev:master Mar 30, 2020
@benjeffery benjeffery deleted the edge-metadata branch April 18, 2020 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants