Add metadata column to edges #496

benjeffery · 2020-03-24T23:40:36Z

Part of the work in #493.

I realise I'm now three levels deep on PR requests, but at ~600 lines this seemed to warrant its own PR....

codecov · 2020-03-25T00:01:47Z

Codecov Report

Merging #496 into master will decrease coverage by 0.15%.
The diff coverage is 81.10%.

@@            Coverage Diff             @@
##           master     #496      +/-   ##
==========================================
- Coverage   87.23%   87.08%   -0.16%     
==========================================
  Files          21       21              
  Lines       15420    15597     +177     
  Branches     2999     3036      +37     
==========================================
+ Hits        13452    13583     +131     
- Misses        975     1006      +31     
- Partials      993     1008      +15

Flag	Coverage Δ
#c_tests	`88.17% <76.92%> (-0.28%)`	⬇️
#python_c_tests	`90.46% <95.94%> (+0.07%)`	⬆️
#python_tests	`99.18% <100.00%> (+<0.01%)`	⬆️

Impacted Files	Coverage Δ
c/tskit/tables.c	`78.15% <73.83%> (-0.54%)`	⬇️
python/_tskitmodule.c	`84.04% <94.91%> (+0.19%)`	⬆️
c/tskit/core.c	`90.41% <100.00%> (+0.13%)`	⬆️
c/tskit/trees.c	`90.93% <100.00%> (+0.01%)`	⬆️
python/tskit/tables.py	`99.79% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`98.60% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ae6972...fa6c865. Read the comment docs.

benjeffery · 2020-03-25T08:42:05Z

@jeromekelleher This is all ready for you to have a gander.

jeromekelleher

Looks great, thanks @benjeffery. I know how tedious all that copy-pasting is...

One high-level point: do we want to increment FILE_VERSION_MAJOR for this? If all we were doing before the next release was adding edge metadata, I would say we should just increment the minor version and conditionally read in edge metadata if it's in the file. I think we want to get more flexible about having optional columns in the file format with well-defined defaults at some point anyway, so maybe now is the time to grasp that nettle. From a space perspective this will be important anyway, as adding an extra uint32 for each edge will be significant for things like UKB.

We could put in a special case to deal with edge metadata in this PR, and then think about a general mechanism for optional columns and defaults later. I think I've convinced myself that we do need to make this optional all right --- it's quite disruptive asking users to upgrade their files and the space argument is a clincher. Happy to be convinced otherwise though.

One other thing: when mucking with the low-level C module it's a good idea to run the stress_lowlevel.py script to make sure there are no memory leaks. I find just running the low-level module tests is enough,

python3 stress_lowlevel.py -m lowlevel

This should get the iteration count up pretty quickly. A good way to build up a feel for how this should look is to run it first for say 100 iterations. Then comment out one of your XDECREFs, run it again and see what happens.

python/_tskitmodule.c

python/tskit/tables.py

benjeffery · 2020-03-26T00:14:07Z

I know how tedious all that copy-pasting is...

Wasn't too bad as in doing it I got a good tour through the layers.

One high-level point: do we want to increment FILE_VERSION_MAJOR for this?

We need to at some point draw a line and from that point forward do everything we can to be backward compatible. We talked about this being 1.0 which seems to make sense as long as it lines up with how things are being used on the ground. If you think a breaking change would cause a lot of suffering already then we should start being backward compatible now.

I think optional columns and defaults is a separate issue (although linked in design considerations). Certainly one to consider for metadata on all tables. I'll have a play and see what this might look like for edges in the C.

jeromekelleher · 2020-03-26T08:39:14Z

I was thinking about the optional columns thing this morning, and if we align it with the semantics of set_columns then it wouldn't be too bad to do. The idea would be to read in some pointers to columns using kastore, and any that are marked optional and missing would be NULL. The semantics what's allowed would then be taken care of automaticaly. It would need a bit of replumbing at the bottom level and we'd need to get rid of the idea of using the kastore memory directly, but that's probably of very little value anyway and a few memcpys won't make any difference.

Anyway, I don't think we should do that here. We can do something like tsk_table_collection_load_indexes for the edge table. In my mind everything we're adding to the file format in terms of metadata is optional, so I'd like to keep the file format version major the same. I've no problem bumping the version number if we do find something we need to change that'll break old code, though.

benjeffery · 2020-03-27T00:27:53Z

@jeromekelleher So I've had a go at making metadata an optional column for edges. Basically the metadata and metadata_offset columns are NULL until the first row is added or row set appended that has metadata, at which point they are initialised. This saves the uint32 per edge for empty metadata. This also means the files are backwards compatible.

I have only done the C side of things - wanted to get your thoughts before going further. It is probably easier to see these specific changes by just looking at 8ec7142.

EDIT: Dammit! Just noticed the one spot I didn't put a new line between declarations and code ;)

jeromekelleher · 2020-03-27T09:41:49Z

Looks good, thanks @benjeffery. I think we should simplify it a bit on this first pass though, and follow the same logic as the rest of the tables. Metadata/metadata_offset are never NULL, as we set them to default values if they are not specified. If we do the same thing when we read the data in from file, this should minimise the changes.

It would be simpler to just do something specific for the edge_table_t (see the bit where we're reading in the indexes, e.g) rather than making a generic optional/non-optional mechanism. However, there's no point in taking out the changes you've made to the rest of the columns as we'll need them soon.

Figuring out when we do and don't write out columns is something we can follow up on. I don't think storing NULL is the way to do it though, we should always have valid memory for each of the columns in each table as part of the API contract. (For metadata, we can do something simple like if metadata_length == 0 don't store metadata or metadata offets)

benjeffery · 2020-03-27T11:35:36Z

@jeromekelleher Ok, so this latest push creates an empty metadata and metadata_offset if none are found in the file. Note that this means that we keep the uint32 extra per edge for now. Is this what you meant in your last comment?

c/tskit/tables.c

jeromekelleher · 2020-03-27T11:58:37Z

@jeromekelleher Ok, so this latest push creates an empty metadata and metadata_offset if none are found in the file. Note that this means that we keep the uint32 extra per edge for now. Is this what you meant in your last comment?

Yes, that's the idea. It's probably more efficient to go through this in person though, let's talk it through later.

benjeffery · 2020-03-30T11:24:09Z

@jeromekelleher I've added the metadata_malloced_locally flag and added a python test I realised was missing. We talked about doing the same for migrations, I guess it would almost be a copy-paste of this?

jeromekelleher

LGTM @benjeffery. One nitpick about indentation and we're good to merge.

Let's do the migrations as a follow-up PR - this is big enough already.

python/_tskitmodule.c

jeromekelleher · 2020-03-30T12:12:09Z

I'm assuming that codecov is being pedantic also, and we've covered everything that we reasonably can?

benjeffery · 2020-03-30T12:21:30Z

Thought I'd fixed that indent! Must have run git-fubar at some point.
Rebased and squashed. I think coverage is complaining about the patch as it adds a few lines that aren't covered as they are only run on OOM.

benjeffery mentioned this pull request Mar 24, 2020

[C API] Metadata schema field on tables #493

Merged

8 tasks

benjeffery force-pushed the edge-metadata branch from 1feefb9 to 0098fe7 Compare March 25, 2020 00:01

jeromekelleher reviewed Mar 25, 2020

View reviewed changes

python/_tskitmodule.c Outdated Show resolved Hide resolved

python/tskit/tables.py Show resolved Hide resolved

benjeffery force-pushed the edge-metadata branch from 0098fe7 to 8ec7142 Compare March 27, 2020 00:22

benjeffery force-pushed the edge-metadata branch from 8ec7142 to 5ef26ea Compare March 27, 2020 11:31

benjeffery force-pushed the edge-metadata branch from 5ef26ea to 599b51b Compare March 27, 2020 11:36

jeromekelleher reviewed Mar 27, 2020

View reviewed changes

c/tskit/tables.c Outdated Show resolved Hide resolved

benjeffery force-pushed the edge-metadata branch from d389c12 to d5b5b12 Compare March 28, 2020 23:56

benjeffery requested a review from jeromekelleher March 30, 2020 11:24

jeromekelleher approved these changes Mar 30, 2020

View reviewed changes

python/_tskitmodule.c Outdated Show resolved Hide resolved

Add metadata column to edges

fa6c865

benjeffery force-pushed the edge-metadata branch from f7adb73 to fa6c865 Compare March 30, 2020 12:19

jeromekelleher merged commit 4b285a5 into tskit-dev:master Mar 30, 2020

benjeffery deleted the edge-metadata branch April 18, 2020 22:58

benjeffery mentioned this pull request Jul 8, 2020

Edge metadata is not sorted when edges are. #708

Closed

Add metadata column to edges #496

Add metadata column to edges #496

Uh oh!

Conversation

benjeffery commented Mar 24, 2020

Uh oh!

codecov bot commented Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benjeffery commented Mar 25, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

benjeffery commented Mar 26, 2020

Uh oh!

jeromekelleher commented Mar 26, 2020

Uh oh!

benjeffery commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Mar 27, 2020

Uh oh!

benjeffery commented Mar 27, 2020

Uh oh!

Uh oh!

jeromekelleher commented Mar 27, 2020

Uh oh!

benjeffery commented Mar 30, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromekelleher commented Mar 30, 2020

Uh oh!

benjeffery commented Mar 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 25, 2020 •

edited

Loading

benjeffery commented Mar 27, 2020 •

edited

Loading