Use struct for node metadata #642

hyanwong · 2022-03-05T11:38:22Z

Fixes #416

hyanwong · 2022-03-07T18:12:10Z

This has the new metadata scheme, where the tsinfer-specific metadata is in a "tsinfer" key.

codecov · 2022-03-07T18:13:39Z

Codecov Report

Merging #642 (eb4fce8) into main (99cad13) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #642   +/-   ##
=======================================
  Coverage   93.08%   93.09%           
=======================================
  Files          17       17           
  Lines        5136     5140    +4     
  Branches      954      954           
=======================================
+ Hits         4781     4785    +4     
  Misses        235      235           
  Partials      120      120

Flag	Coverage Δ
C	`93.09% <100.00%> (+<0.01%)`	⬆️
python	`96.31% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tsinfer/formats.py	`98.43% <100.00%> (+<0.01%)`	⬆️
tsinfer/inference.py	`98.57% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99cad13...eb4fce8. Read the comment docs.

hyanwong · 2022-03-08T08:43:43Z

This is ready for review. I reckon if we merge it (or whatever we end up with) at the same time as #607 then we make 2 changes to the formats at roughly the same time.

jeromekelleher

Looks good, but I'm not sure about the schema. I think we're over complicating it.

The metadata column should be interpretable as a 2D numpy array of int32s, so all the unused values are explicitly filled in as (-1,-1). There's no defaults then. I think this is much more useful in practice than using nulls and saving a few bytes.

tsinfer/formats.py

jeromekelleher · 2022-03-08T10:07:51Z

tsinfer/formats.py

+            "codec": "struct",
+            "type": ["object", "null"],
+            "properties": {
+                "tsinfer": {


I don't see the point in the "tsinfer" top key here.

I thought it was cleaner, because then when we e.g. add values inferred from tsdate we can store them in a tsdate property. That way it is clear where the different metadata values have come from, and they don't really risk overwriting each other etc.

Addendum: I see this as a nice advantage of struct: we can be verbose in the explanation (and nesting) of the properties in the metadata, and it doesn't take up extra space in the encoding of the metadata for each row. That's my understanding, anyway.

The nice thing is that there is no storage cost to the additional level.

The nice thing is that there is no storage cost to the additional level.

Yes, that's what I was trying to say (but you said it better)

jeromekelleher · 2022-03-08T10:14:56Z

As I say, the main point of using struct here for me would be to enable direct access to the metadata as numpy arrays. Let's get some info on doing that first before discussing the other details any further.

hyanwong · 2022-03-08T10:16:45Z

Ah, that's not the main reason I would be using a struct, TBH. I didn't even think about wanting to get the entire array out. I can see that would be useful sometimes, though.

BTW, and not that it matters, but the sample_data_id is mostly unfilled, I think.

jeromekelleher · 2022-03-08T10:18:52Z

If the info is sparse and you mainly plan on leaving things blank then there's no point at all in struct, I think. But being able to process these things efficiently in Python seems attractive to me.

hyanwong · 2022-03-08T10:21:19Z

The sample_data_id value is sparse (I think). The ancestor_data_id is not, and storing 27 million copies of the identical string "ancestor_data_id" in JSON form in the public tree sequences seems pretty wasteful to me.

Edit - that's at least half a gigabyte in duplicated text in the tree sequence, I think?

jeromekelleher · 2022-03-08T10:46:00Z

Maybe some concrete numbers for chr1 would be helpful.

hyanwong · 2022-03-08T11:30:19Z

OK, for the p-arm of chr1, which is 123.4 MB (i.e. about 1/25th of the whole), we are storing 30Mb in node metadata, of which we have just over a million copies of the byte string "ancestor_data_id" (= 16 Mb), and no copies of the string "sample_data_id" (because we only use this for historically inserted samples, which we don't do in the unified ts).

jeromekelleher · 2022-03-08T12:32:25Z

So what would that be for 8 bytes per node?Around 8MB I guess?

hyanwong · 2022-03-08T14:37:06Z

So what would that be for 8 bytes per node?Around 8MB I guess?

Yep, 8.9MB. Note that the released unified genealogy tree sequences also don't contain the tsdate means and variances associated with node times (I'm not sure why they don't). If we included those it would add json info such as

b'"mn":11039.99, "vr":1218813.29'

to the JSON, which I estimate would multiply the metadata storage requirements in the JSON format by ~ 2.5 times, and would either increase the struct version by another 8.9 MB if we store mean and var as 4 byte floats, or 17.8 if we use doubles.

hyanwong · 2022-03-08T14:46:34Z

So if we store tsdate mean and var, for all chromosomes in total, we use up roughly 1.9 GB in a JSON encoding, and either 445 MB (if storing floats) or 667MB (if storing doubles) for the struct encoding.

If we don't store tsdate means and variances, we use up ~ 750MB for JSON and 220MB for struct.

benjeffery · 2022-04-07T23:00:51Z

I've come and gone several times on the "top-level key with name of generating software" concept several times, but I think I've convinced myself that it is a bad idea. Here's the thinking - if downstream software wants to use a metric stored in metadata it will have to know what software wrote it, likely hard-coding the name of that software. This would prevent a different upstream software from replacing the first, or even worse cause it to write metrics to a key that had the name of the first piece of software in order to be compatible!
In other words the identity of the metric is it's key, it's name, and the generating software is incidental (but should absolutely be stored in the description of the field).
Tagging @hyanwong and @jeromekelleher as I think this is something we need to decide for the ecosystem and put into the tskit docs at some point.

hyanwong · 2022-04-08T09:20:44Z

Thanks @benjeffery. I guess there might be a difference between metadata that is generalisable, and metadata that is very tightly linked to the software package. In the case of the sample_data_id and ancestor_data_id from tsinfer, these values reference specific internal IDs created by the tsinfer inference process, so I can't imagine they would be useful outside that context.

Having said which, I'm perfectly happy to label them simply sample_data_id and documenting their origin in the description field (or even use tsinfer_sample_data_id, although that slightly ruins the justification of not hard-coding the software name).

benjeffery · 2022-04-08T10:31:13Z

Yes, good point. What downstream uses are there of sample_data_id and ancestor_data_id? Is it just for analysis of tsinfer's performance? I'm trying to get an idea of where they sit in the generalisability spectrum.

hyanwong · 2022-04-08T10:53:01Z

Some of the rationale is at #301 and #604 (comment)

Basically, it is useful for manipulating the tree sequences for matching historical individuals into a tree sequence, and inserting them etc. This could either be done in the standard internal tsinfer routines, or as part of a post-processing step

hyanwong force-pushed the strcut-node-metadata branch from 3786aea to 3fb7b94 Compare March 7, 2022 18:06

hyanwong marked this pull request as draft March 7, 2022 18:11

hyanwong marked this pull request as ready for review March 8, 2022 08:41

jeromekelleher reviewed Mar 8, 2022

View reviewed changes

Use struct for node metadata

eb4fce8

hyanwong force-pushed the strcut-node-metadata branch from 3fb7b94 to eb4fce8 Compare March 8, 2022 14:20

hyanwong mentioned this pull request Apr 7, 2022

Error in assignment? tskit-dev/tsdate#208

Closed

hyanwong mentioned this pull request Apr 8, 2022

Point release to fix packaging for Zarr #658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use struct for node metadata #642

Use struct for node metadata #642

hyanwong commented Mar 5, 2022

hyanwong commented Mar 7, 2022

codecov bot commented Mar 7, 2022 •

edited

Loading

hyanwong commented Mar 8, 2022

jeromekelleher left a comment •

edited

Loading

jeromekelleher Mar 8, 2022

hyanwong Mar 8, 2022

hyanwong Mar 8, 2022

benjeffery Mar 8, 2022

hyanwong Mar 8, 2022

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022 •

edited

Loading

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022 •

edited

Loading

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022 •

edited

Loading

hyanwong commented Mar 8, 2022

benjeffery commented Apr 7, 2022

hyanwong commented Apr 8, 2022 •

edited

Loading

benjeffery commented Apr 8, 2022 •

edited

Loading

hyanwong commented Apr 8, 2022

Use struct for node metadata #642

Are you sure you want to change the base?

Use struct for node metadata #642

Conversation

hyanwong commented Mar 5, 2022

hyanwong commented Mar 7, 2022

codecov bot commented Mar 7, 2022 • edited Loading

Codecov Report

hyanwong commented Mar 8, 2022

jeromekelleher left a comment • edited Loading

Choose a reason for hiding this comment

jeromekelleher Mar 8, 2022

Choose a reason for hiding this comment

hyanwong Mar 8, 2022

Choose a reason for hiding this comment

hyanwong Mar 8, 2022

Choose a reason for hiding this comment

benjeffery Mar 8, 2022

Choose a reason for hiding this comment

hyanwong Mar 8, 2022

Choose a reason for hiding this comment

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022 • edited Loading

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022 • edited Loading

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022

jeromekelleher commented Mar 8, 2022

hyanwong commented Mar 8, 2022 • edited Loading

hyanwong commented Mar 8, 2022

benjeffery commented Apr 7, 2022

hyanwong commented Apr 8, 2022 • edited Loading

benjeffery commented Apr 8, 2022 • edited Loading

hyanwong commented Apr 8, 2022

codecov bot commented Mar 7, 2022 •

edited

Loading

jeromekelleher left a comment •

edited

Loading

hyanwong commented Mar 8, 2022 •

edited

Loading

hyanwong commented Mar 8, 2022 •

edited

Loading

hyanwong commented Mar 8, 2022 •

edited

Loading

hyanwong commented Apr 8, 2022 •

edited

Loading

benjeffery commented Apr 8, 2022 •

edited

Loading