Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use struct for node metadata #642

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Mar 5, 2022

Fixes #416

@hyanwong hyanwong marked this pull request as draft March 7, 2022 18:11
@hyanwong
Copy link
Member Author

hyanwong commented Mar 7, 2022

This has the new metadata scheme, where the tsinfer-specific metadata is in a "tsinfer" key.

@codecov
Copy link

codecov bot commented Mar 7, 2022

Codecov Report

Merging #642 (eb4fce8) into main (99cad13) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #642   +/-   ##
=======================================
  Coverage   93.08%   93.09%           
=======================================
  Files          17       17           
  Lines        5136     5140    +4     
  Branches      954      954           
=======================================
+ Hits         4781     4785    +4     
  Misses        235      235           
  Partials      120      120           
Flag Coverage Δ
C 93.09% <100.00%> (+<0.01%) ⬆️
python 96.31% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/formats.py 98.43% <100.00%> (+<0.01%) ⬆️
tsinfer/inference.py 98.57% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99cad13...eb4fce8. Read the comment docs.

@hyanwong hyanwong marked this pull request as ready for review March 8, 2022 08:41
@hyanwong
Copy link
Member Author

hyanwong commented Mar 8, 2022

This is ready for review. I reckon if we merge it (or whatever we end up with) at the same time as #607 then we make 2 changes to the formats at roughly the same time.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'm not sure about the schema. I think we're over complicating it.

The metadata column should be interpretable as a 2D numpy array of int32s, so all the unused values are explicitly filled in as (-1,-1). There's no defaults then. I think this is much more useful in practice than using nulls and saving a few bytes.

tsinfer/formats.py Outdated Show resolved Hide resolved
"codec": "struct",
"type": ["object", "null"],
"properties": {
"tsinfer": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point in the "tsinfer" top key here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was cleaner, because then when we e.g. add values inferred from tsdate we can store them in a tsdate property. That way it is clear where the different metadata values have come from, and they don't really risk overwriting each other etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addendum: I see this as a nice advantage of struct: we can be verbose in the explanation (and nesting) of the properties in the metadata, and it doesn't take up extra space in the encoding of the metadata for each row. That's my understanding, anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nice thing is that there is no storage cost to the additional level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nice thing is that there is no storage cost to the additional level.

Yes, that's what I was trying to say (but you said it better)

@jeromekelleher
Copy link
Member

As I say, the main point of using struct here for me would be to enable direct access to the metadata as numpy arrays. Let's get some info on doing that first before discussing the other details any further.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 8, 2022

Ah, that's not the main reason I would be using a struct, TBH. I didn't even think about wanting to get the entire array out. I can see that would be useful sometimes, though.

BTW, and not that it matters, but the sample_data_id is mostly unfilled, I think.

@jeromekelleher
Copy link
Member

If the info is sparse and you mainly plan on leaving things blank then there's no point at all in struct, I think. But being able to process these things efficiently in Python seems attractive to me.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 8, 2022

The sample_data_id value is sparse (I think). The ancestor_data_id is not, and storing 27 million copies of the identical string "ancestor_data_id" in JSON form in the public tree sequences seems pretty wasteful to me.

Edit - that's at least half a gigabyte in duplicated text in the tree sequence, I think?

@jeromekelleher
Copy link
Member

Maybe some concrete numbers for chr1 would be helpful.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 8, 2022

OK, for the p-arm of chr1, which is 123.4 MB (i.e. about 1/25th of the whole), we are storing 30Mb in node metadata, of which we have just over a million copies of the byte string "ancestor_data_id" (= 16 Mb), and no copies of the string "sample_data_id" (because we only use this for historically inserted samples, which we don't do in the unified ts).

@jeromekelleher
Copy link
Member

So what would that be for 8 bytes per node?Around 8MB I guess?

@hyanwong
Copy link
Member Author

hyanwong commented Mar 8, 2022

So what would that be for 8 bytes per node?Around 8MB I guess?

Yep, 8.9MB. Note that the released unified genealogy tree sequences also don't contain the tsdate means and variances associated with node times (I'm not sure why they don't). If we included those it would add json info such as

b'"mn":11039.99, "vr":1218813.29'

to the JSON, which I estimate would multiply the metadata storage requirements in the JSON format by ~ 2.5 times, and would either increase the struct version by another 8.9 MB if we store mean and var as 4 byte floats, or 17.8 if we use doubles.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 8, 2022

So if we store tsdate mean and var, for all chromosomes in total, we use up roughly 1.9 GB in a JSON encoding, and either 445 MB (if storing floats) or 667MB (if storing doubles) for the struct encoding.

If we don't store tsdate means and variances, we use up ~ 750MB for JSON and 220MB for struct.

@benjeffery
Copy link
Member

I've come and gone several times on the "top-level key with name of generating software" concept several times, but I think I've convinced myself that it is a bad idea. Here's the thinking - if downstream software wants to use a metric stored in metadata it will have to know what software wrote it, likely hard-coding the name of that software. This would prevent a different upstream software from replacing the first, or even worse cause it to write metrics to a key that had the name of the first piece of software in order to be compatible!
In other words the identity of the metric is it's key, it's name, and the generating software is incidental (but should absolutely be stored in the description of the field).
Tagging @hyanwong and @jeromekelleher as I think this is something we need to decide for the ecosystem and put into the tskit docs at some point.

@hyanwong
Copy link
Member Author

hyanwong commented Apr 8, 2022

Thanks @benjeffery. I guess there might be a difference between metadata that is generalisable, and metadata that is very tightly linked to the software package. In the case of the sample_data_id and ancestor_data_id from tsinfer, these values reference specific internal IDs created by the tsinfer inference process, so I can't imagine they would be useful outside that context.

Having said which, I'm perfectly happy to label them simply sample_data_id and documenting their origin in the description field (or even use tsinfer_sample_data_id, although that slightly ruins the justification of not hard-coding the software name).

@benjeffery
Copy link
Member

benjeffery commented Apr 8, 2022

Yes, good point. What downstream uses are there of sample_data_id and ancestor_data_id? Is it just for analysis of tsinfer's performance? I'm trying to get an idea of where they sit in the generalisability spectrum.

@hyanwong
Copy link
Member Author

hyanwong commented Apr 8, 2022

Some of the rationale is at #301 and #604 (comment)

Basically, it is useful for manipulating the tree sequences for matching historical individuals into a tree sequence, and inserting them etc. This could either be done in the standard internal tsinfer routines, or as part of a post-processing step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tsinfer specific schema for node metadata.
3 participants