Skip to content

Conversation

@benjeffery
Copy link
Member

@benjeffery benjeffery commented Jul 31, 2020

Closes #603, #769

@benjeffery
Copy link
Member Author

Ok, I've treid to address all the comments and have done a read through. Ready for a check over @jeromekelleher and @petrelharp!

@benjeffery benjeffery requested a review from petrelharp August 13, 2020 12:16
@codecov
Copy link

codecov bot commented Aug 13, 2020

Codecov Report

Merging #741 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #741   +/-   ##
=======================================
  Coverage   87.72%   87.72%           
=======================================
  Files          24       24           
  Lines       19394    19400    +6     
  Branches     3640     3640           
=======================================
+ Hits        17013    17019    +6     
  Misses       1290     1290           
  Partials     1091     1091           
Flag Coverage Δ
#c_tests 85.33% <ø> (ø)
#python_c_tests 90.17% <100.00%> (+<0.01%) ⬆️
#python_tests 99.02% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/metadata.py 98.47% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b12450...5240745. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments.

@benjeffery
Copy link
Member Author

@jeromekelleher Fixed up. I think we wait for @petrelharp or @molpopgen to have a look before merging.

@benjeffery benjeffery marked this pull request as ready for review August 13, 2020 16:18
which created the tree-sequence file, as the exact metadata
stored will vary depending on the use case. Subsequent processes can add or modify the schemas
if they wish to add or modify to the types (or encoding) of the metadata. Most users of tree-sequence
files will not need to modify the schemas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add "The purpose of the schema is to say what information is stored in each metadata record, and how it is stored." - maybe after the first sentence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 837e0e8

This codec places extra restrictions on the schema:

#. Each property must have a ``binaryFormat``
This sets the binary encoding for used for the property.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for used for"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b31382

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Something that I didn't see addressed is: what if the metadata don't match the schema? The tree sequence will be loaded (since metadata are decoded lazily), but you'll get errors when trying to decode it; and to figure out what's going on (and maybe retrieve the metadata) you might need to set the schema to the null schema and look at the raw bytes.

Oh, and also: should it be said somewhere that the metadata schema is stored as a string in the underlying tables? And that if created in python, this can be viewed by str(schema)?

To determine the binary encoding of each property in the metadata the ``binaryFormat`` key is used.
This describes the encoding for each property using ``struct``
`format characters <https://docs.python.org/3/library/struct.html#format-characters>`_.
For example an unsigned 8-byte integer can be specified with::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably acknowledge some of the below as being modified from the struct docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d63b49

The codec stores the length of the array before the array data. The format used for the
length of the array can be chosen with ``arrayLengthFormat`` which must be one
of ``B``, ``H``, ``I``, ``L`` or ``Q`` which have the same meaning as in the numeric
types above. ``Q`` is the default. As an example::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #769

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fecbe45 with added error message on overflow.

"codec": "struct",
"type": "object",
"properties": {
"accession_number": {"type": "number", "binaryFormat": "i"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be integer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8b80596

This schema states that the metadata for each row of the table
is an object consisting of two properties. Property ``accession_number`` is a number
(stored as a 4-byte int) which must be specified (it is included in the ``required``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're both required, since it's struct, and there isn't a required list any more (should there be, to emphasize?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4dfd481


'Bob1234'
To modify the metadata of rows in tables use the :ref:`sec_tutorial_metadata_bulk`.
to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing end of the sentence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 004263a

accessed with ``ts.metadata`` and ``ts.metadata_schema``.

If there is no schema for a table or top-level metadata, then no decoding is performed
and ``bytes`` will be returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "no schema" the same as MetadataSchema(None)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2f5f18e

tables.individuals.metadata_schema = schema
Now that the table has a schema calls to
This will overwrite any existing schema. Now that the table has a schema calls to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it should be mentioned here that assigning the schema does not validate all pre-existing metadata entries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note added in a84fefb

@benjeffery
Copy link
Member Author

Something that I didn't see addressed is: what if the metadata don't match the schema? The tree sequence will be loaded (since metadata are decoded lazily), but you'll get errors when trying to decode it; and to figure out what's going on (and maybe retrieve the metadata) you might need to set the schema to the null schema and look at the raw bytes.

Yes you'll likely get some hairy message from the decoder, or worse silently corrupted or missing data. There's not much that can be done apart from encouraging use of validate_and_encode_row rather than just encode_row.

@benjeffery
Copy link
Member Author

Thanks for the great comments @petrelharp I think I've addressed them all.

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks!!

@jeromekelleher
Copy link
Member

Looks like this is ready to go after a squash @benjeffery?

@benjeffery
Copy link
Member Author

I've squashed to two commits: one docs, one code change. Thanks for everyone's input to the metadata API and docs!

@mergify mergify bot merged commit 8dcfa21 into tskit-dev:master Aug 17, 2020
@benjeffery benjeffery deleted the metadata-docs branch August 18, 2020 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve metadata docs

3 participants