Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ install:
build_script:
- cmd: cd python
- cmd: python setup.py build_ext --inplace
# Install some modules needed for tests
- cmd: python -m pip install PyVCF
- cmd: python -m pip install newick
- cmd: python -m pip install python_jsonschema_objects
Expand Down
32 changes: 7 additions & 25 deletions docs/data-model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -492,32 +492,14 @@ record char Provenance record.
Metadata
========

Users of the tables API sometimes need to store auxiliary information for
the various entities defined here. For example, in a forwards-time simulation,
the simulation engine may wish to store the time at which a particular mutation
arose or some other pertinent information. If we are representing real data,
we may wish to store information derived from a VCF INFO field, or associate
information relating to samples or populations. The columns defined in tables
here are deliberately minimal: we define columns only for information which
the library itself can use. All other information is considered to be
**metadata**, and is stored in the ``metadata`` columns of the various
tables.

Arbitrary binary data can be stored in ``metadata`` columns, and the
``tskit`` library makes no attempt to interpret this information. How the
information held in this field is encoded is entirely the choice of client code.

To ensure that metadata can be safely interchanged using the :ref:`sec_text_file_format`,
each row is `base 64 encoded <https://en.wikipedia.org/wiki/Base64>`_. Thus,
binary information can be safely printed and exchanged, but may not be
human readable.

.. todo::
We plan on providing more sophisticated tools for working with metadata
in future, including the auto decoding metadata via pluggable
functions and the ability to store metadata schemas so that metadata
is self-describing.
Each table (excluding provenance) has a metadata column for storing and passing along
information that tskit does not use or interpret. See :ref:`sec_metadata` for details.
The metadata columns are :ref:`binary columns <sec_tables_api_binary_columns>`.

When using the :ref:`sec_text_file_format`, to ensure that metadata can be safely
interchanged, each row is `base 64 encoded <https://en.wikipedia.org/wiki/Base64>`_.
Thus, binary information can be safely printed and exchanged, but may not be
human readable.

.. _sec_valid_tree_sequence_requirements:

Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Welcome to tskit's documentation!
c-api
cli
data-model
metadata
provenance
development
tutorial
Expand Down
127 changes: 127 additions & 0 deletions docs/metadata.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
.. _sec_metadata:

========
Metadata
========

Every entity (nodes, mutations, edges, etc.) in a tskit tree sequence can have
metadata associated with it. This is intended for storing and passing on information
that tskit itself does not use or interpret. For example information derived from a VCF
INFO field, or administrative information (such as unique identifiers) relating to
samples and populations. Note that provenance information about how a tree sequence
was created should not be stored in metadata, instead the provenance mechanisms in
tskit should be used. See :ref:`sec_provenance`.

The metadata for each entity is described by a schema for each entity type. This
schema allows the tskit Python API to encode and decode metadata and, most importantly,
tells downstream users and tools how to decode and interpret the metadata. This schema
is in the form of a
`JSON Schema <http://json-schema.org/>`_. A good to guide to creating JSON Schemas is at
`Understanding JSON Schema <https://json-schema.org/understanding-json-schema/>`_.

In the most common case where the metadata schema specifies an object with properties,
the keys and types of those properties are specified along with optional
long-form names, descriptions
and validations such as min/max or regex matching for strings. See
:ref:`sec_metadata_example` below. Names and descriptions can assist
downstream users in understanding and using the metadata. It is best practise to
populate these fields if your files will be used by any third-party, or if you wish to
remember what they were some time after making the file!

The :ref:`sec_tutorial_metadata` Tutorial shows how to use schemas and access metadata
in the tskit Python API.

Note that the C API simply provides byte-array binary access to the metadata and
leaves encoding and decoding to the user. The same can be achieved with the Python
API, see :ref:`sec_tutorial_metadata_binary`.

******
Codecs
******

As the underlying metadata is in raw binary (see
:ref:`data model <sec_metadata_definition>`) it
must be encoded and decoded, in the case of the Python API to Python objects.
The method for doing this is specified in the top-level schema property ``codec``.
Currently the Python API supports the ``json`` codec which encodes metadata as
`JSON <https://www.json.org/json-en.html>`_. We plan to support more codecs soon, such
as an efficient binary encoding (see :issue:`535`). It is possible to define a custom
codec using :meth:`tskit.register_metadata_codec`, however this should only be used
when necessary as downstream users of the metadata will not be able to decode it
without the custom codec. For an example see :ref:`sec_tutorial_metadata_custom_codec`

.. _sec_metadata_example:

*******
Example
*******


As an example here is a schema using the ``json`` codec which could apply, for example,
to the individuals in a tree sequence:

.. code-block:: json

{
"codec": "json",
"type": "object",
"properties": {
"accession_number": {"type": "number"},
"collection_date": {
"name": "Collection date",
"description": "Date of sample collection in ISO format",
"type": "string",
"pattern": "^([1-9][0-9]{3})-(1[0-2]|0[1-9])-(3[01]|0[1-9]|[12][0-9])?$"
},
},
"required": ["accession_number"],
"additionalProperties": false,
}

This schema states that the metadata for the each row of the table
is an object consisting of two properties. Property ``accession_number`` is a number
which must be specified (it is included in the ``required`` list).
Property ``collection_date`` is an optional string which must satisfy a regex,
which checks it is a valid `ISO8601 <https://www.iso.org/iso-8601-date-and-time-format
.html>`_ date.
Any other properties are not allowed (``additionalProperties`` is false).

.. _sec_metadata_api_overview:

****************************
Python Metadata API Overview
****************************

Schemas are represented in the Python API by the :class:`tskit.MetadataSchema`
class which can be assigned to, and retrieved from, tables via their ``metadata_schema``
attribute (e.g. :attr:`tskit.IndividualTable.metadata_schema`). The schemas
for all tables can be retrieved from a :class:`tskit.TreeSequence` by the
:attr:`tskit.TreeSequence.table_metadata_schemas` attribute.

Each table's ``add_row`` method (e.g. :meth:`tskit.IndividualTable.add_row`) will
validate and encode the metadata using the schema.

Metadata will be lazily decoded if accessed via
``tables.individuals[0].metadata`` or ``tree_sequence.individual(0).metadata``.

In the interests of efficiency the bulk methods of ``set_columns``
(e.g. :meth:`tskit.IndividualTable.set_columns`)
and ``append_columns`` (e.g. :meth:`tskit.IndividualTable.append_columns`) do not
validate or encode metadata. See :ref:`sec_tutorial_metadata_bulk` for how to prepare
metadata for these methods.

Metadata processing can be disabled and raw bytes stored/retrived. See
:ref:`sec_tutorial_metadata_binary`.

.. _sec_metadata_schema_schema:

***************
Full metaschema
***************

The schema for metadata schemas is formally defined using
`JSON Schema <http://json-schema.org/>`_ and given in full here. Any schema passed to
:class:`tskit.MetadataSchema` is validated against this metaschema.

.. literalinclude:: ../python/tskit/metadata_schema.schema.json
:language: json
32 changes: 28 additions & 4 deletions docs/python-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -321,10 +321,11 @@ Binary columns
Columns storing binary data take the same approach as
:ref:`sec_tables_api_text_columns` to encoding
:ref:`variable length data <sec_encoding_ragged_columns>`.
The difference between the two is
only raw :class:`bytes` values are accepted: no character encoding or
decoding is done on the data. Consider the following example::

The difference between the two is only raw :class:`bytes` values are accepted: no
character encoding or decoding is done on the data. Consider the following example
where a table has no ``metadata_schema`` such that arbitrary bytes can be stored and
no automatic encoding or decoding of objects is performed by the Python API and we can
store and retrive raw ``bytes``. (See :ref:`sec_metadata` for details)::

>>> t = tskit.NodeTable()
>>> t.add_row(metadata=b"raw bytes")
Expand Down Expand Up @@ -388,30 +389,37 @@ and use, see :ref:`the table definitions <sec_table_definitions>`.
.. autoclass:: tskit.IndividualTable()
:members:
:inherited-members:
:special-members: __getitem__

.. autoclass:: tskit.NodeTable()
:members:
:inherited-members:
:special-members: __getitem__

.. autoclass:: tskit.EdgeTable()
:members:
:inherited-members:
:special-members: __getitem__

.. autoclass:: tskit.MigrationTable()
:members:
:inherited-members:
:special-members: __getitem__

.. autoclass:: tskit.SiteTable()
:members:
:inherited-members:
:special-members: __getitem__

.. autoclass:: tskit.MutationTable()
:members:
:inherited-members:
:special-members: __getitem__

.. autoclass:: tskit.PopulationTable()
:members:
:inherited-members:
:special-members: __getitem__

.. autoclass:: tskit.ProvenanceTable()
:members:
Expand Down Expand Up @@ -461,6 +469,22 @@ Table functions

.. autofunction:: tskit.unpack_bytes

.. _sec_metadata_api:

********
Metadata
********

The ``metadata`` module provides validation, encoding and decoding of metadata
using a schema. See :ref:`sec_metadata`, :ref:`sec_metadata_api_overview` and
:ref:`sec_tutorial_metadata`.

.. autoclass:: tskit.MetadataSchema
:members:
:inherited-members:

.. autofunction:: tskit.register_metadata_codec

.. _sec_stats_api:

**********************
Expand Down
Loading