Skip to content

decoding struct metadata does not check for proper length #3254

@petrelharp

Description

@petrelharp

This was turned up in tskit-dev/pyslim#317. Consider what happens if we assign metadata:

import tskit, pyslim
tables = tskit.TableCollection(sequence_length=1)
tables.nodes.add_row(time=0, metadata=b'{"ancestor_data_id": 1}')
print(tables.nodes[0].metadata)
# b'{"ancestor_data_id": 1}'
print(tables.nodes.metadata)
# [123  34  97 110  99 101 115 116 111 114  95 100  97 116  97  95 105 100
  34  58  32  49 125]
print(tables.nodes.metadata_offset)
# [ 0 23]

and then change the schema:

nms = pyslim.slim_metadata_schemas['node']
tables.nodes.metadata_schema = nms
print(tables.nodes[0].metadata)
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
# [123  34  97 110  99 101 115 116 111 114  95 100  97 116  97  95 105 100
  34  58  32  49 125]
print(tables.nodes.metadata_offset)
# [ 0 23]

This extends to the tree sequence:

ts = tables.tree_sequence()
ts.node(0).metadata
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
ts.tables.nodes.metadata_offset
# [ 0 23]

So - the metadata bytes (including the offsets) hasn't changed, and even though the bytes are the wrong length, they decode. This is unfortunate because one can merrily go along decoding and re-encoding the metadata (like for n in ts.nodes: tables.nodes.append(n.replace(...))) without realizing you're silently losing data and introducing nonsense.

Hm - so this is a thing that just decode_row does:

nms.decode_row(b'{"ancestor_data_id": 1}')
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}

We do already error if we try to decode something that's too short:

# nms.decode_row(b'foo')
Traceback (most recent call last):
  File "<python-input-29>", line 1, in <module>
    nms.decode_row(b'foo')
    ~~~~~~~~~~~~~~^^^^^^^^
  File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 665, in <lambda>
    self.decode = lambda buffer: decoder(iter(buffer))
                                 ~~~~~~~^^^^^^^^^^^^^^
  File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 462, in decode_object_or_null
    key: sub_decoder(buffer)
         ~~~~~~~~~~~^^^^^^^^
  File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 506, in <lambda>
    return lambda buffer: struct.unpack(f, bytes(islice(buffer, size)))[0]
                          ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 8 bytes

The underlying code uses struct.unpack, which does notice if the provided buffers are too long:

struct.unpack("<h", b'a')
# struct.error: unpack requires a buffer of 2 bytes
struct.unpack("<h", b'aa')
# (24929,)
struct.unpack("<h", b'aaa')
# struct.error: unpack requires a buffer of 2 bytes

and, we do check the length of the buffer here. Seems like a check whether we've exhuasted the buffer would do it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions