decoding struct metadata does not check for proper length

This was turned up in https://github.com/tskit-dev/pyslim/issues/317. Consider what happens if we assign metadata:
```
import tskit, pyslim
tables = tskit.TableCollection(sequence_length=1)
tables.nodes.add_row(time=0, metadata=b'{"ancestor_data_id": 1}')
print(tables.nodes[0].metadata)
# b'{"ancestor_data_id": 1}'
print(tables.nodes.metadata)
# [123  34  97 110  99 101 115 116 111 114  95 100  97 116  97  95 105 100
  34  58  32  49 125]
print(tables.nodes.metadata_offset)
# [ 0 23]
```
and then change the schema:
```
nms = pyslim.slim_metadata_schemas['node']
tables.nodes.metadata_schema = nms
print(tables.nodes[0].metadata)
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
# [123  34  97 110  99 101 115 116 111 114  95 100  97 116  97  95 105 100
  34  58  32  49 125]
print(tables.nodes.metadata_offset)
# [ 0 23]
```
This extends to the tree sequence:
```
ts = tables.tree_sequence()
ts.node(0).metadata
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
ts.tables.nodes.metadata_offset
# [ 0 23]
```
So - the metadata bytes (including the offsets) hasn't changed, and even though the bytes are the wrong length, they decode. This is unfortunate because one can merrily go along decoding and re-encoding the metadata (like `for n in ts.nodes: tables.nodes.append(n.replace(...))`) without realizing you're silently losing data and introducing nonsense.

Hm - so this is a thing that just `decode_row` does:
```
nms.decode_row(b'{"ancestor_data_id": 1}')
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
```

We do already error if we try to decode something that's too short:
```
# nms.decode_row(b'foo')
Traceback (most recent call last):
  File "<python-input-29>", line 1, in <module>
    nms.decode_row(b'foo')
    ~~~~~~~~~~~~~~^^^^^^^^
  File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 665, in <lambda>
    self.decode = lambda buffer: decoder(iter(buffer))
                                 ~~~~~~~^^^^^^^^^^^^^^
  File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 462, in decode_object_or_null
    key: sub_decoder(buffer)
         ~~~~~~~~~~~^^^^^^^^
  File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 506, in <lambda>
    return lambda buffer: struct.unpack(f, bytes(islice(buffer, size)))[0]
                          ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 8 bytes
```

The [underlying code](https://github.com/tskit-dev/tskit/blob/main/python/tskit/metadata.py#L506) uses `struct.unpack`, which does notice if the provided buffers are too long:
```
struct.unpack("<h", b'a')
# struct.error: unpack requires a buffer of 2 bytes
struct.unpack("<h", b'aa')
# (24929,)
struct.unpack("<h", b'aaa')
# struct.error: unpack requires a buffer of 2 bytes
```
and, we do check the length of the buffer [here](https://github.com/tskit-dev/tskit/blob/main/python/tskit/metadata.py#L457). Seems like a check whether we've exhuasted the buffer would do it?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

decoding struct metadata does not check for proper length #3254

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

decoding struct metadata does not check for proper length #3254

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions