-
Notifications
You must be signed in to change notification settings - Fork 77
Description
This was turned up in tskit-dev/pyslim#317. Consider what happens if we assign metadata:
import tskit, pyslim
tables = tskit.TableCollection(sequence_length=1)
tables.nodes.add_row(time=0, metadata=b'{"ancestor_data_id": 1}')
print(tables.nodes[0].metadata)
# b'{"ancestor_data_id": 1}'
print(tables.nodes.metadata)
# [123 34 97 110 99 101 115 116 111 114 95 100 97 116 97 95 105 100
34 58 32 49 125]
print(tables.nodes.metadata_offset)
# [ 0 23]
and then change the schema:
nms = pyslim.slim_metadata_schemas['node']
tables.nodes.metadata_schema = nms
print(tables.nodes[0].metadata)
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
# [123 34 97 110 99 101 115 116 111 114 95 100 97 116 97 95 105 100
34 58 32 49 125]
print(tables.nodes.metadata_offset)
# [ 0 23]
This extends to the tree sequence:
ts = tables.tree_sequence()
ts.node(0).metadata
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
ts.tables.nodes.metadata_offset
# [ 0 23]
So - the metadata bytes (including the offsets) hasn't changed, and even though the bytes are the wrong length, they decode. This is unfortunate because one can merrily go along decoding and re-encoding the metadata (like for n in ts.nodes: tables.nodes.append(n.replace(...))
) without realizing you're silently losing data and introducing nonsense.
Hm - so this is a thing that just decode_row
does:
nms.decode_row(b'{"ancestor_data_id": 1}')
# {'slim_id': 8391162008449393275, 'is_vacant': [111]}
We do already error if we try to decode something that's too short:
# nms.decode_row(b'foo')
Traceback (most recent call last):
File "<python-input-29>", line 1, in <module>
nms.decode_row(b'foo')
~~~~~~~~~~~~~~^^^^^^^^
File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 665, in <lambda>
self.decode = lambda buffer: decoder(iter(buffer))
~~~~~~~^^^^^^^^^^^^^^
File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 462, in decode_object_or_null
key: sub_decoder(buffer)
~~~~~~~~~~~^^^^^^^^
File "/home/peter/projects/tskit-dev/tsvenv/lib/python3.13/site-packages/tskit/metadata.py", line 506, in <lambda>
return lambda buffer: struct.unpack(f, bytes(islice(buffer, size)))[0]
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 8 bytes
The underlying code uses struct.unpack
, which does notice if the provided buffers are too long:
struct.unpack("<h", b'a')
# struct.error: unpack requires a buffer of 2 bytes
struct.unpack("<h", b'aa')
# (24929,)
struct.unpack("<h", b'aaa')
# struct.error: unpack requires a buffer of 2 bytes
and, we do check the length of the buffer here. Seems like a check whether we've exhuasted the buffer would do it?