-
Notifications
You must be signed in to change notification settings - Fork 78
Show metadata size #3343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Show metadata size #3343
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3343 +/- ##
=======================================
Coverage 89.76% 89.76%
=======================================
Files 29 29
Lines 31292 31302 +10
Branches 5738 5740 +2
=======================================
+ Hits 28089 28099 +10
Misses 1794 1794
Partials 1409 1409
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
8e703c4 to
056973f
Compare
|
Bytes are better. How does this interact with the old "has metadata" flag? What happens when there is metadata but there's no schema (I think I see what it is not from rereading the code, but it would really help to use some variables and not try to make everything maximally cryptic). Some simple unit tests here based on the output of metadata_details covering the various possibilities would be sufficient. |
|
I'll go with bytes then. This replaces the "has_metadata" column (which I think is the right way to go: if there is text in there, it means either a schema is set or there is metadata). I'm trying not to bloat the tables with lots of extra text / columns, so if we are going with bytes, how about No metadata, no schema = So e.g. ? Other suggestions welcome, obviously. I could put e.g. an empty marker (e.g. The alternative is to have 2 columns, labelled e.g. |
|
I think it's confusing to omit the bytes value when it's empty, just say "0 Bytes". I think it's probably easier to understand if we have two columns all right. |
0c5e46f to
e9eb4eb
Compare
| def metadata_codec(table): | ||
| if hasattr(table, "metadata_schema"): | ||
| schema = table.metadata_schema.schema | ||
| return "raw" if schema is None else schema.get("codec", "unknown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the tests don't check if this value is correct? (eg, no examples of "unknown" codec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good reason why logic like this should be spelled out longhand rather than aiming for maximally concise. Much easier to track test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even sure that we can have a schema with no codec, but as this is display code, I thought it better not to fail here, and I didn't really want to dive into understanding what was and wasn't acceptable as a schema. Bu I guess I should just allow this to fail if there's no codec then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can't be tested then it shouldn't be in the code. You need to take responsibility here for understanding what you're putting in to the code - this is what being a maintainer entails.
| if hasattr(table, "metadata_schema"): | ||
| schema = table.metadata_schema.schema | ||
| codec = schema["codec"] if schema else "raw" | ||
| codecs[codec] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here could check if it shows 0% for tables without metadata? And that the percent is <= 100%?
|
Looks good. How does this look for the sc2ts tree sequence so we can see the output at scale? |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test following the modern modular approach please? If you're writing complicated logic in test code you're doing it wrong, as a rule of thumb.
I agree we should update existing tests minimally to capture the change in format, but do the testing of the actual conversion in standalone unit tests which are as stupidly simple as possible.
| assert re.search(rf"║Time Units *│ *{ts.time_units}║", s) | ||
| for table in ts.tables.table_name_map: | ||
| assert re.search(rf"║{table.capitalize()} *│", s) | ||
| codecs = collections.defaultdict(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very much negative on this style of testing now - any complex logic in a test case is to be avoided.
I suggested testing these functions as follows
class TestMetadataCodecStr:
def test_plain_table_collection(self):
s = util.metadata_codec(tskit.TableCollection(1))
assert s == "raw"
This is much more useful long term as well as being easier to write and review.
| assert f"<td>{table_name.capitalize()}</td>" in html | ||
| if hasattr(table, "metadata_schema"): | ||
| schema = table.metadata_schema.schema | ||
| codec = schema["codec"] if schema else "raw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer "bytes" here. "raw" implies something special?
| def metadata_codec(table): | ||
| if hasattr(table, "metadata_schema"): | ||
| schema = table.metadata_schema.schema | ||
| return "raw" if schema is None else schema.get("codec", "unknown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't have a schema without a codec.
As a general point of taste - defensive code is almost always indicating a lack of testing or some other smell. Unless it is to do with interfacing to the real world or data you don't control. Here we control everything so defensiveness is not wanted.
|
|
||
| def metadata_size(table): | ||
| if hasattr(table, "metadata"): | ||
| frac = len(table.metadata) / table.nbytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that table.nbytes can't be zero. Can you try to create one that is? If you can't then this is fine.



Fixes #2637 - I remembered this issue when writing some recent tutorial material.
I went for the absolute size rather than percentage, and simply appended the codec type, rather than having a separate column, as it seems simpler:
I will add tests if this seems like the right format: