-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC: Metadata implementation #574
Conversation
This now can emit email and json metadata (still yet to be extensively tested) More decisions made:
Footnotes
|
I'm currently downloading a corpus of data from PyPI to test this PR with. There's a lot to download so it won't be done anytime soon, but testing with what I have so far results in 282469 I'm digging into why exactly those had left-over keys, so far the most common reason is just bad data that can't be correctly parsed due to the new line problem I mentioned. An example of a problem
That bar character was added by me, the other lines in that description are just
I think this shows the strength of being very careful about how we deserialize data and the However, this PR returns a
and the "leftover" data strcture looks like:
which shows that there was an error parsing the Footnotes
|
Here's another one, this one is subtle:
That produces a
with a left overs of:
It looks at some point maturin was emitting Footnotes
|
More weird bad data that normally passes silently:
Body removed for brevity:
with leftovers
Looks like that file was emitted with a stray new line after the This is what
|
So far all of the metadata files with leftover data I've investigated are due to the
the |
My latest push starts rewriting the The new design has the following properties:
The way this PR's If you create a Metadata class using the normal constructor Thus, when you create a In addition to that, it supports a number of alternate constructors: The We add a Thus, we ensure that on access or writing to a property, that property's data is always valid from the POV of the user, but since we're doing it lazily, it allows partial validation if needed. The last part isn't written, but the plan is to also add a set of Footnotes
|
Overall, this design gives a lot of power, while ensuring a lot of safety and makes several pieces easier to implement:
This let's us serve a lot of different use cases:
On the design side, we've got very clear separation of concerns:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm up for going with this general approach. Should we try to get the raw parts in first to keep the PRs small?
|
||
|
||
@enum.unique | ||
class DynamicField(enum.Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it is worth sticking with an enum or just with lowercase string literals for the metadata field names? Same goes for known/supported metadata versions.
# | ||
# However, we want to support validation to happen both up front | ||
# and on the fly as you access attributes, and when using the | ||
# on the fly validation, we don't want to validate anything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# on the fly validation, we don't want to validate anything else | |
# on-the-fly validation, we don't want to validate anything else |
# purpose of RawMetadata. | ||
_raw: RawMetadata | ||
|
||
# Likewise, we need a place to store our honest to goodness actually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Likewise, we need a place to store our honest to goodness actually | |
# Likewise, we need a place to store our honest-to-goodness, actually |
# validated metadata too, we could just store this in a dict, but | ||
# this will give us better typing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# validated metadata too, we could just store this in a dict, but | |
# this will give us better typing. | |
# validated metadata, too. We could just store this in a dict, but | |
# this will give us better typing. |
v2_3 = "2.3" | ||
|
||
|
||
class _ValidatedMetadata(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So would this class have a key for each piece of metadata that we are willing to perform conversions/validation on from raw metadata?
def full_validate(self, value: V | None) -> None: | ||
if value is not None: | ||
self.validate(value) | ||
|
||
@abc.abstractmethod | ||
def validate(self, value: V) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the two functions? Is it just to avoid having to deal with the None
case for typing purposes?
dynamic: List[str] | ||
|
||
# Metadata 2.3 - PEP 685 | ||
# No new fields were added in PEP 685, just some edge case were |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# No new fields were added in PEP 685, just some edge case were | |
# No new fields were added in PEP 685, just some edge cases were |
|
||
|
||
_EMAIL_FIELD_ORDER = [ | ||
# Always put the metadata version first, incase it ever changes how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Always put the metadata version first, incase it ever changes how | |
# Always put the metadata version first, in case it ever changes how |
# class, some light touch ups can make a massive different in usability. | ||
|
||
|
||
_EMAIL_FIELD_MAPPING = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scares me that there's a typo somewhere, but we would probably find out pretty quickly, so my brain wanting to do this as a dict comprehension just needs to calm down. 😅
# This might appear to be a mapping of the same key to itself, and in many cases | ||
# it is. However, the algorithm in PEP 566 doesn't match 100% the keys chosen | ||
# for RawMetadata, so we use this mapping just like with email to handle that. | ||
_JSON_FIELD_MAPPING = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need a dict comprehension.
@dstufft what would you like to do to move this forward? Implement the raw stuff with tests first? Something else? |
To help @dstufft move this forward, I have started my own branch that takes Donald's email header parsing code and begins to add tests and docs (this is currently a WIP w/ appropriate attribution to Donald via |
I think this PoC has outlived it's usefulness now with the work @brettcannon has been doing, so going to close it. |
This is still a work in progress and hasn't even been tested at all yet to even see if it runs or anything.
However, I think it should correctly implement parsing the
METADATA
andmetadata.json
files correctly, with I think the most lenient parser that I could come up with that doesn't silently let bad data go unnoticed.Some general notes about decisions made here (so far anyways):
RawMetadata
largely matches the format ofmetadata.json
, but has some deviations to make using it better.metadata.json
have really awkward lack of pluralization (e.g.metdata.json
hasclassifer: list[str]
, butRawMetadata
hasclassifiers: list[str]
.RawMetadata
are optional regardless of what the core metadata spec says.Project-URL
metadata is represented as adict[str, str]
notlist[str]
.parse_FORMAT
functions return a tuple ofRawMetdata
that represents everything that could be parsed, and adict[Any, Any]
that represents what could not be.parse_FORMAT
functions take either abytes
or astr
, allowing callers to give us a bytes and we will do the right thing, or if they know that their document is broken in some way with a wrong encoding, they can decode it themselves before passing it in.RawMetadata
even enforces) pass through silently.I tried to comment through everything, but there are a lot of subtle situations that I believe this will handle about as good as possible:
Project-URL
, which conceptually is a map but due to RFC 822 not supporting maps, is serialized as a list of strings.METADATA
repeat uses of a key that does not allow multiple uses makes that key unparseable and pushes it into the second structure.METADATA
file from havingName
emitted twice with different contents, which if we just blindly pick one of them as "the" value, different systems may randomly pick the other one, leaving two systems to parse the same file with different results.METADATA
, such that a file that is mostly utf8, but one field has been mojibaked can still parse the bulk of the file correctly 1.Anyways, tomorrow I'm going to actually test this, and try to get serialization back into
METADATA
andmetadata.json
done, which should be a lot easier and less finnicky.Footnotes
The stdlib email parser plus RFC 822 together makes this horrible to do, because all of the parsing methods that accept bytes just do a hardcoded
decode("ascii", "surrogateescape")
, which means that anyone parsingMETADATA
with one of the byte interfaces from the email library is incorrectly parsing validMETADATA
files that contain any utf8 characters that aren't also ascii characters. ↩