Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Metadata implementation #574

Closed
wants to merge 20 commits into from
Closed

Conversation

dstufft
Copy link
Member

@dstufft dstufft commented Jul 16, 2022

This is still a work in progress and hasn't even been tested at all yet to even see if it runs or anything.

However, I think it should correctly implement parsing the METADATA and metadata.json files correctly, with I think the most lenient parser that I could come up with that doesn't silently let bad data go unnoticed.

Some general notes about decisions made here (so far anyways):

  • The RawMetadata largely matches the format of metadata.json, but has some deviations to make using it better.
    • Several of the key names in metadata.json have really awkward lack of pluralization (e.g. metdata.json has classifer: list[str], but RawMetadata has classifiers: list[str].
    • All values in RawMetadata are optional regardless of what the core metadata spec says.
    • the Project-URL metadata is represented as a dict[str, str] not list[str].
  • I believe it is unsafe to include unparsed data intermixed with parsed data, it makes it far too difficult to differentiate between them, so the parse_FORMAT functions return a tuple of RawMetdata that represents everything that could be parsed, and a dict[Any, Any] that represents what could not be.
  • The parse_FORMAT functions take either a bytes or a str, allowing callers to give us a bytes and we will do the right thing, or if they know that their document is broken in some way with a wrong encoding, they can decode it themselves before passing it in.
  • Round tripping to a byte for byte result is not a supported use case, but round tripping to a semantically equivalent result is.
  • Under no circumstances do we let malformed data (with what little correctness RawMetadata even enforces) pass through silently.

I tried to comment through everything, but there are a lot of subtle situations that I believe this will handle about as good as possible:

  • Extraneous keys are not implicitly accepted, rather they are pushed into a second data structure to mark them as unparsed.
    • This holds true even for Project-URL, which conceptually is a map but due to RFC 822 not supporting maps, is serialized as a list of strings.
  • Types are explicitly checked to ensure that our typing matches our runtime, since this data is external, we can't assuming that the shape of it matches anything in particular.
  • When parsing METADATA repeat uses of a key that does not allow multiple uses makes that key unparseable and pushes it into the second structure.
    • This might be fixing a possible security bug that's a variant of the confused deputy? There's nothing right now preventing a METADATA file from having Name emitted twice with different contents, which if we just blindly pick one of them as "the" value, different systems may randomly pick the other one, leaving two systems to parse the same file with different results.
  • Implement a RFC 822 aware, line by line decoding of METADATA, such that a file that is mostly utf8, but one field has been mojibaked can still parse the bulk of the file correctly 1.

Anyways, tomorrow I'm going to actually test this, and try to get serialization back into METADATA and metadata.json done, which should be a lot easier and less finnicky.

Footnotes

  1. The stdlib email parser plus RFC 822 together makes this horrible to do, because all of the parsing methods that accept bytes just do a hardcoded decode("ascii", "surrogateescape"), which means that anyone parsing METADATA with one of the byte interfaces from the email library is incorrectly parsing valid METADATA files that contain any utf8 characters that aren't also ascii characters.

@dstufft dstufft mentioned this pull request Jul 16, 2022
3 tasks
@dstufft
Copy link
Member Author

dstufft commented Jul 16, 2022

This now can emit email and json metadata (still yet to be extensively tested)

More decisions made:

  • The code to emit email always does the distutils style RFC 822 escaping, which should be a no-op if the field contains no new lines 1.
  • The code to emit email always emits Description as the email body.
  • Emitting assumes that you've passed in a correct RawMetadata, but as an extra layer of protection will not emit keys that is unknown to it.

Footnotes

  1. This means in the presence of new lines, we will emit mangled, but otherwise safe, data. Validation is left to other layers.

@dstufft
Copy link
Member Author

dstufft commented Jul 17, 2022

I'm currently downloading a corpus of data from PyPI to test this PR with. There's a lot to download so it won't be done anytime soon, but testing with what I have so far results in 282469 METADATA or PKG-INFO files parsed with no left-over keys 12 and 209 parsed with left over keys.

I'm digging into why exactly those had left-over keys, so far the most common reason is just bad data that can't be correctly parsed due to the new line problem I mentioned.

An example of a problem PKG-INFO is:

Metadata-Version: 1.1
Name: aisg-cli
Version: 0.1.0
Summary: AISG CLI Tool
Home-page: https://github.com/kensoh/aisg-cli
Author: AI Singapore
Author-email: engineering@aisingapore.org
License: UNKNOWN
Description-Content-Type: UNKNOWN
Description: # AISG CLI Tool

        Command line interface to simplify machine learning workflows - data acquisition, modeling, deployment

        |
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6

That bar character was added by me, the other lines in that description are just \n but the line I added the bar to was padded out to the bar. This malformed PKG-INFO ends up being parsed by email.parser with a Description header that is set to # AISG CLI Tool, then a body payload that is set to:

        Command line interface to simplify machine learning workflows - data acquisition, modeling, deployment

        |
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6

I think this shows the strength of being very careful about how we deserialize data and the leftover data structure, because all of the other libraries I've tested silently ignore the fact this metadata is malformed and throw away the # AISG CLI Tool data.

However, this PR returns a RawMetamdata:

{'author': 'AI Singapore',
 'author_email': 'engineering@aisingapore.org',
 'description_content_type': 'UNKNOWN',
 'home_page': 'https://github.com/kensoh/aisg-cli',
 'license': 'UNKNOWN',
 'metadata_version': '1.1',
 'name': 'aisg-cli',
 'summary': 'AISG CLI Tool',
 'version': '0.1.0'}

and the "leftover" data strcture looks like:

{'Description': ['# AISG CLI Tool',
                 '        Command line interface to simplify machine learning '
                 'workflows - data acquisition, modeling, deployment\r'
                 '\r\n'
                 '        \r\n'
                 'Platform: UNKNOWN\r\n'
                 'Classifier: Development Status :: 3 - Alpha\r\n'
                 'Classifier: Intended Audience :: Developers\r\n'
                 'Classifier: Intended Audience :: System Administrators\r\n'
                 'Classifier: Intended Audience :: Science/Research\r\n'
                 'Classifier: License :: OSI Approved :: Apache Software '
                 'License\r\n'
                 'Classifier: Programming Language :: Python :: 2\r\n'
                 'Classifier: Programming Language :: Python :: 2.7\r\n'
                 'Classifier: Programming Language :: Python :: 3\r\n'
                 'Classifier: Programming Language :: Python :: 3.4\r\n'
                 'Classifier: Programming Language :: Python :: 3.5\r\n'
                 'Classifier: Programming Language :: Python :: 3.6\r\n']}

which shows that there was an error parsing the Description key, and in this case because it saw two values for that key, it included a list that had both values.

Footnotes

  1. I'm excluding the License-File data being left-over from these results, since that is the library behaving correctly.

  2. I haven't attempted to compare the results of what can be parsed between libraries to see how this is faring yet, just seeing what data it wasn't able to parse to sort out any blatant errors first.

@dstufft
Copy link
Member Author

dstufft commented Jul 17, 2022

Here's another one, this one is subtle:

Metadata-Version: 2.1
Name: adblock
Version: 0.4.3
Classifiers: Programming Language :: Python
Classifiers: Programming Language :: Rust
Classifiers: License :: OSI Approved :: MIT License
Classifiers: License :: OSI Approved :: Apache Software License
Home-Page: https://github.com/ArniDagur/python-adblock
Author: Árni Dagur <arni@dagur.eu>
Author-Email: Árni Dagur <arni@dagur.eu>
License: MIT OR Apache-2.0
Requires-Python: >=3.6
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

python-adblock
==========
Python wrapper for Brave's adblocking library, which is written in Rust.

### Building

\`\`\`
maturin build --release
\`\`\`

#### Build dependencies

| Build Dependency | Versions | Arch Linux | Url |
|------------------|----------|------------|-----|
| Python           | `>=3.6`  | `python3`  | -   |
| Rust             | `>=1.45` | `rust`     | -   |
| Maturin          | `*`      | `maturin`  | https://github.com/PyO3/maturin |

### Developing

I use Poetry for development. To create and enter a virtual environment, do
\`\`\`
poetry install
poetry shell
\`\`\`
then, to install the `adblock` module into the virtual environment, do
\`\`\`
maturin develop
\`\`\`

### Documentation

Rust documentation for the latest `master` branch can be found at https://arnidagur.github.io/python-adblock/docs/adblock/index.html.

### License

This project is licensed under either of

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
   http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
   http://opensource.org/licenses/MIT)

at your option.

That produces a RawMetadata like:

{'author': 'Árni Dagur <arni@dagur.eu>',
 'author_email': 'Árni Dagur <arni@dagur.eu>',
 'description': 'python-adblock\r\n'
                '==========\r\n'
                "Python wrapper for Brave's adblocking library, which is "
                'written in Rust.\r\n'
                '\r\n'
                '### Building\r\n'
                '\r\n'
                '```\r\n'
                'maturin build --release\r\n'
                '```\r\n'
                '\r\n'
                '#### Build dependencies\r\n'
                '\r\n'
                '| Build Dependency | Versions | Arch Linux | Url |\r\n'
                '|------------------|----------|------------|-----|\r\n'
                '| Python           | `>=3.6`  | `python3`  | -   |\r\n'
                '| Rust             | `>=1.45` | `rust`     | -   |\r\n'
                '| Maturin          | `*`      | `maturin`  | '
                'https://github.com/PyO3/maturin |\r\n'
                '\r\n'
                '### Developing\r\n'
                '\r\n'
                'I use Poetry for development. To create and enter a virtual '
                'environment, do\r\n'
                '```\r\n'
                'poetry install\r\n'
                'poetry shell\r\n'
                '```\r\n'
                'then, to install the `adblock` module into the virtual '
                'environment, do\r\n'
                '```\r\n'
                'maturin develop\r\n'
                '```\r\n'
                '\r\n'
                '### Documentation\r\n'
                '\r\n'
                'Rust documentation for the latest `master` branch can be '
                'found at '
                'https://arnidagur.github.io/python-adblock/docs/adblock/index.html.\r\n'
                '\r\n'
                '### License\r\n'
                '\r\n'
                'This project is licensed under either of\r\n'
                '\r\n'
                ' * Apache License, Version 2.0, '
                '([LICENSE-APACHE](LICENSE-APACHE) or\r\n'
                '   http://www.apache.org/licenses/LICENSE-2.0)\r\n'
                ' * MIT license ([LICENSE-MIT](LICENSE-MIT) or\r\n'
                '   http://opensource.org/licenses/MIT)\r\n'
                '\r\n'
                'at your option.\r\n'
                '\n',
 'description_content_type': 'text/markdown; charset=UTF-8; variant=GFM',
 'home_page': 'https://github.com/ArniDagur/python-adblock',
 'license': 'MIT OR Apache-2.0',
 'metadata_version': '2.1',
 'name': 'adblock',
 'requires_python': '>=3.6',
 'version': '0.4.3'}

with a left overs of:

{'classifiers': ['Programming Language :: Python',
                 'Programming Language :: Rust',
                 'License :: OSI Approved :: MIT License',
                 'License :: OSI Approved :: Apache Software License']}

It looks at some point maturin was emitting Classifiers instead of Classifier, which this immediately caught 1.

Footnotes

  1. See maturin commit where it was fixed: https://github.com/PyO3/maturin/commit/0cb3d79d5b3aa75a4cfc3a4ef8b353dfa7161279

@dstufft
Copy link
Member Author

dstufft commented Jul 17, 2022

More weird bad data that normally passes silently:

Metadata-Version: 2.1
Name: asciinema
Version: 2.2.0
Summary: Terminal session recorder
Home-page: https://asciinema.org
Download-URL: 
https: //github.com/asciinema/asciinema/archive/v2.2.0.tar.gz
Author: Marcin Kulik
Author-email: m@ku1ik.com
License: GNU GPLv3
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: System :: Shells
Classifier: Topic :: Terminals
Classifier: Topic :: Utilities
Description-Content-Type: text/markdown; charset=UTF-8
License-File: LICENSE

...

Body removed for brevity:

{'author': 'Marcin Kulik',
 'author_email': 'm@ku1ik.com',
 'classifiers': ['Development Status :: 5 - Production/Stable',
                 'Environment :: Console',
                 'Intended Audience :: Developers',
                 'Intended Audience :: System Administrators',
                 'License :: OSI Approved :: GNU General Public License v3 or '
                 'later (GPLv3+)',
                 'Natural Language :: English',
                 'Programming Language :: Python',
                 'Programming Language :: Python :: 3.6',
                 'Programming Language :: Python :: 3.7',
                 'Programming Language :: Python :: 3.8',
                 'Programming Language :: Python :: 3.9',
                 'Programming Language :: Python :: 3.10',
                 'Topic :: System :: Shells',
                 'Topic :: Terminals',
                 'Topic :: Utilities'],
 'description_content_type': 'text/markdown; charset=UTF-8',
 'download_url': '',
 'home_page': 'https://asciinema.org',
 'license': 'GNU GPLv3',
 'metadata_version': '2.1',
 'name': 'asciinema',
 'platforms': ['UNKNOWN'],
 'summary': 'Terminal session recorder',
 'version': '2.2.0'}

with leftovers

{'https': ['//github.com/asciinema/asciinema/archive/v2.2.0.tar.gz'],
 'license-file': ['LICENSE']}

Looks like that file was emitted with a stray new line after the Download-URL causing the url to be on the next line and get parsed as a header.

This is what pkg_metadata gets:

{'author': 'Marcin Kulik',
 'author_email': 'm@ku1ik.com',
 'classifier': ['Development Status :: 5 - Production/Stable',
                'Environment :: Console',
                'Intended Audience :: Developers',
                'Intended Audience :: System Administrators',
                'License :: OSI Approved :: GNU General Public License v3 or '
                'later (GPLv3+)',
                'Natural Language :: English',
                'Programming Language :: Python',
                'Programming Language :: Python :: 3.6',
                'Programming Language :: Python :: 3.7',
                'Programming Language :: Python :: 3.8',
                'Programming Language :: Python :: 3.9',
                'Programming Language :: Python :: 3.10',
                'Topic :: System :: Shells',
                'Topic :: Terminals',
                'Topic :: Utilities'],
 'description_content_type': 'text/markdown; charset=UTF-8',
 'download_url': '',
 'home_page': 'https://asciinema.org',
 'license': 'GNU GPLv3',
 'metadata_version': '2.1',
 'name': 'asciinema',
 'platform': ['UNKNOWN'],
 'summary': 'Terminal session recorder',
 'version': '2.2.0'}

@dstufft
Copy link
Member Author

dstufft commented Jul 17, 2022

So far all of the metadata files with leftover data I've investigated are due to the METADATA file itself being broken in some way. The bulk of them are due to stray \n causing the rest of the file to get parsed as the body like:

Metadata-Version: 1.1
Name: applicationinsights
Version: 0.11.10
Summary: This project extends the Application Insights API surface to support Python.
Home-page: https://github.com/Microsoft/ApplicationInsights-Python
Author: Microsoft
Author-email: appinsightssdk@microsoft.com
License: MIT
Download-URL: https://github.com/Microsoft/ApplicationInsights-Python
Description: This SDK is no longer maintained or supported by Microsoft. Check out the `Python OpenCensus SDK <https://docs.microsoft.com/azure/azure-monitor/app/opencensus-python>`_ for Azure Monitor's latest Python investments. Azure Monitor only provides support when using the `supported SDKs <https://docs.microsoft.com/en-us/azure/azure-monitor/app/platforms#unsupported-community-sdks>`_. We’re constantly assessing opportunities to expand our support for other languages, so follow our `GitHub Announcements <https://github.com/microsoft/ApplicationInsights-Announcements/issues>`_ page to receive the latest SDK news. 

        |
Keywords: analytics applicationinsights telemetry appinsights development
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6

the | was added by me again to show whitespace.

@dstufft
Copy link
Member Author

dstufft commented Jul 17, 2022

My latest push starts rewriting the Metadata class to work in a much different way (though ultimately it will still be able to have mostly the same API 1.

The new design has the following properties:

  • Users manually creating Metadata objects cannot create a Metadata with invalid metadata.
  • Adds Metadata.from_FORMAT methods to go from RawMetadata or METADATA / metadata.json, which will validate the metadata by default.
  • The from_FORMAT methods can optionally disable validation, which will let invalid data possibly be stored internally in Metadata.
  • Even in the case that validation is disabled, any access of a field will ensure that specific field is validated, as well as setting a field to a new value will ensure that new value is validated.
  • Will add Metaata.to_FORMAT methods to help go from a Metadata to a serialized form.
    • These will ensure that only fully valid metadata files are emitted.

The way this PR's Metadata class works (or will work when fully implemented) is:

If you create a Metadata class using the normal constructor Metadata(...), then the type signature of the class will guide people towards making correct metadata (name/version with no default values, etc), and once the data has all be copied over to their respective attributes, a "global validation" will run that ensures policy level requirements (metadata-version is appropriate for the defined fields, etc. basically anything that requires looking at multiple fields to actually validate).

Thus, when you create a Metadata object from it's constructor, you're forced to pass only valid values to have a fully valid set of metadata.

In addition to that, it supports a number of alternate constructors: Metadata.from_raw(), Metadata.from_email, and Metdata.from_json. The email and json variety of those constructors just call their respective parse_FORMAT method, and if there is any left over unparsed data, will hard fail, otherwise they'll take the raw metata and pass it into Metadata.from_raw().

The from_raw constructor has some light magic to avoid invoking __init__, we want __init__ to eagerly validate metadata and ensure it's all valid, but we want to enable passing in data that may be invalid (we assume it's a valid RawMetadata however) and lazily validating it as needed.

We add a lazy_validator class property that will implement our per field validation (using helper validators to make it easy to compose and test them). This lazy_validator will pull data out of raw, parse + validate it, then store it in the validated dictionary to serve as a cache/store of validated data. Likewise when setting a value, it will do the same thing, and deletion will clear it from those dictionaries.

Thus, we ensure that on access or writing to a property, that property's data is always valid from the POV of the user, but since we're doing it lazily, it allows partial validation if needed.

The last part isn't written, but the plan is to also add a set of to_FORMAT methods to serialize a Metadata, as part of that, serialization it will run the "global" validation again, to ensure that we don't emit any invalid metadata (but each field is always consistent).

Footnotes

  1. Except there's a problem where the existing API where it can't actually represent everything that is an otherwise well formed set of metadata, which I plan to address.

@dstufft
Copy link
Member Author

dstufft commented Jul 17, 2022

Overall, this design gives a lot of power, while ensuring a lot of safety and makes several pieces easier to implement:

  • The "Raw" layer will very leniently parse and represent metadata, but does so in a way that anything that isn't fully correct metadata is immediately obvious.
  • The Metadata layer will never let someone read or write invalid metadata on a per field basis, and in almost all cases on a per document basis.
    • You can technically get invalid Metadata out, but the only kinds of rules it can break are rules that require inspecting multiple fields at once, and even then, that's only if you manually read each field and serialize them together without running the full validation.
  • However, the Metadata layer is lazy, so if you only care about a single field, you can access just that field in a validated way, without having to validate the rest of the data.
  • Metadata.from_FORMAT validates the full document by default, requiring people to opt in to the lazy validation.

This let's us serve a lot of different use cases:

  1. If your goal is to read as much data as possible, and you don't really care about if it's well formed or not, you can use the raw layer and ignore the fact there is leftover data (or interpret it yourself).

  2. If your goal is to read specific pieces of metadata and you don't care if the rest of the metadata is value, you have two choices:

    • If you want to ignore malformed files that have leftover data, you'll have to use the parse_FORMAT functions, ignore the left overs and pass the RawMetadata into from_raw)
    • If you want to ignore invalid fields, but you want to only read from documents with valid formatting, you can use any of the from_FORMAT methods.

    In either case, you'll need to pass validate=False to the from_FORMAT method you're using to disable the eager validation.

  3. If your goal is to read the metadata and you want to only work with valid metadata, any of the from_FORMAT methods will work for you with validate=True (the default).

  4. If your goal is to write invalid metadata, you must use the raw layer, the Metadata layer will never let you write out invalid metadata.

On the design side, we've got very clear separation of concerns:

  • The raw layer only cares about turning bytes or str into very very lightly parsed documents, and it focuses entirely on doing that safely. Beyond trying to translate to/from the on disk formats to the intermediate format, it doesn't care about anything else.
  • The Metadata layer only cares about valid metadata and reading/writing from the raw intermediate format. It doesn't know anything at all about the on disk formats, nor does it need to.
  • The validations layer is setup to be easily composable, so each validation can be written with minimal knowledge or special casing.

@pradyunsg pradyunsg mentioned this pull request Oct 7, 2022
@brettcannon brettcannon self-requested a review October 22, 2022 19:25
Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm up for going with this general approach. Should we try to get the raw parts in first to keep the PRs small?



@enum.unique
class DynamicField(enum.Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it is worth sticking with an enum or just with lowercase string literals for the metadata field names? Same goes for known/supported metadata versions.

#
# However, we want to support validation to happen both up front
# and on the fly as you access attributes, and when using the
# on the fly validation, we don't want to validate anything else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# on the fly validation, we don't want to validate anything else
# on-the-fly validation, we don't want to validate anything else

# purpose of RawMetadata.
_raw: RawMetadata

# Likewise, we need a place to store our honest to goodness actually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Likewise, we need a place to store our honest to goodness actually
# Likewise, we need a place to store our honest-to-goodness, actually

Comment on lines +100 to +101
# validated metadata too, we could just store this in a dict, but
# this will give us better typing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# validated metadata too, we could just store this in a dict, but
# this will give us better typing.
# validated metadata, too. We could just store this in a dict, but
# this will give us better typing.

v2_3 = "2.3"


class _ValidatedMetadata(TypedDict, total=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So would this class have a key for each piece of metadata that we are willing to perform conversions/validation on from raw metadata?

Comment on lines +108 to +113
def full_validate(self, value: V | None) -> None:
if value is not None:
self.validate(value)

@abc.abstractmethod
def validate(self, value: V) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the two functions? Is it just to avoid having to deal with the None case for typing purposes?

dynamic: List[str]

# Metadata 2.3 - PEP 685
# No new fields were added in PEP 685, just some edge case were
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# No new fields were added in PEP 685, just some edge case were
# No new fields were added in PEP 685, just some edge cases were



_EMAIL_FIELD_ORDER = [
# Always put the metadata version first, incase it ever changes how
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Always put the metadata version first, incase it ever changes how
# Always put the metadata version first, in case it ever changes how

# class, some light touch ups can make a massive different in usability.


_EMAIL_FIELD_MAPPING = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scares me that there's a typo somewhere, but we would probably find out pretty quickly, so my brain wanting to do this as a dict comprehension just needs to calm down. 😅

# This might appear to be a mapping of the same key to itself, and in many cases
# it is. However, the algorithm in PEP 566 doesn't match 100% the keys chosen
# for RawMetadata, so we use this mapping just like with email to handle that.
_JSON_FIELD_MAPPING = {
Copy link
Member

@brettcannon brettcannon Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need a dict comprehension.

@brettcannon
Copy link
Member

@dstufft what would you like to do to move this forward? Implement the raw stuff with tests first? Something else?

@brettcannon
Copy link
Member

To help @dstufft move this forward, I have started my own branch that takes Donald's email header parsing code and begins to add tests and docs (this is currently a WIP w/ appropriate attribution to Donald via Co-authored-by): https://github.com/brettcannon/packaging/tree/raw-metadata . My hope is to get RawMetadata parsing working and then transparent Metadata validation/transformation/reading working as I have a direct need for that now (https://github.com/brettcannon/mousebender/ and getting a pure wheel resolver). We can add email header emission and such later on in separate PRs.

@brettcannon brettcannon mentioned this pull request Jan 24, 2023
@dstufft
Copy link
Member Author

dstufft commented Jun 30, 2023

I think this PoC has outlived it's usefulness now with the work @brettcannon has been doing, so going to close it.

@dstufft dstufft closed this Jun 30, 2023
@dstufft dstufft deleted the metadata-parsing branch September 29, 2023 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants