Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate use of cgi #377

Merged
merged 9 commits into from
Jan 6, 2024
Merged

Eliminate use of cgi #377

merged 9 commits into from
Jan 6, 2024

Conversation

twm
Copy link
Contributor

@twm twm commented Jan 2, 2024

Stop using cgi, which was deprecated in Python 3.11 and dropped in Python 3.13. Replace it with the multipart package, which becomes a dependency.

Fixes #355.

@glyph
Copy link
Member

glyph commented Jan 2, 2024

Does vendoring a stdlib function have license implications? I'd rather deal with the possibility of tinkering in the email module than copy the PSF license.

@twm
Copy link
Contributor Author

twm commented Jan 2, 2024

@glyph I thought we've vendored stuff from the stdlib in Twisted before... is that not the case?

@twm twm requested a review from a team January 2, 2024 05:28
@twm
Copy link
Contributor Author

twm commented Jan 2, 2024

Okay, I switched to using multipart for everything.

Copy link
Member

@adiroiban adiroiban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

I just left a very generic review.
I have never used treq so far.

It looks good. Only minor comments.

I prefer stdlib here... but multipart is also ok.

Regards

def _encoding_from_headers(headers: Headers) -> Optional[str]:
content_types = headers.getRawHeaders("content-type")
if content_types is None:
return None

# This seems to be the choice browsers make when encountering multiple
# content-type headers.
content_type, params = cgi.parse_header(content_types[-1])
media_type, params = multipart.parse_options_header(content_types[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a FYI.

In Twisted we/I went for stdlib usage https://github.com/twisted/twisted/pull/12016/files#diff-3093a8f64dec64c2305f6322f276a5cee1437f0037efc660ce6524ffa58017a1R229-R234


My reasoning for using stdlib, is that this is quite common usage and if there is a bug, it's best to have it fixed in stdlib rather than rely on 3rd party packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for using stdlib, is that this is quite common usage and if there is a bug, it's best to have it fixed in stdlib rather than rely on 3rd party packages.

I'm curious what experience this is based on. In my experience, known bugs often remain in the stdlib for years (perhaps because there are relatively many barriers to landing a change in the stdlib, as well as the fact that the stdlib can only change with a release which happens relatively infrequently - plus such a release can not fix the issue for already-released versions of python).

If there is a third-party module of reasonable quality that is practical to use then I would generally prefer this over something from the stdlib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the source of the 3rd-party module. More dependencies = more attack surface, both in the literal sense (more people whose PyPI upload credentials can ruin your day), and in the metaphoric sense of unanticipated changes (what happens if a dependency goes AGPL3).

In this sense "the stdlib" or any existing dependency are roughly equivalent, and I'd probably prefer to depend more on our existing deps. Taking on a new dependency ought to give us a moment of pause, but really only a moment, especially if we know the maintainers or it's sufficiently widely used that we are not entirely on the hook for due diligence on its continued maintenance.

Copy link
Member

@glyph glyph Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case this is a blocker to 3.13 support and the dependency is only for tests, so we should probably not block on this minor concern :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the dependency here is not just for tests. In 0a3d17b I switched to using the equivalent of cgi.parse_header() provided by multipart to avoid vendoring code under the PSF license as that seemed to give @glyph pause.

I picked multipart because it has a few familiar faces attached, e.g. @cjwatson, who has PyPI upload permissions. It is hosted under an individual's GitHub account rather than an org, so that's a downside.

I also considered pulling the relevant bits of cgi into a separate package and putting it on PyPI. I'd be happy to do that under the Twisted org if that's preferable to an external dependency.

I can only echo @exarkun's skepticism of using the stdlib for this. The stdlib has an anti-bytes bias that is often incorrect and has led to a lot of makework for us in Twisted. The cgi module has seen some baffling refactors, then deprecation and removal. Now we're told to use the email package, which isn't explicitly implementing the HTTP RFCs. Seems risky. Also, it was recently rewritten! It's possible for code to be too maintained.

from typing import cast, AnyStr

from io import BytesIO

from multipart import MultipartParser # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok to use multipart in the testing code, to double check the implementation... even if the code under tests would have use stdlib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, the cgi docs suggest the multipart package:

Deprecated since version 3.11, will be removed in version 3.13: This function, like the rest of the cgi module, is deprecated. It can be replaced with the functionality in the email package (e.g. email.message.EmailMessage/email.message.Message) which implements the same MIME RFCs, or with the multipart PyPI project.

While it suggests the email package too, I do not think that the email package is representative of real-world multipart/form-data parsers, in particular because it targets email-dialect MIME that involves nesting never seen in HTTP. (As just one example, Django's multipart parser doesn't support nesting. I guarantee no browser ever generates it.)

A more ideal test suite would test against a variety of real-world multipart parsers, but I don't think that'd have great ROI.

self.assertIsInstance(resource.request_finishes[0].value, ConnectionDone)


class EncodingFromHeadersTests(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a docstring to describe the criteria for grouping tests into this class.
This should help future devs know if they should add a new tests here or add it somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! In the future it's fine to remind me of the Twisted coding standard, no need to write so much. :)

Copy link
Contributor Author

@twm twm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @adiroiban! I'll give a little time for folks to discuss the third-party dependency situation.

def _encoding_from_headers(headers: Headers) -> Optional[str]:
content_types = headers.getRawHeaders("content-type")
if content_types is None:
return None

# This seems to be the choice browsers make when encountering multiple
# content-type headers.
content_type, params = cgi.parse_header(content_types[-1])
media_type, params = multipart.parse_options_header(content_types[-1])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the dependency here is not just for tests. In 0a3d17b I switched to using the equivalent of cgi.parse_header() provided by multipart to avoid vendoring code under the PSF license as that seemed to give @glyph pause.

I picked multipart because it has a few familiar faces attached, e.g. @cjwatson, who has PyPI upload permissions. It is hosted under an individual's GitHub account rather than an org, so that's a downside.

I also considered pulling the relevant bits of cgi into a separate package and putting it on PyPI. I'd be happy to do that under the Twisted org if that's preferable to an external dependency.

I can only echo @exarkun's skepticism of using the stdlib for this. The stdlib has an anti-bytes bias that is often incorrect and has led to a lot of makework for us in Twisted. The cgi module has seen some baffling refactors, then deprecation and removal. Now we're told to use the email package, which isn't explicitly implementing the HTTP RFCs. Seems risky. Also, it was recently rewritten! It's possible for code to be too maintained.

self.assertIsInstance(resource.request_finishes[0].value, ConnectionDone)


class EncodingFromHeadersTests(unittest.TestCase):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! In the future it's fine to remind me of the Twisted coding standard, no need to write so much. :)

from typing import cast, AnyStr

from io import BytesIO

from multipart import MultipartParser # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, the cgi docs suggest the multipart package:

Deprecated since version 3.11, will be removed in version 3.13: This function, like the rest of the cgi module, is deprecated. It can be replaced with the functionality in the email package (e.g. email.message.EmailMessage/email.message.Message) which implements the same MIME RFCs, or with the multipart PyPI project.

While it suggests the email package too, I do not think that the email package is representative of real-world multipart/form-data parsers, in particular because it targets email-dialect MIME that involves nesting never seen in HTTP. (As just one example, Django's multipart parser doesn't support nesting. I guarantee no browser ever generates it.)

A more ideal test suite would test against a variety of real-world multipart parsers, but I don't think that'd have great ROI.

@adiroiban
Copy link
Member

adiroiban commented Jan 5, 2024

Thanks for the feedback. Much appreciated!

I have approved the PR and I am ok with using multipart for treq.

My comment about multipart vs stdlib was more of a FYI for what was done in twisted/twisted to fix this issue.

Also, I remember some discussions for getting treq code include in Twisted and have it as a high-level API .

From what I can see in this PR, multipart is used in "production" only to parse the Content-Type header. I was hoping that HTTP and MIME have the same standard for Content-Type header format.


I think that HTTP handling should be a core feature of any library, and this is why I would prefer to have http code implemented in stdlib.

For example, in Twisted, HTTP protocol handling is part of main Twisted package, while for LDAP we have a separate package.

Something similar for stdlib. I would expect Python to have a good enough HTTP handing, and use a separate package for LDAP handling.

@glyph glyph merged commit 8ccd453 into trunk Jan 6, 2024
16 checks passed
@glyph glyph deleted the bye-cgi-355 branch January 6, 2024 01:51
@opoplawski
Copy link

Any chance we could get a new release with this fix? Fedora rawhide just updated to Python 3.13b2 and treq is failing to build. This PR doesn't apply cleanly to 23.11.0. Thanks.

@adiroiban
Copy link
Member

adiroiban commented Jun 15, 2024

I think that first we should fix the 3.13 support in twisted/twisted

@stratakis
Copy link

Any chance we could get a new release with this fix? Fedora rawhide just updated to Python 3.13b2 and treq is failing to build. This PR doesn't apply cleanly to 23.11.0. Thanks.

Unfortunately on Fedora we don't package multipart, but python-multipart which is a different project. And while not entirely sure of the stability of each project, the multipart used here looks abandoned upstream (no commits since 2021).

@cjwatson
Copy link

multipart may not be super-active, but it isn't abandoned. I just released 0.2.5.

@stratakis
Copy link

multipart may not be super-active, but it isn't abandoned. I just released 0.2.5.

That is awesome, thanks for the response here :). I'll work on getting it packaged for Fedora then.

@twm
Copy link
Contributor Author

twm commented Jun 27, 2024

@opoplawski Thanks for the ping! I'm currently a bit loopy on cough medicine, but I should be able to manage a release this weekend or next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cgi is deprecated as of Python 3.11
7 participants