-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming multipart/form-data encoder #2331
base: main
Are you sure you want to change the base?
Add streaming multipart/form-data encoder #2331
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Very exciting, thanks for opening this! Let me know when you're ready for reviews. |
a8dcd6d
to
f52eb3e
Compare
13635bb
to
0e2a726
Compare
@@ -12,7 +13,7 @@ | |||
cast, | |||
) | |||
|
|||
_TYPE_FIELD_VALUE = Union[str, bytes] | |||
_TYPE_FIELD_VALUE = Union[str, bytes, BinaryIO] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for all of this shows using a file here so this was missing but it had the knock on effect of rewriting the header params which were overloading the definition of a "FIELD_VALUE"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure if files were supported. Iirc it was open (x).read()
? Need to check on PC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so I guess requests always papered over this: https://github.com/psf/requests/blob/d09659997cd1e3eca49a07c59ece5557071c0ab9/requests/models.py#L160
I don't know how we consolidate these APIs though in such a way that folks don't need to read the entire file into memory.
eaedfa1
to
8c9ec05
Compare
9b7e6e2
to
47cf7d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Here's some initial comments from mobile:
src/urllib3/multipart/decoder.py
Outdated
|
||
|
||
@typing.overload | ||
def _split_on_find(content: str, bound: str) -> typing.Tuple[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use .partition() instead of this function?
src/urllib3/multipart/decoder.py
Outdated
self.encoding = encoding | ||
headers: typing.Dict[str, str] = {} | ||
# Split into header section (if any) and the content | ||
if b"\r\n\r\n" in content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use partition and check the middle element of the returned 3-tuple to avoid doing two scans for this bytestring?
src/urllib3/multipart/decoder.py
Outdated
return part | ||
|
||
def _parse_body(self, content: bytes) -> None: | ||
boundary = b"".join((b"--", self.boundary)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used a bytes literal + boundary
src/urllib3/multipart/decoder.py
Outdated
and part != b"--" | ||
) | ||
|
||
parts = content.split(b"".join((b"\r\n", boundary))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bytes literal + boundary
@@ -12,7 +13,7 @@ | |||
cast, | |||
) | |||
|
|||
_TYPE_FIELD_VALUE = Union[str, bytes] | |||
_TYPE_FIELD_VALUE = Union[str, bytes, BinaryIO] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure if files were supported. Iirc it was open (x).read()
? Need to check on PC
src/urllib3/multipart/__init__.py
Outdated
@@ -0,0 +1,19 @@ | |||
"""Multipart support for urllib3.""" | |||
|
|||
from .decoder import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we hide the encoder/decoder submodules and reassign the names in this module? (Ie from .encoder import x as x)
src/urllib3/multipart/decoder.py
Outdated
from .. import _collections | ||
from .encoder import encode_with | ||
|
||
if typing.TYPE_CHECKING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I to make the typing compatible with other files, would suggest to don't use typing.x
and importing them as from typing import x,y,...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that importing things from the typing module pollutes namespaces. I can consolidate to that here but I really hate the practice of importing from typing rather than importing typing or aliasing the import typing to something shorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see both ways of (import typing
or from typing import x
) in different projects.
I am not against any of them. I would suggest choosing one of them and applying it to all files.
BTW, I am ok with import typing
as well. I can update the rest of the files of the project in a PR.
@sethmlarson what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's switch to the namedspaced usage with typing.X
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use libcst
to do it automagically for us if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments for you, this time from the comfort of a PC.
A general comment, if it's possible I'd like to move away from the .len
property. You likely understand the historical significance more than I but if there's a way we can accomplish this with only using __len__
and .tell()
+.seek()
that'd be best imo
src/urllib3/multipart/encoder.py
Outdated
self.boundary_value: str = boundary or uuid.uuid4().hex | ||
|
||
# Computed boundary | ||
self.boundary: str = f"--{self.boundary_value}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have boundary
be what was passed in by the user or generated instead of boundary_value
? I'm not sure why users would need --{boundary}
as a public attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, this isn't there for the users, but we can make these private/read-only as necessary. It's just more to write everything we use frequently as _{name}
because we don't trust users to be intelligent and not muck with things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, users will be users. 🤷 At least if these attributes are private we can say "told you so" if it ends up being a problem later.
src/urllib3/multipart/encoder.py
Outdated
) | ||
|
||
#: Fields provided by the user | ||
self.fields = fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the fields
, encoding
, and finished
properties to be read-only properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so much extra boilerplate to do what? What motivates you wanting significantly more code to make things read-only here? Is the idea if this is read-only users won't search for the private attributes storing the data and then will report bugs that updating the private attributes didn't do what they expected versus what happened changing the undocumented attributes that are 'public' and writable? How adversarial a relationship do you want to have with our users?
src/urllib3/multipart/encoder.py
Outdated
"Content-Length": self.content_length, | ||
} | ||
|
||
def getall(self) -> bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have .read()
do we need this function? We can document that .read()
can be used to read all the data it can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I received complaints on toolbelt about not having something that didn't imply a need for a read size 🤷 I don't care either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick with .read()
for now, thanks!
So today, urllib3 is more-or-less architecture agnostic. We know from the harassment of the cryptography project for adopting Rust that there are people still operating older architectures and compiling things from scratch for themselves to make that work. On the face of things, I'd guess 90+% of our users are not on 32-bit architectures but we had a fair number of users of requests + urllib3 that were on 32-bit architectures that couldn't use the MultipartEncoder because Also, I haven't done a deep reconsideration of the design here just the minimal effort to migrate it here. This was designed to work best within Requests and its idiosyncrasies. |
e352da1
to
112bcb9
Compare
I initially implemented this for Requests in the requests-toolbelt but this was already pretty generic because Requests just passes file-like objects (which is how the streaming encoder behaves) directly to urllib3. All that needed to change was what we were relying on from the requests namespace and imports and such. This also adds the decoder in ther same breath because it's easier to ensure that's all working together properly in one and it all fits together nicely. One thing we _could_ do is consolidate a bunch of the logic too and make `encode_multipart_formdata` rely on the streaming encoder and call `getall` instead so that we don't have 2 implementations of the same logic.
112bcb9
to
c8c153e
Compare
Regarding CI, there are two issues left:
|
It's incorrect in the context of the streaming encoder changes.
The docs are fine now, coverage is left. (And also make sure all concerns are addressed) |
What's the status of this? Can it be merged? Openapi generator users want this and we use this library. |
@sigmavirus24 are you still working on this? I'm thinking of taking this over to get this over the line. |
@nioncode Please do. The main task is figuring out what is left to do. |
@sigmavirus24 this has been open for 3 years almost, shall we close this or you have any plan to revive this PR? |
The project wants to land it. I don't have the time for it and others have done work intermittently to get it over the line. As it stands urllib3 can consume gigs of memory without proper warning to users |
Closes #51
Closes #624