New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 non-ASCII File Names by Default #856

Closed
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@Spryttan

Spryttan commented May 4, 2016

These changes address one point of urllib3/urllib3#303: encoding non-ASCII file names according to the HTML5 working draft by default, while still allowing file names to be encoded according to RFC 2231.

Spryttan added some commits May 3, 2016

Added support for non-ASCII file names to be rendered according to th…
…e HTML5 working draft, and made this the default.

Support for RFC 2231-style encoding remains. Which style encoding to use
can be selected by setting `filename_encoding_style` when creating a
`RequestField` (by `__init__` or by `from_tuples`).
Added More Tests & Unicode Literals
Made sure the `format_header_param_*` methods returned unicode strings
on all Python versions on all code paths.

Added a few more comments, including "Must be unicode." to
`RequestField.__init__`. This was to match the requirement of
`from_tuples` which requires field names and file names to be unicode.
try:
result.encode('ascii')
except (UnicodeEncodeError, UnicodeDecodeError):
pass
else:
return result
if not six.PY3 and isinstance(value, six.text_type): # Python 2:
if not six.PY3: # Python 2:

This comment has been minimized.

@Lukasa

Lukasa May 4, 2016

Contributor

Why has the isinstance check been removed here?

@Lukasa

Lukasa May 4, 2016

Contributor

Why has the isinstance check been removed here?

This comment has been minimized.

@Spryttan

Spryttan May 4, 2016

It was unnecessary, as the method doc requires that value be a unicode string. If we did allow non-unicode-data through here, we would need the encoding also, so it can be passed to encode_rfc2231. Right now, the formatted result is hardcoded to display utf-8 as the encoding.

@Spryttan

Spryttan May 4, 2016

It was unnecessary, as the method doc requires that value be a unicode string. If we did allow non-unicode-data through here, we would need the encoding also, so it can be passed to encode_rfc2231. Right now, the formatted result is hardcoded to display utf-8 as the encoding.

@Lukasa

This comment has been minimized.

Show comment
Hide comment
@Lukasa

Lukasa May 4, 2016

Contributor

Thanks @Spryttan!

I have three initial questions before delving into a big code review. Firstly: why the change to HTML5-by-default? That change is going to be frustrating for consumers like requests which need to defend against a sudden and somewhat subtle change that will not be clearly exposed but will affect the wire-form of their requests.

Secondly, why have you changed the return value of format_header_param_* to be unicode in all cases, rather than native strings?

Thirdly, it doesn't seem like the output of the encoder matches the spec. My reading of the spec is that you encode a unicode code point (e.g. U+1234) to Ӓ. Your code doesn't do that. What have I failed to understand?

Contributor

Lukasa commented May 4, 2016

Thanks @Spryttan!

I have three initial questions before delving into a big code review. Firstly: why the change to HTML5-by-default? That change is going to be frustrating for consumers like requests which need to defend against a sudden and somewhat subtle change that will not be clearly exposed but will affect the wire-form of their requests.

Secondly, why have you changed the return value of format_header_param_* to be unicode in all cases, rather than native strings?

Thirdly, it doesn't seem like the output of the encoder matches the spec. My reading of the spec is that you encode a unicode code point (e.g. U+1234) to Ӓ. Your code doesn't do that. What have I failed to understand?

@Spryttan

This comment has been minimized.

Show comment
Hide comment
@Spryttan

Spryttan May 4, 2016

Hi @Lukasa ,

I think there are two reasons to support HTML5 by default: The spec you linked to forbids the RFC2231 encoding: "User agents must not use the RFC 2231 encoding suggested by RFC 2388", and the HTML5-way seems to be more widely supported - but I have no data on that, just personal experience.

format_header_param used to return a str (bytes string) in Python 2 and a str (unicode string) in Python 3 - this was confusing. Also, it is easiest to manipulate string data if we keep it unicode all the time, and don't encode until as late as possible, when the request body is actually being built.

The part of the spec you linked to is referring to how to encode form data that is not expressible in the selected character set, which is a separate process from formatting the file name. The urllib3 "selected character coding" is hardcoded to utf-8 in filepost.py, line 11 writer = codecs.lookup('utf-8')[3], and utf-8 can express all characters, so we should never see anything like Ӓ.

Spryttan commented May 4, 2016

Hi @Lukasa ,

I think there are two reasons to support HTML5 by default: The spec you linked to forbids the RFC2231 encoding: "User agents must not use the RFC 2231 encoding suggested by RFC 2388", and the HTML5-way seems to be more widely supported - but I have no data on that, just personal experience.

format_header_param used to return a str (bytes string) in Python 2 and a str (unicode string) in Python 3 - this was confusing. Also, it is easiest to manipulate string data if we keep it unicode all the time, and don't encode until as late as possible, when the request body is actually being built.

The part of the spec you linked to is referring to how to encode form data that is not expressible in the selected character set, which is a separate process from formatting the file name. The urllib3 "selected character coding" is hardcoded to utf-8 in filepost.py, line 11 writer = codecs.lookup('utf-8')[3], and utf-8 can express all characters, so we should never see anything like Ӓ.

@Lukasa

This comment has been minimized.

Show comment
Hide comment
@Lukasa

Lukasa May 4, 2016

Contributor

The part of the spec you linked to is referring to how to encode form data that is not expressible in the selected character set, which is a separate process from formatting the file name.

So, just to clarify, the net effect is that the 'encoding algorithm' used in HTML5 is just...send UTF-8?

Contributor

Lukasa commented May 4, 2016

The part of the spec you linked to is referring to how to encode form data that is not expressible in the selected character set, which is a separate process from formatting the file name.

So, just to clarify, the net effect is that the 'encoding algorithm' used in HTML5 is just...send UTF-8?

@Spryttan

This comment has been minimized.

Show comment
Hide comment
@Spryttan

Spryttan May 5, 2016

Yes, given that the selected character encoding is utf-8. And the file name just takes the same encoding as the selected character encoding, although there is leeway to approximate the precise name "if necessary".

Spryttan commented May 5, 2016

Yes, given that the selected character encoding is utf-8. And the file name just takes the same encoding as the selected character encoding, although there is leeway to approximate the precise name "if necessary".

Show outdated Hide outdated urllib3/fields.py
@Lukasa

This comment has been minimized.

Show comment
Hide comment
@Lukasa

Lukasa May 5, 2016

Contributor

@Spryttan So, just to be clear. The HTML 5 specification here can be summed up as: just encode the text don't worry about it? Wow, ok then.

I'm definitely nervous about swapping our default behaviour here, but @shazow can express a preference.

Contributor

Lukasa commented May 5, 2016

@Spryttan So, just to be clear. The HTML 5 specification here can be summed up as: just encode the text don't worry about it? Wow, ok then.

I'm definitely nervous about swapping our default behaviour here, but @shazow can express a preference.

@Spryttan

This comment has been minimized.

Show comment
Hide comment
@Spryttan

Spryttan commented May 5, 2016

@Lukasa Yes.

@Spryttan

This comment has been minimized.

Show comment
Hide comment
@Spryttan

Spryttan May 5, 2016

About the failing build: nose.proxy.OSError: [Errno 98] Address already in use Is this just a transient error?

Spryttan commented May 5, 2016

About the failing build: nose.proxy.OSError: [Errno 98] Address already in use Is this just a transient error?

@shazow

This comment has been minimized.

Show comment
Hide comment
@shazow

shazow May 5, 2016

Collaborator

About the failing build: nose.proxy.OSError: [Errno 98] Address already in use Is this just a transient error?

Should be, restarted the build.

Collaborator

shazow commented May 5, 2016

About the failing build: nose.proxy.OSError: [Errno 98] Address already in use Is this just a transient error?

Should be, restarted the build.

@Lukasa

This comment has been minimized.

Show comment
Hide comment
@Lukasa

Lukasa May 6, 2016

Contributor

@shazow Before merging, can we consider whether HTML5 or RFC 2231 should be the default?

Contributor

Lukasa commented May 6, 2016

@shazow Before merging, can we consider whether HTML5 or RFC 2231 should be the default?

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 May 6, 2016

Member

I'd also like a little time to research the relevant RFCs. There's a more recent RFC that makes reference to the HTML5 spec but updates the multipart/form-data spec. I think I passed it over to @Lukasa too.

Member

sigmavirus24 commented May 6, 2016

I'd also like a little time to research the relevant RFCs. There's a more recent RFC that makes reference to the HTML5 spec but updates the multipart/form-data spec. I think I passed it over to @Lukasa too.

@shazow

This comment has been minimized.

Show comment
Hide comment
@shazow

shazow May 6, 2016

Collaborator

Sounds good.

It's probably the prudent thing to add a transition period with a release warning but if HTML5 turns out to be more compliant these days then I'm fine with just switching to it and seeing what happens. :)

Collaborator

shazow commented May 6, 2016

Sounds good.

It's probably the prudent thing to add a transition period with a release warning but if HTML5 turns out to be more compliant these days then I'm fine with just switching to it and seeing what happens. :)

@theacodes

This comment has been minimized.

Show comment
Hide comment
@theacodes

theacodes Apr 27, 2018

Member

Thank you for creating this pull request, unfortunately, for one reason or
another it has become stale. We are declaring PR bankruptcy (see #1370) to
better focus our limited volunteer resources. If you still think this PR is
relevant and useful, please comment and let us know!

Member

theacodes commented Apr 27, 2018

Thank you for creating this pull request, unfortunately, for one reason or
another it has become stale. We are declaring PR bankruptcy (see #1370) to
better focus our limited volunteer resources. If you still think this PR is
relevant and useful, please comment and let us know!

@theacodes theacodes closed this Apr 27, 2018

@jadolg jadolg referenced this pull request May 12, 2018

Open

API upload missing #18

@singalen

This comment has been minimized.

Show comment
Hide comment
@singalen

singalen commented Jun 14, 2018

Came here from bottlepy/bottle#852.
:~(

@theacodes

This comment has been minimized.

Show comment
Hide comment
@theacodes

theacodes Jun 15, 2018

Member

@singalen happy to accept this PR if you want to make it current!

Member

theacodes commented Jun 15, 2018

@singalen happy to accept this PR if you want to make it current!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment