Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When Location decoding fails, fall back to original #4372

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mjpieters
Copy link
Contributor

Issue #3888 correctly identified Location headers as usually containing UTF-8
codepoints (when not correctly URL encoded), but this is not always the case.
For example the URL
http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects
to b'/etf/ishares_core_dax\xae_ucits_etf_de', containing the Latin-1 byte for
the ® character.

If UTF-8 decoding fails, it is better to fall back to the original.

This issue was found via https://stackoverflow.com/questions/47113376/python-3-x-requests-redirect-with-unicode-character

Issue psf#3888 correctly identified Location headers as *usually* containing UTF-8
codepoints (when not correctly URL encoded), but this is not always the case.
For example the URL
http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects
to `b'/etf/ishares_core_dax\xae_ucits_etf_de'`, containing the Latin-1 byte for
the ® character.

If UTF-8 decoding fails, it is better to fall back to the original.
@mjpieters
Copy link
Contributor Author

mjpieters commented Nov 4, 2017

Crumbs, tests fail on 2.x because it encodes a bytestring (latin-1 encoded), while Python 3 handles a Unicode value. Returning a native latin-1 string should work there.

@mjpieters
Copy link
Contributor Author

mjpieters commented Nov 4, 2017

Nope, to_native_string() returns a str on Python 2. Suggestions to produce consistent output on 2.x and 3.x appreciated; just returning location.decode('latin1') doesn't work either.

@mjpieters
Copy link
Contributor Author

And another thought: Python 3 ends up with UTF8 bytes in the URL-encoded redirection URL regardless of what encoding the server used in the Location header. Surprisingly, this specific server doesn't appear to care (both variants end accepted and return the same response), but for other servers this may necessarily be the same. Most will expect the exact same byte sequence to be used for the next location. How should requests handle those?

@Lukasa
Copy link
Member

Lukasa commented Nov 5, 2017

All of this is distressingly difficult for us to handle appropriately. The biggest issue is that we do not control header decoding on Python 3 (as noted in the code comments above the change you made), so things get tricky fast.

The core issue though is that we cannot "retain the original": we need to transition the string to a native form. Have you tried using to_native_string(resp.headers['location'], 'latin1') to see if that resolves the test failure?

@mjpieters
Copy link
Contributor Author

Have you tried using to_native_string(resp.headers['location'], 'latin1') to see if that resolves the test failure?

I did, it doesn't, because in Python 2 you'd get a bytestring still. That is then urlencoded to a different representation from the Python 3 Unicode string path.

@Lukasa
Copy link
Member

Lukasa commented Nov 5, 2017

@mjpieters What is the different urlencoding output in each case?

@mjpieters
Copy link
Contributor Author

For the Latin1 å character, Python 2 outputs %E5, Python 3 %C3%A5, so the Latin-1 and UTF-8 bytes URL-encoded.

You can reproduce these in Python 3 with:

>>> from urllib.parse import quote
>>> quote('å', encoding='utf8')
'%C3%A5'
>>> quote('å', encoding='latin')
'%E5'

@Lukasa
Copy link
Member

Lukasa commented Nov 7, 2017

So, just to be clear: when given a byte string in Python 2, the quote library just quotes its bytes directly. When given a unicode string on Python 3, the quote library encodes it and then quotes the bytes?

@mjpieters
Copy link
Contributor Author

Exactly. And you can tell quote() what encoding to use too; the default is UTF-8. So if we can store the encoding for the location header (UTF-8, or if that fails, the fallback to Latin-1) we could use that information to re-encode to the same.

@Lukasa
Copy link
Member

Lukasa commented Nov 10, 2017

That sounds like it'd be the best approach, if we can swing it.

@imtbl
Copy link

imtbl commented Jan 19, 2020

Any update on this?

Edit: I see there's #4933 as well.

@nateprewitt nateprewitt changed the base branch from master to main January 3, 2022 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants