-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When Location decoding fails, fall back to original #4372
base: main
Are you sure you want to change the base?
Conversation
Issue psf#3888 correctly identified Location headers as *usually* containing UTF-8 codepoints (when not correctly URL encoded), but this is not always the case. For example the URL http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects to `b'/etf/ishares_core_dax\xae_ucits_etf_de'`, containing the Latin-1 byte for the ® character. If UTF-8 decoding fails, it is better to fall back to the original.
Crumbs, tests fail on 2.x because it encodes a bytestring (latin-1 encoded), while Python 3 handles a Unicode value. Returning a native |
Nope, |
And another thought: Python 3 ends up with UTF8 bytes in the URL-encoded redirection URL regardless of what encoding the server used in the Location header. Surprisingly, this specific server doesn't appear to care (both variants end accepted and return the same response), but for other servers this may necessarily be the same. Most will expect the exact same byte sequence to be used for the next location. How should requests handle those? |
All of this is distressingly difficult for us to handle appropriately. The biggest issue is that we do not control header decoding on Python 3 (as noted in the code comments above the change you made), so things get tricky fast. The core issue though is that we cannot "retain the original": we need to transition the string to a native form. Have you tried using |
I did, it doesn't, because in Python 2 you'd get a bytestring still. That is then urlencoded to a different representation from the Python 3 Unicode string path. |
@mjpieters What is the different urlencoding output in each case? |
For the Latin1 You can reproduce these in Python 3 with: >>> from urllib.parse import quote
>>> quote('å', encoding='utf8')
'%C3%A5'
>>> quote('å', encoding='latin')
'%E5' |
So, just to be clear: when given a byte string in Python 2, the quote library just quotes its bytes directly. When given a unicode string on Python 3, the quote library encodes it and then quotes the bytes? |
Exactly. And you can tell |
That sounds like it'd be the best approach, if we can swing it. |
Any update on this? Edit: I see there's #4933 as well. |
Issue #3888 correctly identified Location headers as usually containing UTF-8
codepoints (when not correctly URL encoded), but this is not always the case.
For example the URL
http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects
to
b'/etf/ishares_core_dax\xae_ucits_etf_de'
, containing the Latin-1 byte forthe ® character.
If UTF-8 decoding fails, it is better to fall back to the original.
This issue was found via https://stackoverflow.com/questions/47113376/python-3-x-requests-redirect-with-unicode-character