When Location decoding fails, fall back to original #4372

mjpieters · 2017-11-04T18:28:35Z

Issue #3888 correctly identified Location headers as usually containing UTF-8
codepoints (when not correctly URL encoded), but this is not always the case.
For example the URL
http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects
to b'/etf/ishares_core_dax\xae_ucits_etf_de', containing the Latin-1 byte for
the ® character.

If UTF-8 decoding fails, it is better to fall back to the original.

This issue was found via https://stackoverflow.com/questions/47113376/python-3-x-requests-redirect-with-unicode-character

Issue psf#3888 correctly identified Location headers as *usually* containing UTF-8 codepoints (when not correctly URL encoded), but this is not always the case. For example the URL http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects to `b'/etf/ishares_core_dax\xae_ucits_etf_de'`, containing the Latin-1 byte for the ® character. If UTF-8 decoding fails, it is better to fall back to the original.

mjpieters · 2017-11-04T18:33:29Z

Crumbs, tests fail on 2.x because it encodes a bytestring (latin-1 encoded), while Python 3 handles a Unicode value. Returning a native latin-1 string should work there.

mjpieters · 2017-11-04T18:42:53Z

Nope, to_native_string() returns a str on Python 2. Suggestions to produce consistent output on 2.x and 3.x appreciated; just returning location.decode('latin1') doesn't work either.

mjpieters · 2017-11-04T21:34:50Z

And another thought: Python 3 ends up with UTF8 bytes in the URL-encoded redirection URL regardless of what encoding the server used in the Location header. Surprisingly, this specific server doesn't appear to care (both variants end accepted and return the same response), but for other servers this may necessarily be the same. Most will expect the exact same byte sequence to be used for the next location. How should requests handle those?

Lukasa · 2017-11-05T09:06:48Z

All of this is distressingly difficult for us to handle appropriately. The biggest issue is that we do not control header decoding on Python 3 (as noted in the code comments above the change you made), so things get tricky fast.

The core issue though is that we cannot "retain the original": we need to transition the string to a native form. Have you tried using to_native_string(resp.headers['location'], 'latin1') to see if that resolves the test failure?

mjpieters · 2017-11-05T16:37:52Z

Have you tried using to_native_string(resp.headers['location'], 'latin1') to see if that resolves the test failure?

I did, it doesn't, because in Python 2 you'd get a bytestring still. That is then urlencoded to a different representation from the Python 3 Unicode string path.

Lukasa · 2017-11-05T16:41:22Z

@mjpieters What is the different urlencoding output in each case?

mjpieters · 2017-11-07T14:59:37Z

For the Latin1 å character, Python 2 outputs %E5, Python 3 %C3%A5, so the Latin-1 and UTF-8 bytes URL-encoded.

You can reproduce these in Python 3 with:

>>> from urllib.parse import quote
>>> quote('å', encoding='utf8')
'%C3%A5'
>>> quote('å', encoding='latin')
'%E5'

Lukasa · 2017-11-07T20:34:23Z

So, just to be clear: when given a byte string in Python 2, the quote library just quotes its bytes directly. When given a unicode string on Python 3, the quote library encodes it and then quotes the bytes?

mjpieters · 2017-11-09T15:37:17Z

Exactly. And you can tell quote() what encoding to use too; the default is UTF-8. So if we can store the encoding for the location header (UTF-8, or if that fails, the fallback to Latin-1) we could use that information to re-encode to the same.

Lukasa · 2017-11-10T21:54:23Z

That sounds like it'd be the best approach, if we can swing it.

imtbl · 2020-01-19T02:44:24Z

Any update on this?

Edit: I see there's #4933 as well.

tomchristie mentioned this pull request Jan 9, 2019

requests can't properly handle redirects if the response body is encoded in something else than 'utf8' #4926

Open

lasote mentioned this pull request Sep 23, 2019

"Please log in to "conan-center" to perform this action." for several packages conan-io/conan#5780

Closed

ddormer mentioned this pull request Aug 6, 2020

Issue with Latin Characters in URL #5258

Open

nateprewitt changed the base branch from master to main January 3, 2022 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When Location decoding fails, fall back to original #4372

When Location decoding fails, fall back to original #4372

mjpieters commented Nov 4, 2017

mjpieters commented Nov 4, 2017 •

edited

mjpieters commented Nov 4, 2017 •

edited

mjpieters commented Nov 4, 2017

Lukasa commented Nov 5, 2017

mjpieters commented Nov 5, 2017

Lukasa commented Nov 5, 2017

mjpieters commented Nov 7, 2017

Lukasa commented Nov 7, 2017

mjpieters commented Nov 9, 2017

Lukasa commented Nov 10, 2017

imtbl commented Jan 19, 2020 •

edited

When Location decoding fails, fall back to original #4372

Are you sure you want to change the base?

When Location decoding fails, fall back to original #4372

Conversation

mjpieters commented Nov 4, 2017

mjpieters commented Nov 4, 2017 • edited

mjpieters commented Nov 4, 2017 • edited

mjpieters commented Nov 4, 2017

Lukasa commented Nov 5, 2017

mjpieters commented Nov 5, 2017

Lukasa commented Nov 5, 2017

mjpieters commented Nov 7, 2017

Lukasa commented Nov 7, 2017

mjpieters commented Nov 9, 2017

Lukasa commented Nov 10, 2017

imtbl commented Jan 19, 2020 • edited

mjpieters commented Nov 4, 2017 •

edited

mjpieters commented Nov 4, 2017 •

edited

imtbl commented Jan 19, 2020 •

edited