[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947
Conversation
Current coverage is 83.23%@@ master #1947 diff @@
=====================================
Files 161 161
Lines 8631 8671 +40
Methods 0 0
Branches 1258 1270 +12
=====================================
+ Hits 7183 7217 +34
- Misses 1201 1204 +3
- Partials 247 250 +3
|
scheme, netloc, path, params, query, fragment = _safe_ParseResult( | ||
parse_url(url), encoding=encoding) | ||
except UnicodeError as e: | ||
if encoding != 'utf8': |
kmike
Apr 25, 2016
Member
there are several other spellings for utf8:
- utf_8;
- utf-8;
- U8;
- UTF;
these names are also case-insensitive. See https://docs.python.org/3.5/library/codecs.html#standard-encodings.
there are several other spellings for utf8:
- utf_8;
- utf-8;
- U8;
- UTF;
these names are also case-insensitive. See https://docs.python.org/3.5/library/codecs.html#standard-encodings.
redapple
Apr 26, 2016
•
Author
Contributor
On second thought, the test is not really necessary, I can't think of an example of UTF8 not working, so testing if supplied encoding is not UTF8 does not make much sense, i.e., if it failed, then it must have been some other encoding, so use UTF8 the 2nd time.
On second thought, the test is not really necessary, I can't think of an example of UTF8 not working, so testing if supplied encoding is not UTF8 does not make much sense, i.e., if it failed, then it must have been some other encoding, so use UTF8 the 2nd time.
return urlunparse((scheme, netloc.lower(), path, params, query, fragment)) | ||
|
||
|
||
def _unquotepath(path): | ||
for reserved in ('2f', '2F', '3f', '3F'): | ||
path = path.replace('%' + reserved, '%25' + reserved.upper()) | ||
return unquote(path) | ||
|
||
if six.PY3: |
kmike
Apr 25, 2016
Member
it is slightly better to use six.PY2 because six.PY3 may become False in Python 4
it is slightly better to use six.PY2 because six.PY3 may become False in Python 4
if encoding != 'utf8': | ||
scheme, netloc, path, params, query, fragment = _safe_ParseResult( | ||
parse_url(url), encoding='utf8') | ||
else: |
kmike
Apr 25, 2016
Member
When can it fail?
When can it fail?
redapple
Apr 26, 2016
•
Author
Contributor
it fails for example for test_canonicalize_url_unicode_query_string_wrong_encoding
(when changing the code to raise the exception)
self = <tests.test_utils_url.CanonicalizeUrlTest testMethod=test_canonicalize_url_unicode_query_string_wrong_encoding>
def test_canonicalize_url_unicode_query_string_wrong_encoding(self):
# trying to encode with wrong encoding
# fallback to UTF-8
> self.assertEqual(canonicalize_url(u"http://www.example.com/résumé?currency=€", encoding='latin1'),
"http://www.example.com/r%C3%A9sum%C3%A9?currency=%E2%82%AC")
/home/paul/src/scrapy/tests/test_utils_url.py:147:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/paul/src/scrapy/scrapy/utils/url.py:85: in canonicalize_url
parse_url(url), encoding=encoding)
/home/paul/src/scrapy/scrapy/utils/url.py:54: in _safe_ParseResult
quote(to_bytes(parts.query, encoding), _safe_chars),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
text = 'currency=€', encoding = 'latin1', errors = 'strict'
def to_bytes(text, encoding=None, errors='strict'):
"""Return the binary representation of `text`. If `text`
is already a bytes object, return it as-is."""
if isinstance(text, bytes):
return text
if not isinstance(text, six.string_types):
raise TypeError('to_bytes must receive a unicode, str or bytes '
'object, got %s' % type(text).__name__)
if encoding is None:
encoding = 'utf-8'
> return text.encode(encoding, errors)
E UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 9: ordinal not in range(256)
/home/paul/src/scrapy/scrapy/utils/python.py:120: UnicodeEncodeError
it fails for example for test_canonicalize_url_unicode_query_string_wrong_encoding
(when changing the code to raise the exception)
self = <tests.test_utils_url.CanonicalizeUrlTest testMethod=test_canonicalize_url_unicode_query_string_wrong_encoding>
def test_canonicalize_url_unicode_query_string_wrong_encoding(self):
# trying to encode with wrong encoding
# fallback to UTF-8
> self.assertEqual(canonicalize_url(u"http://www.example.com/résumé?currency=€", encoding='latin1'),
"http://www.example.com/r%C3%A9sum%C3%A9?currency=%E2%82%AC")
/home/paul/src/scrapy/tests/test_utils_url.py:147:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/paul/src/scrapy/scrapy/utils/url.py:85: in canonicalize_url
parse_url(url), encoding=encoding)
/home/paul/src/scrapy/scrapy/utils/url.py:54: in _safe_ParseResult
quote(to_bytes(parts.query, encoding), _safe_chars),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
text = 'currency=€', encoding = 'latin1', errors = 'strict'
def to_bytes(text, encoding=None, errors='strict'):
"""Return the binary representation of `text`. If `text`
is already a bytes object, return it as-is."""
if isinstance(text, bytes):
return text
if not isinstance(text, six.string_types):
raise TypeError('to_bytes must receive a unicode, str or bytes '
'object, got %s' % type(text).__name__)
if encoding is None:
encoding = 'utf-8'
> return text.encode(encoding, errors)
E UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 9: ordinal not in range(256)
/home/paul/src/scrapy/scrapy/utils/python.py:120: UnicodeEncodeError
Also don't test passed encoding against 'utf8'; Just consider that if encoding failed, it must have been another encoding.
from six.moves.urllib.parse import (ParseResult, urlunparse, urldefrag, | ||
urlparse, parse_qsl, urlencode, | ||
unquote) | ||
quote, unquote) | ||
if six.PY3: |
kmike
Apr 26, 2016
Member
right
right
Looks good, thanks @redapple for the hard work! By the way, do you know if this is true? I think it can be a desirable property of canonicalize_url function.
|
The url passed can be a str or unicode, while the url returned is always a | ||
str. | ||
The url passed can be bytes or unicode, while the url returned is | ||
always a native str (bytes in Python 2, unicode in Python 3). | ||
For examples see the tests in tests/test_utils_url.py |
kmike
Apr 26, 2016
Member
I wonder if we should add a note about what is canonicalize_url
function for: it is not for cleaning up URLs before sending them to a server, it is for URL comparison and duplicate detection, right?
I wonder if we should add a note about what is canonicalize_url
function for: it is not for cleaning up URLs before sending them to a server, it is for URL comparison and duplicate detection, right?
redapple
Apr 26, 2016
Author
Contributor
Would not hurt to remind users of this, yes.
A bit of a shame that it's used by default in link extraction
Would not hurt to remind users of this, yes.
A bit of a shame that it's used by default in link extraction
@kmike , |
@redapple do you want to make these changes in this PR, or should we just merge it as-is? |
would be good in this one. but it'll have to be tomorrow at earliest for me. |
Current coverage is 81.75%@@ master #1947 diff @@
========================================
Files 161 161
Lines 8631 8669 +38
Methods 0 0
Branches 1258 1269 +11
========================================
- Hits 7183 7087 -96
- Misses 1201 1266 +65
- Partials 247 316 +69
|
[backport][1.1] Fix canonicalize_url() on Python 3 and re-enable tests (PR #1947)
Bits of #1874 related to canonicalize_url()
Does NOT handle issues raised in #1941