New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947
Conversation
23142bb
to
68dedf5
Compare
Current coverage is 83.23%@@ master #1947 diff @@
=====================================
Files 161 161
Lines 8631 8671 +40
Methods 0 0
Branches 1258 1270 +12
=====================================
+ Hits 7183 7217 +34
- Misses 1201 1204 +3
- Partials 247 250 +3
|
scheme, netloc, path, params, query, fragment = _safe_ParseResult( | ||
parse_url(url), encoding=encoding) | ||
except UnicodeError as e: | ||
if encoding != 'utf8': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are several other spellings for utf8:
- utf_8;
- utf-8;
- U8;
- UTF;
these names are also case-insensitive. See https://docs.python.org/3.5/library/codecs.html#standard-encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, the test is not really necessary, I can't think of an example of UTF8 not working, so testing if supplied encoding is not UTF8 does not make much sense, i.e., if it failed, then it must have been some other encoding, so use UTF8 the 2nd time.
Also don't test passed encoding against 'utf8'; Just consider that if encoding failed, it must have been another encoding.
from six.moves.urllib.parse import (ParseResult, urlunparse, urldefrag, | ||
urlparse, parse_qsl, urlencode, | ||
unquote) | ||
quote, unquote) | ||
if six.PY3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike , probably not six.PY2
here too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
Looks good, thanks @redapple for the hard work! By the way, do you know if this is true? I think it can be a desirable property of canonicalize_url function.
|
The url passed can be a str or unicode, while the url returned is always a | ||
str. | ||
The url passed can be bytes or unicode, while the url returned is | ||
always a native str (bytes in Python 2, unicode in Python 3). | ||
|
||
For examples see the tests in tests/test_utils_url.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add a note about what is canonicalize_url
function for: it is not for cleaning up URLs before sending them to a server, it is for URL comparison and duplicate detection, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not hurt to remind users of this, yes.
A bit of a shame that it's used by default in link extraction
@kmike , |
@redapple do you want to make these changes in this PR, or should we just merge it as-is? |
would be good in this one. but it'll have to be tomorrow at earliest for me. |
Current coverage is 81.75%@@ master #1947 diff @@
========================================
Files 161 161
Lines 8631 8669 +38
Methods 0 0
Branches 1258 1269 +11
========================================
- Hits 7183 7087 -96
- Misses 1201 1266 +65
- Partials 247 316 +69
|
[backport][1.1] Fix canonicalize_url() on Python 3 and re-enable tests (PR #1947)
Bits of #1874 related to canonicalize_url()
Does NOT handle issues raised in #1941