Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947

Merged
merged 4 commits into from Apr 26, 2016

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Apr 21, 2016

Bits of #1874 related to canonicalize_url()

Does NOT handle issues raised in #1941

@redapple redapple added this to the v1.1 milestone Apr 21, 2016
@redapple redapple force-pushed the canonicalize-url branch from 23142bb to 68dedf5 Apr 21, 2016
@codecov-io
Copy link

@codecov-io codecov-io commented Apr 21, 2016

Current coverage is 83.23%

Merging #1947 into master will increase coverage by +<.01%

@@           master   #1947   diff @@
=====================================
  Files         161     161          
  Lines        8631    8671    +40   
  Methods         0       0          
  Branches     1258    1270    +12   
=====================================
+ Hits         7183    7217    +34   
- Misses       1201    1204     +3   
- Partials      247     250     +3   
  1. File ...py/utils/trackref.py (not in diff) was modified. more
    • Misses 0
    • Partials +1
    • Hits -1

Sunburst

Powered by Codecov. Last updated by 282d4e1

scheme, netloc, path, params, query, fragment = _safe_ParseResult(
parse_url(url), encoding=encoding)
except UnicodeError as e:
if encoding != 'utf8':

This comment has been minimized.

@kmike

kmike Apr 25, 2016
Member

there are several other spellings for utf8:

  • utf_8;
  • utf-8;
  • U8;
  • UTF;

these names are also case-insensitive. See https://docs.python.org/3.5/library/codecs.html#standard-encodings.

This comment has been minimized.

@redapple

redapple Apr 26, 2016
Author Contributor

On second thought, the test is not really necessary, I can't think of an example of UTF8 not working, so testing if supplied encoding is not UTF8 does not make much sense, i.e., if it failed, then it must have been some other encoding, so use UTF8 the 2nd time.

return urlunparse((scheme, netloc.lower(), path, params, query, fragment))


def _unquotepath(path):
for reserved in ('2f', '2F', '3f', '3F'):
path = path.replace('%' + reserved, '%25' + reserved.upper())
return unquote(path)

if six.PY3:

This comment has been minimized.

@kmike

kmike Apr 25, 2016
Member

it is slightly better to use six.PY2 because six.PY3 may become False in Python 4

if encoding != 'utf8':
scheme, netloc, path, params, query, fragment = _safe_ParseResult(
parse_url(url), encoding='utf8')
else:

This comment has been minimized.

@kmike

kmike Apr 25, 2016
Member

When can it fail?

This comment has been minimized.

@redapple

redapple Apr 26, 2016
Author Contributor

it fails for example for test_canonicalize_url_unicode_query_string_wrong_encoding (when changing the code to raise the exception)

self = <tests.test_utils_url.CanonicalizeUrlTest testMethod=test_canonicalize_url_unicode_query_string_wrong_encoding>

    def test_canonicalize_url_unicode_query_string_wrong_encoding(self):
        # trying to encode with wrong encoding
        # fallback to UTF-8
>       self.assertEqual(canonicalize_url(u"http://www.example.com/résumé?currency=€", encoding='latin1'),
                                          "http://www.example.com/r%C3%A9sum%C3%A9?currency=%E2%82%AC")

/home/paul/src/scrapy/tests/test_utils_url.py:147: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/paul/src/scrapy/scrapy/utils/url.py:85: in canonicalize_url
    parse_url(url), encoding=encoding)
/home/paul/src/scrapy/scrapy/utils/url.py:54: in _safe_ParseResult
    quote(to_bytes(parts.query, encoding), _safe_chars),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

text = 'currency=€', encoding = 'latin1', errors = 'strict'

    def to_bytes(text, encoding=None, errors='strict'):
        """Return the binary representation of `text`. If `text`
        is already a bytes object, return it as-is."""
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
>       return text.encode(encoding, errors)
E       UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 9: ordinal not in range(256)

/home/paul/src/scrapy/scrapy/utils/python.py:120: UnicodeEncodeError
Also don't test passed encoding against 'utf8';
Just consider that if encoding failed, it must have been another encoding.
from six.moves.urllib.parse import (ParseResult, urlunparse, urldefrag,
urlparse, parse_qsl, urlencode,
unquote)
quote, unquote)
if six.PY3:

This comment has been minimized.

@redapple

redapple Apr 26, 2016
Author Contributor

@kmike , probably not six.PY2 here too, right?

This comment has been minimized.

@kmike

kmike Apr 26, 2016
Member

right

@kmike
Copy link
Member

@kmike kmike commented Apr 26, 2016

Looks good, thanks @redapple for the hard work!

By the way, do you know if this is true? I think it can be a desirable property of canonicalize_url function.

url = canonicalize_url(url_original, encoding=X)
url2 = canonicalize_url(url, encoding=X)
# url is always the same as url2
The url passed can be a str or unicode, while the url returned is always a
str.
The url passed can be bytes or unicode, while the url returned is
always a native str (bytes in Python 2, unicode in Python 3).
For examples see the tests in tests/test_utils_url.py

This comment has been minimized.

@kmike

kmike Apr 26, 2016
Member

I wonder if we should add a note about what is canonicalize_url function for: it is not for cleaning up URLs before sending them to a server, it is for URL comparison and duplicate detection, right?

This comment has been minimized.

@redapple

redapple Apr 26, 2016
Author Contributor

Would not hurt to remind users of this, yes.
A bit of a shame that it's used by default in link extraction

@redapple
Copy link
Contributor Author

@redapple redapple commented Apr 26, 2016

@kmike , safe_url_string is tested for this so worth a try with canonicalize_url too... and a test

@kmike kmike changed the title Fix canonicalize_url() on Python 3 and re-enable tests [MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests Apr 26, 2016
@kmike
Copy link
Member

@kmike kmike commented Apr 26, 2016

@redapple do you want to make these changes in this PR, or should we just merge it as-is?

@redapple
Copy link
Contributor Author

@redapple redapple commented Apr 26, 2016

would be good in this one. but it'll have to be tomorrow at earliest for me.
Or someone else pushes (tries) new tests

@codecov-io
Copy link

@codecov-io codecov-io commented Apr 26, 2016

Current coverage is 81.75%

Merging #1947 into master will increase coverage by -1.46%

@@           master      #1947   diff @@
========================================
  Files         161        161          
  Lines        8631       8669    +38   
  Methods         0          0          
  Branches     1258       1269    +11   
========================================
- Hits         7183       7087    -96   
- Misses       1201       1266    +65   
- Partials      247        316    +69   
  1. 10 files (not in diff) in scrapy/utils were modified. more
    • Misses +4
    • Partials +11
    • Hits -15
  2. 2 files (not in diff) in scrapy/spiders were modified. more
    • Partials +3
    • Hits -3
  3. 3 files (not in diff) in scrapy/pipelines were modified. more
    • Misses +1
    • Partials +7
    • Hits -8
  4. 4 files (not in diff) in ...crapy/linkextractors were modified. more
    • Misses +27
    • Partials +7
    • Hits -34
  5. 4 files (not in diff) in scrapy/http were modified. more
    • Partials +10
    • Hits -10
  6. 3 files (not in diff) in .../downloader/handlers were modified. more
    • Misses +23
    • Partials +5
    • Hits -28
  7. 3 files (not in diff) in ...rapy/core/downloader were modified. more
    • Misses +10
    • Partials +1
    • Hits -11
  8. 1 files (not in diff) in scrapy/core were modified. more
    • Partials +1
    • Hits -1
  9. 9 files (not in diff) in scrapy were modified. more
    • Misses -2
    • Partials +14
    • Hits -12
  10. 2 files in scrapy were modified. more
    • Partials +3
    • Hits -3

Sunburst

Powered by Codecov. Last updated by c1a6d2c

@kmike kmike merged commit dbef7e2 into master Apr 26, 2016
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
redapple added a commit that referenced this pull request Apr 27, 2016
[backport][1.1] Fix canonicalize_url() on Python 3 and re-enable tests (PR #1947)
@kmike kmike deleted the canonicalize-url branch Apr 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants