[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947

redapple · 2016-04-21T12:42:10Z

Bits of #1874 related to canonicalize_url()

Does NOT handle issues raised in #1941

codecov-io · 2016-04-21T12:58:45Z

Current coverage is 83.23%

Merging #1947 into master will increase coverage by +<.01%

@@           master   #1947   diff @@
=====================================
  Files         161     161          
  Lines        8631    8671    +40   
  Methods         0       0          
  Branches     1258    1270    +12   
=====================================
+ Hits         7183    7217    +34   
- Misses       1201    1204     +3   
- Partials      247     250     +3

File ...py/utils/trackref.py (not in diff) was modified. more
- Misses 0
- Partials +1
- Hits -1

Powered by Codecov. Last updated by 282d4e1

kmike · 2016-04-25T19:42:29Z

scrapy/utils/url.py

+        scheme, netloc, path, params, query, fragment = _safe_ParseResult(
+            parse_url(url), encoding=encoding)
+    except UnicodeError as e:
+        if encoding != 'utf8':


there are several other spellings for utf8:

utf_8;

utf-8;

U8;

UTF;

these names are also case-insensitive. See https://docs.python.org/3.5/library/codecs.html#standard-encodings.

On second thought, the test is not really necessary, I can't think of an example of UTF8 not working, so testing if supplied encoding is not UTF8 does not make much sense, i.e., if it failed, then it must have been some other encoding, so use UTF8 the 2nd time.

Also don't test passed encoding against 'utf8'; Just consider that if encoding failed, it must have been another encoding.

redapple · 2016-04-26T13:31:05Z

scrapy/utils/url.py

 from six.moves.urllib.parse import (ParseResult, urlunparse, urldefrag,
                                    urlparse, parse_qsl, urlencode,
-                                    unquote)
+                                    quote, unquote)
+if six.PY3:


@kmike , probably not six.PY2 here too, right?

kmike · 2016-04-26T16:32:32Z

Looks good, thanks @redapple for the hard work!

By the way, do you know if this is true? I think it can be a desirable property of canonicalize_url function.

url = canonicalize_url(url_original, encoding=X)
url2 = canonicalize_url(url, encoding=X)
# url is always the same as url2

kmike · 2016-04-26T16:33:53Z

scrapy/utils/url.py

-    The url passed can be a str or unicode, while the url returned is always a
-    str.
+    The url passed can be bytes or unicode, while the url returned is
+    always a native str (bytes in Python 2, unicode in Python 3).

    For examples see the tests in tests/test_utils_url.py


I wonder if we should add a note about what is canonicalize_url function for: it is not for cleaning up URLs before sending them to a server, it is for URL comparison and duplicate detection, right?

Would not hurt to remind users of this, yes.
A bit of a shame that it's used by default in link extraction

redapple · 2016-04-26T17:06:59Z

@kmike , safe_url_string is tested for this so worth a try with canonicalize_url too... and a test

kmike · 2016-04-26T17:15:57Z

@redapple do you want to make these changes in this PR, or should we just merge it as-is?

redapple · 2016-04-26T17:42:10Z

would be good in this one. but it'll have to be tomorrow at earliest for me.
Or someone else pushes (tries) new tests

codecov-io · 2016-04-26T18:03:58Z

Current coverage is 81.75%

Merging #1947 into master will increase coverage by -1.46%

@@           master      #1947   diff @@
========================================
  Files         161        161          
  Lines        8631       8669    +38   
  Methods         0          0          
  Branches     1258       1269    +11   
========================================
- Hits         7183       7087    -96   
- Misses       1201       1266    +65   
- Partials      247        316    +69

10 files (not in diff) in scrapy/utils were modified. more
- Misses +4
- Partials +11
- Hits -15
2 files (not in diff) in scrapy/spiders were modified. more
- Partials +3
- Hits -3
3 files (not in diff) in scrapy/pipelines were modified. more
- Misses +1
- Partials +7
- Hits -8
4 files (not in diff) in ...crapy/linkextractors were modified. more
- Misses +27
- Partials +7
- Hits -34
4 files (not in diff) in scrapy/http were modified. more
- Partials +10
- Hits -10
3 files (not in diff) in .../downloader/handlers were modified. more
- Misses +23
- Partials +5
- Hits -28
3 files (not in diff) in ...rapy/core/downloader were modified. more
- Misses +10
- Partials +1
- Hits -11
1 files (not in diff) in scrapy/core were modified. more
- Partials +1
- Hits -1
9 files (not in diff) in scrapy were modified. more
- Misses -2
- Partials +14
- Hits -12
2 files in scrapy were modified. more
- Partials +3
- Hits -3

Powered by Codecov. Last updated by c1a6d2c

[backport][1.1] Fix canonicalize_url() on Python 3 and re-enable tests (PR #1947)

redapple mentioned this pull request Apr 21, 2016

Fix and re-enabled canonicalize_url() tests #1874

Closed

redapple added this to the v1.1 milestone Apr 21, 2016

Fix canonicalize_url() on Python 3 and re-enable tests

68dedf5

redapple force-pushed the canonicalize-url branch from 23142bb to 68dedf5 Compare April 21, 2016 12:51

redapple mentioned this pull request Apr 21, 2016

Add encoding to Link object + re-enable HTML entities links tests #1949

Closed

kmike reviewed Apr 25, 2016
View reviewed changes

Use six.PY2 instead of six.PY3 for Python version variations

25401fd

Also don't test passed encoding against 'utf8'; Just consider that if encoding failed, it must have been another encoding.

redapple reviewed Apr 26, 2016
View reviewed changes

Use six.PY2 also for conditional imports

efbe75e

kmike reviewed Apr 26, 2016
View reviewed changes

kmike changed the title ~~Fix canonicalize_url() on Python 3 and re-enable tests~~ [MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests Apr 26, 2016

Add idempotence tests for canonicalize_url

0e11b3e

kmike merged commit dbef7e2 into master Apr 26, 2016

redapple mentioned this pull request Apr 27, 2016

[backport][1.1] Fix canonicalize_url() on Python 3 and re-enable tests (PR #1947) #1958

Merged

redapple added a commit that referenced this pull request Apr 27, 2016

Merge pull request #1958 from redapple/backport-1.1-pr1947

0890aa6

[backport][1.1] Fix canonicalize_url() on Python 3 and re-enable tests (PR #1947)

redapple mentioned this pull request Apr 27, 2016

Update changelog with changes since 1.1.0RC3 #1927

Closed

kmike deleted the canonicalize-url branch April 27, 2016 15:41

kmike mentioned this pull request Sep 9, 2016

idna-encode netloc for international domains #903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947

[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947

redapple commented Apr 21, 2016

codecov-io commented Apr 21, 2016 •

edited

kmike Apr 25, 2016

redapple Apr 26, 2016 •

edited

redapple Apr 26, 2016

kmike Apr 26, 2016

kmike commented Apr 26, 2016

kmike Apr 26, 2016

redapple Apr 26, 2016

redapple commented Apr 26, 2016

kmike commented Apr 26, 2016

redapple commented Apr 26, 2016

codecov-io commented Apr 26, 2016

[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947

[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947

Conversation

redapple commented Apr 21, 2016

codecov-io commented Apr 21, 2016 • edited

Current coverage is 83.23%

kmike Apr 25, 2016

Choose a reason for hiding this comment

redapple Apr 26, 2016 • edited

Choose a reason for hiding this comment

redapple Apr 26, 2016

Choose a reason for hiding this comment

kmike Apr 26, 2016

Choose a reason for hiding this comment

kmike commented Apr 26, 2016

kmike Apr 26, 2016

Choose a reason for hiding this comment

redapple Apr 26, 2016

Choose a reason for hiding this comment

redapple commented Apr 26, 2016

kmike commented Apr 26, 2016

redapple commented Apr 26, 2016

codecov-io commented Apr 26, 2016

Current coverage is 81.75%

codecov-io commented Apr 21, 2016 •

edited

redapple Apr 26, 2016 •

edited