Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests #1947

Merged
merged 4 commits into from Apr 26, 2016

Conversation

redapple
Copy link
Contributor

Bits of #1874 related to canonicalize_url()

Does NOT handle issues raised in #1941

@redapple redapple added this to the v1.1 milestone Apr 21, 2016
@codecov-io
Copy link

codecov-io commented Apr 21, 2016

Current coverage is 83.23%

Merging #1947 into master will increase coverage by +<.01%

@@           master   #1947   diff @@
=====================================
  Files         161     161          
  Lines        8631    8671    +40   
  Methods         0       0          
  Branches     1258    1270    +12   
=====================================
+ Hits         7183    7217    +34   
- Misses       1201    1204     +3   
- Partials      247     250     +3   
  1. File ...py/utils/trackref.py (not in diff) was modified. more
    • Misses 0
    • Partials +1
    • Hits -1

Sunburst

Powered by Codecov. Last updated by 282d4e1

scheme, netloc, path, params, query, fragment = _safe_ParseResult(
parse_url(url), encoding=encoding)
except UnicodeError as e:
if encoding != 'utf8':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are several other spellings for utf8:

  • utf_8;
  • utf-8;
  • U8;
  • UTF;

these names are also case-insensitive. See https://docs.python.org/3.5/library/codecs.html#standard-encodings.

Copy link
Contributor Author

@redapple redapple Apr 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, the test is not really necessary, I can't think of an example of UTF8 not working, so testing if supplied encoding is not UTF8 does not make much sense, i.e., if it failed, then it must have been some other encoding, so use UTF8 the 2nd time.

Also don't test passed encoding against 'utf8';
Just consider that if encoding failed, it must have been another encoding.
from six.moves.urllib.parse import (ParseResult, urlunparse, urldefrag,
urlparse, parse_qsl, urlencode,
unquote)
quote, unquote)
if six.PY3:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike , probably not six.PY2 here too, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

@kmike
Copy link
Member

kmike commented Apr 26, 2016

Looks good, thanks @redapple for the hard work!

By the way, do you know if this is true? I think it can be a desirable property of canonicalize_url function.

url = canonicalize_url(url_original, encoding=X)
url2 = canonicalize_url(url, encoding=X)
# url is always the same as url2

The url passed can be a str or unicode, while the url returned is always a
str.
The url passed can be bytes or unicode, while the url returned is
always a native str (bytes in Python 2, unicode in Python 3).

For examples see the tests in tests/test_utils_url.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add a note about what is canonicalize_url function for: it is not for cleaning up URLs before sending them to a server, it is for URL comparison and duplicate detection, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not hurt to remind users of this, yes.
A bit of a shame that it's used by default in link extraction

@redapple
Copy link
Contributor Author

@kmike , safe_url_string is tested for this so worth a try with canonicalize_url too... and a test

@kmike kmike changed the title Fix canonicalize_url() on Python 3 and re-enable tests [MRG+1] Fix canonicalize_url() on Python 3 and re-enable tests Apr 26, 2016
@kmike
Copy link
Member

kmike commented Apr 26, 2016

@redapple do you want to make these changes in this PR, or should we just merge it as-is?

@redapple
Copy link
Contributor Author

would be good in this one. but it'll have to be tomorrow at earliest for me.
Or someone else pushes (tries) new tests

@codecov-io
Copy link

Current coverage is 81.75%

Merging #1947 into master will increase coverage by -1.46%

@@           master      #1947   diff @@
========================================
  Files         161        161          
  Lines        8631       8669    +38   
  Methods         0          0          
  Branches     1258       1269    +11   
========================================
- Hits         7183       7087    -96   
- Misses       1201       1266    +65   
- Partials      247        316    +69   
  1. 10 files (not in diff) in scrapy/utils were modified. more
    • Misses +4
    • Partials +11
    • Hits -15
  2. 2 files (not in diff) in scrapy/spiders were modified. more
    • Partials +3
    • Hits -3
  3. 3 files (not in diff) in scrapy/pipelines were modified. more
    • Misses +1
    • Partials +7
    • Hits -8
  4. 4 files (not in diff) in ...crapy/linkextractors were modified. more
    • Misses +27
    • Partials +7
    • Hits -34
  5. 4 files (not in diff) in scrapy/http were modified. more
    • Partials +10
    • Hits -10
  6. 3 files (not in diff) in .../downloader/handlers were modified. more
    • Misses +23
    • Partials +5
    • Hits -28
  7. 3 files (not in diff) in ...rapy/core/downloader were modified. more
    • Misses +10
    • Partials +1
    • Hits -11
  8. 1 files (not in diff) in scrapy/core were modified. more
    • Partials +1
    • Hits -1
  9. 9 files (not in diff) in scrapy were modified. more
    • Misses -2
    • Partials +14
    • Hits -12
  10. 2 files in scrapy were modified. more
    • Partials +3
    • Hits -3

Sunburst

Powered by Codecov. Last updated by c1a6d2c

@kmike kmike merged commit dbef7e2 into master Apr 26, 2016
redapple added a commit that referenced this pull request Apr 27, 2016
[backport][1.1] Fix canonicalize_url() on Python 3 and re-enable tests (PR #1947)
@kmike kmike deleted the canonicalize-url branch April 27, 2016 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants