Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encoding to Link object + re-enable HTML entities links tests #1949

Closed
wants to merge 3 commits into from

Conversation

redapple
Copy link
Contributor

@redapple redapple commented Apr 21, 2016

Add encoding parameter and attribute to Link object.
That's the only way I found to properly account for encoding information when building requests from link extractors.

Supersedes #1880
Should fix #1403

@redapple redapple added this to the v1.1 milestone Apr 21, 2016
@redapple redapple changed the title Add encoding to Link object + re-enable HTML entities links tests [WIP] Add encoding to Link object + re-enable HTML entities links tests Apr 21, 2016
@redapple
Copy link
Contributor Author

redapple commented Apr 21, 2016

cf. test_restrict_xpaths_with_html_entities
If you open this HTML in a browser

<html>

<head>
  <meta http-equiv="content-type" content="text/html;charset=iso8859-15" />
</head>

<body>
  <p><a href="http://www.example.com/&hearts;/you?c=&euro;">text</a></p>
</body>

</html>

you'll see http://www.example.com/%E2%99%A5/you?c=%A4 being sent when clicked

@redapple redapple changed the title [WIP] Add encoding to Link object + re-enable HTML entities links tests Add encoding to Link object + re-enable HTML entities links tests Apr 27, 2016
@@ -28,6 +28,7 @@ def __init__(self, url, text='', fragment='', nofollow=False):
self.text = text
self.fragment = fragment
self.nofollow = nofollow
self.encoding = encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 2 when unicode URL is passed it is converted to bytes using utf-8 encoding, regardless of encoding passed. Should we use encoding instead? If so, the warning may become misleading. There should be a warning if encoding is not passed, but if user passes it explicitly then it may be fine to pass unicode URL in Python 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I don't quite understand why Unicode URL is converted to bytes within Link.
Was the intent to make it safe? (in that case, we could store the safe_url version of it, using encoding).
Reading scrapy/tests/test_link.py, I find the following odd:

            link = Link(u"http://www.example.com/\xa3")
            self.assertIsInstance(link.url, str)
            self.assertEqual(link.url, b'http://www.example.com/\xc2\xa3')  <-----

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python 2.x all URLs were bytes, so we required link.url to be bytes. For backwards compatibility, and because of missing unicode support in url-related functions in Python 2.x, link.url is still required to be bytes in Python 2.x

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea about what is this test for :)

Copy link
Contributor Author

@redapple redapple Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, in Python2, we could make it pass through safe_url_string, using encoding, (and converting to bytes would already be bytes)?

Copy link
Contributor Author

@redapple redapple Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean safe_url_string called twice, once in LinkExtractor and once in Request init?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@redapple redapple Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests seem to pass fine with WITHOUT safe_url_string in LinkExtractor. Checking further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about your suggestion to use safe_url_string in Link constructor; I haven't noticed it in LinkExtractor.

@codecov-io
Copy link

codecov-io commented Apr 27, 2016

Current coverage is 81.65%

Merging #1949 into master will decrease coverage by -0.00%

@@             master      #1949   diff @@
==========================================
  Files           161        161          
  Lines          8669       8669          
  Methods           0          0          
  Messages          0          0          
  Branches       1269       1269          
==========================================
- Hits           7080       7078     -2   
  Misses         1233       1233          
- Partials        356        358     +2   
  1. 2 files (not in diff) in scrapy/core were modified. more
    • Partials +2
    • Hits -2

Sunburst

Powered by Codecov. Last updated by a4dbf7e

@redapple
Copy link
Contributor Author

@kmike , with safe_url_string removed from link extraction (54f1c24), for Python2 Unicode URL input case in Link init, one actually needs UTF8 for to_bytes, otherwise test_restrict_xpaths_with_html_entities even fails ("iso8859-15" can only encode part of the Unicode characters)

What happens with to_bytes(url, encoding='utf8') in Link.__init__() is illustrated below:

Py2 warning: u'http://example.org/\u2665/you?c=\u20ac'
--> 'http://example.org/\xe2\x99\xa5/you?c=\xe2\x82\xac'

(link extraction's) canonicalize_url() gets this:
('http://example.org/\xe2\x99\xa5/you?c=\xe2\x82\xac', 'iso8859-15')

which goes through _safe_ParseResult(),
after parse_url('http://example.org/\xe2\x99\xa5/you?c=\xe2\x82\xac', None)

(None actually saves the whole thing, decoding UTF8)

the rest is fine after that:
ParseResult(scheme=u'http', netloc=u'example.org', path=u'/\u2665/you', params='', query=u'c=\u20ac', fragment='')

_safe_ParseResult --> 'http', 'example.org', '/%E2%99%A5/you', '', 'c=%A4', ''
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode character
3 participants