Add encoding to Link object + re-enable HTML entities links tests #1949

redapple · 2016-04-21T13:52:00Z

Add encoding parameter and attribute to Link object.
That's the only way I found to properly account for encoding information when building requests from link extractors.

Supersedes #1880
Should fix #1403

redapple · 2016-04-21T14:46:02Z

cf. test_restrict_xpaths_with_html_entities
If you open this HTML in a browser

<html>

<head>
  <meta http-equiv="content-type" content="text/html;charset=iso8859-15" />
</head>

<body>
  <p><a href="http://www.example.com/&hearts;/you?c=&euro;">text</a></p>
</body>

</html>

you'll see http://www.example.com/%E2%99%A5/you?c=%A4 being sent when clicked

kmike · 2016-04-27T14:48:39Z

scrapy/link.py

@@ -28,6 +28,7 @@ def __init__(self, url, text='', fragment='', nofollow=False):
        self.text = text
        self.fragment = fragment
        self.nofollow = nofollow
+        self.encoding = encoding


In Python 2 when unicode URL is passed it is converted to bytes using utf-8 encoding, regardless of encoding passed. Should we use encoding instead? If so, the warning may become misleading. There should be a warning if encoding is not passed, but if user passes it explicitly then it may be fine to pass unicode URL in Python 2.

I admit I don't quite understand why Unicode URL is converted to bytes within Link.
Was the intent to make it safe? (in that case, we could store the safe_url version of it, using encoding).
Reading scrapy/tests/test_link.py, I find the following odd:

link = Link(u"http://www.example.com/\xa3") self.assertIsInstance(link.url, str) self.assertEqual(link.url, b'http://www.example.com/\xc2\xa3') <-----

In Python 2.x all URLs were bytes, so we required link.url to be bytes. For backwards compatibility, and because of missing unicode support in url-related functions in Python 2.x, link.url is still required to be bytes in Python 2.x

no idea about what is this test for :)

so, in Python2, we could make it pass through safe_url_string, using encoding, (~~and converting to bytes~~ would already be bytes)?

Or do you mean safe_url_string called twice, once in LinkExtractor and once in Request init?

so your comment is on https://github.com/scrapy/scrapy/blob/link-encoding/scrapy/linkextractors/lxmlhtml.py#L60, right?

Tests seem to pass fine ~~with~~ WITHOUT safe_url_string in LinkExtractor. Checking further.

My comment was about your suggestion to use safe_url_string in Link constructor; I haven't noticed it in LinkExtractor.

codecov-io · 2016-04-27T15:20:03Z

Current coverage is 81.65%

Merging #1949 into master will decrease coverage by -0.00%

@@             master      #1949   diff @@
==========================================
  Files           161        161          
  Lines          8669       8669          
  Methods           0          0          
  Messages          0          0          
  Branches       1269       1269          
==========================================
- Hits           7080       7078     -2   
  Misses         1233       1233          
- Partials        356        358     +2

2 files (not in diff) in scrapy/core were modified. more
- Partials +2
- Hits -2

Powered by Codecov. Last updated by a4dbf7e

redapple · 2016-04-27T18:05:32Z

@kmike , with safe_url_string removed from link extraction (54f1c24), for Python2 Unicode URL input case in Link init, one actually needs UTF8 for to_bytes, otherwise test_restrict_xpaths_with_html_entities even fails ("iso8859-15" can only encode part of the Unicode characters)

What happens with to_bytes(url, encoding='utf8') in Link.__init__() is illustrated below:

Py2 warning: u'http://example.org/\u2665/you?c=\u20ac'
--> 'http://example.org/\xe2\x99\xa5/you?c=\xe2\x82\xac'

(link extraction's) canonicalize_url() gets this:
('http://example.org/\xe2\x99\xa5/you?c=\xe2\x82\xac', 'iso8859-15')

which goes through _safe_ParseResult(),
after parse_url('http://example.org/\xe2\x99\xa5/you?c=\xe2\x82\xac', None)

(None actually saves the whole thing, decoding UTF8)

the rest is fine after that:
ParseResult(scheme=u'http', netloc=u'example.org', path=u'/\u2665/you', params='', query=u'c=\u20ac', fragment='')

_safe_ParseResult --> 'http', 'example.org', '/%E2%99%A5/you', '', 'c=%A4', ''
...

redapple mentioned this pull request Apr 21, 2016

Add encoding to Link object + re-enable HTML entities links tests #1880

Closed

redapple added this to the v1.1 milestone Apr 21, 2016

redapple changed the title ~~Add encoding to Link object + re-enable HTML entities links tests~~ [WIP] Add encoding to Link object + re-enable HTML entities links tests Apr 21, 2016

redapple added 2 commits April 27, 2016 16:21

Add encoding to Link object + re-enable HTML entities links tests

5213fab

Pass encoding when building Requests from Links

581144d

redapple force-pushed the link-encoding branch from 4a7501b to 581144d Compare April 27, 2016 14:23

redapple changed the title ~~[WIP] Add encoding to Link object + re-enable HTML entities links tests~~ Add encoding to Link object + re-enable HTML entities links tests Apr 27, 2016

kmike reviewed Apr 27, 2016
View reviewed changes

Remove safe_url_string() call in LxmlParserLinkExtractor

54f1c24

redapple modified the milestones: v1.2, v1.1 May 11, 2016

redapple modified the milestones: v2.0, v1.2 Jul 22, 2016

redapple mentioned this pull request Jul 26, 2016

UnicodeEncodeError on LinkExtractor's extract_links #1360

Closed

redapple mentioned this pull request Sep 5, 2016

PyPy support #2213

Closed

5 tasks

redapple mentioned this pull request Sep 14, 2016

HTML entity causes UnicodeEncodeError in LxmlLinkExtractor #998

Closed

redapple added the link extraction label Sep 14, 2016

redapple mentioned this pull request Oct 17, 2016

[Python2] Handle urlparse'd bytes URLs in canonicalize_url() scrapy/w3lib#75

Closed

redapple mentioned this pull request Jun 7, 2017

canonicalize at LinkExtractor works incorrectly #2779

Closed

lopuhin mentioned this pull request Jun 8, 2018

Unicode support for urlparse 4 scrapy/scurl#3

Merged

Gallaecio mentioned this pull request Feb 7, 2020

Use safe_url_string in link extraction #4321

Merged

wRAR closed this in #4321 Feb 19, 2020

nyov mentioned this pull request Mar 9, 2020

LinkExtractor does not extract relative links #3755

Open

wRAR deleted the link-encoding branch November 18, 2022 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add encoding to Link object + re-enable HTML entities links tests #1949

Add encoding to Link object + re-enable HTML entities links tests #1949

redapple commented Apr 21, 2016 •

edited

Loading

redapple commented Apr 21, 2016 •

edited

Loading

kmike Apr 27, 2016

redapple Apr 27, 2016

kmike Apr 27, 2016

kmike Apr 27, 2016

redapple Apr 27, 2016 •

edited

Loading

redapple Apr 27, 2016 •

edited

Loading

kmike Apr 27, 2016

redapple Apr 27, 2016

redapple Apr 27, 2016 •

edited

Loading

kmike Apr 27, 2016

codecov-io commented Apr 27, 2016 •

edited

Loading

redapple commented Apr 27, 2016

Add encoding to Link object + re-enable HTML entities links tests #1949

Add encoding to Link object + re-enable HTML entities links tests #1949

Conversation

redapple commented Apr 21, 2016 • edited Loading

redapple commented Apr 21, 2016 • edited Loading

kmike Apr 27, 2016

Choose a reason for hiding this comment

redapple Apr 27, 2016

Choose a reason for hiding this comment

kmike Apr 27, 2016

Choose a reason for hiding this comment

kmike Apr 27, 2016

Choose a reason for hiding this comment

redapple Apr 27, 2016 • edited Loading

Choose a reason for hiding this comment

redapple Apr 27, 2016 • edited Loading

Choose a reason for hiding this comment

kmike Apr 27, 2016

Choose a reason for hiding this comment

redapple Apr 27, 2016

Choose a reason for hiding this comment

redapple Apr 27, 2016 • edited Loading

Choose a reason for hiding this comment

kmike Apr 27, 2016

Choose a reason for hiding this comment

codecov-io commented Apr 27, 2016 • edited Loading

Current coverage is 81.65%

redapple commented Apr 27, 2016

redapple commented Apr 21, 2016 •

edited

Loading

redapple commented Apr 21, 2016 •

edited

Loading

redapple Apr 27, 2016 •

edited

Loading

redapple Apr 27, 2016 •

edited

Loading

redapple Apr 27, 2016 •

edited

Loading

codecov-io commented Apr 27, 2016 •

edited

Loading