`LxmlLinkExtractor` fails handling unicode netlocs in Python2 #2323

starrify · 2016-10-12T17:10:55Z

Affected version:
dc1f9ad

Affected Python version:
Python 2 only

Steps to reproduce:

>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/__init__.py", line 104, in _process_links
    link.url = canonicalize_url(urlparse(link.url))
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 354, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 298, in _safe_ParseResult
    netloc = parts.netloc.encode('idna')
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 76, in ToASCII
    label = nameprep(label)
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 21, in nameprep
    newlabel.append(stringprep.map_table_b2(c))
  File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
    b = unicodedata.normalize("NFKC", al)
TypeError: normalize() argument 2 must be unicode, not str

The text was updated successfully, but these errors were encountered:

redapple · 2016-10-12T18:01:01Z

Thanks for reporting @starrify
Is it the same as #2304 ?
Does scrapy/w3lib#75 fix it?

Le 12 oct. 2016 7:11 PM, "Pengyu CHEN" notifications@github.com a écrit :

Affected version:
dc1f9ad
dc1f9ad

Affected Python version:
Python 2 only

Steps to reproduce:

import scrapy.http>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'', encoding='utf8')>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']>>> import scrapy.linkextractors>>> extractor = scrapy.linkextractors.LinkExtractor()>>> extractor.extract_links(response)
Traceback (most recent call last):
File "", line 1, in
File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
all_links.extend(self._process_links(links))
File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/init.py", line 104, in _process_links
link.url = canonicalize_url(urlparse(link.url))
File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 354, in canonicalize_url
parse_url(url), encoding=encoding)
File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 298, in _safe_ParseResult
netloc = parts.netloc.encode('idna')
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 164, in encode
result.append(ToASCII(label))
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 76, in ToASCII
label = nameprep(label)
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 21, in nameprep
newlabel.append(stringprep.map_table_b2(c))
File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
b = unicodedata.normalize("NFKC", al)TypeError: normalize() argument 2 must be unicode, not str

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2323, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA2GGD4my6g33KGKtIgbtoRKnZChO4ipks5qzRSmgaJpZM4KU_lv
.

redapple · 2016-10-13T09:04:23Z

I tested scrapy 1.2.0 with scrapy/w3lib#75 (scrapy/w3lib@10865d9) and it fixes the issue:

$ scrapy version -v
Scrapy    : 1.2.0
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.4.1
Python    : 2.7.12 (default, Jul  1 2016, 15:12:24) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.4.0-42-generic-x86_64-with-Ubuntu-16.04-xenial

$ pip freeze |grep w3lib
-e git+git@github.com:scrapy/w3lib.git@10865d916b74f26e4eb59f60a4bc11b88b89d674#egg=w3lib

$ scrapy shell
2016-10-13 10:59:02 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
[Link(url='http://xn--foo-4s5a/', text=u'', fragment='', nofollow=False)]
>>>

jmb0z · 2017-02-12T04:32:51Z

any progress on this issue?

ghost · 2017-02-22T14:35:07Z

this happens to me also on 1.3.0 version?
any progress?

esamattis · 2017-10-25T10:52:03Z

Upgrading from Scrapy 1.3.3 to 1.4.0 seems to fix this for me.

malloxpb · 2018-03-02T06:06:21Z

@redapple do you think there's anything else to do in this issue? 😄😄😄

elacuesta · 2019-12-24T13:42:25Z

Working since (at least) w3lib 1.17.0 and scrapy 1.4.0:

Python 3

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 3.6.9 (default, Nov  7 2019, 10:44:02) - [GCC 8.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa

In [1]: from scrapy.http import TextResponse 
   ...: from scrapy.linkextractors import LinkExtractor 
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8') 
   ...: response.css('a::attr(href)').extract() 
   ...: extractor = LinkExtractor() 
   ...: extractor.extract_links(response)                                                                                                                                                                                                     
Out[1]: [Link(url='http://foo☺', text='', fragment='', nofollow=False)]

Python 2

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 2.7.15+ (default, Oct  7 2019, 17:39:04) - [GCC 7.4.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa

In [1]: from scrapy.http import TextResponse
   ...: from scrapy.linkextractors import LinkExtractor
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
   ...: response.css('a::attr(href)').extract()
   ...: extractor = LinkExtractor()
   ...: extractor.extract_links(response)
   ...: 
Out[1]: [Link(url='http://foo\xe2\x98\xba', text=u'', fragment='', nofollow=False)]

redapple added this to the v1.2.1 milestone Oct 14, 2016

redapple added bug link extraction labels Oct 14, 2016

redapple removed this from the v1.2.1 milestone Oct 24, 2016

elacuesta closed this as completed Dec 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LxmlLinkExtractor` fails handling unicode netlocs in Python2 #2323

`LxmlLinkExtractor` fails handling unicode netlocs in Python2 #2323

starrify commented Oct 12, 2016

redapple commented Oct 12, 2016

redapple commented Oct 13, 2016

jmb0z commented Feb 12, 2017

ghost commented Feb 22, 2017

esamattis commented Oct 25, 2017

malloxpb commented Mar 2, 2018 •

edited

elacuesta commented Dec 24, 2019

LxmlLinkExtractor fails handling unicode netlocs in Python2 #2323

LxmlLinkExtractor fails handling unicode netlocs in Python2 #2323

Comments

starrify commented Oct 12, 2016

redapple commented Oct 12, 2016

redapple commented Oct 13, 2016

jmb0z commented Feb 12, 2017

ghost commented Feb 22, 2017

esamattis commented Oct 25, 2017

malloxpb commented Mar 2, 2018 • edited

elacuesta commented Dec 24, 2019

`LxmlLinkExtractor` fails handling unicode netlocs in Python2 #2323

`LxmlLinkExtractor` fails handling unicode netlocs in Python2 #2323

malloxpb commented Mar 2, 2018 •

edited