Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LxmlLinkExtractor fails handling unicode netlocs in Python2 #2323

Closed
starrify opened this issue Oct 12, 2016 · 7 comments
Closed

LxmlLinkExtractor fails handling unicode netlocs in Python2 #2323

starrify opened this issue Oct 12, 2016 · 7 comments

Comments

@starrify
Copy link
Contributor

Affected version:
dc1f9ad

Affected Python version:
Python 2 only

Steps to reproduce:

>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/__init__.py", line 104, in _process_links
    link.url = canonicalize_url(urlparse(link.url))
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 354, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 298, in _safe_ParseResult
    netloc = parts.netloc.encode('idna')
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 76, in ToASCII
    label = nameprep(label)
  File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 21, in nameprep
    newlabel.append(stringprep.map_table_b2(c))
  File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
    b = unicodedata.normalize("NFKC", al)
TypeError: normalize() argument 2 must be unicode, not str
@redapple
Copy link
Contributor

Thanks for reporting @starrify
Is it the same as #2304 ?
Does scrapy/w3lib#75 fix it?

Le 12 oct. 2016 7:11 PM, "Pengyu CHEN" notifications@github.com a écrit :

Affected version:
dc1f9ad
dc1f9ad

Affected Python version:
Python 2 only

Steps to reproduce:

import scrapy.http>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'', encoding='utf8')>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']>>> import scrapy.linkextractors>>> extractor = scrapy.linkextractors.LinkExtractor()>>> extractor.extract_links(response)
Traceback (most recent call last):
File "", line 1, in
File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
all_links.extend(self._process_links(links))
File "/tmp/virtualenv/src/scrapy/scrapy/linkextractors/init.py", line 104, in _process_links
link.url = canonicalize_url(urlparse(link.url))
File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 354, in canonicalize_url
parse_url(url), encoding=encoding)
File "/tmp/virtualenv/lib/python2.7/site-packages/w3lib/url.py", line 298, in _safe_ParseResult
netloc = parts.netloc.encode('idna')
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 164, in encode
result.append(ToASCII(label))
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 76, in ToASCII
label = nameprep(label)
File "/tmp/virtualenv/lib/python2.7/encodings/idna.py", line 21, in nameprep
newlabel.append(stringprep.map_table_b2(c))
File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
b = unicodedata.normalize("NFKC", al)TypeError: normalize() argument 2 must be unicode, not str


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2323, or mute the thread
https://github.com/notifications/unsubscribe-auth/AA2GGD4my6g33KGKtIgbtoRKnZChO4ipks5qzRSmgaJpZM4KU_lv
.

@redapple
Copy link
Contributor

I tested scrapy 1.2.0 with scrapy/w3lib#75 (scrapy/w3lib@10865d9) and it fixes the issue:

$ scrapy version -v
Scrapy    : 1.2.0
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.4.1
Python    : 2.7.12 (default, Jul  1 2016, 15:12:24) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.4.0-42-generic-x86_64-with-Ubuntu-16.04-xenial

$ pip freeze |grep w3lib
-e git+git@github.com:scrapy/w3lib.git@10865d916b74f26e4eb59f60a4bc11b88b89d674#egg=w3lib

$ scrapy shell
2016-10-13 10:59:02 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
>>> import scrapy.http
>>> response = scrapy.http.TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
>>> response.css('a::attr(href)').extract()
[u'http://foo\u263a']
>>> import scrapy.linkextractors
>>> extractor = scrapy.linkextractors.LinkExtractor()
>>> extractor.extract_links(response)
[Link(url='http://xn--foo-4s5a/', text=u'', fragment='', nofollow=False)]
>>> 

@redapple redapple added this to the v1.2.1 milestone Oct 14, 2016
@redapple redapple removed this from the v1.2.1 milestone Oct 24, 2016
@jmb0z
Copy link

jmb0z commented Feb 12, 2017

any progress on this issue?

@ghost
Copy link

ghost commented Feb 22, 2017

this happens to me also on 1.3.0 version?
any progress?

@esamattis
Copy link

Upgrading from Scrapy 1.3.3 to 1.4.0 seems to fix this for me.

@malloxpb
Copy link
Member

malloxpb commented Mar 2, 2018

@redapple do you think there's anything else to do in this issue? 😄😄😄

@elacuesta
Copy link
Member

Working since (at least) w3lib 1.17.0 and scrapy 1.4.0:

Python 3

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 3.6.9 (default, Nov  7 2019, 10:44:02) - [GCC 8.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa
In [1]: from scrapy.http import TextResponse 
   ...: from scrapy.linkextractors import LinkExtractor 
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8') 
   ...: response.css('a::attr(href)').extract() 
   ...: extractor = LinkExtractor() 
   ...: extractor.extract_links(response)                                                                                                                                                                                                     
Out[1]: [Link(url='http://foo☺', text='', fragment='', nofollow=False)]

Python 2

scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.4.2.0
libxml2   : 2.9.9
cssselect : 1.1.0
parsel    : 1.5.2
w3lib     : 1.17.0
Twisted   : 19.10.0
Python    : 2.7.15+ (default, Oct  7 2019, 17:39:04) - [GCC 7.4.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
Platform  : Linux-4.15.0-20-generic-x86_64-with-LinuxMint-19.1-tessa
In [1]: from scrapy.http import TextResponse
   ...: from scrapy.linkextractors import LinkExtractor
   ...: response = TextResponse(url='http://foo.com', body=u'<a href="http://foo\u263a">', encoding='utf8')
   ...: response.css('a::attr(href)').extract()
   ...: extractor = LinkExtractor()
   ...: extractor.extract_links(response)
   ...: 
Out[1]: [Link(url='http://foo\xe2\x98\xba', text=u'', fragment='', nofollow=False)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants