Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikipedia robots.txt raises exceptions #2373

Closed
mohmad-null opened this issue Nov 2, 2016 · 7 comments
Closed

Wikipedia robots.txt raises exceptions #2373

mohmad-null opened this issue Nov 2, 2016 · 7 comments
Assignees
Labels
bug
Milestone

Comments

@mohmad-null
Copy link

@mohmad-null mohmad-null commented Nov 2, 2016

I'm scraping a page which in turn links to wikipedia.

But the wikipedia robots.txt is creating some errors/exceptions as below.

Python 2.7.12
Scrapy 1.2.1

2016-11-02 13:13:18 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/robots.txt> (referer: None)
2016-11-02 13:13:18 [py.warnings] WARNING: C:\Python27\lib\urllib.py:1303: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))

2016-11-02 13:13:18 [scrapy] ERROR: Error downloading <GET http://en.wikipedia.org/robots.txt>: u'\xd8'
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 587, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Python27\lib\site-packages\scrapy\downloadermiddlewares\robotstxt.py", line 97, in _parse_robots
    rp.parse(body.splitlines())
  File "C:\Python27\lib\robotparser.py", line 120, in parse
    entry.rulelines.append(RuleLine(line[1], False))
  File "C:\Python27\lib\robotparser.py", line 174, in __init__
    self.path = urllib.quote(path)
  File "C:\Python27\lib\urllib.py", line 1303, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xd8'
@redapple redapple added the bug label Nov 2, 2016
@redapple
Copy link
Contributor

@redapple redapple commented Nov 2, 2016

I can reproduce this with Python 2.7 on Linux but not with Python 3.

$ cat wiki.py
import scrapy


class WikipediaSpider(scrapy.Spider):
    name = "wikipedia"

    start_urls = ['https://en.wikipedia.org/']

    def parse(self, response):
        pass
$ scrapy version -v
Scrapy    : 1.2.1
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.5.0
Python    : 2.7.12 (default, Jul  1 2016, 15:12:24) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.4.0-45-generic-x86_64-with-Ubuntu-16.04-xenial

$ scrapy runspider wiki.py -s ROBOTSTXT_OBEY=1
2016-11-02 14:31:24 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
2016-11-02 14:31:24 [scrapy] INFO: Overridden settings: {'ROBOTSTXT_OBEY': '1'}
(...)
2016-11-02 14:31:24 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/robots.txt> (referer: None)
2016-11-02 14:31:24 [py.warnings] WARNING: /usr/lib/python2.7/urllib.py:1299: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))

2016-11-02 14:31:24 [scrapy] ERROR: Error downloading <GET https://en.wikipedia.org/robots.txt>: u'\xd8'
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 649, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py", line 97, in _parse_robots
    rp.parse(body.splitlines())
  File "/usr/lib/python2.7/robotparser.py", line 120, in parse
    entry.rulelines.append(RuleLine(line[1], False))
  File "/usr/lib/python2.7/robotparser.py", line 174, in __init__
    self.path = urllib.quote(path)
  File "/usr/lib/python2.7/urllib.py", line 1299, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xd8'
2016-11-02 14:31:24 [scrapy] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/Main_Page> from <GET https://en.wikipedia.org/>
2016-11-02 14:31:24 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Main_Page> (referer: None)
2016-11-02 14:31:25 [scrapy] INFO: Closing spider (finished)
2016-11-02 14:31:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 804,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 25391,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 11, 2, 13, 31, 25, 113618),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 11, 2, 13, 31, 24, 503170)}
2016-11-02 14:31:25 [scrapy] INFO: Spider closed (finished)

@redapple
Copy link
Contributor

@redapple redapple commented Nov 2, 2016

It does look like a bug in Python 2's robotparser:

>>> from robotparser import RobotFileParser
>>> lines = u'''User-agent: *
... Allow: /w/api.php?action=mobileview&
... Allow: /w/load.php?
... Allow: /api/rest_v1/?doc
... Disallow: /w/
... Disallow: /api/
... Disallow: /trap/
... #
... # ar:
... Disallow: /wiki/%D8%AE%D8%A7%D8%B5:Search
... Disallow: /wiki/%D8%AE%D8%A7%D8%B5%3ASearch'''
>>> rp = RobotFileParser('https://en.wikipedia.org/robots.txt')
>>> rp.parse(lines.splitlines())
/usr/lib/python2.7/urllib.py:1299: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/robotparser.py", line 120, in parse
    entry.rulelines.append(RuleLine(line[1], False))
  File "/usr/lib/python2.7/robotparser.py", line 174, in __init__
    self.path = urllib.quote(path)
  File "/usr/lib/python2.7/urllib.py", line 1299, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xd8'
>>> 
@redapple
Copy link
Contributor

@redapple redapple commented Nov 2, 2016

This line seems to be the issue.
Changing to line[1] = w3lib.url.safe_url_string(line[1].strip()) seems to fix it.

@kmike
Copy link
Member

@kmike kmike commented Nov 2, 2016

What do you suggest @redapple? Use a backport of Python 3.x robotparser, switch to reppy, create our own robots.txt parser, tell people to use Python 3, something else?

@redapple
Copy link
Contributor

@redapple redapple commented Nov 2, 2016

I think we can move to reppy for this, even without the crawl-delay support for now.

@kmike
Copy link
Member

@kmike kmike commented Nov 2, 2016

reppy sounds fine, but I have a few small reservations about it:

  • it seems reppy is under heavy refactoring right now;
  • they combine robots.txt parsing and fetching in a same package, so they have requests in install_requires; it could be weird to have requests as a Scrapy dependency :)
@redapple
Copy link
Contributor

@redapple redapple commented Nov 2, 2016

oh right, reading some recent PRs.
Then it may be easier to use a custom RobotFileParser subclass for now.

@redapple redapple self-assigned this Nov 8, 2016
@redapple redapple added the in progress label Nov 8, 2016
redapple added a commit to redapple/scrapy that referenced this issue Nov 9, 2016
@redapple redapple added this to the v1.3 milestone Nov 16, 2016
@redapple redapple modified the milestones: v1.2.2, v1.3 Nov 30, 2016
@redapple redapple removed the in progress label Dec 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants