Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazy-load downloadhandlers (continuation of #1357) #1421

Merged
merged 5 commits into from Aug 11, 2015

Conversation

@dangra
Copy link
Member

@dangra dangra commented Aug 10, 2015

After this change it is possible to run scrapy crawl without having to port all handlers enabled by default.

For example for https://github.com/scrapinghub/testspiders the "loremipsum" spider, that only use file:// urls, works under python 3. Errors in log are on purpose and part of spider code.

$ scrapy crawl loremipsum
2015-08-10 18:20:58 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: testspiders)
2015-08-10 18:20:58 [scrapy] INFO: Optional features available: http11, ssl
2015-08-10 18:20:58 [scrapy] INFO: Overridden settings: {'COOKIES_ENABLED': False, 'CLOSESPIDER_PAGECOUNT': 1000, 'NEWSPIDER_MODULE': 'testspiders.spiders', 'CLOSESPIDER_TIMEOUT': 3600, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['testspiders.spiders'], 'BOT_NAME': 'testspiders'}
2015-08-10 18:20:58 [scrapy] INFO: Enabled extensions: CoreStats, CloseSpider, LogStats, SpiderState
2015-08-10 18:20:58 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgent, ErrorMonkeyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-10 18:20:58 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-10 18:20:58 [scrapy] INFO: Enabled item pipelines: 
2015-08-10 18:20:58 [scrapy] INFO: Spider opened
2015-08-10 18:20:58 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-10 18:20:58 [scrapy] DEBUG: Crawled (200) <GET file:///tmp/tmpdhscrtwq> (referer: None)
2015-08-10 18:20:59 [loremipsum] DEBUG: b'Lorem ipsum dolor sit amet, co'
2015-08-10 18:20:59 [loremipsum] INFO: b'nsectetuer adipiscing elit, se'
2015-08-10 18:20:59 [loremipsum] WARNING: b'd\ndiam nonummy nibh euismod ti'
2015-08-10 18:20:59 [loremipsum] ERROR: b'ncidunt ut laoreet dolore magn'
2015-08-10 18:20:59 [scrapy] DEBUG: Scraped from <200 file:///tmp/tmpdhscrtwq>
{'body': b'Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed\ndiam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat\nvolutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper\nsuscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum\niriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum\ndolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio\ndignissim qui blandit praesent luptatum zzril delenit augue duis dolore te\nfeugait nulla facilisi. Nam liber tempor cum soluta nobis eleifend option\ncongue nihil imperdiet doming id quod mazim placerat facer possim assum. Typi\nnon habent claritatem insitam; est usus legentis in iis qui facit eorum\nclaritatem. Investigationes demonstraverunt lectores legere me lius quod ii\nlegunt saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem\nconsuetudium lectorum. Mirum est notare quam littera gothica, quam nunc putamus\nparum claram, anteposuerit litterarum formas humanitatis per seacula quarta\ndecima et quinta decima. Eodem modo typi, qui nunc nobis videntur parum clari,\nfiant sollemnes in futurum.',
 'title': b'Lorem ipsum dolor si',
 'url': 'file:///tmp/tmpdhscrtwq'}
2015-08-10 18:20:59 [scrapy] ERROR: Error downloading <GET file:///tmp/tmpdhscrtwq?x-error-response>
Traceback (most recent call last):
  File "/home/daniel/src/twisted/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/daniel/src/scrapy/scrapy/core/downloader/middleware.py", line 46, in process_response
    response = method(request=request, response=response, spider=spider)
  File "/home/daniel/src/frankie/testspiders/testspiders/middleware.py", line 31, in process_response
    _ = 1 / 0
ZeroDivisionError: division by zero
2015-08-10 18:20:59 [scrapy] ERROR: Spider error processing <GET file:///tmp/tmpdhscrtwq?x-error-response> (referer: b'file:///tmp/tmpdhscrtwq')
Traceback (most recent call last):
  File "/home/daniel/src/twisted/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/daniel/src/frankie/testspiders/testspiders/spiders/loremipsum.py", line 52, in recover
    raise ValueError('hoho')
ValueError: hoho
2015-08-10 18:20:59 [scrapy] INFO: Closing spider (finished)
2015-08-10 18:20:59 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 593,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 38,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 8, 10, 21, 20, 59, 293849),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/ERROR': 3,
 'log_count/INFO': 8,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2015, 8, 10, 21, 20, 58, 959610)}
2015-08-10 18:20:59 [scrapy] INFO: Spider closed (finished)

For a spider with http urls it fails and log the errors properly:

$ scrapy crawl followall
2015-08-10 18:22:32 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: testspiders)
2015-08-10 18:22:32 [scrapy] INFO: Optional features available: http11, ssl
2015-08-10 18:22:32 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'testspiders.spiders', 'CLOSESPIDER_PAGECOUNT': 1000, 'RETRY_ENABLED': False, 'CLOSESPIDER_TIMEOUT': 3600, 'SPIDER_MODULES': ['testspiders.spiders'], 'COOKIES_ENABLED': False, 'BOT_NAME': 'testspiders'}
2015-08-10 18:22:33 [scrapy] INFO: Enabled extensions: CloseSpider, SpiderState, LogStats, CoreStats
2015-08-10 18:22:33 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgent, ErrorMonkeyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-10 18:22:33 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-10 18:22:33 [scrapy] INFO: Enabled item pipelines: 
2015-08-10 18:22:33 [scrapy] INFO: Spider opened
2015-08-10 18:22:33 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-10 18:22:33 [scrapy] ERROR: Loading "scrapy.core.downloader.handlers.http.HTTPDownloadHandler" for scheme "http" handler
Traceback (most recent call last):
  File "/home/daniel/src/scrapy/scrapy/core/downloader/handlers/__init__.py", line 48, in _get_handler
    dhcls = load_object(path)
  File "/home/daniel/src/scrapy/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/home/daniel/envs/scrapy3/lib/python3.4/importlib/__init__.py", line 109, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 2254, in _gcd_import
  File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
  File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1129, in _exec
  File "<frozen importlib._bootstrap>", line 1471, in exec_module
  File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
  File "/home/daniel/src/scrapy/scrapy/core/downloader/handlers/http.py", line 5, in <module>
    from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler
  File "/home/daniel/src/scrapy/scrapy/core/downloader/handlers/http11.py", line 268, in <module>
    class _RequestBodyProducer(object):
  File "/home/daniel/src/scrapy/scrapy/core/downloader/handlers/http11.py", line 269, in _RequestBodyProducer
    implements(IBodyProducer)
  File "/home/daniel/envs/scrapy3/lib/python3.4/site-packages/zope/interface/declarations.py", line 412, in implements
    raise TypeError(_ADVICE_ERROR % 'implementer')
TypeError: Class advice impossible in Python3.  Use the @implementer class decorator instead.
2015-08-10 18:22:33 [scrapy] ERROR: Error downloading <GET http://scrapinghub.com/>
Traceback (most recent call last):
  File "/home/daniel/src/scrapy/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/home/daniel/src/scrapy/scrapy/core/downloader/handlers/__init__.py", line 67, in download_request
    (scheme, self._notconfigured[scheme]))
scrapy.exceptions.NotSupported: Unsupported URL scheme 'http': Class advice impossible in Python3.  Use the @implementer class decorator instead.
2015-08-10 18:22:33 [scrapy] INFO: Closing spider (finished)
2015-08-10 18:22:33 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.NotSupported': 1,
 'downloader/request_bytes': 237,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 8, 10, 21, 22, 33, 579317),
 'log_count/ERROR': 2,
 'log_count/INFO': 7,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 8, 10, 21, 22, 33, 259919)}
2015-08-10 18:22:33 [scrapy] INFO: Spider closed (finished)
return None
except Exception as ex:
logger.exception('Loading "{}" for scheme "{}" handler'
.format(path, scheme))

This comment has been minimized.

@kmike

kmike Aug 11, 2015
Member

Shouldn't we pass crawler or spider here? I think it is also better to pass path and scheme as parameters, without calling format ourselves.

This comment has been minimized.

@dangra

dangra Aug 11, 2015
Author Member

good point, updated.

kmike added a commit that referenced this pull request Aug 11, 2015
lazy-load downloadhandlers (continuation of #1357)
@kmike kmike merged commit fa123b3 into scrapy:master Aug 11, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dangra dangra deleted the dangra:nyov/lazyload-downloadhandlers branch Aug 14, 2015
@redapple redapple added this to the Scrapy 1.1 milestone Jan 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants