Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose current crawler in the scrapy shell. #557

Merged
merged 4 commits into from Jan 28, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
38 changes: 21 additions & 17 deletions docs/intro/tutorial.rst
Expand Up @@ -147,15 +147,17 @@ To put our spider to work, go to the project's top level directory and run::
The ``crawl dmoz`` command runs the spider for the ``dmoz.org`` domain. You
will get an output similar to this::

2008-08-20 03:51:13-0300 [scrapy] INFO: Started project: dmoz
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)

Pay attention to the lines containing ``[dmoz]``, which corresponds to our
spider. You can see a log line for each URL defined in ``start_urls``. Because
Expand Down Expand Up @@ -253,16 +255,18 @@ This is what the shell looks like::

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [default] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[s] Available Scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
[s] sel <Selector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] item Item()
[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>
[s] item {}
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] spider <Spider 'default' at 0x1b6c2d0>
[s] sel <Selector xpath=None data=u'<html>\r\n<head>\r\n<meta http-equiv="Conten'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x3cebf50>
[s] Useful shortcuts:
[s] shelp() Print this help
[s] fetch(req_or_url) Fetch a new request or URL and update shell objects
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

In [1]:
Expand All @@ -278,13 +282,13 @@ on response's type.
So let's try it::

In [1]: sel.xpath('//title')
Out[1]: [<Selector (title) xpath=//title>]
Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: sel.xpath('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: sel.xpath('//title/text()')
Out[3]: [<Selector (text) xpath=//title/text()>]
Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]

In [4]: sel.xpath('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
Expand Down
73 changes: 43 additions & 30 deletions docs/topics/shell.rst
Expand Up @@ -71,6 +71,8 @@ content).

Those objects are:

* ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.

* ``spider`` - the Spider which is known to handle the URL, or a
:class:`~scrapy.spider.Spider` object if there is no spider found for
the current URL
Expand Down Expand Up @@ -110,16 +112,17 @@ Then, the shell fetches the URL (using the Scrapy downloader) and prints the
list of available objects and useful shortcuts (you'll notice that these lines
all start with the ``[s]`` prefix)::

[s] Available objects
[s] sel <Selector (http://scrapy.org) xpath=None>
[s] item Item()
[s] request <http://scrapy.org>
[s] response <http://scrapy.org>
[s] settings <Settings 'mybot.settings'>
[s] spider <Spider 'default' at 0x2bed9d0>
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
[s] item {}
[s] request <GET http://scrapy.org>
[s] response <200 http://scrapy.org>
[s] sel <Selector xpath=None data=u'<html>\n <head>\n <meta charset="utf-8'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s] shelp() Prints this help.
[s] fetch(req_or_url) Fetch a new request or URL and update objects
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

>>>
Expand All @@ -131,24 +134,27 @@ After that, we can star playing with the objects::

>>> fetch("http://slashdot.org")
[s] Available Scrapy objects:
[s] sel <Selector (http://slashdot.org) xpath=None>
[s] item JobItem()
[s] crawler <scrapy.crawler.Crawler object at 0x1a13b50>
[s] item {}
[s] request <GET http://slashdot.org>
[s] response <200 http://slashdot.org>
[s] settings <Settings 'jobsbot.settings'>
[s] spider <Spider 'default' at 0x3c44a10>
[s] sel <Selector xpath=None data=u'<html lang="en">\n<head>\n\n\n\n\n<script id="'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

>>> sel.xpath("//h2/text()").extract()
[u'News for nerds, stuff that matters']
>>> sel.xpath('//title/text()').extract()
[u'Slashdot: News for nerds, stuff that matters']

>>> request = request.replace(method="POST")

>>> fetch(request)
2009-04-03 00:57:39-0300 [default] ERROR: Downloading <http://slashdot.org> from <None>: 405 Method Not Allowed
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>>

Expand All @@ -165,47 +171,54 @@ This can be achieved by using the ``scrapy.shell.inspect_response`` function.

Here's an example of how you would call it from your spider::

from scrapy.spider import Spider


class MySpider(Spider):
...
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]

def parse(self, response):
if response.url == 'http://www.example.com/products.php':
# We want to inspect one specific response.
if ".org" in response.url:
from scrapy.shell import inspect_response
inspect_response(response)

# ... your parsing code ..
# Rest of parsing code.

When you run the spider, you will get something similar to this::

2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/> (referer: <None>)
2009-08-27 19:15:26-0300 [example.com] DEBUG: Crawled <http://www.example.com/products.php> (referer: <http://www.example.com/>)
[s] Available objects
[s] sel <Selector (http://www.example.com/products.php) xpath=None>
2014-01-23 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://www.example.com/products.php'
'http://example.org'

Then, you can check if the extraction code is working::

>>> sel.xpath('//h1')
>>> sel.xpath('//h1[@class="fn"]')
[]

Nope, it doesn't. So you can open the response in your web browser and see if
it's the response you were expecting::

>>> view(response)
>>>
True

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the
crawling::

>>> ^D
2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/product.php?id=1> (referer: <None>)
2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/product.php?id=2> (referer: <None>)
# ...
2014-01-23 17:50:03-0400 [myspider] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...

Note that you can't use the ``fetch`` shortcut here since the Scrapy engine is
blocked by the shell. However, after you leave the shell, the spider will
continue crawling where it stopped, as shown above.

21 changes: 12 additions & 9 deletions scrapy/shell.py
@@ -1,30 +1,32 @@
"""
Scrapy Shell
"""Scrapy Shell

See documentation in docs/topics/shell.rst

"""
from __future__ import print_function

import signal

from twisted.internet import reactor, threads, defer
from twisted.python import threadable
from w3lib.url import any_to_uri

from scrapy.crawler import Crawler
from scrapy.exceptions import IgnoreRequest
from scrapy.http import Request, Response
from scrapy.item import BaseItem
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.spider import create_spider_for_request
from scrapy.settings import Settings
from scrapy.spider import Spider
from scrapy.utils.console import start_python_console
from scrapy.utils.misc import load_object
from scrapy.utils.response import open_in_browser
from scrapy.utils.console import start_python_console
from scrapy.settings import Settings
from scrapy.http import Request, Response
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.spider import create_spider_for_request


class Shell(object):

relevant_classes = (Spider, Request, Response, BaseItem,
relevant_classes = (Crawler, Spider, Request, Response, BaseItem,
Selector, Settings)

def __init__(self, crawler, update_vars=None, code=None):
Expand Down Expand Up @@ -91,6 +93,7 @@ def fetch(self, request_or_url, spider=None):
self.populate_vars(response, request, spider)

def populate_vars(self, response=None, request=None, spider=None):
self.vars['crawler'] = self.crawler
self.vars['item'] = self.item_class()
self.vars['settings'] = self.crawler.settings
self.vars['spider'] = spider
Expand Down