Skip to content
Permalink
Browse files

Updated shell docs with the crawler reference and fixed the actual sh…

…ell output.

Also updated the shell example with a reproducible code example.
  • Loading branch information
rmax authored and dangra committed Feb 5, 2014
1 parent f89efaf commit 875b9abe8ae4fe76ae1f9b5bb155c79d154cf7dc
Showing with 53 additions and 38 deletions.
  1. +10 −8 docs/intro/tutorial.rst
  2. +43 −30 docs/topics/shell.rst
@@ -253,16 +253,18 @@ This is what the shell looks like::

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [default] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[s] Available Scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
[s] sel <Selector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] item Item()
[s] crawler <scrapy.crawler.Crawler object at 0x3636b50>
[s] item {}
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] spider <Spider 'default' at 0x1b6c2d0>
[s] sel <Selector xpath=None data=u'<html>\r\n<head>\r\n<meta http-equiv="Conten'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x3cebf50>
[s] Useful shortcuts:
[s] shelp() Print this help
[s] fetch(req_or_url) Fetch a new request or URL and update shell objects
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

In [1]:
@@ -278,13 +280,13 @@ on response's type.
So let's try it::

In [1]: sel.xpath('//title')
Out[1]: [<Selector (title) xpath=//title>]
Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]

In [2]: sel.xpath('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: sel.xpath('//title/text()')
Out[3]: [<Selector (text) xpath=//title/text()>]
Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]

In [4]: sel.xpath('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
@@ -71,6 +71,8 @@ content).

Those objects are:

* ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.

* ``spider`` - the Spider which is known to handle the URL, or a
:class:`~scrapy.spider.Spider` object if there is no spider found for
the current URL
@@ -110,16 +112,17 @@ Then, the shell fetches the URL (using the Scrapy downloader) and prints the
list of available objects and useful shortcuts (you'll notice that these lines
all start with the ``[s]`` prefix)::

[s] Available objects
[s] sel <Selector (http://scrapy.org) xpath=None>
[s] item Item()
[s] request <http://scrapy.org>
[s] response <http://scrapy.org>
[s] settings <Settings 'mybot.settings'>
[s] spider <Spider 'default' at 0x2bed9d0>
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
[s] item {}
[s] request <GET http://scrapy.org>
[s] response <200 http://scrapy.org>
[s] sel <Selector xpath=None data=u'<html>\n <head>\n <meta charset="utf-8'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s] shelp() Prints this help.
[s] fetch(req_or_url) Fetch a new request or URL and update objects
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

>>>
@@ -131,24 +134,27 @@ After that, we can star playing with the objects::

>>> fetch("http://slashdot.org")
[s] Available Scrapy objects:
[s] sel <Selector (http://slashdot.org) xpath=None>
[s] item JobItem()
[s] crawler <scrapy.crawler.Crawler object at 0x1a13b50>
[s] item {}
[s] request <GET http://slashdot.org>
[s] response <200 http://slashdot.org>
[s] settings <Settings 'jobsbot.settings'>
[s] spider <Spider 'default' at 0x3c44a10>
[s] sel <Selector xpath=None data=u'<html lang="en">\n<head>\n\n\n\n\n<script id="'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

>>> sel.xpath("//h2/text()").extract()
[u'News for nerds, stuff that matters']
>>> sel.xpath('//title/text()').extract()
[u'Slashdot: News for nerds, stuff that matters']

>>> request = request.replace(method="POST")

>>> fetch(request)
2009-04-03 00:57:39-0300 [default] ERROR: Downloading <http://slashdot.org> from <None>: 405 Method Not Allowed
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>>

@@ -165,47 +171,54 @@ This can be achieved by using the ``scrapy.shell.inspect_response`` function.

Here's an example of how you would call it from your spider::

from scrapy.spider import Spider


class MySpider(Spider):
...
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]

def parse(self, response):
if response.url == 'http://www.example.com/products.php':
# We want to inspect one specific response.
if ".org" in response.url:
from scrapy.shell import inspect_response
inspect_response(response)

# ... your parsing code ..
# Rest of parsing code.

When you run the spider, you will get something similar to this::

2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/> (referer: <None>)
2009-08-27 19:15:26-0300 [example.com] DEBUG: Crawled <http://www.example.com/products.php> (referer: <http://www.example.com/>)
[s] Available objects
[s] sel <Selector (http://www.example.com/products.php) xpath=None>
2014-01-23 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://www.example.com/products.php'
'http://example.org'

Then, you can check if the extraction code is working::

>>> sel.xpath('//h1')
>>> sel.xpath('//h1[@class="fn"]')
[]

Nope, it doesn't. So you can open the response in your web browser and see if
it's the response you were expecting::

>>> view(response)
>>>
True

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the
crawling::

>>> ^D
2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/product.php?id=1> (referer: <None>)
2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/product.php?id=2> (referer: <None>)
# ...
2014-01-23 17:50:03-0400 [myspider] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...

Note that you can't use the ``fetch`` shortcut here since the Scrapy engine is
blocked by the shell. However, after you leave the shell, the spider will
continue crawling where it stopped, as shown above.

0 comments on commit 875b9ab

Please sign in to comment.
You can’t perform that action at this time.