Skip to content
Permalink
Browse files

Docs: 4-space indent for final spider example

  • Loading branch information
redapple authored and dangra committed Feb 5, 2014
1 parent 368a946 commit 13846ded0fe8003bd34ce45ca3be67da03c6ba96
Showing with 37 additions and 37 deletions.
  1. +37 −37 docs/intro/tutorial.rst
@@ -48,10 +48,10 @@ This will create a ``tutorial`` directory with the following contents::
pipelines.py
settings.py
spiders/
__init__.py
...
__init__.py
...

These are basically:
These are basically:

* ``scrapy.cfg``: the project configuration file
* ``tutorial/``: the project's python module, you'll later import your code from
@@ -84,15 +84,15 @@ items.py, found in the ``tutorial`` directory. Our Item class looks like this::
title = Field()
link = Field()
desc = Field()

This may seem complicated at first, but defining the item allows you to use other handy
components of Scrapy that need to know how your item looks like.

Our first Spider
================

Spiders are user-written classes used to scrape information from a domain (or group
of domains).
of domains).

They define an initial list of URLs to download, how to follow links, and how
to parse the contents of those pages to extract :ref:`items <topics-items>`.
@@ -112,7 +112,7 @@ define the three main, mandatory, attributes:
be called with the downloaded :class:`~scrapy.http.Response` object of each
start URL. The response is passed to the method as the first and only
argument.

This method is responsible for parsing the response data and extracting
scraped data (as scraped items) and more URLs to follow.

@@ -132,7 +132,7 @@ This is the code for our first Spider; save it in a file named
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
@@ -225,7 +225,7 @@ documentation).
argument.

* :meth:`~scrapy.selector.Selector.css`: returns a list of selectors, each of
them representing the nodes selected by the CSS expression given as argument.
them representing the nodes selected by the CSS expression given as argument.

* :meth:`~scrapy.selector.Selector.extract`: returns a unicode string with the
selected data.
@@ -269,7 +269,7 @@ This is what the shell looks like::
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

In [1]:
In [1]:

After the shell loads, you will have the response fetched in a local
``response`` variable, so if you type ``response.body`` you will see the body
@@ -299,7 +299,7 @@ So let's try it::
Extracting the data
^^^^^^^^^^^^^^^^^^^

Now, let's try to extract some real information from those pages.
Now, let's try to extract some real information from those pages.

You could type ``response.body`` in the console, and inspect the source code to
figure out the XPaths you need to use. However, inspecting the raw HTML code
@@ -357,7 +357,7 @@ Let's add this code to our spider::
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li')
@@ -367,7 +367,7 @@ Let's add this code to our spider::
desc = site.xpath('text()').extract()
print title, link, desc

Notice we import our Selector class from scrapy.selector and instantiate a
Notice we import our Selector class from scrapy.selector and instantiate a
new Selector object. We can now specify our XPaths just as we did in the shell.
Now try crawling the dmoz.org domain again and you'll see sites being printed
in your output, run::
@@ -390,30 +390,30 @@ Spiders are expected to return their scraped data inside
:class:`~scrapy.item.Item` objects. So, in order to return the data we've
scraped so far, the final code for our Spider would be like this::

from scrapy.spider import Spider
from scrapy.selector import Selector

from tutorial.items import DmozItem

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items
from scrapy.spider import Spider
from scrapy.selector import Selector

from tutorial.items import DmozItem

class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items

.. note:: You can find a fully-functional variant of this spider in the dirbot_
project available at https://github.com/scrapy/dirbot
@@ -449,7 +449,7 @@ pipeline if you just want to store the scraped items.

Next steps
==========

This tutorial covers only the basics of Scrapy, but there's a lot of other
features not mentioned here. Check the :ref:`topics-whatelse` section in
:ref:`intro-overview` chapter for a quick overview of the most important ones.

0 comments on commit 13846de

Please sign in to comment.
You can’t perform that action at this time.