Skip to content

Commit 13846de

Browse files
redappledangra
authored andcommitted
Docs: 4-space indent for final spider example
1 parent 368a946 commit 13846de

File tree

1 file changed

+37
-37
lines changed

1 file changed

+37
-37
lines changed

docs/intro/tutorial.rst

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,10 @@ This will create a ``tutorial`` directory with the following contents::
4848
pipelines.py
4949
settings.py
5050
spiders/
51-
__init__.py
52-
...
51+
__init__.py
52+
...
5353

54-
These are basically:
54+
These are basically:
5555

5656
* ``scrapy.cfg``: the project configuration file
5757
* ``tutorial/``: the project's python module, you'll later import your code from
@@ -84,15 +84,15 @@ items.py, found in the ``tutorial`` directory. Our Item class looks like this::
8484
title = Field()
8585
link = Field()
8686
desc = Field()
87-
87+
8888
This may seem complicated at first, but defining the item allows you to use other handy
8989
components of Scrapy that need to know how your item looks like.
9090

9191
Our first Spider
9292
================
9393

9494
Spiders are user-written classes used to scrape information from a domain (or group
95-
of domains).
95+
of domains).
9696

9797
They define an initial list of URLs to download, how to follow links, and how
9898
to parse the contents of those pages to extract :ref:`items <topics-items>`.
@@ -112,7 +112,7 @@ define the three main, mandatory, attributes:
112112
be called with the downloaded :class:`~scrapy.http.Response` object of each
113113
start URL. The response is passed to the method as the first and only
114114
argument.
115-
115+
116116
This method is responsible for parsing the response data and extracting
117117
scraped data (as scraped items) and more URLs to follow.
118118

@@ -132,7 +132,7 @@ This is the code for our first Spider; save it in a file named
132132
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
133133
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
134134
]
135-
135+
136136
def parse(self, response):
137137
filename = response.url.split("/")[-2]
138138
open(filename, 'wb').write(response.body)
@@ -225,7 +225,7 @@ documentation).
225225
argument.
226226

227227
* :meth:`~scrapy.selector.Selector.css`: returns a list of selectors, each of
228-
them representing the nodes selected by the CSS expression given as argument.
228+
them representing the nodes selected by the CSS expression given as argument.
229229

230230
* :meth:`~scrapy.selector.Selector.extract`: returns a unicode string with the
231231
selected data.
@@ -269,7 +269,7 @@ This is what the shell looks like::
269269
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
270270
[s] view(response) View response in a browser
271271

272-
In [1]:
272+
In [1]:
273273

274274
After the shell loads, you will have the response fetched in a local
275275
``response`` variable, so if you type ``response.body`` you will see the body
@@ -299,7 +299,7 @@ So let's try it::
299299
Extracting the data
300300
^^^^^^^^^^^^^^^^^^^
301301

302-
Now, let's try to extract some real information from those pages.
302+
Now, let's try to extract some real information from those pages.
303303

304304
You could type ``response.body`` in the console, and inspect the source code to
305305
figure out the XPaths you need to use. However, inspecting the raw HTML code
@@ -357,7 +357,7 @@ Let's add this code to our spider::
357357
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
358358
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
359359
]
360-
360+
361361
def parse(self, response):
362362
sel = Selector(response)
363363
sites = sel.xpath('//ul/li')
@@ -367,7 +367,7 @@ Let's add this code to our spider::
367367
desc = site.xpath('text()').extract()
368368
print title, link, desc
369369

370-
Notice we import our Selector class from scrapy.selector and instantiate a
370+
Notice we import our Selector class from scrapy.selector and instantiate a
371371
new Selector object. We can now specify our XPaths just as we did in the shell.
372372
Now try crawling the dmoz.org domain again and you'll see sites being printed
373373
in your output, run::
@@ -390,30 +390,30 @@ Spiders are expected to return their scraped data inside
390390
:class:`~scrapy.item.Item` objects. So, in order to return the data we've
391391
scraped so far, the final code for our Spider would be like this::
392392

393-
from scrapy.spider import Spider
394-
from scrapy.selector import Selector
395-
396-
from tutorial.items import DmozItem
397-
398-
class DmozSpider(Spider):
399-
name = "dmoz"
400-
allowed_domains = ["dmoz.org"]
401-
start_urls = [
402-
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
403-
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
404-
]
405-
406-
def parse(self, response):
407-
sel = Selector(response)
408-
sites = sel.xpath('//ul/li')
409-
items = []
410-
for site in sites:
411-
item = DmozItem()
412-
item['title'] = site.xpath('a/text()').extract()
413-
item['link'] = site.xpath('a/@href').extract()
414-
item['desc'] = site.xpath('text()').extract()
415-
items.append(item)
416-
return items
393+
from scrapy.spider import Spider
394+
from scrapy.selector import Selector
395+
396+
from tutorial.items import DmozItem
397+
398+
class DmozSpider(Spider):
399+
name = "dmoz"
400+
allowed_domains = ["dmoz.org"]
401+
start_urls = [
402+
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
403+
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
404+
]
405+
406+
def parse(self, response):
407+
sel = Selector(response)
408+
sites = sel.xpath('//ul/li')
409+
items = []
410+
for site in sites:
411+
item = DmozItem()
412+
item['title'] = site.xpath('a/text()').extract()
413+
item['link'] = site.xpath('a/@href').extract()
414+
item['desc'] = site.xpath('text()').extract()
415+
items.append(item)
416+
return items
417417

418418
.. note:: You can find a fully-functional variant of this spider in the dirbot_
419419
project available at https://github.com/scrapy/dirbot
@@ -449,7 +449,7 @@ pipeline if you just want to store the scraped items.
449449

450450
Next steps
451451
==========
452-
452+
453453
This tutorial covers only the basics of Scrapy, but there's a lot of other
454454
features not mentioned here. Check the :ref:`topics-whatelse` section in
455455
:ref:`intro-overview` chapter for a quick overview of the most important ones.

0 commit comments

Comments
 (0)