@@ -48,10 +48,10 @@ This will create a ``tutorial`` directory with the following contents::
48
48
pipelines.py
49
49
settings.py
50
50
spiders/
51
- __init__.py
52
- ...
51
+ __init__.py
52
+ ...
53
53
54
- These are basically:
54
+ These are basically:
55
55
56
56
* ``scrapy.cfg ``: the project configuration file
57
57
* ``tutorial/ ``: the project's python module, you'll later import your code from
@@ -84,15 +84,15 @@ items.py, found in the ``tutorial`` directory. Our Item class looks like this::
84
84
title = Field()
85
85
link = Field()
86
86
desc = Field()
87
-
87
+
88
88
This may seem complicated at first, but defining the item allows you to use other handy
89
89
components of Scrapy that need to know how your item looks like.
90
90
91
91
Our first Spider
92
92
================
93
93
94
94
Spiders are user-written classes used to scrape information from a domain (or group
95
- of domains).
95
+ of domains).
96
96
97
97
They define an initial list of URLs to download, how to follow links, and how
98
98
to parse the contents of those pages to extract :ref: `items <topics-items >`.
@@ -112,7 +112,7 @@ define the three main, mandatory, attributes:
112
112
be called with the downloaded :class: `~scrapy.http.Response ` object of each
113
113
start URL. The response is passed to the method as the first and only
114
114
argument.
115
-
115
+
116
116
This method is responsible for parsing the response data and extracting
117
117
scraped data (as scraped items) and more URLs to follow.
118
118
@@ -132,7 +132,7 @@ This is the code for our first Spider; save it in a file named
132
132
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
133
133
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
134
134
]
135
-
135
+
136
136
def parse(self, response):
137
137
filename = response.url.split("/")[-2]
138
138
open(filename, 'wb').write(response.body)
@@ -225,7 +225,7 @@ documentation).
225
225
argument.
226
226
227
227
* :meth: `~scrapy.selector.Selector.css `: returns a list of selectors, each of
228
- them representing the nodes selected by the CSS expression given as argument.
228
+ them representing the nodes selected by the CSS expression given as argument.
229
229
230
230
* :meth: `~scrapy.selector.Selector.extract `: returns a unicode string with the
231
231
selected data.
@@ -269,7 +269,7 @@ This is what the shell looks like::
269
269
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
270
270
[s] view(response) View response in a browser
271
271
272
- In [1]:
272
+ In [1]:
273
273
274
274
After the shell loads, you will have the response fetched in a local
275
275
``response `` variable, so if you type ``response.body `` you will see the body
@@ -299,7 +299,7 @@ So let's try it::
299
299
Extracting the data
300
300
^^^^^^^^^^^^^^^^^^^
301
301
302
- Now, let's try to extract some real information from those pages.
302
+ Now, let's try to extract some real information from those pages.
303
303
304
304
You could type ``response.body `` in the console, and inspect the source code to
305
305
figure out the XPaths you need to use. However, inspecting the raw HTML code
@@ -357,7 +357,7 @@ Let's add this code to our spider::
357
357
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
358
358
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
359
359
]
360
-
360
+
361
361
def parse(self, response):
362
362
sel = Selector(response)
363
363
sites = sel.xpath('//ul/li')
@@ -367,7 +367,7 @@ Let's add this code to our spider::
367
367
desc = site.xpath('text()').extract()
368
368
print title, link, desc
369
369
370
- Notice we import our Selector class from scrapy.selector and instantiate a
370
+ Notice we import our Selector class from scrapy.selector and instantiate a
371
371
new Selector object. We can now specify our XPaths just as we did in the shell.
372
372
Now try crawling the dmoz.org domain again and you'll see sites being printed
373
373
in your output, run::
@@ -390,30 +390,30 @@ Spiders are expected to return their scraped data inside
390
390
:class: `~scrapy.item.Item ` objects. So, in order to return the data we've
391
391
scraped so far, the final code for our Spider would be like this::
392
392
393
- from scrapy.spider import Spider
394
- from scrapy.selector import Selector
395
-
396
- from tutorial.items import DmozItem
397
-
398
- class DmozSpider(Spider):
399
- name = "dmoz"
400
- allowed_domains = ["dmoz.org"]
401
- start_urls = [
402
- "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
403
- "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
404
- ]
405
-
406
- def parse(self, response):
407
- sel = Selector(response)
408
- sites = sel.xpath('//ul/li')
409
- items = []
410
- for site in sites:
411
- item = DmozItem()
412
- item['title'] = site.xpath('a/text()').extract()
413
- item['link'] = site.xpath('a/@href').extract()
414
- item['desc'] = site.xpath('text()').extract()
415
- items.append(item)
416
- return items
393
+ from scrapy.spider import Spider
394
+ from scrapy.selector import Selector
395
+
396
+ from tutorial.items import DmozItem
397
+
398
+ class DmozSpider(Spider):
399
+ name = "dmoz"
400
+ allowed_domains = ["dmoz.org"]
401
+ start_urls = [
402
+ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
403
+ "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
404
+ ]
405
+
406
+ def parse(self, response):
407
+ sel = Selector(response)
408
+ sites = sel.xpath('//ul/li')
409
+ items = []
410
+ for site in sites:
411
+ item = DmozItem()
412
+ item['title'] = site.xpath('a/text()').extract()
413
+ item['link'] = site.xpath('a/@href').extract()
414
+ item['desc'] = site.xpath('text()').extract()
415
+ items.append(item)
416
+ return items
417
417
418
418
.. note :: You can find a fully-functional variant of this spider in the dirbot_
419
419
project available at https://github.com/scrapy/dirbot
@@ -449,7 +449,7 @@ pipeline if you just want to store the scraped items.
449
449
450
450
Next steps
451
451
==========
452
-
452
+
453
453
This tutorial covers only the basics of Scrapy, but there's a lot of other
454
454
features not mentioned here. Check the :ref: `topics-whatelse ` section in
455
455
:ref: `intro-overview ` chapter for a quick overview of the most important ones.
0 commit comments