-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Tutorial: rewrite tutorial seeking to improve learning path #2252
Merged
Merged
Changes from 2 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
2427791
tutorial: remove item class definition and present start_requests first
eliasdorneles c508f40
use harcoded URLs, remove item reference on second spider
eliasdorneles 0da497c
updates on the first section (our first spider)
stummjr 0cd9dfc
small fixes on tutorial
stummjr b2a5cdd
tutorial: update section about following links, expand examples
eliasdorneles 21de617
mention that spiders need to subclass scrapy.Spider
eliasdorneles 147e756
update after review comments (thanks @stummjr)
eliasdorneles 31545a9
tutorial: updating extracting data section to introduce CSS and XPath…
eliasdorneles 233b98d
include section describing spider arguments
stummjr 2a409d1
[wip] changing introduction to scraping with selectors
eliasdorneles fee0783
Completing the data extraction section
stummjr f4f93c5
fix tox docs build, adjust title
eliasdorneles 8975371
Merge branch 'master' into tutorial-upgrades
eliasdorneles 125b691
more reviewing and editing, minor restructure, syntax fixes
eliasdorneles bc41fdf
address review comments, add debug log to initial spider
eliasdorneles a876ea5
minor grammar fix
eliasdorneles c126c59
address more review comments
eliasdorneles 38266cc
recommend Dive into Python and Python tutorial instead of LPTHW for n…
eliasdorneles 32017a7
recommend learn python the hard way for beginners
eliasdorneles d636e5b
better description for start_requests expected return value
eliasdorneles f4a2208
addressing review comments and other minor editing
eliasdorneles File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,10 +13,9 @@ our example domain to scrape. | |
This tutorial will walk you through these tasks: | ||
|
||
1. Creating a new Scrapy project | ||
2. Defining the Items you will extract | ||
3. Writing a :ref:`spider <topics-spiders>` to crawl a site and extract | ||
2. Writing a :ref:`spider <topics-spiders>` to crawl a site and extract | ||
:ref:`Items <topics-items>` | ||
4. Exporting the scraped data using command line | ||
3. Exporting the scraped data using command line | ||
|
||
Scrapy is written in Python_. If you're new to the language you might want to | ||
start by getting an idea of what the language is like, to get the most out of | ||
|
@@ -55,34 +54,6 @@ This will create a ``tutorial`` directory with the following contents:: | |
__init__.py | ||
|
||
|
||
Defining our Item | ||
================= | ||
|
||
`Items` are containers that will be loaded with the scraped data; they work | ||
like simple Python dicts. While you can use plain Python dicts with Scrapy, | ||
`Items` provide additional protection against populating undeclared fields, | ||
preventing typos. They can also be used with :ref:`Item Loaders | ||
<topics-loaders>`, a mechanism with helpers to conveniently populate `Items`. | ||
|
||
They are declared by creating a :class:`scrapy.Item <scrapy.item.Item>` class and defining | ||
its attributes as :class:`scrapy.Field <scrapy.item.Field>` objects, much like in an ORM | ||
(don't worry if you're not familiar with ORMs, you will see that this is an | ||
easy task). | ||
|
||
We begin by modeling the item that we will use to hold the site's data obtained | ||
from quotes.toscrape.com. As we want to capture the text and author from each of | ||
the quotes listed there, we define fields for each of these three attributes. To do that, we edit | ||
``items.py``, found in the ``tutorial`` directory. Our Item class looks like this:: | ||
|
||
import scrapy | ||
|
||
class QuoteItem(scrapy.Item): | ||
text = scrapy.Field() | ||
author = scrapy.Field() | ||
|
||
This may seem complicated at first, but defining an item class allows you to use other handy | ||
components and helpers within Scrapy. | ||
|
||
Our first Spider | ||
================ | ||
|
||
|
@@ -93,20 +64,23 @@ They define an initial list of URLs to download, how to follow links, and how | |
to parse the contents of pages to extract :ref:`items <topics-items>`. | ||
|
||
To create a Spider, you must subclass :class:`scrapy.Spider | ||
<scrapy.spiders.Spider>` and define some attributes: | ||
<scrapy.spiders.Spider>` and define some attributes and methods: | ||
|
||
* :attr:`~scrapy.spiders.Spider.name`: identifies the Spider. It must be | ||
unique within a project, that is, you can't set the same name for different | ||
Spiders. | ||
|
||
* :attr:`~scrapy.spiders.Spider.start_urls`: a list of URLs where the | ||
Spider will begin to crawl from. The first pages downloaded will be those | ||
listed here. The subsequent URLs will be generated successively from data | ||
contained in the start URLs. | ||
* :meth:`~scrapy.spiders.Spider.start_requests`: must return a list | ||
of requests where the Spider will begin to crawl from. | ||
Subsequent requests will be generated successively from these initial requests. | ||
|
||
As alternative to defining this method, you can define a class | ||
attribute :attr:`~scrapy.spiders.Spider.start_urls`, which the default | ||
implementation of this method will use to create the proper requests. | ||
|
||
* :meth:`~scrapy.spiders.Spider.parse`: a method of the spider, which will | ||
be called with the downloaded :class:`~scrapy.http.Response` object of each | ||
start URL. The response is passed to the method as the first and only | ||
initial request. The response is passed to the method as the first and only | ||
argument. | ||
|
||
This method is responsible for parsing the response data and extracting | ||
|
@@ -124,13 +98,18 @@ This is the code for our first Spider; save it in a file named | |
|
||
class QuotesSpider(scrapy.Spider): | ||
name = "quotes" | ||
start_urls = [ | ||
'http://quotes.toscrape.com/page/1/', | ||
'http://quotes.toscrape.com/page/2/', | ||
] | ||
|
||
def start_requests(self): | ||
urls = [ | ||
'http://quotes.toscrape.com/page/1/', | ||
'http://quotes.toscrape.com/page/2/', | ||
] | ||
for url in urls: | ||
yield scrapy.Request(url=url, callback=self.parse) | ||
|
||
def parse(self, response): | ||
filename = 'quotes-' + response.url.split("/")[-2] + '.html' | ||
page = response.url.split("/")[-2] | ||
filename = 'quotes-%s.html' % page | ||
with open(filename, 'wb') as f: | ||
f.write(response.body) | ||
|
||
|
@@ -171,13 +150,13 @@ URLs, as our ``parse`` method instructs. | |
What just happened under the hood? | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Scrapy creates :class:`scrapy.Request <scrapy.http.Request>` objects | ||
for each URL in the ``start_urls`` attribute of the Spider, and assigns | ||
them the ``parse`` method of the spider as their callback function. | ||
Scrapy will schedule the :class:`scrapy.Request <scrapy.http.Request>` objects | ||
returned by the ``start_requests`` method of the Spider, and when receiving | ||
a response for each one it will instantiate :class:`scrapy.http.Response` | ||
objects and call the ``parse`` callback method passing the response as argument. | ||
|
||
These Requests are scheduled, then executed, and :class:`scrapy.http.Response` | ||
objects are returned and then fed back to the spider, through the | ||
:meth:`~scrapy.spiders.Spider.parse` method. | ||
.. TODO: add here an explanation about how this structure is so command that | ||
we can do a short version of the spider w/ start_urls and default callback | ||
|
||
Extracting Items | ||
---------------- | ||
|
@@ -355,9 +334,13 @@ concatenate further ``.xpath()`` calls to dig deeper into a node. We are going t | |
that property here, so:: | ||
|
||
for quote in response.xpath('//div[@class="quote"]'): | ||
text = quote.xpath('span[@class="text"]/text()').extract() | ||
author = quote.xpath('span/small/text()').extract() | ||
print('{}: {}'.format(author, text)) | ||
text = quote.xpath('span[@class="text"]/text()').extract_first() | ||
author = quote.xpath('span/small/text()').extract_first() | ||
print({'text': text, 'author': author}) | ||
|
||
In the above snippet we've decided to use the method ``.extract_first()`` | ||
instead of ``.extract()``, to extract the content from the first element from a | ||
selector list returned by ``.xpath()``. | ||
|
||
.. note:: | ||
|
||
|
@@ -366,7 +349,11 @@ that property here, so:: | |
:ref:`topics-selectors-relative-xpaths` in the :ref:`topics-selectors` | ||
documentation | ||
|
||
Let's add this code to our spider:: | ||
Knowing to use selectors, extracting data from a page is just a matter of | ||
yield the Python dictionaries from the callback method instead of printing | ||
them. | ||
|
||
Let's add the necessary code to our spider:: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we could add a note about simplifying the spider via |
||
|
||
import scrapy | ||
|
||
|
@@ -380,54 +367,16 @@ Let's add this code to our spider:: | |
|
||
def parse(self, response): | ||
for quote in response.xpath('//div[@class="quote"]'): | ||
text = quote.xpath('span[@class="text"]/text()').extract_first() | ||
author = quote.xpath('span/small/text()').extract_first() | ||
print(u'{}: {}'.format(author, text)) | ||
yield { | ||
'text': quote.xpath('span[@class="text"]/text()').extract_first(), | ||
'author': quote.xpath('span/small/text()').extract_first(), | ||
} | ||
|
||
Note how we've changed to use the method ``.extract_first()``, which extracts | ||
the first element from a selector list returned by ``.xpath()``. | ||
|
||
Now try crawling quotes.toscrape.com again and you'll see sites being printed | ||
in your output. Run:: | ||
Run:: | ||
|
||
scrapy crawl quotes | ||
|
||
Using our item | ||
-------------- | ||
|
||
:class:`~scrapy.item.Item` objects are custom Python dicts; you can access the | ||
values of their fields (attributes of the class we defined earlier) using the | ||
standard dict syntax like:: | ||
|
||
>>> from tutorial.items import QuoteItem | ||
>>> item = QuoteItem() | ||
>>> item['text'] = 'Some random quote' | ||
>>> item['title'] | ||
'Some random quote' | ||
|
||
So, in order to return the data we've scraped so far, the final code for our | ||
Spider would be like this:: | ||
|
||
import scrapy | ||
from tutorial.items import QuoteItem | ||
|
||
|
||
class QuotesSpider(scrapy.Spider): | ||
name = "quotes" | ||
start_urls = [ | ||
'http://quotes.toscrape.com/page/1/', | ||
'http://quotes.toscrape.com/page/2/', | ||
] | ||
|
||
def parse(self, response): | ||
for quote in response.xpath('//div[@class="quote"]'): | ||
item = QuoteItem() | ||
item['text'] = quote.xpath('span[@class="text"]/text()').extract_first() | ||
item['author'] = quote.xpath('span/small/text()').extract_first() | ||
yield item | ||
|
||
|
||
Now crawling quotes.toscrape.com yields ``QuoteItem`` objects:: | ||
Now crawling quotes.toscrape.com will show dictionary objects:: | ||
|
||
2016-09-02 16:35:20 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/> | ||
{'author': 'Oscar Wilde', | ||
|
@@ -450,7 +399,6 @@ want for all of them? | |
Here is a modification to our spider that does just that:: | ||
|
||
import scrapy | ||
from tutorial.items import QuoteItem | ||
|
||
|
||
class QuotesSpider(scrapy.Spider): | ||
|
@@ -461,12 +409,13 @@ Here is a modification to our spider that does just that:: | |
|
||
def parse(self, response): | ||
for quote in response.xpath('//div[@class="quote"]'): | ||
item = QuoteItem() | ||
item['text'] = quote.xpath('span[@class="text"]/text()').extract_first() | ||
item['author'] = quote.xpath('span/small/text()').extract_first() | ||
yield item | ||
yield { | ||
'text': quote.xpath('span[@class="text"]/text()').extract_first(), | ||
'author': quote.xpath('span/small/text()').extract_first(), | ||
} | ||
|
||
next_page = response.xpath('//li[@class="next"]/a/@href').extract_first() | ||
if next_page: | ||
if next_page is not None: | ||
next_page = response.urljoin(next_page) | ||
yield scrapy.Request(next_page, callback=self.parse) | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about presenting the source code first and then explaining the methods and attrs that the needs?