Some improvements for Scrapy tutorial #1180
Conversation
navigating the structure, it can also look at the content: you're | ||
able to select things like: *the link that contains the text 'Next Page'*. | ||
Because of this, we encourage you to learn about XPath even if you | ||
already know how to construct CSS selectors. | ||
|
||
For working with XPaths, Scrapy provides :class:`~scrapy.selector.Selector` |
kmike
Apr 21, 2015
Member
Selector works with CSS as well
Selector works with CSS as well
|
||
def parse(self, response): | ||
for href in response.css("ul.directory.dir-col > li > a::attr('href')"): | ||
url = urlparse.urljoin(response.url, href.extract()) |
kmike
Apr 21, 2015
Member
It may be good to mention response.urljoin because it is shorter, and because it handles <base>
HTML tag.
It may be good to mention response.urljoin because it is shorter, and because it handles <base>
HTML tag.
eliasdorneles
Apr 21, 2015
Author
Member
oh right!! sorry, forgot about it again! :)
oh right!! sorry, forgot about it again! :)
What you see here is the Scrapy's mechanism of following links: if you yield a | ||
Request instead of an Item in a callback method, Scrapy will schedule that | ||
request to be sent and register a callback method to be executed when that | ||
request finishes. |
kmike
Apr 21, 2015
Member
It may worths it to mention a callback method can yield both Items and Requests at the same time, there are no separate "item callbacks" and "request callbacks".
It may worths it to mention a callback method can yield both Items and Requests at the same time, there are no separate "item callbacks" and "request callbacks".
eliasdorneles
Apr 21, 2015
Author
Member
+1, will do it in a jiffy
+1, will do it in a jiffy
|
One more common pain point for beginners is creating a single item from multiple web pages. In future other solutions could appear, e.g. #1138 or #1144. |
@kmike +1, I added some examples and attended your other comments. Let me know if you think of another thing. =) |
|
||
settings.py # project settings file | ||
|
||
spiders/ # a directory where you'll later put your spiders | ||
__init__.py | ||
... |
kmike
Apr 21, 2015
Member
in github UI extra new lines look too sparse to my taste, but maybe they are more readable at RTFD
in github UI extra new lines look too sparse to my taste, but maybe they are more readable at RTFD
eliasdorneles
Apr 21, 2015
Author
Member
yeah, I tried without the new lines and it looked a bit confusing.
yeah, I tried without the new lines and it looked a bit confusing.
eliasdorneles
Apr 21, 2015
Author
Member
So, with the risk of overthinking this... :P
The main change here was to avoid the indirection (before it was showing the structure, and detailing the type of content below).
Yes, it is a bit more sparse now, but this is nicer for a reader trying to understand the structure, because 1) she can see the descriptions directly in the structure tree, and 2) she can scan quickly the vertically-aligned comments to know the kinds of files there are.
The only thing I'm not so sure about here is the vertical alignment.
So, with the risk of overthinking this... :P
The main change here was to avoid the indirection (before it was showing the structure, and detailing the type of content below).
Yes, it is a bit more sparse now, but this is nicer for a reader trying to understand the structure, because 1) she can see the descriptions directly in the structure tree, and 2) she can scan quickly the vertically-aligned comments to know the kinds of files there are.
The only thing I'm not so sure about here is the vertical alignment.
kmike
Apr 21, 2015
Member
I like inline comments more.
I like inline comments more.
for article in response.xpath("//article"): | ||
yield { | ||
... extract article data here | ||
} |
kmike
Apr 21, 2015
Member
This example can be confusing because in the tutorial we were not yielding dicts, only Items.
This example can be confusing because in the tutorial we were not yielding dicts, only Items.
kmike
Apr 21, 2015
Member
There is "Spiders are expected to return their scraped data inside :class:~scrapy.item.Item
objects." statement earlier, which doesn't show a full picture now.
There is "Spiders are expected to return their scraped data inside :class:~scrapy.item.Item
objects." statement earlier, which doesn't show a full picture now.
eliasdorneles
Apr 21, 2015
Author
Member
Hmmm, right.
I've already added a bit about being able to use Python dicts in the beginning of "Defining our Item" section above, lemme just fix that statement then.
Hmmm, right.
I've already added a bit about being able to use Python dicts in the beginning of "Defining our Item" section above, lemme just fix that statement then.
eliasdorneles
Apr 21, 2015
Author
Member
So, I settled for just removing the incorrect statement and changing the example for using Item, to avoid the different item type being a distraction here.
So, I settled for just removing the incorrect statement and changing the example for using Item, to avoid the different item type being a distraction here.
... extract article data here | ||
} | ||
|
||
next_page = response.css("ul.navigation -> li.next-page > a::attr('href')") |
kmike
Apr 21, 2015
Member
is -> valid CSS?
is -> valid CSS?
eliasdorneles
Apr 21, 2015
Author
Member
ooops, no! 😁
ooops, no!
…l, removed statement implying mandatory usage of Item
A couple more ideas/issues/suggestions for the tutorial, this time more radical:
I think tutorial should explain all the points displayed in overview again, but in more details, expanding from basic features (dicts, single file spiders, css selectors) to more advanced (items, scrapy projects, xpath selectors). Some repetition will help users learn. IMHO showing a feature in overview and then using something completely different in a tutorial is not enough, it is confusing. After an overview users already know something; we should reinforce their knowledge and provide extra details. |
Some improvements for Scrapy tutorial
So, here are some improvements for Scrapy tutorial, please let me know what you think.
Summary of the changes:
watch
target to docs Makefile (note: depends on watchdog)Does this look good?