Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some improvements for Scrapy tutorial #1180

Merged
merged 4 commits into from Apr 21, 2015

Conversation

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Apr 21, 2015

So, here are some improvements for Scrapy tutorial, please let me know what you think.

Summary of the changes:

  • adds an example and explanation about following links (fixes #608 )
  • added note about CSS vs XPath, idea is to encourage people who know CSS to dig XPath too
  • updated recommend XPath tutorials to the ones I felt were most helpful as a XPath beginner
  • add a watch target to docs Makefile (note: depends on watchdog)
  • a few other small changes

Does this look good?

navigating the structure, it can also look at the content: you're
able to select things like: *the link that contains the text 'Next Page'*.
Because of this, we encourage you to learn about XPath even if you
already know how to construct CSS selectors.

For working with XPaths, Scrapy provides :class:`~scrapy.selector.Selector`

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

Selector works with CSS as well


def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = urlparse.urljoin(response.url, href.extract())

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

It may be good to mention response.urljoin because it is shorter, and because it handles <base> HTML tag.

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 21, 2015
Author Member

oh right!! sorry, forgot about it again! :)

What you see here is the Scrapy's mechanism of following links: if you yield a
Request instead of an Item in a callback method, Scrapy will schedule that
request to be sent and register a callback method to be executed when that
request finishes.

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

It may worths it to mention a callback method can yield both Items and Requests at the same time, there are no separate "item callbacks" and "request callbacks".

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 21, 2015
Author Member

+1, will do it in a jiffy

@kmike
Copy link
Member

@kmike kmike commented Apr 21, 2015

👍

@kmike
Copy link
Member

@kmike kmike commented Apr 21, 2015

One more common pain point for beginners is creating a single item from multiple web pages.
Maybe add a link to http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments for now, or even discuss it briefly in the tutorial?

In future other solutions could appear, e.g. #1138 or #1144.

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Apr 21, 2015

@kmike +1, I added some examples and attended your other comments.

Let me know if you think of another thing. =)


settings.py # project settings file

spiders/ # a directory where you'll later put your spiders
__init__.py
...

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

in github UI extra new lines look too sparse to my taste, but maybe they are more readable at RTFD

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 21, 2015
Author Member

yeah, I tried without the new lines and it looked a bit confusing.

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 21, 2015
Author Member

So, with the risk of overthinking this... :P
The main change here was to avoid the indirection (before it was showing the structure, and detailing the type of content below).

Yes, it is a bit more sparse now, but this is nicer for a reader trying to understand the structure, because 1) she can see the descriptions directly in the structure tree, and 2) she can scan quickly the vertically-aligned comments to know the kinds of files there are.

The only thing I'm not so sure about here is the vertical alignment.

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

I like inline comments more.

for article in response.xpath("//article"):
yield {
... extract article data here
}

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

This example can be confusing because in the tutorial we were not yielding dicts, only Items.

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

There is "Spiders are expected to return their scraped data inside :class:~scrapy.item.Item objects." statement earlier, which doesn't show a full picture now.

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 21, 2015
Author Member

Hmmm, right.
I've already added a bit about being able to use Python dicts in the beginning of "Defining our Item" section above, lemme just fix that statement then.

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 21, 2015
Author Member

So, I settled for just removing the incorrect statement and changing the example for using Item, to avoid the different item type being a distraction here.

... extract article data here
}

next_page = response.css("ul.navigation -> li.next-page > a::attr('href')")

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

is -> valid CSS?

This comment has been minimized.

@eliasdorneles

eliasdorneles Apr 21, 2015
Author Member

ooops, no! 😁

…l, removed statement implying mandatory usage of Item
@kmike
Copy link
Member

@kmike kmike commented Apr 21, 2015

A couple more ideas/issues/suggestions for the tutorial, this time more radical:

  1. It may focus on start_requests instead of start_urls. This way users will see more clearly how Scrapy works; parse method will be less 'magical', and from start_requests is easy to explain that start_urls is a shortcut for a common use case.
  2. Transition from dicts (used in overview) to items (used here) could be more gradual; to do that we may introduce Items later in the tutorial, and continue using dicts at the beginning. I.e. start writing spider using dicts, get something working, and then switch to Items. The jump from overview to tutorial could become more smooth this way.

I think tutorial should explain all the points displayed in overview again, but in more details, expanding from basic features (dicts, single file spiders, css selectors) to more advanced (items, scrapy projects, xpath selectors). Some repetition will help users learn. IMHO showing a feature in overview and then using something completely different in a tutorial is not enough, it is confusing. After an overview users already know something; we should reinforce their knowledge and provide extra details.

pablohoffman added a commit that referenced this pull request Apr 21, 2015
Some improvements for Scrapy tutorial
@pablohoffman pablohoffman merged commit 4c4eb4f into scrapy:master Apr 21, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants
You can’t perform that action at this time.