Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Tutorial: rewrite tutorial seeking to improve learning path #2252

Merged
merged 21 commits into from Sep 22, 2016

Conversation

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Sep 15, 2016

This changes the tutorial, removing the step of creating an item class
and also starts by presenting the start_requests method instead of
start_urls.

This is currently WIP because we gotta decide when to present the short version of the spider (using start_urls and relying on the default callback). This is now ready for review.

Main changes:

  • remove unnecessary item class definition
  • present start_requests first, to empower users from the beginning
  • show CSS before XPath, for less scary learning
  • get the learner playing in the shell as soon as possible
  • show how to parameterize spiders using arguments
  • show more illustrative code examples

We were also trying to let the learners think and try to guess things (instead of spoonfeed them), so we show code examples before we explain them.

What do you think?

This changes the tutorial, removing the step of creating an item class
and also starts by presenting the start_requests method instead of
start_urls.
@codecov-io
Copy link

@codecov-io codecov-io commented Sep 15, 2016

Current coverage is 83.20% (diff: 100%)

Merging #2252 into master will not change coverage

@@             master      #2252   diff @@
==========================================
  Files           161        161          
  Lines          8717       8717          
  Methods           0          0          
  Messages          0          0          
  Branches       1283       1283          
==========================================
  Hits           7253       7253          
  Misses         1215       1215          
  Partials        249        249          

Powered by Codecov. Last update 3fd947b...f4a2208

@@ -93,20 +64,23 @@ They define an initial list of URLs to download, how to follow links, and how
to parse the contents of pages to extract :ref:`items <topics-items>`.

To create a Spider, you must subclass :class:`scrapy.Spider
<scrapy.spiders.Spider>` and define some attributes:
<scrapy.spiders.Spider>` and define some attributes and methods:

This comment has been minimized.

@stummjr

stummjr Sep 15, 2016
Member

What about presenting the source code first and then explaining the methods and attrs that the needs?

yield the Python dictionaries from the callback method instead of printing
them.

Let's add the necessary code to our spider::

This comment has been minimized.

@stummjr

stummjr Sep 15, 2016
Member

Here we could add a note about simplifying the spider via start_urls.

(don't worry if you're not familiar with ORMs, you will see that this is an
easy task).
Spiders are classes that you define and that Scrapy uses to scrape information
from a website (or group of websites). They define an initial list of URLs to

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 16, 2016
Author Member

this should probably be initial requests instead of initial list of URLs

@stummjr

This comment has been minimized.

Copy link
Collaborator

@stummjr stummjr commented on docs/intro/tutorial.rst in b2a5cdd Sep 16, 2016

nitpick: "pagination links"

@stummjr

This comment has been minimized.

Copy link
Collaborator

@stummjr stummjr commented on docs/intro/tutorial.rst in b2a5cdd Sep 16, 2016

"Another interesting thing this spider ..."

This comment has been minimized.

Copy link
Collaborator

@stummjr stummjr replied Sep 16, 2016

Ginormous sentence. Let's break it:

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don't need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake.

This comment has been minimized.

Copy link
Owner Author

@eliasdorneles eliasdorneles replied Sep 16, 2016

nice, thanks! 👍

@stummjr

This comment has been minimized.

Copy link
Collaborator

@stummjr stummjr commented on docs/intro/tutorial.rst in b2a5cdd Sep 16, 2016

"visiting the same author page multiple times**,** because ..."

Scrapy creates :class:`scrapy.Request <scrapy.http.Request>` objects
for each URL in the ``start_urls`` attribute of the Spider, and assigns
them the ``parse`` method of the spider as their callback function.
Simplifying your spider

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 16, 2016
Author Member

I'm not sure if using start_urls can be described as simplifying, because there are actually more things entangled behind the scenes (i.e., this second version is shorter, but not necessarily simpler).

I think it's best to present as a short version, saying something like: "This pattern of starting from a few hardcoded URLs is so common that Scrapy has a shortcut for it: you can define a start_urls class attribute w/ a list of URLs, which will be used by the default implementation of Spider.start_requests to create the initial requests"

Makes sense?

@stummjr

This comment has been minimized.

Copy link
Collaborator

@stummjr stummjr commented on docs/intro/tutorial.rst in b2a5cdd Sep 16, 2016

You have to use double ticks to format as source code: parse().

This comment has been minimized.

Copy link
Collaborator

@stummjr stummjr replied Sep 16, 2016

The same applies to other parts of the text.

@eliasdorneles

This comment has been minimized.

Copy link
Owner Author

@eliasdorneles eliasdorneles commented on docs/intro/tutorial.rst in b2a5cdd Sep 16, 2016

I think we're okay, I wanted to introduce the subject as generic link following, the next paragraph specifies it's about following the link to the next page.

@stummjr
Copy link
Member

@stummjr stummjr commented Sep 19, 2016

@eliasdorneles I did an attempt to complete the section on data extraction. Would you mind having a look?

crawlers on top of it.

Adding a spider argument

This comment has been minimized.

@redapple

redapple Sep 20, 2016
Contributor

What do you think of using "Passing" instead of "Adding"?

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 20, 2016
Author Member

Hm, I dunno, I have the feeling passing is more about giving a value to something that is already parameterized. This is trying to show how to parameterize first, and then how to "pass" (give the value).
Does this make sense?

This comment has been minimized.

@redapple

redapple Sep 20, 2016
Contributor

What about simply "Using [custom] spider arguments"?

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 20, 2016
Author Member

Sounds good, let's go w/ that, less ambiguity. 👍

@eliasdorneles eliasdorneles changed the title [WIP] Tutorial: remove item class definition and present start_requests first Tutorial: remove item class definition and present start_requests first Sep 20, 2016
@eliasdorneles eliasdorneles changed the title Tutorial: remove item class definition and present start_requests first Tutorial: rewrite tutorial seeking to improve learning path Sep 20, 2016
@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Sep 20, 2016

I think this is ready for review, I've updated the description with the main changes and some explanation of what @stummjr and I had in mind while writing. :)

For better reviewing, comment in this PR and read the text here: https://github.com/eliasdorneles/scrapy/blob/tutorial-upgrades/docs/intro/tutorial.rst

Spiders are classes that you define and that Scrapy uses to scrape information
from a website (or group of websites). They must subclass
:class:`scrapy.Spider` and define the initial requests to make, how to follow
links in the pages, and how to parse the downloaded page content to extract

This comment has been minimized.

@redapple

redapple Sep 20, 2016
Contributor

"how to follow links in the pages" is in theory optional

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 20, 2016
Author Member

right, we can make that explicit: define the initial requests to make, optionally how to follow links, and how to parse the content

... text = quote.css("span.text ::text").extract_first()
... author = quote.css("small.author ::text").extract_first()
... tags = quote.css("div.tags a.tag ::text").extract()
... print(dict(text=text, author=author, tags=tags))

This comment has been minimized.

@redapple

redapple Sep 20, 2016
Contributor

While reading it, I was expecting sample output here, first dict perhaps

Besides the :meth:`~scrapy.selector.Selector.extract` and
:meth:`~scrapy.selector.SelectorList.extract_first` methods, you can also use
the :meth:`~scrapy.selector.Selector.re` method to extract using a regular
expression::

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

What do you think about adding a link to python stdlib docs for re module, to point to a place where users can get information about re syntax?

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 20, 2016
Author Member

good idea, will do!

>>>
In order to find the proper CSS selectors to use, you might find useful opening
the response page from the shell in your web browser using ``view(response)``.
You can use your browser developer tools or extensions like Firebug. For more

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

What do you think about mentioning SelectorGadget extension?

We won't cover much of XPath here. To learn more about XPath, we recommend
`this tutorial to learn XPath through examples
<http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

Scrapy docs also have a section about selectors which is mostly focused on xpaths, what do you think about adding a link?

... tags = quote.css("div.tags a.tag::text").extract()
... print(dict(text=text, author=author, tags=tags))
{'text': u'\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}
{'text': u'\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

In Python 3 users won't get those ugly \u201cIt. What do you think about using Python 3 in the tutorial, and mentioning that in Python 2 output can be different? I'm not sure about that.

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 20, 2016
Author Member

indeed, best to use Python 3 for everything.

Now try crawling quotes.toscrape.com again and you'll see sites being printed
in your output. Run::
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

hm, it seems here Python 3 is used, while earlier output was from Python 2


Here is a modification to our spider that does just that::
>>> response.css('li.next a::attr("href")').extract_first()

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

are quotes around href needed? I'm always writing it without quotes, maybe it is wrong..

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 20, 2016
Author Member

AFAIK, both are accepted: https://github.com/scrapy/parsel/blob/master/parsel/csstranslator.py#L81

Quotes aren't needed, I just have the habit of putting them -- I'll remove from the examples.


scrapy crawl quotes -o items.json
scrapy crawl quotes -o items.json -a tag=humor

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

If users will follow the tutorial as-is, items.json file will contain multiple json objects because by default data is appended, and there was and example with -o items.json before. I think we should mention this gotcha somwhere.

This comment has been minimized.

@eliasdorneles

eliasdorneles Sep 20, 2016
Author Member

ohh, that's kinda big deal -- yea, we'll have to write something.
For the tutorial, I'll just use different files items-humor.json.
didn't we have a PR adding support for -O?

tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url)

This comment has been minimized.

@kmike

kmike Sep 20, 2016
Member

my vote is to always pass a callback, even if it is self.parse: yield scrapy.Request(url, self.parse).

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Sep 20, 2016

@kmike I think I've addressed all your comments as well. :)

Btw, thanks for reviewing you all! 🙇 🙇

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Sep 21, 2016

After chatting about it w/ @kmike, I changed the recommendation on how to learn Python.

For people experienced w/ other languages, the tutorial recommends Dive into Python 3 and the official Python tutorial as alternative.

For people learning to program, it recommends Learn Python the Hard Way and alternatively the list of resources for non-programmers from Python's wiki.

]
* :meth:`~scrapy.spiders.Spider.parse`: a method that will be called to handle
the response downloaded for each of the requests made. The response parameter
is an instance of :class:`~scrapy.http.Response` that holds the page content and

This comment has been minimized.

@kmike

kmike Sep 21, 2016
Member

We discussed it before, but response parameter can be an instance of a Response subclass, not only a Response instance. Linking to Response docs can be confusing because Response class doesn't have methods we're using later (.css, .xpath). Even after #2275 this issue is not fixed because Response docs still don't mention new methods, and technically it is not correct that response is an instance of Response.


So we can select each ``<div class="quote">`` element belonging to the site's
list with this code::
$ scrapy shell http://quotes.toscrape.com

This comment has been minimized.

@kmike

kmike Sep 21, 2016
Member

url should be in quotes

yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css("div.tags a.tag::text").extract(),

This comment has been minimized.

@kmike

kmike Sep 21, 2016
Member

A very minor nitpick: probably it is better to use single quotes here, for consistency with other selectors.

be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

Now, after extracting the data, the ``parse()`` method looks for the link to
the next page, builds a full absolute URL using the ``response.urljoin`` method

This comment has been minimized.

@kmike

kmike Sep 21, 2016
Member

what do you think about adding a link to response.urljoin docs?

This comment has been minimized.

@kmike

kmike Sep 21, 2016
Member

though it is not that useful because of Response/TextResponse differences.. response.urljoin is better because it handles <base> html tag, but this is not explained in Response docs because it is a TextResponse feature

@@ -490,30 +551,109 @@ Another common pattern is to build an item with data from more than one page,
using a :ref:`trick to pass additional data to the callbacks
<topics-request-response-ref-request-callback-arguments>`.

Another example: scraping authors

This comment has been minimized.

@kmike

kmike Sep 21, 2016
Member

What do you think about moving the section above (a link to the trick to pass additional data to callbacks) below this example?

@kmike
Copy link
Member

@kmike kmike commented Sep 21, 2016

Besides a few minor remaining comments the PR looks good, great work @eliasdorneles and @stummjr!

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Sep 22, 2016

Thanks for reviewing @kmike, I've addressed your comments and made other small changes in the latest commit.
@redapple @kmike Does this look good to merge?

@kmike kmike changed the title Tutorial: rewrite tutorial seeking to improve learning path [MRG+1] Tutorial: rewrite tutorial seeking to improve learning path Sep 22, 2016
@redapple redapple changed the title [MRG+1] Tutorial: rewrite tutorial seeking to improve learning path [MRG+2] Tutorial: rewrite tutorial seeking to improve learning path Sep 22, 2016
@redapple redapple merged commit a975a50 into scrapy:master Sep 22, 2016
3 checks passed
3 checks passed
codecov/patch Coverage not affected when comparing 3fd947b...f4a2208
Details
codecov/project 83.20% (+0.00%) compared to 3fd947b
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@redapple
Copy link
Contributor

@redapple redapple commented Sep 22, 2016

@eliasdorneles @stummjr , I believe the tutorial now is also compatible with 1.1 branch, correct?
Worth a backport maybe?

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Sep 22, 2016

It is, but it's going to be some work -- I tried cherry-picking the first commit and got a bunch of conflicts.
I'll do it w/ more patience later today. 👍

@redapple
Copy link
Contributor

@redapple redapple commented Sep 22, 2016

@eliasdorneles , could you check #2281 ?
I basically overwrote tutorial.rst with the content of this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants