Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] some improvements to overview page #1106

Merged
merged 7 commits into from Mar 27, 2015

Conversation

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Mar 25, 2015

Hey folks!

Here is my proposal for addressing issue #609 (and replaces PR #1023, while keeping some ideas).

Since the overview is "the pitch", I tried my best to make it short and to the point.

Summary of the changes:

  • Added example spider showcasing both scraping and crawling (link following)
  • Wrote an explanation of what the code does, without delving much into details
  • Summarized table of features in the end, and reordered them based on my gut feeling
  • Cut some text
  • Cut some more

Note: the example spider is also showcasing the features from PRs #1081 and #1086, that I assume will be also in Scrapy 1.0 release.

So, what do you think, does this look good?

Thank you!

In the ``parse`` callback, we scrape the links to the questions and
yield a few more requests to be processed, registering for them
the method ``parse_question`` as the callback to be called when the
requests are complete.

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

I think here we should explain that these requests are scheduled and processed asynchronously.

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

Right, let me fix that!


You'll notice that all field values (except for the ``url`` which was assigned
directly) are actually lists. This is because the :ref:`selectors
<topics-selectors>` return lists. You may want to store single values, or

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

I'm on fence about using .extract_first() vs documenting lists. .extract_first() is not what is always needed.

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

Well, keeping in mind that this is the overview, Scrapy at a glance, we can leave that explanation for the tutorial.

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

Removing .extract_first() can make examples shorter, so it can be good for an overview :) We can explain what to do next in the tutorial.

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

Yeah, I can work with that -- as long it doesn't need much explanation.

@@ -189,68 +100,37 @@ You've seen how to extract and store items from a website using Scrapy, but
this is just the surface. Scrapy provides a lot of powerful features for making
scraping easy and efficient, such as:

* Built-in support for :ref:`selecting and extracting <topics-selectors>` data
from HTML and XML sources

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

I think XPath and CSS selectors should be mentioned explicitly, not only as a part of scrapy shell.

.. highlight:: html
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

href.extract() returns a list

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

We're iterating over a SelectorList here, but href is not a SelectorList, here it returns the text directly.
I've tested this example, btw, using monkey patches for the missing features.

Finally, the file size is contained in the second ``<p>`` tag inside the ``<div>``
tag with ``id=specifications``::
def parse_question(self, response):
title = response.css('h1 a::text').extract_first()

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

CSS selectors are easier to understand for the newcomers, but I think we should still add a note about what is happening here. Here we use some custom extensions which are not a part of CSS standard, so even people familiar with CSS selectors could benefit from a little explanation.

I'm also on fence with creating variables vs returning {'title': response.css('.question ...'), ...} directly - it could make an example shorter.



.. highlight:: none
When this finishes you will have in the ``top-stackoverflow-questions.json`` file
a list of the most upvoted questions in StackOverflow in JSON format, containing the

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

What about showing a small example of the output? This will make an overview a bit larger, but examples help :)

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

Hmmm, I think I can work out something -- replacing the big fields so it won't turn out a mess.


For more information about XPath see the `XPath reference`_.
Here you notice one of the main advantages about Scrapy: requests are

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

there was a single request in start_urls; it should be easier to see the advantage for requests sent from parse method

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

Yeah, I was in doubt about that one... Better move it one paragraph down.

scheduled and processed asynchronously. This means that Scrapy doesn't
need to wait for a request to be finished and processed, it can send
another request or do other things in the meantime, which results in much
faster crawlings.

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

There is a balance between speed and politeness: users can send multiple requests in parallel about as easy with e.g. concurrent.futures + requests, but this way politeness settings (download delay, concurrency) won't be respected. The advantage of Scrapy is that it provides helpers to maintain per-domain politeness settings (including an autothrottle extension) and still send async requests when possible (+ other benefits like automatic retries). This all could be much harder to implement with e.g. requests + a thread pool.

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

Agreed, lemme try to convey that.

@kmike
Copy link
Member

@kmike kmike commented Mar 26, 2015

@eliasdorneles a good overview, I like it 👍

I'm trying to attack it from a position of a person who can hack together a spider using requests + concurrent.futures + pyquery + json. Please don't take it as a criticism :) Why should such person bother with Scrapy?

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Mar 26, 2015

Don't worry @kmike, I appreciate your feedback a good deal, always good points! :)

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Mar 26, 2015

Hey @kmike -- I've just updated addressing your concerns and did some more editing.
Can you please have a look again?
Thank you!

provide any API or mechanism to access that info programmatically. Scrapy can
help you extract that information.
Once you're ready to dive in more, you can :ref:`follow the tutorial
and build a full-blown Scrapy project <intro-tutorial>`.

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

I think this note can be moved to the end - it is unclear if users should continue reading the overview, of if they should go to the tutorial. It seems the reason you've put it here is that in addition to 'scrapy runspider' there is 'scrapy crawl' with a full-blown project support, and you wanted to mention it. We can add project support to a list of Scrapy advantages - Scrapy helps to organize the code, so that projects with tens and hundreds of spiders are still manageable.


What's next?
============

The next obvious steps are for you to `download Scrapy`_, read :ref:`the
The next obvious steps for you are to `download Scrapy`_, read :ref:`the

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

I think that installation docs are more helpful than "download" docs.

* Strong :ref:`extensibility support <extending-scrapy>` and lots of built-in
extensions and middlewares to handle things like cookies, crawl throttling,
HTTP caching, HTTP compression, user-agent spoofing, robots.txt,
stats collection and many more.

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

I think we can extract some items from this list to a separate items to make them more visible, e.g. crawl throttling, HTTP caching, "Jobs: pausing and resuming crawls" may deserve their own *. It doesn't hurt to have a larger list of Scrapy advantages :) Throttling / caching / pausing are the things which developers can benefit from immediately, they are nice to have out-of-box.

'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

❤️

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

yielding dicts is so sweet!
I can barely wait for the release, want to use it NOW. :)

"tags": ["java", "c++", "performance", "optimization"],
"title": "Why is processing a sorted array faster than an unsorted array?",
"votes": "9924"
},

This comment has been minimized.

@kmike

kmike Mar 26, 2015
Member

indentation of open { brace doesn't match indentation of closing } brace - is it intentional?

This comment has been minimized.

@eliasdorneles

eliasdorneles Mar 26, 2015
Author Member

oops, lemme fix that

@kmike
Copy link
Member

@kmike kmike commented Mar 26, 2015

//cc @pablohoffman @shaneaevans @dangra and everyone else - thoughts? Use https://github.com/eliasdorneles/scrapy/blob/overview-page-improvements/docs/intro/overview.rst link to read it.

I think this introduction is nearly perfect :)
+1 to merge it once we have the required PRs merged.

@kmike kmike changed the title some improvements to overview page [MRG+1] some improvements to overview page Mar 26, 2015
@nyov
Copy link
Contributor

@nyov nyov commented Mar 27, 2015

Everyone Else here. Looks good, I like it.

Well, meh, I actually clicked on that AAWS link, thought I would get some info on how to extract API data. Amazon is big enough, maybe this could point to somewhere else, like something on http://www.programmableweb.com/ ?

I would ask for a single, tiny, response.xpath query in the spider example, just to let old-timers know they aren't deprecated yet :)

Some minor things like misplaced or missing commas, but that can be ignored.

curita added a commit that referenced this pull request Mar 27, 2015
[MRG+1] some improvements to overview page
@curita curita merged commit f4e241a into scrapy:master Mar 27, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Mar 27, 2015

Hey @nyov -- since this is already merged, please feel free to send a PR to fix those commas or whatever. =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants