Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Added general guide for developer tools #3400

Merged
merged 5 commits into from Aug 23, 2018

Conversation

testingcan
Copy link
Contributor

See #3373 and #3372.

Reworked the Firefox and Firebug-sections in favor of a general guide on how to use Developer Tools for scraping. Guide includes basic overview over the Inspector and Network-tools. In order to avoid overlap between the general tutorial and this section, I tried to be brief on the general scrapy terms and logic.

and so on. We'll ignore the other tabs and click directly on ``Reponse``.

What you should see in the ``Preview``-pane is the rendered HTML-code,
that is exactly what we saw when we called ``view(response`` in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

view(response)


==========================
Using your browser's Developer Tools for scraping
==========================
Copy link
Member

@kmike kmike Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis fails with "Title overline too short." - would you mind increaseing the length of both ===='s to cover the whole text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I increased the length.

@kmike kmike requested a review from stummjr August 22, 2018 14:55
@kmike
Copy link
Member

kmike commented Aug 22, 2018

Wow, thanks @testingcan! I like the new tutorial a lot, it looks awesome.

@codecov
Copy link

codecov bot commented Aug 22, 2018

Codecov Report

Merging #3400 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3400   +/-   ##
=======================================
  Coverage   84.46%   84.46%           
=======================================
  Files         167      167           
  Lines        9362     9362           
  Branches     1390     1390           
=======================================
  Hits         7908     7908           
  Misses       1199     1199           
  Partials      255      255

Copy link
Member

@stummjr stummjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 great work @testingcan! and thanks for this awesome material in a section where a revamp was due for a long time already :)

I left some comments, but feel free to discuss in case you disagree or have further improvements in mind.

This lets us operate on the JSON-object like on a Python dictionary.
We iterate through the ``quotes`` and print out the ``quote["text"]``.
If the handy ``has_next``-element is ``true`` (try loading
`http://quotes.toscrape.com/api/quotes?page=10`_ in your browser or a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is raising a build error:

scrapy/docs/topics/developer-tools.rst:229: WARNING: Unknown target name: "http://quotes.toscrape.com/api/quotes?page=10".

I think it's just a matter of removing the surrounding backticks and the underscore after the URL.

By far the most handy feature of the Developer Tools is the `Inspector`
feature, which allows you to inspect the underlying HTML code of
any webpage. To demonstrate the Inspector, let's take a
look at the `quotes.toscrape.com`_-site.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe: look at the `quotes.toscrape.com`_ website.

<div class="tags">(...)</div>
</div>

If you hover over the first ``div`` directly above the ``span``-tag highlighted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we need the hyphen after the tag names, may be better as: span tag, instead of span-tag


>>> scrapy shell "http://quotes.toscrape.com/"
(...)
>>> response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').extract()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of .extract(), we could use the new .getall() (alias) method, as I believe it's now the recommended way of getting the list of results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off topic but since .getall() and .get() are now the recommended ways of getting results, should we replace all occurences of .extract() and extract_first() in the documentation?

Copy link
Member

@stummjr stummjr Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so. There's an issue regarding that, in case you want to tackle it in another PR: #3317

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is almost done in #3390.

see if we can refine our XPath a bit:

If we check the `Inspector` again we'll see that directly beneath our
expanded ``div``-tag we have eight identical ``div``-tags, each with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't there another nine?

could go ahead and try out different XPaths directly, but instead
we'll check another quite useful command from the scrapy shell::

>>> scrapy shell "quotes.toscrape.com/scroll"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with >>>, it looks like scrapy shell should be invoked from a Python shell, so maybe change to $.

$ scrapy shell "http://quotes.toscrape.com/scroll"

def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
quote = quote["text"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to generate an item here, instead of printing the quote:

yield {'quote': quote['text']}


Therefore, you should keep in mind the following things:

* Disable Javascript while inspecting the DOM looking for XPaths to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention quickly that this can be done by clicking in the cog-icon in the developer tools?

copy XPaths to selected elements. Let's try it out: Right-click on the ``span``-
tag, select ``Copy > XPath`` and paste it in the scrapy shell like so::

>>> scrapy shell "http://quotes.toscrape.com/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment made further down in the doc about scrapy shell applies here.

see each quote. With this knowledge we can refine our XPath: Instead of a path
to follow, we'll simply select all ``span``-tags with the ``class="text"``::

>>> response.xpath('//span[@class="text"]/text()').extract()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though '//span[@class="text"]/text()' works for this website specifically, but maybe we should promote the usage of the new has-class() XPath function here:

response.xpath('//span[has-class("text")]/text()').getall()

Comparing @class="text" can make your spider brittle, because if the website developer adds a new class to the quote elements (<span class="text odd"... >), the original selector wouldn't work anymore.

@kmike kmike added this to the v1.6 milestone Aug 22, 2018
@testingcan
Copy link
Contributor Author

@kmike @stummjr Thanks for your input! I fixed the mistakes and expanded some sections. Let me know if there is anything else!

@@ -219,8 +238,7 @@ also request each page to get every quote on the site::
def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
quote = quote["text"]
print(quote)
yield {"quote": quote["text"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a curly brace: yield {"quote": quote["text"]}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry. Added it.

@stummjr stummjr changed the title Added general guide for developer tools [MRG+1] Added general guide for developer tools Aug 23, 2018
@kmike
Copy link
Member

kmike commented Aug 23, 2018

@testingcan do you think we can remove firebug images from the repo?

@testingcan
Copy link
Contributor Author

@kmike right, as far as I can see they are not being used anywhere else in the docs, so we should be able to remove them without any issues.

@kmike
Copy link
Member

kmike commented Aug 23, 2018

Thanks @testingcan for a much-needed docs update, and @stummjr for a review!

@kmike kmike merged commit e45ef7d into scrapy:master Aug 23, 2018
@testingcan testingcan deleted the docs-developer-tools branch August 23, 2018 14:48
@wRAR wRAR mentioned this pull request Sep 10, 2018
@kmike kmike added the docs label Dec 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants