Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Added general guide for developer tools #3400

Merged
merged 5 commits into from Aug 23, 2018

Conversation

@testingcan
Copy link
Contributor

@testingcan testingcan commented Aug 22, 2018

See #3373 and #3372.

Reworked the Firefox and Firebug-sections in favor of a general guide on how to use Developer Tools for scraping. Guide includes basic overview over the Inspector and Network-tools. In order to avoid overlap between the general tutorial and this section, I tried to be brief on the general scrapy terms and logic.

and so on. We'll ignore the other tabs and click directly on ``Reponse``.

What you should see in the ``Preview``-pane is the rendered HTML-code,
that is exactly what we saw when we called ``view(response`` in the

This comment has been minimized.

@kmike

kmike Aug 22, 2018
Member

view(response)


==========================
Using your browser's Developer Tools for scraping
==========================

This comment has been minimized.

@kmike

kmike Aug 22, 2018
Member

Travis fails with "Title overline too short." - would you mind increaseing the length of both ===='s to cover the whole text?

This comment has been minimized.

@testingcan

testingcan Aug 22, 2018
Author Contributor

Sure thing, I increased the length.

@kmike kmike requested a review from stummjr Aug 22, 2018
@kmike
Copy link
Member

@kmike kmike commented Aug 22, 2018

Wow, thanks @testingcan! I like the new tutorial a lot, it looks awesome.

@codecov
Copy link

@codecov codecov bot commented Aug 22, 2018

Codecov Report

Merging #3400 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3400   +/-   ##
=======================================
  Coverage   84.46%   84.46%           
=======================================
  Files         167      167           
  Lines        9362     9362           
  Branches     1390     1390           
=======================================
  Hits         7908     7908           
  Misses       1199     1199           
  Partials      255      255
Copy link
Member

@stummjr stummjr left a comment

👏 great work @testingcan! and thanks for this awesome material in a section where a revamp was due for a long time already :)

I left some comments, but feel free to discuss in case you disagree or have further improvements in mind.

This lets us operate on the JSON-object like on a Python dictionary.
We iterate through the ``quotes`` and print out the ``quote["text"]``.
If the handy ``has_next``-element is ``true`` (try loading
`http://quotes.toscrape.com/api/quotes?page=10`_ in your browser or a

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

this is raising a build error:

scrapy/docs/topics/developer-tools.rst:229: WARNING: Unknown target name: "http://quotes.toscrape.com/api/quotes?page=10".

I think it's just a matter of removing the surrounding backticks and the underscore after the URL.

By far the most handy feature of the Developer Tools is the `Inspector`
feature, which allows you to inspect the underlying HTML code of
any webpage. To demonstrate the Inspector, let's take a
look at the `quotes.toscrape.com`_-site.

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

maybe: look at the `quotes.toscrape.com`_ website.

<div class="tags">(...)</div>
</div>

If you hover over the first ``div`` directly above the ``span``-tag highlighted

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

not sure if we need the hyphen after the tag names, may be better as: span tag, instead of span-tag


>>> scrapy shell "http://quotes.toscrape.com/"
(...)
>>> response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]/text()').extract()

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

Instead of .extract(), we could use the new .getall() (alias) method, as I believe it's now the recommended way of getting the list of results.

This comment has been minimized.

@testingcan

testingcan Aug 23, 2018
Author Contributor

Off topic but since .getall() and .get() are now the recommended ways of getting results, should we replace all occurences of .extract() and extract_first() in the documentation?

This comment has been minimized.

@stummjr

stummjr Aug 23, 2018
Member

Yeah, I think so. There's an issue regarding that, in case you want to tackle it in another PR: #3317

This comment has been minimized.

@kmike

kmike Aug 23, 2018
Member

Right, this is almost done in #3390.

see if we can refine our XPath a bit:

If we check the `Inspector` again we'll see that directly beneath our
expanded ``div``-tag we have eight identical ``div``-tags, each with the

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

aren't there another nine?

could go ahead and try out different XPaths directly, but instead
we'll check another quite useful command from the scrapy shell::

>>> scrapy shell "quotes.toscrape.com/scroll"

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

with >>>, it looks like scrapy shell should be invoked from a Python shell, so maybe change to $.

$ scrapy shell "http://quotes.toscrape.com/scroll"
def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
quote = quote["text"]

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

I think it's better to generate an item here, instead of printing the quote:

yield {'quote': quote['text']}

Therefore, you should keep in mind the following things:

* Disable Javascript while inspecting the DOM looking for XPaths to be

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

Maybe mention quickly that this can be done by clicking in the cog-icon in the developer tools?

copy XPaths to selected elements. Let's try it out: Right-click on the ``span``-
tag, select ``Copy > XPath`` and paste it in the scrapy shell like so::

>>> scrapy shell "http://quotes.toscrape.com/"

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

same comment made further down in the doc about scrapy shell applies here.

see each quote. With this knowledge we can refine our XPath: Instead of a path
to follow, we'll simply select all ``span``-tags with the ``class="text"``::

>>> response.xpath('//span[@class="text"]/text()').extract()

This comment has been minimized.

@stummjr

stummjr Aug 22, 2018
Member

Even though '//span[@class="text"]/text()' works for this website specifically, but maybe we should promote the usage of the new has-class() XPath function here:

response.xpath('//span[has-class("text")]/text()').getall()

Comparing @class="text" can make your spider brittle, because if the website developer adds a new class to the quote elements (<span class="text odd"... >), the original selector wouldn't work anymore.

@kmike kmike added this to the v1.6 milestone Aug 22, 2018
@testingcan
Copy link
Contributor Author

@testingcan testingcan commented Aug 23, 2018

@kmike @stummjr Thanks for your input! I fixed the mistakes and expanded some sections. Let me know if there is anything else!

@@ -219,8 +238,7 @@ also request each page to get every quote on the site::
def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
quote = quote["text"]
print(quote)
yield {"quote": quote["text"]

This comment has been minimized.

@stummjr

stummjr Aug 23, 2018
Member

missing a curly brace: yield {"quote": quote["text"]}

This comment has been minimized.

@testingcan

testingcan Aug 23, 2018
Author Contributor

Whoops, sorry. Added it.

@stummjr stummjr changed the title Added general guide for developer tools [MRG+1] Added general guide for developer tools Aug 23, 2018
@kmike
Copy link
Member

@kmike kmike commented Aug 23, 2018

@testingcan do you think we can remove firebug images from the repo?

@testingcan
Copy link
Contributor Author

@testingcan testingcan commented Aug 23, 2018

@kmike right, as far as I can see they are not being used anywhere else in the docs, so we should be able to remove them without any issues.

@kmike
Copy link
Member

@kmike kmike commented Aug 23, 2018

Thanks @testingcan for a much-needed docs update, and @stummjr for a review!

@kmike kmike merged commit e45ef7d into scrapy:master Aug 23, 2018
3 checks passed
3 checks passed
codecov/patch Coverage not affected when comparing c7654f7...79de3d5
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@testingcan testingcan deleted the testingcan:docs-developer-tools branch Aug 23, 2018
@wRAR wRAR mentioned this pull request Sep 10, 2018
@kmike kmike added the docs label Dec 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants