[MRG+1] update Scrapy to use parsel 1.5 #3390
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3390 +/- ##
=======================================
Coverage 84.49% 84.49%
=======================================
Files 167 167
Lines 9376 9376
Branches 1392 1392
=======================================
Hits 7922 7922
Misses 1199 1199
Partials 255 255
|
* remove unused link * fix ReST syntax * fix a link to regular expression docs
This change is backwards incompatible if ItemLoader is used with a custom Selector subclass which overrides .extract without overriding .getall.
Also, response.urljoin is added in a few places, for robustness.
… adjust wording in the introduction
… suggest get/getall
@kmike just a heads-up that I started reviewing this today. |
I love this PR so much <3 =) |
|
``None`` when it doesn't find any element matching the selection. | ||
However, using ``.get()`` directly on a :class:`~scrapy.selector.SelectorList` | ||
instance avoids an ``IndexError`` and returns ``None`` when it doesn't | ||
find any element matching the selection. | ||
|
||
There's a lesson here: for most scraping code, you want it to be resilient to | ||
errors due to things not being found on a page, so that even if some parts fail | ||
to be scraped, you can at least get **some** data. | ||
|
||
Besides the :meth:`~scrapy.selector.Selector.extract` and |
stummjr
Sep 17, 2018
Member
I think we need to replace the .extract
here with .getall
.
I think we need to replace the .extract
here with .getall
.
@@ -641,7 +649,7 @@ this time for scraping author information:: | |||
|
|||
def parse_author(self, response): | |||
def extract_with_css(query): | |||
return response.css(query).extract_first().strip() | |||
return response.css(query).get().strip() |
stummjr
Sep 17, 2018
Member
Maybe we should promote the usage of the default
parameter here to make this example less error prone:
return response.css(query).get(default='').strip()
I know that this is somehow unrelated, but still related to best practices.
Maybe we should promote the usage of the default
parameter here to make this example less error prone:
return response.css(query).get(default='').strip()
I know that this is somehow unrelated, but still related to best practices.
kmike
Sep 17, 2018
Author
Member
TIL response.css('foo::text').get() returns None if foo is found, but there is no text inside.
TIL response.css('foo::text').get() returns None if foo is found, but there is no text inside.
|
||
For convenience, response objects expose a selector on `.selector` attribute, | ||
it's totally OK to use this shortcut when possible:: | ||
For convenience, response objects expose a selector on `.selector` attribute. |
stummjr
Sep 17, 2018
Member
Is the .selector
part supposed to be shown in monospaced font? I think we're missing an extra surrounding backtick in that case.
Is the .selector
part supposed to be shown in monospaced font? I think we're missing an extra surrounding backtick in that case.
>>> Selector(text=body).xpath('//span/text()').extract() | ||
[u'good'] | ||
>>> Selector(text=body).xpath('//span/text()').get() | ||
'good' | ||
|
||
Constructing from response:: |
stummjr
Sep 17, 2018
Member
Maybe we should add a small observation here stating that this is usually not necessary, given that Response
already contains a selector
.
Even though it's explained later, I think we should be explicit that this could be a bad pattern.
Maybe we should add a small observation here stating that this is usually not necessary, given that Response
already contains a selector
.
Even though it's explained later, I think we should be explicit that this could be a bad pattern.
Notice that CSS selectors can select text or attribute nodes using CSS3 | ||
pseudo-elements:: | ||
|
||
>>> selector.css('title::text').get() |
stummjr
Sep 17, 2018
Member
should it be response.css('title::text').get()
?
should it be response.css('title::text').get()
?
|
||
.. _topics-selectors-css-extensions: | ||
|
||
Extensions to CSS Selectors |
stummjr
Sep 17, 2018
Member
this is awesome! we were missing this section :)
this is awesome! we were missing this section :)
attrs.append(( | ||
x.attrib['id'], | ||
x.xpath("name/text()").extract(), | ||
x.xpath("./type/text()").extract())) |
stummjr
Sep 17, 2018
Member
.getall()
here too?
.getall()
here too?
* fixed several small issues * re-written "Creating Selectors" section * fixed remaining .extract usage in tests
also, move "Selecting attributes" reference closer to `a::atr(href)` example
Thanks for a careful review @stummjr! How does it look now? |
Looks great to me! |
shortcuts. By using ``response.selector`` or one of these shortcuts | ||
you can also ensure the response body is parsed only once. | ||
|
||
But if required, it is possible to use ``Selector`` directly. |
stummjr
Sep 18, 2018
Member
I liked how you changed the order and rephrased this part! 🤘
I liked how you changed the order and rephrased this part!
Fixes #3317.
Todo: