Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] update Scrapy to use parsel 1.5 #3390

Merged
merged 18 commits into from Sep 18, 2018
Merged

[MRG+1] update Scrapy to use parsel 1.5 #3390

merged 18 commits into from Sep 18, 2018

Conversation

@kmike
Copy link
Member

@kmike kmike commented Aug 15, 2018

Fixes #3317.

Todo:

  • update Selector and SelectorList API documentation
  • revise Selectors tutorial: either bring more from the parsel's tutorial, or link to missing pieces (like has-class XPath extension); maybe rearrange some of the sections.
  • check .extract usages
  • find more places where .attrib makes code simpler
  • update after #3400.
  • fix "Removing namespaces" example - see scrapy/parsel#119
@codecov
Copy link

@codecov codecov bot commented Aug 21, 2018

Codecov Report

Merging #3390 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #3390   +/-   ##
=======================================
  Coverage   84.49%   84.49%           
=======================================
  Files         167      167           
  Lines        9376     9376           
  Branches     1392     1392           
=======================================
  Hits         7922     7922           
  Misses       1199     1199           
  Partials      255      255
Impacted Files Coverage Δ
scrapy/linkextractors/sgml.py 96.8% <ø> (ø) ⬆️
scrapy/selector/unified.py 93.75% <ø> (ø) ⬆️
scrapy/loader/__init__.py 94.52% <100%> (ø) ⬆️
@kmike kmike changed the title [WIP] update docs to use parsel 1.5 [WIP] update Scrapy to use parsel 1.5 Aug 21, 2018
@kmike kmike force-pushed the parsel-1.5 branch from 05c3be4 to 133ca3a Aug 22, 2018
@kmike kmike requested a review from stummjr Aug 23, 2018
@kmike kmike force-pushed the parsel-1.5 branch from 133ca3a to 0794f5d Aug 23, 2018
@kmike kmike force-pushed the parsel-1.5 branch from 0794f5d to 7fdfdb7 Sep 12, 2018
… suggest get/getall
@kmike kmike changed the title [WIP] update Scrapy to use parsel 1.5 update Scrapy to use parsel 1.5 Sep 12, 2018
@kmike kmike added this to the v1.6 milestone Sep 12, 2018
@stummjr
Copy link
Member

@stummjr stummjr commented Sep 12, 2018

@kmike just a heads-up that I started reviewing this today.

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Sep 14, 2018

I love this PR so much <3 =)

Copy link
Member

@stummjr stummjr left a comment

👏 wow, that's a huge revamp in the docs. Looks AWESOME to me! 👍

``None`` when it doesn't find any element matching the selection.
However, using ``.get()`` directly on a :class:`~scrapy.selector.SelectorList`
instance avoids an ``IndexError`` and returns ``None`` when it doesn't
find any element matching the selection.

There's a lesson here: for most scraping code, you want it to be resilient to
errors due to things not being found on a page, so that even if some parts fail
to be scraped, you can at least get **some** data.

Besides the :meth:`~scrapy.selector.Selector.extract` and

This comment has been minimized.

@stummjr

stummjr Sep 17, 2018
Member

I think we need to replace the .extract here with .getall.

@@ -641,7 +649,7 @@ this time for scraping author information::

def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
return response.css(query).get().strip()

This comment has been minimized.

@stummjr

stummjr Sep 17, 2018
Member

Maybe we should promote the usage of the default parameter here to make this example less error prone:

return response.css(query).get(default='').strip()

I know that this is somehow unrelated, but still related to best practices.

This comment has been minimized.

@kmike

kmike Sep 17, 2018
Author Member

TIL response.css('foo::text').get() returns None if foo is found, but there is no text inside.


For convenience, response objects expose a selector on `.selector` attribute,
it's totally OK to use this shortcut when possible::
For convenience, response objects expose a selector on `.selector` attribute.

This comment has been minimized.

@stummjr

stummjr Sep 17, 2018
Member

Is the .selector part supposed to be shown in monospaced font? I think we're missing an extra surrounding backtick in that case.

>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']
>>> Selector(text=body).xpath('//span/text()').get()
'good'

Constructing from response::

This comment has been minimized.

@stummjr

stummjr Sep 17, 2018
Member

Maybe we should add a small observation here stating that this is usually not necessary, given that Response already contains a selector.

Even though it's explained later, I think we should be explicit that this could be a bad pattern.

Notice that CSS selectors can select text or attribute nodes using CSS3
pseudo-elements::

>>> selector.css('title::text').get()

This comment has been minimized.

@stummjr

stummjr Sep 17, 2018
Member

should it be response.css('title::text').get()?


.. _topics-selectors-css-extensions:

Extensions to CSS Selectors

This comment has been minimized.

@stummjr

stummjr Sep 17, 2018
Member

this is awesome! we were missing this section :)

attrs.append((
x.attrib['id'],
x.xpath("name/text()").extract(),
x.xpath("./type/text()").extract()))

This comment has been minimized.

@stummjr

stummjr Sep 17, 2018
Member

.getall() here too?

kmike added 2 commits Sep 18, 2018
* fixed several small issues
* re-written "Creating Selectors" section
* fixed remaining .extract usage in tests
also, move "Selecting attributes" reference closer to `a::atr(href)` example
@kmike
Copy link
Member Author

@kmike kmike commented Sep 18, 2018

Thanks for a careful review @stummjr! How does it look now?

Copy link
Member

@stummjr stummjr left a comment

Looks great to me! 👏

shortcuts. By using ``response.selector`` or one of these shortcuts
you can also ensure the response body is parsed only once.

But if required, it is possible to use ``Selector`` directly.

This comment has been minimized.

@stummjr

stummjr Sep 18, 2018
Member

I liked how you changed the order and rephrased this part! 🤘

@stummjr stummjr changed the title update Scrapy to use parsel 1.5 [MRG+1] update Scrapy to use parsel 1.5 Sep 18, 2018
@dangra dangra merged commit bafd174 into master Sep 18, 2018
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 84.49%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dangra dangra deleted the parsel-1.5 branch Sep 18, 2018
@kmike kmike added the docs label Dec 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants