Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] update Scrapy to use parsel 1.5 #3390

Merged
merged 18 commits into from Sep 18, 2018
Merged

[MRG+1] update Scrapy to use parsel 1.5 #3390

merged 18 commits into from Sep 18, 2018

Conversation

kmike
Copy link
Member

@kmike kmike commented Aug 15, 2018

Fixes #3317.

Todo:

@codecov
Copy link

codecov bot commented Aug 21, 2018

Codecov Report

Merging #3390 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #3390   +/-   ##
=======================================
  Coverage   84.49%   84.49%           
=======================================
  Files         167      167           
  Lines        9376     9376           
  Branches     1392     1392           
=======================================
  Hits         7922     7922           
  Misses       1199     1199           
  Partials      255      255
Impacted Files Coverage Δ
scrapy/linkextractors/sgml.py 96.8% <ø> (ø) ⬆️
scrapy/selector/unified.py 93.75% <ø> (ø) ⬆️
scrapy/loader/__init__.py 94.52% <100%> (ø) ⬆️

@kmike kmike changed the title [WIP] update docs to use parsel 1.5 [WIP] update Scrapy to use parsel 1.5 Aug 21, 2018
@kmike kmike requested a review from stummjr August 23, 2018 14:17
@kmike kmike changed the title [WIP] update Scrapy to use parsel 1.5 update Scrapy to use parsel 1.5 Sep 12, 2018
@kmike kmike added this to the v1.6 milestone Sep 12, 2018
@stummjr
Copy link
Member

stummjr commented Sep 12, 2018

@kmike just a heads-up that I started reviewing this today.

@eliasdorneles
Copy link
Member

I love this PR so much <3 =)

Copy link
Member

@stummjr stummjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 wow, that's a huge revamp in the docs. Looks AWESOME to me! 👍

``None`` when it doesn't find any element matching the selection.
However, using ``.get()`` directly on a :class:`~scrapy.selector.SelectorList`
instance avoids an ``IndexError`` and returns ``None`` when it doesn't
find any element matching the selection.

There's a lesson here: for most scraping code, you want it to be resilient to
errors due to things not being found on a page, so that even if some parts fail
to be scraped, you can at least get **some** data.

Besides the :meth:`~scrapy.selector.Selector.extract` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to replace the .extract here with .getall.

@@ -641,7 +649,7 @@ this time for scraping author information::

def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
return response.css(query).get().strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should promote the usage of the default parameter here to make this example less error prone:

return response.css(query).get(default='').strip()

I know that this is somehow unrelated, but still related to best practices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL response.css('foo::text').get() returns None if foo is found, but there is no text inside.


For convenience, response objects expose a selector on `.selector` attribute,
it's totally OK to use this shortcut when possible::
For convenience, response objects expose a selector on `.selector` attribute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the .selector part supposed to be shown in monospaced font? I think we're missing an extra surrounding backtick in that case.

>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']
>>> Selector(text=body).xpath('//span/text()').get()
'good'

Constructing from response::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a small observation here stating that this is usually not necessary, given that Response already contains a selector.

Even though it's explained later, I think we should be explicit that this could be a bad pattern.

Notice that CSS selectors can select text or attribute nodes using CSS3
pseudo-elements::

>>> selector.css('title::text').get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be response.css('title::text').get()?


.. _topics-selectors-css-extensions:

Extensions to CSS Selectors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome! we were missing this section :)

attrs.append((
x.attrib['id'],
x.xpath("name/text()").extract(),
x.xpath("./type/text()").extract()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.getall() here too?

* fixed several small issues
* re-written "Creating Selectors" section
* fixed remaining .extract usage in tests
also, move "Selecting attributes" reference closer to `a::atr(href)` example
@kmike
Copy link
Member Author

kmike commented Sep 18, 2018

Thanks for a careful review @stummjr! How does it look now?

Copy link
Member

@stummjr stummjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! 👏

shortcuts. By using ``response.selector`` or one of these shortcuts
you can also ensure the response body is parsed only once.

But if required, it is possible to use ``Selector`` directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked how you changed the order and rephrased this part! 🤘

@stummjr stummjr changed the title update Scrapy to use parsel 1.5 [MRG+1] update Scrapy to use parsel 1.5 Sep 18, 2018
@dangra dangra merged commit bafd174 into master Sep 18, 2018
@dangra dangra deleted the parsel-1.5 branch September 18, 2018 16:07
@kmike kmike added the docs label Dec 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Scrapy tutorial and other docs to use parsel 1.5
4 participants