New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Migrating selectors to use parsel #1409
Conversation
} | ||
_lxml_smart_strings = False | ||
# supporting legacy _root argument | ||
root = kwargs.get('_root', root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is really needed, since the root argument seems to be only used here: https://github.com/scrapy/parsel/blob/master/parsel/unified.py#L96
Do you think we could get rid of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave it and log a warning about change of parameter name. Same goes if we decide to promote self._root to self.root.
Without LxmlDocument cache some parts of Scrapy can become less efficient. For example, LxmlLinkExtractor instantiates Selector(response), so for users which use both link extractors and selectors response will be parsed twice. Maybe there are other cases like this (maybe in user's components), I'm not sure. If we proceed with cache removal I think we should at least fix Scrapy built-in components. |
do you mean replacing calls like |
yep, that's it |
should I make that part of this PR? |
for the code that force the selector
|
@eliasdorneles yes, I think so if it pass tests with that simple change. |
@dangra talking about tests, any idea why I'm getting failures on |
no, but it started to happen on a previous but very unlikely commit build https://travis-ci.org/scrapy/scrapy/jobs/74296461 |
It doesn't make sense to me
|
@dangra it's definitely weird, but I tested locally fixing Twisted version to 15.2.1 and it passed. |
On Travis it is failing with AssertionError, but we're expecting ValueError in test. I have no idea how can it depend on installed requirements. For me this test passes locally with twisted 15.3.0. |
@kmike it fails if you run the whole testsuite but pass if you run just that test file |
okay, this is weird. if I run (with last Twisted) just that test with: but if I run the whole suite with ideas? |
running all tests with twisted 15.2.0 passes for me |
yeah, the same with 15.2.1 for me. |
It fails for 15.3.0 using |
I think it can be related to twisted/twisted@a8d8a0c - see this line. |
@kmike that's for sure. |
@dangra @kmike so, I've made the changes to use The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60 I get a bunch of test failures if I try to replace that for |
@elias replace it by the corr espondent direct call to lxml parser following re-encoding best practices from parsel (i.e.: body_as_unicode().encode('utf8')) |
@dangra okay, finally got rid of it. :) 🎉 |
@@ -54,10 +55,15 @@ def _urlencode(seq, enc): | |||
return urlencode(values, doseq=1) | |||
|
|||
|
|||
def _create_parser_from_response(response, parser_cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to expose this function in parsel instead of duplicating it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike what do you think about promoting Selector._root
to Selector.root
in parsel library and use an extended-html
parser for lxml.html.HTMLParser
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I feel dumb - could you please explain what is extended-html
and why is it different from html
?
Promoting _root
to root
sounds good, I had to use _root
more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike in practice, it would mean using lxml.html.HTMLParser
instead of lxml.etree.HTMLParser
(which is extended by the former).
I'll promote _root
in parsel, and add a deprecated attribute for _root
in Scrapy's subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?
FormRequest uses lxml.html.HTMLParser
which is different from type=html
in Selector
class which uses lxml.etree.HTMLParser
.
The former provides extra methods to query html in a python way but it is slower than the later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we use lxml.html.HTMLParser by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in the future #471 (comment)
def test_badly_encoded_body(self): | ||
# \xe9 alone isn't valid utf8 sequence | ||
r1 = TextResponse('http://www.example.com', \ | ||
body='<html><p>an Jos\xe9 de</p><html>', \ | ||
body=u'<html><p>an Jos\xe9 de</p><html>', \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the body must be bytes for this test case, it is a badly encoded utf8 sequence.
…n warning for _root argument
…or modules, fix test
6883af1
to
e50610b
Compare
@dangra updated and rebased on top of current master :) |
great work @eliasdorneles ! |
@kmike feel free to merge when you are happy |
|
||
@deprecated(use_instead='.xpath()') | ||
def select(self, xpath): | ||
return self.xpath(xpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to remove these deprecated methods, but we should not forget to mention it in the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to do that in another PR, this one is about migrating to use parsel lib while retaining current functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should add backwards compatibility shims for these methods in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woo, yeah, we're missing the shims for these methods! :O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. We are keeping them for Selector
class
It actually means that SelectorList
must be a class attribute of Selector
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so it can be defined by subclasses.
I'll change this in parsel, do a release and update this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related scrapy/parsel#11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, shim added!
we're getting closer, guys! :D
[MRG+1] Migrating selectors to use parsel
Great job @eliasdorneles! Also, thanks @umrashrf for getting the ball rolling. |
Hey, folks!
So, here is my first stab at porting selectors to use Parsel.
I'm not very happy about the initialization of theself._parser
attribute which is used to build theLxmlDocument
instance for the response, that forced me to import_ctgroup
from parsel.I'd love to hear any suggestions about how to handle that.
What do you think?
Thank you!