[MRG+1] Migrating selectors to use parsel #1409
Conversation
} | ||
_lxml_smart_strings = False | ||
# supporting legacy _root argument | ||
root = kwargs.get('_root', root) |
eliasdorneles
Aug 3, 2015
Author
Member
I'm not sure if this is really needed, since the root argument seems to be only used here: https://github.com/scrapy/parsel/blob/master/parsel/unified.py#L96
Do you think we could get rid of this?
I'm not sure if this is really needed, since the root argument seems to be only used here: https://github.com/scrapy/parsel/blob/master/parsel/unified.py#L96
Do you think we could get rid of this?
dangra
Aug 4, 2015
Member
I'd leave it and log a warning about change of parameter name. Same goes if we decide to promote self._root to self.root.
I'd leave it and log a warning about change of parameter name. Same goes if we decide to promote self._root to self.root.
|
||
if text is not None: | ||
response = _response_from_text(text, st) | ||
|
||
if response is not None: | ||
_root = LxmlDocument(response, self._parser) | ||
root = LxmlDocument(response, self._parser) |
dangra
Aug 4, 2015
Member
I think we can drop LxmlDocument, it is a cache of parsed dom keyedb by Response instance, it is rarely needed now that we have a shortcut to selector from response. The old way of instanciating selectors manually is not encouraged since a few versions.
I think we can drop LxmlDocument, it is a cache of parsed dom keyedb by Response instance, it is rarely needed now that we have a shortcut to selector from response. The old way of instanciating selectors manually is not encouraged since a few versions.
eliasdorneles
Aug 4, 2015
Author
Member
I like this -- I'll be able to get rid of the _ctgroup import and code will be cleaner. :D
I like this -- I'll be able to get rid of the _ctgroup import and code will be cleaner. :D
dangra
Aug 4, 2015
Member
👍
data = repr(self.extract()[:40]) | ||
return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data) | ||
__repr__ = __str__ | ||
text = response.body_as_unicode() if response else None |
dangra
Aug 4, 2015
Member
I would check for if response is not None
just in case.
I would check for if response is not None
just in case.
eliasdorneles
Aug 4, 2015
Author
Member
hm, do response objects can have non-truthy values?
hm, do response objects can have non-truthy values?
dangra
Aug 4, 2015
Member
No, but truth of some objects can become expensive if someone decides to pass a cleverish response. It happened before with selectors checks calling overrided __nonzero__
No, but truth of some objects can become expensive if someone decides to pass a cleverish response. It happened before with selectors checks calling overrided __nonzero__
|
||
def __init__(self, response=None, text=None, type=None, namespaces=None, | ||
_root=None, _expr=None): | ||
self.type = st = _st(response, type or self._default_type) | ||
self._parser = _ctgroup[st]['_parser'] |
dangra
Aug 4, 2015
Member
_ctgroup deserves a better (public) name
_ctgroup deserves a better (public) name
eliasdorneles
Aug 4, 2015
Author
Member
what about adding a property to parsel's Selector that returns the proper object according to self.type?
@property
def _parser(self):
return _ctgroup[self.type]['_parser']
what about adding a property to parsel's Selector that returns the proper object according to self.type?
@property
def _parser(self):
return _ctgroup[self.type]['_parser']
dangra
Aug 4, 2015
Member
actually, I think we shouldn't be setting _parser
in this constructor. Instead call super().__init__(text=response.body_as_unicode(), type=st)
and let Parsel selector do the parsing based on type.
actually, I think we shouldn't be setting _parser
in this constructor. Instead call super().__init__(text=response.body_as_unicode(), type=st)
and let Parsel selector do the parsing based on type.
eliasdorneles
Aug 4, 2015
Author
Member
yeah, I was only setting it here because I needed it for LxmlDocument
, thought we were keeping it.
I'll update the PR tomorrow.
yeah, I was only setting it here because I needed it for LxmlDocument
, thought we were keeping it.
I'll update the PR tomorrow.
dangra
Aug 4, 2015
Member
🆗
TranslatorMixin, | ||
ScrapyGenericTranslator, | ||
ScrapyHTMLTranslator | ||
) |
dangra
Aug 4, 2015
Member
I would go for deprecation on all this classes, instruct users to import from parsel directly.
I would go for deprecation on all this classes, instruct users to import from parsel directly.
eliasdorneles
Aug 4, 2015
Author
Member
hm, I suppose I should rename those in Parsel too. ParselGenericTranslator
, etc.
this will need a new release of parsel, but I think it's worthy, right?
hm, I suppose I should rename those in Parsel too. ParselGenericTranslator
, etc.
this will need a new release of parsel, but I think it's worthy, right?
dangra
Aug 4, 2015
Member
no need to rename in parsel, you can do like
scrapy/scrapy/settings/__init__.py
Lines 177 to 197
in
311293f
no need to rename in parsel, you can do like
scrapy/scrapy/settings/__init__.py
Lines 177 to 197 in 311293f
dangra
Aug 4, 2015
Member
from parsel.csstranslator import ScrapyXPathExpr
ScrapyXPathExpr = create_deprecated_class("ScrapyXPathExpr", ScrapyXPathExpr, new_class_path='parsel.csstranslator.ScrapyXPathExpr')
from parsel.csstranslator import ScrapyXPathExpr
ScrapyXPathExpr = create_deprecated_class("ScrapyXPathExpr", ScrapyXPathExpr, new_class_path='parsel.csstranslator.ScrapyXPathExpr')
eliasdorneles
Aug 4, 2015
Author
Member
I understand, but my question is more like: "should Parsel have classes with Scrapy on their names?"
I understand, but my question is more like: "should Parsel have classes with Scrapy on their names?"
dangra
Aug 4, 2015
Member
oh you're right! no for sure but having Parsel prefix does't make sense neither.
oh you're right! no for sure but having Parsel prefix does't make sense neither.
dangra
Aug 4, 2015
Member
to your original question, let's remove prefixes from parsel and do a new release
to your original question, let's remove prefixes from parsel and do a new release
eliasdorneles
Aug 4, 2015
Author
Member
👍
def test_deprecated_root_argument(self, warnings): | ||
root = etree.fromstring(u'<html/>') | ||
sel = self.sscls(_root=root) | ||
self.assertEqual(root, sel._root) |
dangra
Aug 5, 2015
Member
I think it should be self.assertIs()
I think it should be self.assertIs()
Without LxmlDocument cache some parts of Scrapy can become less efficient. For example, LxmlLinkExtractor instantiates Selector(response), so for users which use both link extractors and selectors response will be parsed twice. Maybe there are other cases like this (maybe in user's components), I'm not sure. If we proceed with cache removal I think we should at least fix Scrapy built-in components. |
do you mean replacing calls like |
yep, that's it |
should I make that part of this PR? |
for the code that force the selector
|
@eliasdorneles yes, I think so if it pass tests with that simple change. |
@dangra talking about tests, any idea why I'm getting failures on |
no, but it started to happen on a previous but very unlikely commit build https://travis-ci.org/scrapy/scrapy/jobs/74296461 |
It doesn't make sense to me
|
@dangra it's definitely weird, but I tested locally fixing Twisted version to 15.2.1 and it passed. |
On Travis it is failing with AssertionError, but we're expecting ValueError in test. I have no idea how can it depend on installed requirements. For me this test passes locally with twisted 15.3.0. |
@kmike it fails if you run the whole testsuite but pass if you run just that test file |
okay, this is weird. if I run (with last Twisted) just that test with: but if I run the whole suite with ideas? |
running all tests with twisted 15.2.0 passes for me |
yeah, the same with 15.2.1 for me. |
It fails for 15.3.0 using |
I think it can be related to twisted/twisted@a8d8a0c - see this line. |
@kmike that's for sure. |
@dangra @kmike so, I've made the changes to use The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60 I get a bunch of test failures if I try to replace that for |
@elias replace it by the corr espondent direct call to lxml parser following re-encoding best practices from parsel (i.e.: body_as_unicode().encode('utf8')) |
@dangra okay, finally got rid of it. :) |
@@ -54,10 +55,15 @@ def _urlencode(seq, enc): | |||
return urlencode(values, doseq=1) | |||
|
|||
|
|||
def _create_parser_from_response(response, parser_cls): |
kmike
Aug 7, 2015
Member
+1 to expose this function in parsel instead of duplicating it
+1 to expose this function in parsel instead of duplicating it
dangra
Aug 7, 2015
Member
@kmike what do you think about promoting Selector._root
to Selector.root
in parsel library and use an extended-html
parser for lxml.html.HTMLParser
?
@kmike what do you think about promoting Selector._root
to Selector.root
in parsel library and use an extended-html
parser for lxml.html.HTMLParser
?
kmike
Aug 7, 2015
Member
Sorry, I feel dumb - could you please explain what is extended-html
and why is it different from html
?
Promoting _root
to root
sounds good, I had to use _root
more than once.
Sorry, I feel dumb - could you please explain what is extended-html
and why is it different from html
?
Promoting _root
to root
sounds good, I had to use _root
more than once.
eliasdorneles
Aug 7, 2015
Author
Member
@kmike in practice, it would mean using lxml.html.HTMLParser
instead of lxml.etree.HTMLParser
(which is extended by the former).
I'll promote _root
in parsel, and add a deprecated attribute for _root
in Scrapy's subclass.
@kmike in practice, it would mean using lxml.html.HTMLParser
instead of lxml.etree.HTMLParser
(which is extended by the former).
I'll promote _root
in parsel, and add a deprecated attribute for _root
in Scrapy's subclass.
dangra
Aug 7, 2015
Member
Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?
FormRequest uses lxml.html.HTMLParser
which is different from type=html
in Selector
class which uses lxml.etree.HTMLParser
.
The former provides extra methods to query html in a python way but it is slower than the later.
Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?
FormRequest uses lxml.html.HTMLParser
which is different from type=html
in Selector
class which uses lxml.etree.HTMLParser
.
The former provides extra methods to query html in a python way but it is slower than the later.
kmike
Aug 7, 2015
Member
why can't we use lxml.html.HTMLParser by default?
why can't we use lxml.html.HTMLParser by default?
|
||
if text is not None: | ||
response = _response_from_text(text, st) | ||
|
||
if response is not None: | ||
_root = LxmlDocument(response, self._parser) | ||
text = response.body_as_unicode() |
dangra
Aug 10, 2015
Member
now that parsel can handle base_url, let's be full backward compatible: kwargs.setdefault('base_url', response.url)
now that parsel can handle base_url, let's be full backward compatible: kwargs.setdefault('base_url', response.url)
eliasdorneles
Aug 10, 2015
Author
Member
nice! 👍
nice!
…n warning for _root argument
…or modules, fix test
6883af1
to
e50610b
@dangra updated and rebased on top of current master :) |
great work @eliasdorneles ! |
@kmike feel free to merge when you are happy |
|
||
@deprecated(use_instead='.xpath()') | ||
def select(self, xpath): | ||
return self.xpath(xpath) |
kmike
Aug 11, 2015
Member
I think it is fine to remove these deprecated methods, but we should not forget to mention it in the release notes.
I think it is fine to remove these deprecated methods, but we should not forget to mention it in the release notes.
dangra
Aug 11, 2015
Member
I prefer to do that in another PR, this one is about migrating to use parsel lib while retaining current functionality
I prefer to do that in another PR, this one is about migrating to use parsel lib while retaining current functionality
kmike
Aug 11, 2015
Member
Do you mean we should add backwards compatibility shims for these methods in this PR?
Do you mean we should add backwards compatibility shims for these methods in this PR?
eliasdorneles
Aug 11, 2015
Author
Member
woo, yeah, we're missing the shims for these methods! :O
woo, yeah, we're missing the shims for these methods! :O
dangra
Aug 11, 2015
Member
Yes, I think so. We are keeping them for Selector
class
It actually means that SelectorList
must be a class attribute of Selector
class
Yes, I think so. We are keeping them for Selector
class
It actually means that SelectorList
must be a class attribute of Selector
class
eliasdorneles
Aug 11, 2015
Author
Member
right, so it can be defined by subclasses.
I'll change this in parsel, do a release and update this PR.
right, so it can be defined by subclasses.
I'll change this in parsel, do a release and update this PR.
eliasdorneles
Aug 11, 2015
Author
Member
alright, shim added!
we're getting closer, guys! :D
alright, shim added!
we're getting closer, guys! :D
This comment has been minimized.
This comment has been minimized.
I prefer extending from Selector.selectorlist_cls instead but just my preference. |
This comment has been minimized.
This comment has been minimized.
No need to import SelectorList from parsel and you can be sure it will use the same class used by Selector. |
This comment has been minimized.
This comment has been minimized.
right, one less dependency, will update in a jiffy |
[MRG+1] Migrating selectors to use parsel
Great job @eliasdorneles! Also, thanks @umrashrf for getting the ball rolling. |
Hey, folks!
So, here is my first stab at porting selectors to use Parsel.
I'm not very happy about the initialization of theself._parser
attribute which is used to build theLxmlDocument
instance for the response, that forced me to import_ctgroup
from parsel.I'd love to hear any suggestions about how to handle that.
What do you think?
Thank you!