Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Migrating selectors to use parsel #1409

Merged
merged 17 commits into from Aug 11, 2015
Merged

Conversation

@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented Aug 3, 2015

Hey, folks!

So, here is my first stab at porting selectors to use Parsel.

I'm not very happy about the initialization of the self._parser attribute which is used to build the LxmlDocument instance for the response, that forced me to import _ctgroup from parsel.
I'd love to hear any suggestions about how to handle that.

What do you think?
Thank you!

}
_lxml_smart_strings = False
# supporting legacy _root argument
root = kwargs.get('_root', root)

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 3, 2015
Author Member

I'm not sure if this is really needed, since the root argument seems to be only used here: https://github.com/scrapy/parsel/blob/master/parsel/unified.py#L96

Do you think we could get rid of this?

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

I'd leave it and log a warning about change of parameter name. Same goes if we decide to promote self._root to self.root.


if text is not None:
response = _response_from_text(text, st)

if response is not None:
_root = LxmlDocument(response, self._parser)
root = LxmlDocument(response, self._parser)

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

I think we can drop LxmlDocument, it is a cache of parsed dom keyedb by Response instance, it is rarely needed now that we have a shortcut to selector from response. The old way of instanciating selectors manually is not encouraged since a few versions.

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 4, 2015
Author Member

I like this -- I'll be able to get rid of the _ctgroup import and code will be cleaner. :D

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

👍

data = repr(self.extract()[:40])
return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
__repr__ = __str__
text = response.body_as_unicode() if response else None

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

I would check for if response is not None just in case.

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 4, 2015
Author Member

hm, do response objects can have non-truthy values?

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

No, but truth of some objects can become expensive if someone decides to pass a cleverish response. It happened before with selectors checks calling overrided __nonzero__


def __init__(self, response=None, text=None, type=None, namespaces=None,
_root=None, _expr=None):
self.type = st = _st(response, type or self._default_type)
self._parser = _ctgroup[st]['_parser']

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

_ctgroup deserves a better (public) name

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 4, 2015
Author Member

what about adding a property to parsel's Selector that returns the proper object according to self.type?

@property
def _parser(self):
    return _ctgroup[self.type]['_parser']

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

actually, I think we shouldn't be setting _parser in this constructor. Instead call super().__init__(text=response.body_as_unicode(), type=st) and let Parsel selector do the parsing based on type.

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 4, 2015
Author Member

yeah, I was only setting it here because I needed it for LxmlDocument, thought we were keeping it.
I'll update the PR tomorrow.

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

🆗

TranslatorMixin,
ScrapyGenericTranslator,
ScrapyHTMLTranslator
)

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

I would go for deprecation on all this classes, instruct users to import from parsel directly.

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 4, 2015
Author Member

hm, I suppose I should rename those in Parsel too. ParselGenericTranslator, etc.
this will need a new release of parsel, but I think it's worthy, right?

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

no need to rename in parsel, you can do like

class CrawlerSettings(Settings):
def __init__(self, settings_module=None, **kw):
Settings.__init__(self, **kw)
self.settings_module = settings_module
def __getitem__(self, opt_name):
if opt_name in self.overrides:
return self.overrides[opt_name]
if self.settings_module and hasattr(self.settings_module, opt_name):
return getattr(self.settings_module, opt_name)
if opt_name in self.defaults:
return self.defaults[opt_name]
return Settings.__getitem__(self, opt_name)
def __str__(self):
return "<CrawlerSettings module=%r>" % self.settings_module
CrawlerSettings = create_deprecated_class(
'CrawlerSettings', CrawlerSettings,
new_class_path='scrapy.settings.Settings')

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

from parsel.csstranslator import ScrapyXPathExpr

ScrapyXPathExpr = create_deprecated_class("ScrapyXPathExpr", ScrapyXPathExpr, new_class_path='parsel.csstranslator.ScrapyXPathExpr')

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 4, 2015
Author Member

I understand, but my question is more like: "should Parsel have classes with Scrapy on their names?"

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

oh you're right! no for sure but having Parsel prefix does't make sense neither.

This comment has been minimized.

@dangra

dangra Aug 4, 2015
Member

to your original question, let's remove prefixes from parsel and do a new release

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 4, 2015
Author Member

👍

def test_deprecated_root_argument(self, warnings):
root = etree.fromstring(u'<html/>')
sel = self.sscls(_root=root)
self.assertEqual(root, sel._root)

This comment has been minimized.

@dangra

dangra Aug 5, 2015
Member

I think it should be self.assertIs()

@kmike
Copy link
Member

@kmike kmike commented Aug 5, 2015

Without LxmlDocument cache some parts of Scrapy can become less efficient. For example, LxmlLinkExtractor instantiates Selector(response), so for users which use both link extractors and selectors response will be parsed twice. Maybe there are other cases like this (maybe in user's components), I'm not sure. If we proceed with cache removal I think we should at least fix Scrapy built-in components.

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

If we proceed with cache removal I think we should at least fix Scrapy built-in components.

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

@kmike
Copy link
Member

@kmike kmike commented Aug 5, 2015

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

yep, that's it

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 5, 2015

should I make that part of this PR?

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

for the code that force the selector type we just let it be.

~/src/scrapy$ ack 'Selector\(' scrapy/
scrapy/http/response/text.py:106:            self._cached_selector = Selector(self)
scrapy/selector/unified.py:50:class Selector(object_ref):
scrapy/spiders/feed.py:72:            selector = Selector(response, type='xml')
scrapy/spiders/feed.py:76:            selector = Selector(response, type='html')
scrapy/linkextractors/sgml.py:130:            sel = Selector(response)
scrapy/linkextractors/lxmlhtml.py:68:        html = Selector(response)
scrapy/linkextractors/lxmlhtml.py:98:        html = Selector(response)
scrapy/utils/iterators.py:40:        yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
scrapy/utils/iterators.py:52:        xs = Selector(text=nodetext, type='xml')
@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

@eliasdorneles yes, I think so if it pass tests with that simple change.

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 5, 2015

@dangra talking about tests, any idea why I'm getting failures on test_squeues.py there? 👇 :

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

talking about tests, any idea why I'm getting failures on test_squeues.py there?
Hide all checks

no, but it started to happen on a previous but very unlikely commit build https://travis-ci.org/scrapy/scrapy/jobs/74296461

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

if it is passing for "precise" then it is probably a bug introduced by some dependency upgrade
image

Twisted 15.3.0 was released yesterday for example

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

It doesn't make sense to me

>>> import sys, pickle, cPickle
>>> sys.version
'2.7.10 (default, May 26 2015, 04:16:29) \n[GCC 5.1.0]'
>>> cPickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-10-e076fdfef0cc> in <module>()
----> 1 cPickle.dumps((lambda x: x), protocol=2)

PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
>>> pickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-11-fb0cc15a50f8> in <module>()
----> 1 pickle.dumps((lambda x: x), protocol=2)

/usr/lib64/python2.7/pickle.pyc in dumps(obj, protocol)
   1372 def dumps(obj, protocol=None):
   1373     file = StringIO()
-> 1374     Pickler(file, protocol).dump(obj)
   1375     return file.getvalue()
   1376 

/usr/lib64/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_global(self, obj, name, pack)
    746             raise PicklingError(
    747                 "Can't pickle %r: it's not found as %s.%s" %
--> 748                 (obj, module, name))
    749         else:
    750             if klass is not obj:

PicklingError: Can't pickle <function <lambda> at 0x7f48660f6848>: it's not found as __main__.<lambda>
>>> 
@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 5, 2015

@dangra it's definitely weird, but I tested locally fixing Twisted version to 15.2.1 and it passed.
no idea why, though.

@kmike
Copy link
Member

@kmike kmike commented Aug 5, 2015

On Travis it is failing with AssertionError, but we're expecting ValueError in test. I have no idea how can it depend on installed requirements. For me this test passes locally with twisted 15.3.0.

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

@kmike it fails if you run the whole testsuite but pass if you run just that test file

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 5, 2015

okay, this is weird.

if I run (with last Twisted) just that test with: tox -r -e py27 -- tests/test_squeues.py, it passes.

but if I run the whole suite with tox -r -e py27, it fails.

ideas?

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

running all tests with twisted 15.2.0 passes for me

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 5, 2015

yeah, the same with 15.2.1 for me.

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

It fails for 15.3.0 using $ tox -e py27 tests/test_crawl.py tests/test_squeues.py

@kmike
Copy link
Member

@kmike kmike commented Aug 5, 2015

I think it can be related to twisted/twisted@a8d8a0c - see this line.

@dangra
Copy link
Member

@dangra dangra commented Aug 5, 2015

@kmike that's for sure.

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 6, 2015

@dangra @kmike so, I've made the changes to use response.selector in link extractors.

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

I get a bunch of test failures if I try to replace that for response.selector, so I suppose it's not safe to make the assumption all responses there will have a selector for it -- should we keep it there?

@dangra
Copy link
Member

@dangra dangra commented Aug 6, 2015

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

@elias replace it by the corr espondent direct call to lxml parser following re-encoding best practices from parsel (i.e.: body_as_unicode().encode('utf8'))

@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 7, 2015

@dangra okay, finally got rid of it. :) 🎉
should I rebase and/or squash the commits?

@@ -54,10 +55,15 @@ def _urlencode(seq, enc):
return urlencode(values, doseq=1)


def _create_parser_from_response(response, parser_cls):

This comment has been minimized.

@kmike

kmike Aug 7, 2015
Member

+1 to expose this function in parsel instead of duplicating it

This comment has been minimized.

@dangra

dangra Aug 7, 2015
Member

@kmike what do you think about promoting Selector._root to Selector.root in parsel library and use an extended-html parser for lxml.html.HTMLParser ?

This comment has been minimized.

@kmike

kmike Aug 7, 2015
Member

Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?

Promoting _root to root sounds good, I had to use _root more than once.

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 7, 2015
Author Member

@kmike in practice, it would mean using lxml.html.HTMLParser instead of lxml.etree.HTMLParser (which is extended by the former).

I'll promote _root in parsel, and add a deprecated attribute for _root in Scrapy's subclass.

This comment has been minimized.

@dangra

dangra Aug 7, 2015
Member

Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?

FormRequest uses lxml.html.HTMLParser which is different from type=html in Selector class which uses lxml.etree.HTMLParser.

The former provides extra methods to query html in a python way but it is slower than the later.

This comment has been minimized.

@kmike

kmike Aug 7, 2015
Member

why can't we use lxml.html.HTMLParser by default?

This comment has been minimized.

@dangra

dangra Aug 7, 2015
Member

Back in the future #471 (comment)


if text is not None:
response = _response_from_text(text, st)

if response is not None:
_root = LxmlDocument(response, self._parser)
text = response.body_as_unicode()

This comment has been minimized.

@dangra

dangra Aug 10, 2015
Member

now that parsel can handle base_url, let's be full backward compatible: kwargs.setdefault('base_url', response.url)

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 10, 2015
Author Member

nice! 👍

@eliasdorneles eliasdorneles force-pushed the eliasdorneles:migrate-parsel branch from 6883af1 to e50610b Aug 11, 2015
@eliasdorneles
Copy link
Member Author

@eliasdorneles eliasdorneles commented Aug 11, 2015

@dangra updated and rebased on top of current master :)

@dangra
Copy link
Member

@dangra dangra commented Aug 11, 2015

great work @eliasdorneles !

@dangra dangra changed the title Migrating selectors to use parsel [MRG+1] Migrating selectors to use parsel Aug 11, 2015
@dangra
Copy link
Member

@dangra dangra commented Aug 11, 2015

@kmike feel free to merge when you are happy


@deprecated(use_instead='.xpath()')
def select(self, xpath):
return self.xpath(xpath)

This comment has been minimized.

@kmike

kmike Aug 11, 2015
Member

I think it is fine to remove these deprecated methods, but we should not forget to mention it in the release notes.

This comment has been minimized.

@dangra

dangra Aug 11, 2015
Member

I prefer to do that in another PR, this one is about migrating to use parsel lib while retaining current functionality

This comment has been minimized.

@kmike

kmike Aug 11, 2015
Member

Do you mean we should add backwards compatibility shims for these methods in this PR?

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 11, 2015
Author Member

woo, yeah, we're missing the shims for these methods! :O

This comment has been minimized.

@dangra

dangra Aug 11, 2015
Member

Yes, I think so. We are keeping them for Selector class

It actually means that SelectorList must be a class attribute of Selector class

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 11, 2015
Author Member

right, so it can be defined by subclasses.
I'll change this in parsel, do a release and update this PR.

This comment has been minimized.

This comment has been minimized.

@eliasdorneles

eliasdorneles Aug 11, 2015
Author Member

alright, shim added!
we're getting closer, guys! :D

@dangra

This comment has been minimized.

Copy link

@dangra dangra commented on scrapy/selector/unified.py in 766c255 Aug 11, 2015

I prefer extending from Selector.selectorlist_cls instead but just my preference.

This comment has been minimized.

Copy link

@dangra dangra replied Aug 11, 2015

No need to import SelectorList from parsel and you can be sure it will use the same class used by Selector.

This comment has been minimized.

Copy link
Owner Author

@eliasdorneles eliasdorneles replied Aug 11, 2015

right, one less dependency, will update in a jiffy

dangra added a commit that referenced this pull request Aug 11, 2015
[MRG+1] Migrating selectors to use parsel
@dangra dangra merged commit 15c1300 into scrapy:master Aug 11, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kmike
Copy link
Member

@kmike kmike commented Aug 11, 2015

Great job @eliasdorneles! Also, thanks @umrashrf for getting the ball rolling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants