Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Migrating selectors to use parsel #1409

Merged
merged 17 commits into from Aug 11, 2015

Conversation

eliasdorneles
Copy link
Member

Hey, folks!

So, here is my first stab at porting selectors to use Parsel.

I'm not very happy about the initialization of the self._parser attribute which is used to build the LxmlDocument instance for the response, that forced me to import _ctgroup from parsel.
I'd love to hear any suggestions about how to handle that.

What do you think?
Thank you!

}
_lxml_smart_strings = False
# supporting legacy _root argument
root = kwargs.get('_root', root)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is really needed, since the root argument seems to be only used here: https://github.com/scrapy/parsel/blob/master/parsel/unified.py#L96

Do you think we could get rid of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave it and log a warning about change of parameter name. Same goes if we decide to promote self._root to self.root.

@kmike
Copy link
Member

kmike commented Aug 5, 2015

Without LxmlDocument cache some parts of Scrapy can become less efficient. For example, LxmlLinkExtractor instantiates Selector(response), so for users which use both link extractors and selectors response will be parsed twice. Maybe there are other cases like this (maybe in user's components), I'm not sure. If we proceed with cache removal I think we should at least fix Scrapy built-in components.

@dangra
Copy link
Member

dangra commented Aug 5, 2015

If we proceed with cache removal I think we should at least fix Scrapy built-in components.

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

@kmike
Copy link
Member

kmike commented Aug 5, 2015

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

yep, that's it

@eliasdorneles
Copy link
Member Author

should I make that part of this PR?

@dangra
Copy link
Member

dangra commented Aug 5, 2015

for the code that force the selector type we just let it be.

~/src/scrapy$ ack 'Selector\(' scrapy/
scrapy/http/response/text.py:106:            self._cached_selector = Selector(self)
scrapy/selector/unified.py:50:class Selector(object_ref):
scrapy/spiders/feed.py:72:            selector = Selector(response, type='xml')
scrapy/spiders/feed.py:76:            selector = Selector(response, type='html')
scrapy/linkextractors/sgml.py:130:            sel = Selector(response)
scrapy/linkextractors/lxmlhtml.py:68:        html = Selector(response)
scrapy/linkextractors/lxmlhtml.py:98:        html = Selector(response)
scrapy/utils/iterators.py:40:        yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
scrapy/utils/iterators.py:52:        xs = Selector(text=nodetext, type='xml')

@dangra
Copy link
Member

dangra commented Aug 5, 2015

@eliasdorneles yes, I think so if it pass tests with that simple change.

@eliasdorneles
Copy link
Member Author

@dangra talking about tests, any idea why I'm getting failures on test_squeues.py there? 👇 :

@dangra
Copy link
Member

dangra commented Aug 5, 2015

talking about tests, any idea why I'm getting failures on test_squeues.py there?
Hide all checks

no, but it started to happen on a previous but very unlikely commit build https://travis-ci.org/scrapy/scrapy/jobs/74296461

@dangra
Copy link
Member

dangra commented Aug 5, 2015

if it is passing for "precise" then it is probably a bug introduced by some dependency upgrade
image

Twisted 15.3.0 was released yesterday for example

@dangra
Copy link
Member

dangra commented Aug 5, 2015

It doesn't make sense to me

>>> import sys, pickle, cPickle
>>> sys.version
'2.7.10 (default, May 26 2015, 04:16:29) \n[GCC 5.1.0]'
>>> cPickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-10-e076fdfef0cc> in <module>()
----> 1 cPickle.dumps((lambda x: x), protocol=2)

PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
>>> pickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-11-fb0cc15a50f8> in <module>()
----> 1 pickle.dumps((lambda x: x), protocol=2)

/usr/lib64/python2.7/pickle.pyc in dumps(obj, protocol)
   1372 def dumps(obj, protocol=None):
   1373     file = StringIO()
-> 1374     Pickler(file, protocol).dump(obj)
   1375     return file.getvalue()
   1376 

/usr/lib64/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_global(self, obj, name, pack)
    746             raise PicklingError(
    747                 "Can't pickle %r: it's not found as %s.%s" %
--> 748                 (obj, module, name))
    749         else:
    750             if klass is not obj:

PicklingError: Can't pickle <function <lambda> at 0x7f48660f6848>: it's not found as __main__.<lambda>
>>> 

@eliasdorneles
Copy link
Member Author

@dangra it's definitely weird, but I tested locally fixing Twisted version to 15.2.1 and it passed.
no idea why, though.

@kmike
Copy link
Member

kmike commented Aug 5, 2015

On Travis it is failing with AssertionError, but we're expecting ValueError in test. I have no idea how can it depend on installed requirements. For me this test passes locally with twisted 15.3.0.

@dangra
Copy link
Member

dangra commented Aug 5, 2015

@kmike it fails if you run the whole testsuite but pass if you run just that test file

@eliasdorneles
Copy link
Member Author

okay, this is weird.

if I run (with last Twisted) just that test with: tox -r -e py27 -- tests/test_squeues.py, it passes.

but if I run the whole suite with tox -r -e py27, it fails.

ideas?

@dangra
Copy link
Member

dangra commented Aug 5, 2015

running all tests with twisted 15.2.0 passes for me

@eliasdorneles
Copy link
Member Author

yeah, the same with 15.2.1 for me.

@dangra
Copy link
Member

dangra commented Aug 5, 2015

It fails for 15.3.0 using $ tox -e py27 tests/test_crawl.py tests/test_squeues.py

@kmike
Copy link
Member

kmike commented Aug 5, 2015

I think it can be related to twisted/twisted@a8d8a0c - see this line.

@dangra
Copy link
Member

dangra commented Aug 5, 2015

@kmike that's for sure.

@eliasdorneles
Copy link
Member Author

@dangra @kmike so, I've made the changes to use response.selector in link extractors.

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

I get a bunch of test failures if I try to replace that for response.selector, so I suppose it's not safe to make the assumption all responses there will have a selector for it -- should we keep it there?

@dangra
Copy link
Member

dangra commented Aug 6, 2015

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

@elias replace it by the corr espondent direct call to lxml parser following re-encoding best practices from parsel (i.e.: body_as_unicode().encode('utf8'))

@eliasdorneles
Copy link
Member Author

@dangra okay, finally got rid of it. :) 🎉
should I rebase and/or squash the commits?

@@ -54,10 +55,15 @@ def _urlencode(seq, enc):
return urlencode(values, doseq=1)


def _create_parser_from_response(response, parser_cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to expose this function in parsel instead of duplicating it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike what do you think about promoting Selector._root to Selector.root in parsel library and use an extended-html parser for lxml.html.HTMLParser ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?

Promoting _root to root sounds good, I had to use _root more than once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike in practice, it would mean using lxml.html.HTMLParser instead of lxml.etree.HTMLParser (which is extended by the former).

I'll promote _root in parsel, and add a deprecated attribute for _root in Scrapy's subclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?

FormRequest uses lxml.html.HTMLParser which is different from type=html in Selector class which uses lxml.etree.HTMLParser.

The former provides extra methods to query html in a python way but it is slower than the later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we use lxml.html.HTMLParser by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in the future #471 (comment)

def test_badly_encoded_body(self):
# \xe9 alone isn't valid utf8 sequence
r1 = TextResponse('http://www.example.com', \
body='<html><p>an Jos\xe9 de</p><html>', \
body=u'<html><p>an Jos\xe9 de</p><html>', \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the body must be bytes for this test case, it is a badly encoded utf8 sequence.

@eliasdorneles
Copy link
Member Author

@dangra updated and rebased on top of current master :)

@dangra
Copy link
Member

dangra commented Aug 11, 2015

great work @eliasdorneles !

@dangra dangra changed the title Migrating selectors to use parsel [MRG+1] Migrating selectors to use parsel Aug 11, 2015
@dangra
Copy link
Member

dangra commented Aug 11, 2015

@kmike feel free to merge when you are happy


@deprecated(use_instead='.xpath()')
def select(self, xpath):
return self.xpath(xpath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to remove these deprecated methods, but we should not forget to mention it in the release notes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to do that in another PR, this one is about migrating to use parsel lib while retaining current functionality

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should add backwards compatibility shims for these methods in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woo, yeah, we're missing the shims for these methods! :O

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. We are keeping them for Selector class

It actually means that SelectorList must be a class attribute of Selector class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so it can be defined by subclasses.
I'll change this in parsel, do a release and update this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, shim added!
we're getting closer, guys! :D

dangra added a commit that referenced this pull request Aug 11, 2015
[MRG+1] Migrating selectors to use parsel
@dangra dangra merged commit 15c1300 into scrapy:master Aug 11, 2015
@kmike
Copy link
Member

kmike commented Aug 11, 2015

Great job @eliasdorneles! Also, thanks @umrashrf for getting the ball rolling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants