[MRG+1] Migrating selectors to use parsel #1409

eliasdorneles · 2015-08-03T23:47:55Z

Hey, folks!

So, here is my first stab at porting selectors to use Parsel.

I'm not very happy about the initialization of the self._parser attribute which is used to build the LxmlDocument instance for the response, that forced me to import _ctgroup from parsel.
I'd love to hear any suggestions about how to handle that.

What do you think?
Thank you!

eliasdorneles · 2015-08-03T23:49:55Z

scrapy/selector/unified.py

-    }
-    _lxml_smart_strings = False
+        # supporting legacy _root argument
+        root = kwargs.get('_root', root)


I'm not sure if this is really needed, since the root argument seems to be only used here: https://github.com/scrapy/parsel/blob/master/parsel/unified.py#L96

Do you think we could get rid of this?

I'd leave it and log a warning about change of parameter name. Same goes if we decide to promote self._root to self.root.

kmike · 2015-08-05T19:59:30Z

Without LxmlDocument cache some parts of Scrapy can become less efficient. For example, LxmlLinkExtractor instantiates Selector(response), so for users which use both link extractors and selectors response will be parsed twice. Maybe there are other cases like this (maybe in user's components), I'm not sure. If we proceed with cache removal I think we should at least fix Scrapy built-in components.

dangra · 2015-08-05T20:03:21Z

If we proceed with cache removal I think we should at least fix Scrapy built-in components.

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

kmike · 2015-08-05T20:03:56Z

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

yep, that's it

eliasdorneles · 2015-08-05T20:04:57Z

should I make that part of this PR?

dangra · 2015-08-05T20:05:26Z

for the code that force the selector type we just let it be.

~/src/scrapy$ ack 'Selector\(' scrapy/
scrapy/http/response/text.py:106:            self._cached_selector = Selector(self)
scrapy/selector/unified.py:50:class Selector(object_ref):
scrapy/spiders/feed.py:72:            selector = Selector(response, type='xml')
scrapy/spiders/feed.py:76:            selector = Selector(response, type='html')
scrapy/linkextractors/sgml.py:130:            sel = Selector(response)
scrapy/linkextractors/lxmlhtml.py:68:        html = Selector(response)
scrapy/linkextractors/lxmlhtml.py:98:        html = Selector(response)
scrapy/utils/iterators.py:40:        yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
scrapy/utils/iterators.py:52:        xs = Selector(text=nodetext, type='xml')

dangra · 2015-08-05T20:05:41Z

@eliasdorneles yes, I think so if it pass tests with that simple change.

eliasdorneles · 2015-08-05T20:09:55Z

@dangra talking about tests, any idea why I'm getting failures on test_squeues.py there? 👇 :

dangra · 2015-08-05T20:14:59Z

talking about tests, any idea why I'm getting failures on test_squeues.py there?
Hide all checks

no, but it started to happen on a previous but very unlikely commit build https://travis-ci.org/scrapy/scrapy/jobs/74296461

dangra · 2015-08-05T20:18:11Z

if it is passing for "precise" then it is probably a bug introduced by some dependency upgrade

Twisted 15.3.0 was released yesterday for example

dangra · 2015-08-05T20:32:47Z

It doesn't make sense to me

>>> import sys, pickle, cPickle
>>> sys.version
'2.7.10 (default, May 26 2015, 04:16:29) \n[GCC 5.1.0]'
>>> cPickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-10-e076fdfef0cc> in <module>()
----> 1 cPickle.dumps((lambda x: x), protocol=2)

PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
>>> pickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-11-fb0cc15a50f8> in <module>()
----> 1 pickle.dumps((lambda x: x), protocol=2)

/usr/lib64/python2.7/pickle.pyc in dumps(obj, protocol)
   1372 def dumps(obj, protocol=None):
   1373     file = StringIO()
-> 1374     Pickler(file, protocol).dump(obj)
   1375     return file.getvalue()
   1376 

/usr/lib64/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_global(self, obj, name, pack)
    746             raise PicklingError(
    747                 "Can't pickle %r: it's not found as %s.%s" %
--> 748                 (obj, module, name))
    749         else:
    750             if klass is not obj:

PicklingError: Can't pickle <function <lambda> at 0x7f48660f6848>: it's not found as __main__.<lambda>
>>>

eliasdorneles · 2015-08-05T20:40:56Z

@dangra it's definitely weird, but I tested locally fixing Twisted version to 15.2.1 and it passed.
no idea why, though.

kmike · 2015-08-05T20:43:59Z

On Travis it is failing with AssertionError, but we're expecting ValueError in test. I have no idea how can it depend on installed requirements. For me this test passes locally with twisted 15.3.0.

dangra · 2015-08-05T21:03:23Z

@kmike it fails if you run the whole testsuite but pass if you run just that test file

eliasdorneles · 2015-08-05T21:03:27Z

okay, this is weird.

if I run (with last Twisted) just that test with: tox -r -e py27 -- tests/test_squeues.py, it passes.

but if I run the whole suite with tox -r -e py27, it fails.

ideas?

dangra · 2015-08-05T21:15:05Z

running all tests with twisted 15.2.0 passes for me

eliasdorneles · 2015-08-05T21:16:00Z

yeah, the same with 15.2.1 for me.

dangra · 2015-08-05T21:17:29Z

It fails for 15.3.0 using $ tox -e py27 tests/test_crawl.py tests/test_squeues.py

kmike · 2015-08-05T21:19:12Z

I think it can be related to twisted/twisted@a8d8a0c - see this line.

dangra · 2015-08-05T21:25:46Z

@kmike that's for sure.

eliasdorneles · 2015-08-06T00:42:33Z

@dangra @kmike so, I've made the changes to use response.selector in link extractors.

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

I get a bunch of test failures if I try to replace that for response.selector, so I suppose it's not safe to make the assumption all responses there will have a selector for it -- should we keep it there?

dangra · 2015-08-06T23:38:40Z

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

@elias replace it by the corr espondent direct call to lxml parser following re-encoding best practices from parsel (i.e.: body_as_unicode().encode('utf8'))

eliasdorneles · 2015-08-07T01:39:25Z

@dangra okay, finally got rid of it. :) 🎉
should I rebase and/or squash the commits?

kmike · 2015-08-07T10:24:10Z

scrapy/http/request/form.py

@@ -54,10 +55,15 @@ def _urlencode(seq, enc):
    return urlencode(values, doseq=1)


+def _create_parser_from_response(response, parser_cls):


+1 to expose this function in parsel instead of duplicating it

@kmike what do you think about promoting Selector._root to Selector.root in parsel library and use an extended-html parser for lxml.html.HTMLParser ?

Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?

Promoting _root to root sounds good, I had to use _root more than once.

@kmike in practice, it would mean using lxml.html.HTMLParser instead of lxml.etree.HTMLParser (which is extended by the former).

I'll promote _root in parsel, and add a deprecated attribute for _root in Scrapy's subclass.

Sorry, I feel dumb - could you please explain what is extended-html and why is it different from html?

FormRequest uses lxml.html.HTMLParser which is different from type=html in Selector class which uses lxml.etree.HTMLParser.

The former provides extra methods to query html in a python way but it is slower than the later.

why can't we use lxml.html.HTMLParser by default?

Back in the future #471 (comment)

dangra · 2015-08-10T13:46:25Z

tests/test_selector.py

    def test_badly_encoded_body(self):
        # \xe9 alone isn't valid utf8 sequence
        r1 = TextResponse('http://www.example.com', \
-                          body='<html><p>an Jos\xe9 de</p><html>', \
+                          body=u'<html><p>an Jos\xe9 de</p><html>', \


the body must be bytes for this test case, it is a badly encoded utf8 sequence.

…n warning for _root argument

…or modules, fix test

… Selector

eliasdorneles · 2015-08-11T17:10:02Z

@dangra updated and rebased on top of current master :)

dangra · 2015-08-11T17:23:35Z

great work @eliasdorneles !

dangra · 2015-08-11T17:40:32Z

@kmike feel free to merge when you are happy

kmike · 2015-08-11T17:41:03Z

scrapy/selector/unified.py

-
-    @deprecated(use_instead='.xpath()')
-    def select(self, xpath):
-        return self.xpath(xpath)


I think it is fine to remove these deprecated methods, but we should not forget to mention it in the release notes.

I prefer to do that in another PR, this one is about migrating to use parsel lib while retaining current functionality

Do you mean we should add backwards compatibility shims for these methods in this PR?

woo, yeah, we're missing the shims for these methods! :O

Yes, I think so. We are keeping them for Selector class

It actually means that SelectorList must be a class attribute of Selector class

right, so it can be defined by subclasses.
I'll change this in parsel, do a release and update this PR.

Related scrapy/parsel#11

alright, shim added!
we're getting closer, guys! :D

…elSelectorList

[MRG+1] Migrating selectors to use parsel

kmike · 2015-08-11T20:59:47Z

Great job @eliasdorneles! Also, thanks @umrashrf for getting the ball rolling.

eliasdorneles reviewed Aug 3, 2015
View reviewed changes

kmike reviewed Aug 7, 2015
View reviewed changes

Digenis mentioned this pull request Aug 7, 2015

[WIP] Selectors separation #1007

Closed

dangra reviewed Aug 10, 2015
View reviewed changes

eliasdorneles added 15 commits August 11, 2015 14:08

migrating scrapy Selector to use Parsel

ce21884

fix support to legacy _root argument

c7b29d1

cleanup csstranslator module, keeping only imports

3a572e2

remove selector support for LxmlDocument DOM cache and add deprecatio…

01d948f

…n warning for _root argument

update minimal parsel version, add deprecated classes for csstranslat…

17d7347

…or modules, fix test

use response.selector in link extractors instead of instantiating new…

35c1dcd

… Selector

remove lxmldocument dependency from http.request.form

6287fc3

remove deprecated module lxmldocument

94c3a34

avoid harcoded check for selector type

67c98b1

upgrade parsel and using promoted root attribute

2fe6d12

upgrade parsel and use its function to instantiate root for finding form

26ebccd

warning when ambiguous root arguments and minor cleanups

12579b9

cleanup tests for selectors and translators

3a03ef7

using bytes for response body in tests

8ef5aa2

set base_url in kwargs to be fully backward compatible

e50610b

eliasdorneles force-pushed the migrate-parsel branch from 6883af1 to e50610b Compare August 11, 2015 17:09

dangra changed the title ~~Migrating selectors to use parsel~~ [MRG+1] Migrating selectors to use parsel Aug 11, 2015

kmike reviewed Aug 11, 2015
View reviewed changes

eliasdorneles added 2 commits August 11, 2015 15:20

upgrade parsel and add shim for deprecated selectorlist methods

766c255

make Parsel's Selector more private, remove direct dependency of Pars…

a5abd19

…elSelectorList

dangra added a commit that referenced this pull request Aug 11, 2015

Merge pull request #1409 from eliasdorneles/migrate-parsel

15c1300

[MRG+1] Migrating selectors to use parsel

dangra merged commit 15c1300 into scrapy:master Aug 11, 2015

kmike mentioned this pull request Aug 12, 2015

Scrapy.selector Enhancement Proposal #906

Closed

dangra mentioned this pull request Aug 12, 2015

[MRG+1] Replace usage of deprecated class by its parsel\'s counterpart #1431

Merged

		@@ -54,10 +55,15 @@ def _urlencode(seq, enc):
		return urlencode(values, doseq=1)


		def _create_parser_from_response(response, parser_cls):

[MRG+1] Migrating selectors to use parsel #1409

[MRG+1] Migrating selectors to use parsel #1409

Conversation

eliasdorneles commented Aug 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike commented Aug 5, 2015

dangra commented Aug 5, 2015

kmike commented Aug 5, 2015

eliasdorneles commented Aug 5, 2015

dangra commented Aug 5, 2015

dangra commented Aug 5, 2015

eliasdorneles commented Aug 5, 2015

dangra commented Aug 5, 2015

dangra commented Aug 5, 2015

dangra commented Aug 5, 2015

eliasdorneles commented Aug 5, 2015

kmike commented Aug 5, 2015

dangra commented Aug 5, 2015

eliasdorneles commented Aug 5, 2015

dangra commented Aug 5, 2015

eliasdorneles commented Aug 5, 2015

dangra commented Aug 5, 2015

kmike commented Aug 5, 2015

dangra commented Aug 5, 2015

eliasdorneles commented Aug 6, 2015

dangra commented Aug 6, 2015

eliasdorneles commented Aug 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eliasdorneles commented Aug 11, 2015

dangra commented Aug 11, 2015

dangra commented Aug 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike commented Aug 11, 2015