Request serialization should fail for non-picklable objects #3054

elacuesta · 2017-12-29T17:17:07Z

The Pickle-based disk queues silently serialize requests that shouldn't be serialized in Python<=3.5. I found this problem when dumping a request with an ItemLoader object in its meta dict. Python 3.6 fails in this line with TypeError: can't pickle HtmlElement objects, because the loader contains a Selector, which in turns contains an HtmlElement object.

I tested this using the https://github.com/scrapinghub/scrapinghub-stack-scrapy repository, and found that pickle.loads(pickle.dumps(selector)) doesn't fail, but generates a broken object.

Python 2.7, Scrapy 1.3.3 (https://github.com/scrapinghub/scrapinghub-stack-scrapy/tree/branch-1.3)

root@04bfc6cf84cd:/# scrapy version -v
Scrapy    : 1.3.3
lxml      : 3.7.2.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.1.0
w3lib     : 1.17.0
Twisted   : 16.6.0
Python    : 2.7.14 (default, Dec 12 2017, 16:55:09) - [GCC 4.9.2]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.1t  3 May 2016)
Platform  : Linux-4.9.44-linuxkit-aufs-x86_64-with-debian-8.10
root@04bfc6cf84cd:/# scrapy shell "http://example.org"
2017-12-29 16:49:27 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
(...)
>>> from six.moves import cPickle as pickle
>>> s2 = pickle.loads(pickle.dumps(response.selector, protocol=2))
>>> response.selector.css('a')
[<Selector xpath=u'descendant-or-self::a' data=u'<a href="http://www.iana.org/domains/exa'>]
>>> s2.css('a')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/parsel/selector.py", line 227, in css
    return self.xpath(self._css2xpath(query))
  File "/usr/local/lib/python2.7/site-packages/parsel/selector.py", line 203, in xpath
    **kwargs)
  File "src/lxml/lxml.etree.pyx", line 1584, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:59349)
  File "src/lxml/xpath.pxi", line 257, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:170478)
  File "src/lxml/apihelpers.pxi", line 19, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:16482)
AssertionError: invalid Element proxy at 140144569743064

Python 3.5, Scrapy 1.3.3 (https://github.com/scrapinghub/scrapinghub-stack-scrapy/tree/branch-1.3-py3)

root@1945e2154919:/# scrapy version -v
Scrapy    : 1.3.3
lxml      : 3.7.2.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.1.0
w3lib     : 1.17.0
Twisted   : 16.6.0
Python    : 3.5.4 (default, Dec 12 2017, 16:43:39) - [GCC 4.9.2]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.1t  3 May 2016)
Platform  : Linux-4.9.44-linuxkit-aufs-x86_64-with-debian-8.10
root@1945e2154919:/# scrapy shell "http://example.org"
2017-12-29 16:52:37 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
(...)
>>> from six.moves import cPickle as pickle
>>> s2 = pickle.loads(pickle.dumps(response.selector, protocol=2))
>>> response.selector.css('a')
[<Selector xpath='descendant-or-self::a' data='<a href="http://www.iana.org/domains/exa'>]
>>> s2.css('a')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/parsel/selector.py", line 227, in css
    return self.xpath(self._css2xpath(query))
  File "/usr/local/lib/python3.5/site-packages/parsel/selector.py", line 203, in xpath
    **kwargs)
  File "src/lxml/lxml.etree.pyx", line 1584, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:59349)
  File "src/lxml/xpath.pxi", line 257, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:170478)
  File "src/lxml/apihelpers.pxi", line 19, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:16482)
AssertionError: invalid Element proxy at 139862544625976

Python 3.6, Scrapy 1.3.3 (https://github.com/scrapinghub/scrapinghub-stack-scrapy/tree/branch-1.3-py3)

root@43e690443ca7:/# scrapy version -v
Scrapy    : 1.3.3
lxml      : 3.7.2.0
libxml2   : 2.9.3
cssselect : 1.0.1
parsel    : 1.1.0
w3lib     : 1.17.0
Twisted   : 16.6.0
Python    : 3.6.4 (default, Dec 21 2017, 01:35:12) - [GCC 4.9.2]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.1t  3 May 2016)
Platform  : Linux-4.9.44-linuxkit-aufs-x86_64-with-debian-8.10
root@43e690443ca7:/# scrapy shell "http://example.org"
2017-12-29 16:54:49 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
(...)
>>> from six.moves import cPickle as pickle
>>> s2 = pickle.loads(pickle.dumps(response.selector, protocol=2))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: can't pickle HtmlElement objects

The text was updated successfully, but these errors were encountered:

elacuesta · 2018-01-19T13:32:06Z

Seems to be related to https://bugs.launchpad.net/lxml/+bug/736708.

kmike added the bug label Jan 10, 2018

elacuesta mentioned this issue Jan 18, 2018

[MRG+1] Do not serialize unpickable objects (py3) #3082

Merged

lopuhin closed this as completed in #3082 Feb 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request serialization should fail for non-picklable objects #3054

Request serialization should fail for non-picklable objects #3054

elacuesta commented Dec 29, 2017

elacuesta commented Jan 19, 2018

Request serialization should fail for non-picklable objects #3054

Request serialization should fail for non-picklable objects #3054

Comments

elacuesta commented Dec 29, 2017

Python 2.7, Scrapy 1.3.3 (https://github.com/scrapinghub/scrapinghub-stack-scrapy/tree/branch-1.3)

Python 3.5, Scrapy 1.3.3 (https://github.com/scrapinghub/scrapinghub-stack-scrapy/tree/branch-1.3-py3)

Python 3.6, Scrapy 1.3.3 (https://github.com/scrapinghub/scrapinghub-stack-scrapy/tree/branch-1.3-py3)

elacuesta commented Jan 19, 2018