New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Do not serialize unpickable objects (py3) #3082
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3082 +/- ##
======================================
Coverage 82.1% 82.1%
======================================
Files 228 228
Lines 9586 9586
Branches 1384 1384
======================================
Hits 7871 7871
Misses 1456 1456
Partials 259 259
|
1848bb4
to
1b74543
Compare
self.assertRaises(ValueError, q.push, a) | ||
# Selectors should fail (lxml.html.HtmlElement objects can't be pickled) | ||
sel = Selector(text='<html><body><p>some text</p></body></html>') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line rang a bell for me.
Based on the test case, I think it is clear that Selector instances can't be pickled so let's enforce it by adding a __getstate__
method that fails (it is also quicker than going down into its ._root
element to find out it fails.
Aside of that, it is unclear to me if lxml.html.HTMLElement
should be pickeable or not. I don't think https://bugs.launchpad.net/lxml/+bug/736708 is relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting thread on the topic that recommends raising TypeError
for unpickeable objects https://mail.python.org/pipermail/python-list/2014-April/671009.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you pickle HtmlElement (or other Element)? As I understand, to do this you need to save the whole tree along with an element, and this can be very inefficient if you're saving many elements.
If HtmlElement shouldn't be picklable, it should be a bug report to lxml, not a monkey patch.
It seems that picking of Selector objects is tricky indeeed; +1 to disallow pickling of Selectors explicitly, a great idea. It will be easy to re-enable it if we ever figure out how to pickle lxml sutuctures, and it'd prevent errors in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments guys 😄
As it says in https://bugs.launchpad.net/lxml/+bug/736708: "This has been discussed on the mailing list several times, please read the archives for reasons why this is not trivial". Seems like the archives search site is down though (or I couldn't find it 😅)
I don't think there's anything inherently unpickable about the Selector
class, in fact it could probably be pickled storing the text attribute instead of the HtmlElement
object, and recreate it when unpickling (not saying it would be efficient or a good idea, just possible, I think).
Anyway, defining a __getstate__
method that raises TypeError
sounds good to me, but it should probably go in parsel
, right? I see Scrapy's Selector
inherits from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the fix should be in parsel.
Storing text of HtmlElement is not sufficient, as you can query siblings or parents of an element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we close and implement changes in parsel then?
anyone going to report the issue upstream to lxml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that at least tests from this PR can be useful, so we shouldn't necessarily close it. +1 to open lxml issue and fix it in parsel - @elacuesta do you have plans to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @kmike, I plan to open a PR to parsel
. I think this PR shouldn't be closed, because the change to catch TypeError
in scrapy/sques.py
is still necessary. I'll update this shortly.
1b74543
to
e46700f
Compare
FYI: parsel 1.4.0 is released; it raises a TypeError on Selector / SelectorList pickling. |
Awesome, I'll update this PR accordingly. Thanks! |
6578de7
to
0d87e77
Compare
scrapy/squeues.py
Outdated
# Python<=3.4 raises pickle.PicklingError | ||
except (pickle.PicklingError, AttributeError) as e: | ||
# Python<=3.4 raises pickle.PicklingError here while | ||
# Python>=3.5 raises AttributeError and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean 3.5 <= Python version < 3.6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let me clarify that. To be honest I have not seen any AttributeError
, I just left what the previous comment said, but I phrased my addition poorly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks for noticing!
Looks good to me, thanks @elacuesta! |
This addresses #3054 partially, as it catches the exception raised bypickle
.However, since this exception is only raised in python >= 3.6, for earlier versions it's still not able to realize it shouldn't serialize some requests.
Fixes #3054.
Added a monkey patch to prevent the pickling oflxml.html.HtmlElement
objects.Updated: raise an exception when pickling Selector/SelectorList objects (scrapy/parsel#108), catch it in the queue.