Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Do not serialize unpickable objects (py3) #3082

Merged
merged 4 commits into from Feb 8, 2018

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Jan 18, 2018

This addresses #3054 partially, as it catches the exception raised by pickle.
However, since this exception is only raised in python >= 3.6, for earlier versions it's still not able to realize it shouldn't serialize some requests.

Fixes #3054.

Added a monkey patch to prevent the pickling of lxml.html.HtmlElement objects.

Updated: raise an exception when pickling Selector/SelectorList objects (scrapy/parsel#108), catch it in the queue.

@codecov
Copy link

codecov bot commented Jan 18, 2018

Codecov Report

Merging #3082 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #3082   +/-   ##
======================================
  Coverage    82.1%   82.1%           
======================================
  Files         228     228           
  Lines        9586    9586           
  Branches     1384    1384           
======================================
  Hits         7871    7871           
  Misses       1456    1456           
  Partials      259     259
Impacted Files Coverage Δ
scrapy/squeues.py 100% <100%> (ø) ⬆️

@elacuesta elacuesta changed the title [WIP] Do not serialize unpickable objects (py3) Do not serialize unpickable objects (py3) Jan 19, 2018
@elacuesta elacuesta force-pushed the pickle-requests branch 2 times, most recently from 1848bb4 to 1b74543 Compare January 19, 2018 14:39
self.assertRaises(ValueError, q.push, a)
# Selectors should fail (lxml.html.HtmlElement objects can't be pickled)
sel = Selector(text='<html><body><p>some text</p></body></html>')
Copy link
Member

@dangra dangra Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line rang a bell for me.

Based on the test case, I think it is clear that Selector instances can't be pickled so let's enforce it by adding a __getstate__ method that fails (it is also quicker than going down into its ._root element to find out it fails.

Aside of that, it is unclear to me if lxml.html.HTMLElement should be pickeable or not. I don't think https://bugs.launchpad.net/lxml/+bug/736708 is relevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting thread on the topic that recommends raising TypeError for unpickeable objects https://mail.python.org/pipermail/python-list/2014-April/671009.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you pickle HtmlElement (or other Element)? As I understand, to do this you need to save the whole tree along with an element, and this can be very inefficient if you're saving many elements.

If HtmlElement shouldn't be picklable, it should be a bug report to lxml, not a monkey patch.

It seems that picking of Selector objects is tricky indeeed; +1 to disallow pickling of Selectors explicitly, a great idea. It will be easy to re-enable it if we ever figure out how to pickle lxml sutuctures, and it'd prevent errors in the meantime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments guys 😄

As it says in https://bugs.launchpad.net/lxml/+bug/736708: "This has been discussed on the mailing list several times, please read the archives for reasons why this is not trivial". Seems like the archives search site is down though (or I couldn't find it 😅)

I don't think there's anything inherently unpickable about the Selector class, in fact it could probably be pickled storing the text attribute instead of the HtmlElement object, and recreate it when unpickling (not saying it would be efficient or a good idea, just possible, I think).

Anyway, defining a __getstate__ method that raises TypeError sounds good to me, but it should probably go in parsel, right? I see Scrapy's Selector inherits from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the fix should be in parsel.

Storing text of HtmlElement is not sufficient, as you can query siblings or parents of an element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we close and implement changes in parsel then?

anyone going to report the issue upstream to lxml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that at least tests from this PR can be useful, so we shouldn't necessarily close it. +1 to open lxml issue and fix it in parsel - @elacuesta do you have plans to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @kmike, I plan to open a PR to parsel. I think this PR shouldn't be closed, because the change to catch TypeErrorin scrapy/sques.py is still necessary. I'll update this shortly.

@kmike
Copy link
Member

kmike commented Feb 8, 2018

FYI: parsel 1.4.0 is released; it raises a TypeError on Selector / SelectorList pickling.

@elacuesta
Copy link
Member Author

Awesome, I'll update this PR accordingly. Thanks!

# Python<=3.4 raises pickle.PicklingError
except (pickle.PicklingError, AttributeError) as e:
# Python<=3.4 raises pickle.PicklingError here while
# Python>=3.5 raises AttributeError and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean 3.5 <= Python version < 3.6?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me clarify that. To be honest I have not seen any AttributeError, I just left what the previous comment said, but I phrased my addition poorly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for noticing!

@kmike kmike changed the title Do not serialize unpickable objects (py3) [MRG+1] Do not serialize unpickable objects (py3) Feb 8, 2018
@kmike
Copy link
Member

kmike commented Feb 8, 2018

Looks good to me, thanks @elacuesta!

@lopuhin lopuhin merged commit 8d2240d into scrapy:master Feb 8, 2018
@elacuesta elacuesta deleted the pickle-requests branch March 15, 2018 16:24
@kmike kmike added this to the v1.6 milestone Jul 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants