Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle: use protocol 4, update tests #4541

Merged
merged 5 commits into from May 11, 2020
Merged

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented May 6, 2020

Fixes #4135

I started by adding a PICKLE_PROTOCOL setting, but that didn't work well with scrapy.squeues._pickle_serialize because it's defined as a top level function. There is probably a way around that, I just didn't think it was worth the trouble.

I simplified the testing of unserializable objects, since https://twistedmatrix.com/trac/ticket/7989 only affects py2 AFAICT.

A simple and non-rigorous benchmark shows a performance improvement of about 10%:

from tempfile import mkdtemp
from scrapy import Request
from scrapy.utils.test import get_crawler
from scrapy.squeues import PickleFifoDiskQueue

def test(request_count):
    jobdir = mkdtemp()
    crawler = get_crawler(settings_dict={"JOBDIR": jobdir})
    queue = PickleFifoDiskQueue(crawler, jobdir)
    for i in range(request_count):
        queue.push(Request("https://example.org/{}".format(i)))

Protocol 2 (3878b67):

In [2]: %timeit -n 10 test(10**5)                                                                                                                                                                                                             
7.63 s ± 75.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Protocol 4:

In [2]: %timeit -n 10 test(10**5)                                                                                                                                                                                                             
6.87 s ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Additionally, I reworded the comment about pickle.PicklingError being raised in Python <= 3.4. We no longer support 3.4, but I think we should keep catching pickle.PicklingError since it's raised when trying to pickle a lambda function in 3.5 <= Python <= 3.8. For the record, the test actually raises AttributeError("Can't pickle local object 'PickleFifoDiskQueueTest.test_non_pickable_object.<locals>.<lambda>'") in CPython, but pickle.PicklingError is raised in pypy.

@Gallaecio
Copy link
Member

Gallaecio commented May 6, 2020

Given the backward incompatibility implications, such as cache invalidation, I think a setting is still worth exploring.

Thinking of users, I think we should only upgrade the version of pickle automatically for new projects (by adding a setting override to the project settings template). We could then cover the setting in some topic or section about performance, as one way to improve Scrapy performance in CPU-bound scenarios (and disk-bound scenarios?).

@elacuesta
Copy link
Member Author

elacuesta commented May 6, 2020

I don't think we need to worry about that, since pickle.load(s) are able to identify the used protocol. From the pickle module docs:

The protocol version of the pickle is detected automatically, so no protocol argument is needed

@elacuesta
Copy link
Member Author

elacuesta commented May 6, 2020

Tests are failing only on pypy3. In the past, I've encountered that pypy handles pickle somewhat differently from CPython, but I don't usually test it locally before pushing. I'll take a look.

@codecov
Copy link

codecov bot commented May 6, 2020

Codecov Report

Merging #4541 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4541      +/-   ##
==========================================
+ Coverage   84.48%   84.49%   +0.01%     
==========================================
  Files         164      164              
  Lines        9903     9903              
  Branches     1482     1481       -1     
==========================================
+ Hits         8367     8368       +1     
+ Misses       1267     1266       -1     
  Partials      269      269              
Impacted Files Coverage Δ
scrapy/exporters.py 100.00% <100.00%> (ø)
scrapy/extensions/httpcache.py 95.43% <100.00%> (ø)
scrapy/extensions/spiderstate.py 100.00% <100.00%> (ø)
scrapy/squeues.py 96.77% <100.00%> (ø)
scrapy/utils/trackref.py 82.85% <0.00%> (-2.86%) ⬇️
scrapy/core/engine.py 87.87% <0.00%> (ø)
scrapy/utils/request.py 100.00% <0.00%> (ø)
scrapy/pipelines/media.py 97.29% <0.00%> (ø)
scrapy/utils/deprecate.py 95.38% <0.00%> (ø)
scrapy/http/request/form.py 95.34% <0.00%> (ø)
... and 2 more

@elacuesta elacuesta closed this May 7, 2020
@elacuesta elacuesta reopened this May 7, 2020
@elacuesta elacuesta changed the title Pickle adjustments Pickle: use protocol 4, update tests May 7, 2020
@kmike kmike merged commit 892467c into scrapy:master May 11, 2020
2 checks passed
@kmike
Copy link
Member

kmike commented May 11, 2020

Thanks @elacuesta!

@elacuesta elacuesta deleted the pickle-adjustments branch May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants