Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Improve gunzip performance for big files on Python 3 #3281

Merged
merged 1 commit into from Jun 1, 2018

Conversation

@fbergen
Copy link
Contributor

@fbergen fbergen commented Jun 1, 2018

Copied from discussion on #3270 , partial fix for #2658

Problem
gunzip method runs forever for big files in python 3

bytes in Python 3 does not include string concatenation improvements that Python 2 has, effectively making the following take forever:

%%time
out = b""
for x in range(10000):
    out += b"." * 8000

Solution
Improve the gunzip method to not copy the entire output string in each iteration, but instead append to list + join, significantly improving performance from O(n^2)

Thanks @kmike for your help debugging!

@kmike
Copy link
Member

@kmike kmike commented Jun 1, 2018

//cc @nctl144 and @lopuhin, as you're working on Scrapy performance improvements and becnhamarks. This change fixes an issue explained here: #2658 (comment) - do you think large and gzip-compressed responses should be added to https://github.com/scrapy/scrapy-bench? Maybe add an option to reproduce not only HTML content of real-world websites from Alexa top, but raw binary responses and HTTP headers as well?

@kmike kmike changed the title Improve gunzip performance for big files on Python 3 [MRG+1] Improve gunzip performance for big files on Python 3 Jun 1, 2018
@kmike
Copy link
Member

@kmike kmike commented Jun 1, 2018

Looks great, thanks @fbergen! +1 to merge after tests pass.

@codecov
Copy link

@codecov codecov bot commented Jun 1, 2018

Codecov Report

Merging #3281 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #3281   +/-   ##
=======================================
  Coverage   82.14%   82.14%           
=======================================
  Files         228      228           
  Lines        9599     9599           
  Branches     1385     1385           
=======================================
  Hits         7885     7885           
  Misses       1456     1456           
  Partials      258      258
Impacted Files Coverage Δ
scrapy/utils/gz.py 100% <100%> (ø) ⬆️
@cathalgarvey
Copy link
Contributor

@cathalgarvey cathalgarvey commented Jun 1, 2018

Looks good to me, thanks @fbergen!

@cathalgarvey cathalgarvey changed the title [MRG+1] Improve gunzip performance for big files on Python 3 [MRG+2] Improve gunzip performance for big files on Python 3 Jun 1, 2018
@kmike kmike merged commit 3cf871c into scrapy:master Jun 1, 2018
2 checks passed
2 checks passed
codecov/patch 100% of diff hit (target 82.14%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@fbergen fbergen deleted the fbergen:gunzipperf branch Jun 2, 2018
@kmike kmike added this to the v1.6 milestone Jul 4, 2018
@kmike kmike mentioned this pull request Jul 9, 2018
@kmike kmike modified the milestones: v1.6, v1.5.1 Jul 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.