Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Improve gunzip performance for big files on Python 3 #3281

Merged
merged 1 commit into from Jun 1, 2018

Conversation

fbergen
Copy link
Contributor

@fbergen fbergen commented Jun 1, 2018

Copied from discussion on #3270 , partial fix for #2658

Problem
gunzip method runs forever for big files in python 3

bytes in Python 3 does not include string concatenation improvements that Python 2 has, effectively making the following take forever:

%%time
out = b""
for x in range(10000):
    out += b"." * 8000

Solution
Improve the gunzip method to not copy the entire output string in each iteration, but instead append to list + join, significantly improving performance from O(n^2)

Thanks @kmike for your help debugging!

@kmike
Copy link
Member

kmike commented Jun 1, 2018

//cc @nctl144 and @lopuhin, as you're working on Scrapy performance improvements and becnhamarks. This change fixes an issue explained here: #2658 (comment) - do you think large and gzip-compressed responses should be added to https://github.com/scrapy/scrapy-bench? Maybe add an option to reproduce not only HTML content of real-world websites from Alexa top, but raw binary responses and HTTP headers as well?

@kmike kmike changed the title Improve gunzip performance for big files on Python 3 [MRG+1] Improve gunzip performance for big files on Python 3 Jun 1, 2018
@kmike
Copy link
Member

kmike commented Jun 1, 2018

Looks great, thanks @fbergen! +1 to merge after tests pass.

@codecov
Copy link

codecov bot commented Jun 1, 2018

Codecov Report

Merging #3281 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #3281   +/-   ##
=======================================
  Coverage   82.14%   82.14%           
=======================================
  Files         228      228           
  Lines        9599     9599           
  Branches     1385     1385           
=======================================
  Hits         7885     7885           
  Misses       1456     1456           
  Partials      258      258
Impacted Files Coverage Δ
scrapy/utils/gz.py 100% <100%> (ø) ⬆️

@cathalgarvey
Copy link
Contributor

Looks good to me, thanks @fbergen!

@cathalgarvey cathalgarvey changed the title [MRG+1] Improve gunzip performance for big files on Python 3 [MRG+2] Improve gunzip performance for big files on Python 3 Jun 1, 2018
@kmike kmike merged commit 3cf871c into scrapy:master Jun 1, 2018
@fbergen fbergen deleted the gunzipperf branch June 2, 2018 04:50
@kmike kmike added this to the v1.6 milestone Jul 4, 2018
@kmike kmike mentioned this pull request Jul 9, 2018
@kmike kmike modified the milestones: v1.6, v1.5.1 Jul 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants