Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase chunk size for stats endpoints #93

Merged
merged 1 commit into from
Jan 10, 2018
Merged

Conversation

vshlapakov
Copy link
Contributor

job.items.stats() takes double time comparing to a plain request to the same endpoint - it was reported by @immerrr, but I have just found some time to prepare a patch, please take a look.

In [50]: j = conn.get_job('1/2/3')

In [51]: %time x1 = requests.get('https://storage.scrapinghub.com/items/%s/stats?apikey=%s' % (j.key, apikey)).json()
CPU times: user 80 ms, sys: 12 ms, total: 92 ms
Wall time: 582 ms

In [52]: %time x2 = j.items.stats();
CPU times: user 736 ms, sys: 4 ms, total: 740 ms
Wall time: 1.27 s

In [53]: x1 == x2
Out[53]: True

In a word, the problem is that current apiget logic always iterates through a response body with 512B chunks returning an iterator with JSON lines. And if there's a large single JSON line in a response (about ~800KB in the example above), it will take significant time to get it and split the whole content to a set of lines by a delimiter.

In [64]: profile.run('list(x1.iter_lines())')
         6591 function calls in 0.516 seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 :0(isinstance)
     1646    0.004    0.000    0.004    0.000 :0(len)
     1645    0.004    0.000    0.004    0.000 :0(pop)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
     1645    0.452    0.000    0.452    0.000 :0(splitlines)
        1    0.000    0.000    0.516    0.516 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 models.py:717(iter_content)
        1    0.000    0.000    0.000    0.000 models.py:734(generate)
        2    0.048    0.024    0.516    0.258 models.py:772(iter_lines)
        1    0.000    0.000    0.516    0.516 profile:0(list(x1.iter_lines()))
        0    0.000             0.000          profile:0(profiler)
     1646    0.008    0.000    0.012    0.000 utils.py:449(iter_slices)

There're different solutions to this problem, but the most straightforward one is just increasing a chunk size when working with the stats endpoints.

In [18]: %time x = list(x1.iter_lines(chunk_size=512))
CPU times: user 384 ms, sys: 0 ns, total: 384 ms
Wall time: 370 ms

In [19]: %time x = list(x1.iter_lines(chunk_size=512*1024))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.5 ms

@codecov
Copy link

codecov bot commented Jan 10, 2018

Codecov Report

Merging #93 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #93      +/-   ##
==========================================
+ Coverage   93.26%   93.28%   +0.01%     
==========================================
  Files          28       28              
  Lines        1870     1875       +5     
==========================================
+ Hits         1744     1749       +5     
  Misses        126      126
Impacted Files Coverage Δ
scrapinghub/hubstorage/resourcetype.py 97.32% <100%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 738d229...e019152. Read the comment docs.

@dangra dangra merged commit 166084d into master Jan 10, 2018
@dangra dangra deleted the simplify-stats-logic-v2 branch January 10, 2018 12:10
@immerrr
Copy link
Contributor

immerrr commented Jan 10, 2018

Great stuff, thank you @vshlapakov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants