Increase chunk size for stats endpoints #93

vshlapakov · 2018-01-10T11:26:42Z

job.items.stats() takes double time comparing to a plain request to the same endpoint - it was reported by @immerrr, but I have just found some time to prepare a patch, please take a look.

In [50]: j = conn.get_job('1/2/3')

In [51]: %time x1 = requests.get('https://storage.scrapinghub.com/items/%s/stats?apikey=%s' % (j.key, apikey)).json()
CPU times: user 80 ms, sys: 12 ms, total: 92 ms
Wall time: 582 ms

In [52]: %time x2 = j.items.stats();
CPU times: user 736 ms, sys: 4 ms, total: 740 ms
Wall time: 1.27 s

In [53]: x1 == x2
Out[53]: True

In a word, the problem is that current apiget logic always iterates through a response body with 512B chunks returning an iterator with JSON lines. And if there's a large single JSON line in a response (about ~800KB in the example above), it will take significant time to get it and split the whole content to a set of lines by a delimiter.

In [64]: profile.run('list(x1.iter_lines())')
         6591 function calls in 0.516 seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 :0(isinstance)
     1646    0.004    0.000    0.004    0.000 :0(len)
     1645    0.004    0.000    0.004    0.000 :0(pop)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
     1645    0.452    0.000    0.452    0.000 :0(splitlines)
        1    0.000    0.000    0.516    0.516 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 models.py:717(iter_content)
        1    0.000    0.000    0.000    0.000 models.py:734(generate)
        2    0.048    0.024    0.516    0.258 models.py:772(iter_lines)
        1    0.000    0.000    0.516    0.516 profile:0(list(x1.iter_lines()))
        0    0.000             0.000          profile:0(profiler)
     1646    0.008    0.000    0.012    0.000 utils.py:449(iter_slices)

There're different solutions to this problem, but the most straightforward one is just increasing a chunk size when working with the stats endpoints.

In [18]: %time x = list(x1.iter_lines(chunk_size=512))
CPU times: user 384 ms, sys: 0 ns, total: 384 ms
Wall time: 370 ms

In [19]: %time x = list(x1.iter_lines(chunk_size=512*1024))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.5 ms

codecov · 2018-01-10T11:39:21Z

Codecov Report

Merging #93 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #93      +/-   ##
==========================================
+ Coverage   93.26%   93.28%   +0.01%     
==========================================
  Files          28       28              
  Lines        1870     1875       +5     
==========================================
+ Hits         1744     1749       +5     
  Misses        126      126

Impacted Files	Coverage Δ
scrapinghub/hubstorage/resourcetype.py	`97.32% <100%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 738d229...e019152. Read the comment docs.

immerrr · 2018-01-10T17:17:45Z

Great stuff, thank you @vshlapakov

Increase chunk size for stats endpoints

e019152

vshlapakov added the enhancement label Jan 10, 2018

vshlapakov self-assigned this Jan 10, 2018

vshlapakov requested review from dangra and immerrr January 10, 2018 11:26

dangra merged commit 166084d into master Jan 10, 2018

dangra deleted the simplify-stats-logic-v2 branch January 10, 2018 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase chunk size for stats endpoints #93

Increase chunk size for stats endpoints #93

vshlapakov commented Jan 10, 2018

codecov bot commented Jan 10, 2018 •

edited

Loading

immerrr commented Jan 10, 2018

Increase chunk size for stats endpoints #93

Increase chunk size for stats endpoints #93

Conversation

vshlapakov commented Jan 10, 2018

codecov bot commented Jan 10, 2018 • edited Loading

Codecov Report

immerrr commented Jan 10, 2018

codecov bot commented Jan 10, 2018 •

edited

Loading