New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize app engine frontend instance count #578

Closed
snarfed opened this Issue Dec 18, 2015 · 38 comments

Comments

Projects
None yet
2 participants
@snarfed
Copy link
Owner

snarfed commented Dec 18, 2015

we average ~2 app engine frontend instances right now, and they cost money (at least above the first), so i've been experimenting with different tuning knobs to see if i can get that average down a bit. this isn't high priority, and i'm also not sure i can have much impact without making drastic changes, so this is mostly just for tracking.

directly related:

  • #90 figure out why requests aren't running concurrently (enough)
  • #110 optimize poll task

loosely related:

  • #8 find an alternative to the resident backend for twitter streaming
  • #53 get backend memory usage below cap
  • #145 cache in instance memory as well as memcache
  • #300 try dedicated memcache

here are a few things i've tried recently, and the results. tl;dr: none of them had a meaningful effect on instance count.

  • 12/17 12:20pm: 30m fast poll frequency, poll concurrency 2
  • 12/18 7:21am: 15m fast poll frequency, concurrency 1
  • 12/18 10:40am: 15m fast poll frequency, poll concurrency 10

screen shot 2015-12-18 at 3 13 24 pm

screen shot 2015-12-18 at 3 13 41 pm

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 2, 2016

the really annoying part is that we don't actually need more than one instance, but app engine's scheduler doesn't handle us well and thinks we do, so it usually keeps two or three around. from https://console.developers.google.com/appengine/instances?project=brid-gy&moduleId=default&versionId=7&duration=P1D :

screen shot 2016-02-02 at 2 48 04 pm

this is the norm. one instance serves almost all of the requests; the remaining instance(s) serve almost none.

details on how app engine's request scheduler/auto scaler works in the links below. i've tried lots of different values (including default) for Max Idle Instances and Min Pending Latency, different instance classes, and other settings, but haven't managed to change this behavior. :/

basic and manual scaling are tempting, but only get 8 free instance hours per day, as opposed to automatic scaling's 28, so they're right out.

@kylewm and i talked a bit on IRC just now on profiling. our request volume is mostly poll and propagate (~87% and ~9% each, respectively), both of which i expect spend most of their time blocked on network I/O. good idea to check, though. app stats is great for that; we can turn it on here.

the poll task queue is currently configured to run up to 2 at a time, propagate up to 1 at a time, and we have threadsafe: yes in our app.yaml to allow instances to serve multiple requests concurrently. last time i checked with bridgy's instance info recording, this was indeed happening.

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 2, 2016

for reference, here's bridgy's lifetime cost history. it breaks down to roughly 2/3 frontend instances, 1/3 datastore ops (reads/writes).

costs

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 11, 2016

tried another experiment: removed max_concurrent_requests: 4 from the poll queue and let it run as many as once as it wanted. part of the motivation was to let backlogs clear faster and avoid telling users "next poll in 1m" for long periods of time, but also a (weak) hypothesis was that running more polls simultaneously would better utilize instances.

screen shot 2016-02-11 at 9 50 33 am

in practice, it didn't clearly help or hurt. i flipped the switch feb 10 ~11am. the main difference i see is that our active instance count was more jagged and variable because we weren't throttling polls (and thus load).

i also don't know why feb 9 evening averaged so low, but afaik it wasn't due to anything we did.

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 11, 2016

oh, right, polls were unusually slow feb 9 evening, i suspect due to one of the silo APIs having issues and being slow itself. not sure why that would drop our average instance count, but it is the one clear correlation.

screen shot 2016-02-11 at 9 55 17 am

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 19, 2016

i started experimenting with this again yesterday. coincidentally, poll latency also spiked a bit before i started experimenting, and has stayed pretty high.

latency

the only change i made that could have caused that was deploying #620 and snarfed/granary@74aed5c, both of which really seem unrelated. maybe i accidentally deployed a different version of some libraries in my virtualenv? (although deploy.sh should have caught that?) maybe some bad new interaction in app engine 1.9.33? i don't know. :(

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 19, 2016

anecdotally, it seems like the cause might be slower HTTP fetches during OPD. i looked at a few poll logs and saw a number of >10s requests when fetching different people's web pages. :(

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 20, 2016

here are the experiment results, even if they're tainted by the high latency.

all

the magic single-instance period starting at 10am was the first time with min_idle_instances=0, max_idle_instances=1, but the poll queue was caught up when that started, which i suspect mattered. otherwise, unclear.

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 21, 2016

latency is slowly maybe getting back to normal.

screen shot 2016-02-20 at 10 19 28 pm

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 21, 2016

well that's a pleasant surprise.

instances

snarfed added a commit that referenced this issue Feb 22, 2016

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 23, 2016

definitely promising. looks like we may be able to hold sustain ~1.5 instances on average. we'll see.

screen shot 2016-02-23 at 9 23 32 am

@snarfed snarfed referenced this issue Feb 24, 2016

Closed

optimize OPD refetch #624

5 of 10 tasks complete
@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 24, 2016

seeing substantial poll latency spikes again. anecdotally, it looks like many of the long requests are doing OPD refetch. filed #624 to optimize.

@kylewm

This comment has been minimized.

Copy link
Collaborator

kylewm commented Feb 26, 2016

huh, there's weird big gaps between RPCs. Not sure if that's the process being asleep, or if it's actually doing something processor intensive in that time.

pollling-profile

here's another one

another-profile

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 26, 2016

whoa, those are big gaps indeed.15-25s?! i mean, I'm sure some of our compute is inefficient, but not that inefficient.

i guess we could try to repro it in dev_appserver, refetching a local web server, and comment out parts of the code in the gap and binary search down until we isolate it. maybe tedious, but could work.

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 26, 2016

maybe it's context switching to other requests in other threads? hard to believe though. i'd think it should be doing that during the API calls, and i'd also hope no self respecting scheduler would starve a thread for 20s at a time.

maybe worth searching/asking on SO or https://groups.google.com/forum/m/#!forum/google-appengine ?

@kylewm

This comment has been minimized.

Copy link
Collaborator

kylewm commented Feb 26, 2016

I found one where the spacing is much more regular
screenshot from 2016-02-26 07-55-32

After adding some more debug output, I'm finding that mf2 parsing regularly takes longer than the fetch itself

D 2016-02-26 16:24:34.896 fetching post permalink https://cmal.info/2016/online-legal-publishers-squabble-over-the-right-to-copyright-the
D 2016-02-26 16:24:35.248 done fetching. now parsing post for mf2 https://cmal.info/2016/online-legal-publishers-squabble-over-the-right-to-copyright-the
D 2016-02-26 16:24:36.749 done parsing https://cmal.info/2016/online-legal-publishers-squabble-over-the-right-to-copyright-the

That's 350ms for fetching, 1500ms for parsing!

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Feb 26, 2016

hoooooly crap, really?!? that's...not OK.

also profiling++ obviously. thanks for doing this!

i wonder if switching HTML parsers would help? i guess it's time to cpu profile mf2py.

@kylewm

This comment has been minimized.

Copy link
Collaborator

kylewm commented Feb 26, 2016

Just doing some basic dumb testing now locally. Lxml doesn't seem significantly faster :(

On February 26, 2016 8:56:26 AM PST, Ryan Barrett notifications@github.com wrote:

hoooooly crap, really?!? that's...not OK.

also profiling++ obviously. thanks for doing this!

i wonder if switching HTML parsers would help? i guess it's time to cpu
profile mf2py.


Reply to this email directly or view it on GitHub:
#578 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@kylewm

This comment has been minimized.

Copy link
Collaborator

kylewm commented Feb 26, 2016

Huh, actually html5lib uses lxml if lxml is installed. And it's a hell of a lot faster with it installed (but still slow). Parsing my homepage and all permalinks from there:

total time in requests.get: 0.322495
total time in BeautifulSoup ctor: 0.999292
total time in mf2py.parse: 2.284201

@snarfed snarfed added the now label Feb 26, 2016

@kylewm

This comment has been minimized.

Copy link
Collaborator

kylewm commented Feb 27, 2016

I'm at a loss for speeding up mf2. And making the fetches asynchronous is moot when so much more time is spent processing the result. I'll continue this on #624

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 2, 2016

whoa, they added a "Billed instances estimate" (green) to the instances graph, and it's noticeably below the total instance count (blue), which is what i thought they were charging for. nice!

screen shot 2016-03-01 at 9 46 09 pm

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 2, 2016

...hmm. oddly the billed instances estimate is sometimes higher than the total instance count. confusing. 🤔
screen shot 2016-03-02 at 9 59 44 am

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 3, 2016

we get Twitter favorites by fetching and scraping html for each tweet, in serial. i thought parallelizing these with async utlfetch could be a big win, but i looked at ~10 Twitter polls anecdotally, and most didn't fetch any favorites HTML pages, and the rest only fetched one. looks like our caching logic is working. yay. measure first then optimize, and all that!

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 5, 2016

we've been back on min_idle_instances: 0, max_idle_instances: 1 for a few days now, and it definitely does cut down instance count, but anecdotally it also hurts interactive latency pretty badly. :/ (cc @tantek @aaronpk.)

as an experiment, i dropped max_idle_instances: 1 and kept min_idle_instances: 0 to see if that helps latency while still keeping some of the instance count improvement. we'll see.

snarfed added a commit that referenced this issue Mar 7, 2016

yet another auto scaling tweak: remove min_idle_instances: 0
been running with this for a few days. it looks like it's just max_idle_instances: 1 that's keeping our instance count down, and i *think* removing min_idle_instances: 0 has improved latency.

for #578
@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 14, 2016

look at the last week, the remaining 8k qpd include:

  • 1200 WordPress (every fetch is repeated 3x :( )
  • 624 http://airbornesurfer.com crawling a bunch of his twitter and FB mf2 handlers. :/
  • 525 Domain Re-Animator Bot (and ignores robots.txt. will blacklist.)
  • 500 static
  • 400 Known (conveniently exactly 1/3 of WordPress :P)
  • 304 X-ray
  • 82 Bridgy (what the?)
  • 60 Googlebot (of which 2/3 are the home page)
  • 53 Yahoo
  • 40 bingbot

searching for all bots (ie \*[Bb]ot\*) shows ~930 qpd, which is totally reasonable (if very inexact).

snarfed added a commit that referenced this issue Mar 14, 2016

blacklist Domain Re-Animator bot (https://domainreanimator.com/)
scummy looking SEO tool that ignores robots.txt. ugh.
background: #578 (comment)
@kylewm

This comment has been minimized.

Copy link
Collaborator

kylewm commented Mar 14, 2016

hmm, I know airbornesurfer IRL, sort of doubt he knows that's happening (he should just be on the vanilla wordpress plugin).

@kylewm

This comment has been minimized.

Copy link
Collaborator

kylewm commented Mar 14, 2016

1200 WordPress (every fetch is repeated 3x :( )

micro-caching might help?

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 14, 2016

yeah i guess that'd be #15. (that issue number tho! :P)

also i pinged airbornesurfer: https://twitter.com/schnarfed/status/709519485795840001

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 14, 2016

more importantly, 1200 qpd is <1qpm, and those requests average 1-2s, so i don't think that's really our problem. same with all the other requests in that breakdown.

i think app engine's auto scaling just really doesn't like our workload. if we can convince ourselves that min_idle_instances: 0 doesn't really hurt latency too much - which i'm not at all sure of - i vote to leave that on and call it a day.

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 21, 2016

go home billed instance estimate ur drunk

billed

snarfed added a commit that referenced this issue Mar 21, 2016

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 29, 2016

so after tons of work and soul searching and waffling, i think all the optimization here was nice and appreciated, but really the one thing that did the trick all by itself (and didn't drive our latency through the roof) was max_idle_instances: 1. whee. :P

screen shot 2016-03-29 at 3 31 37 pm

tentatively closing for now.

@snarfed snarfed closed this Mar 29, 2016

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 30, 2016

incidentally, wow, billed instance estimate really does just get totally drunk sometimes. no clue why. funny.

screen shot 2016-03-30 at 9 44 56 am

snarfed added a commit to snarfed/granary that referenced this issue Apr 8, 2016

snarfed added a commit to snarfed/oauth-dropins that referenced this issue Apr 8, 2016

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented May 8, 2016

weird new instance scaling pattern recently. we had a good month or two where our instance count mostly hung out around 1, but now it's jumping up to 2 for periods of ~12h (and right now >24h). i haven't found that it corresponds to anything yet (request volume, cron jobs, backups, etc).

screen shot 2016-05-08 at 9 06 34 am

for reference, we got billed for 38 instance hrs on 5/6, 39 on 5/5, and 40 on 5/4, ie much closer to the blue line than the yellow Billed instance estimate. boo.

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented May 31, 2016

the weird pattern continues. silly. at least they're billing us for close to the estimate (yellow) instead of the total (blue).

screen shot 2016-05-31 at 9 56 57 am

@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Jan 9, 2017

weird pattern is gone, more or less.

screen shot 2017-01-08 at 4 16 42 pm

snarfed added a commit that referenced this issue Mar 26, 2018

add new 'background' service for running poll, propagate, etc
for #578. the granary/o-d python 3 migration bumped up our instance count for some reason, not sure why except that maybe python-future adds meaningfully more overhead?

this moves task queue and cron processing into a separate 'background' service. this decouples quick, light user-facing requests (front page, user pages, etc) from heavier background tasks (poll, propagate, etc), which makes more sense app engine's scheduler. it now runs the frontend 'default' service on ~1.5 instances avg, and lets me control the 'background' service's instances more directly with manual or basic scaling.

background:
https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine
https://cloud.google.com/appengine/docs/standard/python/modules/converting
https://cloud.google.com/appengine/docs/standard/python/config/appref#basic_scaling
https://cloud.google.com/appengine/docs/standard/python/config/appref#manual_scaling
@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Mar 26, 2018

the granary/o-d python 3 migration bumped up our instance count for some reason, not sure why except that maybe python-future adds meaningfully more overhead?

this moves task queue and cron processing into a separate 'background' service. this decouples quick, light user-facing requests (front page, user pages, etc) from heavier background tasks (poll, propagate, etc), which app engine's scheduler understands and handles better. it now runs the frontend 'default' service on ~1.5 instances avg, and lets me control the 'background' service's instances more directly with manual or basic scaling.

(i'd include instance count graphs, but they're buggy right now and the console isn't showing them. :/ )

background:
https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine
https://cloud.google.com/appengine/docs/standard/python/modules/converting
https://cloud.google.com/appengine/docs/standard/python/config/appref#basic_scaling
https://cloud.google.com/appengine/docs/standard/python/config/appref#manual_scaling

snarfed added a commit that referenced this issue Aug 17, 2018

temporarily switch poll and propagate back to the frontend service
for #578. experiment to see how it handles the load now that facebook is gone.

snarfed added a commit that referenced this issue Aug 17, 2018

temporarily switch cron jobs back to the frontend service
for #578. goes with 0024e66. experiment to see how it handles the load now that facebook is gone.
@snarfed

This comment has been minimized.

Copy link
Owner

snarfed commented Aug 18, 2018

now that facebook is dead (#826 #817), i tried an experiment to see if a single frontend instance could serve both interactive requests and poll/propagate tasks. i disabled the dedicated background service and moved task queues back to targeting the frontend (default) service.

turns out...no. even if a single frontend instance may theoretically be enough to handle the load, app engine's scheduler still just doesn't like serving two very differently shaped workloads from the same service. it served on 3-4 instances on average, and billed us for 2-3.

i flipped the switch 8/16 1pm PST. background service requests:

screen shot 2018-08-18 at 4 05 04 pm

frontend service:

screen shot 2018-08-18 at 3 58 06 pm

frontend instances:

screen shot 2018-08-18 at 4 04 40 pm

cost:

screen shot 2018-08-18 at 3 55 26 pm

billable frontend instance hours: 60.73 8/17 (full day), 24.00 8/16 (half day).

ah well. it was worth a shot. turning on the dedicated background service now.

snarfed added a commit that referenced this issue Aug 20, 2018

snarfed added a commit that referenced this issue Aug 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment