server command-logging release tracking #630

erosson · 2015-04-26T22:38:06Z

similar to #619, but with craploads more traffic.

app server stats at https://console.aws.amazon.com/elasticbeanstalk/home?region=us-east-1#/environment/monitoring?applicationName=swarm-server-sails&environmentId=e-s2ppt9c929

instance health: https://console.aws.amazon.com/ec2/autoscaling/home?region=us-east-1#AutoScalingGroups:id=awseb-e-s2ppt9c929-stack-AWSEBAutoScalingGroup-1IEC20HUENHGK;view=instances

db metrics: https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:id=swarmsim-prod-throwaway2;sf=all;v=mm

cost: https://console.aws.amazon.com/cost-reports/home?#/custom?granularity=Daily&chartStyle=Line&timeRangeOption=Last14days

https://api.swarmsim.com/about

https://api.swarmsim.com/healthy

client released 8 minutes ago: https://swarmsim.github.io/releasewatch/ expecting traffic to ramp up for a half hour after release as clients autorefresh, then hold steady.

there's not really a "task" here, just collecting info about any scalability failures in one place.

erosson · 2015-04-26T22:41:58Z

10 minutes in, zoomed past the autoscaling threshold. let's wait and see if it autoscales.

...2 minutes later, not bad:

2015-04-26 15:43:11 UTC-0700 INFO Adding instance 'i-046851f8' to your environment.

latency jumped to 0.2s (from 0.05s), but went back down. cpu's hovering at 20%. no worries.

erosson · 2015-04-26T23:02:13Z

app servers are holding up fine. we could even experiment with lowering the autoscale threshold. latency's steady.

db space/iops might be a concern. raising it now - expect to see error responses for a bit.

erosson · 2015-04-26T23:34:57Z

db disk space (and iops) upgraded, but from reading the docs it's clear iops are going to be a bottleneck. raising the db space helps (disk space and available iops are linked) but it's cost-prohibitive to throw more space at it, and we're going to run out of burst iops soon. stopped writing new command logs; now only write character-state updates - let's see if that helps.

longer term, redis might be a good choice for command logging - it's very write-heavy.

erosson · 2015-04-26T23:52:27Z

...haha, oops - that burst of iops was from the db size migration. with that done, normal operation puts us at about [2 write iops and negligible read iops] per gb, which is sustainable. removing command logging ~halved that. adding redis is not necessary right now. aws pricing calculator is a little confusing, but https://aws.amazon.com/rds/pricing/ is clear that non-provisioned iops (what we're using) are free with enough storage.

cpu, db, latency are all under control. network i/o now looks like the biggest concern: is autoscaling at 6mb/min networkout (the default) necessary? ~~can we gzip requests to reduce it?~~ responses are already gzipped.

erosson · 2015-04-27T00:27:21Z

to automatically kill instances that fail the app healthcheck: ec2 > autoscaling group > change healthcheck type from ec2 to elb

I've raised the autoscaling thresholds a lot, since things seem to be under control now.

erosson · 2015-04-27T21:04:58Z

overnight:

autoscaling's gone crazy overnight, flapping a low number of app servers. latency and 500s are also through the roof. reconfigured to make it much harder to remove app servers once they're in service, and added a few manually. adding more boxes quickly decreased latency and 500s; watching to see if autoscaling keeps them around this time.
looks like db cpu is using a few burst credits consistently; will become unsustainable later this evening/tomorrow morning. upgrading the db box now.
app servers are going to run out of cpu credits soon, that might be why they kept dying earlier. either need to add more to spread the load, or upgrade to bigger boxes - haven't yet done this

erosson · 2015-04-29T02:37:58Z

we're still choking after some time passes: that is, all the hosts fall down, requests aren't received/return 500. added cpu credits to the metrics graphed on https://console.aws.amazon.com/elasticbeanstalk/home?region=us-east-1#/environment/monitoring?applicationName=swarm-server-sails&environmentId=e-s2ppt9c929 , and it's clear this happens when we run out of cpu credits. upgraded to bigger (pricier) boxes with more cpu credits, and we should probably alarm (or autoscale?) when cpu credits run short.

db is stable.

erosson · 2015-04-29T09:35:48Z

https://console.aws.amazon.com/elasticbeanstalk/home?region=us-east-1#/environment/monitoring/graph/AWS%7CEC2/AWSEBAutoScalingGroup?applicationName=swarm-server-sails&environmentId=e-s2ppt9c929&metricName=CPUCreditBalance&statistic=Average&period=60

looks like cpu credits are still unsustainable even with these bigger boxes.

it looks like we're cpu-bound. that makes sense: we gzip all requests/responses, and later - when the game's math moves server-side - we'll be even more cpu-bound, and https://aws.amazon.com/ec2/instance-types/ recommends cpu-optimized for "web servers, frontend fleets".

let's see how we do with a pair of cpu-optimized machines instead of a swarm (heh) of smaller ones. more expensive since the minimum for cpu-optimized is so much bigger, but still affordable. let's see how these do overnight - maybe we can use 1 box most of the time, with 2 for deployments to avoid downtime. if we are cpu-bound, autoscaling shoudl be based on cpu instead of network, too.

erosson · 2015-04-29T22:49:16Z

new app servers are good, but now we're exceeding database i/o limits. started getting 500s a few hours ago, when we ran out of i/o credits and the graph clearly started limiting our i/o. increasing that.

sure am glad I did this silent release; so much easier to do these changes when they don't impact players.

erosson · 2015-05-02T03:41:44Z

almost there on this one. once or twice a day, 500s spike for a minute or two and both servers reboot simultaneously. it must be db-related since it hits both app servers, but the db monitoring doesn't report any problems. i/o credits aren't visible in the monitoring but we're well under our limit now.

logs are too polluted to figure out the cause of the 500s; cleaning up logs in #629 now.

happy enough with the stability to release it to the real world, though - pre-1.1 milestone is done. (client stuff, of course, is nowhere near done - this just means that if we released this today we could be reasonably sure the server won't fall over, not that the client code actually works.)

erosson · 2015-05-04T04:33:59Z

no trouble for the last couple of days. very good! nothign left to do here.

#578 will want the links from this issue.

erosson added this to the pre-1.1 milestone Apr 26, 2015

erosson mentioned this issue Apr 26, 2015

save to server after every action #587

Closed

erosson added a commit to swarmsim/swarm-server-sails that referenced this issue Apr 26, 2015

don't write command log, for now - hits db too hard. swarmsim/swarm#630

afc648d

erosson modified the milestones: 1.1, pre-1.1 May 2, 2015

erosson added a commit to swarmsim/swarm-server-sails that referenced this issue May 2, 2015

fix 500 for GET requests to kongregate auth. swarmsim/swarm#630

d9fbdab

erosson closed this as completed May 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server command-logging release tracking #630

server command-logging release tracking #630

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 27, 2015

erosson commented Apr 27, 2015

erosson commented Apr 29, 2015

erosson commented Apr 29, 2015

erosson commented Apr 29, 2015

erosson commented May 2, 2015

erosson commented May 4, 2015

server command-logging release tracking #630

server command-logging release tracking #630

Comments

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 26, 2015

erosson commented Apr 27, 2015

erosson commented Apr 27, 2015

erosson commented Apr 29, 2015

erosson commented Apr 29, 2015

erosson commented Apr 29, 2015

erosson commented May 2, 2015

erosson commented May 4, 2015