New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server command-logging release tracking #630
Comments
10 minutes in, zoomed past the autoscaling threshold. let's wait and see if it autoscales. ...2 minutes later, not bad:
latency jumped to 0.2s (from 0.05s), but went back down. cpu's hovering at 20%. no worries. |
app servers are holding up fine. we could even experiment with lowering the autoscale threshold. latency's steady. db space/iops might be a concern. raising it now - expect to see error responses for a bit. |
db disk space (and iops) upgraded, but from reading the docs it's clear iops are going to be a bottleneck. raising the db space helps (disk space and available iops are linked) but it's cost-prohibitive to throw more space at it, and we're going to run out of burst iops soon. stopped writing new command logs; now only write character-state updates - let's see if that helps. longer term, redis might be a good choice for command logging - it's very write-heavy. |
...haha, oops - that burst of iops was from the db size migration. with that done, normal operation puts us at about [2 write iops and negligible read iops] per gb, which is sustainable. removing command logging ~halved that. adding redis is not necessary right now. aws pricing calculator is a little confusing, but https://aws.amazon.com/rds/pricing/ is clear that non-provisioned iops (what we're using) are free with enough storage. cpu, db, latency are all under control. network i/o now looks like the biggest concern: is autoscaling at 6mb/min networkout (the default) necessary? |
to automatically kill instances that fail the app healthcheck: ec2 > autoscaling group > change healthcheck type from ec2 to elb I've raised the autoscaling thresholds a lot, since things seem to be under control now. |
overnight:
|
we're still choking after some time passes: that is, all the hosts fall down, requests aren't received/return 500. added cpu credits to the metrics graphed on https://console.aws.amazon.com/elasticbeanstalk/home?region=us-east-1#/environment/monitoring?applicationName=swarm-server-sails&environmentId=e-s2ppt9c929 , and it's clear this happens when we run out of cpu credits. upgraded to bigger (pricier) boxes with more cpu credits, and we should probably alarm (or autoscale?) when cpu credits run short. db is stable. |
looks like cpu credits are still unsustainable even with these bigger boxes. it looks like we're cpu-bound. that makes sense: we gzip all requests/responses, and later - when the game's math moves server-side - we'll be even more cpu-bound, and https://aws.amazon.com/ec2/instance-types/ recommends cpu-optimized for "web servers, frontend fleets". let's see how we do with a pair of cpu-optimized machines instead of a swarm (heh) of smaller ones. more expensive since the minimum for cpu-optimized is so much bigger, but still affordable. let's see how these do overnight - maybe we can use 1 box most of the time, with 2 for deployments to avoid downtime. if we are cpu-bound, autoscaling shoudl be based on cpu instead of network, too. |
new app servers are good, but now we're exceeding database i/o limits. started getting 500s a few hours ago, when we ran out of i/o credits and the graph clearly started limiting our i/o. increasing that. sure am glad I did this silent release; so much easier to do these changes when they don't impact players. |
almost there on this one. once or twice a day, 500s spike for a minute or two and both servers reboot simultaneously. it must be db-related since it hits both app servers, but the db monitoring doesn't report any problems. i/o credits aren't visible in the monitoring but we're well under our limit now. logs are too polluted to figure out the cause of the 500s; cleaning up logs in #629 now. happy enough with the stability to release it to the real world, though - pre-1.1 milestone is done. (client stuff, of course, is nowhere near done - this just means that if we released this today we could be reasonably sure the server won't fall over, not that the client code actually works.) |
no trouble for the last couple of days. very good! nothign left to do here. #578 will want the links from this issue. |
similar to #619, but with craploads more traffic.
app server stats at https://console.aws.amazon.com/elasticbeanstalk/home?region=us-east-1#/environment/monitoring?applicationName=swarm-server-sails&environmentId=e-s2ppt9c929
instance health: https://console.aws.amazon.com/ec2/autoscaling/home?region=us-east-1#AutoScalingGroups:id=awseb-e-s2ppt9c929-stack-AWSEBAutoScalingGroup-1IEC20HUENHGK;view=instances
db metrics: https://console.aws.amazon.com/rds/home?region=us-east-1#dbinstances:id=swarmsim-prod-throwaway2;sf=all;v=mm
cost: https://console.aws.amazon.com/cost-reports/home?#/custom?granularity=Daily&chartStyle=Line&timeRangeOption=Last14days
https://api.swarmsim.com/about
https://api.swarmsim.com/healthy
client released 8 minutes ago: https://swarmsim.github.io/releasewatch/ expecting traffic to ramp up for a half hour after release as clients autorefresh, then hold steady.
there's not really a "task" here, just collecting info about any scalability failures in one place.
The text was updated successfully, but these errors were encountered: