Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server command-logging release tracking #630

Closed
erosson opened this issue Apr 26, 2015 · 11 comments
Closed

server command-logging release tracking #630

erosson opened this issue Apr 26, 2015 · 11 comments
Milestone

Comments

@erosson
Copy link
Member

erosson commented Apr 26, 2015

@erosson erosson added this to the pre-1.1 milestone Apr 26, 2015
@erosson
Copy link
Member Author

erosson commented Apr 26, 2015

10 minutes in, zoomed past the autoscaling threshold. let's wait and see if it autoscales.

...2 minutes later, not bad:

2015-04-26 15:43:11 UTC-0700 INFO Adding instance 'i-046851f8' to your environment.

latency jumped to 0.2s (from 0.05s), but went back down. cpu's hovering at 20%. no worries.

@erosson
Copy link
Member Author

erosson commented Apr 26, 2015

app servers are holding up fine. we could even experiment with lowering the autoscale threshold. latency's steady.

db space/iops might be a concern. raising it now - expect to see error responses for a bit.

erosson added a commit to swarmsim/swarm-server-sails that referenced this issue Apr 26, 2015
@erosson
Copy link
Member Author

erosson commented Apr 26, 2015

db disk space (and iops) upgraded, but from reading the docs it's clear iops are going to be a bottleneck. raising the db space helps (disk space and available iops are linked) but it's cost-prohibitive to throw more space at it, and we're going to run out of burst iops soon. stopped writing new command logs; now only write character-state updates - let's see if that helps.

longer term, redis might be a good choice for command logging - it's very write-heavy.

@erosson
Copy link
Member Author

erosson commented Apr 26, 2015

...haha, oops - that burst of iops was from the db size migration. with that done, normal operation puts us at about [2 write iops and negligible read iops] per gb, which is sustainable. removing command logging ~halved that. adding redis is not necessary right now. aws pricing calculator is a little confusing, but https://aws.amazon.com/rds/pricing/ is clear that non-provisioned iops (what we're using) are free with enough storage.

cpu, db, latency are all under control. network i/o now looks like the biggest concern: is autoscaling at 6mb/min networkout (the default) necessary? can we gzip requests to reduce it? responses are already gzipped.

@erosson
Copy link
Member Author

erosson commented Apr 27, 2015

to automatically kill instances that fail the app healthcheck: ec2 > autoscaling group > change healthcheck type from ec2 to elb

I've raised the autoscaling thresholds a lot, since things seem to be under control now.

@erosson
Copy link
Member Author

erosson commented Apr 27, 2015

overnight:

  • autoscaling's gone crazy overnight, flapping a low number of app servers. latency and 500s are also through the roof. reconfigured to make it much harder to remove app servers once they're in service, and added a few manually. adding more boxes quickly decreased latency and 500s; watching to see if autoscaling keeps them around this time.
  • looks like db cpu is using a few burst credits consistently; will become unsustainable later this evening/tomorrow morning. upgrading the db box now.
  • app servers are going to run out of cpu credits soon, that might be why they kept dying earlier. either need to add more to spread the load, or upgrade to bigger boxes - haven't yet done this

@erosson
Copy link
Member Author

erosson commented Apr 29, 2015

we're still choking after some time passes: that is, all the hosts fall down, requests aren't received/return 500. added cpu credits to the metrics graphed on https://console.aws.amazon.com/elasticbeanstalk/home?region=us-east-1#/environment/monitoring?applicationName=swarm-server-sails&environmentId=e-s2ppt9c929 , and it's clear this happens when we run out of cpu credits. upgraded to bigger (pricier) boxes with more cpu credits, and we should probably alarm (or autoscale?) when cpu credits run short.

db is stable.

@erosson
Copy link
Member Author

erosson commented Apr 29, 2015

https://console.aws.amazon.com/elasticbeanstalk/home?region=us-east-1#/environment/monitoring/graph/AWS%7CEC2/AWSEBAutoScalingGroup?applicationName=swarm-server-sails&environmentId=e-s2ppt9c929&metricName=CPUCreditBalance&statistic=Average&period=60

looks like cpu credits are still unsustainable even with these bigger boxes.

it looks like we're cpu-bound. that makes sense: we gzip all requests/responses, and later - when the game's math moves server-side - we'll be even more cpu-bound, and https://aws.amazon.com/ec2/instance-types/ recommends cpu-optimized for "web servers, frontend fleets".

let's see how we do with a pair of cpu-optimized machines instead of a swarm (heh) of smaller ones. more expensive since the minimum for cpu-optimized is so much bigger, but still affordable. let's see how these do overnight - maybe we can use 1 box most of the time, with 2 for deployments to avoid downtime. if we are cpu-bound, autoscaling shoudl be based on cpu instead of network, too.

@erosson
Copy link
Member Author

erosson commented Apr 29, 2015

new app servers are good, but now we're exceeding database i/o limits. started getting 500s a few hours ago, when we ran out of i/o credits and the graph clearly started limiting our i/o. increasing that.

sure am glad I did this silent release; so much easier to do these changes when they don't impact players.

@erosson
Copy link
Member Author

erosson commented May 2, 2015

almost there on this one. once or twice a day, 500s spike for a minute or two and both servers reboot simultaneously. it must be db-related since it hits both app servers, but the db monitoring doesn't report any problems. i/o credits aren't visible in the monitoring but we're well under our limit now.

logs are too polluted to figure out the cause of the 500s; cleaning up logs in #629 now.

happy enough with the stability to release it to the real world, though - pre-1.1 milestone is done. (client stuff, of course, is nowhere near done - this just means that if we released this today we could be reasonably sure the server won't fall over, not that the client code actually works.)

@erosson erosson modified the milestones: 1.1, pre-1.1 May 2, 2015
erosson added a commit to swarmsim/swarm-server-sails that referenced this issue May 2, 2015
@erosson
Copy link
Member Author

erosson commented May 4, 2015

no trouble for the last couple of days. very good! nothign left to do here.

#578 will want the links from this issue.

@erosson erosson closed this as completed May 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant