Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider moving to reserved EC2 instances? #190

Closed
larsbergstrom opened this issue Jan 8, 2016 · 11 comments
Closed

Consider moving to reserved EC2 instances? #190

larsbergstrom opened this issue Jan 8, 2016 · 11 comments

Comments

@larsbergstrom
Copy link
Contributor

@larsbergstrom larsbergstrom commented Jan 8, 2016

cc @edunham @metajack

We're finding that, particularly after any network hiccups, buildbot is really bad about spinning up a new EC2 latent instance. I also can't figure out how to connect one - if you do the obvious thing of spinning one up and starting buildbot on it, you get an error because the master is not in the process of starting it up (the only time when the latent ones can connect). And the master doesn't seem to spin up new ones unless you restart it, which is really hard to time between homu runs w/o breaking other things.

Is there some command I'm missing here that I could be using instead?

Or, should we consider moving to EC2 reserved instances? Even using a smaller instance type would help, as we're basically running with only one EC2 instance anyway most of the time.

We didn't see this as much before because we always had my linode instance to "pick up the slack".

@metajack
Copy link
Contributor

@metajack metajack commented Jan 8, 2016

I'm not sure what changed, but this seems like it used to be much more reliable.

I'm not opposed to reserved instances (I assume for 1yr periods), but this is likely to make our AWS bill quite a bit higher potentially.

@metajack
Copy link
Contributor

@metajack metajack commented Jan 8, 2016

Or in other words, my only objection would be financial. If it fits in budget, let's do it.

@larsbergstrom
Copy link
Contributor Author

@larsbergstrom larsbergstrom commented Jan 13, 2016

@metajack Do you remember what the speed difference was between c4.4xlarge and c4.2xlarge? Should we re-run some tests? The 2xlarge is a little under half the price of 4xlarge, and before we do some reserved instances, I was just curious how much time it's saving us on our runs :-)

@metajack
Copy link
Contributor

@metajack metajack commented Jan 13, 2016

We should definitely do this ASAP. Perusing the buildbot bug tracker it looks like these issues have been known and unfixed for 5 years, and also that slave attach and detach is non-atomic and has all sorts of bad edge cases as a result. Moving to reserved instances would eliminate slave churn and probably get rid of all the trouble.

Here's what I think should happen:

  1. Some script that will launch a slave given the instance type, AZ, AMI id, and the access and secret keys as inputs. The userdata needed to bootstrap the instance is in the buildbot/master/master.cfg in salt.
  2. Test build time for release builds on instance types that are 1 or even 2 down. That's c4.8xlarge, c4.4xlarge, and c4.2xlarge I think. If that doesn't have a huge affect on build times, we should use cheaper instances. Note that you'll need to adjust the parallelism in the build steps to account for the reduced number of cores (you might also play a bit with the settings to see which one is optimal). We'll want the cost numbers at this point so we can compare to budget and old cost.
  3. Once the best instance is decided, fire up 2x that instance as 1yr reserved instances, and change our slave config to use those instead of latent slaves. Note that we can add new slaves that are not latent slaves to the existing pool and have both configs running at once during testing to minimize interruption if needed.
  4. Move servo-master to a reserved instance of some super cheap instance type.
@metajack
Copy link
Contributor

@metajack metajack commented Jan 14, 2016

@larsbergstrom I don't think I ever did tests. I did try various --processes values for running wpt and css tests which is where most of the speedup was gained with many cores.

@larsbergstrom
Copy link
Contributor Author

@larsbergstrom larsbergstrom commented Feb 1, 2016

OK, I've done #1 and #2 from jack's steps above. It looks like getting some 4xlarge instances would save a ton of money, not cost much time, and also leave room for adding more of them as we add more tests. But we can also just get the 8xlarges reserved, too :-)

Price info is from https://aws.amazon.com/ec2/instance-types/, assuming we stay with US-WEST-2.

Raw data below:

c4.8xlarge (36vCPUs, info from linux-rel, $2.10/hour on-demand, $1.34 1year reserved)

compile - 11mins, 56sec
test-wpt "./mach test-wpt --release --processes 24 --log-raw test-wpt.log" - 4mins, 9sec
test-css "./mach test-css --release --processes 24 --log-raw test-css.log" - 5mins, 42sec
build-cef - 11mins, 9sec

c4.4xlarge (16vCPUs, $0.67 / hour reserved 1year upfront)

compile(scratch) 16.28min
compile(rm -rf target) 12m28s
compile(rm -rf target w/ "-j 12") 12m36s
compile(rm -rf target w/ "-j 24") 12m38s

test-wpt w/ 16 processes - 5min
test-wpt w/ 20 processes - 4m24s
test-wpt w/ 24 processes - 3m56s
test-wpt w/ 28 processes - 3m43s
test-wpt w/ 32 processes - 3m32s

test-css w/ 16 processes - 8m49s
test-css w/ 24 processes - 8m10s

c4.2xlarge (8vCPUs, $0.33/hour reserved upfront)

compile(scratch) 19.45min
compile(rm -rf target) 14m14s

test-wpt w/ 8 processes - 9.6min
test-wpt w/ 12 processes - 7.1min
test-wpt w/ 16 processes - 5min, 56s

test-css w/ 8 processes - 15m26s
test-css w/ 16 processes - 14m5s

c4.xlarge (4vCPUs, $0.11/hour reserved upfront)

compile (rm -rf target) 20m29s
test-wpt w/ 4 processes - 18m37s
test-wpt w/ 8 processes - 11m6s
test-css w/ 4 processes - 27m16s

@larsbergstrom
Copy link
Contributor Author

@larsbergstrom larsbergstrom commented Feb 1, 2016

Here's our usage data for the last two weeks on the linux builders w/ on-demand usage. Note that we're still having a LOT of trouble getting it to spin up a second builder reliably, so the 23/24/25 hour usage days are probably underrporting what we would have used if things were working properly.

1/19/16 - 24
1/20/16 - 15
1/21/16 - 36
1/22/16 - 11
1/23/16 - 9
1/24/16 - 4
1/25/16 - 25
1/26/16 - 32
1/27/16 - 39
1/28/16 - 23
1/29/16 - 23
1/30/16 - 15

@larsbergstrom
Copy link
Contributor Author

@larsbergstrom larsbergstrom commented Feb 1, 2016

One extra note - the default Ubuntu 14.04 images on amazon only have 7.5GB, which is barely enough for a single-flavor Servo build. The image brought over from Daala has ~400GB, which seems sufficient for our several different flavors/builds that each of the images bring up (a casual inspection of one of the builders being shared across targets showed about 55GB free).

bors-servo added a commit that referenced this issue Feb 4, 2016
Add EC2 reserved instances

r? @metajack @Manishearth

Closes #190

(this is live)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.svg" height="40" alt="Review on Reviewable"/>](https://reviewable.io/reviews/servo/saltfs/214)
<!-- Reviewable:end -->
@larsbergstrom
Copy link
Contributor Author

@larsbergstrom larsbergstrom commented Feb 4, 2016

We decided to pay for a full year upfront. The hourly cost for that is $0.526. With half up front, the cost is $0.537. Nothing up front is $0.621, on a c4.4xlarge

Note that we currently pay $2.098/hour for our on-demand c4.8xlarge

@aneeshusa
Copy link
Member

@aneeshusa aneeshusa commented Apr 5, 2016

Does this affect the buildbot restarting instructions? In particular, I believe there are no more on-demand EC2 instances so is using the "graceful shutdown" still necessary?

@larsbergstrom
Copy link
Contributor Author

@larsbergstrom larsbergstrom commented Apr 5, 2016

@aneeshusa Thanks! I've cleaned up the instructions a bit and removed the bit about the latent builders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.