Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove EC2 latent builders #222

Merged
merged 1 commit into from Feb 23, 2016
Merged

Conversation

@larsbergstrom
Copy link
Contributor

larsbergstrom commented Feb 23, 2016

r? @Manishearth @edunham @metajack

They continue to cause us problems; let's just stop using them, even for spillover. These problems include:

  1. Hanging around long after no longer in use
  2. Failing to git download immediately after spin-up
  3. Prioritized by buildbot occasionally over our reserved instances, increasing billing costs

Review on Reviewable

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

As a stopgap, I just logged into the EC2 admin console and terminated the linux builders as the spawned up. Since buildbot can't handle respawning them if they die, that should (hopefully) give us a window to land this before the next time buildbot is restarted.

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

It looks like this removes the last remaining machines with 'linux1' style names (instead of servo-head, servo-linux1, etc.). If so, can you remove the corresponding entries in the top.sls and .travis.yml files?

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

@aneeshusa Great catch - thanks!

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

Can you do the same in the .travis.yml file? This should speed up Travis runs too!

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

Ah, I removed it, but that comment also made it clear that we were not testing the style of servo-linux1 (which should be identical to the old linux1 style). So basically the same, though still much better since we were missing the test coverage!

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

Oops, good catch! I just checked on servo-master and it looks like there's still a linux2 machine alive somewhere - are we still using that machine?

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

Alive? That... is curious. I don't see any machines still up on linode or on Ec2 that should be linux2. I'm looking into how I can figure out its IP and see what machine that is right now. Good (and scary!) catch.

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

On servo-master, you can use salt 'linux2' grains.items to get a big data dump of info about the machine (or any other machine if you use different targeting). It looks like it's an EC2 box in us-west2 with IPv4 address 172.31.35.40.

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

@aneeshusa Thanks!

@metajack @edunham Can you please see if there is an EC2 box sitting in the Daala account with that IP (172.32.35.40) that's still sitting around and, if so, terminate it?

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

@aneeshusa Great catch - it was a somehow still-alive EC2 builder from our previous account, so I couldn't see it. It has been terminated.

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

Can you confirm whether linux1, linux3, and servo-linux9000 are also no longer being used? If so, I'll unregister all 4 machines from the Salt master with salt-key -d.

(Did linux4 never get hooked up to Salt?)

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

That's correct! They are all no longer being used. I don't think we ever instantiated a linux4, despite the presence of the rule. servo-linux9000 was a hack for spinning up a one-off box. In retrospect, I think I now know enough to be able to run the salt rules locally on the machine without hooking up to the salt master in that hack-y way :-)

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

Great, those 4 are gone. By the way, there are a bunch of other machines that have tried to connect to the Salt master. You should check the output of salt-key -L on servo-master, and use salt-key -d to delete any keys for machines that shouldn't be connected. (As well as figure out why they tried to connect!)

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

Thanks, done! They're all either lingering from old configurations or result from manually standing up a machine and starting the salt minion service after they know where the salt master is but before I've changed the minion_id file :-)

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

You actually don't need to change the minion_id file manually; you can change the machine's hostname, or set the id in the minion config file, before starting Salt and it should pick up the right value.

keypair_name="servo",
security_name="servo-test",
tags={"Name": "servo-%s" % s},
user_data=make_user_data(s)))

This comment has been minimized.

Copy link
@aneeshusa

aneeshusa Feb 23, 2016

Member

I believe this is the only use of make_user_data, so that function can also be removed.

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

Another great catch - thanks!

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

I think lines 9 and 10 are also unneeded now, since we only used the *_KEY variables in make_user_data.

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

True! Removed from there and from the corresponding python code.

@@ -6,12 +6,10 @@ from buildbot.status.results import SUCCESS

from passwords import HTTP_USERNAME, HTTP_PASSWORD, SLAVE_PASSWORD, CHANGE_PASSWORD
from passwords import GITHUB_DOC_TOKEN, GITHUB_STATUS_TOKEN, HOMU_BUILDBOT_SECRET
from passwords import AWS_ACCESS_KEY, AWS_SECRET_KEY
from passwords import MINION_PUBLIC_KEY, MINION_PRIVATE_KEY

This comment has been minimized.

Copy link
@aneeshusa

aneeshusa Feb 23, 2016

Member

This line is no longer needed as well. In addition, we should remove these 4 variables from the passwords.py file and the test pillars.

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

Nice! I should have spent more time looking at this as a whole instead of just relying on you to do it all :-)

I'll remove the keys from the pillar after there's r+ from one of the other maintainers.

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

Haha, I just wish there was an easy way to do batch review: make a bunch of line comments and only send one email for all of them. Hope I didn't spam you too much!

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

For the pillars - I'm not sure if you mean the real ones or the test ones, but just wanted to clarify. It would be nice to remove the 4 variables from the Travis test pillars in .travis/test_pillars/buildbot/master.sls, ideally before r+ to ensure everything still builds on Travis.

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

Aha, that makes perfect sense - got them!

On Servo, we use http://reviewable.io, which lets us batch comments into a smaller number of e-mails. That said, I get > 1k GitHub emails per day, so 10 or 15 actually relevant ones are totally fine :-)

@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

Travis failure looks like an intermittent (likely a networking problem when downloading the Android SDK). It looks fine in Vagrant.

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

OK, I'm going to squash and r=anneshusa :-)

@larsbergstrom larsbergstrom force-pushed the larsbergstrom:remove_latent branch from 048ee14 to 49929b1 Feb 23, 2016
@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 23, 2016

@bors-servo r=anneshusa

@bors-servo
Copy link
Contributor

bors-servo commented Feb 23, 2016

📌 Commit 49929b1 has been approved by anneshusa

@bors-servo
Copy link
Contributor

bors-servo commented Feb 23, 2016

Testing commit 49929b1 with merge 40f48f2...

bors-servo added a commit that referenced this pull request Feb 23, 2016
Remove EC2 latent builders

r? @Manishearth @edunham @metajack

They continue to cause us problems; let's just stop using them, even for spillover. These problems include:
1) Hanging around long after no longer in use
2) Failing to `git` download immediately after spin-up
3) Prioritized by buildbot occasionally over our reserved instances, increasing billing costs

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.svg" height="40" alt="Review on Reviewable"/>](https://reviewable.io/reviews/servo/saltfs/222)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

bors-servo commented Feb 23, 2016

☀️ Test successful - travis

@bors-servo bors-servo merged commit 49929b1 into servo:master Feb 23, 2016
1 of 2 checks passed
1 of 2 checks passed
homu Testing commit 49929b1 with merge 40f48f2...
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@aneeshusa
Copy link
Member

aneeshusa commented Feb 23, 2016

So many great catches - time to start a team?

Also, a friendly note (noticed this on the TWIS blog also) - my username is spelled aneeshusa, not anneshusa :)

@larsbergstrom
Copy link
Contributor Author

larsbergstrom commented Feb 24, 2016

@aneeshusa I'm sorry about that! I don't know why my mind autocompletes incorrectly. I'll try to check TWiS more carefully :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.