Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move buildmaster from Linode to AWS (tracking issue) #281

Closed
edunham opened this issue Mar 28, 2016 · 14 comments
Closed

Move buildmaster from Linode to AWS (tracking issue) #281

edunham opened this issue Mar 28, 2016 · 14 comments

Comments

@edunham
Copy link
Contributor

@edunham edunham commented Mar 28, 2016

cc @larsbergstrom

Async steps

  • remove Longview states from SaltFS, since they're strictly for Linode instances
  • Select appropriate size AWS instance for new master
    • According to Longview, current master CPU hits 24% of one core, memory stays under 4GB, and disk usage is around 13GB.
    • an m3.medium instance on AWS has 1 core and 4.02GB RAM
  • manually create AWS instance, give it an elastic IP
  • Set up salt master w/ identical versions to old master on the AWS instance
  • turn old master's minion id into servo-master0
  • Follow instructions at https://github.com/servo/servo/wiki/Buildbot-administration#linux to set up the AWS host as servo-master and run a Salt highstate
  • Copy Pillar data and master settings directly to new master with sftp
    • /srv/salt/pillar
    • /srv/pillar/*
    • /etc/salt/master
  • Copy /etc/salt/pki/master directly to new master with sftp
  • Update all Salt minions to say master: build.servo.org in /etc/salt/minion; amend https://github.com/servo/servo/wiki/Buildbot-administration#linux to reflect this change
    • Create servo-master0.servo.org A record pointing to old buildmaster
    • Create servo-master1.servo.org A record pointing to new buildmaster
    • Set all slaves to dual-master as per https://docs.saltstack.com/en/latest/topics/tutorials/multimaster.html#configure-minions -- looks like the processes for managing salt with salt get pretty involved, so it'll be faster to make the change manually this time around
      • servo-mac1
      • servo-macpro1
      • servo-mac2
      • servo-mac3
      • servo-master0
      • servo-master1
      • servo-linux1
      • servo-linux2
      • servo-linux-cross1
      • servo-linux-cross2
  • Test that new master can connect to all minions, such as running a debug highstate
  • Test whether Homu service is running on new master
  • Verify that Buildbot is running on the new master
  • Verify that all requisite ports are permitted by the new master's security group
  • Check all Homu-enabled repos to verify that hooks (https://github.com/servo/servo/settings/hooks) use build.servo.org rather than the Linode master's IP

At 9am PST Wednesday 3/30

  • sftp /home/servo/homu/main.db from old master to new; restart homu
  • In the Cloudflare interface, edit build.servo.org record to point at new master. All webhooks use DNS, so they will not need to be modified.
  • Retry a failed build to verify that everything's working
@aneeshusa
Copy link
Member

@aneeshusa aneeshusa commented Mar 28, 2016

We actually don't have instructions for setting up a Salt master yet, so I'll need to write those up and add them to the wiki (e.g. installing the right version of the right package).

Also, if you're trying to do this as a multimaster transition (which I recommend) then you'll need to first list both masters in the minion configuration file, then after verifying the new master is correctly set up removing the old one. Note: if you're going to give both masters servo-master as an ID, don't have them connect to each other or they will get confused by seeing two minions with the same ID. (This affects the copying the /etc/salt/pki/master step).

@metajack
Copy link
Contributor

@metajack metajack commented Mar 28, 2016

The private google doc should have some possibly out of date instructions for setting up a salt master. I think it was mostly bootstrappable except for some minor things. Thought these days, we probably have things like minion keys that need to be dealt with. When I wrote the instructions I was doing a totally fresh master :)

@aneeshusa
Copy link
Member

@aneeshusa aneeshusa commented Mar 28, 2016

I'd be interested in taking a peek at that doc purely out of curiosity :)

As for the DNS, the fact that Homu and nginx are running on the same box as the Salt master is just an artifact of our environment, and we shouldn't bake it into our configuration files. We should add a separate DNS entry for salt.servo.org (pointing at the new master, via A record not CNAME). In the case of multimaster, this lets us set the minion configuration to say:

master:
  - build.servo.org
  - salt.servo.org

right after we set up the new master. After restarting the minions and checking connectivity to the new master, we can update it to say simply:

master: salt.servo.org

This lets us decouple the DNS changes for Salt from the DNS changes for Homu, when we switch over the build.servo.org DNS record.

What's the timezone for the 9am switchover? I may or may not be awake that early.

@larsbergstrom
Copy link
Contributor

@larsbergstrom larsbergstrom commented Mar 28, 2016

I think it was originally a private etherpad, and those servers have since been moved behind a firewall and then burned down (ether-pocalypse).

It's 9AM US Pacific time :-)

bors-servo added a commit that referenced this issue Mar 28, 2016
Remove longview setup for servo-master Linode exodus

Longview is Linode's proprietary monitoring service.
We are moving the servo-master machine from Linode to EC2, so
we can and will no longer use Longview, so this commit removes it.

Refs #281

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/saltfs/282)
<!-- Reviewable:end -->
@edunham
Copy link
Contributor Author

@edunham edunham commented Mar 28, 2016

@aneeshusa Thanks for the feedback!

I just checked the secrets doc and it does not currently contain any setup instructions for the build master.

How much of the master setup can we do through Salt itself?

Does my amended checklist for doing a dual-master switchover look sane?

@aneeshusa
Copy link
Member

@aneeshusa aneeshusa commented Mar 29, 2016

Master setup:

For now, I just plan to create a script to install the correct salt-master package. Setting up the configuration file and start the corresponding service can be manual for now, as it is for setting up a new minion (although Salting this is on my todo list).

More complicated setup is IMO not worth automating right now; e.g., keeping the list of accepted minion keys in sync would want some kind of CMDB (even if it's just flat files), and ditto for keeping the pillars synced. (GitFS should keep the file tree synced).

Checklist:

We should explicitly separate setting up a Salt minion on the new machine from setting up a Salt master on the new machine on the checklist (since we haven't Salted the process of setting up a Salt master yet). It looks like the checklist already does this, but I'd like to be more clear for that section:

  • Create new EC2 instance, assign it an elastic IP
  • Install a Salt minion using the instructions from https://github.com/servo/servo/wiki/Buildbot-administration#linux and run a local highstate (see my later comments about the ID)
  • Install the Salt master package with instructions from the wiki (TODO: write these instructions)
  • [various master configuration steps - pillar data, file tree, PKI data, master config file?]
  • Start the salt-master service on the new machine
  • Point minions at both machines, etc.

We also need to add some additional steps to the switchover. Earlier, I was thinking to reuse the servo-master ID on the new machine, but on further reflection I think this will be more trouble than it's worth (as I mentioned in an earlier comment, we'd need to add steps to special-case the minion configs, edit the PKI files, etc.). Instead, we should be using separate IDs for each machine; this will require some more steps as well. I'm writing this up in more detail and will leave another comment.

Also, I have a busy week, so any chance of pushing the switchover date back a few days would be appreciated!

@aneeshusa
Copy link
Member

@aneeshusa aneeshusa commented Mar 29, 2016

Salt minion IDs are meant to be unique, so we should follow the convention instead of trying to reuse the servo-master ID, as it will cause trouble when running with two hot masters. Here's my new thinking:

New steps (we should do these before bringing up the new master):

  • Amend the top file so that the servo-master section is amended to servo-master\d+ in order to handle multiple masters.
    • We should also update the .travis.yml file to use servo-master1
  • Switch the current master ID from servo-master to servo-master1. Steps to take on the current master:
    • Edit the minion ID in the /etc/salt/minion file (removing the /etc/salt/minion_id file if necessary)
    • Go into the /etc/salt/pki/master/minions directory and rename the servo-master file to servo-master1
    • Restart the Salt minion, and once it comes back up, confirm connectivity with test.ping and targeting with a highstate in test=True mode

At this point, we can proceed with setting up the new machine, and assigning it a Salt minion ID of servo-master2.

Also, I don't know what I was thinking yesterday with the DNS records, but here's my updated recommendations after some sleep:

  • Salt minion IDs should match the local hostname on each machine (we should Salt this/add to setup instructions on wiki at a later time)
  • Each machine should also get a corresponding A record, e.g.. A records for servo-master1.servo.org and servo-master2.servo.org pointing at the old and new masters.
  • Use CNAME records for functional identification, i.e. build.servo.org.

The goal is that for a given machine, the Salt minion ID == the hostname for the machine == the DNS A record for that machine, and that this is an immutable identifier for that machine. Separately, when we point at a given hostname for functional reasons (i.e. the Homu webhooks), we should use a CNAME record to point to the appropriate machine, because the particular machine that is responsible for these settings could change at any time and the application should not know.

Updated minion config settings:

master:
  - servo-master1.servo.org
  - servo-master2.servo.org

After the switchover:

master:
  - servo-master2.servo.org

Since we'd like the minions to connect to all the masters and not just one, we can use the immutable names directly and we don't need a salt.servo.org CNAME record anymore. This also will make it easier to go back to multimaster mode in the future if needed.

We may also want to consider adding master_alive_interval: 30 to our minion configs: https://docs.saltstack.com/en/2015.5/ref/configuration/minion.html#master-alive-interval, although this isn't necessary.

bors-servo added a commit that referenced this issue Mar 29, 2016
Handle multiple (redundant) masters

Update the Salt states to handle multiple minions that host masters.
This will allow us to easily enter redundant multimaster mode to handle
switching over our master from Linode to EC2, by using separate IDs for
each machine instead of trying to reuse the `servo-master` ID.

See #281 (comment) for more details.

I haven't updated the `common/map.jinja` file yet; are we still using these hostnames in the `/etc/hosts` file or is everything happening via DNS lookups?

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/saltfs/284)
<!-- Reviewable:end -->
@edunham
Copy link
Contributor Author

@edunham edunham commented Mar 29, 2016

@aneeshusa The production salt setup is now dual-master. They're servo-master0 (the old linode host) and servo-master1 (the new AWS host). I've sftp'd all the secrets and master configs over, and they both yield identical ping and highstate results except that servo-master1 isn't aware of a few decommissioned hosts like servo-head.

I believe this covers our bases for a smooth transition on the saltfs side of things, and now the buildmaster+homu+DNS move is all that's left to worry about.

I've amended all the minions' /etc/salt/minion files to contain

master:
  - servo-master0.servo.org
  - servo-master1.servo.org

and they appear to be connecting to both buildmasters successfully at the same time.

root@servo-master1:~# salt-run manage.status
down:
up:
    - servo-linux-cross1
    - servo-linux-cross2
    - servo-linux1
    - servo-linux2
    - servo-mac1
    - servo-mac2
    - servo-mac3
    - servo-macpro1
    - servo-master0
    - servo-master1
root@servo-master0# salt-run manage.status
down:
    - servo-head
    - servo-linux-android1
    - servo-master
up:
    - servo-linux-cross1
    - servo-linux-cross2
    - servo-linux1
    - servo-linux2
    - servo-mac1
    - servo-mac2
    - servo-mac3
    - servo-macpro1
    - servo-master0
    - servo-master1
@edunham
Copy link
Contributor Author

@edunham edunham commented Mar 30, 2016

@edunham
Copy link
Contributor Author

@edunham edunham commented Mar 30, 2016

The only surprise that I ran into was that in /srv/pillar/buildbot.sls, master was set to servo-master, which had an IP hardcoded in the hosts file from https://github.com/servo/saltfs/blob/master/common/map.jinja . Changing that setting to master: build.servo.org and re-running the highstate causes the slaves to look up the buildmaster with DNS rather than directly, and solved the problem.

@aneeshusa, I read a bit about managing Salt with Salt and while I think it's a good idea to manage the salt config files, it looks like it introduces a variety of complications. If you'd like to invest the time into making it work, I'd happily help get a PR through, but for now I think that making this type of major changes to the Salt setup is such an infrequent occurrence that our effort is probably better focused on more commonly used parts of the system.

@edunham
Copy link
Contributor Author

@edunham edunham commented Mar 30, 2016

We're shutting down servo-master0 (the Linode host) at 5pm PST today. If you need logs from a build that happened overnight before then, substitute servo-master0.servo.org for build.servo.org in the URL to access them. After servo-master0 has been shut down, you'll have to rerun a build to get its results.

@edunham edunham closed this Mar 30, 2016
@aneeshusa
Copy link
Member

@aneeshusa aneeshusa commented Mar 30, 2016

👏

I did run a grep -r 'servo-master' against the saltfs repo, but didn't think to check the pillars. We should remember to check the pillars for next time.

Does buildbot properly respect DNS TTLs/will it retry and reconnect if we change the build.servo.org DNS entry in the future?

We still list servo-master with the old IP on the wiki - can we update that?

Agreed on the Salt changes for now.

@edunham
Copy link
Contributor Author

@edunham edunham commented Mar 30, 2016

I don't know the details of how Buildbot handles TTLs, but kicking the buildslave processes seemed to cause them to repeat the lookup. So Buildbot itself probably isn't caching too hard.

Wiki updated. I'll open an issue to discuss migrating away from hardcoded IPs in common/map.jinja entirely.

@aneeshusa
Copy link
Member

@aneeshusa aneeshusa commented Mar 30, 2016

Kicking buildbot should be good enough in that case.

Already opened an issue for common/map.jinja - see #287.

This was referenced Mar 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.