Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives for multi-platform CI support in servo/servo #215

Closed
larsbergstrom opened this issue Feb 4, 2016 · 11 comments
Closed

Alternatives for multi-platform CI support in servo/servo #215

larsbergstrom opened this issue Feb 4, 2016 · 11 comments
Assignees

Comments

@larsbergstrom
Copy link
Contributor

Today, we have a default strategy for each new platform or build configuration in the servo/servo repository - add new buildbot rules and rebalance those rules across a new set of builders we spin up.

That has a couple of issues:

  1. Editing buildbot rules and deploying them is something that really only core CI maintainers can do, since it requires root access and the ability to "deal with meltdowns," since any change can trigger cascading failures and require the ability to pick the whole system back up
  2. Each builder we spin up requires additional oversight, maintenance, etc.

One alternative that I'm considering is to add both AppVeyor support (barosl/homu#87) and the ability to gate on multiple CI systems (barosl/homu#100) to homu.

This would mean that for some new platforms (Windows - servo/servo#9406, ARM - https://github.com/mmatyas/servo-nightly/blob/master/.travis.yml, etc.) and also some tests (test-tidy), we could run them on Travis or AppVeyor infrastructure and use the merged buildbot+travis+appveyor results.

The upsides are:

  1. We don't have to maintain those servers.
  2. As seen from the PRs pointed to, contributing to and testing the CI rules is much more easily done by community members.

Downsides:

  1. If any one of the three services go offline, we're blocked on landing things. Today, we're really only blocked if Amazon, Linode, MacStadium, or Github go down. This would add another 1--2.
  2. It may be difficult to get extremely large instance types (e.g., the c4.4xlarge that we use on EC2) on some of these other services, at least in the first 3--6 months, which could put an upper limit on our build time.
  3. More homu complexity.

Thoughts? CC @Manishearth @metajack @edunham

@larsbergstrom larsbergstrom changed the title Alternatives for multi-platform support in servo/servo Alternatives for multi-platform CI support in servo/servo Feb 4, 2016
@larsbergstrom larsbergstrom self-assigned this Feb 4, 2016
@jdm
Copy link
Member

jdm commented Feb 4, 2016

I am in favour of using other services here. I am not particularly concerned about service outages right now, because:

  • the ones we would be relying upon are run by organizations that have commercial interest in minimizing downtime
  • being unable to land PRs for some bounded period of time does not have any real impact on the project until we have actual releases and deadlines

@jdm
Copy link
Member

jdm commented Feb 4, 2016

That being said, we wouldn't have the ability to log in to the builders like we do with our own, would we? That might impact our ability to reproduce certain failures.

@Manishearth
Copy link
Member

I really like the solution of adding appveyor/etc support since a lot of our CI problems would just disappear. I'd like to see how well appveyor pans out before committing to it (if we plan to pay them for something, that is).

I'm really okay with the "editing buildbot is hard" thing, though (we don't need to edit these much). My main concern is the time sink that might get created if we have to maintain our own Windows infra (maintaining Linux/Mac already is work); especially since I suspect @larsbergstrom is the only one familiar enough with working on Windows to do it 😄

@jdm note that Travis allows you to ssh into their builders if you need it. Not sure if appveyor has that.

@larsbergstrom
Copy link
Contributor Author

@jdm It appears that AppVeyor allows you to RDP (the Windows version of VNC) into the servers - http://www.appveyor.com/docs/how-to/rdp-to-build-worker

@Manishearth I'm probably more worried about the "editing buildbot is hard" thing because I feel like each time we add another builder it doubles in likelihood-to-go-boom. Adding more platforms is definitely going to make things even worse :-/

I'll agree on the Windows buildbot issues. Everyone I've talked with on the Mozilla side of things with experience with it has basically said, "have fun with that."

@edunham
Copy link
Contributor

edunham commented Feb 8, 2016

I agree with @jdm regarding outages -- for now, having Windows test results sometimes will be better than never having them, and if Appveyor turns out to be down a lot we can look at moving the tests to our own infrastructure or teaching Homu to ignore failures from tests attempted on unreachable platforms.

I can see this potentially causing some confusion about where to put a given piece of testing logic (should Homu track it, or should Buildbot and Travis and Appveyor each track it independently? Will one platform ever need to run a test conditionally on some other platform's result?). As long as we document our intentions for the new system clearly, it shouldn't be much of a problem, though.

@aneeshusa
Copy link
Contributor

For the Windows builds, I agree that unless we have someone with significant Windows + buildbot experience, I think it makes more sense to let AppVeyor handle Windows builds.

For the ARM builds, it looks like the servo-nightly repo @larsbergstrom linked is cross-compiling for ARM. If so, I'd prefer integrating an ARM cross-compile flow into the Buildbot flow instead of adding Travis as another external dependency. At the very least, we can add it to Buildbot now, and possibly move things to Travis if needed once the appropriate homu work is done. (Also, if we want real ARM hardware, I think I have some spare Raspberry Pis 😜.)

On a side note, are we still using Linode? I thought we had moved away from them to EC2 reserved instances.

@Manishearth
Copy link
Member

Yeah, we now have EC2 reserved but can spin up more (EC2 on-demand IIRC?) instances if we need.

@larsbergstrom
Copy link
Contributor Author

We're moving to EC2 as much as we can (EC2 auto-bills to Mozilla - I have to use my personal CC for linode, which Finance frowns upon). The smaller builders will be easy to move soon, but the actual salt master is going to be a bit of a nightmare. It'll get a new IP, we'll need a new mapping for build.servo.org, and I expect our whole CI to basically be offline for a day when we do it as DNS entries and "oops we left a raw IP in that GH repo's config" issues transition over.

@aneeshusa
Copy link
Contributor

Heh, paying off technical debt is never fun. If you give me a heads up I can try to be around (on IRC I guess) for the transition. Also, I'd recommend not doing a clean cutover but phasing it using Salt multimaster to add the new master first, confirm functionality with a deprecation period (leave any webhooks running and check logs to see if anything is still pinging the old machine), then shut down the old one.

@larsbergstrom
Copy link
Contributor Author

Now that AppVeyor support has landed in master (servo/servo#9863) and we've got the build times down to ~30 minutes, I'm very tempted to just fix the homu bugs to get things running.

@larsbergstrom
Copy link
Contributor Author

This has been rolled out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants