Alternatives for multi-platform CI support in servo/servo #215

larsbergstrom · 2016-02-04T22:17:55Z

Today, we have a default strategy for each new platform or build configuration in the servo/servo repository - add new buildbot rules and rebalance those rules across a new set of builders we spin up.

That has a couple of issues:

Editing buildbot rules and deploying them is something that really only core CI maintainers can do, since it requires root access and the ability to "deal with meltdowns," since any change can trigger cascading failures and require the ability to pick the whole system back up
Each builder we spin up requires additional oversight, maintenance, etc.

One alternative that I'm considering is to add both AppVeyor support (barosl/homu#87) and the ability to gate on multiple CI systems (barosl/homu#100) to homu.

This would mean that for some new platforms (Windows - servo/servo#9406, ARM - https://github.com/mmatyas/servo-nightly/blob/master/.travis.yml, etc.) and also some tests (test-tidy), we could run them on Travis or AppVeyor infrastructure and use the merged buildbot+travis+appveyor results.

The upsides are:

We don't have to maintain those servers.
As seen from the PRs pointed to, contributing to and testing the CI rules is much more easily done by community members.

Downsides:

If any one of the three services go offline, we're blocked on landing things. Today, we're really only blocked if Amazon, Linode, MacStadium, or Github go down. This would add another 1--2.
It may be difficult to get extremely large instance types (e.g., the c4.4xlarge that we use on EC2) on some of these other services, at least in the first 3--6 months, which could put an upper limit on our build time.
More homu complexity.

Thoughts? CC @Manishearth @metajack @edunham

The text was updated successfully, but these errors were encountered:

jdm · 2016-02-04T22:42:19Z

I am in favour of using other services here. I am not particularly concerned about service outages right now, because:

the ones we would be relying upon are run by organizations that have commercial interest in minimizing downtime
being unable to land PRs for some bounded period of time does not have any real impact on the project until we have actual releases and deadlines

jdm · 2016-02-04T22:43:30Z

That being said, we wouldn't have the ability to log in to the builders like we do with our own, would we? That might impact our ability to reproduce certain failures.

Manishearth · 2016-02-04T23:58:17Z

I really like the solution of adding appveyor/etc support since a lot of our CI problems would just disappear. I'd like to see how well appveyor pans out before committing to it (if we plan to pay them for something, that is).

I'm really okay with the "editing buildbot is hard" thing, though (we don't need to edit these much). My main concern is the time sink that might get created if we have to maintain our own Windows infra (maintaining Linux/Mac already is work); especially since I suspect @larsbergstrom is the only one familiar enough with working on Windows to do it 😄

@jdm note that Travis allows you to ssh into their builders if you need it. Not sure if appveyor has that.

larsbergstrom · 2016-02-05T00:59:32Z

@jdm It appears that AppVeyor allows you to RDP (the Windows version of VNC) into the servers - http://www.appveyor.com/docs/how-to/rdp-to-build-worker

@Manishearth I'm probably more worried about the "editing buildbot is hard" thing because I feel like each time we add another builder it doubles in likelihood-to-go-boom. Adding more platforms is definitely going to make things even worse :-/

I'll agree on the Windows buildbot issues. Everyone I've talked with on the Mozilla side of things with experience with it has basically said, "have fun with that."

edunham · 2016-02-08T19:37:48Z

I agree with @jdm regarding outages -- for now, having Windows test results sometimes will be better than never having them, and if Appveyor turns out to be down a lot we can look at moving the tests to our own infrastructure or teaching Homu to ignore failures from tests attempted on unreachable platforms.

I can see this potentially causing some confusion about where to put a given piece of testing logic (should Homu track it, or should Buildbot and Travis and Appveyor each track it independently? Will one platform ever need to run a test conditionally on some other platform's result?). As long as we document our intentions for the new system clearly, it shouldn't be much of a problem, though.

aneeshusa · 2016-02-09T14:42:58Z

For the Windows builds, I agree that unless we have someone with significant Windows + buildbot experience, I think it makes more sense to let AppVeyor handle Windows builds.

For the ARM builds, it looks like the servo-nightly repo @larsbergstrom linked is cross-compiling for ARM. If so, I'd prefer integrating an ARM cross-compile flow into the Buildbot flow instead of adding Travis as another external dependency. At the very least, we can add it to Buildbot now, and possibly move things to Travis if needed once the appropriate homu work is done. (Also, if we want real ARM hardware, I think I have some spare Raspberry Pis 😜.)

On a side note, are we still using Linode? I thought we had moved away from them to EC2 reserved instances.

Manishearth · 2016-02-09T14:46:20Z

Yeah, we now have EC2 reserved but can spin up more (EC2 on-demand IIRC?) instances if we need.

larsbergstrom · 2016-02-09T15:31:49Z

We're moving to EC2 as much as we can (EC2 auto-bills to Mozilla - I have to use my personal CC for linode, which Finance frowns upon). The smaller builders will be easy to move soon, but the actual salt master is going to be a bit of a nightmare. It'll get a new IP, we'll need a new mapping for build.servo.org, and I expect our whole CI to basically be offline for a day when we do it as DNS entries and "oops we left a raw IP in that GH repo's config" issues transition over.

aneeshusa · 2016-02-09T15:41:28Z

Heh, paying off technical debt is never fun. If you give me a heads up I can try to be around (on IRC I guess) for the transition. Also, I'd recommend not doing a clean cutover but phasing it using Salt multimaster to add the new master first, confirm functionality with a deprecation period (leave any webhooks running and check logs to see if anything is still pinging the old machine), then shut down the old one.

larsbergstrom · 2016-03-04T18:35:30Z

Now that AppVeyor support has landed in master (servo/servo#9863) and we've got the build times down to ~30 minutes, I'm very tempted to just fix the homu bugs to get things running.

larsbergstrom · 2016-03-31T00:24:00Z

This has been rolled out!

larsbergstrom changed the title ~~Alternatives for multi-platform support in servo/servo~~ Alternatives for multi-platform CI support in servo/servo Feb 4, 2016

larsbergstrom self-assigned this Feb 4, 2016

aneeshusa mentioned this issue Mar 25, 2016

Need to set up linux boxes to auto-update to stable packages #275

Open

larsbergstrom closed this as completed Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternatives for multi-platform CI support in servo/servo #215

Alternatives for multi-platform CI support in servo/servo #215

larsbergstrom commented Feb 4, 2016

jdm commented Feb 4, 2016

jdm commented Feb 4, 2016

Manishearth commented Feb 4, 2016

larsbergstrom commented Feb 5, 2016

edunham commented Feb 8, 2016

aneeshusa commented Feb 9, 2016

Manishearth commented Feb 9, 2016

larsbergstrom commented Feb 9, 2016

aneeshusa commented Feb 9, 2016

larsbergstrom commented Mar 4, 2016

larsbergstrom commented Mar 31, 2016

Alternatives for multi-platform CI support in servo/servo #215

Alternatives for multi-platform CI support in servo/servo #215

Comments

larsbergstrom commented Feb 4, 2016

jdm commented Feb 4, 2016

jdm commented Feb 4, 2016

Manishearth commented Feb 4, 2016

larsbergstrom commented Feb 5, 2016

edunham commented Feb 8, 2016

aneeshusa commented Feb 9, 2016

Manishearth commented Feb 9, 2016

larsbergstrom commented Feb 9, 2016

aneeshusa commented Feb 9, 2016

larsbergstrom commented Mar 4, 2016

larsbergstrom commented Mar 31, 2016