Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Buildbot restarting more robustly #304

Open
aneeshusa opened this issue Apr 6, 2016 · 4 comments
Open

Handle Buildbot restarting more robustly #304

aneeshusa opened this issue Apr 6, 2016 · 4 comments

Comments

@aneeshusa
Copy link
Contributor

When updating the Buildbot configuration, we need to wait for Buildbot to not be executing any jobs before we can safely restart it.

See discussion in #300.

Apparently there is a way to just reload the Buildbot configuration instead of restarting it via SIGHUP or buildbot reconfig, but it's fragile, so I'd prefer not to do that: http://docs.buildbot.net/current/manual/cfg-intro.html?highlight=reconfig#reloading-the-config-file-reconfig

Just to be clear, this is all for the Buildbot master config + service, not the builder machines, yes?

@larsbergstrom
Copy link
Contributor

That's correct - I'm not as worried about the builder machines (personally) as we don't change their configuration very often.

cc @edunham

@aneeshusa
Copy link
Contributor Author

A key component of a robust automated solution will likely involve waiting for Buildbot to not have any open jobs. A few questions:

  • How long does it usually take to wait for Buildbot to not have any builds? Is it on the order of a few minutes, a few hours, all day? Is this latency likely to increase over time as we gain more contributions to Servo?
  • Is there a way to inform Homu to stop sending builds to Buildbot for a while (while we wait for the queue to drain), and then inform it that it's safe to send builds again (after a restart)?

If the latency here is short, I'd look towards a solution that integrates the waiting time into the highstate sequence. If the latency is longer (or likely to increase in the future), I'd prefer to do this more asynchronously - the Salt event bus should make this easy to do.

Buildbot masters also seem to have a multimaster mode that could help make these transitions more seamless: https://docs.buildbot.net/current/manual/cfg-global.html#multi-master-mode

Bonus points if we can rig up a "Buildbot is restarting message...' to be shown via nginx (i.e. also inform nginx of buildbot up/down times).

@aneeshusa
Copy link
Contributor Author

Another consideration is that the Ubuntu machines (running Trusty) currently use Upstart for service management, but newer Ubuntu releases use systemd instead. It would ideal if the chosen solution is init-agnostic, or at least has minimal coupling.

@larsbergstrom
Copy link
Contributor

I've had more luck with:

# su - servo
# buildbot restart --clean --nodaemon /home/servo/buildbot/master &

The only issue has been ensuring that it really is run as the correct user, which I think is much easier to do in Salt? This leaves around a process that will do a SIGHUP once the current job finishes, which I think is the most foolproof way to get the changes rolled out.

It usually takes about 45-50 minutes for a given job to complete. Our homu job queue is between empty and 10 items deep at any time, and it's hard to predict when those times are :-) I'm a little afraid of something that takes down homu to let the buildbot job end, because homu also handles all of the other queues on our other servo org repos.

Does that sound reasonable? I do think it doesn't play great for upstart/systemd, though.

@edunham edunham modified the milestones: Salt Best Practices, Buildbot Enhancements Jul 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants