Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Automatically restart Buildbot master on configuration change #505

Conversation

@aneeshusa
Copy link
Member

aneeshusa commented Oct 6, 2016

This should allow us to use Upstart to restart Buildbot cleanly, instead
of having to su to servo and restart Buildbot manually.


This change is Reviewable

@aneeshusa
Copy link
Member Author

aneeshusa commented Oct 6, 2016

I'm not 100% sure about how Upstart works, so we may want to try this manually on servo-master1 before merging.

r? @larsbergstrom @edunham

@metajack
Copy link
Contributor

metajack commented Oct 6, 2016

Upstart terminates the process and starts another on restart. The only difference between calling stop then start versus restart is that post-start and pre-stop and such logic is skipped for restart and the job control config is not reread.

There is also respawn, which restarts the job automatically when it exits. http://upstart.ubuntu.com/cookbook/#respawn

I don't think this patch is correct. What is the behavior you are observing and trying to fix?

@aneeshusa
Copy link
Member Author

aneeshusa commented Oct 6, 2016

Hmm, it does sound this patch won't work. Essentially, I want to teach Upstart how to restart Buildbot gracefully, so instead of running the steps at https://github.com/servo/servo/wiki/Buildbot-administration#dealing-with-troubles manually, we can always do service buildbot-master restart.

Since I don't see us upgrading Buildbot soon, most of the time we only need to reload the configuration, so another option would be cleanly reloading Buildbot. Maybe using service buildbot-master reload would work? Builbot handles SIGHUP, but it's not clear to me if it handles it cleanly (without interrupting ongoing builds).

@metajack
Copy link
Contributor

metajack commented Oct 6, 2016

There are other job types you could set up for one off things like this. What does --clean do anyway?

@aneeshusa
Copy link
Member Author

aneeshusa commented Oct 6, 2016

Buildbot's --clean parameter makes it shut down (and then possibly restart) cleanly, allowing all running builds to complete first. This was introduced in Buildbot 0.8.8.

The main point is to 1) avoid dropping builds 2) require as little admin effort as possible to update the Buildbot config (if you force-kill it it has to be cleaned up by hand to restart).

@aneeshusa aneeshusa force-pushed the aneeshusa:make-buildbot-upstart-configuration-more-robust branch from 347a310 to 0330c52 Oct 6, 2016
@aneeshusa aneeshusa changed the title Always (re)-start Buildbot cleanly from Upstart WIP: Automatically reload Buildbot master on configuration change Oct 6, 2016
@aneeshusa aneeshusa force-pushed the aneeshusa:make-buildbot-upstart-configuration-more-robust branch from 0330c52 to 9be76f2 Oct 7, 2016
@aneeshusa aneeshusa changed the title WIP: Automatically reload Buildbot master on configuration change WIP: Automatically restart Buildbot master on configuration change Oct 7, 2016
@aneeshusa
Copy link
Member Author

aneeshusa commented Oct 7, 2016

I looked into it more deeply and it seems Buildbot's reloading functionality has a lot of gotchas and edges cases it doesn't handle well. I also spent some more time with the Upstart Cookbook and I think I found another approach that may work - @metajack, does this look like a reasonable idea? (Still WIP.)

@metajack
Copy link
Contributor

metajack commented Oct 7, 2016

Looks good, though I'm confused where the value of $reason comes from.

@aneeshusa
Copy link
Member Author

aneeshusa commented Oct 7, 2016

The $reason variable is just to give Upstart unique ids for each instance; the cmd.run state will automatically set it to "salt-$(date)", while it will need to be specified manually for manual starts.

Earlier I was looking into making it work without specifying a reason, but it looks like that causes other problems, so I think I need to add another Upstart job to start an instance at boot-up.

@metajack
Copy link
Contributor

metajack commented Oct 7, 2016

Oh, I see it now. r=me if you think this is ready to go.

@aneeshusa aneeshusa force-pushed the aneeshusa:make-buildbot-upstart-configuration-more-robust branch 6 times, most recently from fb5aa02 to 6c90434 Dec 27, 2016
@aneeshusa
Copy link
Member Author

aneeshusa commented Dec 27, 2016

First commit now fixes #552, and I included a bunch more changes too beyond the original PR. This should really make Buildbot much nicer and automated!

r? @larsbergstrom or @edunham on the Salt/Buildbot changes
r? @metajack on the Upstart changes

Also, let me know what docs I should add with this PR! Most things should be handled automatically by Salt now.

I highly recommend trying this out in Vagrant with the servo-master1 VM. Use
initctl list | grep buildbot to look at the relevant jobs, and look inside the /var/log/upstart directory for relevant logs, for both buildbot-master and buildbot-master-autostart.
A few scenarios to try:

  • Fresh deploy (vagrant destroy --force servo-master1 && vagrant up servo-master1): DB should be created and Buildbot should be (re)started and stay alive
  • Reprovision: (vagrant provision servo-master1): no changes, should see the same Buildbot instance still running
  • VM Restart: (vagrant reload servo-master1): Should see an autostarted Buildbot instance
  • Upgrade (change buildbot version from 0.8.12 to 0.8.14 in buildbot/master/init.sls and run vagrant provision servo-master1): DB upgrade, Buildbot restart
  • Start from stopped: Force stop Buildbot (vagrant ssh servo-master1, then run sudo buildbot stop --clean /home/servo/buildbot/master via SSH, confirm Buildbot is stopped, then run vagrant provision servo-master1): Buildbot should be started via autostart

FYI, strange output is due to saltstack/salt#35989 and can be fixed by a Salt update.

aneeshusa added 3 commits Dec 22, 2016
For some reason (likely due to our "graceful" restarts of Buildbot),
when Buildbot is restarted after its configuration is updated,
it does not read the updated `.py` files but instead uses the existing
compiled bytecode (`.pyc`) files, which reflect an older version and are
not up to date. These bytecode files are also not regenerated,
meaning the new configuration does not (fully) take effect.

To work around this, disable bytecode caching for the Buildbot master
entirely to avoid using out-of-date bytecode.

saltfs-migration: Delete all `.pyc` files (recursively)
in the Buildbot master directory (/home/servo/buildbot/master).
When upgrading the Buildbot master version,
or starting from a fresh deploy (no existing database),
the Buildbot database must be upgraded in order
for Buildbot to start normally.
(The DB is usually created during `buildbot create-master`,
but we want to avoid checking in the database.)
Add a `cmd.run` state that only runs if the Buildbot version is changed,
using an `onchanges` requisite.

The upgrade script requires that Buildbot is not running,
presuambly to avoid conflicting updates,
so Buildbot must be stopped before running the upgrade.
We want to perform a clean stop, but the built-in `buildbot stop`
command is nonblocking and will return immediately, without waiting for
the existing Buildbot instance to finish.
Buildbot has blocking/waiting for stop functionality built-in but not
exposed, so add a small helper script to stop Buildbot and block until
it is finished shutting down, and invoke it before the upgrade.

An alternative would be trying to use Upstart or Salt directly
to stop Buildbot, as a clean shutdown boils down to sending a SIGUSR1
to the Buildbot process (only if one is running), in the Unix tradition.
However, this would be hard to integrate with; in particular, we need to
wait for the existing Buildbot process to finish running; the easiest
way to integrate this into a Salt state (without writing a custom Salt
state) is to start a process to do the waiting, hence the stop script.

Note that the upgrade-master command also adds various other cruft
to the master directory; the Buildbot internal upgradeDatabase API
is not called because it is layered in Twisted Reactor/inline callback
goop, and it is simpler to just call the CLI command.

Also update the Buildbot master states to be more strict
about using requisites for better ordering control,
and re-order/space out states for a better reading flow.
Buildbot comes with built-in functionality for clean restarts,
which entail starting a new Buildbot process that does the following:
- Cleanly shut down the existing buildmaster (existing instance)
  by waiting for pending builds to be finshed,
  ending the existing process.
- Start a new buildmaster in the new process, taking over.

Note that the new process becomes the new daemon,
and thus needs to linger/be kept alive and managed.

Upstart's built-in restart functionality hard-kills the existing
process, which is undesirable; it's also hard to make Upstart wait
for pending builds before stopping the existing process,
as only Buildbot knows about any pending builds.
Additionally, Buildbot has a limited reload functionality,
but there are many pitfalls, gotchas, and inconsistencies,
and it is not recommended for customized installations like ours.
(e.g. re-loading imports doesn't work.)

Note that when a Buildbot clean restart is requested, there are multiple
processes running simultaneously.
Model this in Upstart by using an "instance" job,
which makes it easy to queue restarts and monitor them,
without having to leave processes running in e.g. tmux or screen.

Add an additional Upstart task to spawn a single instance of the
Buildbot master service at boot-up,
or during a Salt deploy if no instances are alive.
More instances of the buildbot-master job can be started manually,
or automatically by Salt, to queue up restarts.
(The original process will exit gracefully after some time.)
All instances will stop gracefully on shutdown.

Note that the switch to an instance job means each separate instance
must be differentiated by a `reason` variable, which is not used for
any other purpose; this is automatically set to the current date/time,
suffixed with some metadata about the reason for the instance.
This has a side effect of creating separate Upstart logs
(in `/var/log/upstart` for each instance), but the choice of reason
keeps the logs sorted by date.

Finally, automate this by having Salt queue a new restart job instance
if anything about the Buildbot master configuration (or package)
changes.
@aneeshusa aneeshusa force-pushed the aneeshusa:make-buildbot-upstart-configuration-more-robust branch from 6c90434 to 0b69a00 Dec 27, 2016
@aneeshusa
Copy link
Member Author

aneeshusa commented Dec 27, 2016

Also, there is some manual migration required: Make sure to delete the existing .pyc files, e.g. with $ find /home/servo/buildbot/master -name '*.pyc' -delete. When poking inside Vagrant you can also leave a watch -d find /home/servo/buildbot/master "'*.pyc'" running to ensure no new *.pyc files get created, or use watch -d sh -c "'initctl list | grep buildbot'" to watch the running buildbot related Upstart jobs.

@bors-servo
Copy link
Contributor

bors-servo commented Jan 17, 2017

The latest upstream changes (presumably #577) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.