Reload config after sighup #15

ltratt · 2020-01-30T17:31:18Z

This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting Config behind a Mutex and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of maxjobs. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being.

Although sharing allowed us to slightly minimise memory usage, the saving was illusory: we still had to `clone()` `String`s and so on later. One possibility is to cache `RepoConfig`s (and distribute them through `Arc` or similar), but that seemed unnecessarily fussy, and would also mean we'd have to implement extra stuff like cache eviction. This commit thus simplifies things: every time we query the `Config` about a repo, we get back a new `RepoConfig` that is not bound in anyway to `Config`. However, since the GitHub secret is a `SecStr`, my assumption is that a) it's expensive to clone (since it's doing `mprotect()` and so on) b) duplicating it repeatedly throughout the heap might make it easier for an attacker to find and decode it. We thus return that seaprately from the `RepoConfig`.

This is a necessary step to allowing reloading of Configs.

…ing.

This is correct, but somewhat crude: any errrors in the config will cause the whole process to terminate, for example.

vext01 · 2020-01-30T18:20:49Z

putting Config behind a Mutex and so on

Is the config mutable after parsing then? That would surprise me.

ltratt · 2020-01-30T19:28:14Z

Is the config mutable after parsing then? That would surprise me.

When you send SIGHUP, the entire config file is reparsed and a new Config produced. So the Config is not mutable as such, but it can be replaced by a new Config.

vext01 · 2020-01-31T09:45:24Z

When you send SIGHUP, the entire config file is reparsed and a new Config produced. So the Config is not mutable as such, but it can be replaced by a new Config.

In the past, I've just had the program re-exec(3) itself, then you don't have to worry about any of this. I'm not sure if that would work for us here?

ltratt · 2020-01-31T09:45:55Z

In the past, I've just had the program re-exec(3) itself

That would not be good here since we'd destroy the queue of jobs!

vext01 · 2020-01-31T09:47:23Z

That would not be good here since we'd destroy the queue of jobs!

Well, you'd have to wait for them to finish. I thought you were doing that already, but I guess from your response that you allow them to continue during the reload?

ltratt · 2020-01-31T09:48:14Z

Yes, the reason this PR is quite so fiddly is that it has no effect on ongoing jobs, but we try still try to enact all the reasonable changes as soon as possible.

vext01 · 2020-01-31T09:49:04Z

I see. I hadn't appreciated that!

Code review coming soon.

ltratt · 2020-01-31T09:52:29Z

Simplifying a bit: we don't change the config of any running job. So if a job was run when you said email="a@b" and you SIGHUP it so email="c@d" then the existing job will still send to a@b but all new jobs will send to c@d. As this suggests, each job has its own "unique" config in a sense (RepoConfig).

The major exception is 82af641: it's really fiddly -- and in the general case impossible -- to deal well with SIGHUP reducing the number of maxjobs if there are already running jobs. So I implemented something simple, given that this is a fairly niche case 82af641#diff-b319aab93ab499624a467ced0e18a2a8R364.

vext01 · 2020-01-31T10:22:11Z

So if a job was run when you said email="a@b" and you SIGHUP it so email="c@d" then the existing job will still send to a@b

I think that's OK. As I user I'd kind of expect that.

to deal well with SIGHUP reducing the number of maxjobs if there are already running jobs.

Hrm, yes. That's annoying.

I can think of two "possible improvements":

Wait until there are fewer-or-equal jobs than the new maxjobs, then create a new smaller array from the used slot of the old array. I think this is what your comment suggests with "compaction".
Keep a count of how many jobs there are, but the backing array may be larger. May waste memory and annoying to keep in sync.

I think the solution you have now is pretty good tbh.

vext01 · 2020-01-31T10:28:28Z

src/jobrunner.rs

-        pollfds.resize_with(snare.config.maxjobs * 2 + 1, || {
-            PollFd::new(-1, PollFlags::empty())
-        });
+        // If the unwrap() on the lock fails, the other thread has paniced.


Is it necessary to repeat this comment on every mutex unlock? Mutex poisoning is well-known among Rust programmers and the documentation explains it well.

I tend to agree. I think in a future PR I will put this on the attribute in the struct.

vext01 · 2020-01-31T10:32:09Z

src/jobrunner.rs

-        });
+        // If the unwrap() on the lock fails, the other thread has paniced.
+        let maxjobs = snare.config.lock().unwrap().maxjobs;
+        assert!(maxjobs < std::usize::MAX);


I'm not sure this is a meaningful assertion. Since both arguments are usize it's equivalent to:

assert!(maxjobs != std::usize::MAX);

Is that being used as a boundary condition elsewhere perhaps?

Actually, this should be assert!(maxjobs < ((std::usize::MAX - 1) / 2)! Long story, but it's basically about the amount of pollfds we create. In practise, of course, we're probably not going to have enough RAM for this to ever be an issue.

Fixed in 5ba7c25.

OK, that looks more valid at least.

I missed why the / 2, but if the story is really long, I'll trust you ;)

Each job has stderr/stdout pipes which is the * (or /) 2.

vext01 · 2020-01-31T10:33:02Z

src/jobrunner.rs

+        assert!(maxjobs < std::usize::MAX);
+        let mut running = Vec::with_capacity(maxjobs);
+        running.resize_with(maxjobs, || None);
+        let mut pollfds = Vec::with_capacity(maxjobs * 2 + 1);


Perhaps the above assertion is to cater for the + 1 here?

vext01 · 2020-01-31T10:38:02Z

src/main.rs

+impl Snare {
+    /// Check to see if we've received a SIGHUP since the last check. If so, we will reload the
+    /// config file. **Note that because snare has multiple threads, the config file can change at
+    /// any arbitrary point, not just after calling this function.**


I don't understand the "not just after calling this function" part of the comment.

When else can the config change?

There are two threads in snare, so the config can change in another thread even if it's not called check_for_hup.

To be clear, the config may change when neither thread received a HUP?

At the moment, the signal comes in, a global bool is set, and then jobrunner checks that bool and reloads the config if necessary. There are only two threads, so one of them has to handle it (and the config stuff is way too much for it to be signal safe, so the signal handler can't deal with it directly).

OK understood. But the comment makes it sound like some other part of the program (not this function) is mutating the config. As I understand you mean to say instead that this function could be run in another thread.

Is 3f03f44 better?

Much clearer, thanks.

vext01 · 2020-01-31T10:38:51Z

src/main.rs

+    /// config file. **Note that because snare has multiple threads, the config file can change at
+    /// any arbitrary point, not just after calling this function.**
+    fn check_for_hup(&self) {
+        if self.sighup_occurred.load(Ordering::Release) {


Is the ordering supposed to be acquire here?

I think Relaxed is fine here as we don't need any other read/writes to have occurred before/after this.

But we have Release.

Perhaps both this and the ordering in a few lines time should both be relaxed?

(If the config weren't in a mutex, you'd certainly want acquire/release, otherwise another thread may see the config mid-move)

You're totally right: they should both be Relaxed and one of the later commits (ff45ea1#diff-639fbc4ef05b315af92b4d836c31b023R66) fixes that. I don't know why I ever thougt Release was the correct ordering!

vext01 · 2020-01-31T10:40:29Z

src/main.rs

+    let sighup_occurred = Arc::new(AtomicBool::new(false));
+    {
+        let sighup_occurred = Arc::clone(&sighup_occurred);
+        if let Err(e) = unsafe {


If you stored the closure in a variable, I think you might be able to limit the scope of unsafe some more?

I'm not sure I understand?

So if you did something like:

if let Err(e) = { let f = move || { // All functions called in this function must be signal safe. See signal(3). sighup_occurred.store(true, Ordering::Relaxed); unsafe { nix::unistd::write(event_write_fd, &[0]).ok() }; }; unsafe { signal_hook::register(signal_hook::SIGHUP, f) }; }

Then fewer lines can be in unsafe? I may be wrong.

TBH, I'm fairly happy having the closure in unsafe because it's a signal handler and being clear that "here be dragons" is not a bad idea.

vext01 · 2020-01-31T10:44:41Z

src/main.rs

+        if self.sighup_occurred.load(Ordering::Relaxed) {
+            match Config::from_path(&self.conf_path) {
+                Ok(config) => *self.config.lock().unwrap() = config,
+                Err(msg) => eprintln!("{}", msg),


So the error is printed on stderr, which may be in the background and may go un-noticed.

Not sure how you can fix that though.

One potential idea would be to have a snare --reload which communicates with the existing snare instance and prints any errors to its stderr, (not the daemon's). However, this would probably need more complex IPC.

A later PR will send this to syslog.

sounds good!

vext01 · 2020-01-31T10:46:18Z

src/main.rs

+        eprintln!("{}", msg);
+    } else {
+        eprintln!("{}.", msg);
+    }


I wonder if this is worth it :)

There's a any number of silly things the caller might do.

msg("error.."); msg("error,"); ...

Yeah, I'm not sure either :/

ltratt · 2020-01-31T11:35:15Z

I think that's everything?

vext01 · 2020-01-31T11:56:41Z

LGTM. Please squash.

If we're running, and receive SIGHUP, it's possible that the user's changes to the config file are incorrect. Rather than aborting, it's better that we report the problem, ignore the new config file, and keep on running.

If the user asks for more jobs to be run, we have an easy task: if they ask for fewer to be run, then it is much trickier. The approach this commit takes for the latter case is simple, but means that we can find ourselves in situations where we are not ever able to actually reduce the number of maximum jobs that are running.

Previously we were inconsistent about whether variables were "conf" or "config". This commit homogenises this to "conf" (though the types are still the longer "Config").

ltratt · 2020-01-31T12:25:54Z

Squashed.

vext01 · 2020-01-31T12:31:07Z

bors r+

15: Reload config after sighup r=vext01 a=ltratt This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being. Co-authored-by: Laurence Tratt <laurie@tratt.net>

ltratt · 2020-01-31T12:37:57Z

@vext01 Any idea why this failed? buildbot seems to have succeeded but bors failed?

vext01 · 2020-01-31T13:37:53Z

Yeah, that doesn't look like our issue.

Let's try again.

bors r+

15: Reload config after sighup r=vext01 a=ltratt This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being. Co-authored-by: Laurence Tratt <laurie@tratt.net>

vext01 · 2020-01-31T14:29:13Z

once more for luck

bors r+

15: Reload config after sighup r=vext01 a=ltratt This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being. Co-authored-by: Laurence Tratt <laurie@tratt.net>

ltratt added 5 commits January 30, 2020 17:21

Comments should consume input until the end of the line.

47bf0ea

Put Config into a Mutex.

6aba57d

This is a necessary step to allowing reloading of Configs.

Split loading/parsing a config file apart from command-line args pars…

389f033

…ing.

Reload the Config when SIGHUP is received.

600dece

This is correct, but somewhat crude: any errrors in the config will cause the whole process to terminate, for example.

ltratt assigned vext01 Jan 30, 2020

vext01 reviewed Jan 31, 2020

View reviewed changes

ltratt added 3 commits January 31, 2020 12:25

Don't exit when errors are found in a config file after SIGHUP.

485e524

If we're running, and receive SIGHUP, it's possible that the user's changes to the config file are incorrect. Rather than aborting, it's better that we report the problem, ignore the new config file, and keep on running.

Consistently use "conf" as the variable name.

0a589e3

Previously we were inconsistent about whether variables were "conf" or "config". This commit homogenises this to "conf" (though the types are still the longer "Config").

ltratt force-pushed the reload_config_after_sighup branch from 3f03f44 to 0a589e3 Compare January 31, 2020 12:25

vext01 merged commit 1913d29 into softdevteam:master Jan 31, 2020

ltratt deleted the reload_config_after_sighup branch January 31, 2020 15:52

Reload config after sighup #15

Reload config after sighup #15

Conversation

ltratt commented Jan 30, 2020

vext01 commented Jan 30, 2020

ltratt commented Jan 30, 2020

vext01 commented Jan 31, 2020

ltratt commented Jan 31, 2020

vext01 commented Jan 31, 2020

ltratt commented Jan 31, 2020

vext01 commented Jan 31, 2020

ltratt commented Jan 31, 2020

vext01 commented Jan 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltratt commented Jan 31, 2020

vext01 commented Jan 31, 2020

ltratt commented Jan 31, 2020

vext01 commented Jan 31, 2020

ltratt commented Jan 31, 2020

vext01 commented Jan 31, 2020

vext01 commented Jan 31, 2020