New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reload config after sighup #15
Conversation
Although sharing allowed us to slightly minimise memory usage, the saving was illusory: we still had to `clone()` `String`s and so on later. One possibility is to cache `RepoConfig`s (and distribute them through `Arc` or similar), but that seemed unnecessarily fussy, and would also mean we'd have to implement extra stuff like cache eviction. This commit thus simplifies things: every time we query the `Config` about a repo, we get back a new `RepoConfig` that is not bound in anyway to `Config`. However, since the GitHub secret is a `SecStr`, my assumption is that a) it's expensive to clone (since it's doing `mprotect()` and so on) b) duplicating it repeatedly throughout the heap might make it easier for an attacker to find and decode it. We thus return that seaprately from the `RepoConfig`.
This is a necessary step to allowing reloading of Configs.
This is correct, but somewhat crude: any errrors in the config will cause the whole process to terminate, for example.
Is the config mutable after parsing then? That would surprise me. |
When you send SIGHUP, the entire config file is reparsed and a new |
In the past, I've just had the program re- |
That would not be good here since we'd destroy the queue of jobs! |
Well, you'd have to wait for them to finish. I thought you were doing that already, but I guess from your response that you allow them to continue during the reload? |
Yes, the reason this PR is quite so fiddly is that it has no effect on ongoing jobs, but we try still try to enact all the reasonable changes as soon as possible. |
I see. I hadn't appreciated that! Code review coming soon. |
Simplifying a bit: we don't change the config of any running job. So if a job was run when you said The major exception is 82af641: it's really fiddly -- and in the general case impossible -- to deal well with SIGHUP reducing the number of |
I think that's OK. As I user I'd kind of expect that.
Hrm, yes. That's annoying. I can think of two "possible improvements":
I think the solution you have now is pretty good tbh. |
pollfds.resize_with(snare.config.maxjobs * 2 + 1, || { | ||
PollFd::new(-1, PollFlags::empty()) | ||
}); | ||
// If the unwrap() on the lock fails, the other thread has paniced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to repeat this comment on every mutex unlock? Mutex poisoning is well-known among Rust programmers and the documentation explains it well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree. I think in a future PR I will put this on the attribute in the struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/jobrunner.rs
Outdated
}); | ||
// If the unwrap() on the lock fails, the other thread has paniced. | ||
let maxjobs = snare.config.lock().unwrap().maxjobs; | ||
assert!(maxjobs < std::usize::MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a meaningful assertion. Since both arguments are usize
it's equivalent to:
assert!(maxjobs != std::usize::MAX);
Is that being used as a boundary condition elsewhere perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this should be assert!(maxjobs < ((std::usize::MAX - 1) / 2)
! Long story, but it's basically about the amount of pollfds we create. In practise, of course, we're probably not going to have enough RAM for this to ever be an issue.
Fixed in 5ba7c25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that looks more valid at least.
I missed why the / 2
, but if the story is really long, I'll trust you ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each job has stderr/stdout pipes which is the *
(or /)
2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
assert!(maxjobs < std::usize::MAX); | ||
let mut running = Vec::with_capacity(maxjobs); | ||
running.resize_with(maxjobs, || None); | ||
let mut pollfds = Vec::with_capacity(maxjobs * 2 + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the above assertion is to cater for the + 1
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
src/main.rs
Outdated
impl Snare { | ||
/// Check to see if we've received a SIGHUP since the last check. If so, we will reload the | ||
/// config file. **Note that because snare has multiple threads, the config file can change at | ||
/// any arbitrary point, not just after calling this function.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the "not just after calling this function" part of the comment.
When else can the config change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two threads in snare, so the config can change in another thread even if it's not called check_for_hup
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the config may change when neither thread received a HUP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, the signal comes in, a global bool is set, and then jobrunner
checks that bool and reloads the config if necessary. There are only two threads, so one of them has to handle it (and the config stuff is way too much for it to be signal safe, so the signal handler can't deal with it directly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK understood. But the comment makes it sound like some other part of the program (not this function) is mutating the config. As I understand you mean to say instead that this function could be run in another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 3f03f44 better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much clearer, thanks.
src/main.rs
Outdated
/// config file. **Note that because snare has multiple threads, the config file can change at | ||
/// any arbitrary point, not just after calling this function.** | ||
fn check_for_hup(&self) { | ||
if self.sighup_occurred.load(Ordering::Release) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ordering supposed to be acquire here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Relaxed
is fine here as we don't need any other read/writes to have occurred before/after this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we have Release
.
Perhaps both this and the ordering in a few lines time should both be relaxed?
(If the config weren't in a mutex, you'd certainly want acquire/release, otherwise another thread may see the config mid-move)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally right: they should both be Relaxed
and one of the later commits (ff45ea1#diff-639fbc4ef05b315af92b4d836c31b023R66) fixes that. I don't know why I ever thougt Release
was the correct ordering!
let sighup_occurred = Arc::new(AtomicBool::new(false)); | ||
{ | ||
let sighup_occurred = Arc::clone(&sighup_occurred); | ||
if let Err(e) = unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you stored the closure in a variable, I think you might be able to limit the scope of unsafe
some more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you did something like:
if let Err(e) = {
let f = move || {
// All functions called in this function must be signal safe. See signal(3).
sighup_occurred.store(true, Ordering::Relaxed);
unsafe { nix::unistd::write(event_write_fd, &[0]).ok() };
};
unsafe { signal_hook::register(signal_hook::SIGHUP, f) };
}
Then fewer lines can be in unsafe
? I may be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm fairly happy having the closure in unsafe
because it's a signal handler and being clear that "here be dragons" is not a bad idea.
if self.sighup_occurred.load(Ordering::Relaxed) { | ||
match Config::from_path(&self.conf_path) { | ||
Ok(config) => *self.config.lock().unwrap() = config, | ||
Err(msg) => eprintln!("{}", msg), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the error is printed on stderr, which may be in the background and may go un-noticed.
Not sure how you can fix that though.
One potential idea would be to have a snare --reload
which communicates with the existing snare instance and prints any errors to its stderr, (not the daemon's). However, this would probably need more complex IPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A later PR will send this to syslog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
eprintln!("{}", msg); | ||
} else { | ||
eprintln!("{}.", msg); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is worth it :)
There's a any number of silly things the caller might do.
msg("error..");
msg("error,");
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure either :/
I think that's everything? |
LGTM. Please squash. |
If we're running, and receive SIGHUP, it's possible that the user's changes to the config file are incorrect. Rather than aborting, it's better that we report the problem, ignore the new config file, and keep on running.
If the user asks for more jobs to be run, we have an easy task: if they ask for fewer to be run, then it is much trickier. The approach this commit takes for the latter case is simple, but means that we can find ourselves in situations where we are not ever able to actually reduce the number of maximum jobs that are running.
Previously we were inconsistent about whether variables were "conf" or "config". This commit homogenises this to "conf" (though the types are still the longer "Config").
3f03f44
to
0a589e3
Compare
Squashed. |
bors r+ |
15: Reload config after sighup r=vext01 a=ltratt This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being. Co-authored-by: Laurence Tratt <laurie@tratt.net>
@vext01 Any idea why this failed? buildbot seems to have succeeded but bors failed? |
Yeah, that doesn't look like our issue. Let's try again. bors r+ |
15: Reload config after sighup r=vext01 a=ltratt This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being. Co-authored-by: Laurence Tratt <laurie@tratt.net>
once more for luck bors r+ |
15: Reload config after sighup r=vext01 a=ltratt This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being. Co-authored-by: Laurence Tratt <laurie@tratt.net>
This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting
Config
behind aMutex
and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value ofmaxjobs
. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being.