Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload config after sighup #15

Merged
merged 8 commits into from Jan 31, 2020

Conversation

ltratt
Copy link
Member

@ltratt ltratt commented Jan 30, 2020

This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting Config behind a Mutex and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of maxjobs. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being.

Although sharing allowed us to slightly minimise memory usage, the saving was
illusory: we still had to `clone()` `String`s and so on later. One possibility
is to cache `RepoConfig`s (and distribute them through `Arc` or similar), but
that seemed unnecessarily fussy, and would also mean we'd have to implement
extra stuff like cache eviction.

This commit thus simplifies things: every time we query the `Config` about a
repo, we get back a new `RepoConfig` that is not bound in anyway to `Config`.
However, since the GitHub secret is a `SecStr`, my assumption is that a) it's
expensive to clone (since it's doing `mprotect()` and so on) b) duplicating it
repeatedly throughout the heap might make it easier for an attacker to find and
decode it. We thus return that seaprately from the `RepoConfig`.
This is a necessary step to allowing reloading of Configs.
This is correct, but somewhat crude: any errrors in the config will cause the
whole process to terminate, for example.
@vext01
Copy link
Member

vext01 commented Jan 30, 2020

putting Config behind a Mutex and so on

Is the config mutable after parsing then? That would surprise me.

@ltratt
Copy link
Member Author

ltratt commented Jan 30, 2020

Is the config mutable after parsing then? That would surprise me.

When you send SIGHUP, the entire config file is reparsed and a new Config produced. So the Config is not mutable as such, but it can be replaced by a new Config.

@vext01
Copy link
Member

vext01 commented Jan 31, 2020

When you send SIGHUP, the entire config file is reparsed and a new Config produced. So the Config is not mutable as such, but it can be replaced by a new Config.

In the past, I've just had the program re-exec(3) itself, then you don't have to worry about any of this. I'm not sure if that would work for us here?

@ltratt
Copy link
Member Author

ltratt commented Jan 31, 2020

In the past, I've just had the program re-exec(3) itself

That would not be good here since we'd destroy the queue of jobs!

@vext01
Copy link
Member

vext01 commented Jan 31, 2020

That would not be good here since we'd destroy the queue of jobs!

Well, you'd have to wait for them to finish. I thought you were doing that already, but I guess from your response that you allow them to continue during the reload?

@ltratt
Copy link
Member Author

ltratt commented Jan 31, 2020

Yes, the reason this PR is quite so fiddly is that it has no effect on ongoing jobs, but we try still try to enact all the reasonable changes as soon as possible.

@vext01
Copy link
Member

vext01 commented Jan 31, 2020

I see. I hadn't appreciated that!

Code review coming soon.

@ltratt
Copy link
Member Author

ltratt commented Jan 31, 2020

Simplifying a bit: we don't change the config of any running job. So if a job was run when you said email="a@b" and you SIGHUP it so email="c@d" then the existing job will still send to a@b but all new jobs will send to c@d. As this suggests, each job has its own "unique" config in a sense (RepoConfig).

The major exception is 82af641: it's really fiddly -- and in the general case impossible -- to deal well with SIGHUP reducing the number of maxjobs if there are already running jobs. So I implemented something simple, given that this is a fairly niche case 82af641#diff-b319aab93ab499624a467ced0e18a2a8R364.

@vext01
Copy link
Member

vext01 commented Jan 31, 2020

So if a job was run when you said email="a@b" and you SIGHUP it so email="c@d" then the existing job will still send to a@b

I think that's OK. As I user I'd kind of expect that.

to deal well with SIGHUP reducing the number of maxjobs if there are already running jobs.

Hrm, yes. That's annoying.

I can think of two "possible improvements":

  • Wait until there are fewer-or-equal jobs than the new maxjobs, then create a new smaller array from the used slot of the old array. I think this is what your comment suggests with "compaction".

  • Keep a count of how many jobs there are, but the backing array may be larger. May waste memory and annoying to keep in sync.

I think the solution you have now is pretty good tbh.

pollfds.resize_with(snare.config.maxjobs * 2 + 1, || {
PollFd::new(-1, PollFlags::empty())
});
// If the unwrap() on the lock fails, the other thread has paniced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to repeat this comment on every mutex unlock? Mutex poisoning is well-known among Rust programmers and the documentation explains it well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree. I think in a future PR I will put this on the attribute in the struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

src/jobrunner.rs Outdated
});
// If the unwrap() on the lock fails, the other thread has paniced.
let maxjobs = snare.config.lock().unwrap().maxjobs;
assert!(maxjobs < std::usize::MAX);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a meaningful assertion. Since both arguments are usize it's equivalent to:

assert!(maxjobs != std::usize::MAX);

Is that being used as a boundary condition elsewhere perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this should be assert!(maxjobs < ((std::usize::MAX - 1) / 2)! Long story, but it's basically about the amount of pollfds we create. In practise, of course, we're probably not going to have enough RAM for this to ever be an issue.

Fixed in 5ba7c25.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that looks more valid at least.

I missed why the / 2, but if the story is really long, I'll trust you ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each job has stderr/stdout pipes which is the * (or /) 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

assert!(maxjobs < std::usize::MAX);
let mut running = Vec::with_capacity(maxjobs);
running.resize_with(maxjobs, || None);
let mut pollfds = Vec::with_capacity(maxjobs * 2 + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the above assertion is to cater for the + 1 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

src/main.rs Outdated
impl Snare {
/// Check to see if we've received a SIGHUP since the last check. If so, we will reload the
/// config file. **Note that because snare has multiple threads, the config file can change at
/// any arbitrary point, not just after calling this function.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the "not just after calling this function" part of the comment.

When else can the config change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two threads in snare, so the config can change in another thread even if it's not called check_for_hup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the config may change when neither thread received a HUP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the signal comes in, a global bool is set, and then jobrunner checks that bool and reloads the config if necessary. There are only two threads, so one of them has to handle it (and the config stuff is way too much for it to be signal safe, so the signal handler can't deal with it directly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK understood. But the comment makes it sound like some other part of the program (not this function) is mutating the config. As I understand you mean to say instead that this function could be run in another thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 3f03f44 better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer, thanks.

src/main.rs Outdated
/// config file. **Note that because snare has multiple threads, the config file can change at
/// any arbitrary point, not just after calling this function.**
fn check_for_hup(&self) {
if self.sighup_occurred.load(Ordering::Release) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering supposed to be acquire here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Relaxed is fine here as we don't need any other read/writes to have occurred before/after this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we have Release.

Perhaps both this and the ordering in a few lines time should both be relaxed?

(If the config weren't in a mutex, you'd certainly want acquire/release, otherwise another thread may see the config mid-move)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right: they should both be Relaxed and one of the later commits (ff45ea1#diff-639fbc4ef05b315af92b4d836c31b023R66) fixes that. I don't know why I ever thougt Release was the correct ordering!

let sighup_occurred = Arc::new(AtomicBool::new(false));
{
let sighup_occurred = Arc::clone(&sighup_occurred);
if let Err(e) = unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you stored the closure in a variable, I think you might be able to limit the scope of unsafe some more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you did something like:

if let Err(e) = {
    let f = move || {
        // All functions called in this function must be signal safe. See signal(3).
        sighup_occurred.store(true, Ordering::Relaxed);
        unsafe { nix::unistd::write(event_write_fd, &[0]).ok() };
    };
    unsafe { signal_hook::register(signal_hook::SIGHUP, f) };
}

Then fewer lines can be in unsafe? I may be wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I'm fairly happy having the closure in unsafe because it's a signal handler and being clear that "here be dragons" is not a bad idea.

if self.sighup_occurred.load(Ordering::Relaxed) {
match Config::from_path(&self.conf_path) {
Ok(config) => *self.config.lock().unwrap() = config,
Err(msg) => eprintln!("{}", msg),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the error is printed on stderr, which may be in the background and may go un-noticed.

Not sure how you can fix that though.

One potential idea would be to have a snare --reload which communicates with the existing snare instance and prints any errors to its stderr, (not the daemon's). However, this would probably need more complex IPC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A later PR will send this to syslog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

eprintln!("{}", msg);
} else {
eprintln!("{}.", msg);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is worth it :)

There's a any number of silly things the caller might do.

msg("error..");
msg("error,");
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure either :/

@ltratt
Copy link
Member Author

ltratt commented Jan 31, 2020

I think that's everything?

@vext01
Copy link
Member

vext01 commented Jan 31, 2020

LGTM. Please squash.

If we're running, and receive SIGHUP, it's possible that the user's changes to
the config file are incorrect. Rather than aborting, it's better that we report
the problem, ignore the new config file, and keep on running.
If the user asks for more jobs to be run, we have an easy task: if they ask for
fewer to be run, then it is much trickier. The approach this commit takes for
the latter case is simple, but means that we can find ourselves in situations
where we are not ever able to actually reduce the number of maximum jobs that
are running.
Previously we were inconsistent about whether variables were "conf" or "config".
This commit homogenises this to "conf" (though the types are still the longer
"Config").
@ltratt
Copy link
Member Author

ltratt commented Jan 31, 2020

Squashed.

@vext01
Copy link
Member

vext01 commented Jan 31, 2020

bors r+

bors bot added a commit that referenced this pull request Jan 31, 2020
15: Reload config after sighup r=vext01 a=ltratt

This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being.

Co-authored-by: Laurence Tratt <laurie@tratt.net>
@ltratt
Copy link
Member Author

ltratt commented Jan 31, 2020

@vext01 Any idea why this failed? buildbot seems to have succeeded but bors failed?

@vext01
Copy link
Member

vext01 commented Jan 31, 2020

Yeah, that doesn't look like our issue.

Let's try again.

bors r+

bors bot added a commit that referenced this pull request Jan 31, 2020
15: Reload config after sighup r=vext01 a=ltratt

This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being.

Co-authored-by: Laurence Tratt <laurie@tratt.net>
@vext01
Copy link
Member

vext01 commented Jan 31, 2020

once more for luck

bors r+

bors bot added a commit that referenced this pull request Jan 31, 2020
15: Reload config after sighup r=vext01 a=ltratt

This PR enables snare to reload its config on SIGHUP. This is a bit more involved than it first seems because snare is multi-threaded. This PR thus goes through several stages (putting `Config` behind a `Mutex` and so on), before actually handling SIGHUP. Note that one case is handled somewhat, but not perfectly: reducing the value of `maxjobs`. I could do more here, but I think this is a niche case, and it's hard to test: the basic approach in this PR seems decent enough to me for the time being.

Co-authored-by: Laurence Tratt <laurie@tratt.net>
@vext01 vext01 merged commit 1913d29 into softdevteam:master Jan 31, 2020
@ltratt ltratt deleted the reload_config_after_sighup branch January 31, 2020 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants