-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns-cache: during shutdown: possible use after free #1666
Conversation
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
One thought to add: |
7253da8
to
6073744
Compare
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably go towards the proper synchronized exit route unless that's risky. I also had a couple of review notes, but I think we might want to talk IRL if possible.
lib/dnscache.c
Outdated
@@ -419,6 +419,31 @@ TLS_BLOCK_END; | |||
static DNSCacheOptions effective_dns_cache_options; | |||
G_LOCK_DEFINE_STATIC(unused_dns_caches); | |||
static GList *unused_dns_caches; | |||
volatile gint unused_dns_caches_ref_count; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to encapsulate this global state into a struct and add the refcounter for that struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. "DNSCaching" could be a name for this struct, as the functions use this as a prefix.
lib/apphook.c
Outdated
@@ -215,4 +216,6 @@ app_thread_stop(void) | |||
main_loop_call_thread_deinit(); | |||
dns_caching_thread_deinit(); | |||
scratch_buffers_allocator_deinit(); | |||
g_atomic_int_dec_and_test(&main_loop_workers_running); | |||
g_cond_signal(&thread_halt_cond); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably go the call-sites where app_thread_start/stop is getting called.
static void | ||
block_till_workers_exit() | ||
{ | ||
gint64 end_time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the worker threads signaled to exit at this point? e.g. IIRC iv_work_pool_put() should wake up the workers and cause them to exit in a timely manner.
our threaded destination code might be blocking for some time still, but I think that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iv_work_pool_put
I call this block after main_loop_io_worker_deinit(), which calls iv_work_pool_put.
main_loop_io_worker_deinit();
main_loop_worker_deinit();
block_till_workers_exit();
our threaded destination code might be blocking for some time still, but I think that's ok.
For the threaded destinations: currently the complete exit mechanism waits for them: the exit mechanism is executed with main_loop_worker_sync_call, which waits for all threaded destinations to call main_loop_worker_job_complete(). So yes, the threaded destination drivers can block the exit, but the block happens earlier than block_till_workers_exit() called, so the timeout with g_cond_wait_until is not affected.
lib/mainloop.c
Outdated
{ | ||
/* timeout has passed. */ | ||
msg_error("Main thread timed out while waiting workers threads to exit", | ||
evt_tag_int("workers_running", main_loop_workers_running), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should include how long we waited and also what would happen next.
btw: this will probably not be printed, as the internal() source is not in operation at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I resolve the non printing part somehow? Should I use fprintf here writing to stderr instead of msg_error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change this to fprintf.
How long we waited: will be 15s as we get here only when timeout happened. As for what happens next, I'm not sure. I think we can just continue with exiting. Or should we abort? The only root cause I can think of is somehow the accounting of main_loop_workers_running is messed up (nonzero). But for sure all workers will be exited: we only get here when there are no jobs, and after 10 seconds every thread is forced to exit, so waiting 15seconds and continue is safe imo.
For text:
fprintf(stderr, "Main thread timed out (15s) while waiting workers threads to exit. workers_running: %d. Continuing ...\n", main_loop_workers_running);
lib/mainloop.c
Outdated
|
||
end_time = g_get_monotonic_time() + 15*G_TIME_SPAN_SECOND; | ||
g_mutex_lock(&dummy_workers_running_lock); | ||
while (main_loop_workers_running) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this dummy? it should be the same lock that protects the counter.
I will drop down the dns_cache part. Lets go with the synchronization version. |
Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
6073744
to
dfd63cf
Compare
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
Prior this patch: during exit, mainloop did not wait worker threads to exit, which resulted racing of app_shutdown and app_thread_stop. This patch path block mainloop exit until the number of threads alive drop to zero. Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
dfd63cf
to
33c999d
Compare
Build SUCCESS, the tests were executed on test branch: master and test suite: functions |
@bazsi could you recheck this PR? |
During shutdown, mainloop did not wait worker threads to stop: app_shutdown and app_thread_stop was racing. This could lead to a use after free issue in the code: a dns_cache object could be returned to an already unallocated pool: unused_dns_caches.
Two possible resolutions to the problem:
This patchset contains both versions. I would like to open up a discussion if we need either one or both.
Me seems reference counting is simpler in terms of code complexity, and has less possible side effects comparing to the version of blocking mainloop. Though the latter one could eliminate the currently unknown or possible future race conditions of these functions as well.
I could only manually test the blocking mainloop version by adding some printf-s to the beginning and the end of the racing functions, and see if the debug-log sandwich form turns into a disjoint form. In the end, leaving these debug logs felt awkward in the code, so I removed them in the final version. Feel free to indicate if I should put them back, or if anyone has any more reasonable idea to test such thing.
The reference counting is tested only manually as well: in a simple file source file destination configuration, sending a single message created a thread with a dns_cache, and stopping syslog-ng instantly after that caused an entry in the valgrind report. On my development machine, unused_dns_caches was steadily freed before the dns_cache was appended to the list so reproducing was not a problem.
This ticket intends to resolve the dns_cache part of the two issues reported in: #1400