Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns-cache: during shutdown: possible use after free #1666

Merged
merged 2 commits into from
Oct 26, 2017

Conversation

furiel
Copy link
Collaborator

@furiel furiel commented Sep 11, 2017

During shutdown, mainloop did not wait worker threads to stop: app_shutdown and app_thread_stop was racing. This could lead to a use after free issue in the code: a dns_cache object could be returned to an already unallocated pool: unused_dns_caches.

Two possible resolutions to the problem:

  • Reference counting for unused_dns_caches.
  • Blocking mainloop until all workers exit.

This patchset contains both versions. I would like to open up a discussion if we need either one or both.

Me seems reference counting is simpler in terms of code complexity, and has less possible side effects comparing to the version of blocking mainloop. Though the latter one could eliminate the currently unknown or possible future race conditions of these functions as well.

I could only manually test the blocking mainloop version by adding some printf-s to the beginning and the end of the racing functions, and see if the debug-log sandwich form turns into a disjoint form. In the end, leaving these debug logs felt awkward in the code, so I removed them in the final version. Feel free to indicate if I should put them back, or if anyone has any more reasonable idea to test such thing.

The reference counting is tested only manually as well: in a simple file source file destination configuration, sending a single message created a thread with a dns_cache, and stopping syslog-ng instantly after that caused an entry in the valgrind report. On my development machine, unused_dns_caches was steadily freed before the dns_cache was appended to the list so reproducing was not a problem.

This ticket intends to resolve the dns_cache part of the two issues reported in: #1400

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@furiel
Copy link
Collaborator Author

furiel commented Sep 12, 2017

One thought to add:
It is not enough for mainloop to wait for worker threads to quit. Due the worker threads have a 10 seconds idle timeout before they quit, it can happen that main thread should wait that 10 seconds as well. So the quit must be early-triggered in the worker threads before the block kicks in. That's why I put the blocking call only after iv_work_pool_put(&main_loop_io_workers);

@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

bazsi
bazsi previously requested changes Sep 13, 2017
Copy link
Collaborator

@bazsi bazsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably go towards the proper synchronized exit route unless that's risky. I also had a couple of review notes, but I think we might want to talk IRL if possible.

lib/dnscache.c Outdated
@@ -419,6 +419,31 @@ TLS_BLOCK_END;
static DNSCacheOptions effective_dns_cache_options;
G_LOCK_DEFINE_STATIC(unused_dns_caches);
static GList *unused_dns_caches;
volatile gint unused_dns_caches_ref_count;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to encapsulate this global state into a struct and add the refcounter for that struct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. "DNSCaching" could be a name for this struct, as the functions use this as a prefix.

lib/apphook.c Outdated
@@ -215,4 +216,6 @@ app_thread_stop(void)
main_loop_call_thread_deinit();
dns_caching_thread_deinit();
scratch_buffers_allocator_deinit();
g_atomic_int_dec_and_test(&main_loop_workers_running);
g_cond_signal(&thread_halt_cond);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably go the call-sites where app_thread_start/stop is getting called.

static void
block_till_workers_exit()
{
gint64 end_time;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the worker threads signaled to exit at this point? e.g. IIRC iv_work_pool_put() should wake up the workers and cause them to exit in a timely manner.

our threaded destination code might be blocking for some time still, but I think that's ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iv_work_pool_put

I call this block after main_loop_io_worker_deinit(), which calls iv_work_pool_put.
main_loop_io_worker_deinit();
main_loop_worker_deinit();
block_till_workers_exit();

our threaded destination code might be blocking for some time still, but I think that's ok.

For the threaded destinations: currently the complete exit mechanism waits for them: the exit mechanism is executed with main_loop_worker_sync_call, which waits for all threaded destinations to call main_loop_worker_job_complete(). So yes, the threaded destination drivers can block the exit, but the block happens earlier than block_till_workers_exit() called, so the timeout with g_cond_wait_until is not affected.

lib/mainloop.c Outdated
{
/* timeout has passed. */
msg_error("Main thread timed out while waiting workers threads to exit",
evt_tag_int("workers_running", main_loop_workers_running),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should include how long we waited and also what would happen next.

btw: this will probably not be printed, as the internal() source is not in operation at this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I resolve the non printing part somehow? Should I use fprintf here writing to stderr instead of msg_error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this to fprintf.

How long we waited: will be 15s as we get here only when timeout happened. As for what happens next, I'm not sure. I think we can just continue with exiting. Or should we abort? The only root cause I can think of is somehow the accounting of main_loop_workers_running is messed up (nonzero). But for sure all workers will be exited: we only get here when there are no jobs, and after 10 seconds every thread is forced to exit, so waiting 15seconds and continue is safe imo.
For text:
fprintf(stderr, "Main thread timed out (15s) while waiting workers threads to exit. workers_running: %d. Continuing ...\n", main_loop_workers_running);

lib/mainloop.c Outdated

end_time = g_get_monotonic_time() + 15*G_TIME_SPAN_SECOND;
g_mutex_lock(&dummy_workers_running_lock);
while (main_loop_workers_running)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this dummy? it should be the same lock that protects the counter.

@furiel
Copy link
Collaborator Author

furiel commented Sep 20, 2017

I will drop down the dns_cache part. Lets go with the synchronization version.

Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

Prior this patch: during exit, mainloop did not wait worker threads to
exit, which resulted racing of app_shutdown and app_thread_stop. This
patch path block mainloop exit until the number of threads alive drop
to zero.

Signed-off-by: Antal Nemes <antal.nemes@balabit.com>
@kira-syslogng
Copy link
Contributor

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

@presidento presidento added the bug label Oct 5, 2017
@furiel
Copy link
Collaborator Author

furiel commented Oct 10, 2017

@bazsi could you recheck this PR?

@lbudai lbudai merged commit 3dad146 into syslog-ng:master Oct 26, 2017
@furiel furiel deleted the dns_cache_use_after_free branch May 8, 2018 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants