dns-cache: during shutdown: possible use after free #1666

furiel · 2017-09-11T14:37:24Z

During shutdown, mainloop did not wait worker threads to stop: app_shutdown and app_thread_stop was racing. This could lead to a use after free issue in the code: a dns_cache object could be returned to an already unallocated pool: unused_dns_caches.

Two possible resolutions to the problem:

Reference counting for unused_dns_caches.
Blocking mainloop until all workers exit.

This patchset contains both versions. I would like to open up a discussion if we need either one or both.

Me seems reference counting is simpler in terms of code complexity, and has less possible side effects comparing to the version of blocking mainloop. Though the latter one could eliminate the currently unknown or possible future race conditions of these functions as well.

I could only manually test the blocking mainloop version by adding some printf-s to the beginning and the end of the racing functions, and see if the debug-log sandwich form turns into a disjoint form. In the end, leaving these debug logs felt awkward in the code, so I removed them in the final version. Feel free to indicate if I should put them back, or if anyone has any more reasonable idea to test such thing.

The reference counting is tested only manually as well: in a simple file source file destination configuration, sending a single message created a thread with a dns_cache, and stopping syslog-ng instantly after that caused an entry in the valgrind report. On my development machine, unused_dns_caches was steadily freed before the dns_cache was appended to the list so reproducing was not a problem.

This ticket intends to resolve the dns_cache part of the two issues reported in: #1400

kira-syslogng · 2017-09-11T14:50:29Z

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

furiel · 2017-09-12T06:20:07Z

One thought to add:
It is not enough for mainloop to wait for worker threads to quit. Due the worker threads have a 10 seconds idle timeout before they quit, it can happen that main thread should wait that 10 seconds as well. So the quit must be early-triggered in the worker threads before the block kicks in. That's why I put the blocking call only after iv_work_pool_put(&main_loop_io_workers);

kira-syslogng · 2017-09-12T11:50:02Z

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

bazsi

I would probably go towards the proper synchronized exit route unless that's risky. I also had a couple of review notes, but I think we might want to talk IRL if possible.

bazsi · 2017-09-13T17:24:42Z

lib/dnscache.c

@@ -419,6 +419,31 @@ TLS_BLOCK_END;
 static DNSCacheOptions effective_dns_cache_options;
 G_LOCK_DEFINE_STATIC(unused_dns_caches);
 static GList *unused_dns_caches;
+volatile gint unused_dns_caches_ref_count;
+


I think it would make sense to encapsulate this global state into a struct and add the refcounter for that struct.

e.g. "DNSCaching" could be a name for this struct, as the functions use this as a prefix.

bazsi · 2017-09-13T17:26:10Z

lib/apphook.c

@@ -215,4 +216,6 @@ app_thread_stop(void)
  main_loop_call_thread_deinit();
  dns_caching_thread_deinit();
  scratch_buffers_allocator_deinit();
+  g_atomic_int_dec_and_test(&main_loop_workers_running);
+  g_cond_signal(&thread_halt_cond);
 }


I think this should probably go the call-sites where app_thread_start/stop is getting called.

bazsi · 2017-09-13T17:29:50Z

lib/mainloop.c

+static void
+block_till_workers_exit()
+{
+  gint64 end_time;


are the worker threads signaled to exit at this point? e.g. IIRC iv_work_pool_put() should wake up the workers and cause them to exit in a timely manner.

our threaded destination code might be blocking for some time still, but I think that's ok.

iv_work_pool_put

I call this block after main_loop_io_worker_deinit(), which calls iv_work_pool_put.
main_loop_io_worker_deinit();
main_loop_worker_deinit();
block_till_workers_exit();

our threaded destination code might be blocking for some time still, but I think that's ok.

For the threaded destinations: currently the complete exit mechanism waits for them: the exit mechanism is executed with main_loop_worker_sync_call, which waits for all threaded destinations to call main_loop_worker_job_complete(). So yes, the threaded destination drivers can block the exit, but the block happens earlier than block_till_workers_exit() called, so the timeout with g_cond_wait_until is not affected.

bazsi · 2017-09-13T17:31:00Z

lib/mainloop.c

+      {
+        /* timeout has passed. */
+        msg_error("Main thread timed out while waiting workers threads to exit",
+                  evt_tag_int("workers_running", main_loop_workers_running),


we should include how long we waited and also what would happen next.

btw: this will probably not be printed, as the internal() source is not in operation at this point.

Can I resolve the non printing part somehow? Should I use fprintf here writing to stderr instead of msg_error?

I'll change this to fprintf.

How long we waited: will be 15s as we get here only when timeout happened. As for what happens next, I'm not sure. I think we can just continue with exiting. Or should we abort? The only root cause I can think of is somehow the accounting of main_loop_workers_running is messed up (nonzero). But for sure all workers will be exited: we only get here when there are no jobs, and after 10 seconds every thread is forced to exit, so waiting 15seconds and continue is safe imo.
For text:
fprintf(stderr, "Main thread timed out (15s) while waiting workers threads to exit. workers_running: %d. Continuing ...\n", main_loop_workers_running);

bazsi · 2017-09-13T17:31:37Z

lib/mainloop.c

+
+  end_time = g_get_monotonic_time() + 15*G_TIME_SPAN_SECOND;
+  g_mutex_lock(&dummy_workers_running_lock);
+  while (main_loop_workers_running)


what is this dummy? it should be the same lock that protects the counter.

furiel · 2017-09-20T07:38:00Z

I will drop down the dns_cache part. Lets go with the synchronization version.

Signed-off-by: Antal Nemes <antal.nemes@balabit.com>

kira-syslogng · 2017-09-25T08:04:23Z

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

Prior this patch: during exit, mainloop did not wait worker threads to exit, which resulted racing of app_shutdown and app_thread_stop. This patch path block mainloop exit until the number of threads alive drop to zero. Signed-off-by: Antal Nemes <antal.nemes@balabit.com>

kira-syslogng · 2017-09-25T08:24:15Z

Build SUCCESS, the tests were executed on test branch: master and test suite: functions

furiel · 2017-10-10T05:59:58Z

@bazsi could you recheck this PR?

Fixed

lbudai added the in progress label Sep 11, 2017

furiel force-pushed the dns_cache_use_after_free branch from 7253da8 to 6073744 Compare September 12, 2017 11:34

bazsi previously requested changes Sep 13, 2017

View reviewed changes

mainloop: rename workers_running --> jobs_running

a463301

Signed-off-by: Antal Nemes <antal.nemes@balabit.com>

furiel force-pushed the dns_cache_use_after_free branch from 6073744 to dfd63cf Compare September 25, 2017 07:51

furiel force-pushed the dns_cache_use_after_free branch from dfd63cf to 33c999d Compare September 25, 2017 08:11

presidento added the bug label Oct 5, 2017

furiel added needs-review and removed in progress labels Oct 18, 2017

Kokan approved these changes Oct 26, 2017

View reviewed changes

lbudai merged commit 3dad146 into syslog-ng:master Oct 26, 2017

lbudai removed the needs-review label Oct 26, 2017

furiel mentioned this pull request Oct 26, 2017

Memory leak in syslog-ng 3.9.1 #1400

Closed

furiel deleted the dns_cache_use_after_free branch May 8, 2018 04:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dns-cache: during shutdown: possible use after free #1666

dns-cache: during shutdown: possible use after free #1666

furiel commented Sep 11, 2017

kira-syslogng commented Sep 11, 2017

furiel commented Sep 12, 2017

kira-syslogng commented Sep 12, 2017

bazsi left a comment

bazsi Sep 13, 2017

bazsi Sep 13, 2017

bazsi Sep 13, 2017

bazsi Sep 13, 2017

furiel Sep 21, 2017

bazsi Sep 13, 2017

furiel Sep 20, 2017

furiel Sep 21, 2017

bazsi Sep 13, 2017

furiel commented Sep 20, 2017

kira-syslogng commented Sep 25, 2017

kira-syslogng commented Sep 25, 2017

furiel commented Oct 10, 2017

dns-cache: during shutdown: possible use after free #1666

dns-cache: during shutdown: possible use after free #1666

Conversation

furiel commented Sep 11, 2017

kira-syslogng commented Sep 11, 2017

furiel commented Sep 12, 2017

kira-syslogng commented Sep 12, 2017

bazsi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

furiel commented Sep 20, 2017

kira-syslogng commented Sep 25, 2017

kira-syslogng commented Sep 25, 2017

furiel commented Oct 10, 2017