Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logreader: stop processing if the connection is closed #3985

Closed
wants to merge 1 commit into from

Conversation

sparlane
Copy link

If the connection is closed and log_reader_work_finished was running,
calling log_pipe_notify will result in the LogReader object being no longer
usable because it was cleaned up.

So to prevent SEGV, don't allow execution to continue into the next block
when the connection is marked as NC_CLOSE

It seems more likely for this to occur when the log reader is a socket
that was opened by "logger" from busybox.

We started experiencing this after updating from glibc 2.29 to 2.35
and gcc from 8.3 to 11.2.

Signed-off-by: Scott Parlane scott.parlane@alliedtelesis.co.nz

If the connection is closed and log_reader_work_finished was running,
calling log_pipe_notify will result in the LogReader object being no longer
usable because it was cleaned up.

So to prevent SEGV, don't allow execution to continue into the next block
when the connection is marked as NC_CLOSE

It seems more likely for this to occur when the log reader is a socket
that was opened by "logger" from busybox.

We started experiencing this after updating from glibc 2.29 to 2.35
and gcc from 8.3 to 11.2.

Signed-off-by: Scott Parlane <scott.parlane@alliedtelesis.co.nz>
@kira-syslogng
Copy link
Contributor

This user does not have permission to start the build. Can one of the admins verify this patch and start the build?
(admin: you have the next options (make sure you checked the code):
"ok to test" to accept this pull request (and further changes) for testing
"test this please" for a one time test run
"add to whitelist" add author of a Pull Request to whitelist (globally, be careful, it means this user can trigger kira for any PR)
do nothing -> CI won't start)

1 similar comment
@kira-syslogng
Copy link
Contributor

This user does not have permission to start the build. Can one of the admins verify this patch and start the build?
(admin: you have the next options (make sure you checked the code):
"ok to test" to accept this pull request (and further changes) for testing
"test this please" for a one time test run
"add to whitelist" add author of a Pull Request to whitelist (globally, be careful, it means this user can trigger kira for any PR)
do nothing -> CI won't start)

@bazsi
Copy link
Collaborator

bazsi commented Apr 14, 2022

@kira-syslogng ok to test;

@kira-syslogng
Copy link
Contributor

Build FAILURE

@bazsi
Copy link
Collaborator

bazsi commented Apr 14, 2022

Can you show a backtrace that this fixes? The "self" pointer at that point should still be referenced, so at the very least the pointer should still be valid.

The NC_CLOSE handler in afsocket-source.c will deinit the pipe (afsocket_sd_close_connection) prior to returning, which means that the condition right after the log_pipe_notify() call (in log_reader_work_finished) should not be entered. So I think your change should not actually change behaviour.

@MrAnno MrAnno self-requested a review April 14, 2022 09:07
@sparlane
Copy link
Author

From syslog-ng 3.21.1:

Thread 1 (Thread 0x77fa5020 (LWP 405)):
#0 0x0ff59618 in log_proto_server_reset_error (s=0x0) at ./lib/logproto/logproto-server.h:164
#1 log_reader_work_finished (s=0x10053800) at lib/logreader.c:426
#2 0x0fc75b98 in iv_fd_poll_and_run (st=st@entry=0x100270a0, abs=) at iv_fd.c:219
#3 0x0fc770b4 in iv_main () at iv_main_posix.c:112
#4 0x0ff5e060 in main_loop_run (self=0xfff629c <main_loop>) at lib/mainloop.c:657
#5 0x10001264 in main (argc=, argv=) at syslog-ng/main.c:316

I added debug print statements to the start of every function in logreader.c, and most of the branches
pending_close wasn't set, but notify_code was set to NC_CLOSE, and it called the close/free functions while it was processing the message.

Yes, I know 3.21.1 is old, but that is what we currently use, I did test with the latest release and I get an identical backtrace (with different line numbers)

There are some limitations on where we see this:

  • if init is upstart we see this, if it's systemd we don't (I suspect because of the way that systemd interferes with syslog)
  • we don't see this on our x86_64 target, but do see it on at least armv7 and ppc
  • I think it's triggered specifically by the use of logger, which starts opens the the socket to syslog, writes and message and then closes the socket.
  • I can get this to happen often if I replace our syslog-ng-ctl reload/reopen calls with restarting the syslog-ng job, I suspect this is because syslog-ng needs to be busy to have both the close and a message at the same time.

@bazsi
Copy link
Collaborator

bazsi commented Apr 20, 2022 via email

@bazsi
Copy link
Collaborator

bazsi commented Apr 27, 2022 via email

@bazsi
Copy link
Collaborator

bazsi commented Apr 28, 2022

I was delving into the sources and have so fair failed to find a scenario which leads to this crash. Certainly I am missing something. If you still have a core file on the crash, can you please run these gdb commands and record their results:

(gdb) bt full
(gdb) frame 1
(gdb) p *s

These are things I've checked:

  • the reference to the self pointer should be ensured by the reference taken by MainLoopIOJob. reader->io_job.engage takes this reference as an I/O event is submitted to the I/O work pool and is not released until log_reader_work_finished() returns. The crash happens within log_reader_work_finished() at this reference should still be intact.
  • if the self pointer is still valid, the question is how self->proto can be NULL at the point of the crash.
  • self->proto is initialized to NULL as a LogReader is created, but is set via log_reader_open() right after the constructor call. so there's no LogReader with self->proto == NULL unless we are in the process of closing it.
  • The process of closing happens when we read EOF on the socket. When we encounter an EOF, self->notify_code is set to NC_CLOSE and log_reader_work_finished() is invoked in the main thread (this function is the completion callback of the READ event the we finish handling).
  • As we enter log_reader_work_finished(), we check if self->pending_close is set. It should never be set in the afsocket case, as it is used by the file driver for cases where we close a file in the non-main thread.
  • The next thing in log_reader_work_finished() is to call log_pipe_notify() that calls the afsocket_sc_notify() virtual function to inform the socket connection that the associated connection is closing.
    • log_pipe_notify() calls afsocket_sd_close_connection(), which calls log_reader_close_proto(). log_reader_close_proto() will set self->proto to NULL (e.g. the crash trigger), but also calls log_pipe_deinit() on the source connection.
    • The deinit method for AFSocketSourceConnection (afsocket_sc_deinit) drops the owner reference AND calls log_pipe_deinit() on the LogReader.
    • The deinit method of the LogReader (log_reader_deinit), will clear the PIF_INITIALIZED flag in LogReader->flags
  • After returning from log_pipe_notify(), we check the PIF_INITIALIZED flag and if it's cleared we don't do anything. If it's not we enter the crash location, and if self->proto is NULL, we indeed crash.

In summary the PIF_INITIALIZED flag in LogReader->flags prevents the crash, it is cleared in the same process that sets proto to NULL.

The only way this could happen, if log_reader_work_finished() would reenter a second time, maybe as an invocation from log_pipe_notify() but work_finished is only called as a completion callback (either by ivykis in threaded mode, or ourselves if threading is disabled). The backtrace shows, that work_finished is called by ivykis, so threading indeed seems to be enabled.

Also NOTE that the entire list above is executed in series by the main thread. The asynchronous closing that happens in wildcard-file() is not usedin the case of afsocket.

With the extra information that you can hopefully send me, could pinpoint where I am not right in the analysis above.

@sparlane
Copy link
Author

sparlane commented Apr 29, 2022

Hi,

I had to regenerate it, my exact code is here:
https://github.com/sparlane/syslog-ng/tree/syslog_3_21_debug

Core was generated by `/usr/sbin/syslog-ng -F -f /etc/syslog-ng.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0ff598bc in log_proto_server_reset_error (s=0x0) at ./lib/logproto/logproto-server.h:164
164	./lib/logproto/logproto-server.h: No such file or directory.
(gdb) bt full
#0  0x0ff598bc in log_proto_server_reset_error (s=0x0) at ./lib/logproto/logproto-server.h:164
No locals.
#1  log_reader_work_finished (s=0x10055710) at lib/logreader.c:440
        self = 0x10055710
#2  0x0fc75b98 in iv_fd_poll_and_run (st=st@entry=0x10028250, abs=<optimized out>) at iv_fd.c:219
        fd = 0x10055898
        active = {next = 0x7fd07de8, prev = 0x7fd07de8}
        run_timers = <optimized out>
#3  0x0fc770b4 in iv_main () at iv_main_posix.c:112
        _abs = {tv_sec = 0, tv_nsec = 0}
        abs = <optimized out>
        st = 0x10028250
        run_timers = <optimized out>
#4  0x0ff5e338 in main_loop_run (self=0xfff62c4 <main_loop>) at lib/mainloop.c:657
No locals.
#5  0x10001308 in main (argc=<optimized out>, argv=<optimized out>) at syslog-ng/main.c:322
        rc = 0
        ctx = <optimized out>
        error = 0x0
        main_loop = 0xfff62c4 <main_loop>
        exit_before_main_loop_run = <optimized out>
(gdb) frame 1
#1  log_reader_work_finished (s=0x10055710) at lib/logreader.c:440
440	lib/logreader.c: No such file or directory.
(gdb) p *s
Attempt to dereference a generic pointer.
(gdb) p *self
$2 = {super = {super = {ref_cnt = {counter = 268771413}, flags = 1726283135, queue = 0xff5a814 <log_source_queue>, cfg = 0x100441f0, expr_node = 0x0, pipe_next = 0x10055690, discarded_messages = 0x0, persist_name = 0x0, 
      plugin_name = 0x0, pre_init = 0x0, init = 0xff596f0 <log_reader_init>, deinit = 0xff59010 <log_reader_deinit>, post_deinit = 0x0, generate_persist_name = 0x0, clone = 0x0, free_fn = 0xff58e3c <log_reader_free>, notify = 0x0}, 
    options = 0x100676ec, threaded = 0, pos_tracked = 0, stats_id = 0x10054cd0 "\020\004X\264f\344\375\177\017\365\236", <incomplete sequence \374>, stats_instance = 0x10055bb0 "\020\004V\245f\344\375\177eam,anonymous", window_size = {
      counter = {value = 2147483647}}, last_message_seen = 0x0, recvd_messages = 0x0, last_ack_count = 0, ack_count = 0, window_full_sleep_nsec = 0, last_ack_rate_time = {tv_sec = 0, tv_nsec = 0}, ack_tracker = 0x10056e00, 
    wakeup = 0xff58dc4 <log_reader_wakeup>, window_empty_cb = 0xff58dd8 <log_reader_window_empty>}, proto = 0x0, immediate_check = 0, control = 0x10055690, options = 0x100676e8, poll_events = 0x0, peer_addr = 0x1004b908, restart_task = {
    cookie = 0x10055710, handler = 0xff598e0 <log_reader_io_handle_in>, pad = {0x100557b4, 0x100557b4, 0x23, 0x0, 0x0, 0x0}}, schedule_wakeup = {cookie = 0x10055710, handler = 0xff597b0 <log_reader_wakeup_triggered>, owner = 0x10028250, 
    list = {next = 0x100557d8, prev = 0x100557d8}}, last_msg_sent_event = {cookie = 0x10055710, handler = 0xff58d0c <_last_msg_sent>, owner = 0x10028250, list = {next = 0x100557ec, prev = 0x100557ec}}, io_job = {
    engage = 0xff57710 <log_pipe_ref>, work = 0xff59344 <log_reader_work_perform>, completion = 0xff597d0 <log_reader_work_finished>, release = 0xff57790 <log_pipe_unref>, user_data = 0x10055710, working = 0, cond = 0, work_item = {
      cookie = 0x100557f4, work = 0xff5f6b0 <_work>, completion = 0xff5f64c <_complete>, list = {next = 0x0, prev = 0x0}}}, watches_running = 0, suspended = 0, notify_code = 0, pending_close = 0, pending_close_cond = 0x10069180, 
  pending_close_lock = {mutex = 0x0, unused = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__elision_data = {__espins = 0, __elision = 0}, __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, 
      __align = 0}}, idle_timer = {expires = {tv_sec = 0, tv_nsec = 0}, cookie = 0x10055710, handler = 0xff58ed4 <log_reader_idle_timeout>, pad = {0x0, 0x0, 0xffffffff, 0x0}}}

The output from my additional debug for this logreader is:

Fri Apr 29 00:41:51 2022
 0x10055710: Created log reader
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_open
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_apply_proto_and_poll_events: start
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_apply_proto_and_poll_events: self->proto is set
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_init
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_work_finished
Fri Apr 29 00:41:51 2022
 0x10055710: notify_code is set
Fri Apr 29 00:41:51 2022
 (nil): afsocket_sd_close_connection
Fri Apr 29 00:41:51 2022
 (nil): UNIX
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_close_proto
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_close_proto_deferred
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_apply_proto_and_poll_events: start
Fri Apr 29 00:41:51 2022
 0x10055710: self->proto set to NULL @ log_reader_apply_proto_and_poll_events
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_deinit
Fri Apr 29 00:41:51 2022
 0x10055710: log_reader_free
Fri Apr 29 00:41:51 2022
 0x10055710: PIF_INITIALIZED

@sparlane
Copy link
Author

For comparison, I used the official syslog-ng-3.36.1.tar.gz in place of our syslog-ng, and I see the same crash

Core was generated by `/usr/sbin/syslog-ng -F -f /etc/syslog-ng.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0ff5093c in log_proto_server_reset_error (s=0x0) at ./lib/logproto/logproto-server.h:164
164	./lib/logproto/logproto-server.h: No such file or directory.
[Current thread is 1 (Thread 0x77fdb020 (LWP 383))]
(gdb) bt full
#0  0x0ff5093c in log_proto_server_reset_error (s=0x0) at ./lib/logproto/logproto-server.h:164
No locals.
#1  log_reader_work_finished (s=0x100bde80) at lib/logreader.c:402
        self = 0x100bde80
#2  0x0fc65b98 in iv_fd_poll_and_run (st=st@entry=0x100270f0, abs=<optimized out>) at iv_fd.c:219
        fd = 0x100bdb9c
        active = {next = 0x7fd9b208, prev = 0x7fd9b208}
        run_timers = <optimized out>
#3  0x0fc670b4 in iv_main () at iv_main_posix.c:112
        _abs = {tv_sec = 0, tv_nsec = 0}
        abs = <optimized out>
        st = 0x100270f0
        run_timers = <optimized out>
#4  0x0ff560e0 in main_loop_run (self=0xfff69f8 <main_loop>) at lib/mainloop.c:668
No locals.
#5  0x10001270 in main (argc=<optimized out>, argv=<optimized out>) at syslog-ng/main.c:311
        rc = 0
        ctx = <optimized out>
        error = 0x0
        main_loop = 0xfff69f8 <main_loop>
        exit_before_main_loop_run = 0
(gdb) frame 1
#1  log_reader_work_finished (s=0x100bde80) at lib/logreader.c:402
402	lib/logreader.c: No such file or directory.
(gdb) p *s
Attempt to dereference a generic pointer.
(gdb) p *self
$1 = {super = {super = {ref_cnt = {counter = 65725}, flags = -1730795671, queue = 0xff522d4 <log_source_queue>, cfg = 0x10045170, expr_node = 0x0, pipe_next = 0x100bb480, discarded_messages = 0x0, persist_name = 0x0, plugin_name = 0x0, 
      signal_slot_connector = 0x100bb590, pre_init = 0x0, init = 0xff506bc <log_reader_init>, deinit = 0xff502d8 <log_reader_deinit>, post_deinit = 0x0, on_config_inited = 0x0, generate_persist_name = 0x0, arcs = 0xff4e098 <_arcs>, 
      clone = 0x0, free_fn = 0xff4fde8 <log_reader_free>, notify = 0x0, info = 0x0}, options = 0x10069ccc, threaded = 0, name = 0x100bdbe0 "\020\n\265ݘ\326'ieam,AF_UNIX(anonymous)", 
    stats_id = 0x100bb390 "\020\n\350;\230\326'i\020\006", <incomplete sequence \340>, stats_instance = 0x100bbea0 "\020\n뻘\326'ieam,anonymous", window_size = {counter = {value = 100}}, dynamic_window = {pool = 0x0, stat = {n = 0, 
        sum = 0}}, window_initialized = 1, initial_window_size = 100, full_window_size = 100, window_size_to_be_reclaimed = {value = 0}, pending_reclaimed = {value = 0}, stat_window_size = 0x0, stat_full_window = 0x0, 
    last_message_seen = 0x0, recvd_messages = 0x0, stat_window_size_cluster = 0x0, stat_full_window_cluster = 0x0, last_ack_count = 0, ack_count = 0, window_full_sleep_nsec = 0, last_ack_rate_time = {tv_sec = 0, tv_nsec = 0}, 
    ack_tracker_factory = 0x10060da0, ack_tracker = 0x0, wakeup = 0xff4fd48 <log_reader_wakeup>, schedule_dynamic_window_realloc = 0xff4fd60 <_schedule_dynamic_window_realloc>}, proto = 0x0, immediate_check = 0, control = 0x100bb480, 
  options = 0x10069cc8, poll_events = 0x0, peer_addr = 0x10062160, local_addr = 0x100621e0, max_message_size = 0x0, average_messages_size = 0x0, CPS = 0x0, restart_task = {cookie = 0x100bde80, 
    handler = 0xff5095c <log_reader_io_handle_in>, pad = {0x100bdf88, 0x100bdf88, 0x21, 0x0, 0x0, 0x0}}, schedule_wakeup = {cookie = 0x100bde80, handler = 0xff50860 <log_reader_wakeup_triggered>, owner = 0x100270f0, list = {
      next = 0x100bdfac, prev = 0x100bdfac}}, io_job = {engage = 0xff4e344 <log_pipe_ref>, work = 0xff50218 <log_reader_work_perform>, completion = 0xff5087c <log_reader_work_finished>, release = 0xff4e3c4 <log_pipe_unref>, 
    user_data = 0x100bde80, working = 0, cond = 0, work_item = {cookie = 0x100bdfb4, work = 0xff575ec <_work>, completion = 0xff57588 <_complete>, list = {next = 0x0, prev = 0x0}}}, watches_running = 0, suspended = 0, 
  realloc_window_after_fetch = 0, notify_code = 0, pending_close = 0, pending_close_cond = {p = 0x0, i = {0, 0}}, pending_close_lock = {p = 0x0, i = {0, 0}}, idle_timer = {expires = {tv_sec = 0, tv_nsec = 0}, cookie = 0x100bde80, 
    handler = 0xff509f8 <log_reader_idle_timeout>, pad = {0x0, 0x0, 0xffffffff, 0x0}}}

@bazsi
Copy link
Collaborator

bazsi commented Apr 29, 2022 via email

@bazsi
Copy link
Collaborator

bazsi commented Apr 29, 2022

Do you have threading enabled? E.g. threaded flag or threaded(no) in the global config.

If you have disabled threading then I have a theory.

Non threaded execution is performed by invoking the MainloopIOJob callbacks directly. In that case MainloooIOJob->engage and release does not take a reference on the LogReader, causing exactly this crash.

Add a log_pipe_ref/unref pair around the invocation of log_reader_work_perform/finished, in the function
log_reader_io_handle_in()

This should close this issue. I am writing this on my phone and am unable to submit a patch at the moment. But if you could confirm this theory that'd be great.

Thanks

@bazsi
Copy link
Collaborator

bazsi commented Apr 29, 2022

I have now reproduced this locally, not the crash but the use-after-free situation. Probably the crash does not happen on some of your platforms as either the flags member is not overwritten when LogReader is freed, OR the self->proto becomes non-NULL during the free.

In any case, both after use-after-free conditions that eventually could lead to a crash. This patch seems to fix it:

diff --git a/lib/logreader.c b/lib/logreader.c
index 135032ce7..3f1821843 100644
--- a/lib/logreader.c
+++ b/lib/logreader.c
@@ -572,8 +572,10 @@ log_reader_io_handle_in(gpointer s)
        */
       if (!main_loop_worker_job_quit())
         {
+          log_pipe_ref(&self->super.super);
           log_reader_work_perform(s, G_IO_IN);
           log_reader_work_finished(s);
+          log_pipe_unref(&self->super.super);
         }
     }
 }

This code is only executing in the non-threaded case. While threading is enabled, this is fixed by MainLoopIOJob->engage/release methods which take and drop these very refs.

As a workaround, you could re-enable threading, that should fix the crash as well. I am opening a PR from this patch in a moment.

@sparlane
Copy link
Author

sparlane commented May 1, 2022

Hi,

There is no reference to threading anywhere in our config, so I think it is turned off.

Adding the log_ref/unref as described also resolves this crash.

@sparlane
Copy link
Author

sparlane commented May 1, 2022

Hi @bazsi,

Looking at some of the things we have previously tried to fix, I think the same bug exists in log_writer_io_handler where it has the same code pattern.

From our bug history I see that they are related to having a debug output that is connected to the local console and then the user logging out without stopping the debug output.

@bazsi
Copy link
Collaborator

bazsi commented May 2, 2022 via email

@bazsi
Copy link
Collaborator

bazsi commented May 3, 2022

Closing this in favour of #3997

@bazsi bazsi closed this May 3, 2022
@sparlane sparlane deleted the logreader_segv_fix branch May 3, 2022 09:26
@MrAnno MrAnno removed their request for review May 3, 2022 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants