Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix config reload crash via introducing on_config_inited in LogPipe #3176

Merged
merged 5 commits into from
Jul 13, 2020

Conversation

0140454
Copy link
Contributor

@0140454 0140454 commented Mar 13, 2020

In current version, syslog-ng registers revert_config task after cfg_init.
When applying config failed, ivykis task list can be illustrated as the following:

  | _reenable_worker_jobs | _io_process_input | ... | _revert_config |
  ^
  | ivykis task list

If there is a log before _revert_config finished, syslog-ng will use uninitialized log pipe to process it. This will lead to a crash.

Therefore, main_loop_reload_config_revert is scheduled to be called immediately via main_loop_worker_sync_call after applying config failed.

To avoid recursion in main_loop_worker_sync_call, there is a global flag named sync_call_running.
According to this flag, syslog-ng can determine whether it has to call _invoke_sync_call_actions.

fixes: #3173
closes #3196

@kira-syslogng
Copy link
Contributor

This user does not have permission to start the build. Can one of the admins verify this patch and start the build?
(admin: you have the next options (make sure you checked the code):
"ok to test" to accept this pull request (and further changes) for testing
"test this please" for a one time test run
"add to whitelist" add author of a Pull Request to whitelist (globally, be careful, it means this user can trigger kira for any PR)
do nothing -> CI won't start)

2 similar comments
@kira-syslogng
Copy link
Contributor

This user does not have permission to start the build. Can one of the admins verify this patch and start the build?
(admin: you have the next options (make sure you checked the code):
"ok to test" to accept this pull request (and further changes) for testing
"test this please" for a one time test run
"add to whitelist" add author of a Pull Request to whitelist (globally, be careful, it means this user can trigger kira for any PR)
do nothing -> CI won't start)

@kira-syslogng
Copy link
Contributor

This user does not have permission to start the build. Can one of the admins verify this patch and start the build?
(admin: you have the next options (make sure you checked the code):
"ok to test" to accept this pull request (and further changes) for testing
"test this please" for a one time test run
"add to whitelist" add author of a Pull Request to whitelist (globally, be careful, it means this user can trigger kira for any PR)
do nothing -> CI won't start)

@furiel
Copy link
Collaborator

furiel commented Mar 13, 2020

@kira-syslogng ok to test

@kira-syslogng
Copy link
Contributor

Build FAILURE

@furiel
Copy link
Collaborator

furiel commented Mar 13, 2020

I tried to copy what the failing test does, and I experience crash during shutdown.

Syslog-ng started with -Fe: ./syslog-ng -Fe -f ../etc/http.conf.

@version: 3.26

source s_network {
    network(flags(no-parse)
    ip("127.0.0.1")
    port(10001));
};

destination d_sql { sql(); };

log {
  source(s_network);
  source { example-msg-generator(num(1)); };
  destination { http(url("http://localhost:5000")); };
  destination(d_sql); # comment this line for the failed reload
};

The http server does not matter, whatever answers with 200 ok for a http post is fine. For example:
python3 app.py.

from flask import Flask
app = Flask(__name__)
from flask import request

@app.route('/', methods=["POST"])
def hello():
    return "Hello World!"

if __name__ == '__main__':
    app.run()

After starting syslog-ng, commenting out the sql source, reloading, then stopping. During stop, syslog-ng crashes:

$ gdb --core core syslog-ng -ex 'bt full' --batch
[New LWP 499]
[New LWP 604]
[New LWP 603]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./syslog-ng -Fe -f ../etc/http.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd614c96316 in _request_worker_exit (s=0x1339890) at /home/furiel/workspace/failed-reload/syslog-ng/lib/logthrdest/logthrdestdrv.c:917
917	  self->owner->under_termination = TRUE;
[Current thread is 1 (Thread 0x7fd61507ab80 (LWP 499))]
#0  0x00007fd614c96316 in _request_worker_exit (s=0x1339890) at /home/furiel/workspace/failed-reload/syslog-ng/lib/logthrdest/logthrdestdrv.c:917
        self = 0x1339890
#1  0x00007fd614c4f6d6 in _invoke_worker_exit_callback (func=0x1339aa0) at /home/furiel/workspace/failed-reload/syslog-ng/lib/mainloop-worker.c:150
No locals.
#2  0x00007fd61492e51d in g_list_foreach () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#3  0x00007fd614c4f6f8 in _request_all_threads_to_exit () at /home/furiel/workspace/failed-reload/syslog-ng/lib/mainloop-worker.c:156
No locals.
#4  0x00007fd614c4fd12 in main_loop_worker_sync_call (func=0x7fd614c4e4dd <main_loop_exit_finish>, user_data=0x7fd614f09360 <main_loop>) at /home/furiel/workspace/failed-reload/syslog-ng/lib/mainloop-worker.c:409
No locals.
#5  0x00007fd614c4e533 in main_loop_exit_timer_elapsed (user_data=0x7fd614f09360 <main_loop>) at /home/furiel/workspace/failed-reload/syslog-ng/lib/mainloop.c:435
        self = 0x7fd614f09360 <main_loop>
#6  0x00007fd613aeb9fe in ?? () from /usr/lib/x86_64-linux-gnu/libivykis.so.0
No symbol table info available.
#7  0x00007fd613aee376 in iv_main () from /usr/lib/x86_64-linux-gnu/libivykis.so.0
No symbol table info available.
#8  0x00007fd614c4eba6 in main_loop_run (self=0x7fd614f09360 <main_loop>) at /home/furiel/workspace/failed-reload/syslog-ng/lib/mainloop.c:647
No locals.
#9  0x0000000000401d04 in main (argc=1, argv=0x7fff49cd41f8) at /home/furiel/workspace/failed-reload/syslog-ng/syslog-ng/main.c:313
        rc = 0
        ctx = 0x12e92c0
        error = 0x0
        main_loop = 0x7fd614f09360 <main_loop>
        exit_before_main_loop_run = 0

@0140454 0140454 changed the title Revert to old config early if applying new config failed (fix #3173) [WIP] Revert to old config early if applying new config failed (fix #3173) Mar 15, 2020
@kira-syslogng
Copy link
Contributor

Build FAILURE

@0140454
Copy link
Contributor Author

0140454 commented Mar 16, 2020

Is it possible to let me know what the failing test does?

@bazsi
Copy link
Collaborator

bazsi commented Mar 16, 2020 via email

@0140454
Copy link
Contributor Author

0140454 commented Mar 16, 2020

It gives me HTTP 404 Error.

@bazsi
Copy link
Collaborator

bazsi commented Mar 16, 2020 via email

@0140454
Copy link
Contributor Author

0140454 commented Mar 16, 2020

@MrAnno
Copy link
Collaborator

MrAnno commented Mar 16, 2020

Hello @0140454,

Thank you for your pull request.

The macOS job has been fixed recently, rebasing your branch to our master will fix that check.

Kira-starter runs our internal test suite, you have 2 failing tests there:

  • an add-contextual-data() parser reload test
  • a http() destination reload test:
@version: 3.22
options {
    time_reopen(50);
    stats_level(3);
};

source s_network {
    network(flags(no-parse) ip("127.0.0.1") port(10061));
};

destination d_http {
    http(method(post) url("http://127.0.0.1:61062"));
};

destination d_sql_invalid {
    sql();
};

log {
source(s_network);
destination(d_http);
#destination(d_sql_invalid);

flags(flow-control);
};

The test case starts syslog-ng with the above config, then activates destination(d_sql_invalid); (remove #) and reloads syslog-ng.
Since d_sql_invalid contains an incorrectly configured SQL destination, the following is done/expected:

  • syslog-ng starts with the initial correct configuration (sbin/syslog-ng -Fdev)
    • you can validate that the http() destination's worker thread has been started: Dedicated worker thread started; worker_index='0', driver='d_http#0'
  • one message is sent to s_network
  • after adding the invalid destination to the log path, the reload command (sbin/syslog-ng-ctl reload) should fail, which means
    1. the initial http() destination should stop its worker thread (you should find a log line Dedicated worker thread finished; worker_index='0', driver='d_http#0')
    2. a new reloaded http() instance should start (log: Dedicated worker thread started; worker_index='0', driver='d_http#0')
    3. the initialization of d_sql_invalid will then fail (Error initializing new configuration, reverting to old config)
    4. the syslog-ng configuration is then reverted to the initial config due to the unsuccessful reload attempt, so you should find another Dedicated worker thread finished and Dedicated worker thread started pair.
  • another message is sent to s_network

So you need a total of 3 Dedicated worker thread started and 2 Dedicated worker thread finished messages on your terminal.
The test failed because the second http worker thread hasn't been stopped:

Error initializing new configuration, reverting to old config;
Shutting down dedicated worker thread; worker_index='0', driver='d_http#0'
>>>> ... (missing "worker thread finished")
Dedicated worker thread started; worker_index='0', driver='d_http#0'

@kira-syslogng
Copy link
Contributor

Build SUCCESS

@kira-syslogng
Copy link
Contributor

Build SUCCESS

@0140454 0140454 changed the title [WIP] Revert to old config early if applying new config failed (fix #3173) Revert to old config early if applying new config failed (fix #3173) Mar 18, 2020
@furiel
Copy link
Collaborator

furiel commented Mar 19, 2020

@0140454 I checked this change with your original test from #3173 (executing syslog-ng without -e, commenting f_my_info). If I replace the system() source with example-msg-generator(num(1)), I run into the same assert.

@version: 3.25

options {
  stats_freq (0);
  flush_lines (0);
  time_reopen (10);
  log_fifo_size (10000);
  chain_hostnames (off);
  use_dns (no);
  use_fqdn (no);
  create_dirs (no);
  keep_hostname (yes);
  perm(0640);
  group("log");
};

source src {
  example-msg-generator(num(1));
  internal();
};

filter f_err { level(err); };
#filter f_my_info { level(info); };
filter f_test { filter(f_err) or filter(f_my_info); };
destination d_test { file("/var/log/test.log"); };

log { source(src); filter(f_test); destination(d_test); };

ERROR:/home/furiel/workspace/failed-reload/syslog-ng/lib/logpipe.h:343:log_pipe_queue: assertion failed: ((s->flags & PIF_INITIALIZED) != 0)

However, with system() source (your original config), the problem does not occur.

@0140454
Copy link
Contributor Author

0140454 commented Mar 19, 2020

In my understanding, there are three stages in iv_main(): timer, task and event.

while (1) {
    iv_run_timers(st);
    iv_run_tasks(st);
    iv_fd_poll_and_run(st, abs);
}

The module example-msg-generator uses timer to generate message; in the other hand, reloading config is occurred at the last stage.

To avoid crash when using example-msg-generator module, we need to finish reverting to old config in the last stage too.

Before reverting to old config, we need all threads to exit. The situation becomes that worker threads wait main thread for performing main_loop_worker_job_complete(), and main thread wait worker threads for continuing the task about reverting.

I have no idea how to deal with this issue currently.

@furiel
Copy link
Collaborator

furiel commented Mar 23, 2020

@0140454 I try to catch up with the details, and see if I can find out something.

@furiel
Copy link
Collaborator

furiel commented Mar 25, 2020

I think I found a resolution for the example-msg-generator problem. As that is a new problem, and has different root cause that you fixed here, I opened a pull request to handle it separately: #3196.

Now I look into your patchset, and try to understand the details.

lib/mainloop.c Outdated Show resolved Hide resolved
@furiel
Copy link
Collaborator

furiel commented Mar 26, 2020

Just an idea, but

@0140454 what if you turn revert_config into a zero timeout (or something in the past) timer? Could that work? To my understanding, that would ensure the revert_config "job" would be scheduled ahead both the timers and tasks? With that #3196 would not be necessary?

@0140454
Copy link
Contributor Author

0140454 commented Mar 26, 2020

I have tried to use timer to make revert_config job run earlier. However, it doesn't work.

The following is the implementation of iv_main,

while (1) {
	struct timespec _abs;
	const struct timespec *abs;

	if (run_timers)
		iv_run_timers(st);
	iv_run_tasks(st);

	if (st->quit || !st->numobjs)
		break;

	if (iv_pending_tasks(st)) {
		_abs.tv_sec = 0;
		_abs.tv_nsec = 0;
		abs = &_abs;
	} else {
		abs = iv_get_soonest_timeout(st);
	}

	run_timers = iv_fd_poll_and_run(st, abs);
}

When syslog-ng received a signal, the handler is executed in iv_fd_poll_and_run.
According to backtrace from gdb,

#0  main_loop_reload_config_apply (user_data=0x7ffff7f8a820 <main_loop>) at lib/mainloop.c:286
#1  0x00007ffff7ecd79e in _consume_action (action=0x55555564a000) at lib/mainloop-worker.c:240
#2  0x00007ffff7ecd7d6 in _invoke_sync_call_actions () at lib/mainloop-worker.c:250
#3  0x00007ffff7ecdb23 in main_loop_worker_sync_call (func=0x7ffff7ecbc84 <main_loop_reload_config_apply>, user_data=0x7ffff7f8a820 <main_loop>)
    at lib/mainloop-worker.c:397
#4  0x00007ffff7ecc0a3 in main_loop_reload_config_commence (self=0x7ffff7f8a820 <main_loop>) at lib/mainloop.c:350
#5  0x00007ffff7ecc141 in main_loop_reload_config (self=0x7ffff7f8a820 <main_loop>) at lib/mainloop.c:365
#6  0x00007ffff7ecc4d2 in sig_hup_handler (user_data=0x7ffff7f8a820 <main_loop>) at lib/mainloop.c:482
#7  0x00007ffff7f337f0 in iv_signal_event (_this=0x7ffff7f8a828 <main_loop+8>) at iv_signal.c:204
#8  0x00007ffff7f31ae7 in iv_event_raw_got_event (_this=0x7ffff7f8a868 <main_loop+72>) at iv_event_raw_posix.c:62
//// HERE //// #9  0x00007ffff7f322cb in iv_fd_poll_and_run (st=st@entry=0x5555555853a0, abs=<optimized out>) at iv_fd.c:223
#10 0x00007ffff7f331e4 in iv_main () at iv_main_posix.c:112

The return value from iv_fd_poll_and_run, which will be assigned to run_timers in iv_main, has be set as zero.
In other words, revert_config timer task will not be executed in next round, it is executed in second round.
However, log message from internal() will be processed by iv_run_tasks in next round, then leads to a crash.

@furiel
Copy link
Collaborator

furiel commented Apr 1, 2020

Ok, I see the timers cannot work. First I was confused because many of the poll methods in ivykis just return 1, but then I compiled and checked with gdb, and on my system iv_fd_poll_method_epoll_timerfd is selected for polling, which indeed returns 0 when no timers were scheduled in the previous loop (we want to add the timer in the current loop), or when interrupted by signal.

In summary, I agree with your patch, and I do not know any better way to handle the problem. The timer related ordering problems need to be addressed separately.

I would like to ask you a minor change though. I am afraid that it is easy to miss the reason of this register unregister dance, and someone might just refactor back it to the original version, not checking the commit message. Could you please encapsulate the iv_task_register(&self->revert_config); into a function with a talkative name. For example schedule_revert_config_to_the_first_ivykis_task(), or similar? That might be enough for someone to know it is not accidental, but there is a reason for this code organization.

Apart from that, it looks good to me.

I am now moving forward to your second patch. I experience a little different thing that you mention in the commit message. After applying the first patch, but without the second, my internal messages about the failed reload are not emitted immediately, but not lost either. Interestingly, during shutdown they appear. I want to look into that, and understand how the patch works in details.

@kira-syslogng
Copy link
Contributor

Build FAILURE

@kira-syslogng
Copy link
Contributor

Build FAILURE

@0140454
Copy link
Contributor Author

0140454 commented Jul 4, 2020

I am not sure the reason for build failed.

However, I think debug output is different from the current version mentioned in #3176 (comment) because dedicated worker threads are started after syslog-ng applied new config successfully instead of in halfway now.

Current

Shutting down dedicated worker thread; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Dedicated worker thread finished; worker_index='0', driver='d_http#0', location='http.conf:12:5'
...
Starting dedicated worker thread; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Dedicated worker thread started; worker_index='0', driver='d_http#0', location='http.conf:12:5'
...
Error initializing message pipeline; plugin_name='sql', location='http.conf:16:5'
Error initializing new configuration, reverting to old config;
...
Shutting down dedicated worker thread; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Dedicated worker thread finished; worker_index='0', driver='d_http#0', location='http.conf:12:5'
...
Starting dedicated worker thread; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Dedicated worker thread started; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Running application hooks; hook='6'

Patch

Shutting down dedicated worker thread; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Dedicated worker thread finished; worker_index='0', driver='d_http#0', location='http.conf:12:5'
...
Error initializing message pipeline; plugin_name='sql', location='http.conf:16:5'
Error initializing new configuration, reverting to old config;
...
Starting dedicated worker thread; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Dedicated worker thread started; worker_index='0', driver='d_http#0', location='http.conf:12:5'
Running application hooks; hook='6'

@furiel Could you review my patch again and provide some hints about build failed? I cannot view Jenkins job output. Thank you.

@0140454 0140454 requested a review from furiel July 4, 2020 04:57
@0140454 0140454 changed the title WIP: fix config reload crash via introducing on_config_inited in LogPipe fix config reload crash via introducing on_config_inited in LogPipe Jul 4, 2020
@furiel
Copy link
Collaborator

furiel commented Jul 6, 2020

@kira-syslogng retest this please test branch=furiel-followup-on-config-inited;

@kira-syslogng
Copy link
Contributor

Build FAILURE

@mitzkia
Copy link
Contributor

mitzkia commented Jul 6, 2020

@kira-syslogng retest this please test branch=furiel-followup-on-config-inited;

@kira-syslogng
Copy link
Contributor

Build FAILURE

@mitzkia
Copy link
Contributor

mitzkia commented Jul 6, 2020

@kira-syslogng retest this please test branch=furiel-followup-on-config-inited;

@kira-syslogng
Copy link
Contributor

Build SUCCESS

@furiel
Copy link
Collaborator

furiel commented Jul 7, 2020

@0140454 that test needed a followup. It started syslog-ng, and did a failed reload, and counted the number of Starting dedicated worker thread logs. It expected to count 3, but after your patch there is only 2.

I will review the code this week.

Copy link
Collaborator

@furiel furiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few drivers that are not implemented through threaded source or threaded destination interface, but through LogReader. For example: program source/destination, network source/destination, file source/destination, journald-source (maybe others?)

I am not too afraid of the destinations. It was the sources that started to send the messages too early, which caused the crash.

For the logreader based sources: on the other hand, (just checking the code, I did not try out):log_reader_init calls log_reader_start_watches, so maybe the logreader based drivers may do not work properly either. @MrAnno @bazsi what do you think?

Do we want to check those too in the scope of this PR? Or handle them separately?

if (!cfg_tree_start(&cfg->tree))
return FALSE;

g_assert(cfg_tree_on_inited(&cfg->tree));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asserting on the return value: does it mean that cfg_tree_on_inited should not have return value?
log_threaded_source_driver_start_worker returns TRUE unconditionally too. Just an idea, but we might want to turn every on_inited to void consistently?

Copy link
Contributor Author

@0140454 0140454 Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I think someone may want syslog-ng to be terminated if there is an error in cfg_tree_on_inited, I use g_assert here to achieve this goal.

@MrAnno
Copy link
Collaborator

MrAnno commented Jul 10, 2020

@furiel In my understanding, with this modification and the partial revert of #2555 (8054b23#diff-d527f62d8ea146c1e6ac8145a3b466b8R281), all drivers should work now.
Non-threaded drivers are scheduled from the main thread, so they can't send too early.

@0140454
Copy link
Contributor Author

0140454 commented Jul 10, 2020

In my understand, log_reader_start_watches will call PollEvents::start_watches and they finally use one of the following to conduct wanted operation:

  • iv_fd_register
  • iv_inotify_watch_register
  • iv_timer_register

Since the partial revert of #2555, config revert is performed immediately.
ivykis' timers and others operation will not be executed until the revert finished.
Therefore, LogReader-based should be fixed too.

@MrAnno
Copy link
Collaborator

MrAnno commented Jul 10, 2020

When the config load is unsuccessful, deinit() will be called immediately, which deregisters all timers, fd watches and inotify monitors. I think everything should be fine, those scheduled tasks will be canceled before they can run, and the old configuration will be started atomically.

Copy link
Collaborator

@MrAnno MrAnno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice work, thanks for that!

Based on my tests, non-threaded sources (internal() and msg-generator() too) and threaded sources/destinations are working correctly.

(The name I suggested (on_config_inited()) might not be the best, I hope someone can suggest a better one.)

@furiel
Copy link
Collaborator

furiel commented Jul 13, 2020

When the config load is unsuccessful, deinit() will be called immediately, which deregisters all timers, fd watches and inotify monitors.

Thanks! This puzzle piece was missing. This PR reverts 6b7d9d0 too. As no new threads will be started (as long as all future drivers will start thread only during on_config_inited), there is no need to call revert through main_loop_worker_sync_call (which would be responsible for stopping those threads before calling the sync call stack). With main_loop_worker_sync_call out of the picture, the separate revert task is not needed either. And revert is called immediately.

This is why @0140454 did not need to move _start_timer_immediately to an on_config_inited function in example msg generator either. Timers do not cause problem, only the spawn of new worker threads.

@furiel furiel merged commit db56f14 into syslog-ng:master Jul 13, 2020
@MrAnno
Copy link
Collaborator

MrAnno commented Jul 13, 2020

\o/

szemere added a commit to szemere/syslog-ng that referenced this pull request Feb 5, 2021
with syslog-ng#3176 this function became obsolete

Signed-off-by: Laszlo Szemere <laszlo.szemere@oneidentity.com>
szemere added a commit to szemere/syslog-ng that referenced this pull request Feb 9, 2021
with syslog-ng#3176 this function became obsolete

Signed-off-by: Laszlo Szemere <laszlo.szemere@oneidentity.com>
alltilla added a commit to alltilla/syslog-ng that referenced this pull request Dec 13, 2023
We cannot fail in a threaded workers init(), because of lib/cfg.c:344.

```
  /*
   * TLDR: A half-initialized pipeline turned out to be really hard to deinitialize
   * correctly when dedicated source/destination threads are spawned (because we
   * would have to wait for workers to stop and guarantee some internal
   * task/timer/fdwatch ordering in ivykis during this action).
   * See: syslog-ng#3176 (comment)
   */
  g_assert(cfg_tree_post_config_init(&cfg->tree));
```

Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
alltilla added a commit to alltilla/syslog-ng that referenced this pull request Dec 13, 2023
We cannot fail in a threaded workers init(), because of lib/cfg.c:344.

```
  /*
   * TLDR: A half-initialized pipeline turned out to be really hard to deinitialize
   * correctly when dedicated source/destination threads are spawned (because we
   * would have to wait for workers to stop and guarantee some internal
   * task/timer/fdwatch ordering in ivykis during this action).
   * See: syslog-ng#3176 (comment)
   */
  g_assert(cfg_tree_post_config_init(&cfg->tree));
```

Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

syslog-ng crashed between loading invalid config and reverting to old config
7 participants