Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opendkim/opendkim.c:dkimf_config_free(): don't assert conf->refcnt == 0 #22

Open
dilyanpalauzov opened this issue May 29, 2018 · 10 comments

Comments

@dilyanpalauzov
Copy link

dilyanpalauzov commented May 29, 2018

xxfi_close() is not necessary called, when SIGINT, SIGHUP or SIGTERM are sent to the signal handler thread established by smfi_main(), before smfi_main() returns. For this to happen the underlaying select/poll calls need to return. On some UNIX the poll/select calls exist, when libmilter closes the connection with the MTA, on Linux select/poll exit, when the timeout set by smfi_settimeout() expires.

In turn, when SIGTERM is sent to opendkim, the opened connections are not xxfi_closed() for each previously established xxfi_negotiate(), the reference counter is not decreased to zero and the assert fails.

diff --git a/opendkim/opendkim.c b/opendkim/opendkim.c
--- a/opendkim/opendkim.c
+++ b/opendkim/opendkim.c
@@ -5905,7 +5905,6 @@ static void
 dkimf_config_free(struct dkimf_config *conf)
 {
        assert(conf != NULL);
-       assert(conf->conf_refcnt == 0);
 
        dkimf_zapkey(conf);
 
@mskucherawy
Copy link
Member

The assertion in there is to detect and report conditions where we call dkimf_config_free() improperly, so I'd prefer to leave it in if possible and fix the underlying problem.

Off the top of my head, we could try moving the refcount increment to mlfi_connect(). It looks like it's possible to make this safe by having mlfi_negotiate() use curconf, and only actually copy a reference to that object and increment the refcount in mlfi_connect(). Does this make sense?

@dilyanpalauzov
Copy link
Author

The problem happens on Ctrl+C: the xxfi_close() is not called for existing connections, the counter does not decrease and during shutdown opendkim crashes as the assert fails.

It does not matter if the increment is done is xxfi_connect() or xxfi_negotiate(), the trouble comes from skipping the decrement in xxfi_close().

@mskucherawy
Copy link
Member

I'm trying to understand how the call to mlfi_close() could get skipped. In libmilter, every thread calls xxfi_close() on shutdown if one is set, unless the thread is already in ST_QUIT state. Their documentation also says: "xxfi_close is always called once at the end of each connection."

...so it should never be skipped. We will only skip the decrement intentionally in mlfi_close() if the private per-connection status pointer is NULL, but if that's NULL then presumably the decrement already happened (or the increment never did).

Still looking.

@dilyanpalauzov
Copy link
Author

On 30 Oct 2017 I proposed per email to Sendmail to update the documentation as follows:

diff --git a/docs/xxfi_close.html b/docs/xxfi_close.html
--- a/docs/xxfi_close.html
+++ b/docs/xxfi_close.html
@@ -60,6 +60,13 @@ In particular, it is incorrect to assume the private context pointer
  will be something other than NULL in this callback.
  <LI>xxfi_close is called on close even if the previous mail
  transaction was aborted.
+<LI>xxfi_close is not necessary called after SIGHUP, SIGTERM or SIGINT are sent
+to the signal handler thread established by <A
+href="smfi_main.html">smfi_main</A>, for each call of xxfi_negotiate before
+smfi_main returns.  For this to happen the underlying select/poll calls need to exit.  
+On some UNIX the poll/select calls exit when libmilter closes the connection
+to the MTA, on Linux select/poll exit when the timeout set by <A
+href="smfi_settimeout.html">smfi_settimeout</A> expires.
  <LI>xxfi_close is responsible for freeing any resources allocated on a
  per-connection basis.
  <LI>Since the connection is already closing, the return value is

and I do not find an answer in my archives.

If I remember correctly on Ctrl+C poll/select() does not return, but waits for a timeout, and xxfi_close() is called only after poll/select() returns.

@dilyanpalauzov
Copy link
Author

Compile m1.c with -lpthread -lmilter and start it in a terminal. In another terminal start miltertest by passing this script:

conn = mt.connect("/tmp/zz", 40, 0.25)
mt.conninfo(conn, "aaa.bbb.ccc", "1.2.3.4")

The second terminal prints "miltertest: select(): timeout on fd 3", the first terminal prints "
NEGOTIATE
NEG 1
NEG 2
". If now Ctrl+C is pressed in the first terminal, CLOSE is not printed, therefore xxfi_close is not called for each xxfi_negotiate. If Ctrl+C is not pressed CLOSE is printed.

@dilyanpalauzov
Copy link
Author

I was told the documentation of libmilter will be updated as follows:

diff --git a/libmilter/docs/overview.html b/libmilter/docs/overview.html
index 78676d3..ff0a69f 100644
--- a/libmilter/docs/overview.html
+++ b/libmilter/docs/overview.html
@@ -203,11 +203,21 @@ communication with the MTA happens.
 Filters are not terminated asynchronously
 (except by signals that can't be caught).
 In the case of <TT>Abort</TT> the
-<A HREF="xxfi_abort.html">xxfi_abort</A> callback is invoked.
+<A HREF="xxfi_abort.html">xxfi_abort</A> callback is usually invoked if there is an active transaction.  However, if an invoked callback takes too long to execute (the maximum time <TT>Abort</TT> waits is currently 5s)
+<!-- XREF: MI_CHK_TIME -->
+then the filter is simply terminated, i.e.,
+neither the
+<A HREF="xxfi_abort.html">xxfi_abort</A> callback
+nor the
+<A HREF="xxfi_close.html">xxfi_close</A> callback
+is invoked.
 
 <HR size="1">
 <FONT size="-1">
-Copyright (c) 2000, 2001, 2003, 2006 Proofpoint, Inc. and its suppliers.
+Copyright (c) 2000, 2001, 2003, 2006, 2018 Proofpoint, Inc. and its suppliers.
 All rights reserved.
 <BR>
 By using this file, you agree to the terms and conditions set

@mskucherawy
Copy link
Member

Wow, that's been waiting 12 years to sting. That means we can leak state because we don't know the session referencing it went away.

Then I guess you're right, we have to stop asserting that this is a closed system.

@dilyanpalauzov
Copy link
Author

dilyanpalauzov commented Jul 24, 2019

In fact conf_refcnt should be removed from everywhere, not only from the assert.

@glts
Copy link

glts commented Jan 2, 2020

I’m running into this with the current 2.11.0-Beta2 tarball. Frequently I’m getting a core dump on shutdown:

opendkim: opendkim.c:5905: dkimf_config_free: Assertion `conf->conf_refcnt == 0' failed.
Aborted

This happens both with Ctrl-C as well as when stopped via systemd (eg systemctl restart).

@dilyanpalauzov
Copy link
Author

Yes, remove all ref_cnt and recompile, or use https://mail.aegee.org/cgit/OpenDKIM/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants