Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on FreeBSD 12.1+ when communicating over vici socket. #966

Closed
mskalski opened this issue Mar 22, 2022 · 9 comments
Closed

Memory leak on FreeBSD 12.1+ when communicating over vici socket. #966

mskalski opened this issue Mar 22, 2022 · 9 comments
Labels

Comments

@mskalski
Copy link
Contributor

mskalski commented Mar 22, 2022

System (please complete the following information):

  • OS: FreeBSD 12.1, 12.2, 13.0, 13.1-BETA2
  • Kernel version (if applicable): 13.1-BETA2, the same on 12.1+
  • strongSwan version(s): latest from git (FreeBSD strongSwan U5.9.5/K13.1-BETA2)
  • Tested/confirmed with the latest version: yes

Describe the bug
On FreeBSD system (amd64, arm64) when communicating over vici socket memory leaks in terms of constantly increasing Virtual and Resident (VSS and RSS) memory of process occur, until all system memory is exhausted, when process charon is killed by kernel with message kernel: pid 903 (charon), jid 0, uid 0, was killed: failed to reclaim memory.

Any tool for memory leak detection tools (valgrind, ktrace) does not detect any memory leaks, increasing RSS is the only symptom.

Also this behaviour is not observed on Linux (tested on Ubuntu 20.04 and Debian 10 bookworm/sid).

To Reproduce
Steps to reproduce the behavior:

  1. Download any vm image with FreeBSD 12.0+ (was tested also on latest amd64 13.1-BETA2 to confirm)
  2. Configure virtual machine; for strongswan compilation give more memory, but for test 256 MB is enough.
    Run VM and disable swap (to speed-up failure)
    # swapoff /dev/gpt/swapfs
    
  3. install required packages for strongswan compilation:
    # pkg install git autoconf gperf autoconf-archive libtool m4 automake flex bison pkgconf gettext
    
  4. get strongswan: git clone https://github.com/strongswan/strongswan
  5. Compile strongswan:
    cd strongswan
    ./autogen.sh
    ./configure --disable-kernel-netlink --enable-kernel-pfroute --enable-kernel-pfkey --disable-gmp --enable-openssl --enable-mediation --disable-scripts --with-group=wheel --enable-gcm --enable-ccm --enable-pkcs11
    make -j4
    make install
    
  6. start strongswan: ipsec start
  7. run in loop any command which communicates on vici interface, swanctl --stats is enough to reproduce error:
    sh -c 'while swanctl --stats >/dev/null; do true; done'
    
    or
    sh -c 'while swanctl --stats >/dev/null; do sleep 0.1; done'
    
  8. Observe increase of VSS and RSS (Virtual and resident) memory of charon process, using e.g. top or attached memlogger.py
  9. After few hours charon should be killed by kernel due to not enough memory/swap space.

Expected behavior
Memory used by charon process should not constantly increase, and should not finally cause charon to be killed.

Logs/Backtraces

  • Attaching simple python script to monitor and write charon memory usage: memlogger.py - requires python3 and python package psutil (pkg install py38-psutil).
  • result of that command charon_mem.log
  • Increase of RSS memory used by charon from 9MB to 95MB took about 1.5 hour

Additional context
Problem occurred when monitored via vici socket state of charon daemon (tunnel definitions, SAs, etc), but it was also reproduced using simple swanctl --stats command repeated in loop.

No change in this beaviour is observed when using different configure's --with-printf-hooks= -- according to issue in pfsense: https://redmine.pfsense.org/issues/5149 this could be the reason, but tests with --with-printf-hooks=builtin, --with-printf-hooks=glibc and --with-printf-hooks=vstr did not fix the error.

@tobiasbrunner
Copy link
Member

Does it also happen if you use VICI over TCP instead of a UNIX socket?

@tobiasbrunner tobiasbrunner removed the new label Mar 22, 2022
@mskalski
Copy link
Contributor Author

mskalski commented Mar 22, 2022

Changed charon.plugins.vici.socket = tcp://0.0.0.0:4502 in /usr/local/etc/strongswan.d/charon/vici.conf and swanctl.socket = tcp://127.0.0.1:4502 in /usr/local/etc/strongswan.d/swanctl.conf, attaching
charon_mem-tcp.log and RSS memory is still increasing.

In fact I think this behaviour is because frequent memory allocation and deallocation (malloc/free functions), which is used in vici plugin. And I observed that this increase can also be caused by SA renegotiations, but that is harder to isolate. I make some tests then.

One more note: I first observed this behaviour with strongswan version FreeBSD strongSwan U5.9.2/K12.1-RELEASE-p7(which was first version I started work with, not tested earlier).

@tobiasbrunner
Copy link
Member

In fact I think this behaviour is because frequent memory allocation and deallocation (malloc/free functions), which is used in vici plugin.

That would indicate an issue in the memory allocator of FreeBSD's libc (fragmentation or something) or maybe even the kernel, not really strongSwan (in particular because it apparently isn't an issue on Linux). Did you have a look at the man page for FreeBSD's malloc()? There seem to be various options to tune it. For instance, the number of arenas or thread-specific caching (there are also some debug options).

@mskalski
Copy link
Contributor Author

mskalski commented Mar 22, 2022

Yes, It's possible, that problem is due to wrong tuning of FreeBSD memory allocator.

But on the other hand - there is no special malloc configuration for charon daemon AFAIK and other applications on FreeBSD box are not affected, which are i.e. some running python daemons (which I believe do massive allocations and use multiple threads). What is specific in a way strongswan allocates memory that RSS process memory is increasing so much?

I made similar test test on FreeBSD 9.3 VM, which is the last release before incorporating jemalloc allocator (also downloaded VM image from freebsd archives) and results are the same. I had to enable --enable-kernel-libipsec option during strongswan compilation, because ipsec module is probably not compiled into 9.3 GENERIC kernel (specific net.inet.ipsec sysctls do not exist) and charon daemon could not start without working kernel-ipsec plugin, but test was similar, I used looped swanctl --stats command.

(Attached charon_mem.log)


I also changed definition of FreeBSD 13.1 VM to use only one cpu core, but this did not change anything.

Here is result of following command on 1CPU FreeBSD13.1-BETA2:

sh -c "MALLOC_CONF='stats_print:true,narenas:1' /usr/local/libexec/ipsec/charon 2>/var/log/charon-memdump-0.log"

Attached log of periodic monitoring of used memory charon_mem-1cpu-memstats.log and stats dumped by jemalloc library: charon-jemalloc-stats.log at charon daemon exit.

Unfortunately I don't know anything about debugging of jemalloc, so these numbers are rather cryptic for me :-)

@tobiasbrunner
Copy link
Member

What is specific in a way strongswan allocates memory that RSS process memory is increasing so much?

No idea whatsoever.

I made similar test test on FreeBSD 9.3 VM, which is the last release before incorporating jemalloc allocator (also downloaded VM image from freebsd archives) and results are the same.

Unless the previous allocator behaved similarly, it might be something else then (maybe an issue with sockets).

Unfortunately I don't know anything about debugging of jemalloc, so these numbers are rather cryptic for me :-)

Don't really know either. Have you checked the jemalloc tuning guide, and the heap profiling and leak checking pages?

@mskalski
Copy link
Contributor Author

mskalski commented Mar 23, 2022

Thank you very much for your help.

Yes, I've read these pages, unfortunately heap profiling and leak checking require profiling support which is not compiled into FreeBSD's libc by default. It will require recompile, which will be next step if I cannot find solution.

I filled bug report on FreeBSD bugzilla, maybe someone from FreeBSD team would notice something obvious I cannot see: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262743
I will add comments here when I will know more.

I also remade again test with valgrind just to be sure, attaching log - no real leaks, but showed some warnings which can be interesting.

The same test, using swanctl --stats in a loop, (test last about 30 seconds, as valgrind requires much more memory):

root@freebsd13-1:~ # sh -c 'valgrind -s --leak-check=full --show-leak-kinds=all /usr/local/libexec/ipsec/charon 2> /var/log/charon-valgind-full.log'
00[DMN] Starting IKE charon daemon (strongSwan 5.9.5, FreeBSD 13.1-BETA2, amd64)
00[CFG] PKCS11 module '<name>' lacks library path
00[KNL] unable to set IPSEC_POLICY on socket: Protocol not available
00[NET] installing IKE bypass policy failed
00[KNL] unable to set IPSEC_POLICY on socket: Protocol not available
00[NET] installing IKE bypass policy failed
00[KNL] unable to set UDP_ENCAP: Invalid argument
00[NET] enabling UDP decapsulation for IPv6 on port 4500 failed
00[KNL] unable to set IPSEC_POLICY on socket: Protocol not available
00[NET] installing IKE bypass policy failed
00[KNL] unable to set IPSEC_POLICY on socket: Protocol not available
00[NET] installing IKE bypass policy failed
00[KNL] unable to set UDP_ENCAP: Protocol not available
00[NET] enabling UDP decapsulation for IPv4 on port 4500 failed
00[CFG] loading ca certificates from '/usr/local/etc/ipsec.d/cacerts'
00[CFG] loading aa certificates from '/usr/local/etc/ipsec.d/aacerts'
00[CFG] loading ocsp signer certificates from '/usr/local/etc/ipsec.d/ocspcerts'
00[CFG] loading attribute certificates from '/usr/local/etc/ipsec.d/acerts'
00[CFG] loading crls from '/usr/local/etc/ipsec.d/crls'
00[CFG] loading secrets from '/usr/local/etc/ipsec.secrets'
00[LIB] loaded plugins: charon pkcs11 aes des rc2 sha2 sha1 md5 random nonce x509 revocation constraints pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey pem openssl fips-prf curve25519 xcbc cmac hmac ccm gcm drbg attr kernel-pfkey kernel-pfroute resolve socket-default stroke vici updown xauth-generic counters
00[DMN] removing pidfile '/var/run/charon.pid', process not running
00[JOB] spawning 16 worker threads
^C00[DMN] SIGINT received, shutting down

And valgrind's log: charon-valgrind-full.log

@tobiasbrunner
Copy link
Member

While not fixing the issue, setting tcache:false in MALLOC_CONF seems to reduce the rate of the increase (to less than 20% of the default rate, at least in my quick tests). We constantly allocate memory in one thread and free it in another, which might be an issue for the allocator.

@mskalski
Copy link
Contributor Author

mskalski commented Aug 8, 2022

Thank you again for all the help.

No change in this beaviour is observed when using different configure's --with-printf-hooks= -- according to issue in pfsense: https://redmine.pfsense.org/issues/5149 this could be the reason, but tests with --with-printf-hooks=builtin, --with-printf-hooks=glibc and --with-printf-hooks=vstr did not fix the error.

I was wrong - the --with-printf-hooks=builtin seems to resolve the problem. I must have missed that fact during previous tests.

I made long-lasting test (1.5 week) with more aggresive stressing and it looks that just few hours after start, memory footprint increases (not as much as for libc printf-hook), but when system needs more memory, unused one is released.

Note graph below - every day at 3:01 AM starts some cron job which checks the whole system, downloads some CVE reports, which is rather memory consuming task - on the graph you can see drops of used RSS memory at that time:
charon_mem

Attaching also script for tracking (sh-only based) and log of RSS memory usage for charon compiled with the --with-printf-hooks=builtin configure option.

memlogger.sh
charon_mem.log.gz

It looks like this option resolves my problem. Note, that strongswan FreeBSD package is by default compiled with this option, and when manual building from ports, this option can be changed (although it is also by default set to builtin).

Maybe this note is worth of adding to the FreeBSD installation documentation?
Or just maybe --with-printf-hooks=builtin configure option should be default for the FreeBSD OS?

I think this issue can be closed.

@tobiasbrunner
Copy link
Member

Thanks for the update. I've added the option and a note to the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants