[QA] Modulesd is using too much memory on dev branch 4.1.0 #6997

spothound · 2020-12-21T12:36:21Z

Wazuh version	Component	Install type	Install method	Platform
v4.1.0-40102	Modulesd	Manager	Sources	Centos

Hello team,

Do you want to request a feature or report a bug?
- bug report
What is the current behavior?
We've detected a memory leak on modulesd while testing the development tag v4.1.0-alpha1 using a stress test.

Results of the tests

We've also run these tests with Valgrind and got the confirmation of a memory leak, apparently related with SCA.

==7338== LEAK SUMMARY:
==7338==    definitely lost: 333,431 bytes in 81 blocks
==7338==    indirectly lost: 584,284 bytes in 11,300 blocks
==7338==      possibly lost: 6,474,447 bytes in 41 blocks
==7338==    still reachable: 694,835 bytes in 12,846 blocks
==7338==         suppressed: 0 bytes in 0 blocks
==7338== Reachable blocks (those to which a pointer was found) are not shown.
==7338== To see them, rerun with: --leak-check=full --show-leak-kinds=all


==7338== 552,688 (64 direct, 552,624 indirect) bytes in 1 blocks are definitely lost in loss record 288 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x534107D: cJSON_New_Item (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x53410AF: cJSON_CreateObject (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5438B8: yaml2json_node (yaml2json.c:110)
==7338==    by 0x543603: yaml2json (yaml2json.c:70)
==7338==    by 0x4AC5AD: wm_sca_read_files (wm_sca.c:360)
==7338==    by 0x4AC2D5: wm_sca_start (wm_sca.c:314)
==7338==    by 0x4ABC2F: wm_sca_main (wm_sca.c:235)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 256,078 (255,712 direct, 366 indirect) bytes in 61 blocks are definitely lost in loss record 287 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x576B140: __env_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x578C52A: __memp_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579043F: __memp_fget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5755A15: __db_goff (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5763F56: __db_ret (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x573A30B: __dbc_iget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x574CEFB: __dbc_get_pp (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x530004: sys_rpm_packages (syscollector_linux.c:424)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 65,537 bytes in 1 blocks are definitely lost in loss record 280 of 289
==7338==    at 0x4C2BFB9: calloc (vg_replace_malloc.c:762)
==7338==    by 0x4B9794: wm_sca_request_thread (wm_sca.c:2969)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 19,290 (2,280 direct, 17,010 indirect) bytes in 1 blocks are definitely lost in loss record 271 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579C607: __os_calloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5771AB5: db_env_create (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5752DC3: __db_create_internal (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x57536AB: db_create (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x52F332: sys_rpm_packages (syscollector_linux.c:396)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 4,192 bytes in 1 blocks are definitely lost in loss record 254 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x576B140: __env_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x578C52A: __memp_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579043F: __memp_fget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x571F34A: __ham_get_cpage (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x572632B: __ham_item_next (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5717E9F: __hamc_get (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5739E17: __dbc_iget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x574CEFB: __dbc_get_pp (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x530004: sys_rpm_packages (syscollector_linux.c:424)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 4,192 bytes in 1 blocks are definitely lost in loss record 253 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x576B140: __env_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x578C52A: __memp_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579043F: __memp_fget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x571D5F8: __ham_get_meta (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x571EC31: __ham_open (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5754B23: __db_open (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x574A23D: __db_open_pp (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x52F473: sys_rpm_packages (syscollector_linux.c:407)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)
==7338== 

==7338== 1,054 (64 direct, 990 indirect) bytes in 1 blocks are definitely lost in loss record 242 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x534107D: cJSON_New_Item (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x53410AF: cJSON_CreateObject (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x4A18CB: Read_Log (wm_osquery_monitor.c:124)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)
==7338==

The text was updated successfully, but these errors were encountered:

spothound · 2020-12-23T08:07:59Z

We've detected that these "leak" logs are recurrent and seems to be in a loop. In this release the changes to SCA module are minimal. The problem may be that we've changed the way threads and processes are executed and we're executing a problematic code on a loop. We're investigating this.

Also, we've detected anotherb bigger leak with Valgrind:

==23200== 4,409,345 bytes in 1 blocks are definitely lost in loss record 125 of 125
==23200==    at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==23200==    by 0x11A72F: reader (wm_exec.c:571)
==23200==    by 0x4FD5608: start_thread (pthread_create.c:477)
==23200==    by 0x5111292: clone (clone.S:95)
==23200==
==23200== LEAK SUMMARY:
==23200==    definitely lost: 4,476,519 bytes in 46 blocks
==23200==    indirectly lost: 563,859 bytes in 11,334 blocks
==23200==      possibly lost: 0 bytes in 0 blocks
==23200==    still reachable: 675,681 bytes in 12,684 blocks
==23200==         suppressed: 0 bytes in 0 blocks
==23200== Reachable blocks (those to which a pointer was found) are not shown.
==23200== To see them, rerun with: --leak-check=full --show-leak-kinds=all

That could explain the final memory increasing.

spothound · 2020-12-23T15:47:36Z

SCA test 7 hours

There is some king of small memory leak related with SCA although it seems that this module haven't changed in the new version.

spothound · 2020-12-24T08:03:11Z

Additional testing

When a scanning interval is increased:

The leak persist but it increases slowly.

spothound · 2020-12-24T10:16:43Z

Hello team,

Chema has created a new branch to solve this or: 6997-sca-memleak and we're testing it:

Packages: https://ci.wazuh.info/view/Packages/job/Packages_builder_tier/1012/parameters/
STRES:

Results:

40 min

It seems that the memory leak persist!

spothound · 2020-12-24T10:19:00Z

We've also detected a weird behaviour on analysis while testing this leak:

spothound · 2020-12-28T07:52:32Z

Fix branch results

8 Hours

Memory leak persists.

Fix with just SCA enabled (and stressed)

chemamartinez · 2020-12-28T20:19:47Z

Hi team,

After deeper investigation, it seems the test is reporting the memory leak produced by the child process when a command fails due to the use of execvp, as explained in the Rationale of #7028. The different tests ran today are:

Running SCA on 4.1.0 (agent) with a policy where some commands fail

Clearly, a memory leak is appreciated here.
As this test is run in an agent, we discard the possibility of other modules added in 4.1.0 are interfering with the results.

Running SCA on 4.0.3 with a policy where some commands fail

The premise here is that 4.0.3 doesn't contain any known leak.
This test was run with an interval for the SCA scan of 30 seconds. Others were run with an interval of one second.
We can see how the curve grows although with less inclination due to the modified interval.
The conclusion of this test is that the failure of some commands makes the test report a memory leak in 4.0.3 as well.

Running SCA on 4.1.0 with a policy from 4.0.3 (with no commands failling)

We can see the memory leak disappears under these conditions. Therefore, it seems the memory leak was produced by the child process when the commands fail.

Running the wodle command with SCA disabled

This test is to ensure that the RSS memory includes the memory of the child process forked when a command is executed. SCA was disabled and the following configuration was added:

<wodle name="command">
  <disabled>no</disabled>
  <tag>test-memleak</tag>
  <command>fake-command</command>
  <interval>2s</interval>
  <ignore_output>no</ignore_output>
  <run_on_start>yes</run_on_start>
  <timeout>30</timeout>
  <skip_verification>yes</skip_verification>
</wodle>

The results are:

We can see how the memory leak appears again.

Final conclusions

The addition of new SCA policies includes the following commands in different checks for the Linux policies:

# nft list ruleset
# nft list tables

However, the nft command is not available by default in all the Linux distributions, making the command fail:

# nft list tables

Command 'nft' not found, but can be installed with:

apt install nftables

This can make the tests report a memory leak since the RSS memory seems to be including the memory for the child process which is running the command. In that case, the memory leak doesn't exist since the memory is reclaimed by the OS once the child process dies.

The solution for this could be to find a workaround for the nft command, as it is not installed by default in Linux.

In addition, #7028 makes the report to be attenuated when SCA is evaluated since the dynamic buffer of 64KB was interpreted as a memory leak as well.

spothound · 2021-01-18T09:40:59Z

Extra tests:

Without default policies (Just qa ones)

agent:

manager:

Without QA policies (just 4.1 ones)

Agent:

Manager:

spothound · 2021-01-18T16:32:49Z

ALL but sca

manager:

spothound · 2021-01-19T15:52:43Z

Modulesd memory usage research for 4.1

With the intention to explain the weird memory behaviour on Wazuh 4.1 I've run the following manual stress tests.

Test 1: Missing commands

One possible explanation is that there are false-positive memory leaks due to missing command execution. One test, to discard this option was launching the tests with some of the missing command found by default in the original stress tests. The behaviour is pretty similar to the previous one.

We've installed iw, lsof, autofs, and nftables.

Test 2: New modules on modulesd

We've checked that the leak is not related with the new modules in modulesd by running some tests disabling scan, syscollector and all the other old modules from modulesd.
(modulesd is one of the bottom blues)
After that, I've enabled sca in this check and the memory has started to grow!

So, probably the new modules don't affect the memory, it seems that the problem is in SCA and the new 4.1 policies.

This is an example of 4.1 running with 4.0 policies, the memory of modulesd seems quite good!

(modulesd is the purple one from the bottom)
If we run just SCA with the new centos policies:

the memory keeps growing.

Test 3: force missing commands

In these tests, I've modifies the scan policies to add a lot of checks with missing commands, the result, after adding those modified checks to the policy and restart the manager was:

(still growing)
After a few hours:

So, I strongly recommend reviewing how SCA handles the memory used to execute commands.

vikman90 · 2021-01-22T18:31:35Z

Hi team,

After an exhaustive analysisd of the memory usage by Modulesd, we concluded that it is produced by a memory fragmentation issue in the SCA scan, when running the CIS check "Ensure XD/NX support is enabled".

Affected installation types:

Manager
Agent

Affected platforms:

RHEL 6/7
CentOS 6/7
Debian 7-10

Rationale

That check runs journalctl, that produces a potentially large output. The command execution library in Modulesd has a 64 MiB limit for data output. For instance, in the testing host, we had 45 MiB of data. The memory is allocated and freed, according to the standard memory management. However, such memory is not guaranteed to be returned back to the system.

We observed this behavior is prone to occur when multiple modules are running, like SCA and Syscollector. Therefore, though the memory usage is much higher in manager than in agent, there is no guarantee that it cannot happen in the agent.

Action items

Disable the related SCA check, until we implement a solution in SCA.
Fix wm_strtok so that we can group arguments through double quotes in commands, for instance:

- 'c:sh -c "journalctl | grep protection" -> r:^kernel:\s+NX \(Execute Disable\) protection: active'

Let SCA consume the commands' output line by line instead of the full data chunk.
Investigate how to reduce memory fragmentation. However, this won't prevent the memory usage peak.

vikman90 · 2021-01-29T18:33:44Z

Finally, wm_strtok will be fixed as part of #3340.

spothound added the reporter/cicd label Dec 21, 2020

vikman90 added this to the Triage bugs detected in 4.1.0-alpha1 milestone Dec 21, 2020

vikman90 added type/bug Something isn't working module/modulesd Issues related to the Modules daemon labels Dec 21, 2020

chemamartinez self-assigned this Dec 24, 2020

davidjiglesias modified the milestones: Triage bugs detected in 4.1.0-alpha1, Triage bugs detected in 4.1.0-rc1 Dec 31, 2020

spothound changed the title ~~[QA] Memory leak on dev branch: v4.1.0-alpha1~~ [QA] Memory leak on dev branch 4.1.0 Dec 31, 2020

chemamartinez modified the milestone: Triage bugs detected in 4.1.0-rc1 Dec 31, 2020

vikman90 modified the milestones: Triage bugs detected in 4.1.0-rc1, Triage bugs detected in 4.1.0-rc4 Jan 22, 2021

vikman90 self-assigned this Jan 22, 2021

vikman90 mentioned this issue Jan 22, 2021

Remove CIS policy "Ensure XD/NX support is enabled" for SCA. wazuh/wazuh-ruleset#822

Merged

vikman90 changed the title ~~[QA] Memory leak on dev branch 4.1.0~~ [QA] Modulesd is using too much memory on dev branch 4.1.0 Jan 24, 2021

vikman90 removed this from the Triage bugs detected in 4.1.0-rc4 milestone Jan 25, 2021

vikman90 mentioned this issue Jan 28, 2021

Allow quoting in commands to group arguments #7307

Merged

5 tasks

vikman90 linked a pull request Jan 28, 2021 that will close this issue

Allow quoting in commands to group arguments #7307

Merged

5 tasks

vikman90 mentioned this issue Jan 29, 2021

Add CIS policy "Ensure XD/NX support is enabled" back for SCA #7314

Closed

vikman90 closed this as completed Jan 29, 2021

vikman90 removed a link to a pull request Jan 29, 2021

Allow quoting in commands to group arguments #7307

Merged

5 tasks

vikman90 mentioned this issue Apr 16, 2021

[4.2.0] Memory leak study in modulesd #8105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QA] Modulesd is using too much memory on dev branch 4.1.0 #6997

[QA] Modulesd is using too much memory on dev branch 4.1.0 #6997

spothound commented Dec 21, 2020

spothound commented Dec 23, 2020

spothound commented Dec 23, 2020 •

edited

spothound commented Dec 24, 2020 •

edited

spothound commented Dec 24, 2020 •

edited

spothound commented Dec 24, 2020

spothound commented Dec 28, 2020

chemamartinez commented Dec 28, 2020

spothound commented Jan 18, 2021 •

edited

spothound commented Jan 18, 2021

spothound commented Jan 19, 2021

vikman90 commented Jan 22, 2021 •

edited

vikman90 commented Jan 29, 2021

[QA] Modulesd is using too much memory on dev branch 4.1.0 #6997

[QA] Modulesd is using too much memory on dev branch 4.1.0 #6997

Comments

spothound commented Dec 21, 2020

Results of the tests

spothound commented Dec 23, 2020

spothound commented Dec 23, 2020 • edited

SCA test 7 hours

spothound commented Dec 24, 2020 • edited

Additional testing

spothound commented Dec 24, 2020 • edited

Results:

40 min

spothound commented Dec 24, 2020

spothound commented Dec 28, 2020

Fix branch results

8 Hours

Fix with just SCA enabled (and stressed)

chemamartinez commented Dec 28, 2020

Running SCA on 4.1.0 (agent) with a policy where some commands fail

Running SCA on 4.0.3 with a policy where some commands fail

Running SCA on 4.1.0 with a policy from 4.0.3 (with no commands failling)

Running the wodle command with SCA disabled

Final conclusions

spothound commented Jan 18, 2021 • edited

Without default policies (Just qa ones)

Without QA policies (just 4.1 ones)

spothound commented Jan 18, 2021

ALL but sca

spothound commented Jan 19, 2021

Modulesd memory usage research for 4.1

Test 1: Missing commands

Test 2: New modules on modulesd

Test 3: force missing commands

vikman90 commented Jan 22, 2021 • edited

Rationale

Action items

vikman90 commented Jan 29, 2021

spothound commented Dec 23, 2020 •

edited

spothound commented Dec 24, 2020 •

edited

spothound commented Dec 24, 2020 •

edited

spothound commented Jan 18, 2021 •

edited

vikman90 commented Jan 22, 2021 •

edited