Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QA] Modulesd is using too much memory on dev branch 4.1.0 #6997

Closed
1 task done
spothound opened this issue Dec 21, 2020 · 12 comments
Closed
1 task done

[QA] Modulesd is using too much memory on dev branch 4.1.0 #6997

spothound opened this issue Dec 21, 2020 · 12 comments
Assignees
Labels
module/modulesd Issues related to the Modules daemon reporter/cicd type/bug Something isn't working

Comments

@spothound
Copy link
Contributor

Wazuh version Component Install type Install method Platform
v4.1.0-40102 Modulesd Manager Sources Centos

Hello team,

  • Do you want to request a feature or report a bug?

    • bug report
  • What is the current behavior?
    We've detected a memory leak on modulesd while testing the development tag v4.1.0-alpha1 using a stress test.

Results of the tests

monitor-manager-Test_stress_B1167_centos_0-pre-release_RSS
monitor-manager-Test_stress_B1167_centos_0-pre-release_Write_Ops

We've also run these tests with Valgrind and got the confirmation of a memory leak, apparently related with SCA.

==7338== LEAK SUMMARY:
==7338==    definitely lost: 333,431 bytes in 81 blocks
==7338==    indirectly lost: 584,284 bytes in 11,300 blocks
==7338==      possibly lost: 6,474,447 bytes in 41 blocks
==7338==    still reachable: 694,835 bytes in 12,846 blocks
==7338==         suppressed: 0 bytes in 0 blocks
==7338== Reachable blocks (those to which a pointer was found) are not shown.
==7338== To see them, rerun with: --leak-check=full --show-leak-kinds=all


==7338== 552,688 (64 direct, 552,624 indirect) bytes in 1 blocks are definitely lost in loss record 288 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x534107D: cJSON_New_Item (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x53410AF: cJSON_CreateObject (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5438B8: yaml2json_node (yaml2json.c:110)
==7338==    by 0x543603: yaml2json (yaml2json.c:70)
==7338==    by 0x4AC5AD: wm_sca_read_files (wm_sca.c:360)
==7338==    by 0x4AC2D5: wm_sca_start (wm_sca.c:314)
==7338==    by 0x4ABC2F: wm_sca_main (wm_sca.c:235)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 256,078 (255,712 direct, 366 indirect) bytes in 61 blocks are definitely lost in loss record 287 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x576B140: __env_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x578C52A: __memp_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579043F: __memp_fget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5755A15: __db_goff (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5763F56: __db_ret (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x573A30B: __dbc_iget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x574CEFB: __dbc_get_pp (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x530004: sys_rpm_packages (syscollector_linux.c:424)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 65,537 bytes in 1 blocks are definitely lost in loss record 280 of 289
==7338==    at 0x4C2BFB9: calloc (vg_replace_malloc.c:762)
==7338==    by 0x4B9794: wm_sca_request_thread (wm_sca.c:2969)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 19,290 (2,280 direct, 17,010 indirect) bytes in 1 blocks are definitely lost in loss record 271 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579C607: __os_calloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5771AB5: db_env_create (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5752DC3: __db_create_internal (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x57536AB: db_create (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x52F332: sys_rpm_packages (syscollector_linux.c:396)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 4,192 bytes in 1 blocks are definitely lost in loss record 254 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x576B140: __env_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x578C52A: __memp_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579043F: __memp_fget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x571F34A: __ham_get_cpage (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x572632B: __ham_item_next (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5717E9F: __hamc_get (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5739E17: __dbc_iget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x574CEFB: __dbc_get_pp (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x530004: sys_rpm_packages (syscollector_linux.c:424)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)

==7338== 4,192 bytes in 1 blocks are definitely lost in loss record 253 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x579C4B7: __os_malloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x576B140: __env_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x578C52A: __memp_alloc (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x579043F: __memp_fget (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x571D5F8: __ham_get_meta (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x571EC31: __ham_open (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x5754B23: __db_open (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x574A23D: __db_open_pp (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x52F473: sys_rpm_packages (syscollector_linux.c:407)
==7338==    by 0x52F094: sys_packages_linux (syscollector_linux.c:350)
==7338==    by 0x52C4D2: wm_sys_main (syscollector_common.c:123)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)
==7338== 

==7338== 1,054 (64 direct, 990 indirect) bytes in 1 blocks are definitely lost in loss record 242 of 289
==7338==    at 0x4C29EA3: malloc (vg_replace_malloc.c:309)
==7338==    by 0x534107D: cJSON_New_Item (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x53410AF: cJSON_CreateObject (in /var/ossec/lib/libwazuhext.so)
==7338==    by 0x4A18CB: Read_Log (wm_osquery_monitor.c:124)
==7338==    by 0x5E55E64: start_thread (in /usr/lib64/libpthread-2.17.so)
==7338==    by 0x616888C: clone (in /usr/lib64/libc-2.17.so)
==7338== 

@vikman90 vikman90 added type/bug Something isn't working module/modulesd Issues related to the Modules daemon labels Dec 21, 2020
@spothound
Copy link
Contributor Author

We've detected that these "leak" logs are recurrent and seems to be in a loop. In this release the changes to SCA module are minimal. The problem may be that we've changed the way threads and processes are executed and we're executing a problematic code on a loop. We're investigating this.

Also, we've detected anotherb bigger leak with Valgrind:

==23200== 4,409,345 bytes in 1 blocks are definitely lost in loss record 125 of 125
==23200==    at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==23200==    by 0x11A72F: reader (wm_exec.c:571)
==23200==    by 0x4FD5608: start_thread (pthread_create.c:477)
==23200==    by 0x5111292: clone (clone.S:95)
==23200==
==23200== LEAK SUMMARY:
==23200==    definitely lost: 4,476,519 bytes in 46 blocks
==23200==    indirectly lost: 563,859 bytes in 11,334 blocks
==23200==      possibly lost: 0 bytes in 0 blocks
==23200==    still reachable: 675,681 bytes in 12,684 blocks
==23200==         suppressed: 0 bytes in 0 blocks
==23200== Reachable blocks (those to which a pointer was found) are not shown.
==23200== To see them, rerun with: --leak-check=full --show-leak-kinds=all

That could explain the final memory increasing.

@spothound
Copy link
Contributor Author

spothound commented Dec 23, 2020

SCA test 7 hours

monitor-manager-Test_stress_B1201_centos_0-pre-release_RSS
There is some king of small memory leak related with SCA although it seems that this module haven't changed in the new version.

@spothound
Copy link
Contributor Author

spothound commented Dec 24, 2020

Additional testing

When a scanning interval is increased:

monitor-manager-Test_stress_B1206_centos_0-pre-release_RSS

The leak persist but it increases slowly.

@chemamartinez chemamartinez self-assigned this Dec 24, 2020
@spothound
Copy link
Contributor Author

spothound commented Dec 24, 2020

Hello team,

Chema has created a new branch to solve this or: 6997-sca-memleak and we're testing it:

Packages: https://ci.wazuh.info/view/Packages/job/Packages_builder_tier/1012/parameters/
STRES:

Results:

40 min

monitor-manager-Test_stress_B1215_centos_0-warehouse-test_RSS
It seems that the memory leak persist!

@spothound
Copy link
Contributor Author

We've also detected a weird behaviour on analysis while testing this leak:

monitor-manager-Test_stress_B1208_centos_0-pre-release_RSS

@spothound
Copy link
Contributor Author

Fix branch results

8 Hours

plots_RSS
plots_VMS

Memory leak persists.

Fix with just SCA enabled (and stressed)

monitor-manager-Test_stress_B1226_centos_0-warehouse-test_RSS
monitor-manager-Test_stress_B1226_centos_0-warehouse-test_VMS

@chemamartinez
Copy link
Contributor

Hi team,

After deeper investigation, it seems the test is reporting the memory leak produced by the child process when a command fails due to the use of execvp, as explained in the Rationale of #7028. The different tests ran today are:

Running SCA on 4.1.0 (agent) with a policy where some commands fail

image

  • Clearly, a memory leak is appreciated here.
  • As this test is run in an agent, we discard the possibility of other modules added in 4.1.0 are interfering with the results.

Running SCA on 4.0.3 with a policy where some commands fail

image

  • The premise here is that 4.0.3 doesn't contain any known leak.
  • This test was run with an interval for the SCA scan of 30 seconds. Others were run with an interval of one second.
  • We can see how the curve grows although with less inclination due to the modified interval.
  • The conclusion of this test is that the failure of some commands makes the test report a memory leak in 4.0.3 as well.

Running SCA on 4.1.0 with a policy from 4.0.3 (with no commands failling)

image

  • We can see the memory leak disappears under these conditions. Therefore, it seems the memory leak was produced by the child process when the commands fail.

Running the wodle command with SCA disabled

  • This test is to ensure that the RSS memory includes the memory of the child process forked when a command is executed. SCA was disabled and the following configuration was added:
<wodle name="command">
  <disabled>no</disabled>
  <tag>test-memleak</tag>
  <command>fake-command</command>
  <interval>2s</interval>
  <ignore_output>no</ignore_output>
  <run_on_start>yes</run_on_start>
  <timeout>30</timeout>
  <skip_verification>yes</skip_verification>
</wodle>

The results are:

image

  • We can see how the memory leak appears again.

Final conclusions

The addition of new SCA policies includes the following commands in different checks for the Linux policies:

# nft list ruleset
# nft list tables

However, the nft command is not available by default in all the Linux distributions, making the command fail:

# nft list tables

Command 'nft' not found, but can be installed with:

apt install nftables

This can make the tests report a memory leak since the RSS memory seems to be including the memory for the child process which is running the command. In that case, the memory leak doesn't exist since the memory is reclaimed by the OS once the child process dies.

The solution for this could be to find a workaround for the nft command, as it is not installed by default in Linux.

In addition, #7028 makes the report to be attenuated when SCA is evaluated since the dynamic buffer of 64KB was interpreted as a memory leak as well.

@spothound spothound changed the title [QA] Memory leak on dev branch: v4.1.0-alpha1 [QA] Memory leak on dev branch 4.1.0 Dec 31, 2020
@spothound
Copy link
Contributor Author

spothound commented Jan 18, 2021

Extra tests:

Without default policies (Just qa ones)

agent:
image
manager:
image

Without QA policies (just 4.1 ones)

Agent:
image

Manager:
image

@spothound
Copy link
Contributor Author

ALL but sca

manager:
image

@spothound
Copy link
Contributor Author

Modulesd memory usage research for 4.1

With the intention to explain the weird memory behaviour on Wazuh 4.1 I've run the following manual stress tests.

Test 1: Missing commands

image

One possible explanation is that there are false-positive memory leaks due to missing command execution. One test, to discard this option was launching the tests with some of the missing command found by default in the original stress tests. The behaviour is pretty similar to the previous one.

We've installed iw, lsof, autofs, and nftables.

Test 2: New modules on modulesd

image

We've checked that the leak is not related with the new modules in modulesd by running some tests disabling scan, syscollector and all the other old modules from modulesd.
(modulesd is one of the bottom blues)
After that, I've enabled sca in this check and the memory has started to grow!

image

So, probably the new modules don't affect the memory, it seems that the problem is in SCA and the new 4.1 policies.

This is an example of 4.1 running with 4.0 policies, the memory of modulesd seems quite good!
image

(modulesd is the purple one from the bottom)
If we run just SCA with the new centos policies:
image

the memory keeps growing.

Test 3: force missing commands

In these tests, I've modifies the scan policies to add a lot of checks with missing commands, the result, after adding those modified checks to the policy and restart the manager was:
image

(still growing)
After a few hours:
image

So, I strongly recommend reviewing how SCA handles the memory used to execute commands.

@vikman90
Copy link
Member

vikman90 commented Jan 22, 2021

Hi team,

After an exhaustive analysisd of the memory usage by Modulesd, we concluded that it is produced by a memory fragmentation issue in the SCA scan, when running the CIS check "Ensure XD/NX support is enabled".

Affected installation types:

  • Manager
  • Agent

Affected platforms:

  • RHEL 6/7
  • CentOS 6/7
  • Debian 7-10

Rationale

That check runs journalctl, that produces a potentially large output. The command execution library in Modulesd has a 64 MiB limit for data output. For instance, in the testing host, we had 45 MiB of data. The memory is allocated and freed, according to the standard memory management. However, such memory is not guaranteed to be returned back to the system.

We observed this behavior is prone to occur when multiple modules are running, like SCA and Syscollector. Therefore, though the memory usage is much higher in manager than in agent, there is no guarantee that it cannot happen in the agent.

Action items

  • Disable the related SCA check, until we implement a solution in SCA.
  • Fix wm_strtok so that we can group arguments through double quotes in commands, for instance:
- 'c:sh -c "journalctl | grep protection" -> r:^kernel:\s+NX \(Execute Disable\) protection: active'
  • Let SCA consume the commands' output line by line instead of the full data chunk.
  • Investigate how to reduce memory fragmentation. However, this won't prevent the memory usage peak.

@vikman90 vikman90 changed the title [QA] Memory leak on dev branch 4.1.0 [QA] Modulesd is using too much memory on dev branch 4.1.0 Jan 24, 2021
@vikman90 vikman90 removed this from the Triage bugs detected in 4.1.0-rc4 milestone Jan 25, 2021
@vikman90 vikman90 linked a pull request Jan 28, 2021 that will close this issue
5 tasks
@vikman90
Copy link
Member

Finally, wm_strtok will be fixed as part of #3340.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module/modulesd Issues related to the Modules daemon reporter/cicd type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants