Rootcheck database inconsistencies #23477

GGP1 · 2024-05-16T19:07:16Z

Description

During the investigation in #23389, we found that in 4.9.0 our test cases are failing sporadically because the agents' rootcheck databases are empty even though their logs indicate that the scans have finished.

I tried sleeping for two minutes before starting the test to see if it was a race condition but the database remained empty after multiple scans. In some other executions, even without a sleep, the database did contain the scan results.

Our agent healtcheck waits 30 seconds after the shared configuration is modified to consider it healthy. So before 4.9.0 the results were stored within this time frame.

We haven't noticed this behavior in previous versions, so we should investigate if there is any issue while populating the database or with the scan itself not finding results.

Logs: rootcheck_logs.zip

vikman90 · 2024-05-17T10:00:32Z

@GGP1 I've checked that all agents have logs like:

2024/05/16 18:21:57 rootcheck: INFO: Starting rootcheck scan.
2024/05/16 18:22:05 rootcheck: INFO: Ending rootcheck scan.

On the other hand, I see the tables populated.

Manual check

Agent 000

echo 'agent 000 sql select * from pm_event' | ~/wdb-query.py

Output

[
    {
        "id": 1,
        "date_first": 1715937625,
        "date_last": 1715937625,
        "log": "Starting rootcheck scan."
    },
    {
        "id": 2,
        "date_first": 1715937626,
        "date_last": 1715937626,
        "log": "Trojaned version of file '/bin/diff' detected. Signature used: 'bash|^/bin/sh|file\\.h|proc\\.h|/dev/[^n]|^/bin/.*sh' (Generic)."
    },
    {
        "id": 3,
        "date_first": 1715937626,
        "date_last": 1715937626,
        "log": "Trojaned version of file '/usr/bin/diff' detected. Signature used: 'bash|^/bin/sh|file\\.h|proc\\.h|/dev/[^n]|^/bin/.*sh' (Generic)."
    },
    {
        "id": 4,
        "date_first": 1715937661,
        "date_last": 1715937661,
        "log": "Ending rootcheck scan."
    }
]

Agent 002

On the other hand, there are agents that just have the "Starting/Ending rootcheck scan.", like this one.

echo 'agent 000 sql select * from pm_event' | ~/wdb-query.py

Output

[
    {
        "id": 1,
        "date_first": 1715938455,
        "date_last": 1715938461,
        "log": "Starting rootcheck scan."
    },
    {
        "id": 2,
        "date_first": 1715938469,
        "date_last": 1715938469,
        "log": "Ending rootcheck scan."
    }
]

wazuh-api -XGET https://localhost:55000/rootcheck/002

Output

{
  "data": {
    "affected_items": [],
    "total_affected_items": 0,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "No rootcheck information was returned",
  "error": 0
}

However, the API still reports the last scan data:

wazuh-api -XGET https://localhost:55000/rootcheck/002/last_scan

Output

{
  "data": {
    "affected_items": [
      {
        "start": "2024-05-17T09:34:21Z",
        "end": "2024-05-17T09:34:29Z"
      }
    ],
    "total_affected_items": 1,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "Last rootcheck scan of the agent was returned",
  "error": 0
}

Test implementation

If I'm correct, the Rootcheck integration test is accessing /var/ossec/queue/db/*.db manually: test_rootcheck.py

This may provide wrong results or even corrupt the databases.

Conclusion

This might be caused by:

A race condition: querying the database before the scan has ended.
A database is not committed.
A really empty database.

I rule out load-related event dropping, as I've seen no logs about full buffer. However, I'm a bit confused as the test is accessing X.db but we can see JSON results from the API in your report.

Next steps

Please avoid accessing database files manually. We need a fix in the test. Until then, we should disregard this test.

Rootcheck integration test is handling queue/db files wazuh-qa#5398

Workaround

Wait 1:30 minutes (30 second margin + 60 seconds for commit time).

GGP1 · 2024-05-17T14:59:00Z

@vikman90

The tests were this behavior was observed are the API integration tests where we open a connection to the wdb socket (ref).

I tried the workaround you suggested by adding 90 seconds to the agents' healtcheck script but the tests continue failing sporadically.

Logs: logs.zip

vikman90 · 2024-05-20T08:25:28Z

@GGP1 gotcha, sorry for the mistake. At first glance, it may seem that the scan data is missing from the database since the /rootcheck test fails but /rootcheck/last_scan works. However, after analyzing this case with Selu, we found that the expected values for the latter case are:

end: !anything
start: !anystring

This would validate a case where the Rootcheck scan has not finished:

{
  "data": {
    "affected_items": [
      {
        "start": "2024-05-20T08:01:06Z",
        "end": null
      }
    ],
    "total_affected_items": 1,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "Last rootcheck scan of the agent was returned",
  "error": 0
}

Conclusion

I believe it is very likely that the test failed simply because the scan has not finished.

Suggestions

Test /rootcheck/last_scan before /rootcheck (determine if the scan has finished).
Rename the previous case to "scan started".
Add a "scan finished" case with end: !anystring.

That will help determine whether the scan data is missing from the DB or the scan did not finish.

Next steps

I have agreed with @Selutario to close this issue and leave it to your team to apply the suggested changes above.

Best!

GGP1 added type/bug Something isn't working level/task labels May 16, 2024

GGP1 mentioned this issue May 16, 2024

Investigate errors Endpoints IT for Connexion #23389

Closed

vikman90 mentioned this issue May 17, 2024

Rootcheck integration test is handling queue/db files wazuh/wazuh-qa#5398

Closed

vikman90 closed this as completed May 17, 2024

vikman90 closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2024

GGP1 reopened this May 17, 2024

vikman90 self-assigned this May 20, 2024

vikman90 closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024

GGP1 mentioned this issue May 24, 2024

Fix connexion 3.0 failed AIT tests #23514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rootcheck database inconsistencies #23477

Rootcheck database inconsistencies #23477

GGP1 commented May 16, 2024

vikman90 commented May 17, 2024

GGP1 commented May 17, 2024 •

edited

vikman90 commented May 20, 2024

Rootcheck database inconsistencies #23477

Rootcheck database inconsistencies #23477

Comments

GGP1 commented May 16, 2024

Description

vikman90 commented May 17, 2024

Manual check

Agent 000

Agent 002

Test implementation

Conclusion

Next steps

Workaround

GGP1 commented May 17, 2024 • edited

vikman90 commented May 20, 2024

Conclusion

Suggestions

Next steps

GGP1 commented May 17, 2024 •

edited