Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootcheck database inconsistencies #23477

Closed
GGP1 opened this issue May 16, 2024 · 3 comments
Closed

Rootcheck database inconsistencies #23477

GGP1 opened this issue May 16, 2024 · 3 comments
Assignees
Labels
level/task type/bug Something isn't working

Comments

@GGP1
Copy link
Member

GGP1 commented May 16, 2024

Description

During the investigation in #23389, we found that in 4.9.0 our test cases are failing sporadically because the agents' rootcheck databases are empty even though their logs indicate that the scans have finished.

I tried sleeping for two minutes before starting the test to see if it was a race condition but the database remained empty after multiple scans. In some other executions, even without a sleep, the database did contain the scan results.

Our agent healtcheck waits 30 seconds after the shared configuration is modified to consider it healthy. So before 4.9.0 the results were stored within this time frame.

We haven't noticed this behavior in previous versions, so we should investigate if there is any issue while populating the database or with the scan itself not finding results.

Logs: rootcheck_logs.zip

@vikman90
Copy link
Member

@GGP1 I've checked that all agents have logs like:

2024/05/16 18:21:57 rootcheck: INFO: Starting rootcheck scan.
2024/05/16 18:22:05 rootcheck: INFO: Ending rootcheck scan.

On the other hand, I see the tables populated.

Manual check

Agent 000

echo 'agent 000 sql select * from pm_event' | ~/wdb-query.py
Output
[
    {
        "id": 1,
        "date_first": 1715937625,
        "date_last": 1715937625,
        "log": "Starting rootcheck scan."
    },
    {
        "id": 2,
        "date_first": 1715937626,
        "date_last": 1715937626,
        "log": "Trojaned version of file '/bin/diff' detected. Signature used: 'bash|^/bin/sh|file\\.h|proc\\.h|/dev/[^n]|^/bin/.*sh' (Generic)."
    },
    {
        "id": 3,
        "date_first": 1715937626,
        "date_last": 1715937626,
        "log": "Trojaned version of file '/usr/bin/diff' detected. Signature used: 'bash|^/bin/sh|file\\.h|proc\\.h|/dev/[^n]|^/bin/.*sh' (Generic)."
    },
    {
        "id": 4,
        "date_first": 1715937661,
        "date_last": 1715937661,
        "log": "Ending rootcheck scan."
    }
]

Agent 002

On the other hand, there are agents that just have the "Starting/Ending rootcheck scan.", like this one.

echo 'agent 000 sql select * from pm_event' | ~/wdb-query.py
Output
[
    {
        "id": 1,
        "date_first": 1715938455,
        "date_last": 1715938461,
        "log": "Starting rootcheck scan."
    },
    {
        "id": 2,
        "date_first": 1715938469,
        "date_last": 1715938469,
        "log": "Ending rootcheck scan."
    }
]
wazuh-api -XGET https://localhost:55000/rootcheck/002
Output
{
  "data": {
    "affected_items": [],
    "total_affected_items": 0,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "No rootcheck information was returned",
  "error": 0
}

However, the API still reports the last scan data:

wazuh-api -XGET https://localhost:55000/rootcheck/002/last_scan
Output
{
  "data": {
    "affected_items": [
      {
        "start": "2024-05-17T09:34:21Z",
        "end": "2024-05-17T09:34:29Z"
      }
    ],
    "total_affected_items": 1,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "Last rootcheck scan of the agent was returned",
  "error": 0
}

Test implementation

If I'm correct, the Rootcheck integration test is accessing /var/ossec/queue/db/*.db manually: test_rootcheck.py

This may provide wrong results or even corrupt the databases.

Conclusion

This might be caused by:

  • A race condition: querying the database before the scan has ended.
  • A database is not committed.
  • A really empty database.

I rule out load-related event dropping, as I've seen no logs about full buffer. However, I'm a bit confused as the test is accessing X.db but we can see JSON results from the API in your report.

Next steps

Please avoid accessing database files manually. We need a fix in the test. Until then, we should disregard this test.

Workaround

Wait 1:30 minutes (30 second margin + 60 seconds for commit time).

@vikman90 vikman90 closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2024
@GGP1
Copy link
Member Author

GGP1 commented May 17, 2024

@vikman90

The tests were this behavior was observed are the API integration tests where we open a connection to the wdb socket (ref).

I tried the workaround you suggested by adding 90 seconds to the agents' healtcheck script but the tests continue failing sporadically.

Logs: logs.zip

@GGP1 GGP1 reopened this May 17, 2024
@vikman90 vikman90 self-assigned this May 20, 2024
@vikman90
Copy link
Member

@GGP1 gotcha, sorry for the mistake. At first glance, it may seem that the scan data is missing from the database since the /rootcheck test fails but /rootcheck/last_scan works. However, after analyzing this case with Selu, we found that the expected values for the latter case are:

end: !anything
start: !anystring

This would validate a case where the Rootcheck scan has not finished:

{
  "data": {
    "affected_items": [
      {
        "start": "2024-05-20T08:01:06Z",
        "end": null
      }
    ],
    "total_affected_items": 1,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "Last rootcheck scan of the agent was returned",
  "error": 0
}

Conclusion

I believe it is very likely that the test failed simply because the scan has not finished.

Suggestions

  1. Test /rootcheck/last_scan before /rootcheck (determine if the scan has finished).
  2. Rename the previous case to "scan started".
  3. Add a "scan finished" case with end: !anystring.

That will help determine whether the scan data is missing from the DB or the scan did not finish.

Next steps

I have agreed with @Selutario to close this issue and leave it to your team to apply the suggested changes above.

Best!

@vikman90 vikman90 closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants