Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wazuh-logcollector: ERROR: socketerr (not available) problem. #10

Closed
rustybofh opened this issue Jun 14, 2024 · 1 comment
Closed

wazuh-logcollector: ERROR: socketerr (not available) problem. #10

rustybofh opened this issue Jun 14, 2024 · 1 comment
Assignees
Labels

Comments

@rustybofh
Copy link

rustybofh commented Jun 14, 2024

Hi, I’m running Wazuh agent 4.7.4 on pfSense 2.7 and I keep getting these errors even though it’s working and sending information to the manager. I’ve tried changing agent parameters like queue size and events per second, but the issue persists. The manager version is 4.8, and for more context, I have Suricata running on pfSense, but even stopping Suricata on the interfaces doesn’t resolve the problem.

I´ve got this:

2024/06/13 17:34:42 wazuh-logcollector: INFO: Successfully reconnected to 'queue/sockets/queue'
2024/06/13 17:34:42 wazuh-logcollector: ERROR: socketerr (not available).
2024/06/13 17:34:42 wazuh-logcollector: ERROR: Unable to send message to 'queue/sockets/queue' (wazuh-agentd might be down). Attempting to reconnect.
2024/06/13 17:34:42 wazuh-logcollector: INFO: Successfully reconnected to 'queue/sockets/queue'
2024/06/13 17:34:42 wazuh-logcollector: ERROR: socketerr (not available).
2024/06/13 17:34:42 wazuh-logcollector: ERROR: Unable to send message to 'queue/sockets/queue' after a successfull reconnection...
2024/06/13 17:34:42 wazuh-logcollector: ERROR: socketerr (not available).
2024/06/13 17:34:42 wazuh-logcollector: ERROR: Unable to send message to 'queue/sockets/queue' (wazuh-agentd might be down). Attempting to reconnect.
2024/06/13 17:34:42 wazuh-logcollector: INFO: Successfully reconnected to 'queue/sockets/queue

I have tried changing parameters in the local.conf, but nothing has worked. The agent appears fine in the manager and there are no enrollment issues. There is no extra hop from the manager to the pfSense, and the other agents in various locations do not have this issue. I also do not see any firewall blocks.

Any hints or suggestions would be appreciated.

Thanks!

Update: This issue only occurs when monitoring the WAN interface. It does not happen when monitoring only the LAN interface

@vikman90
Copy link
Member

Hi @rustybofh

The Wazuh agent is divided into multiple processes that communicate through a local socket: /var/ossec/queue/sockets/queue. The wazuh-agent process exposes this socket so that collectors (e.g., Logcollector, FIM, etc.) can send messages to the manager.

Trying to reproduce

The most common reason for a disconnection is that wazuh-agentd crashes. We see this is not the case here, as it can reconnect immediately. I suspect the issue lies with the internal buffer of that socket (provided by the operating system), so I conducted a proof of concept which worked without issues on Linux:

queue.py
#!/usr/bin/env python3
# Print messages on Wazuh's queue (analysisd/agentd)
#
# Syntax: queue.py [-L] [PATH]
# Reads a line from stdin
# Standard message form: <id>:<location>:<log>
#
# Example:
# echo '1:test:Hello World' | sudo ./queue.py -L

import argparse
from socket import socket, AF_UNIX, SOCK_DGRAM, SO_SNDBUF, SOL_SOCKET
from sys import argv

ADDR = '/var/ossec/queue/sockets/queue'
BLEN = 212992

def connect(addr, blen):
    sock = socket(AF_UNIX, SOCK_DGRAM)
    sock.connect(addr)
    oldbuf = sock.getsockopt(SOL_SOCKET, SO_SNDBUF)

    if oldbuf < blen:
        sock.setsockopt(SOL_SOCKET, SO_SNDBUF, blen)
        newbuf = sock.getsockopt(SOL_SOCKET, SO_SNDBUF)
        print("INFO: Buffer expended from {0} to {1}".format(oldbuf, newbuf))

    return sock


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Print messages on Wazuh's queue")
    parser.add_argument('-L', '--loop', action='store_true', dest='loop', help='enable loop mode')
    parser.add_argument('PATH', nargs='?', default=ADDR, help='override default queue path')

    args = parser.parse_args()
    string = input().encode()
    sock = connect(args.PATH, BLEN)

    if args.loop:
        i = 0

        try:
            while True:
                sock.send(string)
                i += 1
        except BaseException as e:
            print(e)
            print("Messages: {0}\nBytes: {1}".format(i, i * len(string)))

    else:
        string = ' '.join(argv[1:])

    sock.close()

As I understand it, pfSense is based on FreeBSD. I don't have pfSense, but I tested this on FreeBSD and encountered this error:

[Errno 55] No buffer space available

Rationale

This demonstrates a difference between the two platforms: if the socket memory fills up (because Logcollector generates more messages than the agent can handle), Linux performs an implicit wait (causing Logcollector to wait until space is available), while BSD generates an error code.

In fact, this is how I tested it on FreeBSD: by enabling logcollector.debug=2 and inserting numerous logs into a file, Logcollector produced this warning:

2024/06/18 08:39:23 wazuh-logcollector[16057] mq_op.c:127 at SendMSGAction(): DEBUG: Socket busy, discarding message.

So, my hypothesis is:

  • When you enable WAN monitoring, Suricata produces a massive number of logs, which Logcollector captures but exceeds the agent's capacity to handle.
  • For some reason, pfSense in particular generates a different error code than FreeBSD. This causes Logcollector to produce that generic error and attempt to reconnect to the agent.

If this is correct, and seeing that Logcollector reconnects successfully, in terms of code, the end effect is nearly the same (aside from the error printed in the log).

Additionally, pfSense and FreeBSD are not officially supported, so I don't believe we can prioritize development to eliminate the error message.

Workaround

If my hypothesis is valid, and this is due to a capacity issue, I believe we can implement a workaround with the configuration:

  • Ensure the agent's leaky bucket (<client_buffer>) is enabled. This speeds up message handling: if messages can't be sent directly, they are queued.
  • Reduce Logcollector's read rate. For example, limit the bunch of lines to 500 per cycle: edit etc/local_internal_options.conf and add:
    logcollector.max_lines=500
    

I hope this helps.

Best regards.

@vikman90 vikman90 self-assigned this Jun 18, 2024
@vikman90 vikman90 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants