-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Copy link
Labels
Description
Description
As part of efforts to improve agent registration, it was found that the remote server thread may present an unexpected behavior when handling multiple connections to it:
wazuh/src/os_auth/main-server.c
Lines 843 to 844 in dff1842
/* Thread for remote server */ | |
void* run_remote_server(__attribute__((unused)) void *arg) { |
Since some delays may arise under heavy load conditions where multiple connections are being opened, the sockets management using the select
function needs to be reviewed.
Tasks
- Analysis of the code and reproduction of the potential bug. (A script simulating the connections might be required to achieve this)Improvements proposal.Development of the improvements.Unit tests update.To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Pending final review
Activity
MiguelazoDS commentedon Apr 3, 2025
Update
After discussing this with the team and implementing a script simulating multiple connection from agents I could replicate, although not persistently, the errors in the cloud environment shared.
The main reason about this issue analyzed by Octavio Valle is due to the way authd is currently managing connection. The connection management is sequential, causing delays and blocking behavior due to the unordered sequence of connection and payload received from the client by the server.
The server will block on read waiting for client connection payload.
SSL Error (-1)
This was replicated by killing the client ending the connection abruptly before sending the payload, what may indicates that the analysis mentioned above is on the right path, since the connection was successful but the server couldn't read the payload probably during an agent restart in a congested network.
SSL read (unable to receive message).
This was replicated by launching a big number of request with different delay between connection and payload
In the client side we can see this. Some of them receive the response, some do not.
In the server side we can see that the manager never received that request. i.e QSFCZEOFPQNFYBT
The ending was a timeout.
MiguelazoDS commentedon Apr 4, 2025
Update
Sometimes the error code is 1 and sometimes 5. This is a custom script, not an actual agent.
The select implementation here does not manage the fd for each client connection, it's just a "smart" sleep waiting for the server.
A quick fix for this behavior is increase the network timeout value. After changing the options for auth in
internal_options.conf
the problem no longer happen and all agents are registered. Of course it will block that amount of time if theAn approach that I would like to consider here, is to make a non-blocking server managing the the connections and use epoll_wait instead of accept.
I have to better understand how to manage SSL connection because the ssl_read, also writes during the handshake and other ssl operations.
What I can see is that we already have some wrapper for epoll in notify_op.c/h
I'll be uploading changes here https://github.com/wazuh/wazuh/tree/enhancement/28908-authd-connection-management
There's a long trip to get this done, I need to deeply understand the epoll API for this.
MiguelazoDS commentedon Apr 8, 2025
Update
Although I don't see "definitely lost" I do not trust too much in those "still reachable" leaks
Note
I took a look and most of them may be required for other modules, they are variables that live inside shared structures.
Trying to reproduce the issue
10 agents
Expand
1000 agents (max size epoll queue)
for i in $(seq 1 1000); do; ./client $((1200-i))&; done
MiguelazoDS commentedon Apr 8, 2025
Update
Some thread sanitizer warnings
This seems to be an issue with closing the pipe write fd in the T2 thread
There are some others previous to this implementation that I would like to discuss with the team
MiguelazoDS commentedon Apr 9, 2025
Update
Fixing all these cases
https://github.com/wazuh/wazuh/actions/runs/14341591613/job/40201919478
Most of them are related to the new line character in msgs coming from agents, but there's one that causes a crash.
There are two tests that seems to fail due to timeout
I increased the buffer size to fix those two last failing tests