Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wazuh-clusterd occasionally fails copying agent-info #4007

Closed
crd1985 opened this issue Sep 25, 2019 · 4 comments
Closed

Wazuh-clusterd occasionally fails copying agent-info #4007

crd1985 opened this issue Sep 25, 2019 · 4 comments
Assignees
Labels
module/cluster type/bug Something isn't working

Comments

@crd1985
Copy link
Contributor

crd1985 commented Sep 25, 2019

Wazuh version Component Install type Install method Platform
3.10.x clusterd/modulesd Manager Any Any

Sometimes an error appears in cluster.log:

2019/09/24 14:03:18 wazuh-clusterd: ERROR: [Worker wazuh-manager-worker-0] [Main] Error updating agent group/status (/var/ossec/queue/cluster/wazuh-manager-worker-0/client3-any): [Errno 2] No such file or directory: '/var/ossec/queue/agent-info/client3-any.tmp'
2019/09/24 14:03:18 wazuh-clusterd: ERROR: [Worker wazuh-manager-worker-0] [Agent info] Errors updating worker files: /queue/agent-info/: 3

The error arises due to the following function:

def safe_move(source, target, ownership=(common.ossec_uid(), common.ossec_gid()), time=None, permissions=None):
"""Moves a file even between filesystems
This function is useful to move files even when target directory is in a different filesystem from the source.
Write permissions are required on target directory.
:param source: full path to source file
:param target: full path to target file
:param ownership: tuple in the form (user, group) to be set up after the file is moved
:param time: tuple in the form (addition_timestamp, modified_timestamp)
:param permissions: string mask in octal notation. I.e.: '0o640'
"""
# Create temp file. Move between
tmp_target = f"{target}.tmp"
shutil.move(source, tmp_target, copy_function=shutil.copyfile)
# Overwrite the file atomically
shutil.move(tmp_target, target, copy_function=shutil.copyfile)
# Set up metadata
chown(target, *ownership)
if permissions is not None:
chmod(target, permissions)
if time is not None:
utime(target, time)

There is a race condition involving clusterd and modulesd that occurs between lines 362 and 365. If modulesd removes the temp file created by safe_move function before line 372 a No such file or directory exception is raised.

A possible workaround to deal with the problem is changing the temporary file name from myfile.tmp to .myfile.tmp. This way modulesd will ignore it.

I strongly recommend performing a refactor in all daemons so any resource (file, database or whatever) is managed by only one process. Any access to these resources should be delegated to its owner.

@adampav
Copy link

adampav commented Nov 19, 2019

Hello @crd1985,
Do you think this issue might affect also agent status in clustered setups?
I am facing an annoying problem in such a setup, which to the best of my understanding is related to the "Error updating agent group/status ... No such file or directory" message.

The problem:
According to the cluster (CLI tools, API, Kibana APP) some agents are disconnected, however i normally receive events. Specifically all disconnected nodes are associated with one worker node.
The agent-info files are updated normally on tis worker node, however the info is not synced back to the master.
I have gone through the cluster logs and there is a ton of error messages.
If i restart the wazuh-manager on the master all agents immediately get connected.
Note that i took one worker down for a week but did not experience any such problems, all agents were connected.

Setup: 1 master, 2 workers, No LB.
Wazuh agents are configured in failover mode and i have manually modified the order of "managers" in each agent config file.

Do you think this is related to the issue you describe? Should i try another version ?

@adampav
Copy link

adampav commented Nov 19, 2019

Just found this thread. https://groups.google.com/d/msg/wazuh/md0PGQIG1eA/MPdHckMREAAJ
I will try the workaround until 3.11 and update this issue.

Btw, thanks to the entire team, awesome project :)
Cheers,
A.

@crd1985
Copy link
Contributor Author

crd1985 commented Nov 20, 2019

Hi @adampav ,

yes you are right. This issue may lead to agents to be shown as disconnected because of agent-info files not being refreshed.

You can try the fix in the thread you have found which is the same applied in the PR linked in this issue: https://github.com/wazuh/wazuh/pull/4025/files.

I hope you get your problem solved soon.

Regards,

@adampav
Copy link

adampav commented Nov 21, 2019

Hello @crd1985,

this workaround works perfectly for me :)

Sorry for hijacking this issue.

Cheers,
Adam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module/cluster type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants