Wazuh-clusterd occasionally fails copying agent-info #4007

crd1985 · 2019-09-25T06:28:14Z

Wazuh version	Component	Install type	Install method	Platform
3.10.x	clusterd/modulesd	Manager	Any	Any

Sometimes an error appears in cluster.log:

2019/09/24 14:03:18 wazuh-clusterd: ERROR: [Worker wazuh-manager-worker-0] [Main] Error updating agent group/status (/var/ossec/queue/cluster/wazuh-manager-worker-0/client3-any): [Errno 2] No such file or directory: '/var/ossec/queue/agent-info/client3-any.tmp'
2019/09/24 14:03:18 wazuh-clusterd: ERROR: [Worker wazuh-manager-worker-0] [Agent info] Errors updating worker files: /queue/agent-info/: 3

The error arises due to the following function:

wazuh/framework/wazuh/utils.py

Lines 348 to 372 in ee0943d

    
           def safe_move(source, target, ownership=(common.ossec_uid(), common.ossec_gid()), time=None, permissions=None): 
        
               """Moves a file even between filesystems 
        
               This function is useful to move files even when target directory is in a different filesystem from the source. 
        
               Write permissions are required on target directory. 
        
               :param source: full path to source file 
        
               :param target: full path to target file 
        
               :param ownership: tuple in the form (user, group) to be set up after the file is moved 
        
               :param time: tuple in the form (addition_timestamp, modified_timestamp) 
        
               :param permissions: string mask in octal notation. I.e.: '0o640' 
        
               """ 
        
               # Create temp file. Move between  
        
               tmp_target = f"{target}.tmp" 
        
               shutil.move(source, tmp_target, copy_function=shutil.copyfile) 
        
               # Overwrite the file atomically 
        
               shutil.move(tmp_target, target, copy_function=shutil.copyfile) 
        
               # Set up metadata 
        
               chown(target, *ownership) 
        
               if permissions is not None: 
        
                   chmod(target, permissions) 
        
               if time is not None: 
        
                   utime(target, time)

There is a race condition involving clusterd and modulesd that occurs between lines 362 and 365. If modulesd removes the temp file created by safe_move function before line 372 a No such file or directory exception is raised.

A possible workaround to deal with the problem is changing the temporary file name from myfile.tmp to .myfile.tmp. This way modulesd will ignore it.

I strongly recommend performing a refactor in all daemons so any resource (file, database or whatever) is managed by only one process. Any access to these resources should be delegated to its owner.

The text was updated successfully, but these errors were encountered:

adampav · 2019-11-19T20:18:45Z

Hello @crd1985,
Do you think this issue might affect also agent status in clustered setups?
I am facing an annoying problem in such a setup, which to the best of my understanding is related to the "Error updating agent group/status ... No such file or directory" message.

The problem:
According to the cluster (CLI tools, API, Kibana APP) some agents are disconnected, however i normally receive events. Specifically all disconnected nodes are associated with one worker node.
The agent-info files are updated normally on tis worker node, however the info is not synced back to the master.
I have gone through the cluster logs and there is a ton of error messages.
If i restart the wazuh-manager on the master all agents immediately get connected.
Note that i took one worker down for a week but did not experience any such problems, all agents were connected.

Setup: 1 master, 2 workers, No LB.
Wazuh agents are configured in failover mode and i have manually modified the order of "managers" in each agent config file.

Do you think this is related to the issue you describe? Should i try another version ?

adampav · 2019-11-19T22:41:57Z

Just found this thread. https://groups.google.com/d/msg/wazuh/md0PGQIG1eA/MPdHckMREAAJ
I will try the workaround until 3.11 and update this issue.

Btw, thanks to the entire team, awesome project :)
Cheers,
A.

crd1985 · 2019-11-20T07:25:19Z

Hi @adampav ,

yes you are right. This issue may lead to agents to be shown as disconnected because of agent-info files not being refreshed.

You can try the fix in the thread you have found which is the same applied in the PR linked in this issue: https://github.com/wazuh/wazuh/pull/4025/files.

I hope you get your problem solved soon.

Regards,

adampav · 2019-11-21T22:21:53Z

Hello @crd1985,

this workaround works perfectly for me :)

Sorry for hijacking this issue.

Cheers,
Adam

crd1985 added type/bug Something isn't working module/cluster labels Sep 25, 2019

crd1985 self-assigned this Sep 30, 2019

crd1985 added this to the Sprint Framework 101 milestone Sep 30, 2019

crd1985 mentioned this issue Sep 30, 2019

Fix race condition with modulesd in temp files #4025

Merged

crd1985 closed this as completed Sep 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wazuh-clusterd occasionally fails copying agent-info #4007

Wazuh-clusterd occasionally fails copying agent-info #4007

crd1985 commented Sep 25, 2019 •

edited by davidjiglesias

Loading

adampav commented Nov 19, 2019

adampav commented Nov 19, 2019

crd1985 commented Nov 20, 2019

adampav commented Nov 21, 2019

Wazuh-clusterd occasionally fails copying agent-info #4007

Wazuh-clusterd occasionally fails copying agent-info #4007

Comments

crd1985 commented Sep 25, 2019 • edited by davidjiglesias Loading

adampav commented Nov 19, 2019

adampav commented Nov 19, 2019

crd1985 commented Nov 20, 2019

adampav commented Nov 21, 2019

crd1985 commented Sep 25, 2019 •

edited by davidjiglesias

Loading