Refactor or remove cluster configuration options to make it enabled by default #13351

Selutario · 2022-05-04T12:05:46Z

Description

As explained in #7108, we want to make the master enabled by default. That involves removing some configuration options and refactoring others to make it easier to understand and to set up. These are the current cluster options and their default value:

wazuh/framework/wazuh/core/cluster/utils.py

Lines 49 to 59 in 8b613b4

    
           cluster_default_configuration = { 
        
               'disabled': True, 
        
               'node_type': 'master', 
        
               'name': 'wazuh', 
        
               'node_name': 'node01', 
        
               'key': '', 
        
               'port': 1516, 
        
               'bind_addr': '0.0.0.0', 
        
               'nodes': ['NODE_IP'], 
        
               'hidden': 'no' 
        
           }

We should change the behavior of the cluster for the following options:

disabled: Since the master will always be running, this option is no longer needed nor should it be used by the cluster.
name: It is not possible to configure multiple clusters on one node, therefore this option is not necessary and should be removed.
node_name: The hostname should be used by default.
node_type: The node type can be inferred from the list of master nodes (nodes). If the hostname or IP specified in said list belongs to the host where the cluster process is running, the node will be master. If not, it will be worker. However, it is necessary to study if it is worth including a new library necessary for this purpose (netifaces) and if it can cause other problems could arise. Edit: this will remain the same and should not be removed from the configuration.
key: It is intended to eliminate the need for a fernet key and use SSL (TLS actually) to negotiate the symmetric key automatically at the start of the connection (Use SSLcontext instead of fernet key by default in Wazuh cluster #13320).
bind_addr: This option should not be removed, but it may not be included in the default configuration in the ossec.conf. In this case its value would be 0.0.0.0 or 127.0.0.1. If 127.0.0.1 is used, it means that children process should not be created and some tasks like Local Integrity or Local agent-groups should not be running, since workers won't be able to connect to the master.
nodes: By default it should only have one value, 127.0.0.1. It is necessary to test how this affects binaries like agent_groups.

As seen above, the goal is not only the default cluster but also to make its configuration easier to understand.

Checks

wazuh/wazuh

wazuh/wazuh-documentation

Migration from 3.X for changed endpoints (source/user-manual/api/equivalence.rst)
Update RBAC reference with new/modified actions/resources/relationships (source/user-manual/api/rbac/reference.rst)

The text was updated successfully, but these errors were encountered:

Selutario · 2022-05-24T13:04:58Z

Status update

I have already removed these configuration options:

Disabled
Name (cluster name)
Key

In addition, the default value of these options has been changed:

nodes parameter now points to 127.0.0.1 instead of NODE_IP.
node_name uses the hostname by default instead of node01.

Right now, there is nothing required in the ossec.conf of the master node for it to work as a cluster (in case custom names/port/etc are not required by the user). And this would be the only thing to set in the worker nodes:

    <cluster>
        <node_type>worker</node_type>
        <nodes>
            <node>master-ip</node>
        </nodes>
    </cluster>

Everything is working fine in my tests, although more changes and much deeper testing are still required.

Selutario · 2022-05-25T13:06:08Z

Status update

1. Removing `node_type` | Discarded

I have been doing some checks and tests to determine if removing the <node_type> setting is feasible. In order to remove it, each manager should be able to get a list of all the IPs (for all NIC) where it is installed. This way, it could infer whether <node></node> is pointing to itself (it would be a master node) or to another host (it would be a worker).

However, this implies depending on this external library which seems archives, so this is discarded:

https://github.com/al45tair/netifaces

Also, it seems that <node_type></node_type> is required for wazuh-authd to start. Otherwise, this error is shown and Wazuh does not start:

# service wazuh-manager start  
Starting Wazuh v4.5.0...
wazuh-apid already running...
Started wazuh-csyslogd...
Started wazuh-dbd...
2022/05/25 12:53:19 wazuh-integratord: INFO: Remote integrations not configured. Clean exit.
Started wazuh-integratord...
Started wazuh-agentlessd...
2022/05/25 12:53:19 wazuh-authd: ERROR: Invalid option at cluster configuration
wazuh-authd did not start correctly.

2. Undefined `node_name`

I was testing if the wazuh binaries related or affected by the cluster work well after the changes, but I found this error when trying to upgrade an agent with agent_upgrade:

root@wazuh-worker1:/# /var/ossec/bin/agent_upgrade -l
ID    Name                                Version                  
008   3840aedc5ad2                        Wazuh v4.1.5             

Total outdated agents: 1
root@wazuh-worker1:/# /var/ossec/bin/agent_upgrade -a 008
Internal error: 
root@wazuh-worker1:/#
root@wazuh-worker1:/#

The problem seems to be related to the node_name of each agent appearing as undefined in the database:

{
  "data": {
    "affected_items": [
      {
        "node_name": "undefined",
        "id": "000"
      },
      {
        "node_name": "undefined",
        "id": "002"
      },
      {
        "node_name": "undefined",
        "id": "004"
      },
      {
        "node_name": "undefined",
        "id": "005"
      },
      {
        "node_name": "undefined",
        "id": "008"
      }
    ],
    "total_affected_items": 5,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "All selected agents information was returned",
  "error": 0
}

The origin of the error is that I'm not setting any <node_name></node_name> in the cluster configuration, inside the ossec.conf. This is a problem since one of the requirements was to dispense with the need to specify a name for each cluster node. By default, if no name is set, the hostname would be used.

    <cluster>
        <node_type>worker</node_type>
        <nodes>
            <node>wazuh-master</node>
        </nodes>
    </cluster>

However, it seems that wazuh-authd, wazuh-modulesd or whatever service that writes said information to global.db, obtains its information from the ossec.conf.

Since the node name does not appear in the database, distributed requests do not work correctly.

Selutario · 2022-05-27T13:12:02Z

Status update

Today I have hardly been able to work on this development. I have only verified that agent_upgrade.py and other binaries work well despite using the default value in the nodes cluster option (which is now 127.0.0.1).

Regarding the undefined problem I reported here, we have decided that for now it will be mandatory to set node_name in the ossec.conf.

Selutario · 2022-05-30T12:02:47Z

Status update

1. node_name

After analyzing the problem with node_name described in previous updates, I think that forcing the user to set a value for it is not a good idea. For example, let's say there is a user who does not use the cluster and does not have any <cluster></cluster> configuration block in the ossec.conf:

If we make it mandatory, the cluster should not start when node_name is not defined. Notice, the cluster is now needed for other services like API so this would be a problem.
If we don't make it mandatory, the wazuh-clusterd process would work fine. However, other services that look for said tag in the ossec.conf would not behave as expected. In addition, node_name would be undefined in the global.db and distributed requests would fail.

As a consequence, I have extended the following issue so a default value is used for node_name and others configuration options when nothing is specified by the user. In the case of node_name, the default value should be the name of the host:

Review and update cluster related code in core #13410 (comment)

2. Loopback

The cluster process will always be running now. However, if the user sets bind_addr as localhost on the master node, no worker will be able to connect to it. Therefore, there are some tasks that are focused on processing information for workers which are not useful in this situation. These tasks are:

Local integrity
Local agent-groups
Sendsync
Keepalive calculation for master (not for Local server)
API Request Queue for master (not for Local server)

There is also no need for a process pool, so child creation is disabled when bind_addr is localhost in order to save resources.

API requests are queued and handled by the Local Server when they are originated in the master node itself. Therefore, these changes should not affect its behavior. In any case, a log is printed when starting the master if localhost is used:

2022/05/30 09:53:31 DEBUG: [Cluster] [Main] Removing '/var/ossec/queue/cluster/'.
2022/05/30 09:53:31 DEBUG: [Cluster] [Main] Removed '/var/ossec/queue/cluster/'.
2022/05/30 09:53:31 INFO: [Master] [Main] Localhost was set in "bind_addr". Some tasks will be disabled.
2022/05/30 09:53:31 INFO: [Local Server] [Main] Serving on /var/ossec/queue/cluster/c-internal.sock
2022/05/30 09:53:31 DEBUG: [Local Server] [Keep alive] Calculating.
2022/05/30 09:53:31 DEBUG: [Local Server] [Keep alive] Calculated.
2022/05/30 09:53:31 INFO: [Master] [Main] Serving on ('127.0.0.1', 1516)

3. To do

This issue is blocked until these are merged:

Once it is ready to go, unittests and AIT should be run and updated.

Selutario · 2022-05-31T14:59:23Z

Status update

I have resolved the conflicts after merging the two PRs that were blocking this development. I'm now updating unit tests to fix any errors and increase coverage on changed functions.

To do

Update unit tests (in progress).
Manual testing.

Selutario added type/enhancement New feature or request module/framework module/cluster labels May 4, 2022

Selutario changed the title ~~Change cluster ossec.conf template~~ Refactor cluster configuration options May 4, 2022

Selutario changed the title ~~Refactor cluster configuration options~~ Refactor or remove cluster configuration options to make it enabled by default May 6, 2022

This was referenced May 23, 2022

Delete implications of cluster disabled option from framework code #13561

Merged

Remove framework code for non-cluster setups #13382

Closed

Selutario self-assigned this May 24, 2022

Selutario mentioned this issue Jun 1, 2022

Change cluster files to make it enabled by default #13655

Merged

Selutario linked a pull request Jun 2, 2022 that will close this issue

Change cluster files to make it enabled by default #13655

Merged

Selutario closed this as completed Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor or remove cluster configuration options to make it enabled by default #13351

Refactor or remove cluster configuration options to make it enabled by default #13351

Selutario commented May 4, 2022 •

edited

Loading

Selutario commented May 24, 2022 •

edited

Loading

Selutario commented May 25, 2022 •

edited

Loading

Selutario commented May 27, 2022

Selutario commented May 30, 2022

Selutario commented May 31, 2022

Refactor or remove cluster configuration options to make it enabled by default #13351

Refactor or remove cluster configuration options to make it enabled by default #13351

Comments

Selutario commented May 4, 2022 • edited Loading

Description

Checks

wazuh/wazuh

wazuh/wazuh-documentation

Selutario commented May 24, 2022 • edited Loading

Status update

Selutario commented May 25, 2022 • edited Loading

Status update

1. Removing node_type | Discarded

2. Undefined node_name

Selutario commented May 27, 2022

Status update

Selutario commented May 30, 2022

Status update

1. node_name

2. Loopback

3. To do

Selutario commented May 31, 2022

Status update

To do

Selutario commented May 4, 2022 •

edited

Loading

Selutario commented May 24, 2022 •

edited

Loading

Selutario commented May 25, 2022 •

edited

Loading

1. Removing `node_type` | Discarded

2. Undefined `node_name`