Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor or remove cluster configuration options to make it enabled by default #13351

Closed
11 of 13 tasks
Selutario opened this issue May 4, 2022 · 5 comments · Fixed by #13655
Closed
11 of 13 tasks

Refactor or remove cluster configuration options to make it enabled by default #13351

Selutario opened this issue May 4, 2022 · 5 comments · Fixed by #13655
Assignees

Comments

@Selutario
Copy link
Member

Selutario commented May 4, 2022

Description

As explained in #7108, we want to make the master enabled by default. That involves removing some configuration options and refactoring others to make it easier to understand and to set up. These are the current cluster options and their default value:

cluster_default_configuration = {
'disabled': True,
'node_type': 'master',
'name': 'wazuh',
'node_name': 'node01',
'key': '',
'port': 1516,
'bind_addr': '0.0.0.0',
'nodes': ['NODE_IP'],
'hidden': 'no'
}

We should change the behavior of the cluster for the following options:

  • disabled: Since the master will always be running, this option is no longer needed nor should it be used by the cluster.
  • name: It is not possible to configure multiple clusters on one node, therefore this option is not necessary and should be removed.
  • node_name: The hostname should be used by default.
  • node_type: The node type can be inferred from the list of master nodes (nodes). If the hostname or IP specified in said list belongs to the host where the cluster process is running, the node will be master. If not, it will be worker. However, it is necessary to study if it is worth including a new library necessary for this purpose (netifaces) and if it can cause other problems could arise. Edit: this will remain the same and should not be removed from the configuration.
  • key: It is intended to eliminate the need for a fernet key and use SSL (TLS actually) to negotiate the symmetric key automatically at the start of the connection (Use SSLcontext instead of fernet key by default in Wazuh cluster #13320).
  • bind_addr: This option should not be removed, but it may not be included in the default configuration in the ossec.conf. In this case its value would be 0.0.0.0 or 127.0.0.1. If 127.0.0.1 is used, it means that children process should not be created and some tasks like Local Integrity or Local agent-groups should not be running, since workers won't be able to connect to the master.
  • nodes: By default it should only have one value, 127.0.0.1. It is necessary to test how this affects binaries like agent_groups.

As seen above, the goal is not only the default cluster but also to make its configuration easier to understand.

Checks

wazuh/wazuh

  • Unit tests without failures. Updated and/or expanded if there are new functions/methods/outputs:
    • Cluster (framework/wazuh/core/cluster/tests/ & framework/wazuh/core/cluster/dapi/tests/)
    • Core (framework/wazuh/core/tests/)
    • SDK (framework/wazuh/tests/)
    • RBAC (framework/wazuh/rbac/tests/)
    • API (api/api/tests/)
  • API tavern integration tests without failures. Updated and/or expanded if needed (api/test/integration/):
    • Affected tests
    • Affected RBAC (black and white) tests
  • Review integration test mapping using the script (api/test/integration/mapping/integration_test_api_endpoints.json)
  • Review of spec.yaml examples and schemas (api/api/spec/spec.yaml)
  • Review exceptions remediation when any endpoint path changes or is removed (framework/wazuh/core/exception.py)
  • Changelog (CHANGELOG.md)

wazuh/wazuh-documentation

  • Migration from 3.X for changed endpoints (source/user-manual/api/equivalence.rst)
  • Update RBAC reference with new/modified actions/resources/relationships (source/user-manual/api/rbac/reference.rst)
@Selutario Selutario changed the title Change cluster ossec.conf template Refactor cluster configuration options May 4, 2022
@Selutario Selutario changed the title Refactor cluster configuration options Refactor or remove cluster configuration options to make it enabled by default May 6, 2022
@Selutario Selutario self-assigned this May 24, 2022
@Selutario
Copy link
Member Author

Selutario commented May 24, 2022

Status update

I have already removed these configuration options:

  • Disabled
  • Name (cluster name)
  • Key

In addition, the default value of these options has been changed:

  • nodes parameter now points to 127.0.0.1 instead of NODE_IP.
  • node_name uses the hostname by default instead of node01.

Right now, there is nothing required in the ossec.conf of the master node for it to work as a cluster (in case custom names/port/etc are not required by the user). And this would be the only thing to set in the worker nodes:

    <cluster>
        <node_type>worker</node_type>
        <nodes>
            <node>master-ip</node>
        </nodes>
    </cluster>

Everything is working fine in my tests, although more changes and much deeper testing are still required.

@Selutario
Copy link
Member Author

Selutario commented May 25, 2022

Status update

1. Removing node_type | Discarded

I have been doing some checks and tests to determine if removing the <node_type> setting is feasible. In order to remove it, each manager should be able to get a list of all the IPs (for all NIC) where it is installed. This way, it could infer whether <node></node> is pointing to itself (it would be a master node) or to another host (it would be a worker).

However, this implies depending on this external library which seems archives, so this is discarded:

Also, it seems that <node_type></node_type> is required for wazuh-authd to start. Otherwise, this error is shown and Wazuh does not start:

# service wazuh-manager start  
Starting Wazuh v4.5.0...
wazuh-apid already running...
Started wazuh-csyslogd...
Started wazuh-dbd...
2022/05/25 12:53:19 wazuh-integratord: INFO: Remote integrations not configured. Clean exit.
Started wazuh-integratord...
Started wazuh-agentlessd...
2022/05/25 12:53:19 wazuh-authd: ERROR: Invalid option at cluster configuration
wazuh-authd did not start correctly.

2. Undefined node_name

I was testing if the wazuh binaries related or affected by the cluster work well after the changes, but I found this error when trying to upgrade an agent with agent_upgrade:

root@wazuh-worker1:/# /var/ossec/bin/agent_upgrade -l
ID    Name                                Version                  
008   3840aedc5ad2                        Wazuh v4.1.5             

Total outdated agents: 1
root@wazuh-worker1:/# /var/ossec/bin/agent_upgrade -a 008
Internal error: 
root@wazuh-worker1:/#
root@wazuh-worker1:/#

The problem seems to be related to the node_name of each agent appearing as undefined in the database:

{
  "data": {
    "affected_items": [
      {
        "node_name": "undefined",
        "id": "000"
      },
      {
        "node_name": "undefined",
        "id": "002"
      },
      {
        "node_name": "undefined",
        "id": "004"
      },
      {
        "node_name": "undefined",
        "id": "005"
      },
      {
        "node_name": "undefined",
        "id": "008"
      }
    ],
    "total_affected_items": 5,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "All selected agents information was returned",
  "error": 0
}

The origin of the error is that I'm not setting any <node_name></node_name> in the cluster configuration, inside the ossec.conf. This is a problem since one of the requirements was to dispense with the need to specify a name for each cluster node. By default, if no name is set, the hostname would be used.

    <cluster>
        <node_type>worker</node_type>
        <nodes>
            <node>wazuh-master</node>
        </nodes>
    </cluster>

However, it seems that wazuh-authd, wazuh-modulesd or whatever service that writes said information to global.db, obtains its information from the ossec.conf.

Since the node name does not appear in the database, distributed requests do not work correctly.

@Selutario
Copy link
Member Author

Status update

Today I have hardly been able to work on this development. I have only verified that agent_upgrade.py and other binaries work well despite using the default value in the nodes cluster option (which is now 127.0.0.1).

Regarding the undefined problem I reported here, we have decided that for now it will be mandatory to set node_name in the ossec.conf.

@Selutario
Copy link
Member Author

Status update

1. node_name

After analyzing the problem with node_name described in previous updates, I think that forcing the user to set a value for it is not a good idea. For example, let's say there is a user who does not use the cluster and does not have any <cluster></cluster> configuration block in the ossec.conf:

  • If we make it mandatory, the cluster should not start when node_name is not defined. Notice, the cluster is now needed for other services like API so this would be a problem.
  • If we don't make it mandatory, the wazuh-clusterd process would work fine. However, other services that look for said tag in the ossec.conf would not behave as expected. In addition, node_name would be undefined in the global.db and distributed requests would fail.

As a consequence, I have extended the following issue so a default value is used for node_name and others configuration options when nothing is specified by the user. In the case of node_name, the default value should be the name of the host:

2. Loopback

The cluster process will always be running now. However, if the user sets bind_addr as localhost on the master node, no worker will be able to connect to it. Therefore, there are some tasks that are focused on processing information for workers which are not useful in this situation. These tasks are:

  • Local integrity
  • Local agent-groups
  • Sendsync
  • Keepalive calculation for master (not for Local server)
  • API Request Queue for master (not for Local server)

There is also no need for a process pool, so child creation is disabled when bind_addr is localhost in order to save resources.

API requests are queued and handled by the Local Server when they are originated in the master node itself. Therefore, these changes should not affect its behavior. In any case, a log is printed when starting the master if localhost is used:

2022/05/30 09:53:31 DEBUG: [Cluster] [Main] Removing '/var/ossec/queue/cluster/'.
2022/05/30 09:53:31 DEBUG: [Cluster] [Main] Removed '/var/ossec/queue/cluster/'.
2022/05/30 09:53:31 INFO: [Master] [Main] Localhost was set in "bind_addr". Some tasks will be disabled.
2022/05/30 09:53:31 INFO: [Local Server] [Main] Serving on /var/ossec/queue/cluster/c-internal.sock
2022/05/30 09:53:31 DEBUG: [Local Server] [Keep alive] Calculating.
2022/05/30 09:53:31 DEBUG: [Local Server] [Keep alive] Calculated.
2022/05/30 09:53:31 INFO: [Master] [Main] Serving on ('127.0.0.1', 1516)

3. To do

This issue is blocked until these are merged:

Once it is ready to go, unittests and AIT should be run and updated.

@Selutario
Copy link
Member Author

Status update

I have resolved the conflicts after merging the two PRs that were blocking this development. I'm now updating unit tests to fix any errors and increase coverage on changed functions.

To do

  • Update unit tests (in progress).
  • Manual testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant