SPIKE - Initial `Comms API` server design #23395

Selutario · 2024-05-14T09:37:08Z

Epic
#22677

Description

We want to, as part of #22677, replace the current wazuh-remoted and wazuh-agentd services. Instead, we intend to develop a service that uses a standard protocol such as HTTP and request-driven communication, where different events can be forwarded to any of the Wazuh servers, unlike the current session-oriented approach where an agent sends all its messages to the server where it is connected.

However, we will also need to maintain a session-oriented connection so that the server can send commands to the agents on demand. Some proposals for this other mode of communication could include the use of websockets or gRPC.

The preliminary design of the server must have a /login endpoint so that agents/clients can authenticate and obtain a token from the obtained credentials. Additionally, requests of three different types must be handled:

Stateless: After receiving events, the API will immediately respond to the client, without waiting to confirm whether the engine can process said events.
Stateful: The API must confirm that the event has been processed and indexed before responding to the client.
Commands: Would be session-oriented. The server must verify that a command exists (they will be listed in an indexer's index) and forward the command to the appropriate place.
Management: API endpoints related to management tasks such as getting the information of a configuration group, receiving a package to upgrade the Agent, etc.

The API must be versioned.

This is a research issue.

Implementation restrictions

The opensearch-py library should be considered for API-Indexer communication.
Use existing libraries as much as possible, avoiding developing very complex components on our own and testing the maximum supported workload.
Collaborate with the Agent team to align on communication protocols and API integration.

Plan

Analyze everything that wazuh does and create an extensive list of endpoints to support it. We can rely on the devel-agent team for this and their research issue.
- A high level of detail is not necessary. It is enough for each endpoint in the list to include a description of what it does.
Create a list of use cases. For example:
- How to connect and communicate with the indexer to perform X action.
- How to connect and communicate with the engine to perform Y action.
Investigation of available/candidate and most suitable technologies: websockets, gRPC, etc.
Library research: Starlette, FastAPI, Connexion, etc, etc. Carry out a test that verifies the maximum number of concurrent connections supported by the framework.
Initial server design that meets the requirements listed in Agent/server communication protocol #22677. For example: how to request the token? How to request a new token once it expires?

The text was updated successfully, but these errors were encountered:

GGP1 · 2024-05-29T18:39:23Z

API

The Agent comms API will be exposed via an HTTP server using TLS as the transport layer. It will versioned (/api/v1/{endpoint}) and contain the endpoints below.

Authentication

POST /authentication

Request a token to the server to make authenticated requests.

Body: UUID, password
Response: JWT token

Events

POST /events/stateless

Send events that are not necessarily processed by the engine.

Body: Event
Response: Event received message

POST /events/stateful

Send events that must be processed and persisted.

Body: Event
Response: Processing status

Commands

GET /commands

Get commands from the server. The connection will hang for X seconds and send any commands to the agent. If there's no commands in a certain period of time (TBD), the request will timeout.

Parameters: -
Response: Commands or timeout

POST /commands/results

Send the results of one or multiple commands.

Body: Commands results.
Response: Acknowledge message

Management

Note

In case we opt to transmit configurations and SCA policies via bytes streams, we may only need GET /files.

GET /files

Download files from the Wazuh manager (WPK packages, configuration files, etc).

Parameters: File name
Response: File bytes stream

GGP1 · 2024-05-30T13:15:02Z

API-Indexer communication

The communication with the indexer will be performed through the API it exposes, using the opensearch-py library as a SDK.

For example, if a new agent wants to log in, we craft and send a HTTP POST request to the indexer with the identifiers of the agent so we can validate the authorization token.

flowchart TD

subgraph Agents
    Endpoints
    Clouds
    Other
end

subgraph Server["Server cluster"]

    subgraph Wazuh1["Server node n"]
        api1["Agent comms API"]
    end

    subgraph Wazuh2[" Server node 2"]
        api2["Agent comms API"]
    end

end

subgraph Indexer
    subgraph Data_states["Data states"]
        agents_list["Agents list"]
        states["States"]
    end
end

subgraph lb["Load Balancer"]
    lb_node["Per request"]
end

Agents -- /login --> lb
lb -- /login --> Wazuh1
lb -- /login --> Wazuh2
Wazuh1 -- Read credentials --> agents_list
Wazuh2 -- Read credentials --> agents_list

style Wazuh1 fill:#abc2eb
style Wazuh2 fill:#abc2eb
style Data_states fill:#abc2eb

API-Engine communication

For the communication with the Engine, we will use Unix sockets like the Server management API. Since we will have an Engine instance running on each of the nodes of a cluster, there's no need to use the broader internet to communicate.

The API sees the request received from an agent, builds a request to the Engine and sends it through the socket, in the case of stateless events, it's not necessary to wait for the response.

flowchart TD

subgraph Agents
    Endpoints
    Clouds
    Other
end

subgraph Server["Server cluster"]

    subgraph Wazuh1["Server node n"]
        api1["Agent comms API"]
        server1["Server </br> management API"]
        Engine1["Engine"]
        VD1["VD"]
    end

    subgraph Wazuh2[" Server node 2"]
        api2["Agent comms API"]
        server2["Server </br> management API"]
        Engine2["Engine"]
        VD2["VD"]
    end

end

subgraph lb["Load Balancer"]
    lb_node["Per request"]
end

Agents -- /events/stateless --> lb
lb -- /events/stateless --> Wazuh1
lb -- /events/stateless --> Wazuh2
api1 -- Unix socket --> Engine1
api2 -- Unix socket --> Engine2

style Wazuh1 fill:#abc2eb
style Wazuh2 fill:#abc2eb

GGP1 · 2024-05-31T14:12:52Z

Agent registration

The agent registration is exactly what has been explained in #22887, the agent first registers with the Server management API and then uses the /login endpoint to request a token to the Agent comms API.

Once it has the token, it uses it to perform requests to the API. When the token expires, the agent repeats the login process by calling the /login endpoint and obtaining a new token.

The agent may know itself when the token expires (by looking at the token's timestamp) or notice when any of the API requests fail with 401 Unauthorized.

flowchart TD

subgraph Agents
    Endpoints
    Clouds
    Other_sources
end

subgraph Indexer["Indexer cluster"]

    subgraph Data_states["Data states"]
        agents_list["Agents list"]
    end

end

subgraph Server["Server cluster"]

    subgraph Wazuh1["Server node n"]
        api1["Agent comms API"]
        server1["Server </br> management API"]

    end

end

subgraph lb["Load Balancer"]
    lb_node["Per request"]
end

Agents -- 1. /register --> lb
Agents -- 2. /login --> lb
Agents -- 3. /request_with_token --> lb

lb -- 1. /register --> server1
lb -- 2. /login --> api1
lb -- 3. /request_with_token --> api1

server1 -- 1. Store credentials --> agents_list
api1 -- 2. Read credentials --> agents_list

style Wazuh1 fill:#abc2eb
style Data_states fill:#abc2eb

More details about the registration system design will be discussed in #23393

GGP1 · 2024-06-03T18:59:02Z

Login fallback

As per the protocol requirements, every time an agent requests a token through the /login endpoint it should receive one. We should never fail and keep the token stored without giving it to the agent, we can't block the server execution forever and keep the agent hanging and waiting for a response.

A potential solution could be to fail with a timeout and send the login token when it's ready in an event through the /commands endpoint (requires the agent to be listening for events).

Similarly, if the connection fails for any reason, we should determine if the agent should retry the request (ideal) or if we could use the same approach.

Authentication events

In case there are other factors that affect the authentication other than the token, the server could send an authentication event to the agent through the /commands endpoint, instructing the agent to re-authenticate before attempting any other action.

GGP1 · 2024-06-04T14:34:21Z

Server sent events

SSE and websockets

Note

Discarded in favor of HTTP long polling because of the asynchronous nature of the design, load balancing architecture and the short tokens expiration.

Two good alternatives for server-side events are Server sent events (SSE) and WebSockets.

Here is an image that explains their differences pretty well.

Server sent events are simpler and use the HTTP protocol under the hood, the messages can flow in one direction only and due to their simplicity, no external library may be required. Another advantage is that enterprise firewalls do not have issues inspecting the packets like it happens with websockets.

I would only choose websockets if the messages structure is a limitation and we want to use a specific encoding for the commands.

HTTP long polling

Another alternative to SSE and WebSockets that is preferred by the team is HTTP long polling,

Long polling is a technique that uses a long-lived connection where the client sends a request to the server, and the server holds the request open until new data is available or a certain timeout is reached.

Once new data is available, the server responds with the updated information, and the client immediately sends another request to continue the cycle.

The disadvantage of long polling is that every request requires a connection establishment and that contains the HTTP headers instead of just the data. WebSocket and SSE are generally more scalable than HTTP-based long polling, as they allow for more efficient use of server resources.

On the positive side, long polling is the simplest of the three and does not require any library or protocol other than HTTP requests.

GGP1 · 2024-06-04T19:24:34Z

API authentication

This update will focus on the Agent comms API authentication, there's many details regarding the agent registration and its credentials that will be analyzed in #23393.

The API will use the JSON Web Tokens (JWTs) open standard (RFC 7519) to authenticate agents requests.

All requests except to POST /login (which returns the token) will contain an Authentication header with the value Bearer {token} and the JWT token in base 64 encoding.

If the token is invalid, expired or non-existent, the API will respond with a HTTP 401 (Unauthorized) error.

Algorithm

The algorithm used to sign the tokens will be the Elliptic Curve Digital Signature Algorithm with the P-256 curve and the SHA-256 cryptographic hash function (ES256).

ECDSA is able to provide equivalent security to RSA cryptography but using shorter key sizes and with greater processing speed for many operations (ref).

The 256-bit key size was chosen over a 512 one mainly because of its length to security ratio, a 256 bit key has enough keyspace to secure our tokens (provides 128-bits of security) and longer keys could be considered an overkill while negatively impacting the server's communications.

Payload

The payload will contain public claims such as the issuer, audience, subject and timestamps indicating when the token was issued and when it expires.

Additionally, it will contain a uuid field that will store the agent's UUID. This way, the API will be able to identify which agent is doing the request and respond according to it.

For example, if an agent sends a GET /commands request, the server must know who is listening for commands in that connection to send the correct commands.

The JWT token payload will look like:

{
  "iss": "wazuh", > issuer
  "aud": "Wazuh Agent comms API", > audience
  "iat": 1717524220, > issued at
  "exp": 1717525120, > expiration
  "uuid": "018fe477-31c8-7580-ae4a-e0b36713eb05", > agent UUID v7
}

The payload must be as compact as possible considering we are using long polling to fetch commands. Every request will contain the JWT token and all its information as a consequence. By reducing the amount of information in it, we are reducing the network bandwitdth between agents and the server.

Timestamp and expiration

The iat field will be populated with the timestamp at which the token is issued and the exp field will be set to the timestamp at which the token is considered expired (iat + expiration_time).

Note

The server could also set the nbf (not valid before) field if for any reason the token should not be usable until later in the future.

The expiration time we will use is 900 seconds (15m), which is what we are using currently in the Server management API.

The agents will be able to tell when a token is about to expire by looking at the exp field, there's no need for the server to communicate that a token has expired because that information is already available for the agents.

Tokens revocation

No token revocation mechanism will be in place, since that would require database/indexer accesses on every request and that's what we are looking to avoid.

Tokens refreshing

Refresh tokens are used to request access tokens without requiring the frequent use of the credentials. However, they are useful to avoid cases in which users have to manually log in. In our design, this is not the case, the credentials are generated and stored locally by the agent, which logs in to the server automatically.

Therefore, we will not use refresh tokens, agents will request new ones by hitting the POST /login endpoint.

GGP1 · 2024-06-05T14:28:54Z

Asymmetric cryptography authentication

Note

Discarded in favor of JWT tokens. Signed events would require a query to the indexer on every request, while the other design will require one every token expiration (15m), and the latter is preferred.

Each agent will generate a pair of ECDSA keys (public/private) locally and send the events signed to the server. On the agent registration with the Server management API, the agent will provide its UUID and public key.

This way, once the agent publishes an event using POST /events, de Agent comms API will ask the indexer for the public key associated with the UUID and could validate that the event was indeed signed by the agent and that its content hasn't been altered. Only the agent could generate the event signature because no one else has the private key.

An event would have the following format:

{
  "uuid": "018fe477-31c8-7580-ae4a-e0b36713eb05", > agent UUID
  "data": "data", > event information
  "datetime": 1717587895, > event creation timestamp
  "signature": "MEYCIQDi0+QlWVW9jHbIXNv78xGDpdXDI40bmbSdOEtCzfI7WwIhAOC5UBaiiSTpl93+HtgOyrr9s5u+RrsOUlrPXdqO/AWP" > signature of the other fields values concatenated
}

On the other endpoints like GET /commands the request should include the header

Authorization: Bearer <UUID>+<expiration_timestamp>+<previous_information_signature>

so the server can, again, authenticate the agent by validating the signature

The token fields shown above would be encoded.

The value of expiration_timestamp is necessary to avoid using always the same token, and to avoid its use in case it's leaked.

The agent would choose a window in which the token is valid, it should generate new tokens frequently (every 15 minutes for example). The server validates that the timestamp is not expired and could potentially reject timestamps that are further than X seconds in the future, preventing a misuse of the header.

If an attacker changed the token timestamp it wouldn't work, because the signature would be invalid.

Advantages

Elimination of the POST /login endpoint and the need to request tokens every X minutes. We would save all these requests.
The signature computation load is transferred to the agents. In an environment with a lot of agents, a server that is generating tokens constantly could overload it. Distributing this task will relieve the load.
The events integrity is guaranteed, unlike with JWT.
It's more secure, with JWT we would be sending the password during the registration and login, while in this design only the public key and signatures are shared.

The server would still be validating the signature (like with JWT), but the validation is far less expensive than the signing).

Disadvantages

The signatures are now performed by the agents, although nowadays every device connected to a service generates signatures locally and with relative ease. It's shouldn't be an impediment to the requirement of keeping agents lightweight.
We would have to do one signature per event instead of one every X minutos, but this is the trade-off of the advantage 3.
Every request would require a query to the indexer asking for the agent public key.

Side comments

Regarding the requests size, the POST /events ones would be the same in both cases, since one includes the JWT token signature and the other the event signature on every request.
What we would do with JWT is basically the same, but with a password and with the information inside the token.
We could even replace the UUID with the public key, that could serve as an identifier.

TLDR

With this alternative we are doing the same things as with JWT but avoiding constant requests to the POST /login endpoint, distributing the burden of signatures to the agents. In addition, we gain the certainty that the events were not altered.

GGP1 · 2024-06-06T16:17:55Z

Fast API concurrent connections benchmark

I made some tests to validate the maximum number of concurrent connections supported by the Fast API framework using the docker image from https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker and a simple endpoint (files below).

The gunicorn configuration used specified 8 workers per CPU core and unlimited number of workers. I could establish 42350 TCP connections without the API failing with an out of memory error.

That number of connections established remained constant during the whole test, indicating that the benchmark was constrained by physical resources.

As a conclusion, the limitations will likely be related to the CPU, RAM or other resources consumption during the operations that the API will perform, most Python frameworks are minimal and based on the same technologies, so this shouldn't be a determining factor.

main.py

from fastapi import FastAPI
import time

app = FastAPI()

@app.get("/")
def root():
  while True:
    time.sleep(10)

Dockerfile

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.11

COPY main.py /app/main.py

Test

Window 1

gasti@gasti:/tmp/benchmarks/fastapi$ docker build -t fastapi .
...
gasti@gasti:/tmp/benchmarks/fastapi$ docker run --rm -p 80:80 fastapi
...

Window 2

gasti@gasti:/tmp/benchmarks/fastapi$ echo "GET http://localhost:80/" | vegeta attack -insecure -rate=1000000 -connections=1000000 > /dev/null

Window 3

gasti@gasti:/tmp/benchmarks/fastapi$ netstat -an | grep ESTABLISHED | grep -w 80 | wc -l
42350

GGP1 · 2024-06-07T16:00:52Z

API framework

There are many Python web frameworks available to use, after reading the documentation of the most popular ones (FastAPI, Tornado, Falcon, Quart, Sanic, Connexion 3.0) I have noticed that the differences between them is mostly in their design. Most cover the same set of functionalities and share parts of the syntaxis required to build the API.

Having this in mind, the fact that most of them are based on the same technologies (i.e. FastAPI and Connexion 3.0 are based on Starlette and Uvicorn) and consequently being similar in terms of dependencies and performance, I would prioritize the community and maintenance of the project, and that's where FastAPI wins by a big margin.

We have come to the conclusion with the team to go with this option for the proof of concept development.

fdalmaup · 2024-06-18T15:52:50Z

Proposed layout

├── apis
│   ├── server_management_api
│   │   ├── ...
│   │   ├── setup.py
│   │   └── tests
|   ├── comms_api
|   │   ├── __init__.py
│   │   ├── setup.py
|   │   ├── api.py
|   │   ├── configuration
|   │   │   └── ...
|   │   ├── dependencies
|   |   │   ├── __init__.py
|   |   │   ├── auth.py 
|   |   │   ├── indexer.py  
|   |   │   └── ...
|   │   ├── middlewares
|   |   │   ├── __init__.py
|   |   │   └── ...
|   │   ├── models
|   |   │   ├── __init__.py
|   |   │   ├── command.py
|   |   │   ├── event.py
|   |   │   ├── stateless_event.py
|   |   │   └── ...
|   │   ├── routers
|   |   │   ├── __init__.py
|   |   │   ├── commands.py
|   |   │   ├── events.py
|   |   │   ├── management.py
|   |   │   └── login.py
|   │   └── tests
│   ├── test
│   ├── Makefile
│   ├── scripts
│   └── wrappers
└── framework
    ├── examples
    ├── __init__.py
    ├── Makefile
    ├── pytest.ini
    ├── requirements-dev.txt
    ├── requirements.txt
    ├── scripts
    ├── setup.py
    ├── wazuh
    └── wrappers

Both the Server Management API and the Agents comms API will be part of the same apis directory.
The Agents comms API versioning will be carried out internally, i.e., as part of the code and not at a directory level. We want to reduce code duplication when breaking changes affect one component of the API, so we will make use of the FastAPI routers to redirect the requests in the newer version to those components without changes, only modifcating what is required.

GGP1 · 2024-06-19T13:34:58Z

Responses standard proposal

Errors

I believe error responses should be concise and clear, they shouldn't contain more information that the required to understand why the request failed.

In this proposal, the error responses contain the message that explains what went wrong, and an error code to identify the exact issue.

The code could be the same as the response status if it's an error at the protocol level, or it could be a Wazuh-specific code if the logic failed at some point in the request handling.

This way, we are giving enough context and information about what failed to the users, facilitiating the build of custom API clients that, depending on the error code, take different actions to make a successful request.

{
    "error": {
        "message": "Invalid JWT token",
        "code": 403
    }
}

Note

Some errors like the ones related to rate limiting will also contain metadata (i.e. retry-after, x-ratelimit-remaining, etc) in the response headers.

Success

Responses to successful requests will be dependent on the endpoint. We won't have the same structure wrapping all of our responses. In addition to this, responses that do not modify the status of the API and don't require giving any information in exchange will be empty.

For example:

GET /files: bytes stream
POST /authentication: JSON struct with a token field -> {"token": "<TOKEN>"}
POST /events/stateless: Empty response
POST /commands/results: Empty response
GET /commands: Timeout or JSON struct with a list of commands -> {"commands": [<commands>]}

Selutario · 2024-06-21T12:12:06Z

Conclusion

The proposed endpoints can be found here. They will be: POST /authentication, POST /events/stateless, POST /events/statefull, GET /commands, POST /commands/results and GET /files (GET /configuration and GET /sca might be replaced by GET /files).
Communication with other services:
- Indexer: Using opensearch-py.
- Engine: Using unix sockets.
Login fallback: If the token request fails, the client should try to request a new one.
Events sent by server: using HTTP long polling.
API authentication: JWT tokens, signed using ECDSA algorithm and containing issuer (iss), audience (aud), issued at (iat), expiration (exp), uuid and, maybe in the future, the server node or cluster which created the JWT token for multi-tenancy purposes.
Framework: FastAPI + uvicorn (optionally +gunicorn to handle workers).
Initial lyout: link.
Response formats (link): Errors without wrappers, including internal codes when the errors is different to the protocol's error code. Success responses without content when possible.

Selutario added type/research level/subtask labels May 14, 2024

Selutario mentioned this issue May 14, 2024

SPIKE - PoC implementation for agent and server #23396

Closed

jr0me mentioned this issue May 21, 2024

New Agent comms API endpoint client wazuh/wazuh-agent#1

Closed

havidarou mentioned this issue May 23, 2024

Agent/server communication protocol #22677

Open

5 tasks

GGP1 self-assigned this May 29, 2024

fdalmaup mentioned this issue Jun 5, 2024

SPIKE - Initial registration system design #23393

Closed

vikman90 added the phase/spike Spike label Jun 11, 2024

fdalmaup self-assigned this Jun 18, 2024

Selutario closed this as completed Jun 21, 2024

Selutario mentioned this issue Jun 25, 2024

Develop Comms API #24305

Open

10 tasks

vikman90 mentioned this issue Jul 11, 2024

Agent centralized configuration wazuh/wazuh-agent#32

Open

This was referenced Jul 25, 2024

Benchmarking tests: Saturation tools for Wazuh manager cluster #24681

Open

Benchmarking tests: ManagerSimulator class to receive agent events #24685

Open

fdalmaup changed the title ~~SPIKE - Initial Agent comms API server design~~ SPIKE - Initial Comms API server design Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIKE - Initial `Comms API` server design #23395

SPIKE - Initial `Comms API` server design #23395

Selutario commented May 14, 2024 •

edited by fdalmaup

Loading

GGP1 commented May 29, 2024 •

edited by Selutario

Loading

GGP1 commented May 30, 2024 •

edited

Loading

GGP1 commented May 31, 2024

GGP1 commented Jun 3, 2024

GGP1 commented Jun 4, 2024 •

edited

Loading

GGP1 commented Jun 4, 2024 •

edited

Loading

GGP1 commented Jun 5, 2024 •

edited

Loading

GGP1 commented Jun 6, 2024 •

edited

Loading

GGP1 commented Jun 7, 2024

fdalmaup commented Jun 18, 2024

GGP1 commented Jun 19, 2024 •

edited

Loading

Selutario commented Jun 21, 2024

SPIKE - Initial Comms API server design #23395

SPIKE - Initial Comms API server design #23395

Comments

Selutario commented May 14, 2024 • edited by fdalmaup Loading

Description

Implementation restrictions

Plan

GGP1 commented May 29, 2024 • edited by Selutario Loading

API

Authentication

Events

Commands

Management

GGP1 commented May 30, 2024 • edited Loading

API-Indexer communication

API-Engine communication

GGP1 commented May 31, 2024

Agent registration

GGP1 commented Jun 3, 2024

Login fallback

Authentication events

GGP1 commented Jun 4, 2024 • edited Loading

Server sent events

SSE and websockets

HTTP long polling

GGP1 commented Jun 4, 2024 • edited Loading

API authentication

Algorithm

Payload

Timestamp and expiration

Tokens revocation

Tokens refreshing

GGP1 commented Jun 5, 2024 • edited Loading

Asymmetric cryptography authentication

Advantages

Disadvantages

Side comments

TLDR

GGP1 commented Jun 6, 2024 • edited Loading

Fast API concurrent connections benchmark

GGP1 commented Jun 7, 2024

API framework

fdalmaup commented Jun 18, 2024

Proposed layout

GGP1 commented Jun 19, 2024 • edited Loading

Responses standard proposal

Errors

Success

Selutario commented Jun 21, 2024

Conclusion

SPIKE - Initial `Comms API` server design #23395

SPIKE - Initial `Comms API` server design #23395

Selutario commented May 14, 2024 •

edited by fdalmaup

Loading

GGP1 commented May 29, 2024 •

edited by Selutario

Loading

GGP1 commented May 30, 2024 •

edited

Loading

GGP1 commented Jun 4, 2024 •

edited

Loading

GGP1 commented Jun 4, 2024 •

edited

Loading

GGP1 commented Jun 5, 2024 •

edited

Loading

GGP1 commented Jun 6, 2024 •

edited

Loading

GGP1 commented Jun 19, 2024 •

edited

Loading