-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPIKE - Initial Comms API
server design
#23395
Comments
APIThe Agent comms API will be exposed via an HTTP server using TLS as the transport layer. It will versioned ( AuthenticationPOST /authenticationRequest a token to the server to make authenticated requests. Body: UUID, password EventsPOST /events/statelessSend events that are not necessarily processed by the engine. Body: Event POST /events/statefulSend events that must be processed and persisted. Body: Event CommandsGET /commandsGet commands from the server. The connection will hang for X seconds and send any commands to the agent. If there's no commands in a certain period of time (TBD), the request will timeout. Parameters: - POST /commands/resultsSend the results of one or multiple commands. Body: Commands results. ManagementNote In case we opt to transmit configurations and SCA policies via bytes streams, we may only need GET /filesDownload files from the Wazuh manager (WPK packages, configuration files, etc). Parameters: File name |
API-Indexer communicationThe communication with the indexer will be performed through the API it exposes, using the opensearch-py library as a SDK. For example, if a new agent wants to log in, we craft and send a HTTP POST request to the indexer with the identifiers of the agent so we can validate the authorization token. flowchart TD
subgraph Agents
Endpoints
Clouds
Other
end
subgraph Server["Server cluster"]
subgraph Wazuh1["Server node n"]
api1["Agent comms API"]
end
subgraph Wazuh2[" Server node 2"]
api2["Agent comms API"]
end
end
subgraph Indexer
subgraph Data_states["Data states"]
agents_list["Agents list"]
states["States"]
end
end
subgraph lb["Load Balancer"]
lb_node["Per request"]
end
Agents -- /login --> lb
lb -- /login --> Wazuh1
lb -- /login --> Wazuh2
Wazuh1 -- Read credentials --> agents_list
Wazuh2 -- Read credentials --> agents_list
style Wazuh1 fill:#abc2eb
style Wazuh2 fill:#abc2eb
style Data_states fill:#abc2eb
API-Engine communicationFor the communication with the Engine, we will use Unix sockets like the Server management API. Since we will have an Engine instance running on each of the nodes of a cluster, there's no need to use the broader internet to communicate. The API sees the request received from an agent, builds a request to the Engine and sends it through the socket, in the case of stateless events, it's not necessary to wait for the response. flowchart TD
subgraph Agents
Endpoints
Clouds
Other
end
subgraph Server["Server cluster"]
subgraph Wazuh1["Server node n"]
api1["Agent comms API"]
server1["Server </br> management API"]
Engine1["Engine"]
VD1["VD"]
end
subgraph Wazuh2[" Server node 2"]
api2["Agent comms API"]
server2["Server </br> management API"]
Engine2["Engine"]
VD2["VD"]
end
end
subgraph lb["Load Balancer"]
lb_node["Per request"]
end
Agents -- /events/stateless --> lb
lb -- /events/stateless --> Wazuh1
lb -- /events/stateless --> Wazuh2
api1 -- Unix socket --> Engine1
api2 -- Unix socket --> Engine2
style Wazuh1 fill:#abc2eb
style Wazuh2 fill:#abc2eb
|
Agent registrationThe agent registration is exactly what has been explained in #22887, the agent first registers with the Server management API and then uses the Once it has the token, it uses it to perform requests to the API. When the token expires, the agent repeats the login process by calling the
flowchart TD
subgraph Agents
Endpoints
Clouds
Other_sources
end
subgraph Indexer["Indexer cluster"]
subgraph Data_states["Data states"]
agents_list["Agents list"]
end
end
subgraph Server["Server cluster"]
subgraph Wazuh1["Server node n"]
api1["Agent comms API"]
server1["Server </br> management API"]
end
end
subgraph lb["Load Balancer"]
lb_node["Per request"]
end
Agents -- 1. /register --> lb
Agents -- 2. /login --> lb
Agents -- 3. /request_with_token --> lb
lb -- 1. /register --> server1
lb -- 2. /login --> api1
lb -- 3. /request_with_token --> api1
server1 -- 1. Store credentials --> agents_list
api1 -- 2. Read credentials --> agents_list
style Wazuh1 fill:#abc2eb
style Data_states fill:#abc2eb
More details about the registration system design will be discussed in #23393 |
Login fallbackAs per the protocol requirements, every time an agent requests a token through the A potential solution could be to fail with a timeout and send the login token when it's ready in an event through the Similarly, if the connection fails for any reason, we should determine if the agent should retry the request (ideal) or if we could use the same approach. Authentication eventsIn case there are other factors that affect the authentication other than the token, the server could send an authentication event to the agent through the |
Server sent eventsSSE and websocketsNote Discarded in favor of HTTP long polling because of the asynchronous nature of the design, load balancing architecture and the short tokens expiration. Two good alternatives for server-side events are Server sent events (SSE) and WebSockets. Here is an image that explains their differences pretty well. Server sent events are simpler and use the HTTP protocol under the hood, the messages can flow in one direction only and due to their simplicity, no external library may be required. Another advantage is that enterprise firewalls do not have issues inspecting the packets like it happens with websockets. I would only choose websockets if the messages structure is a limitation and we want to use a specific encoding for the commands. HTTP long pollingAnother alternative to SSE and WebSockets that is preferred by the team is HTTP long polling, Long polling is a technique that uses a long-lived connection where the client sends a request to the server, and the server holds the request open until new data is available or a certain timeout is reached. Once new data is available, the server responds with the updated information, and the client immediately sends another request to continue the cycle. The disadvantage of long polling is that every request requires a connection establishment and that contains the HTTP headers instead of just the data. WebSocket and SSE are generally more scalable than HTTP-based long polling, as they allow for more efficient use of server resources. On the positive side, long polling is the simplest of the three and does not require any library or protocol other than HTTP requests. |
API authenticationThis update will focus on the The API will use the JSON Web Tokens (JWTs) open standard (RFC 7519) to authenticate agents requests. All requests except to If the token is invalid, expired or non-existent, the API will respond with a HTTP 401 (Unauthorized) error. AlgorithmThe algorithm used to sign the tokens will be the Elliptic Curve Digital Signature Algorithm with the P-256 curve and the SHA-256 cryptographic hash function ( ECDSA is able to provide equivalent security to RSA cryptography but using shorter key sizes and with greater processing speed for many operations (ref). The 256-bit key size was chosen over a 512 one mainly because of its length to security ratio, a 256 bit key has enough keyspace to secure our tokens (provides 128-bits of security) and longer keys could be considered an overkill while negatively impacting the server's communications. PayloadThe payload will contain public claims such as the issuer, audience, subject and timestamps indicating when the token was issued and when it expires. Additionally, it will contain a For example, if an agent sends a The JWT token payload will look like: {
"iss": "wazuh", > issuer
"aud": "Wazuh Agent comms API", > audience
"iat": 1717524220, > issued at
"exp": 1717525120, > expiration
"uuid": "018fe477-31c8-7580-ae4a-e0b36713eb05", > agent UUID v7
} The payload must be as compact as possible considering we are using long polling to fetch commands. Every request will contain the JWT token and all its information as a consequence. By reducing the amount of information in it, we are reducing the network bandwitdth between agents and the server. Timestamp and expirationThe Note The server could also set the The expiration time we will use is 900 seconds (15m), which is what we are using currently in the Server management API. The agents will be able to tell when a token is about to expire by looking at the Tokens revocationNo token revocation mechanism will be in place, since that would require database/indexer accesses on every request and that's what we are looking to avoid. Tokens refreshingRefresh tokens are used to request access tokens without requiring the frequent use of the credentials. However, they are useful to avoid cases in which users have to manually log in. In our design, this is not the case, the credentials are generated and stored locally by the agent, which logs in to the server automatically. Therefore, we will not use refresh tokens, agents will request new ones by hitting the |
Asymmetric cryptography authenticationNote Discarded in favor of JWT tokens. Signed events would require a query to the indexer on every request, while the other design will require one every token expiration (15m), and the latter is preferred. Each agent will generate a pair of ECDSA keys (public/private) locally and send the events signed to the server. On the agent registration with the Server management API, the agent will provide its UUID and public key. This way, once the agent publishes an event using An event would have the following format: {
"uuid": "018fe477-31c8-7580-ae4a-e0b36713eb05", > agent UUID
"data": "data", > event information
"datetime": 1717587895, > event creation timestamp
"signature": "MEYCIQDi0+QlWVW9jHbIXNv78xGDpdXDI40bmbSdOEtCzfI7WwIhAOC5UBaiiSTpl93+HtgOyrr9s5u+RrsOUlrPXdqO/AWP" > signature of the other fields values concatenated
} On the other endpoints like
so the server can, again, authenticate the agent by validating the signature
The value of The agent would choose a window in which the token is valid, it should generate new tokens frequently (every 15 minutes for example). The server validates that the timestamp is not expired and could potentially reject timestamps that are further than X seconds in the future, preventing a misuse of the header.
Advantages
Disadvantages
Side comments
TLDRWith this alternative we are doing the same things as with JWT but avoiding constant requests to the |
Fast API concurrent connections benchmarkI made some tests to validate the maximum number of concurrent connections supported by the Fast API framework using the docker image from https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker and a simple endpoint (files below). The gunicorn configuration used specified 8 workers per CPU core and unlimited number of workers. I could establish 42350 TCP connections without the API failing with an out of memory error. That number of connections established remained constant during the whole test, indicating that the benchmark was constrained by physical resources. As a conclusion, the limitations will likely be related to the CPU, RAM or other resources consumption during the operations that the API will perform, most Python frameworks are minimal and based on the same technologies, so this shouldn't be a determining factor. main.pyfrom fastapi import FastAPI
import time
app = FastAPI()
@app.get("/")
def root():
while True:
time.sleep(10) DockerfileFROM tiangolo/uvicorn-gunicorn-fastapi:python3.11
COPY main.py /app/main.py TestWindow 1 gasti@gasti:/tmp/benchmarks/fastapi$ docker build -t fastapi .
...
gasti@gasti:/tmp/benchmarks/fastapi$ docker run --rm -p 80:80 fastapi
... Window 2 gasti@gasti:/tmp/benchmarks/fastapi$ echo "GET http://localhost:80/" | vegeta attack -insecure -rate=1000000 -connections=1000000 > /dev/null Window 3 gasti@gasti:/tmp/benchmarks/fastapi$ netstat -an | grep ESTABLISHED | grep -w 80 | wc -l
42350 |
API frameworkThere are many Python web frameworks available to use, after reading the documentation of the most popular ones (FastAPI, Tornado, Falcon, Quart, Sanic, Connexion 3.0) I have noticed that the differences between them is mostly in their design. Most cover the same set of functionalities and share parts of the syntaxis required to build the API. Having this in mind, the fact that most of them are based on the same technologies (i.e. FastAPI and Connexion 3.0 are based on Starlette and Uvicorn) and consequently being similar in terms of dependencies and performance, I would prioritize the community and maintenance of the project, and that's where FastAPI wins by a big margin. We have come to the conclusion with the team to go with this option for the proof of concept development. |
Proposed layout
|
Responses standard proposalErrorsI believe error responses should be concise and clear, they shouldn't contain more information that the required to understand why the request failed. In this proposal, the error responses contain the message that explains what went wrong, and an error code to identify the exact issue. The code could be the same as the response status if it's an error at the protocol level, or it could be a Wazuh-specific code if the logic failed at some point in the request handling. This way, we are giving enough context and information about what failed to the users, facilitiating the build of custom API clients that, depending on the error code, take different actions to make a successful request. {
"error": {
"message": "Invalid JWT token",
"code": 403
}
} Note Some errors like the ones related to rate limiting will also contain metadata (i.e. SuccessResponses to successful requests will be dependent on the endpoint. We won't have the same structure wrapping all of our responses. In addition to this, responses that do not modify the status of the API and don't require giving any information in exchange will be empty. For example:
|
Conclusion
|
Agent comms API
server designComms API
server design
Description
We want to, as part of #22677, replace the current
wazuh-remoted
andwazuh-agentd
services. Instead, we intend to develop a service that uses a standard protocol such as HTTP and request-driven communication, where different events can be forwarded to any of the Wazuh servers, unlike the current session-oriented approach where an agent sends all its messages to the server where it is connected.However, we will also need to maintain a session-oriented connection so that the server can send commands to the agents on demand. Some proposals for this other mode of communication could include the use of websockets or gRPC.
The preliminary design of the server must have a
/login
endpoint so that agents/clients can authenticate and obtain a token from the obtained credentials. Additionally, requests of three different types must be handled:The API must be versioned.
This is a research issue.
Implementation restrictions
Agent team
to align on communication protocols and API integration.Plan
devel-agent
team for this and their research issue.The text was updated successfully, but these errors were encountered: