Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE - Initial Agent comms API server design #23395

Open
Tracked by #22677
Selutario opened this issue May 14, 2024 · 11 comments
Open
Tracked by #22677

SPIKE - Initial Agent comms API server design #23395

Selutario opened this issue May 14, 2024 · 11 comments

Comments

@Selutario
Copy link
Member

Selutario commented May 14, 2024

Epic
#22677

Description

We want to, as part of #22677, replace the current wazuh-remoted and wazuh-agentd services. Instead, we intend to develop a service that uses a standard protocol such as HTTP and request-driven communication, where different events can be forwarded to any of the Wazuh servers, unlike the current session-oriented approach where an agent sends all its messages to the server where it is connected.

However, we will also need to maintain a session-oriented connection so that the server can send commands to the agents on demand. Some proposals for this other mode of communication could include the use of websockets or gRPC.

The preliminary design of the server must have a /login endpoint so that agents/clients can authenticate and obtain a token from the obtained credentials. Additionally, requests of three different types must be handled:

  • Stateless: After receiving events, the API will immediately respond to the client, without waiting to confirm whether the engine can process said events.
  • Stateful: The API must confirm that the event has been processed and indexed before responding to the client.
  • Commands: Would be session-oriented. The server must verify that a command exists (they will be listed in an indexer's index) and forward the command to the appropriate place.
  • Management: API endpoints related to management tasks such as getting the information of a configuration group, receiving a package to upgrade the Agent, etc.

The API must be versioned.

This is a research issue.

Implementation restrictions

  • The opensearch-py library should be considered for API-Indexer communication.
  • Use existing libraries as much as possible, avoiding developing very complex components on our own and testing the maximum supported workload.
  • Collaborate with the Agent team to align on communication protocols and API integration.

Plan

  • Analyze everything that wazuh does and create an extensive list of endpoints to support it. We can rely on the devel-agent team for this and their research issue.
    • A high level of detail is not necessary. It is enough for each endpoint in the list to include a description of what it does.
  • Create a list of use cases. For example:
    • How to connect and communicate with the indexer to perform X action.
    • How to connect and communicate with the engine to perform Y action.
  • Investigation of available/candidate and most suitable technologies: websockets, gRPC, etc.
  • Library research: Starlette, FastAPI, Connexion, etc, etc. Carry out a test that verifies the maximum number of concurrent connections supported by the framework.
  • Initial server design that meets the requirements listed in Agent/server communication protocol #22677. For example: how to request the token? How to request a new token once it expires?
@GGP1
Copy link
Member

GGP1 commented May 29, 2024

API

The Agent comms API will be exposed via an HTTP server using TLS as the transport layer. It will versioned (/api/v1/{endpoint}) and contain the endpoints below.

Authentication

POST /authentication

Request a token to the server to make authenticated requests.

Body: UUID, password
Response: JWT token

Events

POST /events/stateless

Send events that are not necessarily processed by the engine.

Body: Event
Response: Event received message

POST /events/stateful

Send events that must be processed and persisted.

Body: Event
Response: Processing status

Commands

GET /commands

Get commands from the server. The connection will hang for X seconds and send any commands to the agent. If there's no commands in a certain period of time (TBD), the request will timeout.

Parameters: -
Response: Commands or timeout

POST /commands/results

Send the results of one or multiple commands.

Body: Commands results.
Response: Acknowledge message

Management

Note

In case we opt to transmit configurations and SCA policies via bytes streams, we may only need GET /files.

GET /configuration

Get information about the group configuration.

Parameters: -
Response: Current configuration

GET /files

Download files from the Wazuh manager (WPK packages, configuration files, etc).

Parameters: File name
Response: File bytes stream

GET /sca

Get Security Configuration Assessment policies.

Parameters: filters (offset, limit, select)
Response: SCA policies list

@GGP1
Copy link
Member

GGP1 commented May 30, 2024

API-Indexer communication

The communication with the indexer will be performed through the API it exposes, using the opensearch-py library as a SDK.

For example, if a new agent wants to log in, we craft and send a HTTP POST request to the indexer with the identifiers of the agent so we can validate the authorization token.

flowchart TD

subgraph Agents
    Endpoints
    Clouds
    Other
end

subgraph Server["Server cluster"]

    subgraph Wazuh1["Server node n"]
        api1["Agent comms API"]
    end

    subgraph Wazuh2[" Server node 2"]
        api2["Agent comms API"]
    end

end

subgraph Indexer
    subgraph Data_states["Data states"]
        agents_list["Agents list"]
        states["States"]
    end
end

subgraph lb["Load Balancer"]
    lb_node["Per request"]
end

Agents -- /login --> lb
lb -- /login --> Wazuh1
lb -- /login --> Wazuh2
Wazuh1 -- Read credentials --> agents_list
Wazuh2 -- Read credentials --> agents_list

style Wazuh1 fill:#abc2eb
style Wazuh2 fill:#abc2eb
style Data_states fill:#abc2eb

API-Engine communication

For the communication with the Engine, we will use Unix sockets like the Server management API. Since we will have an Engine instance running on each of the nodes of a cluster, there's no need to use the broader internet to communicate.

The API sees the request received from an agent, builds a request to the Engine and sends it through the socket, in the case of stateless events, it's not necessary to wait for the response.

flowchart TD

subgraph Agents
    Endpoints
    Clouds
    Other
end

subgraph Server["Server cluster"]

    subgraph Wazuh1["Server node n"]
        api1["Agent comms API"]
        server1["Server </br> management API"]
        Engine1["Engine"]
        VD1["VD"]
    end

    subgraph Wazuh2[" Server node 2"]
        api2["Agent comms API"]
        server2["Server </br> management API"]
        Engine2["Engine"]
        VD2["VD"]
    end

end

subgraph lb["Load Balancer"]
    lb_node["Per request"]
end

Agents -- /events/stateless --> lb
lb -- /events/stateless --> Wazuh1
lb -- /events/stateless --> Wazuh2
api1 -- Unix socket --> Engine1
api2 -- Unix socket --> Engine2

style Wazuh1 fill:#abc2eb
style Wazuh2 fill:#abc2eb

@GGP1
Copy link
Member

GGP1 commented May 31, 2024

Agent registration

The agent registration is exactly what has been explained in #22887, the agent first registers with the Server management API and then uses the /login endpoint to request a token to the Agent comms API.

Once it has the token, it uses it to perform requests to the API. When the token expires, the agent repeats the login process by calling the /login endpoint and obtaining a new token.

The agent may know itself when the token expires (by looking at the token's timestamp) or notice when any of the API requests fail with 401 Unauthorized.

flowchart TD

subgraph Agents
    Endpoints
    Clouds
    Other_sources
end

subgraph Indexer["Indexer cluster"]

    subgraph Data_states["Data states"]
        agents_list["Agents list"]
    end

end

subgraph Server["Server cluster"]

    subgraph Wazuh1["Server node n"]
        api1["Agent comms API"]
        server1["Server </br> management API"]

    end

end

subgraph lb["Load Balancer"]
    lb_node["Per request"]
end

Agents -- 1. /register --> lb
Agents -- 2. /login --> lb
Agents -- 3. /request_with_token --> lb

lb -- 1. /register --> server1
lb -- 2. /login --> api1
lb -- 3. /request_with_token --> api1

server1 -- 1. Store credentials --> agents_list
api1 -- 2. Read credentials --> agents_list

style Wazuh1 fill:#abc2eb
style Data_states fill:#abc2eb

More details about the registration system design will be discussed in #23393

@GGP1
Copy link
Member

GGP1 commented Jun 3, 2024

Login fallback

As per the protocol requirements, every time an agent requests a token through the /login endpoint it should receive one. We should never fail and keep the token stored without giving it to the agent, we can't block the server execution forever and keep the agent hanging and waiting for a response.

A potential solution could be to fail with a timeout and send the login token when it's ready in an event through the /commands endpoint (requires the agent to be listening for events).

Similarly, if the connection fails for any reason, we should determine if the agent should retry the request (ideal) or if we could use the same approach.

Authentication events

In case there are other factors that affect the authentication other than the token, the server could send an authentication event to the agent through the /commands endpoint, instructing the agent to re-authenticate before attempting any other action.

@GGP1
Copy link
Member

GGP1 commented Jun 4, 2024

Server sent events

SSE and websockets

Note

Discarded in favor of HTTP long polling because of the asynchronous nature of the design, load balancing architecture and the short tokens expiration.

Two good alternatives for server-side events are Server sent events (SSE) and WebSockets.

Here is an image that explains their differences pretty well.

sse-websockets

Server sent events are simpler and use the HTTP protocol under the hood, the messages can flow in one direction only and due to their simplicity, no external library may be required. Another advantage is that enterprise firewalls do not have issues inspecting the packets like it happens with websockets.

I would only choose websockets if the messages structure is a limitation and we want to use a specific encoding for the commands.

HTTP long polling

Another alternative to SSE and WebSockets that is preferred by the team is HTTP long polling,

Long polling is a technique that uses a long-lived connection where the client sends a request to the server, and the server holds the request open until new data is available or a certain timeout is reached.

Once new data is available, the server responds with the updated information, and the client immediately sends another request to continue the cycle.

long_polling

The disadvantage of long polling is that every request requires a connection establishment and that contains the HTTP headers instead of just the data. WebSocket and SSE are generally more scalable than HTTP-based long polling, as they allow for more efficient use of server resources.

On the positive side, long polling is the simplest of the three and does not require any library or protocol other than HTTP requests.

@GGP1
Copy link
Member

GGP1 commented Jun 4, 2024

API authentication

This update will focus on the Agent comms API authentication, there's many details regarding the agent registration and its credentials that will be analyzed in #23393.

The API will use the JSON Web Tokens (JWTs) open standard (RFC 7519) to authenticate agents requests.

All requests except to POST /login (which returns the token) will contain an Authentication header with the value Bearer {token} and the JWT token in base 64 encoding.

If the token is invalid, expired or non-existent, the API will respond with a HTTP 401 (Unauthorized) error.

Algorithm

The algorithm used to sign the tokens will be the Elliptic Curve Digital Signature Algorithm with the P-256 curve and the SHA-256 cryptographic hash function (ES256).

ECDSA is able to provide equivalent security to RSA cryptography but using shorter key sizes and with greater processing speed for many operations (ref).

The 256-bit key size was chosen over a 512 one mainly because of its length to security ratio, a 256 bit key has enough keyspace to secure our tokens (provides 128-bits of security) and longer keys could be considered an overkill while negatively impacting the server's communications.

Payload

The payload will contain public claims such as the issuer, audience, subject and timestamps indicating when the token was issued and when it expires.

Additionally, it will contain a uuid field that will store the agent's UUID. This way, the API will be able to identify which agent is doing the request and respond according to it.

For example, if an agent sends a GET /commands request, the server must know who is listening for commands in that connection to send the correct commands.

The JWT token payload will look like:

{
  "iss": "wazuh", > issuer
  "aud": "Wazuh Agent comms API", > audience
  "iat": 1717524220, > issued at
  "exp": 1717525120, > expiration
  "uuid": "018fe477-31c8-7580-ae4a-e0b36713eb05", > agent UUID v7
}

The payload must be as compact as possible considering we are using long polling to fetch commands. Every request will contain the JWT token and all its information as a consequence. By reducing the amount of information in it, we are reducing the network bandwitdth between agents and the server.

Timestamp and expiration

The iat field will be populated with the timestamp at which the token is issued and the exp field will be set to the timestamp at which the token is considered expired (iat + expiration_time).

Note

The server could also set the nbf (not valid before) field if for any reason the token should not be usable until later in the future.

The expiration time we will use is 900 seconds (15m), which is what we are using currently in the Server management API.

The agents will be able to tell when a token is about to expire by looking at the exp field, there's no need for the server to communicate that a token has expired because that information is already available for the agents.

Tokens revocation

No token revocation mechanism will be in place, since that would require database/indexer accesses on every request and that's what we are looking to avoid.

Tokens refreshing

Refresh tokens are used to request access tokens without requiring the frequent use of the credentials. However, they are useful to avoid cases in which users have to manually log in. In our design, this is not the case, the credentials are generated and stored locally by the agent, which logs in to the server automatically.

Therefore, we will not use refresh tokens, agents will request new ones by hitting the POST /login endpoint.

@GGP1
Copy link
Member

GGP1 commented Jun 5, 2024

Asymmetric cryptography authentication

Note

Discarded in favor of JWT tokens. Signed events would require a query to the indexer on every request, while the other design will require one every token expiration (15m), and the latter is preferred.

Each agent will generate a pair of ECDSA keys (public/private) locally and send the events signed to the server. On the agent registration with the Server management API, the agent will provide its UUID and public key.

This way, once the agent publishes an event using POST /events, de Agent comms API will ask the indexer for the public key associated with the UUID and could validate that the event was indeed signed by the agent and that its content hasn't been altered. Only the agent could generate the event signature because no one else has the private key.

An event would have the following format:

{
  "uuid": "018fe477-31c8-7580-ae4a-e0b36713eb05", > agent UUID
  "data": "data", > event information
  "datetime": 1717587895, > event creation timestamp
  "signature": "MEYCIQDi0+QlWVW9jHbIXNv78xGDpdXDI40bmbSdOEtCzfI7WwIhAOC5UBaiiSTpl93+HtgOyrr9s5u+RrsOUlrPXdqO/AWP" > signature of the other fields values concatenated
}

On the other endpoints like GET /commands the request should include the header

Authorization: Bearer <UUID>+<expiration_timestamp>+<previous_information_signature>

so the server can, again, authenticate the agent by validating the signature

The token fields shown above would be encoded.

The value of expiration_timestamp is necessary to avoid using always the same token, and to avoid its use in case it's leaked.

The agent would choose a window in which the token is valid, it should generate new tokens frequently (every 15 minutes for example). The server validates that the timestamp is not expired and could potentially reject timestamps that are further than X seconds in the future, preventing a misuse of the header.

If an attacker changed the token timestamp it wouldn't work, because the signature would be invalid.

Advantages

  • Elimination of the POST /login endpoint and the need to request tokens every X minutes. We would save all these requests.
  • The signature computation load is transferred to the agents. In an environment with a lot of agents, a server that is generating tokens constantly could overload it. Distributing this task will relieve the load.
  • The events integrity is guaranteed, unlike with JWT.
  • It's more secure, with JWT we would be sending the password during the registration and login, while in this design only the public key and signatures are shared.

The server would still be validating the signature (like with JWT), but the validation is far less expensive than the signing).

Disadvantages

  • The signatures are now performed by the agents, although nowadays every device connected to a service generates signatures locally and with relative ease. It's shouldn't be an impediment to the requirement of keeping agents lightweight.
  • We would have to do one signature per event instead of one every X minutos, but this is the trade-off of the advantage 3.
  • Every request would require a query to the indexer asking for the agent public key.

Side comments

  • Regarding the requests size, the POST /events ones would be the same in both cases, since one includes the JWT token signature and the other the event signature on every request.
  • What we would do with JWT is basically the same, but with a password and with the information inside the token.
  • We could even replace the UUID with the public key, that could serve as an identifier.

TLDR

With this alternative we are doing the same things as with JWT but avoiding constant requests to the POST /login endpoint, distributing the burden of signatures to the agents. In addition, we gain the certainty that the events were not altered.

@GGP1
Copy link
Member

GGP1 commented Jun 6, 2024

Fast API concurrent connections benchmark

I made some tests to validate the maximum number of concurrent connections supported by the Fast API framework using the docker image from https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker and a simple endpoint (files below).

The gunicorn configuration used specified 8 workers per CPU core and unlimited number of workers. I could establish 42350 TCP connections without the API failing with an out of memory error.

That number of connections established remained constant during the whole test, indicating that the benchmark was constrained by physical resources.

As a conclusion, the limitations will likely be related to the CPU, RAM or other resources consumption during the operations that the API will perform, most Python frameworks are minimal and based on the same technologies, so this shouldn't be a determining factor.

main.py
from fastapi import FastAPI
import time

app = FastAPI()

@app.get("/")
def root():
  while True:
    time.sleep(10)
Dockerfile
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.11

COPY main.py /app/main.py
Test

Window 1

gasti@gasti:/tmp/benchmarks/fastapi$ docker build -t fastapi .
...
gasti@gasti:/tmp/benchmarks/fastapi$ docker run --rm -p 80:80 fastapi
...

Window 2

gasti@gasti:/tmp/benchmarks/fastapi$ echo "GET http://localhost:80/" | vegeta attack -insecure -rate=1000000 -connections=1000000 > /dev/null

Window 3

gasti@gasti:/tmp/benchmarks/fastapi$ netstat -an | grep ESTABLISHED | grep -w 80 | wc -l
42350

@GGP1
Copy link
Member

GGP1 commented Jun 7, 2024

API framework

There are many Python web frameworks available to use, after reading the documentation of the most popular ones (FastAPI, Tornado, Falcon, Quart, Sanic, Connexion 3.0) I have noticed that the differences between them is mostly in their design. Most cover the same set of functionalities and share parts of the syntaxis required to build the API.

Having this in mind, the fact that most of them are based on the same technologies (i.e. FastAPI and Connexion 3.0 are based on Starlette and Uvicorn) and consequently being similar in terms of dependencies and performance, I would prioritize the community and maintenance of the project, and that's where FastAPI wins by a big margin.

We have come to the conclusion with the team to go with this option for the proof of concept development.

@vikman90 vikman90 added the phase/spike Spike label Jun 11, 2024
@fdalmaup fdalmaup self-assigned this Jun 18, 2024
@fdalmaup
Copy link
Member

Proposed layout

├── apis
│   ├── server_management_api
│   │   ├── ...
│   │   ├── setup.py
│   │   └── tests
|   ├── comms_api
|   │   ├── __init__.py
│   │   ├── setup.py
|   │   ├── api.py
|   │   ├── configuration
|   │   │   └── ...
|   │   ├── dependencies
|   |   │   ├── __init__.py
|   |   │   ├── auth.py 
|   |   │   ├── indexer.py  
|   |   │   └── ...
|   │   ├── middlewares
|   |   │   ├── __init__.py
|   |   │   └── ...
|   │   ├── models
|   |   │   ├── __init__.py
|   |   │   ├── command.py
|   |   │   ├── event.py
|   |   │   ├── stateless_event.py
|   |   │   └── ...
|   │   ├── routers
|   |   │   ├── __init__.py
|   |   │   ├── commands.py
|   |   │   ├── events.py
|   |   │   ├── management.py
|   |   │   └── login.py
|   │   └── tests
│   ├── test
│   ├── Makefile
│   ├── scripts
│   └── wrappers
└── framework
    ├── examples
    ├── __init__.py
    ├── Makefile
    ├── pytest.ini
    ├── requirements-dev.txt
    ├── requirements.txt
    ├── scripts
    ├── setup.py
    ├── wazuh
    └── wrappers
  • Both the Server Management API and the Agents comms API will be part of the same apis directory.
  • The Agents comms API versioning will be carried out internally, i.e., as part of the code and not at a directory level. We want to reduce code duplication when breaking changes affect one component of the API, so we will make use of the FastAPI routers to redirect the requests in the newer version to those components without changes, only modifcating what is required.

@GGP1
Copy link
Member

GGP1 commented Jun 19, 2024

Responses standard proposal

Errors

I believe error responses should be concise and clear, they shouldn't contain more information that the required to understand why the request failed.

In this proposal, the error responses contain the message that explains what went wrong, and an error code to identify the exact issue.

The code could be the same as the response status if it's an error at the protocol level, or it could be a Wazuh-specific code if the logic failed at some point in the request handling.

This way, we are giving enough context and information about what failed to the users, facilitiating the build of custom API clients that, depending on the error code, take different actions to make a successful request.

{
    "error": {
        "message": "Invalid JWT token",
        "code": 403
    }
}

Note

Some errors like the ones related to rate limiting will also contain metadata (i.e. retry-after, x-ratelimit-remaining, etc) in the response headers.

Success

Responses to successful requests will be dependent on the endpoint. We won't have the same structure wrapping all of our responses. In addition to this, responses that do not modify the status of the API and don't require giving any information in exchange will be empty.

For example:

  • GET /files: bytes stream
  • POST /authentication: JSON struct with a token field -> {"token": "<TOKEN>"}
  • POST /events/stateless: Empty response
  • POST /commands/results: Empty response
  • GET /commands: Timeout or JSON struct with a list of commands -> {"commands": [<commands>]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Pending review
Development

No branches or pull requests

4 participants