Moving storage out of Broker into a separate framework #3450

timwoj · 2023-11-10T22:19:08Z

timwoj
Nov 10, 2023
Maintainer

Context

The existing Broker implementation tightly wraps the storage framework with the IPC framework, but this is likely completely unnecessary. The storage framework should be a consumer of the IPC framework if that functionality is needed, but doesn’t need to be part of it directly.

There is a consideration of whether the storage framework needs IPC at all. Depending on how the storage framework is implemented, and what targets are chosen, it may not be necessary to use the IPC framework. For example, if the target backend is something server-based and/or network-accessible (postgres, redis, mongo, etc), it should be possible for the framework to just open connections directly to that backend from each process.

Having processes open their own connections would have a number of benefits. First, we would avoid the performance implications of serializing the data into some network format, only to deserialize it to be stored into the backend. Second, this would allow us to avoid having to aggregate the data onto a single node before writing to the backend. Lastly, it would allow writers to write the data directly in a format native to the backend, which then enables a whole lot of use cases that are currently not feasible (such as log writers reading directly from the storage backends).

Technical Work

Ideally this framework would use the existing Zeek plugin system. That would allow extra backends to be implemented, using common idioms that are already familiar to plugin developers. It would also allow us to use a common script framework for talking to the storage backends. Plugins could decide how to implement their own storage models depending on what the underlying storage is.

The framework itself will follow the typical layout that we have for other frameworks. A manager type that handles coordination between Zeek and the plugins, a plugin component and component manager, and a set of plugins that handle the heavy lifting. Each plugin can implement their own set of script-land options for connecting to/opening whatever their underlying storage is, plus whatever is needed to read and write data with that storage.

The storage model ends up being a giant question mark here. Our current storage model is a key-value system, with the value being a binary blob that gets stored directly into whatever the underlying storage is. That’s obviously not the ideal model for something like an SQL database, but it allows Zeek to be consistent across all of the storage backends. If we leave the storage model up to the plugins to decide, they’re going to have to figure out how to convert from native Zeek structures such as records into something native to the underlying storage. This gets sticky pretty quickly. For SQL databases the best approach would be to write SQL directly, but we’re not going to convert all of the record structures into SQL tables unless we can do it programmatically at startup.

SQL databases come with some considerations of their own. The best approach would be to write SQL directly into tables for each record type, but we’re not going to want to generate tables for every record, plus dealing with tables and vectors, etc. If we were to do that, you end up having to support a migration system of some sort in the instances that types change, and it gets really messy. With a key-value system based on JSON, you can have one table that has key and value columns. We’ll have to rely on RapidJSON to do the type conversions though, so we have to be careful that it doesn't become a bottleneck.

There’s also a big problem with multi-machine clusters. If the backends are able to do all of the communication with the underlying storage themselves, how do backends like flat files work? andCould the backend be smart enough to know about its runtime environment (maybe reading something from the management framework?) and only open the flat file on a single node? Or should we just deprecate the idea of flat file backends and require users to have a server-based backend? Note that this all applies to not just flat files, but anything file-based such as SQLite (which we definitely shouldn’t get rid of).

The existing script-land API in Broker implements a wide range of operations for modifying data in-situ for stored entries (like most of a map-style API), plus a bunch of operations for dealing with Broker data objects. For the new API, we’ll keep it simple to start with:

A create() method that takes a Tag value for what type of backend and a configuration object. It returns an opaque for later access to the backend.
A put() method that takes an opaque for the backend, a key of type any, and a value of type any.
A get() method that takes an opaque for the backend and a key of type any. It returns the corresponding value of type any or a null if nothing is found.
These would mostly just be thin wrappers over methods in the manager (via BIFs).
Storage-backed tables won’t be supported in early versions for simplicity, but the intention is to implement them later.
When statements will continue to be supported for both put and get operations.

netantho · 2023-11-19T09:07:50Z

netantho
Nov 19, 2023

Minor feedback (the direction seems great): According to its README simdjson is 4x faster than RapidJSON. simdjson is used in Tenzir.

1 reply

timwoj Feb 10, 2024
Maintainer Author

Unfortunately, simdjson only supports parsing JSON. It doesn't support building JSON in-code so you wouldn't be able to, for example, take a Val object and build a JSON string out of it.

stevesmoot · 2023-11-22T08:34:43Z

stevesmoot
Nov 22, 2023

I've personally wondered why we dont just have memcached available through a bif, and skip broker, manager etc. just take a well tuned OS k/v store and link it in easily.

0 replies

mavam · 2023-11-22T09:52:14Z

mavam
Nov 22, 2023

Do we have articulated clear requirements and understand well the access patterns that are to be supported?

For example, is key-value lookup the only access path? Or is a more elaborate, predicate-based selection of subsets of data also a valid workload?

In other words, how do we describe the data access patterns of a multi-machine cluster, or a script that would make use of stored data?

1 reply

timwoj Feb 10, 2024
Maintainer Author

For example, is key-value lookup the only access path? Or is a more elaborate, predicate-based selection of subsets of data also a valid workload?

Key-value is certainly the easiest, but predicate-based lookup shouldn't be too much harder to add. Right now I only have the backends returning single values but adding methods that return sets/vectors of values is easy enough. The actual searching semantics would be up to the plugins to determine exactly how it's done though.

Neverlord · 2023-11-24T08:56:32Z

Neverlord
Nov 24, 2023
Collaborator

Full support for moving storage out of Broker. The data stores in Broker are quite a headache, because they add a lot of complexity and always felt to me like they ended up being more complicated than they should because they tried to serve different roles from one API. For a key/value store, it always bugged me that there is no leader election or other recovery mechanism, which means the master has always been a single point of failure.

I do have some similar concerns as @mavam regarding the overhaul. I think before talking about technical work, it would pay off to better understand the problem space and to investigate alternatives.

Do we really need (or want) to wrap SQLite? If some use case really calls for it, why can't users just use SQLite directly via JavaScript (assuming something like https://www.npmjs.com/package/sqlite3 would work with for JS script running in Zeek)? I guess the same questions come up with other options like Redis, etc. So before we build infrastructure in Zeek with extension capabilities (via plugins), what are the pressing reasons behind that? Why not just delegate to the JavaScript eco system? At least to me it's not obvious.
Are we really talking about a storage API? Correct me if I'm wrong, but I think the main use case for Broker is actually just synchronization, i.e., an in-memory table hosted at one of the cluster nodes that allows the workers to share state. That's a different and much easier problem that an actual store that must survive crashes and restarts. If all we need is synchronization, why not leave storage completely out of the picture and just focus on synchronization? I think "These would mostly just be thin wrappers over methods in the manager" already points into that direction: just have the manager hold the shared state for the workers. And then leave it at that. If someone needs Redis/SQLite/MySQL, just let them use it directly. By focusing on this one configuration, a lot of what Broker is doing in the stores today becomes obsolete. Like the discovery phase for finding the master.

1 reply

timwoj Feb 10, 2024
Maintainer Author

Do we really need (or want) to wrap SQLite? If some use case really calls for it, why can't users just use SQLite directly via JavaScript (assuming something like https://www.npmjs.com/package/sqlite3 would work with for JS script running in Zeek)? I guess the same questions come up with other options like Redis, etc. So before we build infrastructure in Zeek with extension capabilities (via plugins), what are the pressing reasons behind that? Why not just delegate to the JavaScript eco system? At least to me it's not obvious.

My main concern with this is that the javascript support in Zeek is still very new and really hasn't been put through the paces to depend on for something as large as this. Adding a new plugin infrastructure to Zeek is a well-trodden path that we know how to do with minimal fuss.

Are we really talking about a storage API?

it's actually both. People like @initconf want to be able to store long-term data to disk, such as output from known_hosts, etc and reload it on restarts. Others also want the synchronization bits for tables. It's possible for this to support both of them.

awelzel · 2024-02-27T18:53:15Z

awelzel
Feb 27, 2024
Maintainer

A put() method that takes an opaque for the backend, a key of type any, and a value of type any.

A get() method that takes an opaque for the backend and a key of type any. It returns the corresponding value of type any or a null if nothing is found.

As mentioned in person, to leave a record here and with slight reference to #3450 (comment).

If storage is initially about key-value storage. Having a put_raw and get_raw taking string string for key and value (bytes, really) should be another interface where these bytes are directly used as key / value for the underlying storage. I'd actually layer the Vals conversion separately above the Backend (or have something below Backend that's working with bytes instead of Val), so that the encoding can easily be exchanged. E.g. using SQLite or Redis should not define a given serialization format.

I could even see a first step fully in script-land: If a user was to provide to_msgpack() as a BiF, they could then leverage it via:

Storage::put_raw(redis_backend, "key1", to_msgpack(c$id));
local r = Storage::get_raw(redis_backend, "key1", conn_id, from_msgpack);

EDIT, the latter was meant to be something like the following, similar to from_json().

local r = from_msgpack(Storage::get_raw(redis_backend, "key1"), conn_id);

If they opted for Redis, they could then also read values back via Python/Ruby and would be in control of the value encoding.

That maybe won't be as convenient as Vals, but it would move responsibility of dealing with conversion errors or migrations up to the user, rather then bury them into the storage layer where error reporting / recovery may be somewhat difficult.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving storage out of Broker into a separate framework #3450

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Moving storage out of Broker into a separate framework #3450

timwoj Nov 10, 2023 Maintainer

Context

Technical Work

Replies: 5 comments · 3 replies

netantho Nov 19, 2023

timwoj Feb 10, 2024 Maintainer Author

stevesmoot Nov 22, 2023

mavam Nov 22, 2023

timwoj Feb 10, 2024 Maintainer Author

Neverlord Nov 24, 2023 Collaborator

timwoj Feb 10, 2024 Maintainer Author

awelzel Feb 27, 2024 Maintainer

timwoj
Nov 10, 2023
Maintainer

Replies: 5 comments 3 replies

netantho
Nov 19, 2023

timwoj Feb 10, 2024
Maintainer Author

stevesmoot
Nov 22, 2023

mavam
Nov 22, 2023

timwoj Feb 10, 2024
Maintainer Author

Neverlord
Nov 24, 2023
Collaborator

timwoj Feb 10, 2024
Maintainer Author

awelzel
Feb 27, 2024
Maintainer