Replies: 5 comments 3 replies
-
Minor feedback (the direction seems great): According to its README simdjson is 4x faster than RapidJSON. simdjson is used in Tenzir. |
Beta Was this translation helpful? Give feedback.
-
I've personally wondered why we dont just have memcached available through a bif, and skip broker, manager etc. just take a well tuned OS k/v store and link it in easily. |
Beta Was this translation helpful? Give feedback.
-
Do we have articulated clear requirements and understand well the access patterns that are to be supported? For example, is key-value lookup the only access path? Or is a more elaborate, predicate-based selection of subsets of data also a valid workload? In other words, how do we describe the data access patterns of a multi-machine cluster, or a script that would make use of stored data? |
Beta Was this translation helpful? Give feedback.
-
Full support for moving storage out of Broker. The data stores in Broker are quite a headache, because they add a lot of complexity and always felt to me like they ended up being more complicated than they should because they tried to serve different roles from one API. For a key/value store, it always bugged me that there is no leader election or other recovery mechanism, which means the I do have some similar concerns as @mavam regarding the overhaul. I think before talking about technical work, it would pay off to better understand the problem space and to investigate alternatives.
|
Beta Was this translation helpful? Give feedback.
-
As mentioned in person, to leave a record here and with slight reference to #3450 (comment). If storage is initially about key-value storage. Having a I could even see a first step fully in script-land: If a user was to provide
EDIT, the latter was meant to be something like the following, similar to
If they opted for Redis, they could then also read values back via Python/Ruby and would be in control of the value encoding. That maybe won't be as convenient as Vals, but it would move responsibility of dealing with conversion errors or migrations up to the user, rather then bury them into the storage layer where error reporting / recovery may be somewhat difficult. |
Beta Was this translation helpful? Give feedback.
-
Context
The existing Broker implementation tightly wraps the storage framework with the IPC framework, but this is likely completely unnecessary. The storage framework should be a consumer of the IPC framework if that functionality is needed, but doesn’t need to be part of it directly.
There is a consideration of whether the storage framework needs IPC at all. Depending on how the storage framework is implemented, and what targets are chosen, it may not be necessary to use the IPC framework. For example, if the target backend is something server-based and/or network-accessible (postgres, redis, mongo, etc), it should be possible for the framework to just open connections directly to that backend from each process.
Having processes open their own connections would have a number of benefits. First, we would avoid the performance implications of serializing the data into some network format, only to deserialize it to be stored into the backend. Second, this would allow us to avoid having to aggregate the data onto a single node before writing to the backend. Lastly, it would allow writers to write the data directly in a format native to the backend, which then enables a whole lot of use cases that are currently not feasible (such as log writers reading directly from the storage backends).
Technical Work
Ideally this framework would use the existing Zeek plugin system. That would allow extra backends to be implemented, using common idioms that are already familiar to plugin developers. It would also allow us to use a common script framework for talking to the storage backends. Plugins could decide how to implement their own storage models depending on what the underlying storage is.
The framework itself will follow the typical layout that we have for other frameworks. A manager type that handles coordination between Zeek and the plugins, a plugin component and component manager, and a set of plugins that handle the heavy lifting. Each plugin can implement their own set of script-land options for connecting to/opening whatever their underlying storage is, plus whatever is needed to read and write data with that storage.
The storage model ends up being a giant question mark here. Our current storage model is a key-value system, with the value being a binary blob that gets stored directly into whatever the underlying storage is. That’s obviously not the ideal model for something like an SQL database, but it allows Zeek to be consistent across all of the storage backends. If we leave the storage model up to the plugins to decide, they’re going to have to figure out how to convert from native Zeek structures such as records into something native to the underlying storage. This gets sticky pretty quickly. For SQL databases the best approach would be to write SQL directly, but we’re not going to convert all of the record structures into SQL tables unless we can do it programmatically at startup.
SQL databases come with some considerations of their own. The best approach would be to write SQL directly into tables for each record type, but we’re not going to want to generate tables for every record, plus dealing with tables and vectors, etc. If we were to do that, you end up having to support a migration system of some sort in the instances that types change, and it gets really messy. With a key-value system based on JSON, you can have one table that has key and value columns. We’ll have to rely on RapidJSON to do the type conversions though, so we have to be careful that it doesn't become a bottleneck.
There’s also a big problem with multi-machine clusters. If the backends are able to do all of the communication with the underlying storage themselves, how do backends like flat files work? andCould the backend be smart enough to know about its runtime environment (maybe reading something from the management framework?) and only open the flat file on a single node? Or should we just deprecate the idea of flat file backends and require users to have a server-based backend? Note that this all applies to not just flat files, but anything file-based such as SQLite (which we definitely shouldn’t get rid of).
The existing script-land API in Broker implements a wide range of operations for modifying data in-situ for stored entries (like most of a map-style API), plus a bunch of operations for dealing with Broker data objects. For the new API, we’ll keep it simple to start with:
Beta Was this translation helpful? Give feedback.
All reactions