-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle dynamic field mapping with the dot notation in the elasticsearch
sink
#3588
Comments
elasticsearc
sinkelasticsearch
sink
In case it helps anyone else I'm using this lua script to dedot (specifically the kubernetes pod labels from Vector) before indexing to ES. [transforms.dedot_labels]
type = "lua"
inputs = ["input"]
version = "2"
hooks.process = "process"
source = """
function process(event, emit)
if event.log.kubernetes == nil then
return
end
dedot(event.log.kubernetes.pod_labels)
emit(event)
end
function dedot(map)
if map == nil then
return
end
local new_map = {}
local changed_keys = {}
for k, v in pairs(map) do
local dedotted = string.gsub(k, "%.", "_")
if dedotted ~= k then
new_map[dedotted] = v
changed_keys[k] = true
end
end
for k in pairs(changed_keys) do
map[k] = nil
end
for k, v in pairs(new_map) do
map[k] = v
end
end
""" |
+1 for this - I had to add a very similar Lua script myself. As prior art for the new ES sink option, see "Replace_Dots" from Fluent-Bit, in https://docs.fluentbit.io/manual/pipeline/outputs/elasticsearch#configuration-parameters |
I seem to be missing some context on this issue, and so am unclear what exactly is required for this. The link to SO references object mapping in label names and hints that we might need to reshape those labels somehow, but the replies indicate this is simply a matter of optionally replacing dots with underscores. Is that all that is required, or something more? Should it be optional, or is this always required? |
Why do you need to de-dot the fields? |
Currently, if you don't use a lua script like above you'll end up with issues when indexing - a document with a kubernetes pod label of |
Nested objects only happen if you have a mapping type specified ahead of time as Ideally, you can index documents like this:
JSON:
Then one can query using simple dot notation, much like the original data (i.e. in Kibana Discovery tab search bar): |
So based on your example above - if there's another label that gets indexed there that's just |
Yes, but ... do we really want to support arbitrary fields like this? There is a limit to the number of fields available per index (defaults to 1,000). While one can change the limit, it is still a limit, and bad field names in captured data (like those that do something stupid like using "random" number in the field name (seen it happen)) will blow up all indexing for other reads that have "real" fields that should be added. Blowing up on In reality, nobody wants a indexer for Elasticsearch that is going to cause headaches like this. Once an index is corrupted by too many fields all further data to be indexed would be lost. |
Your arguments against allowing arbitrary fields apply whether dots are handled properly or not. It’s not right to confuse users with two complimentary mistakes on ES’s part appearing to offer what looks on the surface correct, but is actually not when they look precisely at the returned data or get in the weeds of more complex searches. The original data is a flat list of key/value strings, not an object hierarchy, and it’s simpler to explain to users that ES cannot handle dots despite the fixes in ES v2 and they should search for underscore-replaced keys instead. I was disappointed that Vector appeared to make the same mistake internally, see #2814 and #3595 Unfortunately we’re not done here as my Lua equivalent has also had to start renaming fields by type (i.e. “app” -> “app.string”, “app.number” or “app.date”, not renaming if a real object). This is as ES auto-detects field values and if two apps disagree on what to log, e.g. a “status” field being a numeric HTTP status code or alphanumeric identifier, then only the first-to-log one gets to write to an index. We’re seriously considering giving up on ES as a logging destination. |
Correct.
Is it really a mistake on Elasticsearch's part? Every data ingestion system implements trade-offs to gain the desired behaviors the system offers. In the case of Elasticsearch, they offer powerful "pre-join" capabilities to speed up searches, sacrificing completely arbitrary field handling (Elasticsearch is not schema-less). It is really up to the entity sending data to Elasticsearch to decide on how to map the input data to the JSON documents to be indexed. Some entity has to map from the input space to the storage space.
Certainly one could argue it is not simpler because there is a key transformation that is not always reversible and confusion brought on by the transforms can complicate using the data. Ingestors need to put data into a storage system in such a way as to leverage the benefits of that storage systems. Why would a user want to use a data ingestor to send data to their existing Elasticsearch instance, if the ingestor putting data into it does not take advantage of its power? In this case, if Vector offered a convenient list of accepted key transforms to handle most cases, giving the users a way to add to that list, and documenting the expected behavior when conflicts are encountered, they would be delighted.
Many implementations of data scrapers like Vector have encountered the same problem.
Yes, that is a problem others have encountered and either ignored as a responsibility of the user (to the detriment of the admins who have to maintain a running system when it breaks) or handled with a clearly defined set of conflict-handling behaviors.
Dynamic templates can help here. Elasticsearch is not schema-less, so either the user has to agree to provide and manage their own mapping definitions, or agree to what Vector provides.
That seems a shame. There are many users of Elasticsearch out there. Users can setup their indexes with mappings that handle their data properly. If Vector offered a way to try to index data normally, and on 400s stored the JSON document in a safe way for users to handle themselves (either in a separate Elasticsearch index, or locally (e.g. see py-es-bulk), then users of Elasticsearch could take advantage of the power of Vector. As long as they are clear on what the behaviors are for error handling with Vector. |
I propose to add an optional
Is this an approach that will assist you with the problems you are experiencing, or is some other non-linear transform more appropriate? |
If |
Noting that we are going to implement this as a remap function. |
@spencergilbert Is there a specific reason why you return if |
It's been a while... but I suspect I had ensured only logs from kubernetes_logs source and only had events with those labels - I also wouldn't be surprised if I'd missed something obvious. As mentioned above this was pretty much taken from existing work around fluentbit (fluent/fluent-bit#1134 (comment)) |
Would it be possible to get a PR accepted that would add a configuration option to the kubernetes_logs source that would do this at annotation time? Of course that would not solve all potential elasticsearch interop problems, but it seems likely that kubernetes labels is the most common problem of this kind. (And for us specifically, changing the elasticsearch sink would not help us because we want to first replace fluentd with vector for the log collection and leave logstash for writing into elasticsearch from kafka for now.) |
@karlbohlmark it's not a bad idea, but I think @bruceg's recommendation is better, since it is within the sink that requires this mapping change. The rest of the upstream pipeline shouldn't concern itself with Elasticsearch limitations unless you want to normalize on that schema (if that makes sense). If you wanted to contribute this change (which would be great!), we would accept @bruceg's approach. |
@binarylogic I hear you, but then we would have to change our tentative plan for adopting vector so that we also replace logstash which is running as a StatefulSet reading from kafka and writing to Elasticsearch. For this I assume we would need the still-in-private-beta helm chart for vector running as an aggregator, so we would need to wait for that first. |
For what it's worth, the aggregator will be hitting a more "open" beta period in the next two weeks. In the interim (prior to handling via VRL) does the |
Thanks for the response! No, we are running the lua transform now and it looks good, so we will likely go ahead with that in the mean time. |
👍 cool, wanted to make sure - I knew it had been working when I was testing running Vector myself, wanted to make sure there wasn't a recent issue with that. I'm personally pretty confident about running Vector as an aggregator as well, I don't expect much to change Helm chart wise - but happy to work with people getting it running (we have a specific channel in Discord for it now). |
Note that there are rules around legal/supported field names in Elasticseach that are not captured here (especially around legal first characters). Where ever it is implemented, solutions like options 2 and 3 above should probably be aligned with Elasticsearch naming rules, and therefore treat first characters differently. |
Note that one of the key behavioral differences is that a strict JSON hierarchy does not map key/value pairs where the key names have a structure:
vs
vs
|
Closing since this is possible via |
For future reference for anyone ending up here trying to figure out how to dedot their keys with VRL, there's a good example here, took a bit to find though. |
This is something we should address within the
elasticsearch
sink.The text was updated successfully, but these errors were encountered: