Handle dynamic field mapping with the dot notation in the `elasticsearch` sink #3588

binarylogic · 2020-08-26T19:13:02Z

I did hit https://stackoverflow.com/questions/59472323/elasticsearch-dynamic-field-mapping-and-json-dot-notation with the pod_label, and had to toss a Lua transform after the json_parser, but so far it looks good. Results show in Kibana.

This is something we should address within the elasticsearch sink.

The text was updated successfully, but these errors were encountered:

spencergilbert · 2020-09-15T13:14:39Z

In case it helps anyone else I'm using this lua script to dedot (specifically the kubernetes pod labels from Vector) before indexing to ES.

[transforms.dedot_labels]
	type = "lua"
	inputs = ["input"]
	version = "2"

	hooks.process = "process"
	source = """
    function process(event, emit)
        if event.log.kubernetes == nil then
            return
        end
        dedot(event.log.kubernetes.pod_labels)
        emit(event)
    end

    function dedot(map)
        if map == nil then
            return
        end
        local new_map = {}
        local changed_keys = {}
        for k, v in pairs(map) do
            local dedotted = string.gsub(k, "%.", "_")
            if dedotted ~= k then
                new_map[dedotted] = v
                changed_keys[k] = true
            end
        end
        for k in pairs(changed_keys) do
            map[k] = nil
        end
        for k, v in pairs(new_map) do
            map[k] = v
        end
    end
    """

tyrken · 2020-09-22T19:32:24Z

+1 for this - I had to add a very similar Lua script myself.

As prior art for the new ES sink option, see "Replace_Dots" from Fluent-Bit, in https://docs.fluentbit.io/manual/pipeline/outputs/elasticsearch#configuration-parameters

bruceg · 2020-11-30T23:37:01Z

I seem to be missing some context on this issue, and so am unclear what exactly is required for this. The link to SO references object mapping in label names and hints that we might need to reshape those labels somehow, but the replies indicate this is simply a matter of optionally replacing dots with underscores. Is that all that is required, or something more? Should it be optional, or is this always required?

portante · 2020-12-01T01:22:41Z

Why do you need to de-dot the fields?

spencergilbert · 2020-12-01T23:49:22Z

Currently, if you don't use a lua script like above you'll end up with issues when indexing - a document with a kubernetes pod label of app.kubernetes.io/instance will be indexed as nested objects, instead of the field name.

portante · 2020-12-02T01:00:51Z

Nested objects only happen if you have a mapping type specified ahead of time as nested, right? Or are you referring to some other behavior of Elasticsearch data type detection?

Ideally, you can index documents like this:

Keys                          Values
-----                         ------
app.kubernetes.io/instance    foobar
app.kubernetes.io/location    East
app.kubernetes.cpu            Fast
app.kubernetes.disk           Slow

JSON:

{
  "app": {
    "kubernetes": {
      "io": {
        "instance": "foobar",
        "location": "East"
      },
      "cpu": "Fast",
      "disk": "Slow"
}

Then one can query using simple dot notation, much like the original data (i.e. in Kibana Discovery tab search bar): app.kubernetes.io.instance: foobar*

spencergilbert · 2020-12-02T01:33:16Z

So based on your example above - if there's another label that gets indexed there that's just app: foo there will be an error because the actual label is the entire app.kubernetes.io/... which ES interprets as an object - when in reality it's just a string

portante · 2020-12-02T04:51:07Z

So based on your example above - if there's another label that gets indexed there that's just app: foo there will be an error because the actual label is the entire app.kubernetes.io/... which ES interprets as an object - when in reality it's just a string

Yes, but ... do we really want to support arbitrary fields like this? There is a limit to the number of fields available per index (defaults to 1,000). While one can change the limit, it is still a limit, and bad field names in captured data (like those that do something stupid like using "random" number in the field name (seen it happen)) will blow up all indexing for other reads that have "real" fields that should be added.

Blowing up on app: foo could be handled in two ways: don't allow arbitrary fields, adding those fields to a raw text string under a known (reserved) field name, or move it to a conflict namespace, like app.string: foo (where you disallow .string field names). The first option avoids arbitrary fields, the second option allows them, but in a more controlled manner. Both require a known namespace for allowed fields.

In reality, nobody wants a indexer for Elasticsearch that is going to cause headaches like this. Once an index is corrupted by too many fields all further data to be indexed would be lost.

tyrken · 2020-12-02T11:11:33Z

Your arguments against allowing arbitrary fields apply whether dots are handled properly or not. It’s not right to confuse users with two complimentary mistakes on ES’s part appearing to offer what looks on the surface correct, but is actually not when they look precisely at the returned data or get in the weeds of more complex searches. The original data is a flat list of key/value strings, not an object hierarchy, and it’s simpler to explain to users that ES cannot handle dots despite the fixes in ES v2 and they should search for underscore-replaced keys instead.

I was disappointed that Vector appeared to make the same mistake internally, see #2814 and #3595

Unfortunately we’re not done here as my Lua equivalent has also had to start renaming fields by type (i.e. “app” -> “app.string”, “app.number” or “app.date”, not renaming if a real object). This is as ES auto-detects field values and if two apps disagree on what to log, e.g. a “status” field being a numeric HTTP status code or alphanumeric identifier, then only the first-to-log one gets to write to an index.

We’re seriously considering giving up on ES as a logging destination.

portante · 2020-12-02T13:28:41Z

Your arguments against allowing arbitrary fields apply whether dots are handled properly or not.

Correct.

It’s not right to confuse users with two complimentary mistakes on ES’s part appearing to offer what looks on the surface correct, but is actually not when they look precisely at the returned data or get in the weeds of more complex searches.

Is it really a mistake on Elasticsearch's part? Every data ingestion system implements trade-offs to gain the desired behaviors the system offers. In the case of Elasticsearch, they offer powerful "pre-join" capabilities to speed up searches, sacrificing completely arbitrary field handling (Elasticsearch is not schema-less).

It is really up to the entity sending data to Elasticsearch to decide on how to map the input data to the JSON documents to be indexed.

Some entity has to map from the input space to the storage space.

The original data is a flat list of key/value strings, not an object hierarchy, and it’s simpler to explain to users that ES cannot handle dots despite the fixes in ES v2 and they should search for underscore-replaced keys instead.

Certainly one could argue it is not simpler because there is a key transformation that is not always reversible and confusion brought on by the transforms can complicate using the data.

Ingestors need to put data into a storage system in such a way as to leverage the benefits of that storage systems. Why would a user want to use a data ingestor to send data to their existing Elasticsearch instance, if the ingestor putting data into it does not take advantage of its power?

In this case, if Vector offered a convenient list of accepted key transforms to handle most cases, giving the users a way to add to that list, and documenting the expected behavior when conflicts are encountered, they would be delighted.

I was disappointed that Vector appeared to make the same mistake internally, see #2814 and #3595.

Many implementations of data scrapers like Vector have encountered the same problem.

Unfortunately we’re not done here as my Lua equivalent has also had to start renaming fields by type (i.e. “app” -> “app.string”, “app.number” or “app.date”, not renaming if a real object).

Yes, that is a problem others have encountered and either ignored as a responsibility of the user (to the detriment of the admins who have to maintain a running system when it breaks) or handled with a clearly defined set of conflict-handling behaviors.

This is as ES auto-detects field values and if two apps disagree on what to log, e.g. a “status” field being a numeric HTTP status code or alphanumeric identifier, then only the first-to-log one gets to write to an index.

Dynamic templates can help here.

Elasticsearch is not schema-less, so either the user has to agree to provide and manage their own mapping definitions, or agree to what Vector provides.

We’re seriously considering giving up on ES as a logging destination.

That seems a shame. There are many users of Elasticsearch out there.

Users can setup their indexes with mappings that handle their data properly. If Vector offered a way to try to index data normally, and on 400s stored the JSON document in a safe way for users to handle themselves (either in a separate Elasticsearch index, or locally (e.g. see py-es-bulk), then users of Elasticsearch could take advantage of the power of Vector. As long as they are clear on what the behaviors are for error handling with Vector.

bruceg · 2020-12-03T20:07:15Z

I propose to add an optional key_mapping setting to this sink, initially with three values:

asis (the default) makes no changes to the key names
underscore will convert all non-alphanumeric characters to underscores
periods will convert all non-alphanumeric characters to periods

Is this an approach that will assist you with the problems you are experiencing, or is some other non-linear transform more appropriate?

spencergilbert · 2020-12-03T21:22:14Z

I propose to add an optional key_mapping setting to this sink, initially with three values:

asis (the default) makes no changes to the key names

underscore will convert all non-alphanumeric characters to underscores

periods will convert all non-alphanumeric characters to periods

Is this an approach that will assist you with the problems you are experiencing, or is some other non-linear transform more appropriate?

If underscore would replace a field in the vector log thats app.kubernetes.io with app_kubernetes_io that would definitely solve my usecase

bruceg · 2020-12-03T22:02:05Z

Noting that we are going to implement this as a remap function.
Blocked on #5377

Tolsto · 2021-03-22T21:47:14Z

In case it helps anyone else I'm using this lua script to dedot (specifically the kubernetes pod labels from Vector) before indexing to ES.

[transforms.dedot_labels]
	type = "lua"
	inputs = ["input"]
	version = "2"

	hooks.process = "process"
	source = """
    function process(event, emit)
        if event.log.kubernetes == nil then
            return
        end
        dedot(event.log.kubernetes.pod_labels)
        emit(event)
    end

    function dedot(map)
        if map == nil then
            return
        end
        local new_map = {}
        local changed_keys = {}
        for k, v in pairs(map) do
            local dedotted = string.gsub(k, "%.", "_")
            if dedotted ~= k then
                new_map[dedotted] = v
                changed_keys[k] = true
            end
        end
        for k in pairs(changed_keys) do
            map[k] = nil
        end
        for k, v in pairs(new_map) do
            map[k] = v
        end
    end
    """

@spencergilbert Is there a specific reason why you return if event.log.kubernetes == nil without emitting an event?

spencergilbert · 2021-03-22T22:25:50Z

@spencergilbert Is there a specific reason why you return if event.log.kubernetes == nil without emitting an event?

It's been a while... but I suspect I had ensured only logs from kubernetes_logs source and only had events with those labels - I also wouldn't be surprised if I'd missed something obvious. As mentioned above this was pretty much taken from existing work around fluentbit (fluent/fluent-bit#1134 (comment))

karlbohlmark · 2021-08-17T11:06:08Z

Would it be possible to get a PR accepted that would add a configuration option to the kubernetes_logs source that would do this at annotation time?

Of course that would not solve all potential elasticsearch interop problems, but it seems likely that kubernetes labels is the most common problem of this kind. (And for us specifically, changing the elasticsearch sink would not help us because we want to first replace fluentd with vector for the log collection and leave logstash for writing into elasticsearch from kafka for now.)

binarylogic · 2021-08-17T13:37:22Z

@karlbohlmark it's not a bad idea, but I think @bruceg's recommendation is better, since it is within the sink that requires this mapping change. The rest of the upstream pipeline shouldn't concern itself with Elasticsearch limitations unless you want to normalize on that schema (if that makes sense). If you wanted to contribute this change (which would be great!), we would accept @bruceg's approach.

karlbohlmark · 2021-08-17T14:22:50Z

@binarylogic I hear you, but then we would have to change our tentative plan for adopting vector so that we also replace logstash which is running as a StatefulSet reading from kafka and writing to Elasticsearch. For this I assume we would need the still-in-private-beta helm chart for vector running as an aggregator, so we would need to wait for that first.

spencergilbert · 2021-08-17T14:25:22Z

@binarylogic I hear you, but then we would have to change our tentative plan for adopting vector so that we also replace logstash which is running as a StatefulSet reading from kafka and writing to Elasticsearch. For this I assume we would need the still-in-private-beta helm chart for vector running as an aggregator, so we would need to wait for that first.

For what it's worth, the aggregator will be hitting a more "open" beta period in the next two weeks. In the interim (prior to handling via VRL) does the lua transform example not suffice, or are you having other issues with it?

karlbohlmark · 2021-08-17T14:27:58Z

Thanks for the response! No, we are running the lua transform now and it looks good, so we will likely go ahead with that in the mean time.

spencergilbert · 2021-08-17T14:30:59Z

Thanks for the response! No, we are running the lua transform now and it looks good, so we will likely go ahead with that in the mean time.

👍 cool, wanted to make sure - I knew it had been working when I was testing running Vector myself, wanted to make sure there wasn't a recent issue with that. I'm personally pretty confident about running Vector as an aggregator as well, I don't expect much to change Helm chart wise - but happy to work with people getting it running (we have a specific channel in Discord for it now).

tomryanx · 2021-12-02T01:27:12Z

I propose to add an optional key_mapping setting to this sink, initially with three values:
1. `asis` (the default) makes no changes to the key names

2. `underscore` will convert all non-alphanumeric characters to underscores

3. `periods` will convert all non-alphanumeric characters to periods
Is this an approach that will assist you with the problems you are experiencing, or is some other non-linear transform more appropriate?

Note that there are rules around legal/supported field names in Elasticseach that are not captured here (especially around legal first characters).

Where ever it is implemented, solutions like options 2 and 3 above should probably be aligned with Elasticsearch naming rules, and therefore treat first characters differently.

portante · 2021-12-02T04:41:24Z

Note that one of the key behavioral differences is that a strict JSON hierarchy does not map key/value pairs where the key names have a structure:

   key0: value1
   key0/subkey1: value2

vs

  # Note here the key structure is ignored.
  { "key0": "value1", "key0/subkey1": "value2" }

vs

  # This results in invalid JSON if you split on "/" because in JSON "key0" can't have two mappings.
  { "key0": "value1", "key0": { "subkey1": "value2" } }

jszwedko · 2022-08-03T21:17:14Z

Closing since this is possible via remap now that it supports iteration.

rlabrecque · 2022-12-23T01:58:41Z

For future reference for anyone ending up here trying to figure out how to dedot their keys with VRL, there's a good example here, took a bit to find though.

https://vector.dev/docs/reference/vrl/functions/#map_keys

jszwedko changed the title ~~Handle dynamic field mapping with the dot notation in the elasticsearc sink~~ Handle dynamic field mapping with the dot notation in the elasticsearch sink Aug 26, 2020

drunkirishcoder mentioned this issue Sep 25, 2020

fix(http source)!: don't dedot keys #4103

Merged

jamtur01 removed the platform: kubernetes Anything `kubernetes` platform related label Nov 6, 2020

binarylogic added this to the 2020-11-23: Pseudo-chitin armor milestone Nov 18, 2020

jamtur01 assigned bruceg Nov 20, 2020

bruceg mentioned this issue Dec 3, 2020

New replace_keys Remap function #5377

Closed

jszwedko mentioned this issue Dec 4, 2020

Remap traversal and function application #4518

Closed

jamtur01 removed this from the 2020-11-23: Pseudo-chitin armor milestone Dec 7, 2020

jamtur01 unassigned bruceg Dec 7, 2020

binarylogic added this to the Vector 0.13 milestone Mar 22, 2021

jszwedko modified the milestones: Vector 0.13, Vector 0.14 Apr 12, 2021

jszwedko removed this from the Vector 0.14.0 milestone Jun 4, 2021

spencergilbert mentioned this issue Jun 21, 2021

Make label keys templateable for Loki sink #6773

Closed

spencergilbert mentioned this issue Jun 28, 2021

[kubernetes_logs] Logs are not forwarded if pod has "app" label configured #8058

Closed

jszwedko closed this as completed Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle dynamic field mapping with the dot notation in the `elasticsearch` sink #3588

Handle dynamic field mapping with the dot notation in the `elasticsearch` sink #3588

binarylogic commented Aug 26, 2020

spencergilbert commented Sep 15, 2020

tyrken commented Sep 22, 2020

bruceg commented Nov 30, 2020

portante commented Dec 1, 2020

spencergilbert commented Dec 1, 2020

portante commented Dec 2, 2020

spencergilbert commented Dec 2, 2020

portante commented Dec 2, 2020

tyrken commented Dec 2, 2020

portante commented Dec 2, 2020

bruceg commented Dec 3, 2020

spencergilbert commented Dec 3, 2020

bruceg commented Dec 3, 2020

Tolsto commented Mar 22, 2021 •

edited

Loading

spencergilbert commented Mar 22, 2021

karlbohlmark commented Aug 17, 2021

binarylogic commented Aug 17, 2021

karlbohlmark commented Aug 17, 2021

spencergilbert commented Aug 17, 2021 •

edited

Loading

karlbohlmark commented Aug 17, 2021

spencergilbert commented Aug 17, 2021

tomryanx commented Dec 2, 2021 •

edited

Loading

portante commented Dec 2, 2021

jszwedko commented Aug 3, 2022

rlabrecque commented Dec 23, 2022

Handle dynamic field mapping with the dot notation in the elasticsearch sink #3588

Handle dynamic field mapping with the dot notation in the elasticsearch sink #3588

Comments

binarylogic commented Aug 26, 2020

spencergilbert commented Sep 15, 2020

tyrken commented Sep 22, 2020

bruceg commented Nov 30, 2020

portante commented Dec 1, 2020

spencergilbert commented Dec 1, 2020

portante commented Dec 2, 2020

spencergilbert commented Dec 2, 2020

portante commented Dec 2, 2020

tyrken commented Dec 2, 2020

portante commented Dec 2, 2020

bruceg commented Dec 3, 2020

spencergilbert commented Dec 3, 2020

bruceg commented Dec 3, 2020

Tolsto commented Mar 22, 2021 • edited Loading

spencergilbert commented Mar 22, 2021

karlbohlmark commented Aug 17, 2021

binarylogic commented Aug 17, 2021

karlbohlmark commented Aug 17, 2021

spencergilbert commented Aug 17, 2021 • edited Loading

karlbohlmark commented Aug 17, 2021

spencergilbert commented Aug 17, 2021

tomryanx commented Dec 2, 2021 • edited Loading

portante commented Dec 2, 2021

jszwedko commented Aug 3, 2022

rlabrecque commented Dec 23, 2022

Handle dynamic field mapping with the dot notation in the `elasticsearch` sink #3588

Handle dynamic field mapping with the dot notation in the `elasticsearch` sink #3588

Tolsto commented Mar 22, 2021 •

edited

Loading

spencergilbert commented Aug 17, 2021 •

edited

Loading

tomryanx commented Dec 2, 2021 •

edited

Loading