Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle dynamic field mapping with the dot notation in the elasticsearch sink #3588

Closed
binarylogic opened this issue Aug 26, 2020 · 25 comments
Closed
Labels
have: must We must have this feature, it is critical to the project's success. It is high priority. meta: feedback Anything related to customer/user feedback. sink: elasticsearch Anything `elasticsearch` sink related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@binarylogic
Copy link
Contributor

I did hit https://stackoverflow.com/questions/59472323/elasticsearch-dynamic-field-mapping-and-json-dot-notation with the pod_label, and had to toss a Lua transform after the json_parser, but so far it looks good. Results show in Kibana.

This is something we should address within the elasticsearch sink.

@binarylogic binarylogic added sink: elasticsearch Anything `elasticsearch` sink related type: enhancement A value-adding code change that enhances its existing functionality. meta: feedback Anything related to customer/user feedback. have: must We must have this feature, it is critical to the project's success. It is high priority. platform: kubernetes Anything `kubernetes` platform related labels Aug 26, 2020
@jszwedko jszwedko changed the title Handle dynamic field mapping with the dot notation in the elasticsearc sink Handle dynamic field mapping with the dot notation in the elasticsearch sink Aug 26, 2020
@spencergilbert
Copy link
Contributor

In case it helps anyone else I'm using this lua script to dedot (specifically the kubernetes pod labels from Vector) before indexing to ES.

[transforms.dedot_labels]
	type = "lua"
	inputs = ["input"]
	version = "2"

	hooks.process = "process"
	source = """
    function process(event, emit)
        if event.log.kubernetes == nil then
            return
        end
        dedot(event.log.kubernetes.pod_labels)
        emit(event)
    end

    function dedot(map)
        if map == nil then
            return
        end
        local new_map = {}
        local changed_keys = {}
        for k, v in pairs(map) do
            local dedotted = string.gsub(k, "%.", "_")
            if dedotted ~= k then
                new_map[dedotted] = v
                changed_keys[k] = true
            end
        end
        for k in pairs(changed_keys) do
            map[k] = nil
        end
        for k, v in pairs(new_map) do
            map[k] = v
        end
    end
    """

@tyrken
Copy link
Contributor

tyrken commented Sep 22, 2020

+1 for this - I had to add a very similar Lua script myself.

As prior art for the new ES sink option, see "Replace_Dots" from Fluent-Bit, in https://docs.fluentbit.io/manual/pipeline/outputs/elasticsearch#configuration-parameters

@jamtur01 jamtur01 removed the platform: kubernetes Anything `kubernetes` platform related label Nov 6, 2020
@bruceg
Copy link
Member

bruceg commented Nov 30, 2020

I seem to be missing some context on this issue, and so am unclear what exactly is required for this. The link to SO references object mapping in label names and hints that we might need to reshape those labels somehow, but the replies indicate this is simply a matter of optionally replacing dots with underscores. Is that all that is required, or something more? Should it be optional, or is this always required?

@portante
Copy link

portante commented Dec 1, 2020

Why do you need to de-dot the fields?

@spencergilbert
Copy link
Contributor

Currently, if you don't use a lua script like above you'll end up with issues when indexing - a document with a kubernetes pod label of app.kubernetes.io/instance will be indexed as nested objects, instead of the field name.

@portante
Copy link

portante commented Dec 2, 2020

Nested objects only happen if you have a mapping type specified ahead of time as nested, right? Or are you referring to some other behavior of Elasticsearch data type detection?

Ideally, you can index documents like this:

Keys                          Values
-----                         ------
app.kubernetes.io/instance    foobar
app.kubernetes.io/location    East
app.kubernetes.cpu            Fast
app.kubernetes.disk           Slow

JSON:

{
  "app": {
    "kubernetes": {
      "io": {
        "instance": "foobar",
        "location": "East"
      },
      "cpu": "Fast",
      "disk": "Slow"
}

Then one can query using simple dot notation, much like the original data (i.e. in Kibana Discovery tab search bar): app.kubernetes.io.instance: foobar*

@spencergilbert
Copy link
Contributor

So based on your example above - if there's another label that gets indexed there that's just app: foo there will be an error because the actual label is the entire app.kubernetes.io/... which ES interprets as an object - when in reality it's just a string

@portante
Copy link

portante commented Dec 2, 2020

So based on your example above - if there's another label that gets indexed there that's just app: foo there will be an error because the actual label is the entire app.kubernetes.io/... which ES interprets as an object - when in reality it's just a string

Yes, but ... do we really want to support arbitrary fields like this? There is a limit to the number of fields available per index (defaults to 1,000). While one can change the limit, it is still a limit, and bad field names in captured data (like those that do something stupid like using "random" number in the field name (seen it happen)) will blow up all indexing for other reads that have "real" fields that should be added.

Blowing up on app: foo could be handled in two ways: don't allow arbitrary fields, adding those fields to a raw text string under a known (reserved) field name, or move it to a conflict namespace, like app.string: foo (where you disallow .string field names). The first option avoids arbitrary fields, the second option allows them, but in a more controlled manner. Both require a known namespace for allowed fields.

In reality, nobody wants a indexer for Elasticsearch that is going to cause headaches like this. Once an index is corrupted by too many fields all further data to be indexed would be lost.

@tyrken
Copy link
Contributor

tyrken commented Dec 2, 2020

Your arguments against allowing arbitrary fields apply whether dots are handled properly or not. It’s not right to confuse users with two complimentary mistakes on ES’s part appearing to offer what looks on the surface correct, but is actually not when they look precisely at the returned data or get in the weeds of more complex searches. The original data is a flat list of key/value strings, not an object hierarchy, and it’s simpler to explain to users that ES cannot handle dots despite the fixes in ES v2 and they should search for underscore-replaced keys instead.

I was disappointed that Vector appeared to make the same mistake internally, see #2814 and #3595

Unfortunately we’re not done here as my Lua equivalent has also had to start renaming fields by type (i.e. “app” -> “app.string”, “app.number” or “app.date”, not renaming if a real object). This is as ES auto-detects field values and if two apps disagree on what to log, e.g. a “status” field being a numeric HTTP status code or alphanumeric identifier, then only the first-to-log one gets to write to an index.

We’re seriously considering giving up on ES as a logging destination.

@portante
Copy link

portante commented Dec 2, 2020

Your arguments against allowing arbitrary fields apply whether dots are handled properly or not.

Correct.

It’s not right to confuse users with two complimentary mistakes on ES’s part appearing to offer what looks on the surface correct, but is actually not when they look precisely at the returned data or get in the weeds of more complex searches.

Is it really a mistake on Elasticsearch's part? Every data ingestion system implements trade-offs to gain the desired behaviors the system offers. In the case of Elasticsearch, they offer powerful "pre-join" capabilities to speed up searches, sacrificing completely arbitrary field handling (Elasticsearch is not schema-less).

It is really up to the entity sending data to Elasticsearch to decide on how to map the input data to the JSON documents to be indexed.

Some entity has to map from the input space to the storage space.

The original data is a flat list of key/value strings, not an object hierarchy, and it’s simpler to explain to users that ES cannot handle dots despite the fixes in ES v2 and they should search for underscore-replaced keys instead.

Certainly one could argue it is not simpler because there is a key transformation that is not always reversible and confusion brought on by the transforms can complicate using the data.

Ingestors need to put data into a storage system in such a way as to leverage the benefits of that storage systems. Why would a user want to use a data ingestor to send data to their existing Elasticsearch instance, if the ingestor putting data into it does not take advantage of its power?

In this case, if Vector offered a convenient list of accepted key transforms to handle most cases, giving the users a way to add to that list, and documenting the expected behavior when conflicts are encountered, they would be delighted.

I was disappointed that Vector appeared to make the same mistake internally, see #2814 and #3595.

Many implementations of data scrapers like Vector have encountered the same problem.

Unfortunately we’re not done here as my Lua equivalent has also had to start renaming fields by type (i.e. “app” -> “app.string”, “app.number” or “app.date”, not renaming if a real object).

Yes, that is a problem others have encountered and either ignored as a responsibility of the user (to the detriment of the admins who have to maintain a running system when it breaks) or handled with a clearly defined set of conflict-handling behaviors.

This is as ES auto-detects field values and if two apps disagree on what to log, e.g. a “status” field being a numeric HTTP status code or alphanumeric identifier, then only the first-to-log one gets to write to an index.

Dynamic templates can help here.

Elasticsearch is not schema-less, so either the user has to agree to provide and manage their own mapping definitions, or agree to what Vector provides.

We’re seriously considering giving up on ES as a logging destination.

That seems a shame. There are many users of Elasticsearch out there.

Users can setup their indexes with mappings that handle their data properly. If Vector offered a way to try to index data normally, and on 400s stored the JSON document in a safe way for users to handle themselves (either in a separate Elasticsearch index, or locally (e.g. see py-es-bulk), then users of Elasticsearch could take advantage of the power of Vector. As long as they are clear on what the behaviors are for error handling with Vector.

@bruceg
Copy link
Member

bruceg commented Dec 3, 2020

I propose to add an optional key_mapping setting to this sink, initially with three values:

  1. asis (the default) makes no changes to the key names
  2. underscore will convert all non-alphanumeric characters to underscores
  3. periods will convert all non-alphanumeric characters to periods

Is this an approach that will assist you with the problems you are experiencing, or is some other non-linear transform more appropriate?

@spencergilbert
Copy link
Contributor

I propose to add an optional key_mapping setting to this sink, initially with three values:

  1. asis (the default) makes no changes to the key names
  2. underscore will convert all non-alphanumeric characters to underscores
  3. periods will convert all non-alphanumeric characters to periods

Is this an approach that will assist you with the problems you are experiencing, or is some other non-linear transform more appropriate?

If underscore would replace a field in the vector log thats app.kubernetes.io with app_kubernetes_io that would definitely solve my usecase

@bruceg
Copy link
Member

bruceg commented Dec 3, 2020

Noting that we are going to implement this as a remap function.
Blocked on #5377

@Tolsto
Copy link
Contributor

Tolsto commented Mar 22, 2021

In case it helps anyone else I'm using this lua script to dedot (specifically the kubernetes pod labels from Vector) before indexing to ES.

[transforms.dedot_labels]
	type = "lua"
	inputs = ["input"]
	version = "2"

	hooks.process = "process"
	source = """
    function process(event, emit)
        if event.log.kubernetes == nil then
            return
        end
        dedot(event.log.kubernetes.pod_labels)
        emit(event)
    end

    function dedot(map)
        if map == nil then
            return
        end
        local new_map = {}
        local changed_keys = {}
        for k, v in pairs(map) do
            local dedotted = string.gsub(k, "%.", "_")
            if dedotted ~= k then
                new_map[dedotted] = v
                changed_keys[k] = true
            end
        end
        for k in pairs(changed_keys) do
            map[k] = nil
        end
        for k, v in pairs(new_map) do
            map[k] = v
        end
    end
    """

@spencergilbert Is there a specific reason why you return if event.log.kubernetes == nil without emitting an event?

@binarylogic binarylogic added this to the Vector 0.13 milestone Mar 22, 2021
@spencergilbert
Copy link
Contributor

@spencergilbert Is there a specific reason why you return if event.log.kubernetes == nil without emitting an event?

It's been a while... but I suspect I had ensured only logs from kubernetes_logs source and only had events with those labels - I also wouldn't be surprised if I'd missed something obvious. As mentioned above this was pretty much taken from existing work around fluentbit (fluent/fluent-bit#1134 (comment))

@karlbohlmark
Copy link

Would it be possible to get a PR accepted that would add a configuration option to the kubernetes_logs source that would do this at annotation time?

Of course that would not solve all potential elasticsearch interop problems, but it seems likely that kubernetes labels is the most common problem of this kind. (And for us specifically, changing the elasticsearch sink would not help us because we want to first replace fluentd with vector for the log collection and leave logstash for writing into elasticsearch from kafka for now.)

@binarylogic
Copy link
Contributor Author

@karlbohlmark it's not a bad idea, but I think @bruceg's recommendation is better, since it is within the sink that requires this mapping change. The rest of the upstream pipeline shouldn't concern itself with Elasticsearch limitations unless you want to normalize on that schema (if that makes sense). If you wanted to contribute this change (which would be great!), we would accept @bruceg's approach.

@karlbohlmark
Copy link

@binarylogic I hear you, but then we would have to change our tentative plan for adopting vector so that we also replace logstash which is running as a StatefulSet reading from kafka and writing to Elasticsearch. For this I assume we would need the still-in-private-beta helm chart for vector running as an aggregator, so we would need to wait for that first.

@spencergilbert
Copy link
Contributor

spencergilbert commented Aug 17, 2021

@binarylogic I hear you, but then we would have to change our tentative plan for adopting vector so that we also replace logstash which is running as a StatefulSet reading from kafka and writing to Elasticsearch. For this I assume we would need the still-in-private-beta helm chart for vector running as an aggregator, so we would need to wait for that first.

For what it's worth, the aggregator will be hitting a more "open" beta period in the next two weeks. In the interim (prior to handling via VRL) does the lua transform example not suffice, or are you having other issues with it?

@karlbohlmark
Copy link

Thanks for the response! No, we are running the lua transform now and it looks good, so we will likely go ahead with that in the mean time.

@spencergilbert
Copy link
Contributor

Thanks for the response! No, we are running the lua transform now and it looks good, so we will likely go ahead with that in the mean time.

👍 cool, wanted to make sure - I knew it had been working when I was testing running Vector myself, wanted to make sure there wasn't a recent issue with that. I'm personally pretty confident about running Vector as an aggregator as well, I don't expect much to change Helm chart wise - but happy to work with people getting it running (we have a specific channel in Discord for it now).

@tomryanx
Copy link

tomryanx commented Dec 2, 2021

I propose to add an optional key_mapping setting to this sink, initially with three values:

1. `asis` (the default) makes no changes to the key names

2. `underscore` will convert all non-alphanumeric characters to underscores

3. `periods` will convert all non-alphanumeric characters to periods

Is this an approach that will assist you with the problems you are experiencing, or is some other non-linear transform more appropriate?

Note that there are rules around legal/supported field names in Elasticseach that are not captured here (especially around legal first characters).

Where ever it is implemented, solutions like options 2 and 3 above should probably be aligned with Elasticsearch naming rules, and therefore treat first characters differently.

@portante
Copy link

portante commented Dec 2, 2021

Note that one of the key behavioral differences is that a strict JSON hierarchy does not map key/value pairs where the key names have a structure:

   key0: value1
   key0/subkey1: value2

vs

  # Note here the key structure is ignored.
  { "key0": "value1", "key0/subkey1": "value2" }

vs

  # This results in invalid JSON if you split on "/" because in JSON "key0" can't have two mappings.
  { "key0": "value1", "key0": { "subkey1": "value2" } }

@jszwedko
Copy link
Member

jszwedko commented Aug 3, 2022

Closing since this is possible via remap now that it supports iteration.

@jszwedko jszwedko closed this as completed Aug 3, 2022
@rlabrecque
Copy link

For future reference for anyone ending up here trying to figure out how to dedot their keys with VRL, there's a good example here, took a bit to find though.

https://vector.dev/docs/reference/vrl/functions/#map_keys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
have: must We must have this feature, it is critical to the project's success. It is high priority. meta: feedback Anything related to customer/user feedback. sink: elasticsearch Anything `elasticsearch` sink related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests