Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(sinks): Allow Encoding config to only/except list fields #1915

Merged
merged 45 commits into from
Mar 5, 2020

Conversation

Hoverbear
Copy link
Contributor

@Hoverbear Hoverbear commented Feb 24, 2020

This should fix #1448 .

So far

Things should work according to how #1448 (comment):

[sinks.my_sink]
  type = "clickhouse"
  encoding.format = "json"
  encoding.only_fields = ["timestamp", "message"]
  encoding.except_fields = ["_meta"]

Where only_fields and except_fields are mutually exclusive, only_fields should take priority. This, > of course, can be represented as a table as well:

[sinks.my_sink]
 type = "clickhouse"

 [sinks.my_sink.encoding]
   format = "json"
   only_fields = ["timestamp", "message"]
   except_fields = ["_meta"]

Please help me verify:

  • Each applicable sink (Those with Encoding at all) has been updated
  • Each sink calls encoding.validate()?
  • Each sink calls encoding.apply_rules(_) on all sunk events.
  • Each sink has the _encoding_ partial in it's .meta/_partials folder. Eg:
    <%= render("_partials/_encoding.toml", namespace: "sinks.console.options") %>

To decide

I need some help from folks to determine the following:

  • Should this new setting be present on other sinks eg clickhouse, blackhole, or datadog?
  • Is there a different structure which might apply better here?
  • Documentation is pending

Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
@Hoverbear Hoverbear self-assigned this Feb 24, 2020
@Hoverbear Hoverbear added the domain: sinks Anything related to the Vector's sinks label Feb 24, 2020
Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
@Hoverbear Hoverbear changed the title feat: Allow Encoding config to black or whitelist fields feat: Allow Encoding config to black or whitelist fields (WIP) Feb 25, 2020
Signed-off-by: Ana Hobden <operator@hoverbear.org>
@binarylogic
Copy link
Contributor

binarylogic commented Feb 25, 2020

Here's a list to make this easier:

  • aws_cloudwatch_logs
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • aws_kinesis_firehose
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • aws_kinesis_streams
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • aws_s3
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • clickhouse
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • console
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • datadog (Excluded: Metrics)
  • elasticsearch
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • file
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • gcp_cloud_storage
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • gcp_stackdriver_logging
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • http
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • humio_logs
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • kafka
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • logdna
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • loki
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • new_relic_logs
    • call validate() (in http sink)
    • calls encoding.apply_rules(_) (in http sink)
    • has the _encoding_ partial in it's .meta/_partials folder
  • sematext
    • call validate()
    • calls encoding.apply_rules(_) (in ES sink)
    • has the _encoding_ partial in it's .meta/_partials folder
  • socket
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder
  • splunk_hec
    • call validate()
    • calls encoding.apply_rules(_)
    • has the _encoding_ partial in it's .meta/_partials folder

@binarylogic
Copy link
Contributor

binarylogic commented Feb 25, 2020

Should this new setting be present on other sinks eg clickhouse, blackhole, or datadog?

I think so since one of the use cases is to discard private data before writing downstream. (ex: dropping a _private object/field). We don't need this for blackhole since it does not output data.

@binarylogic binarylogic changed the title feat: Allow Encoding config to black or whitelist fields (WIP) enhancement(sinks): Allow Encoding config to black or whitelist fields (WIP) Feb 25, 2020
Signed-off-by: Ana Hobden <operator@hoverbear.org>
@Hoverbear Hoverbear requested a review from a user February 25, 2020 19:52
Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
Copy link
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a reasonable approach 👍

The main downside seems to be requiring the validate and apply_rules calls. I think the eventual way around that is to centralize encoders so that it all can be pushed into the "library" portion of the code. That's a more invasive change to the sinks, so it makes sense to put it off for now. Could be worth opening an issue though.

@Hoverbear
Copy link
Contributor Author

Hoverbear commented Feb 26, 2020

@lukesteensen Yeah, I think that is a good future step. Having better usability was kind of hampered by each sink having it's own encoding type in most cases. If we could unify those it'd be really nice.

@lukesteensen
Copy link
Member

@Hoverbear yep, exactly. We'd need to find a good way to let each sink specify the types of encoding it supported and then delegate the actual functionality to a shared implementation.

@Hoverbear
Copy link
Contributor Author

@lukesteensen One thing we can do is have an Encoding trait with a required encode function that each sink could them implement. THen the EncodingConfig<E: Encoder> can call Encoder::encode.

@lukesteensen
Copy link
Member

@Hoverbear Oooo I like that. Sinks could still define their own encoders for config purposes, but maybe delegate to shared utils internally? Would also be possible to make that change incrementally.

@Hoverbear
Copy link
Contributor Author

Yup! How about as a follow up to this so we don't scope creep?

@lukesteensen
Copy link
Member

Sounds good!

@Hoverbear
Copy link
Contributor Author

@binarylogic No, we settled on encoding.codec etc

@binarylogic
Copy link
Contributor

Yep, I understand. So how are we handling backward compatibility? That's a very big consideration here, especially since this is a popular option.

@LucioFranco
Copy link
Contributor

@binarylogic it looks like using encoding = "string" should still work thus maintain backwards compat. This supports both methods.

@binarylogic
Copy link
Contributor

Sounds good, :shipit:

Signed-off-by: Lucio Franco <luciofranco14@gmail.com>
Copy link
Contributor

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really really good! Left some comments in line, let me know if you have any questions :)

src/sinks/aws_cloudwatch_logs/mod.rs Show resolved Hide resolved
}

#[typetag::serde(name = "aws_kinesis_firehose")]
impl SinkConfig for KinesisFirehoseSinkConfig {
fn build(&self, cx: SinkContext) -> crate::Result<(super::RouterSink, super::Healthcheck)> {
let config = self.clone();
config.encoding.validate()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it looks like we call this on each build we could just nest this into our de impl for EncodingConfig?

src/sinks/aws_cloudwatch_logs/mod.rs Outdated Show resolved Hide resolved
pub encoding: EncodingConfig,
#[serde(
deserialize_with = "EncodingConfigWithDefault::from_deserializer",
skip_serializing_if = "skip_serializing_if_default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but why do some sinks have this and others don't? My assumption is that this allows a "default" encoding?

To avoid further bikeshedding this should in theory be possible to do within a custom impl but happy to leave that out for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make sure a user never has to see encoding.format if there is only one option. So this makes sure we skip this field when it's not required in vector generate calls

#[typetag::serde(name = "humio_logs")]
impl SinkConfig for HumioLogsConfig {
fn build(&self, cx: SinkContext) -> crate::Result<(super::RouterSink, super::Healthcheck)> {
if self.encoding.codec != Encoding::Json {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better way around this, in theory could we just implement a custom encoding enum here that doesn't have Json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think this is the nicest way either but it's very straightforward

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing that worries me is I've seen serde sometimes enumerate all the possible options but in this case a user would see json as being available but would fail when you configure the sink. Just something we should check.

Copy link
Contributor Author

@Hoverbear Hoverbear Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, hm... The only way to fix that would be to give this a custom encoding type and transmute it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not write a custom enum here that doesn't include json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/sinks/util/encoding.rs Show resolved Hide resolved
src/sinks/util/encoding.rs Outdated Show resolved Hide resolved
Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
@binarylogic
Copy link
Contributor

Just to verify: Do we have a behavior test with the root level encoding (legacy) option set? If not, we should add one or more.

@Hoverbear
Copy link
Contributor Author

@binarylogic
Copy link
Contributor

That link doesn't go anywhere :(

@LucioFranco
Copy link
Contributor

gh links for lines in PRs have been broken for me for a while now...

Signed-off-by: Ana Hobden <operator@hoverbear.org>
Signed-off-by: Ana Hobden <operator@hoverbear.org>
@Hoverbear
Copy link
Contributor Author

@LucioFranco Our powers combined, this seems ready once the tests pass.

Copy link
Contributor

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! :shipit: 🔥

src/sinks/file/mod.rs Outdated Show resolved Hide resolved
src/sinks/new_relic_logs.rs Outdated Show resolved Hide resolved
@binarylogic
Copy link
Contributor

I'm slightly concerned this is not correct given:

  1. The back and forth around defaults, enumerations, error messages, and so on.
  2. The fact that chore(config): Make encoding non-optional #894 missed a number of sinks.

How can we simply test for this? It seems silly to test for something so trivial, but I feel like we need to at this point. Should we create configs and lint them? Maybe fuzz testing?

@Hoverbear
Copy link
Contributor Author

@binarylogic Let's not stack anything more on this PR, we already chose to unnecessarily stack the fixes for #894 in this. (Which I think was a good idea!)

So, please create a separate issue! I'd love to share my thoughts there!

Prior to merging I'll be doing some manual user acceptance testing. Our exiting unit tests already serve as a fairly good configuration test bank. Note how our clickhouse tests tend to use configuration fragments. I think this is a good way to do things since it lets us safely do these changes and catch mistakes. It'd be nice if we trended towards doing this more over using structs.

@Hoverbear
Copy link
Contributor Author

I'm going to test this out locally. I've identified a few cases in particular I'm checking:

  • Check sinks with a single encoding type that don't get into'd don't print it out with cargo generate.
    vector generate stdin//elasticsearch
    # Ensure `encoding.codec` doesn't exist. Is runnable.
  • Check sinks with a single encoding type that gets into'd don't print it out with cargo generate.
    vector generate stdin//humio_logs
    # Ensure `encoding.codec` doesn't exist. Is runnable.
  • Check sinks with multiple encodings and no default don't default to one.
    vector generate stdin//file
    # Ensure `encoding.codec` doesn't exist, and that we get an error on run until a user writes it in.
  • Check sinks with multiple encodings and a default do default to that one.
    vector generate stdin//aws_s3
    # Ensure `encoding.codec` doesn't exist. Is runnable.

Signed-off-by: Ana Hobden <operator@hoverbear.org>
@Hoverbear
Copy link
Contributor Author

Okay those examples work (Except #1987 ) so I'm going to merge this under the impression that any remaining fixes will be (very) minor and will most likely be related to setting encoding.only_fields without setting encoding.codec, (So, we missed the skip_if_default serde flag) which is a very straightforward workaround.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: sinks Anything related to the Vector's sinks type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Private fields used for processing that are not encoded
5 participants