Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand supported redactors for redact #112

Open
jszwedko opened this issue May 13, 2021 · 6 comments
Open

Expand supported redactors for redact #112

jszwedko opened this issue May 13, 2021 · 6 comments
Labels
type: enhancement A value-adding code change that enhances its existing functionality vrl: stdlib Changes to the standard library

Comments

@jszwedko
Copy link
Member

Broken off from vectordotdev/vector#7250 (comment)

The initial implementation of redact just had one redactor that always replaced with [REDACTED]. We should expand this to support additional redactors like:

  • Customizing the redaction string
  • Hashing the value to have consistent replacement text
  • Masking such as only showing the last 4 for social security numbers
@jszwedko jszwedko added the type: enhancement A value-adding code change that enhances its existing functionality label May 13, 2021
jszwedko referenced this issue in vectordotdev/vector Jun 3, 2021
A few notes:

- I opted not to highlight emitting multiple events from `remap` yet as I'd really like to get the `unnest` PR in there first #7404 since it's not super useful until then.
- I opted not to highlight the new `redact` function until we add additional filters (#7435) and possibly redactors (#7445) since it is a bit lack luster until then.

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
@mr-karan
Copy link

+1 for this. I was migrating from a Logstash based config where I'm using gsub to achieve this. I wanted to preseve the first and last few characters of a sensitive token field but looks like that isn't possible.

For example if this could work: replace(.message,r'(my_token)(.*?):(.*?)(\S{8})', r'\1*\3') <-I wanted to preserve the field name itself and the last 8 chars.

Is there any workaround using other string substitution methods?

@JeanMertz
Copy link
Contributor

@mr-karan you can still use replace to achieve this, but:

  1. You need to use $1 to reference capture groups
  2. The third argument to replace has to be a string
$ .message = "my_token:abcdefghijklmnopqrstuvwxyz"
"my_token:abcdefghijklmnopqrstuvwxyz"

$ replace(token, r'(my_token):(.*)(\S{8})', "$1*$3")
"my_token*stuvwxyz"

You can try it out yourself by running vector vrl.

@mr-karan
Copy link

@JeanMertz Thanks for the help. Works well 👍

@mr-karan
Copy link

mr-karan commented Jul 2, 2021

@JeanMertz A bit perplexed here. I tried out the replace in vrl and it worked perfectly fine. However it doesn't work in the actual pipeline. I wrote a small unit test for you to check. (I can open a new issue if that is more relevant)

[transforms.format_logs]
type = "remap" 
inputs = ["haproxy_logs"] 
source = '''
.message = replace!(.message,r'(auth=token)(.*?):(.*?)(\S{8})&', "$1$2:*$4&")
'''

[[tests]]
  name = "check if token is redacted"

  [[tests.inputs]]
    insert_at = "format_logs"
    type = "raw"
    value = "auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"

  [[tests.outputs]]
    extract_from = "format_logs"

    [[tests.outputs.conditions]]
      type = "check_fields"
      "message.equals" = "auth=token myapp:*s1nVYamq&-"

When running vector test:

Running tests
Jul 02 10:58:25.195  WARN vector::conditions::check_fields: The `check_fields` condition is deprecated, use `remap` instead.
test check if token is redacted ... failed

failures:

test check if token is redacted:

check transform 'format_logs' failed conditions:
  condition[0]: predicates failed: [ message.equals: "auth=token myapp:*s1nVYamq&-" ]
payloads (events encoded as JSON):
   input: {"message":"auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-","timestamp":"2021-07-02T05:28:25.194973128Z"}
  output: {"message":":*&-","timestamp":"2021-07-02T05:28:25.194973128Z"}

When doing the same thing with vector vrl:

$ msg = "auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"
"auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"

$ replace(msg,r'(auth=token)(.*?):(.*?)(\S{8})&', "$1$2:*$4&")
"auth=token myapp:*s1nVYamq&-"

I am really confused how this is happening 😵

@jszwedko
Copy link
Member Author

jszwedko commented Jul 2, 2021

Hi @mr-karan . I think you running into the same issue as vectordotdev/vector#8067.

The issue is that $1 is interpreted when Vector loads the config to mean you want to inject the environment variable $1 int the config file. This behavior is described here: https://vector.dev/docs/reference/configuration/#environment-variables

You can escape the $ via $$ so something like replace(msg,r'(auth=token)(.*?):(.*?)(\S{8})&', "$$1$$2:*$$4&") should work for you.

This is a pretty big gotcha though as the replacement groups use $. I'm wondering if we could improve the error messaging here; although right now should at least see a warning in the output of vector that $1, $2, etc. are undefined when starting Vector.. Certainly we could call it out in the documentation at least.

@mr-karan
Copy link

mr-karan commented Jul 3, 2021

Thanks @jszwedko for the explanation :) Escaping $ worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement A value-adding code change that enhances its existing functionality vrl: stdlib Changes to the standard library
Projects
None yet
Development

No branches or pull requests

3 participants