Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stdlib): add sieve string function #724

Merged
merged 5 commits into from
Mar 11, 2024

Conversation

esensar
Copy link
Contributor

@esensar esensar commented Mar 4, 2024

Adds a sieve string function which can remove unwanted characters from a string using a list of allowed characters (or regex of allowed patterns).

Fixes: #704

Adds a `sieve` string function which can remove unwanted characters from a string using a list of
allowed characters (or regex of allowed patterns).

Fixes: vectordotdev#704
@jszwedko jszwedko requested a review from pront March 4, 2024 15:04
required: false,
},
Parameter {
keyword: "replace_repeated",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ Can you explain a bit what this replace_repeated argument does? I think we could come up with a better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is a bit confusing.

Based on the original issue (#704), it is meant to provide a way to set a replacement string when multiple repeated characters are removed from the original string.

You can also see how it behaves in tests and examples.

all_options example might represent this the best:

all_options {
    args: func_args![value: value!("test123%456.فوائد.net."), permitted_characters: regex::Regex::new("[a-z.0-9]").unwrap(), replace_single: "X", replace_repeated: "<REMOVED>"],
    want: Ok(value!("test123X456.<REMOVED>.net.")),
    tdef: TypeDef::bytes().infallible(),
}

In this example the single % found in the string that was not allowed, was replaced with X, but that Arabic (maybe?) script string was replaced with <REMOVED>.

let mut result = String::with_capacity(value.len());
let mut missed_length = 0;
for char in value.chars() {
if characters.contains(&char) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any crate that uses a more efficient algorithm for string matching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check it out and let you know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this is an optional step, I was curious. If there's nothing mature available we can just leave a comment that this can be optimized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to respond. I was thinking, maybe it would make sense to remove the option to use string and enforce only regex. On that one at least no manual checks are performed and I just iterate through the returned matches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me

Copy link
Contributor

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution!

@pront pront added this pull request to the merge queue Mar 11, 2024
Merged via the queue into vectordotdev:main with commit 14f195c Mar 11, 2024
11 checks passed
@esensar esensar deleted the feature/string_sieve_function branch March 12, 2024 09:34
github-merge-queue bot pushed a commit to vectordotdev/vector that referenced this pull request Mar 21, 2024
* docs(vrl): add documentation for `sieve` function

Related: vectordotdev/vrl#724

* Fix typo in sieve docs

Co-authored-by: jhgilbert <j.h.gilbert@gmail.com>

* Update function docs after removing string pattern option

* cue fmt

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Fix example

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: jhgilbert <j.h.gilbert@gmail.com>
Co-authored-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
AndrooTheChen pushed a commit to discord/vector that referenced this pull request Sep 23, 2024
* docs(vrl): add documentation for `sieve` function

Related: vectordotdev/vrl#724

* Fix typo in sieve docs

Co-authored-by: jhgilbert <j.h.gilbert@gmail.com>

* Update function docs after removing string pattern option

* cue fmt

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Fix example

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: jhgilbert <j.h.gilbert@gmail.com>
Co-authored-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "sieve" string function
2 participants