Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stdlib): add parse_etld function #669

Merged
merged 3 commits into from
Feb 6, 2024

Conversation

esensar
Copy link
Contributor

@esensar esensar commented Feb 1, 2024

This adds parse_etld function to stdlib, based on psl crate

Fixes: #651

This adds `parse_etld` function to stdlib, based on `psl` crate

Fixes: vectordotdev#651
@esensar
Copy link
Contributor Author

esensar commented Feb 1, 2024

This is currently using psl crate for performance, but there is also publicsuffix crate, recommended by the same author, for cases when dynamic list is needed. psl crate is using a list that is hardcoded in the crate and autogenerated whenever it changes, meaning tracking newest list requires updating the crate.

I could switch it to that crate if it would make more sense.

Also, I was also considering adding this to parse_url function, but it might not be beneficial to that many users, so it is probably best to keep it separate. Also, it allows customizing number of additional parts provided (like eTLD+1 but for more than 1), but I am not sure if that is a valid use case. It might be better to keep it simple and just pass eTLD and domain as result. This will also always return result, even for TLDs that are not found in the list, but it will have known_suffix set to false in that case.

@jszwedko jszwedko requested a review from pront February 1, 2024 19:08
@johnhtodd
Copy link

johnhtodd commented Feb 5, 2024

Very happy to see this feature. I suspect it will have a better chance of approval if it also had documentation in the main docs repository as part of the PR.

@pront
Copy link
Collaborator

pront commented Feb 5, 2024

Thank you @esensar, this looks good!

This is currently using psl crate for performance, but there is also publicsuffix crate, recommended by the same author, for cases when dynamic list is needed.

Let's switch to publicsuffix.

Also, I was also considering adding this to parse_url function, but it might not be beneficial to that many users, so it is probably best to keep it separate.

👍

@esensar
Copy link
Contributor Author

esensar commented Feb 5, 2024

Let's switch to publicsuffix.

Alright, makes sense, although that might complicate things a bit. With that crate we need to download the list in runtime, so I need to figure out when to do it and how and when to cache it. I will see if I can update the PR tomorrow, but let me know if you have ideas on how I should approach that.

@pront
Copy link
Collaborator

pront commented Feb 5, 2024

Let's switch to publicsuffix.

Alright, makes sense, although that might complicate things a bit. With that crate we need to download the list in runtime, so I need to figure out when to do it and how and when to cache it. I will see if I can update the PR tomorrow, but let me know if you have ideas on how I should approach that.

Actually, this is a good point. This introduces complexity. Let me take a closer look soon.

@pront
Copy link
Collaborator

pront commented Feb 5, 2024

psl crate is using a list that is hardcoded in the crate and autogenerated whenever it changes, meaning tracking newest list requires updating the crate.

@esensar on a second thought, this is preferable to the alternative 👍

},
Example {
title: "parse etld with plus parts",
source: r#"parse_etld!("vector.dev", plus_parts: 1)"#,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if by mistake we had plus_parts: 10 here? Can you add a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually a test for that, named short_plus, but it uses plus_parts: 3 when only 2 are available. I will change it to 10 for better visibility and clarity, but this makes me think if behavior in that case is right. It will just silently return whatever is available (so "vector.dev" in the above case). Would it be better to somehow let it be known in the results that it was not able to get all of these parts, since this way it would require some additional checks on the callers end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this makes me think if behavior in that case is right.

Some options to consider:

  1. return the max available parts
  2. return an error

I am leaning towards 1 which I believe is the current behavior but let me know if you have further thoughts.

Copy link
Contributor Author

@esensar esensar Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with case 1. That should cover the general case and keep things simple.

Copy link
Collaborator

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @esensar!

@pront pront added this pull request to the merge queue Feb 6, 2024
@jszwedko
Copy link
Member

jszwedko commented Feb 6, 2024

Could we add these new functions to the benchmarks in benches/stdlib.rs?

Merged via the queue into vectordotdev:main with commit caf8c79 Feb 6, 2024
11 checks passed
jszwedko added a commit that referenced this pull request Feb 6, 2024
Missed in #669

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 6, 2024
Missed in #669

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
@esensar esensar deleted the feature/parse_etld_function branch February 7, 2024 08:27
@esensar
Copy link
Contributor Author

esensar commented Feb 7, 2024

Are functions like this documented in the main docs section as part of the PR typically, or is that something that the maintainers reserve as a controlled area? (https://vector.dev/docs/reference/vrl/functions/)

There is a PR open for that update and if I understood correctly, it will be merged after new VRL version is released, to be up to date with the latest available version.

@johnhtodd
Copy link

Are functions like this documented in the main docs section as part of the PR typically, or is that something that the maintainers reserve as a controlled area? (https://vector.dev/docs/reference/vrl/functions/)

There is a PR open for that update and if I understood correctly, it will be merged after new VRL version is released, to be up to date with the latest available version.

I didn't understand the method by which those pages are created, and it seems that you've already done this work - sorry for the confusion.

esensar added a commit to esensar/vrl that referenced this pull request Feb 7, 2024
Missed in vectordotdev#669

This also adds `known_suffix` to type definition of `parse_etld` result, that was missed in original
PR.
esensar added a commit to esensar/vrl that referenced this pull request Feb 7, 2024
Missed in vectordotdev#669

This also adds `known_suffix` to type definition of `parse_etld` result, that was missed in original
PR.
github-merge-queue bot pushed a commit that referenced this pull request Feb 7, 2024
Missed in #669

This also adds `known_suffix` to type definition of `parse_etld` result, that was missed in original
PR.
github-merge-queue bot pushed a commit to vectordotdev/vector that referenced this pull request Feb 9, 2024
* docs(vrl): add documentation for `parse_etld` function

Related: vectordotdev/vrl#669

* Allow etld in spellchecker

* Fix formatting of `parse_etld.cue`

---------

Co-authored-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
@esensar
Copy link
Contributor Author

esensar commented Apr 30, 2024

Thank you @esensar, this looks good!

This is currently using psl crate for performance, but there is also publicsuffix crate, recommended by the same author, for cases when dynamic list is needed.

Let's switch to publicsuffix.

Also, I was also considering adding this to parse_url function, but it might not be beneficial to that many users, so it is probably best to keep it separate.

👍

Sorry for bringing this topic back, but turns out custom PSL is needed. I was thinking about how to implement it and since these are not really big, it would probably be okay to keep them in memory. I thought of maybe implementing it in a similar way to how enrichment_tables are implemented, by adding additional config option to set up PSL URL (and maybe support file paths too). Does something like that make sense? I just want to check before I go into implementation. That would also mean that this function would have to move to vector repo I think.

We can also have a separate function for that "dynamic" approach, so we don't have to remove this one from vrl.

@esensar
Copy link
Contributor Author

esensar commented May 6, 2024

Thank you @esensar, this looks good!

This is currently using psl crate for performance, but there is also publicsuffix crate, recommended by the same author, for cases when dynamic list is needed.

Let's switch to publicsuffix.

Also, I was also considering adding this to parse_url function, but it might not be beneficial to that many users, so it is probably best to keep it separate.

👍

Sorry for bringing this topic back, but turns out custom PSL is needed. I was thinking about how to implement it and since these are not really big, it would probably be okay to keep them in memory. I thought of maybe implementing it in a similar way to how enrichment_tables are implemented, by adding additional config option to set up PSL URL (and maybe support file paths too). Does something like that make sense? I just want to check before I go into implementation. That would also mean that this function would have to move to vector repo I think.

We can also have a separate function for that "dynamic" approach, so we don't have to remove this one from vrl.

@pront not sure if you saw this. What do you think about this?

@johnhtodd
Copy link

Some context here: the PSL data set doesn't quite summarize things as desired, and is not helping in our goal of reducing/aggregating data to the degree that we want to be fed out of edge nodes.

I suspect almost all users of the PSL list have different interest levels of depth into the DNS tree that they wish to aggregate, based on different zone cuts. Many of the domains (voluntarily submitted to the PSL) are much "deeper" in the domain set than normally one would want to go for summarization. For example, if I'm trying to find out counters for every domain that is a registry-listed domain (or at least a close approximation.) So in the simplest case, the desire is to count how many events are inside of "amazonaws.com" as an example aggregation where we look at ETLD+1 (.com plus one zone cut to the left.) However, the PSL list shows that things like "execute-api.ap-east-1.amazonaws.com" are listed, which means that we see every zone cut underneath that zone. While they are not "ETLD+1", they act as if they are if the list is consumed without editing. Needless to say, what we wanted to have as a fairly compact set of namespace has turned out to be an exploding cardinality issue, with probably 50% of the elements in the namespace seen as un-necessarily collected based on our experience so far.

To solve this problem, we'd like to edit the PSL on our own, so that Vector uses a customized version. This doesn't seem too difficult - we're just hoping to have a URL or filepath that allows the PSL data to specified on launch. This changes very infrequently, so it does not need to be dynamic, but it does need to allow a custom file to be identified.

@pront I know this is an edge case, but it is fairly close to the heart of the concept of having the ETLD function and making it useful. Customization of data files like this I hope is consistent with Vector's focus on giving operators highly-editable functionality at the edge of the network. In our case, this turns into billions (hundreds of billions?) of events per day for us that we could condense, so it does have a significant result if implemented.

@johnhtodd
Copy link

We're kind of blocked here, and we hope that the path we're going to take isn't going to create problems with getting the patches accepted.

@pront
Copy link
Collaborator

pront commented May 20, 2024

👋 This thread fell through the cracks, can you summarize what is the proposed change here? Do you want add custom PSL support using enrichment tables?

@jszwedko
Copy link
Member

jszwedko commented May 20, 2024

Thanks for the bump @johnhtodd . I agree with @pront that the proposal isn't entirely clear in that I'm not sure what UX you are expecting.

I could imagine something like the feature to load aliases from a file that was added to parse_groks recently. That might make the function call look something like:

parse_etld(value: "example.com", psl: "/some/path/to/psl/file")

Is that something like what you were expecting? It seems like a reasonable enhancement to me, if so.

I'd suggest we limit it to local files for now rather than opening the door to loading from, for example, URLs as suggested in #669 (comment) since Vector/VRL doesn't have a precedent for that.

@jszwedko
Copy link
Member

I think we should still keep the default psl if a custom one is not provided.

@esensar
Copy link
Contributor Author

esensar commented May 21, 2024

Thanks for the bump @johnhtodd . I agree with @pront that the proposal isn't entirely clear in that I'm not sure what UX you are expecting.

I could imagine something like the feature to load aliases from a file that was added to parse_groks recently. That might make the function call look something like:

parse_etld(value: "example.com", psl: "/some/path/to/psl/file")

Is that something like what you were expecting? It seems like a reasonable enhancement to me, if so.

I'd suggest we limit it to local files for now rather than opening the door to loading from, for example, URLs as suggested in #669 (comment) since Vector/VRL doesn't have a precedent for that.

Yeah, I guess file is okay for most cases, the URL option doesn't really offer much more, since the file can be downloaded before configuring anyways. I agree with the proposed API, but I think it would be good to move the actual file to configuration and refer to PSL by name, because that way we can parse it once on startup and then just use it, but that means that the function would have to move to vector (similar to enrichment tables functions).

I am not sure if that is alright with you, since we would have to remove it from vrl and move it to vector, since otherwise we would have to parse the file every time the function is called (since there is no caching and no runtime in vrl).

I think we should still keep the default psl if a custom one is not provided.

Agreed on that one. We can just use the library with pre-generated PSL for that I guess.

@jszwedko
Copy link
Member

Yeah, I guess file is okay for most cases, the URL option doesn't really offer much more, since the file can be downloaded before configuring anyways. I agree with the proposed API, but I think it would be good to move the actual file to configuration and refer to PSL by name, because that way we can parse it once on startup and then just use it, but that means that the function would have to move to vector (similar to enrichment tables functions).

You can actually do this while leaving it as an argument by requiring the argument to be static and so it can be loaded only once at start-time. You can see an example of this with #194.

@esensar
Copy link
Contributor Author

esensar commented May 21, 2024

Yeah, I guess file is okay for most cases, the URL option doesn't really offer much more, since the file can be downloaded before configuring anyways. I agree with the proposed API, but I think it would be good to move the actual file to configuration and refer to PSL by name, because that way we can parse it once on startup and then just use it, but that means that the function would have to move to vector (similar to enrichment tables functions).

You can actually do this while leaving it as an argument by requiring the argument to be static and so it can be loaded only once at start-time. You can see an example of this with #194.

Oh, that makes sense, it is probably the best solution to do that then. I will then make a PR to vrl repo with these changes. Thanks!

esensar added a commit to esensar/vrl that referenced this pull request May 21, 2024
This adds an optional argument `psl` to `parse_etld`function,
which should be a file path to a custom PSL list (default one can
be found on https://publicsuffix.org/list/public_suffix_list.dat).
If none is provided, the default is used.

Related: vectordotdev#669
github-merge-queue bot pushed a commit that referenced this pull request May 28, 2024
* feat(stdlib): add optional `psl` argument to `parse_etld`

This adds an optional argument `psl` to `parse_etld`function,
which should be a file path to a custom PSL list (default one can
be found on https://publicsuffix.org/list/public_suffix_list.dat).
If none is provided, the default is used.

Related: #669

* Add changelog entry

* Remove needless variables from `parse_etld` VRL tests

* Fix vrl tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: public suffix (TLD and ETLD+1) parsing
4 participants