-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] LIKE
support for a column of patterns, but a scalar escape character
#10797
Comments
I'm a bit confused.
Does this mean it's already possible to support the |
The key part is we want a column of patterns. |
I'd like to get clarification on the what patterns are expected in the LIKE clause. I found this summary which lists a variety of support from different SQL databases: https://stackoverflow.com/questions/712580/list-of-special-characters-for-sql-like-clause |
You can see the Spark code that compiles a LIKE pattern to a regular expression here. I am fine if we don't do the error checking on the patterns. We can do that ourselves. Multi-patters in Spark are optimized differently to avoid extra processing, but we can always implement them as multiple separate calls to a LIKE operator with boolean operations to combine them. |
What happens if the LIKE pattern includes a class range specifier (e.g. |
The translation code calls |
Can you provide some examples of LIKE patterns? Not for testing LIKE/regex conversion but rather for supporting LIKE directly for prototyping/testing purposes only. You probably have a test suite of these already so a link to those here would be fine. |
Sure. The following are some patterns that we have been testing with. '', "\r', '\n', 'a{3}bar', 12345678', '12345678901234', '%SystemDrive%\Users\John', '%o%', '%a%', '', '\%SystemDrive\%\\Users%', 'oo', 'oo%', '%oo', '\u201c%', 'a[d]%', 'a(d)%', '$', '$%', '.', '?|}{%', '%a{3}%'. Then we also have been testing patterns with special escape characters.
The these are all from our test in https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/integration_tests/src/main/python/string_test.py, but be careful because it also contains things for rlike, which uses regular expressions. |
This issue has been labeled |
Adds new strings `like` function to cudf. This is a wildcard-based string matching function based on SQL's LIKE statement. https://www.sqltutorial.org/sql-like/ Though some SQL implementations provide regex-like capabilities in the `like` statement pattern, the implementation here is strictly limited to the `%` (multi-character placeholder) and the `_` (single character placeholder) behavior. It also accepts an optional escape character that can be used when trying to match strings that contain `%` or `_` in them. This is an easier (and faster) alternative to using the regex based `contains` function. Example usage: ``` s = cudf.Series(["David", "Daniel", "Darcy"]) s.str.like('Da%') ==> [True, True, True] # starts with 'Da' s.str.like('_a_i%') ==> [True, True, False] # 2nd character is 'a' and 4th character is 'i' s.str.like('_____') ==> [True, False, True] # match any 5 characters s.str.like('%y') ==> [False, False, True] # ends with 'y' ``` This PR includes gtests, pytest, and an nvbench-mark. Reference #10797 Authors: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) Approvers: - Michael Wang (https://github.com/isVoid) - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11558
LIKE
support for a column of patternsLIKE
support for a column of patterns, but a scalar escape character
To be clear for the escape character we don't need a column of escape characters. We only need a scalar. |
There is a pattern per row. So the size of the patterns column will be the size of the input column. |
Adds a `cudf::strings::like` function that accepts a column of patterns where each pattern is matched against the corresponding input string row. Only a single escape character is supported for all patterns. Closes #10797 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #12269
Is your feature request related to a problem? Please describe.
I have to start off by apologizing. I know this is crazy difficult to do efficiently, but we have a customer that is asking for this. In SQL there is a string pattern matching system around
LIKE
.Spark only supports
%
and_
as special characters with an optional escape character passed in that defaults to\
.In Spark if the pattern is a literal value, then it will parse it and translate the pattern into thinks like starts_with for
SOMETHING%
, ends_with for%SOMETHING
or contains for%SOMETHING%
. If it is anything else it is sent to the like operator. For us we translate the pattern into a regular expression.%
becomes(.|\n)*
and_
becomes(.|\n)
. All other characters we escape before passing it to CUDF. I realize that this is super hard and it may need to be very slow to keep memory management under control. We get that.Describe the solution you'd like
I would like to see a LIKE operator added to CUDF. Ideally it would take either a scalar pattern or a column of patterns along with an escape character and return a boolean column to say of the string matches the pattern. If the pattern is null or the string is null the output should be null.
Describe alternatives you've considered
The text was updated successfully, but these errors were encountered: