Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] LIKE support for a column of patterns, but a scalar escape character #10797

Closed
revans2 opened this issue May 5, 2022 · 11 comments · Fixed by #12269
Closed

[FEA] LIKE support for a column of patterns, but a scalar escape character #10797

revans2 opened this issue May 5, 2022 · 11 comments · Fixed by #12269
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented May 5, 2022

Is your feature request related to a problem? Please describe.
I have to start off by apologizing. I know this is crazy difficult to do efficiently, but we have a customer that is asking for this. In SQL there is a string pattern matching system around LIKE.

Spark only supports % and _ as special characters with an optional escape character passed in that defaults to \.

In Spark if the pattern is a literal value, then it will parse it and translate the pattern into thinks like starts_with for SOMETHING%, ends_with for %SOMETHING or contains for %SOMETHING%. If it is anything else it is sent to the like operator. For us we translate the pattern into a regular expression. % becomes (.|\n)* and _ becomes (.|\n). All other characters we escape before passing it to CUDF. I realize that this is super hard and it may need to be very slow to keep memory management under control. We get that.

Describe the solution you'd like
I would like to see a LIKE operator added to CUDF. Ideally it would take either a scalar pattern or a column of patterns along with an escape character and return a boolean column to say of the string matches the pattern. If the pattern is null or the string is null the output should be null.

Describe alternatives you've considered

  1. Write something ourselves
  2. Go back to the customer and ask them what are they actually trying to do, and see if there is a better way to do it that does not involve this complex operator.
@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 5, 2022
@github-actions github-actions bot added this to Needs prioritizing in Feature Planning May 5, 2022
@jrhemstad
Copy link
Contributor

I'm a bit confused.

For us we translate the pattern into a regular expression. % becomes (.|\n)* and _ becomes (.|\n)

Does this mean it's already possible to support the LIKE operator using the existing regex machinery?

@revans2
Copy link
Contributor Author

revans2 commented May 11, 2022

The key part is we want a column of patterns.

@davidwendt
Copy link
Contributor

I'd like to get clarification on the what patterns are expected in the LIKE clause. I found this summary which lists a variety of support from different SQL databases: https://stackoverflow.com/questions/712580/list-of-special-characters-for-sql-like-clause
But I see this Spark documentation: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-like.html claims only % and _ for wildcards. Also, that page mentions a possible multi-pattern option?
I want to consider a possible implementation that either uses a subset of regex or does not use regex at all.

@revans2
Copy link
Contributor Author

revans2 commented Jun 14, 2022

Spark only supports % and _ as special characters with an optional escape character passed in that defaults to \.

You can see the Spark code that compiles a LIKE pattern to a regular expression here.

I am fine if we don't do the error checking on the patterns. We can do that ourselves.

Multi-patters in Spark are optimized differently to avoid extra processing, but we can always implement them as multiple separate calls to a LIKE operator with boolean operations to combine them.

@davidwendt
Copy link
Contributor

What happens if the LIKE pattern includes a class range specifier (e.g. [a-g] or [^1-8]) as supported in some databases?
Are these treated as literals then since Spark does not expect this? Or do they get passed through the regex conversion intact?
I could not tell from the snippet.

@revans2
Copy link
Contributor Author

revans2 commented Jun 14, 2022

The translation code calls Pattern.quote on any character it wants to pass directly through to the regular expression. So [a-g] will be translated into something where the java regular expression will match those characters exactly and not teat them as anything special. \Q[\E\Qa\E\Q-\E\Qg\E\Q]\E (ugly but it works)

@davidwendt
Copy link
Contributor

Can you provide some examples of LIKE patterns? Not for testing LIKE/regex conversion but rather for supporting LIKE directly for prototyping/testing purposes only. You probably have a test suite of these already so a link to those here would be fine.

@revans2
Copy link
Contributor Author

revans2 commented Jun 14, 2022

Sure. The following are some patterns that we have been testing with.

'', "\r', '\n', 'a{3}bar', 12345678', '12345678901234', '%SystemDrive%\Users\John', '%o%', '%a%', '', '\%SystemDrive\%\\Users%', 'oo', 'oo%', '%oo', '\u201c%', 'a[d]%', 'a(d)%', '$', '$%', '.', '?|}{%', '%a{3}%'.

Then we also have been testing patterns with special escape characters.

            'a like "_a^d%" escape "c"',
            'a like "a_a" escape "c"',
            'a like "a%a" escape "c"',
            'a like "c_" escape "c"',
            'a like x "6162632325616263" escape "#"',
            'a like x "61626325616263" escape "#"'

The these are all from our test in https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/integration_tests/src/main/python/string_test.py, but be careful because it also contains things for rlike, which uses regular expressions.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 29, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Aug 26, 2022
Adds new strings `like` function to cudf. This is a wildcard-based string matching function based on SQL's LIKE statement.
https://www.sqltutorial.org/sql-like/
Though some SQL implementations provide regex-like capabilities in the `like` statement pattern, the implementation here is strictly limited to the `%` (multi-character placeholder) and the `_` (single character placeholder) behavior. It also accepts an optional escape character that can be used when trying to match strings that contain `%` or `_` in them.

This is an easier (and faster) alternative to using the regex based `contains` function.
Example usage:
```
s = cudf.Series(["David", "Daniel", "Darcy"])
s.str.like('Da%')   ==> [True, True, True]    # starts with 'Da'
s.str.like('_a_i%') ==> [True, True, False]   # 2nd character is 'a' and 4th character is 'i'
s.str.like('_____') ==> [True, False, True]   # match any 5 characters
s.str.like('%y')    ==> [False, False, True]  # ends with 'y'
```

This PR includes gtests, pytest, and an nvbench-mark.

Reference #10797

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Tobias Ribizel (https://github.com/upsj)
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #11558
@revans2 revans2 changed the title [FEA] LIKE support for a column of patterns [FEA] LIKE support for a column of patterns, but a scalar escape character Nov 21, 2022
@revans2
Copy link
Contributor Author

revans2 commented Nov 21, 2022

To be clear for the escape character we don't need a column of escape characters. We only need a scalar.

@davidwendt
Copy link
Contributor

There is a pattern per row. So the size of the patterns column will be the size of the input column.

Feature Planning automation moved this from Needs prioritizing to Closed Jan 4, 2023
rapids-bot bot pushed a commit that referenced this issue Jan 4, 2023
Adds a `cudf::strings::like` function that accepts a column of patterns where each pattern is matched against the corresponding input string row. Only a single escape character is supported for all patterns.

Closes #10797

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

URL: #12269
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants