[FEA] `LIKE` support for a column of patterns, but a scalar escape character #10797

revans2 · 2022-05-05T15:40:50Z

Is your feature request related to a problem? Please describe.
I have to start off by apologizing. I know this is crazy difficult to do efficiently, but we have a customer that is asking for this. In SQL there is a string pattern matching system around LIKE.

Spark only supports % and _ as special characters with an optional escape character passed in that defaults to \.

In Spark if the pattern is a literal value, then it will parse it and translate the pattern into thinks like starts_with for SOMETHING%, ends_with for %SOMETHING or contains for %SOMETHING%. If it is anything else it is sent to the like operator. For us we translate the pattern into a regular expression. % becomes (.|\n)* and _ becomes (.|\n). All other characters we escape before passing it to CUDF. I realize that this is super hard and it may need to be very slow to keep memory management under control. We get that.

Describe the solution you'd like
I would like to see a LIKE operator added to CUDF. Ideally it would take either a scalar pattern or a column of patterns along with an escape character and return a boolean column to say of the string matches the pattern. If the pattern is null or the string is null the output should be null.

Describe alternatives you've considered

Write something ourselves
Go back to the customer and ask them what are they actually trying to do, and see if there is a better way to do it that does not involve this complex operator.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2022-05-05T15:59:28Z

I'm a bit confused.

For us we translate the pattern into a regular expression. % becomes (.|\n)* and _ becomes (.|\n)

Does this mean it's already possible to support the LIKE operator using the existing regex machinery?

revans2 · 2022-05-11T18:26:56Z

The key part is we want a column of patterns.

davidwendt · 2022-06-06T12:32:46Z

I'd like to get clarification on the what patterns are expected in the LIKE clause. I found this summary which lists a variety of support from different SQL databases: https://stackoverflow.com/questions/712580/list-of-special-characters-for-sql-like-clause
But I see this Spark documentation: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-like.html claims only % and _ for wildcards. Also, that page mentions a possible multi-pattern option?
I want to consider a possible implementation that either uses a subset of regex or does not use regex at all.

revans2 · 2022-06-14T14:59:40Z

Spark only supports % and _ as special characters with an optional escape character passed in that defaults to \.

You can see the Spark code that compiles a LIKE pattern to a regular expression here.

I am fine if we don't do the error checking on the patterns. We can do that ourselves.

Multi-patters in Spark are optimized differently to avoid extra processing, but we can always implement them as multiple separate calls to a LIKE operator with boolean operations to combine them.

davidwendt · 2022-06-14T15:09:02Z

What happens if the LIKE pattern includes a class range specifier (e.g. [a-g] or [^1-8]) as supported in some databases?
Are these treated as literals then since Spark does not expect this? Or do they get passed through the regex conversion intact?
I could not tell from the snippet.

revans2 · 2022-06-14T15:21:08Z

The translation code calls Pattern.quote on any character it wants to pass directly through to the regular expression. So [a-g] will be translated into something where the java regular expression will match those characters exactly and not teat them as anything special. \Q[\E\Qa\E\Q-\E\Qg\E\Q]\E (ugly but it works)

davidwendt · 2022-06-14T15:28:51Z

Can you provide some examples of LIKE patterns? Not for testing LIKE/regex conversion but rather for supporting LIKE directly for prototyping/testing purposes only. You probably have a test suite of these already so a link to those here would be fine.

revans2 · 2022-06-14T15:58:29Z

Sure. The following are some patterns that we have been testing with.

'', "\r', '\n', 'a{3}bar', 12345678', '12345678901234', '%SystemDrive%\Users\John', '%o%', '%a%', '', '\%SystemDrive\%\\Users%', 'oo', 'oo%', '%oo', '\u201c%', 'a[d]%', 'a(d)%', '$', '$%', '.', '?|}{%', '%a{3}%'.

Then we also have been testing patterns with special escape characters.

            'a like "_a^d%" escape "c"',
            'a like "a_a" escape "c"',
            'a like "a%a" escape "c"',
            'a like "c_" escape "c"',
            'a like x "6162632325616263" escape "#"',
            'a like x "61626325616263" escape "#"'

The these are all from our test in https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/integration_tests/src/main/python/string_test.py, but be careful because it also contains things for rlike, which uses regular expressions.

github-actions · 2022-07-29T00:11:28Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Adds new strings `like` function to cudf. This is a wildcard-based string matching function based on SQL's LIKE statement. https://www.sqltutorial.org/sql-like/ Though some SQL implementations provide regex-like capabilities in the `like` statement pattern, the implementation here is strictly limited to the `%` (multi-character placeholder) and the `_` (single character placeholder) behavior. It also accepts an optional escape character that can be used when trying to match strings that contain `%` or `_` in them. This is an easier (and faster) alternative to using the regex based `contains` function. Example usage: ``` s = cudf.Series(["David", "Daniel", "Darcy"]) s.str.like('Da%') ==> [True, True, True] # starts with 'Da' s.str.like('_a_i%') ==> [True, True, False] # 2nd character is 'a' and 4th character is 'i' s.str.like('_____') ==> [True, False, True] # match any 5 characters s.str.like('%y') ==> [False, False, True] # ends with 'y' ``` This PR includes gtests, pytest, and an nvbench-mark. Reference #10797 Authors: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) Approvers: - Michael Wang (https://github.com/isVoid) - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11558

revans2 · 2022-11-21T22:42:39Z

To be clear for the escape character we don't need a column of escape characters. We only need a scalar.

davidwendt · 2022-11-21T22:43:00Z

There is a pattern per row. So the size of the patterns column will be the size of the input column.

Adds a `cudf::strings::like` function that accepts a column of patterns where each pattern is matched against the corresponding input string row. Only a single escape character is supported for all patterns. Closes #10797 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #12269

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels May 5, 2022

github-actions bot added this to Needs prioritizing in Feature Planning May 5, 2022

revans2 mentioned this issue May 5, 2022

[FEA] Support Like with non-literal value NVIDIA/spark-rapids#5403

Open

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 29, 2022

github-actions bot added the inactive-30d label Jul 29, 2022

davidwendt mentioned this issue Aug 17, 2022

Add strings 'like' function #11558

Merged

3 tasks

revans2 changed the title ~~[FEA] LIKE support for a column of patterns~~ [FEA] LIKE support for a column of patterns, but a scalar escape character Nov 21, 2022

davidwendt mentioned this issue Nov 30, 2022

Add cudf::strings:like function with multiple patterns #12269

Merged

3 tasks

rapids-bot bot closed this as completed in #12269 Jan 4, 2023

Feature Planning automation moved this from Needs prioritizing to Closed Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] `LIKE` support for a column of patterns, but a scalar escape character #10797

[FEA] `LIKE` support for a column of patterns, but a scalar escape character #10797

revans2 commented May 5, 2022

jrhemstad commented May 5, 2022

revans2 commented May 11, 2022

davidwendt commented Jun 6, 2022

revans2 commented Jun 14, 2022

davidwendt commented Jun 14, 2022

revans2 commented Jun 14, 2022

davidwendt commented Jun 14, 2022

revans2 commented Jun 14, 2022

github-actions bot commented Jul 29, 2022

revans2 commented Nov 21, 2022

davidwendt commented Nov 21, 2022

[FEA] LIKE support for a column of patterns, but a scalar escape character #10797

[FEA] LIKE support for a column of patterns, but a scalar escape character #10797

Comments

revans2 commented May 5, 2022

jrhemstad commented May 5, 2022

revans2 commented May 11, 2022

davidwendt commented Jun 6, 2022

revans2 commented Jun 14, 2022

davidwendt commented Jun 14, 2022

revans2 commented Jun 14, 2022

davidwendt commented Jun 14, 2022

revans2 commented Jun 14, 2022

github-actions bot commented Jul 29, 2022

revans2 commented Nov 21, 2022

davidwendt commented Nov 21, 2022

[FEA] `LIKE` support for a column of patterns, but a scalar escape character #10797

[FEA] `LIKE` support for a column of patterns, but a scalar escape character #10797