[FEA] Implement extract_all_re function #9856

andygrove · 2021-12-07T18:03:43Z

Is your feature request related to a problem? Please describe.
I would like to be able to implement a GPU version of Spark's regexp_extract_all function. Here is an example usage from the documentation.

SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
 ["100","300"]

Describe the solution you'd like
I would like a cuDF function extract_all_re which is similar to extract_re but returns columns of type List<String> containing all matches for each group.

Additionally, it would be nice if I could optionally pass cuDF a group index so that we don't always have to extract for all of the groups in the pattern. There is a related issue #9855 asking for this capability for extract_re.

Describe alternatives you've considered
None

Additional context
None

The text was updated successfully, but these errors were encountered:

davidwendt · 2021-12-10T22:05:15Z

Looking a bit closer, this does appear to be similar to extract_record but the all means continue match/extracting after the first match is found? Also, I'm guessing the example in the description includes an index feature that is the subject of a separate issue #9855 and therefore is a bit confusing here. Ignoring indexing, this is what I'm looking at implementing so far:

s = [ "100-200 300-400", "500-600", "700-800", "900" ]
r =  extract_all( s, "'(\d+)-(\d+)" )
r is a LIST column of strings that looks like this:

[ [ "100", "200", "300", "400" ], // 2 sets of matches
  [ "500", "600" ], // 1 match
  [ "700", "800" ], // 1 match
  NULL              // no match
]

The 1st string "100-200 300-400" had 2 matches to (\d+)-(\d+) and so both sets of groups are returned.
The 2nd and 3rd strings each had one match and included those in the output rows.
The 4th string "900" does not match the pattern and so the output for this row is NULL.

andygrove · 2021-12-10T23:05:16Z

Yes, the all here means continue matching. Your example looks correct to me, ignoring indexing. For reference, here is the equivalent example in Spark:

scala> val df = Seq("100-200 300-400", "500-600", "700-800", "900").toDF("str")
scala> df.createOrReplaceTempView("strings")

scala> spark.sql("SELECT regexp_extract_all(str, '(\\\\d+)-(\\\\d+)', 0) FROM strings").show
+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 0)|
+---------------------------------------+
|                     [100-200, 300-400]|
|                              [500-600]|
|                              [700-800]|
|                                     []|
+---------------------------------------+

scala> spark.sql("SELECT regexp_extract_all(str, '(\\\\d+)-(\\\\d+)', 1) FROM strings").show
+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 1)|
+---------------------------------------+
|                             [100, 300]|
|                                  [500]|
|                                  [700]|
|                                     []|
+---------------------------------------+

scala> spark.sql("SELECT regexp_extract_all(str, '(\\\\d+)-(\\\\d+)', 2) FROM strings").show
+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 2)|
+---------------------------------------+
|                             [200, 400]|
|                                  [600]|
|                                  [800]|
|                                     []|
+---------------------------------------+

davidwendt · 2021-12-11T00:50:40Z

Please do not include indexing examples here.

I don't understand the first example.

+---------------------------------------+
|regexp_extract_all(str, (\d+)-(\d+), 0)|
+---------------------------------------+
|                     [100-200, 300-400]|
|                              [500-600]|
|                              [700-800]|
|                                     []|
+---------------------------------------+

This does not look like extract to me.
Extract will return the values found inside the groups ( ). Like [ "100", "200" ] in my previous comment post.
Your example now appears to return the entire match result instead. Like [ "100-200" ]

andygrove · 2021-12-13T17:01:20Z

The Spark function is confusing. A zero index is a special case and returns all of the strings that match the entire regexp pattern. I will see if the existing cuDF functions can already support this case and will file a new issue if not.

For the "normal" indexing case I think I can implement based on the proposed results that you previously posted.

davidwendt · 2021-12-13T17:10:54Z

I think the closest behavior we have is findall
But I suspect you will need _record version of this API to return a lists column instead of a table. This would match the result in this comment.

Do you still need an extract_all as I've described in my comment?
Or perhaps an extract_record that would be similar to this but would only return the first matched group set.

andygrove · 2021-12-13T18:42:16Z

Do you still need an extract_all as I've described in my comment?

Yes, I believe that this would give me what I need.

Closes #9856 Adds a new `cudf::strings::extract_all` API that returns a LIST column of extracted strings given a regex pattern. This is similar to nvstrings version of `extract` called `extract_record` but returns groups from all matches in each string instead of just the first match. Here is pseudo code of it's behavior on various strings input: ``` s = [ "ABC-200 DEF-400", "GHI-60", "JK-800", "900", NULL ] r = extract_all( s, "'(\w+)-(\d+)" ) r is a LIST column of strings that looks like this: [ [ "ABC", "200", "DEF", "400" ], // 2 matches [ "GHI", "60" ], // 1 match [ "JK", "800" ], // 1 match NULL, // no match NULL ] ``` Each match results in two groups as specified in the regex pattern. Also reorganized the extract source code into `src/strings/extract` directory. The match-counting has been factored out into new `count_matches.cuh` since it will become common code used with `findall_record` in a follow on PR. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #9909

Reference #9856 specifically #9856 (comment) Adds `cudf::strings::findall_record` which was initially implemented in nvstrings but not ported over since LIST column types did not exist at the time and returning a vector of small columns was very inefficient. This API should also allow using the current python function `cudf.str.findall()` with the `expand=False` parameter more effectively. A follow-on PR will address these python changes. This PR reorganizes the libcudf strings _find_ source files into the `cpp/src/strings/search` subdirectory as well. Also, `findall()` has only a regex version so the `_re` suffix is dropped from the name in the libcudf implementation. The python changes in this PR address only the name change and the addition of the new API in the cython interface. Depends on #9909 -- shares the `cudf::strings::detail::count_matches()` utility function. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9911

andygrove added feature request New feature or request Needs Triage Need team to review and classify labels Dec 7, 2021

github-actions bot added this to Needs prioritizing in Feature Planning Dec 7, 2021

andygrove mentioned this issue Dec 7, 2021

[FEA] Implement regexp_extract_all on GPU for idx > 0 NVIDIA/spark-rapids#4283

Closed

davidwendt self-assigned this Dec 7, 2021

davidwendt added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Dec 8, 2021

This was referenced Dec 15, 2021

Add cudf::strings::extract_all API #9909

Merged

Add cudf::strings::findall_record API #9911

Merged

beckernick removed the Needs Triage Need team to review and classify label Dec 16, 2021

rapids-bot bot closed this as completed in #9909 Jan 5, 2022

Feature Planning automation moved this from Needs prioritizing to Closed Jan 5, 2022

bdice mentioned this issue Mar 18, 2022

[BUG] cudf::strings::findall_record should return an empty list instead of null if no matches are found #10458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement extract_all_re function #9856

[FEA] Implement extract_all_re function #9856

andygrove commented Dec 7, 2021

davidwendt commented Dec 10, 2021

andygrove commented Dec 10, 2021 •

edited

Loading

davidwendt commented Dec 11, 2021

andygrove commented Dec 13, 2021

davidwendt commented Dec 13, 2021

andygrove commented Dec 13, 2021

[FEA] Implement extract_all_re function #9856

[FEA] Implement extract_all_re function #9856

Comments

andygrove commented Dec 7, 2021

davidwendt commented Dec 10, 2021

andygrove commented Dec 10, 2021 • edited Loading

davidwendt commented Dec 11, 2021

andygrove commented Dec 13, 2021

davidwendt commented Dec 13, 2021

andygrove commented Dec 13, 2021

andygrove commented Dec 10, 2021 •

edited

Loading