[FEA] Add version of extract_re that takes an index #9855

andygrove · 2021-12-07T17:56:51Z

Is your feature request related to a problem? Please describe.
From Spark, when we call extract_re we often are only interested in extracting a single group rather than all the groups in the pattern. We currently call extract_re which returns a Table and we then get the column we are interested in and discard the others. It would be more efficient if we could pass the column index to cuDF so that only one column needs instantiating.

Describe the solution you'd like
I would like a signature something like extract_re(pattern, index).

Describe alternatives you've considered
None

Additional context
None

The text was updated successfully, but these errors were encountered:

davidwendt · 2021-12-08T20:33:21Z

Why include groups in the pattern that will not return anything?
For example, if you have the following pattern with 3 groups:

"([a-z]*)-([0-9]*)-([A-Z]*)"

and you only care about the 2nd group just change the pattern to remove the ( ) for the groups that are not needed.

"[a-z]*-([0-9]*)-[A-Z]*"

andygrove · 2021-12-08T22:20:29Z

I cannot argue with the logic here, but this is just how the Spark function works and we have no control over how people invoke it.

We could potentially rewrite the regexp pattern in the plugin to remove unreferenced groups but I would be nervous about going that far.

beckernick · 2021-12-15T15:21:41Z

Is the goal to leverage information downstream in the DAG that indicates some of the capture groups weren't actually necessary for this execution?

Could you share a bit about the use cases where this comes up? For example, I can imagine it coming up in exploratory data analysis. Curious to understand how common/significant this is.

vyasr · 2021-12-22T06:20:22Z

I am also curious to understand this problem. From the original description it doesn't seem like the request is stemming from some complex workflow where some groups are "discovered" to be unnecessary, but rather to optimize user workflows even when users invoke the function with suboptimal regexes. Naively this seems like a use case where we should be aiming to educate users as to best practices for performance rather than optimizing additional code paths, i.e. we should be documenting that including unnecessary groups in the regex will impact performance. That would be consistent with a lot of our messaging around cuDF Python where there are often ways to do things with pandas that would translate to slow cuDF solutions, and we try to document and socialize knowledge about faster ways to do those things in cuDF without trying to optimize all the slower ways. Maybe I'm missing a key reason that this use case is different, though.

andygrove · 2021-12-22T16:01:25Z

To add some more context to this request. Spark provides a SQL function regexp_extract with the signature regexp_extract(str, pattern, idx). This function was requested in SPARK-8255 to expose the equivalent Java regexp functionality. It doesn't look like too much thought was given to the inefficiency of this approach. The function has existed since Spark 1.5.0 so it is likely that people are using it. We do support this function in the RAPIDS Accelerator for Spark already but it would be more efficient if we could just request a single group to be extracted.

github-actions · 2022-01-21T16:07:24Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-04-21T16:08:40Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

andygrove added feature request New feature or request Needs Triage Need team to review and classify labels Dec 7, 2021

andygrove mentioned this issue Dec 7, 2021

Add support for regexp_extract on the GPU NVIDIA/spark-rapids#4285

Merged

davidwendt self-assigned this Dec 7, 2021

andygrove mentioned this issue Dec 7, 2021

[FEA] Implement extract_all_re function #9856

Closed

beckernick added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Dec 16, 2021

github-actions bot added the inactive-30d label Jan 21, 2022

github-actions bot added the inactive-90d label Apr 21, 2022

GregoryKimball removed the inactive-90d label Apr 3, 2023

vyasr removed the inactive-30d label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add version of extract_re that takes an index #9855

[FEA] Add version of extract_re that takes an index #9855

andygrove commented Dec 7, 2021

davidwendt commented Dec 8, 2021

andygrove commented Dec 8, 2021

beckernick commented Dec 15, 2021

vyasr commented Dec 22, 2021

andygrove commented Dec 22, 2021

github-actions bot commented Jan 21, 2022

github-actions bot commented Apr 21, 2022

[FEA] Add version of extract_re that takes an index #9855

[FEA] Add version of extract_re that takes an index #9855

Comments

andygrove commented Dec 7, 2021

davidwendt commented Dec 8, 2021

andygrove commented Dec 8, 2021

beckernick commented Dec 15, 2021

vyasr commented Dec 22, 2021

andygrove commented Dec 22, 2021

github-actions bot commented Jan 21, 2022

github-actions bot commented Apr 21, 2022