Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add version of extract_re that takes an index #9855

Open
andygrove opened this issue Dec 7, 2021 · 7 comments
Open

[FEA] Add version of extract_re that takes an index #9855

andygrove opened this issue Dec 7, 2021 · 7 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
From Spark, when we call extract_re we often are only interested in extracting a single group rather than all the groups in the pattern. We currently call extract_re which returns a Table and we then get the column we are interested in and discard the others. It would be more efficient if we could pass the column index to cuDF so that only one column needs instantiating.

Describe the solution you'd like
I would like a signature something like extract_re(pattern, index).

Describe alternatives you've considered
None

Additional context
None

@davidwendt
Copy link
Contributor

Why include groups in the pattern that will not return anything?
For example, if you have the following pattern with 3 groups:

"([a-z]*)-([0-9]*)-([A-Z]*)"

and you only care about the 2nd group just change the pattern to remove the ( ) for the groups that are not needed.

"[a-z]*-([0-9]*)-[A-Z]*"

@andygrove
Copy link
Contributor Author

I cannot argue with the logic here, but this is just how the Spark function works and we have no control over how people invoke it.

We could potentially rewrite the regexp pattern in the plugin to remove unreferenced groups but I would be nervous about going that far.

@beckernick
Copy link
Member

Is the goal to leverage information downstream in the DAG that indicates some of the capture groups weren't actually necessary for this execution?

Could you share a bit about the use cases where this comes up? For example, I can imagine it coming up in exploratory data analysis. Curious to understand how common/significant this is.

@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Dec 16, 2021
@vyasr
Copy link
Contributor

vyasr commented Dec 22, 2021

I am also curious to understand this problem. From the original description it doesn't seem like the request is stemming from some complex workflow where some groups are "discovered" to be unnecessary, but rather to optimize user workflows even when users invoke the function with suboptimal regexes. Naively this seems like a use case where we should be aiming to educate users as to best practices for performance rather than optimizing additional code paths, i.e. we should be documenting that including unnecessary groups in the regex will impact performance. That would be consistent with a lot of our messaging around cuDF Python where there are often ways to do things with pandas that would translate to slow cuDF solutions, and we try to document and socialize knowledge about faster ways to do those things in cuDF without trying to optimize all the slower ways. Maybe I'm missing a key reason that this use case is different, though.

@andygrove
Copy link
Contributor Author

To add some more context to this request. Spark provides a SQL function regexp_extract with the signature regexp_extract(str, pattern, idx). This function was requested in SPARK-8255 to expose the equivalent Java regexp functionality. It doesn't look like too much thought was given to the inefficiency of this approach. The function has existed since Spark 1.5.0 so it is likely that people are using it. We do support this function in the RAPIDS Accelerator for Spark already but it would be more efficient if we could just request a single group to be extracted.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

5 participants