Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make line terminator sequences in regular expression using $ a configurable option when MULTILINE flag is enabled #11979

Closed
NVnavkumar opened this issue Oct 24, 2022 · 5 comments · Fixed by #12181
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@NVnavkumar
Copy link
Contributor

NVnavkumar commented Oct 24, 2022

Is your feature request related to a problem? Please describe.
In regular expression multiline mode, currently the $ matches at the position right before a newline character \n (a line terminator) in cuDF. In Python, this behavior makes sense and is consistent with the Python implementation. However, Apache Spark uses the JDK (Java) implementation, and line terminator sequences are a bit more complex. The JDK regular expression library utilizes either of newline (\n), carriage return (\r), carriage return followed by newline (\r\n), and 3 other Unicode newline variants as a line terminator.

Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. Another alternative could be another flag for line terminators or another type of MULTILINE flag

Describe alternatives you've considered
Currently, spark-rapids handles $ by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. It cannot use cuDF MULTILINE mode because only the newline is handled there. With this translation, we are limited to only using the $ in simple scenarios at the end of the regular expression, we cannot use them in choice | right now among other constructions because of the complexity.

@NVnavkumar NVnavkumar added Needs Triage Need team to review and classify feature request New feature or request labels Oct 24, 2022
@davidwendt davidwendt added the strings strings issues (C++ and Python) label Oct 25, 2022
@davidwendt
Copy link
Contributor

davidwendt commented Oct 25, 2022

Could you provide some example use cases?
Only \n is supported for new-line but it may be possible to optionally support \r instead or to optionally support either \r or \n. But I do not think it will be possible to support \r\n as a single new-line entity(?).

Would it be possible to convert \r\n to \n before calling the regex APIs? This should work for matching but maybe there is an issue with extract or replace that I'm not seeing.

>>> import cudf
>>> import re
>>> s = cudf.Series(["abc\nfff\nabc", "fff\nabc\nlll", "abc", "", "abc\n"])
>>> s
0    abc\nfff\nabc
1    fff\nabc\nlll
2              abc
3                 
4            abc\n
dtype: object
>>> sr = cudf.Series(["abc\r\nfff\r\nabc", "fff\r\nabc\r\nlll", "abc", "", "abc\r\n"])
>>> sr
0    abc\r\nfff\r\nabc
1    fff\r\nabc\r\nlll
2                  abc
3                     
4              abc\r\n
dtype: object

>>> s.str.extract("(^abc$)", flags=re.MULTILINE)
      0
0   abc
1   abc
2   abc
3  <NA>
4   abc

>>> sr.str.replace('\r\n', '\n', regex=False).str.extract("(^abc$)", flags=re.MULTILINE)
      0
0   abc
1   abc
2   abc
3  <NA>
4   abc

>>> sr.str.replace('\r\n', '\n', regex=False).str.extract("(^abc$)")
      0
0  <NA>
1  <NA>
2   abc
3  <NA>
4  <NA>
>>> s.str.extract("(^abc$)")
      0
0  <NA>
1  <NA>
2   abc
3  <NA>
4  <NA>

(verified these results with Pandas as well)

Note that you get the same result for (^abc$) and ^(abc)$

The new-line cannot be captured with ^ or $ with extract. For example:

>>> s.str.extract("(^abc$)", re.MULTILINE)
      0
0   abc
1   abc
2   abc
3  <NA>
4   abc
>>> s.str.extract("(^abc$f)", re.MULTILINE)
      0
0  <NA>
1  <NA>
2  <NA>
3  <NA>
4  <NA>
>>> s.str.extract("(^abc$^f)", re.MULTILINE)
      0
0  <NA>
1  <NA>
2  <NA>
3  <NA>
4  <NA>
>>> s.str.extract("(^abc^f)", re.MULTILINE)
      0
0  <NA>
1  <NA>
2  <NA>
3  <NA>
4  <NA>

(same for Pandas too)

Can you provide examples where replacing \r\n with \n would not work for you?

@davidwendt davidwendt self-assigned this Oct 26, 2022
@GregoryKimball GregoryKimball added 0 - Waiting on Author Waiting for author to respond to review and removed Needs Triage Need team to review and classify labels Oct 30, 2022
@NVnavkumar
Copy link
Contributor Author

In this case, the use case is using extract, so substituting \r\n for \n will probably not work.

@davidwendt
Copy link
Contributor

davidwendt commented Oct 31, 2022

In this case, the use case is using extract, so substituting \r\n for \n will probably not work.

Could you provide such an example where this would not work?
My examples above the extracted string never contain the \r or the \n in the result so there would be no need to fix up the result.

@NVnavkumar
Copy link
Contributor Author

In this case, the use case is using extract, so substituting \r\n for \n will probably not work.

Could you provide such an example where this would not work? My examples above the extracted string never contain the \r or the \n in the result so there would be no need to fix up the result.

So I did some initial testing. This substitution approach might work; however, I think there is a small unresolved issue. It looks like if we use re.MULTILINE, the \\Z cannot easily be used in combination with $ (because $ at the point can also match end of string). What's also interesting here is that in Python natively, when NOT using re.MULTILINE, $ actually matches before the \n:

>>> s = cudf.Series(["a.html\r\n", "b.txt\r\n", "c.html\r\nabcd", "d.txt"])
>>> r = s.str.replace("\r\n", "\n")
>>> r.str.extract("\\w+\\.(html$|txt$)")
      0
0  <NA>
1  <NA>
2  <NA>
3   txt

>>> for s in ["a.html\n", "b.txt\n", "c.html\nabcd", "d.txt"]:
...     m = re.match("\\w+\\.(html$|txt$)", s)
...     if m:
...             m.group(1)
...     else:
...             '<NA>'
...
'html'
'txt'
'<NA>'
'txt'

Now with re.MULTILINE:

>>> r.str.extract("\\w+\\.(html$|txt$)", re.MULTILINE)
      0
0  html
1   txt
2  html
3   txt
>>> for s in ["a.html\n", "b.txt\n", "c.html\nabcd", "d.txt"]:
...     m = re.match("\\w+\\.(html$|txt$)", s, re.MULTILINE)
...     if m:
...             m.group(1)
...     else:
...             '<NA>'
...
'html'
'txt'
'html'
'txt'

So, it looks like cuDF is consistent with native Python in MULTILINE mode, and not in normal mode. This substitution strategy could work in normal mode, if that issue is resolved.

@davidwendt
Copy link
Contributor

The edge case of '$' matching "...\n" when \n is at the end of the string in non-MULTILINE mode is not Python specific: https://www.regular-expressions.info/refanchors.html
I can look at supporting this edge case in libcudf.

@sameerz sameerz removed the 0 - Waiting on Author Waiting for author to respond to review label Nov 16, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 28, 2022
…2181)

Support regex EOL where the string ends with a new-line character.
This matches the behavior for the EOL anchor `$` described here: https://www.regular-expressions.info/refanchors.html
Additional gtests are included.
The doxygen for cudf regex support is also updated.

Close #11979

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants