-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Make line terminator sequences in regular expression using $
a configurable option when MULTILINE
flag is enabled
#11979
Comments
Could you provide some example use cases? Would it be possible to convert
(verified these results with Pandas as well) Note that you get the same result for The new-line cannot be captured with
(same for Pandas too) Can you provide examples where replacing |
In this case, the use case is using extract, so substituting |
Could you provide such an example where this would not work? |
So I did some initial testing. This substitution approach might work; however, I think there is a small unresolved issue. It looks like if we use >>> s = cudf.Series(["a.html\r\n", "b.txt\r\n", "c.html\r\nabcd", "d.txt"])
>>> r = s.str.replace("\r\n", "\n")
>>> r.str.extract("\\w+\\.(html$|txt$)")
0
0 <NA>
1 <NA>
2 <NA>
3 txt
>>> for s in ["a.html\n", "b.txt\n", "c.html\nabcd", "d.txt"]:
... m = re.match("\\w+\\.(html$|txt$)", s)
... if m:
... m.group(1)
... else:
... '<NA>'
...
'html'
'txt'
'<NA>'
'txt' Now with >>> r.str.extract("\\w+\\.(html$|txt$)", re.MULTILINE)
0
0 html
1 txt
2 html
3 txt
>>> for s in ["a.html\n", "b.txt\n", "c.html\nabcd", "d.txt"]:
... m = re.match("\\w+\\.(html$|txt$)", s, re.MULTILINE)
... if m:
... m.group(1)
... else:
... '<NA>'
...
'html'
'txt'
'html'
'txt' So, it looks like cuDF is consistent with native Python in MULTILINE mode, and not in normal mode. This substitution strategy could work in normal mode, if that issue is resolved. |
The edge case of |
…2181) Support regex EOL where the string ends with a new-line character. This matches the behavior for the EOL anchor `$` described here: https://www.regular-expressions.info/refanchors.html Additional gtests are included. The doxygen for cudf regex support is also updated. Close #11979 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12181
Is your feature request related to a problem? Please describe.
In regular expression multiline mode, currently the
$
matches at the position right before a newline character\n
(a line terminator) in cuDF. In Python, this behavior makes sense and is consistent with the Python implementation. However, Apache Spark uses the JDK (Java) implementation, and line terminator sequences are a bit more complex. The JDK regular expression library utilizes either of newline (\n
), carriage return (\r
), carriage return followed by newline (\r\n
), and 3 other Unicode newline variants as a line terminator.Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. Another alternative could be another flag for line terminators or another type of MULTILINE flag
Describe alternatives you've considered
Currently, spark-rapids handles
$
by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. It cannot use cuDF MULTILINE mode because only the newline is handled there. With this translation, we are limited to only using the$
in simple scenarios at the end of the regular expression, we cannot use them in choice|
right now among other constructions because of the complexity.The text was updated successfully, but these errors were encountered: