[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

NVnavkumar · 2024-05-14T19:17:48Z

Is your feature request related to a problem? Please describe.
Some notes from #11979 here: The $ matches at the position right before a line terminator in regular expressions. In cuDF (and in Python), this is right before a newline\n. However, in Spark (or rather the JDK), the line terminator can be any one of the following sequences: \r, \n, \r\n, \u0085, \u2028, or \u2029 (unless UNIX_LINES mode is activated) (see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lt).

Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. But this also be a flag that enables a JDK_MODE which would enabling the more complex handling that can be enabled when calling the corresponding methods from the CUDF Java library.

Describe alternatives you've considered
Currently, spark-rapids handles $ by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. With this translation, we are limited to only using the $ in simple scenarios at the end of the regular expression, we cannot use them in choice | right now among other constructions because of the complexity (see NVIDIA/spark-rapids#10764)

The text was updated successfully, but these errors were encountered:

davidwendt · 2024-05-14T20:02:45Z

This has been requested before: #11979
Supporting an array of single characters may be doable but supporting \r\n will likely not be possible.

GregoryKimball · 2024-05-21T20:54:02Z

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?
would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?
As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?
Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

NVnavkumar · 2024-05-28T19:01:11Z

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?

would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?

As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?

Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

Addressing these questions here:

I'm still working on measuring the performance impact here, and trying to ascertain that for certain strings that only include newlines (\n), what is the performance impact of the transpiled regex vs sending the original regex into cudf directly. The theory is that these would get pretty close to the Spark output so transpilation overhead can be reduced.
Line terminators actually dictate both ^ and $ behavior, since they dictate ultimately both the start and end of the line. Sometimes we want to use these in more complicated ways like choice (e.g.abc|$ matches abc or we just want the end of the line, see [FEA] Support single '$' or '^' on right side of regexp choice NVIDIA/spark-rapids#10764 for corresponding spark-rapids issue). In multiline mode, this means that they basically become a matcher for the line terminator characters themselves or the end of the string.
One option we have (and we might even have it in lieu of this issue), is to substitute \r\n for \n, and then run the cudf regexp engine. However, this substitution adds an additional GPU operation and manipulates the original string, so for some operations (like extract), we won't get the same output since we won't be able to include the original line terminator in the output. Another option is we could simplify the transpilation to something like \r\n$|$. If that works, that might be a better option to maintain compatibility with Spark. I also would like to propose that Spark could disable such a transpilation under a "maximizeCompatiblity" flag for perfomance purposes.
Using the second option described in the previous paragraph, this could still be potentially very useful with the simplified transpilation.

I will try to update with some performance numbers soon.

NVnavkumar · 2024-06-03T21:24:40Z

From a performance perspective, I did a very brief test with regexp_extract and Spark involving randomly generated strings which only included newlines at the end of the string (as opposed to other types of line terminators). I chose a relatively simple pattern (boo:+)$ (which involves a single capture group and the $ to match the end of the string. I chose extract since that's the case where the Spark-RAPIDS regexp transpiler code since it tests how extract handles the potential non-capturing group for $ transpilation:

Here is the performance of the extract kernel measured with nsys in 2 scenarios, 1 using the transpiled regex, 2 using the regex as given to cudf.

1) Transpiled Regex
Name	Start	Duration	TID	Category
extract	20.7979s	5.596 ms	939788	
extract	20.7979s	5.593 ms	939789	
extract	20.8318s	4.843 ms	939790	
extract	20.8317s	4.827 ms	939791	
------------------------------------------
 avg                          5.215 ms

2) cuDF using the original regex
Name	Start	Duration	TID	Category
extract	20.6463s	2.719 ms	941563	
extract	20.6789s	        1.840 ms	941564	
extract	20.6465s	2.634 ms	941565	
extract	20.6789s	        1.685 ms	941566	
------------------------------------------
 avg                                   2.220 ms

More testing can be done to fully vet this performance characteristic, but it should illustrate how much complexity the transpiled regex is adding to handle these additional line terminators.

GregoryKimball · 2024-06-04T19:58:41Z

@NVnavkumar Would you please post an example of the transpiled pattern? I would love to see the new pattern that comes out after you go through the steps to convert $ into something that gives correct result.

NVnavkumar · 2024-06-06T01:21:14Z

@NVnavkumar Would you please post an example of the transpiled pattern? I would love to see the new pattern that comes out after you go through the steps to convert $ into something that gives correct result.

So the original pattern (boo:+)$ becomes (boo:+)(?:\r|\u0085|\u2028|\u2029|\r\n)?$. This is to accomodate the \r\n double character and the possibility of supporting the 4 additional characters \r and the 3 unicode line breaks.

davidwendt · 2024-06-10T13:00:19Z

@NVnavkumar Could you provide some example strings to test against?
I just want to make sure I have a good representation of test data you are seeing.

davidwendt · 2024-06-10T21:36:26Z

@NVnavkumar Would you be able to run some tests with PR #15961 ?
No special flags are needed. I hardcoded the single-character new-line types.

Add support for multiple new-line characters for BOL (`^` / `\A`) and EOL (`$` / `\Z`): - `\n` line-feed (already supported) - `\r` carriage-return - `\u0085` next line (NEL) - `\u2028` line separator - `\u2029` paragraph separator Reference #15746 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Navin Kumar (https://github.com/NVnavkumar) URL: #15961

davidwendt · 2024-10-15T18:57:16Z

#15961 solves this issue except for supporting multi-character new line \r\n

NVnavkumar added the feature request New feature or request label May 14, 2024

NVnavkumar mentioned this issue May 14, 2024

[FEA] Support single '$' or '^' on right side of regexp choice NVIDIA/spark-rapids#10764

Open

davidwendt added the strings strings issues (C++ and Python) label May 16, 2024

davidwendt mentioned this issue Jun 10, 2024

Support multiple new-line characters in regex APIs #15961

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

NVnavkumar commented May 14, 2024

davidwendt commented May 14, 2024

GregoryKimball commented May 21, 2024 •

edited

Loading

NVnavkumar commented May 28, 2024

NVnavkumar commented Jun 3, 2024

GregoryKimball commented Jun 4, 2024

NVnavkumar commented Jun 6, 2024 •

edited

Loading

davidwendt commented Jun 10, 2024

davidwendt commented Jun 10, 2024 •

edited

Loading

davidwendt commented Oct 15, 2024

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

Comments

NVnavkumar commented May 14, 2024

davidwendt commented May 14, 2024

GregoryKimball commented May 21, 2024 • edited Loading

NVnavkumar commented May 28, 2024

NVnavkumar commented Jun 3, 2024

GregoryKimball commented Jun 4, 2024

NVnavkumar commented Jun 6, 2024 • edited Loading

davidwendt commented Jun 10, 2024

davidwendt commented Jun 10, 2024 • edited Loading

davidwendt commented Oct 15, 2024

GregoryKimball commented May 21, 2024 •

edited

Loading

NVnavkumar commented Jun 6, 2024 •

edited

Loading

davidwendt commented Jun 10, 2024 •

edited

Loading