Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

Open
NVnavkumar opened this issue May 14, 2024 · 9 comments
Labels
feature request New feature or request strings strings issues (C++ and Python)

Comments

@NVnavkumar
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Some notes from #11979 here: The $ matches at the position right before a line terminator in regular expressions. In cuDF (and in Python), this is right before a newline\n. However, in Spark (or rather the JDK), the line terminator can be any one of the following sequences: \r, \n, \r\n, \u0085, \u2028, or \u2029 (unless UNIX_LINES mode is activated) (see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lt).

Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. But this also be a flag that enables a JDK_MODE which would enabling the more complex handling that can be enabled when calling the corresponding methods from the CUDF Java library.

Describe alternatives you've considered
Currently, spark-rapids handles $ by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. With this translation, we are limited to only using the $ in simple scenarios at the end of the regular expression, we cannot use them in choice | right now among other constructions because of the complexity (see NVIDIA/spark-rapids#10764)

@davidwendt
Copy link
Contributor

This has been requested before: #11979
Supporting an array of single characters may be doable but supporting \r\n will likely not be possible.

@davidwendt davidwendt added the strings strings issues (C++ and Python) label May 16, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented May 21, 2024

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

  • what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?
  • would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?
  • As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?
  • Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

@NVnavkumar
Copy link
Contributor Author

Thank you @NVnavkumar for raising this topic. Would you please share more information about this?

  • what is the performance for a line terminator pattern that only matches \n versus the workarounds Spark-RAPIDS has for the set of JDK_MODE line terminators?
  • would you please share a few examples of how line terminators interact with multiline regex patterns in Spark?
  • As @davidwendt mentioned, supporting a \r\n line terminator may not be possible. What other options do we have to help Spark return correct results in this case?
  • Would there be benefit to adding a JDK_MODE flag that supports line terminators of \r, \n, \u0085, \u2028, or \u2029 but not \r\n?

Addressing these questions here:

  • I'm still working on measuring the performance impact here, and trying to ascertain that for certain strings that only include newlines (\n), what is the performance impact of the transpiled regex vs sending the original regex into cudf directly. The theory is that these would get pretty close to the Spark output so transpilation overhead can be reduced.

  • Line terminators actually dictate both ^ and $ behavior, since they dictate ultimately both the start and end of the line. Sometimes we want to use these in more complicated ways like choice (e.g.abc|$ matches abc or we just want the end of the line, see [FEA] Support single '$' or '^' on right side of regexp choice  NVIDIA/spark-rapids#10764 for corresponding spark-rapids issue). In multiline mode, this means that they basically become a matcher for the line terminator characters themselves or the end of the string.

  • One option we have (and we might even have it in lieu of this issue), is to substitute \r\n for \n, and then run the cudf regexp engine. However, this substitution adds an additional GPU operation and manipulates the original string, so for some operations (like extract), we won't get the same output since we won't be able to include the original line terminator in the output. Another option is we could simplify the transpilation to something like \r\n$|$. If that works, that might be a better option to maintain compatibility with Spark. I also would like to propose that Spark could disable such a transpilation under a "maximizeCompatiblity" flag for perfomance purposes.

  • Using the second option described in the previous paragraph, this could still be potentially very useful with the simplified transpilation.

I will try to update with some performance numbers soon.

@NVnavkumar
Copy link
Contributor Author

From a performance perspective, I did a very brief test with regexp_extract and Spark involving randomly generated strings which only included newlines at the end of the string (as opposed to other types of line terminators). I chose a relatively simple pattern (boo:+)$ (which involves a single capture group and the $ to match the end of the string. I chose extract since that's the case where the Spark-RAPIDS regexp transpiler code since it tests how extract handles the potential non-capturing group for $ transpilation:

Here is the performance of the extract kernel measured with nsys in 2 scenarios, 1 using the transpiled regex, 2 using the regex as given to cudf.

1) Transpiled Regex
Name	Start	Duration	TID	Category
extract	20.7979s	5.596 ms	939788	
extract	20.7979s	5.593 ms	939789	
extract	20.8318s	4.843 ms	939790	
extract	20.8317s	4.827 ms	939791	
------------------------------------------
 avg                          5.215 ms

2) cuDF using the original regex
Name	Start	Duration	TID	Category
extract	20.6463s	2.719 ms	941563	
extract	20.6789s	        1.840 ms	941564	
extract	20.6465s	2.634 ms	941565	
extract	20.6789s	        1.685 ms	941566	
------------------------------------------
 avg                                   2.220 ms

More testing can be done to fully vet this performance characteristic, but it should illustrate how much complexity the transpiled regex is adding to handle these additional line terminators.

@GregoryKimball
Copy link
Contributor

@NVnavkumar Would you please post an example of the transpiled pattern? I would love to see the new pattern that comes out after you go through the steps to convert $ into something that gives correct result.

@NVnavkumar
Copy link
Contributor Author

NVnavkumar commented Jun 6, 2024

@NVnavkumar Would you please post an example of the transpiled pattern? I would love to see the new pattern that comes out after you go through the steps to convert $ into something that gives correct result.

So the original pattern (boo:+)$ becomes (boo:+)(?:\r|\u0085|\u2028|\u2029|\r\n)?$. This is to accomodate the \r\n double character and the possibility of supporting the 4 additional characters \r and the 3 unicode line breaks.

@davidwendt
Copy link
Contributor

@NVnavkumar Could you provide some example strings to test against?
I just want to make sure I have a good representation of test data you are seeing.

@davidwendt
Copy link
Contributor

davidwendt commented Jun 10, 2024

@NVnavkumar Would you be able to run some tests with PR #15961 ?
No special flags are needed. I hardcoded the single-character new-line types.

rapids-bot bot pushed a commit that referenced this issue Sep 17, 2024
Add support for multiple new-line characters for BOL (`^` / `\A`) and EOL (`$` / `\Z`):
-  `\n` line-feed (already supported)
-  `\r` carriage-return
-  `\u0085` next line (NEL)
-  `\u2028` line separator
-  `\u2029` paragraph separator

Reference #15746

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - Navin Kumar (https://github.com/NVnavkumar)

URL: #15961
@davidwendt
Copy link
Contributor

#15961 solves this issue except for supporting multi-character new line \r\n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request strings strings issues (C++ and Python)
Projects
Status: In Progress
Development

No branches or pull requests

4 participants
@GregoryKimball @davidwendt @NVnavkumar and others