-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746
Comments
This has been requested before: #11979 |
Thank you @NVnavkumar for raising this topic. Would you please share more information about this?
|
Addressing these questions here:
I will try to update with some performance numbers soon. |
From a performance perspective, I did a very brief test with regexp_extract and Spark involving randomly generated strings which only included newlines at the end of the string (as opposed to other types of line terminators). I chose a relatively simple pattern Here is the performance of the
More testing can be done to fully vet this performance characteristic, but it should illustrate how much complexity the transpiled regex is adding to handle these additional line terminators. |
@NVnavkumar Would you please post an example of the transpiled pattern? I would love to see the new pattern that comes out after you go through the steps to convert |
So the original pattern |
@NVnavkumar Could you provide some example strings to test against? |
@NVnavkumar Would you be able to run some tests with PR #15961 ? |
Add support for multiple new-line characters for BOL (`^` / `\A`) and EOL (`$` / `\Z`): - `\n` line-feed (already supported) - `\r` carriage-return - `\u0085` next line (NEL) - `\u2028` line separator - `\u2029` paragraph separator Reference #15746 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Navin Kumar (https://github.com/NVnavkumar) URL: #15961
#15961 solves this issue except for supporting multi-character new line |
Is your feature request related to a problem? Please describe.
Some notes from #11979 here: The
$
matches at the position right before a line terminator in regular expressions. In cuDF (and in Python), this is right before a newline\n
. However, in Spark (or rather the JDK), the line terminator can be any one of the following sequences:\r
,\n
,\r\n
,\u0085
,\u2028
, or\u2029
(unless UNIX_LINES mode is activated) (see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lt).Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. But this also be a flag that enables a
JDK_MODE
which would enabling the more complex handling that can be enabled when calling the corresponding methods from the CUDF Java library.Describe alternatives you've considered
Currently, spark-rapids handles
$
by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. With this translation, we are limited to only using the$
in simple scenarios at the end of the regular expression, we cannot use them in choice|
right now among other constructions because of the complexity (see NVIDIA/spark-rapids#10764)The text was updated successfully, but these errors were encountered: