Skip to content

grep: honor whitespace escapes in BRE#55

Closed
wondr-wclabs wants to merge 1 commit into
uutils:mainfrom
wondr-wclabs:codex/bre-whitespace-shorthands
Closed

grep: honor whitespace escapes in BRE#55
wondr-wclabs wants to merge 1 commit into
uutils:mainfrom
wondr-wclabs:codex/bre-whitespace-shorthands

Conversation

@wondr-wclabs
Copy link
Copy Markdown
Contributor

@wondr-wclabs wondr-wclabs commented Jun 5, 2026

Fixes #31.

GNU grep accepts \s and \S as whitespace shorthands in basic regular expressions. Extended mode already recognizes those spellings through Syntax::gnu_regex(), but the default BRE path treated them as literal s/S because Syntax::grep() does not enable that GNU extension by default.

This now enables Oniguruma's SYNTAX_OPERATOR_ESC_S_WHITE_SPACE for BRE. That keeps the implementation small and uses the regex engine's own syntax support instead of rewriting patterns before compilation.

One tradeoff is intentionally left in place: Oniguruma's direct \S behavior can differ from GNU grep on lone invalid UTF-8 bytes. I had previously worked around that by expanding \s/\S into POSIX classes, but that was a disproportionately large local workaround while other GNU/Oniguruma semantic gaps still remain. This version fixes the common valid-input shorthand behavior and BRE repetition support without adding a custom mini parser.

Validation:

  • cargo fmt --all -- --check
  • cargo test bre_whitespace_shorthands
  • cargo test
  • cargo clippy --all-targets --workspace -puu_grep -- -D warnings
  • git diff --check

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

GNU grep testsuite comparison:

Test results comparison:
  Current:   TOTAL: 128 / PASSED: 73 / FAILED: 34 / SKIPPED: 21
  Reference: TOTAL: 128 / PASSED: 72 / FAILED: 35 / SKIPPED: 21

Changes from main branch:
  TOTAL: +0
  PASSED: +1
  FAILED: -1

New test failures (1):
  - backslash-s-vs-invalid-multibyte

Test improvements (2):
  + backslash-s-and-repetition-operators
  + multibyte-white-space

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 5, 2026

Merging this PR will not alter performance

✅ 10 untouched benchmarks
⏩ 17 skipped benchmarks1


Comparing wondr-wclabs:codex/bre-whitespace-shorthands (f319232) with main (e925ff4)

Open in CodSpeed

Footnotes

  1. 17 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@wondr-wclabs
Copy link
Copy Markdown
Contributor Author

I pushed a follow-up for the GNU comparison failure (backslash-s-vs-invalid-multibyte). The first version enabled Oniguruma's \s/\S operator directly for BRE, which fixed the positive GNU shorthand cases but made \S match a lone invalid UTF-8 byte as "not whitespace". GNU's test expects neither \s nor \S to match that byte in a UTF-8 locale.

The updated version expands BRE \s/\S outside bracket expressions to POSIX space classes before compiling: \s becomes [[:space:]], and \S becomes [[:^space:]]. That keeps the GNU shorthand behavior, preserves BRE repetition like \s\+, and avoids the invalid-byte match because Oniguruma's negative POSIX class form does not include invalid UTF-8 as a valid non-space character.

I verified the specific GNU test locally (backslash-s-vs-invalid-multibyte) in addition to rerunning cargo fmt --all -- --check, cargo test, cargo clippy --all-targets --workspace -puu_grep -- -D warnings, and git diff --check.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

GNU grep testsuite comparison:

Test results comparison:
  Current:   TOTAL: 128 / PASSED: 74 / FAILED: 33 / SKIPPED: 21
  Reference: TOTAL: 128 / PASSED: 72 / FAILED: 35 / SKIPPED: 21

Changes from main branch:
  TOTAL: +0
  PASSED: +2
  FAILED: -2

Test improvements (2):
  + backslash-s-and-repetition-operators
  + multibyte-white-space

Copy link
Copy Markdown
Collaborator

@lhecker lhecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally strongly prefer the broken \S behavior but using SYNTAX_OPERATOR_ESC_S_WHITE_SPACE for a shorter solution.

...there's is no way to get Oniguruma to work like GNU grep's regex engine in the first place. We shouldn't try filling tiny gaps with big workarounds while still leaving gaping holes.

@wondr-wclabs
Copy link
Copy Markdown
Contributor Author

@lhecker That direction makes sense. I was optimizing for the backslash-s-vs-invalid-multibyte GNU comparison case, but the workaround made this PR carry a custom BRE pattern rewrite while the underlying engine still has larger GNU compatibility gaps. That is not a good tradeoff for a small syntax extension.

I changed the branch back to the direct SYNTAX_OPERATOR_ESC_S_WHITE_SPACE approach for BRE and removed the rewrite/helper code. The remaining behavior this PR now targets is the common valid-input GNU shorthand behavior: \s, \S, and composition with BRE repetition like \s\+. I also removed the invalid-byte expectations from the local regression test, since those were specifically validating the larger workaround rather than the narrower syntax fix.

Validation after the change:

  • cargo fmt --all -- --check
  • cargo test bre_whitespace_shorthands
  • cargo test
  • cargo clippy --all-targets --workspace -puu_grep -- -D warnings
  • git diff --check

@wondr-wclabs wondr-wclabs force-pushed the codex/bre-whitespace-shorthands branch from 825ff67 to f319232 Compare June 5, 2026 16:17
@wondr-wclabs
Copy link
Copy Markdown
Contributor Author

One follow-up after the rebase/CI run: the direct SYNTAX_OPERATOR_ESC_S_WHITE_SPACE implementation does exactly what we discussed, but it also confirms the tradeoff in CI. The GNU aggregate check is red because backslash-s-vs-invalid-multibyte is a new deterministic GNU-regression compared with main.

I looked for an appropriate way to classify that as an accepted engine mismatch, but the only ignore mechanism in this repo is .github/workflows/ignore-intermittent.txt, which is documented for flaky/environment-dependent tests. I do not think it would be honest to put this case there: it is not intermittent; it is a stable Oniguruma-vs-GNU semantic difference.

So I think there are three real options:

  1. Keep this direct-operator implementation, but the GNU aggregate gate needs an explicit project-level way to accept deterministic known incompatibilities.
  2. Restore the POSIX-class rewrite, which makes CI green but adds the local parser workaround you objected to.
  3. Close this PR and leave BRE \s / \S unsupported until the broader regex-engine direction changes.

Given your feedback, I am leaning toward option 3 unless maintainers want either an explicit known-incompatibility mechanism or the narrower workaround despite the extra code.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

GNU grep testsuite comparison:

Test results comparison:
  Current:   TOTAL: 128 / PASSED: 73 / FAILED: 34 / SKIPPED: 21
  Reference: TOTAL: 128 / PASSED: 72 / FAILED: 35 / SKIPPED: 21

Changes from main branch:
  TOTAL: +0
  PASSED: +1
  FAILED: -1

New test failures (1):
  - backslash-s-vs-invalid-multibyte

Test improvements (2):
  + backslash-s-and-repetition-operators
  + multibyte-white-space

@wondr-wclabs
Copy link
Copy Markdown
Contributor Author

Closing this for now rather than leaving a known-red PR open.

After switching back to the direct SYNTAX_OPERATOR_ESC_S_WHITE_SPACE implementation you preferred, the branch still fails the GNU aggregate gate deterministically on backslash-s-vs-invalid-multibyte. That failure is a real Oniguruma-vs-GNU semantic difference, not a flaky/environmental test, so putting it in the existing intermittent-ignore file would be misleading.

The only two ways I see to make this mergeable are either the POSIX-class rewrite workaround, which you reasonably pushed back on because it adds parser code for a small compatibility gap, or a broader project-level mechanism for accepted deterministic GNU incompatibilities. Without one of those, keeping this open would not be useful. I will leave BRE \s / \S unsupported for now and revisit only if the regex compatibility direction changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

\s and \S are not honored in basic-regexp (-G) mode like GNU

2 participants