grep: honor whitespace escapes in BRE by wondr-wclabs · Pull Request #55 · uutils/grep

wondr-wclabs · 2026-06-05T10:41:44Z

Fixes #31.

GNU grep accepts \s and \S as whitespace shorthands in basic regular expressions. Extended mode already recognizes those spellings through Syntax::gnu_regex(), but the default BRE path treated them as literal s/S because Syntax::grep() does not enable that GNU extension by default.

This now enables Oniguruma's SYNTAX_OPERATOR_ESC_S_WHITE_SPACE for BRE. That keeps the implementation small and uses the regex engine's own syntax support instead of rewriting patterns before compilation.

One tradeoff is intentionally left in place: Oniguruma's direct \S behavior can differ from GNU grep on lone invalid UTF-8 bytes. I had previously worked around that by expanding \s/\S into POSIX classes, but that was a disproportionately large local workaround while other GNU/Oniguruma semantic gaps still remain. This version fixes the common valid-input shorthand behavior and BRE repetition support without adding a custom mini parser.

Validation:

cargo fmt --all -- --check
cargo test bre_whitespace_shorthands
cargo test
cargo clippy --all-targets --workspace -puu_grep -- -D warnings
git diff --check

github-actions · 2026-06-05T10:43:35Z

GNU grep testsuite comparison:

Test results comparison:
  Current:   TOTAL: 128 / PASSED: 73 / FAILED: 34 / SKIPPED: 21
  Reference: TOTAL: 128 / PASSED: 72 / FAILED: 35 / SKIPPED: 21

Changes from main branch:
  TOTAL: +0
  PASSED: +1
  FAILED: -1

New test failures (1):
  - backslash-s-vs-invalid-multibyte

Test improvements (2):
  + backslash-s-and-repetition-operators
  + multibyte-white-space

codspeed-hq · 2026-06-05T10:44:15Z

Merging this PR will not alter performance

✅ 10 untouched benchmarks
⏩ 17 skipped benchmarks¹

_{Comparing wondr-wclabs:codex/bre-whitespace-shorthands (f319232) with main (e925ff4)}

17 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

wondr-wclabs · 2026-06-05T10:57:43Z

I pushed a follow-up for the GNU comparison failure (backslash-s-vs-invalid-multibyte). The first version enabled Oniguruma's \s/\S operator directly for BRE, which fixed the positive GNU shorthand cases but made \S match a lone invalid UTF-8 byte as "not whitespace". GNU's test expects neither \s nor \S to match that byte in a UTF-8 locale.

The updated version expands BRE \s/\S outside bracket expressions to POSIX space classes before compiling: \s becomes [[:space:]], and \S becomes [[:^space:]]. That keeps the GNU shorthand behavior, preserves BRE repetition like \s\+, and avoids the invalid-byte match because Oniguruma's negative POSIX class form does not include invalid UTF-8 as a valid non-space character.

I verified the specific GNU test locally (backslash-s-vs-invalid-multibyte) in addition to rerunning cargo fmt --all -- --check, cargo test, cargo clippy --all-targets --workspace -puu_grep -- -D warnings, and git diff --check.

github-actions · 2026-06-05T10:59:39Z

GNU grep testsuite comparison:

Test results comparison:
  Current:   TOTAL: 128 / PASSED: 74 / FAILED: 33 / SKIPPED: 21
  Reference: TOTAL: 128 / PASSED: 72 / FAILED: 35 / SKIPPED: 21

Changes from main branch:
  TOTAL: +0
  PASSED: +2
  FAILED: -2

Test improvements (2):
  + backslash-s-and-repetition-operators
  + multibyte-white-space

lhecker

I personally strongly prefer the broken \S behavior but using SYNTAX_OPERATOR_ESC_S_WHITE_SPACE for a shorter solution.

...there's is no way to get Oniguruma to work like GNU grep's regex engine in the first place. We shouldn't try filling tiny gaps with big workarounds while still leaving gaping holes.

wondr-wclabs · 2026-06-05T16:09:54Z

@lhecker That direction makes sense. I was optimizing for the backslash-s-vs-invalid-multibyte GNU comparison case, but the workaround made this PR carry a custom BRE pattern rewrite while the underlying engine still has larger GNU compatibility gaps. That is not a good tradeoff for a small syntax extension.

I changed the branch back to the direct SYNTAX_OPERATOR_ESC_S_WHITE_SPACE approach for BRE and removed the rewrite/helper code. The remaining behavior this PR now targets is the common valid-input GNU shorthand behavior: \s, \S, and composition with BRE repetition like \s\+. I also removed the invalid-byte expectations from the local regression test, since those were specifically validating the larger workaround rather than the narrower syntax fix.

Validation after the change:

cargo fmt --all -- --check
cargo test bre_whitespace_shorthands
cargo test
cargo clippy --all-targets --workspace -puu_grep -- -D warnings
git diff --check

wondr-wclabs · 2026-06-05T16:23:50Z

One follow-up after the rebase/CI run: the direct SYNTAX_OPERATOR_ESC_S_WHITE_SPACE implementation does exactly what we discussed, but it also confirms the tradeoff in CI. The GNU aggregate check is red because backslash-s-vs-invalid-multibyte is a new deterministic GNU-regression compared with main.

I looked for an appropriate way to classify that as an accepted engine mismatch, but the only ignore mechanism in this repo is .github/workflows/ignore-intermittent.txt, which is documented for flaky/environment-dependent tests. I do not think it would be honest to put this case there: it is not intermittent; it is a stable Oniguruma-vs-GNU semantic difference.

So I think there are three real options:

Keep this direct-operator implementation, but the GNU aggregate gate needs an explicit project-level way to accept deterministic known incompatibilities.
Restore the POSIX-class rewrite, which makes CI green but adds the local parser workaround you objected to.
Close this PR and leave BRE \s / \S unsupported until the broader regex-engine direction changes.

Given your feedback, I am leaning toward option 3 unless maintainers want either an explicit known-incompatibility mechanism or the narrower workaround despite the extra code.

github-actions · 2026-06-05T16:31:52Z

GNU grep testsuite comparison:

Test results comparison:
  Current:   TOTAL: 128 / PASSED: 73 / FAILED: 34 / SKIPPED: 21
  Reference: TOTAL: 128 / PASSED: 72 / FAILED: 35 / SKIPPED: 21

Changes from main branch:
  TOTAL: +0
  PASSED: +1
  FAILED: -1

New test failures (1):
  - backslash-s-vs-invalid-multibyte

Test improvements (2):
  + backslash-s-and-repetition-operators
  + multibyte-white-space

wondr-wclabs · 2026-06-05T23:35:08Z

Closing this for now rather than leaving a known-red PR open.

After switching back to the direct SYNTAX_OPERATOR_ESC_S_WHITE_SPACE implementation you preferred, the branch still fails the GNU aggregate gate deterministically on backslash-s-vs-invalid-multibyte. That failure is a real Oniguruma-vs-GNU semantic difference, not a flaky/environmental test, so putting it in the existing intermittent-ignore file would be misleading.

The only two ways I see to make this mergeable are either the POSIX-class rewrite workaround, which you reasonably pushed back on because it adds parser code for a small compatibility gap, or a broader project-level mechanism for accepted deterministic GNU incompatibilities. Without one of those, keeping this open would not be useful. I will leave BRE \s / \S unsupported for now and revisit only if the regex compatibility direction changes.

wondr-wclabs mentioned this pull request Jun 5, 2026

grep: honor whitespace escapes in BRE #37

Closed

lhecker requested changes Jun 5, 2026

View reviewed changes

grep: honor whitespace escapes in BRE

f319232

wondr-wclabs force-pushed the codex/bre-whitespace-shorthands branch from 825ff67 to f319232 Compare June 5, 2026 16:17

wondr-wclabs mentioned this pull request Jun 5, 2026

grep: support POSIX equivalence classes #39

Closed

wondr-wclabs closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grep: honor whitespace escapes in BRE#55

grep: honor whitespace escapes in BRE#55
wondr-wclabs wants to merge 1 commit into
uutils:mainfrom
wondr-wclabs:codex/bre-whitespace-shorthands

wondr-wclabs commented Jun 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

lhecker left a comment

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wondr-wclabs commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

lhecker left a comment

Choose a reason for hiding this comment

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

wondr-wclabs commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wondr-wclabs commented Jun 5, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading