Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thompson NFA regex engine doesn’t advance stream position when matching #258

Closed
2 of 12 tasks
bbarenblat opened this issue Oct 2, 2022 · 1 comment
Closed
2 of 12 tasks
Assignees
Labels
bug Something isn't working fixed-in-110.99.4 issues that will be fixed in the 110.99.4 version regexp-lib problem with Regular Expression library

Comments

@bbarenblat
Copy link

bbarenblat commented Oct 2, 2022

Version

Pre-110.97

Operating System

  • Any
  • Linux
  • macOS
  • Windows
  • Other Unix

OS Version

Debian 11.5 (“bullseye”)

Processor

  • Any
  • Arm (using Rosetta)
  • PowerPC
  • Sparc
  • x86 (32-bit)
  • x86-64 (64-bit)
  • Other

System Component

SML/NJ Library

Severity

Minor

Description

Unlike the backtracking and DFA-based regular expression engines, the engine based on the Thompson NFA construction (ThompsonEngine) does not advance the stream cursor when matching. That is, the example code I’ve provided in this bug report prints

SOME (true, "xxx")
SOME (true, "xxx")
SOME (true, "xxx")
SOME (true, "truexxx")

Transcript

No response

Expected Behavior

The Thompson engine should exhibit the same behavior as the other regex engines. That is, the example code I’ve provided in this bug report should produce

SOME (true, "xxx")
SOME (true, "xxx")
SOME (true, "xxx")
SOME (true, "xxx")

Steps to Reproduce

thompson.sml:

(* Copyright 2022 Google LLC
 * SPDX-License-Identifier: Apache-2.0 *)
structure BackTrack = RegExpFn(structure P = AwkSyntax
                               structure E = BackTrackEngine)
structure DFA       = RegExpFn(structure P = AwkSyntax
                               structure E = DfaEngine)
structure Thompson  = RegExpFn(structure P = AwkSyntax
                               structure E = ThompsonEngine)

fun boolScanBackTrack getc =
    BackTrack.match [("true", fn _ => true), ("false", fn _ => false)] getc
fun boolScanDFA getc =
    DFA.match       [("true", fn _ => true), ("false", fn _ => false)] getc
fun boolScanThompson getc =
    Thompson.match  [("true", fn _ => true), ("false", fn _ => false)] getc

fun runMatcher f =
  case f Substring.getc (Substring.full "truexxx")
   of NONE => print "NONE"
    | SOME (v, s) => print ("SOME (" ^ Bool.toString v ^ ", \""
                            ^ Substring.string s ^ "\")\n")

val () = (
  runMatcher Bool.scan;
  runMatcher boolScanBackTrack;
  runMatcher boolScanDFA;
  runMatcher boolScanThompson)

thompson.cm:

(* Copyright 2022 Google LLC
 * SPDX-License-Identifier: Apache-2.0 *)
Group is
$/basis.cm
$/regexp-lib.cm
thompson.sml

Additional Information

Although I’m running an old version of SML/NJ, I’ve verified that this behavior persists with the latest thompson-engine.sml.

Email address

bbarenblat@gmail.com

@bbarenblat bbarenblat added the bug Something isn't working label Oct 2, 2022
@JohnReppy JohnReppy added the regexp-lib problem with Regular Expression library label Oct 3, 2022
@JohnReppy JohnReppy self-assigned this Oct 3, 2022
JohnReppy added a commit that referenced this issue Jul 27, 2023
@JohnReppy JohnReppy added the fixed-in-110.99.4 issues that will be fixed in the 110.99.4 version label Jul 27, 2023
@JohnReppy
Copy link
Contributor

Fixed for 110.99.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed-in-110.99.4 issues that will be fixed in the 110.99.4 version regexp-lib problem with Regular Expression library
Projects
None yet
Development

No branches or pull requests

2 participants