New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow regex #2013

Open
Juerd opened this Issue Jul 3, 2018 · 3 comments

Comments

Projects
None yet
4 participants
@Juerd

Juerd commented Jul 3, 2018

The Problem

say "127.0.0.1" ~~ /^ [ <{ 0 .. 255 }> ]**4 % \. $/

takes 6 seconds

Expected Behavior

I didn't really know what to expect, but something faster than this at least :)

Bonus segfault!

When I tried to see if I could figure out the cause myself, I got a segfault:

1;0 juerd@cxien:~$ perl6 --profile -e'say "127.0.0.1" ~~ /^ [ <{ 0 .. 255 }> ]**4 % \. $/; say now - INIT now'
Segmentation fault

core file is available on request; it's 136 MB uncompressed, 15 MB gzip.

Version

perl6 -V.txt

@zoffixznet

This comment has been minimized.

Contributor

zoffixznet commented Jul 3, 2018

Reproduced both the slowness and the profiler segfault on 2018.06-46-g64bdb3dd7

FWIW, I'd expect that construct to be much slower than regular regex: you're executing a piece of code at runtime that returns a value to be treated as a regex, which then compiles a regex that matches an alternation with 255 alternatives. Also, you're doing that 4 times (the compiler doesn't know what sort of code you're executing, so it doesn't cache it or the result of converting it to a regex).

This version is a lot faster:

$ ./perl6 -e 'my @octet = 0..255; say "127.0.0.1" ~~ /@octet ** 4 % \./; say now - ENTER now'
「127.0.0.1」
0.05932732
@MasterDuke17

This comment has been minimized.

Contributor

MasterDuke17 commented Jul 3, 2018

Some gdb output.

dan@hermes:~/p6$ pg6 --profile -e 'say "127.0.0.1" ~~ /^ [ <{ 0 .. 255 }> ]**4 % \. $/; say now - INIT now'
================================================================================================
This is Rakudo Perl 6 running in the GNU debugger, which often allows the user to generate useful back-
traces to debug or report issues in Rakudo, the MoarVM backend or the currently running code.

This Rakudo version is 2018.06.61.g.55.b.2.e.32.b.9 built on MoarVM version 2018.06.103.gaf.455397.f,
running on ubuntu (18.04.LTS.Bionic.Beaver) / linux (4.15.0.23.generic)

Type `bt full` to generate a backtrace if applicable, type `q` to quit or `help` for help.
------------------------------------------------------------------------------------------------
Reading symbols from /home/dan/Source/perl6/install/bin/moar...done.
Starting program: /home/dan/Source/perl6/install/bin/moar --execname=/home/dan/Source/perl6/install/bin/perl6-gdb-m --libpath=/home/dan/Source/perl6/install/share/nqp/lib --libpath=/home/dan/Source/perl6/install/share/perl6/lib --libpath=/home/dan/Source/perl6/install/share/perl6/runtime /home/dan/Source/perl6/install/share/perl6/runtime/perl6.moarvm --profile -e say\ \"127.0.0.1\"\ \~\~\ /\^\ \[\ \<\{\ 0\ ..\ 255\ \}\>\ \]\*\*4\ %\ \\.\ \$/\;\ say\ now\ -\ INIT\ now
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6241700 (LWP 7890)]
「127.0.0.1」
14.3995042
Writing profiler output to profile-1530637532.6419826.html
MoarVM panic: Heap corruption detected: pointer 0x7ffff6410010 to past fromspace
[Thread 0x7ffff6241700 (LWP 7890) exited]
[Inferior 1 (process 7886) exited with code 01]
(gdb) r
Starting program: /home/dan/Source/perl6/install/bin/moar --execname=/home/dan/Source/perl6/install/bin/perl6-gdb-m --libpath=/home/dan/Source/perl6/install/share/nqp/lib --libpath=/home/dan/Source/perl6/install/share/perl6/lib --libpath=/home/dan/Source/perl6/install/share/perl6/runtime /home/dan/Source/perl6/install/share/perl6/runtime/perl6.moarvm --profile -e say\ \"127.0.0.1\"\ \~\~\ /\^\ \[\ \<\{\ 0\ ..\ 255\ \}\>\ \]\*\*4\ %\ \\.\ \$/\;\ say\ now\ -\ INIT\ now
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6241700 (LWP 7893)]

Thread 1 "moar" received signal SIGSEGV, Segmentation fault.
0x00007ffff3a8087b in ?? ()
(gdb) bt
#0  0x00007ffff3a8087b in ?? ()
#1  0xffffffffffffffff in ?? ()
#2  0x000055555ae9d4b0 in ?? ()
#3  0x0000000000000008 in ?? ()
#4  0x00007fffec229bd0 in ?? ()
#5  0x0000555558433f98 in ?? ()
#6  0x00007ffff75e7939 in MVM_frame_invoke (tc=0x55555ba06810, static_frame=<optimized out>, callsite=0x7fffffffd720, args=0x3, outer=<optimized out>, code_ref=<optimized out>, spesh_cand=<optimized out>) at src/core/frame.c:557
#7  0x0000555555758c70 in ?? ()
#8  0x00005555584b0708 in ?? ()
#9  0x0000555555758c70 in ?? ()
#10 0x00005555558367e8 in ?? ()
#11 0x00007ffff75dcace in MVM_interp_run (tc=tc@entry=0x555555758c70, initial_invoke=0x3c420, invoke_data=0x5555557c7a10) at src/core/interp.c:5883
#12 0x00007ffff76d3225 in MVM_vm_run_file (instance=0x555555758260, filename=<optimized out>) at src/moar.c:412
#13 0x0000555555555612 in main (argc=9, argv=0x7fffffffdd08) at src/main.c:299
(gdb) q
A debugging session is active.

        Inferior 1 [process 7892] will be killed.

Quit anyway? (y or n) y
dan@hermes:~/p6$ MVM_JIT_DISABLE=1 pg6 --profile -e 'say "127.0.0.1" ~~ /^ [ <{ 0 .. 255 }> ]**4 % \. $/; say now - INIT now'
================================================================================================
This is Rakudo Perl 6 running in the GNU debugger, which often allows the user to generate useful back-
traces to debug or report issues in Rakudo, the MoarVM backend or the currently running code.

This Rakudo version is 2018.06.61.g.55.b.2.e.32.b.9 built on MoarVM version 2018.06.103.gaf.455397.f,
running on ubuntu (18.04.LTS.Bionic.Beaver) / linux (4.15.0.23.generic)

Type `bt full` to generate a backtrace if applicable, type `q` to quit or `help` for help.
------------------------------------------------------------------------------------------------
Reading symbols from /home/dan/Source/perl6/install/bin/moar...done.
Starting program: /home/dan/Source/perl6/install/bin/moar --execname=/home/dan/Source/perl6/install/bin/perl6-gdb-m --libpath=/home/dan/Source/perl6/install/share/nqp/lib --libpath=/home/dan/Source/perl6/install/share/perl6/lib --libpath=/home/dan/Source/perl6/install/share/perl6/runtime /home/dan/Source/perl6/install/share/perl6/runtime/perl6.moarvm --profile -e say\ \"127.0.0.1\"\ \~\~\ /\^\ \[\ \<\{\ 0\ ..\ 255\ \}\>\ \]\*\*4\ %\ \\.\ \$/\;\ say\ now\ -\ INIT\ now
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6241700 (LWP 7912)]

Thread 1 "moar" received signal SIGSEGV, Segmentation fault.
0x00007ffff75dbc5a in MVM_interp_run (tc=tc@entry=0x555555758c70, initial_invoke=0x7fffec9c6268, invoke_data=0x4) at src/core/interp.c:5390
5390                    if (!check || !IS_CONCRETE(check) || STABLE(check) != want)
(gdb) bt
#0  0x00007ffff75dbc5a in MVM_interp_run (tc=tc@entry=0x555555758c70, initial_invoke=0x7fffec9c6268, invoke_data=0x4) at src/core/interp.c:5390
#1  0x00007ffff76d3225 in MVM_vm_run_file (instance=0x555555758260, filename=<optimized out>) at src/moar.c:412
#2  0x0000555555555612 in main (argc=9, argv=0x7fffffffdcf8) at src/main.c:299
(gdb) q
A debugging session is active.

        Inferior 1 [process 7908] will be killed.

Quit anyway? (y or n) y
dan@hermes:~/p6$ MVM_SPESH_DISABLE=1 pg6 --profile -e 'say "127.0.0.1" ~~ /^ [ <{ 0 .. 255 }> ]**4 % \. $/; say now - INIT now'
================================================================================================
This is Rakudo Perl 6 running in the GNU debugger, which often allows the user to generate useful back-
traces to debug or report issues in Rakudo, the MoarVM backend or the currently running code.

This Rakudo version is 2018.06.61.g.55.b.2.e.32.b.9 built on MoarVM version 2018.06.103.gaf.455397.f,
running on ubuntu (18.04.LTS.Bionic.Beaver) / linux (4.15.0.23.generic)

Type `bt full` to generate a backtrace if applicable, type `q` to quit or `help` for help.
------------------------------------------------------------------------------------------------
Reading symbols from /home/dan/Source/perl6/install/bin/moar...done.
Starting program: /home/dan/Source/perl6/install/bin/moar --execname=/home/dan/Source/perl6/install/bin/perl6-gdb-m --libpath=/home/dan/Source/perl6/install/share/nqp/lib --libpath=/home/dan/Source/perl6/install/share/perl6/lib --libpath=/home/dan/Source/perl6/install/share/perl6/runtime /home/dan/Source/perl6/install/share/perl6/runtime/perl6.moarvm --profile -e say\ \"127.0.0.1\"\ \~\~\ /\^\ \[\ \<\{\ 0\ ..\ 255\ \}\>\ \]\*\*4\ %\ \\.\ \$/\;\ say\ now\ -\ INIT\ now
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
「127.0.0.1」
10.9597802
Writing profiler output to profile-1530637602.0831006.html
[Inferior 1 (process 7926) exited normally]
(gdb)

It succeeds with MVM_SPESH_INLINE_DISABLE=1, but not with MVM_SPESH_OSR_DISABLE=1, nor with MVM_JIT_EXPR_DISABLE=1.

@jnthn

This comment has been minimized.

Member

jnthn commented Dec 13, 2018

To make matters worse, the <{ ... }> form interprets everything it sees as code, and it also needs to do LTM, meaning that we are:

  • Compiling each value form 0..255 into a regex
  • Building an NFA consisting of all of those regexes
  • Using the NFA to get an evaluation order of those that match
  • Then matching those that match and picking the longest (which is not really right, I think we should just pick the first that matches, given the NFA already did the sort, and it's declarative longest that counts). Just for fun, we compile the matching ones into regexes again.

And we repeat this 4 times. I think we do cache the compilation somewhat for the next 3 times, but it clearly doesn't help a lot.

Things we could do are:

  • Fix it to at least just take the first matching regex in the final step, since it's declarative longest prefix that counts, and the NFA already does that. This counts as a semantic fix.
  • At least store the MAKE_REGEX result so we don't have to do it multiple times

We might also detect the case when we have values that would just compile into a literal. If that's all we have, then sort them by length descending and pick the first one at the current offset that we match. This avoids the whole NFA and regex evaluation machinery and would be a significant speedup for this case.

The thing is, that's what writing @(0..255) does already. So the question is whether we want to penalize people using <{ ... }> "correctly" (in the sense of really needing their stuff EVAL'd) by putting in analysis to try and catch poor use of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment