Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there SIMD support? #411

Open
ntysdd opened this issue Aug 8, 2022 · 8 comments
Open

Is there SIMD support? #411

ntysdd opened this issue Aug 8, 2022 · 8 comments

Comments

@ntysdd
Copy link

ntysdd commented Aug 8, 2022

For example, use SIMD intrinsics explicity, or use long long to process 8 bytes together?

@skvadrik
Copy link
Owner

Not yet. I had some ideas about it, but it's not in the works yet.

@ntysdd
Copy link
Author

ntysdd commented Aug 15, 2022

Thanks, that's good to know.

@skvadrik
Copy link
Owner

A note about alignment: since we cannot guarantee input data alignment, it would be impossible to use multi-byte reads (not without explicit guarantee from the user that the input is aligned). However, it should still be possible to combine multiple bytes into a 2/4/8-byte value and do the switch on this combined value rather than do 2/4/8 consequent switch statements. This optimization is not straightforward, as it the underlying DFA may not have many linear segments that can be combined this way (due to the grammar, or due to the possible end-of-input after each byte). This needs some study and experiments.

@pmetzger
Copy link
Contributor

I wonder if explicitly open coding the SIMD compiler intrinsics beats trying to emit constructs that the compiler can SIMD itself.

@skvadrik
Copy link
Owner

I don't think compiler can perform such optimization, as they require a bit of high-level insight.

Consider this simple regular grammar:

bool lex(const char* YYCURSOR) {
    const char* YYMARKER;
/*!re2c
    re2c:yyfill:enable = 0;
    re2c:define:YYCTYPE = char;

    "abcd" { return true; }
    *      { return false; }
*/
}

It has just two rules: either a string "abcd" or anything else. It would be easy for a human to read four bytes (as four 1-byte reads with arbitrary alignment) then combine into a single 4-byte value and compare it against the numeric value of abcd. Of course, it would be necessary to ensure that there are 4 byte in the buffer, so this optimization would only work with padding-based end-of-input checks.

Currently re2c generates the following "branchy" code (re2c 1.re -is -o 1.c, where -s is for compactness only and does not affect the reasoning):

bool lex(const char* YYCURSOR) {
    const char* YYMARKER;
{
	char yych;
	yych = *YYCURSOR;
	if (yych == 'a') goto yy2;
	++YYCURSOR;
yy1:
	{ return false; }
yy2:
	yych = *(YYMARKER = ++YYCURSOR);
	if (yych != 'b') goto yy1;
	yych = *++YYCURSOR;
	if (yych == 'c') goto yy4;
yy3:
	YYCURSOR = YYMARKER;
	goto yy1;
yy4:
	yych = *++YYCURSOR;
	if (yych != 'd') goto yy3;
	++YYCURSOR;
	{ return true; }
}
}

Which is compiled to very similar "branchy" assembly (g++ -O2 -c 1.c -o 1.o && objdump -d 1.o):

0000000000000000 <_Z3lexPKc>:
   0:	31 c0                	xor    %eax,%eax
   2:	80 3f 61             	cmpb   $0x61,(%rdi)
   5:	74 09                	je     10 <_Z3lexPKc+0x10>
   7:	c3                    	ret
   8:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
   f:	00
  10:	80 7f 01 62          	cmpb   $0x62,0x1(%rdi)
  14:	75 f1                	jne    7 <_Z3lexPKc+0x7>
  16:	80 7f 02 63          	cmpb   $0x63,0x2(%rdi)
  1a:	75 eb                	jne    7 <_Z3lexPKc+0x7>
  1c:	80 7f 03 64          	cmpb   $0x64,0x3(%rdi)
  20:	0f 94 c0             	sete   %al
  23:	c3                        	ret

Both GCC and Clang with -O2 generate almost identical code. And they cannot reorder the branches with the reads: this kind of optimization is too unsafe for a compiler to perform (at least in my limited understanding of C++ compilers).

@pmetzger
Copy link
Contributor

Precisely. This is why I suggest that using the compiler intrinsics is probably the correct path. clang, gcc, etc. support mostly the same set.

@skvadrik
Copy link
Owner

@pmetzger What intrinsics specifically do you mean? I don't see how an intrinsic can restructure the program and squash the four check-and-branch pieces in the example into one.

@pmetzger
Copy link
Contributor

pmetzger commented Oct 1, 2022

You can play games like the one you're proposing with the use of intrinsics. They're gross and have limited portabilty, but the end user could specify whether they wanted the use of intrinsics or not.

Intrinsics also let you call SIMD instructions directly from generated C code. gcc and clang both support a wide variety of intrinsics. Here, for example, is some explanation of the vector extensions both support.

https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Vector-Extensions.html

https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants