Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add word boundary support #167

Closed
JesseBuesking opened this issue Dec 22, 2016 · 5 comments

Comments

@JesseBuesking
Copy link

commented Dec 22, 2016

Would it be possible to support word boundaries? As far as I can tell, the \b that is referred to in the documentation is actually referring the backspace character.

My YYFILL function returns 0 when the known string length is met:

#define YYFILL(n) if (ss->cursor >= ss->limit) return 0;

I have a rule in my grammar to match numbers containing commas with an optional decimal component:

[0-9]{1,3} ("," [0-9]{3})+ ("." [0-9]+)*

Given an input of "1,000,000.00", the lexer will reach the final 0 while processing this rule, then re2c will call YYFILL returning 0 resulting in the rule failing to match.

If word boundaries were supported, I could change my pattern to:

\b [0-9]{1,3} ("," [0-9]{3})+ ("." [0-9]+)* \b

Then the rule would theoretically know that the end of the string is met and return the token prior to reaching the end of the string.

I'm not sure how much work would be involved to support this, so if you have a recommendation for a workaround please let me know. My only idea to address this is to specifically include null bytes in my rules.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 23, 2016

Word boundary woudn't help in this case: it would need to read one character past valid input "1,000,000.00" to determine that word boundary is reached. If you can guarantee existence of the terminating character (one that does not appear in valid input, hence [^0-9,.]), then you don't need neither word boundary, nor YYFILL, you can simply use the sentinel method described here http://re2c.org/examples/example_01.html:

#include <stdio.h>

static const char *lex(const char *YYCURSOR)
{
    const char *YYMARKER;
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;

        end = "\x00";
        dgt = [0-9];
        str = dgt{1,3} ("," dgt{3})+ ("." dgt+)*;

        *       { return "err"; }
        str end { return "str"; }
    */
}

int main(int argc, char **argv)
{
    for (int i = 1; i < argc; ++i) {
        printf ("%s: %s\n", lex(argv[i]), argv[i]);
    }
    return 0;
}

Works like that:

$ re2c -o a.cc a.re -W && g++ -o a a.cc && ./a "1,000,000.00" "1,0000"
str: 1,000,000.00
err: 1,0000

However, if you cannot guarantee existence of terminating character (your input is valid up to the last character and cannot be temporarily modified to end with a terminating character), then you cannot use sentinel method to stop the lexer. There is a number of other options.

Your definition of YYFILL is incorrect: it shouldn't ignore the n parameter. Suppose that n is, say, 10 and you have, say, 3 characters left between cursor and limit. Your YYFILL will resume lexing, but no other checks might happen in the next 10 characters. If the memory past the end of your input happens to contain valid input suffix, lexer won't stop at the end of input.

If you have very long input that needs buffering, follow these examples: http://re2c.org/examples/example_02.html and http://re2c.org/examples/example_03.html.

If you don't use buffering and you just want to stop at exactly N characters or at the first error, then you can't do it with defaut API (because you need to check for the end of input on every character), but you can still do it with generic API http://re2c.org/manual/features/generic_api/generic_api.html. You'll have to disable YYFILL with re2c:yyfill:enable = 0; and define YYPEEK to check for the end of input: #define YYPEEK() (cursor >= limit ? 0 : *cursor). There is an example of this approach in re2c test suite:
https://github.com/skvadrik/re2c/blob/691a28a17c837441347c2b8ed679df72a3f5b94e/re2c/test/input_custom_mjson.--input%28custom%29.re

@JesseBuesking

This comment has been minimized.

Copy link
Author

commented Dec 23, 2016

Yesterday evening after posting I attempted to include null byte checks in my lexer and it worked with surprisingly few modifications, however I believe your last paragraph is exactly what I need. Thank you for your detailed and timely response! I'm going to try your suggestions next.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 23, 2016

Just to clarify, if you can use sentinel method (append NULL or set last input byte to NULL) then by all means do so. It's much more efficient than checking for the end of input on every character.

@JesseBuesking

This comment has been minimized.

Copy link
Author

commented Dec 23, 2016

Agreed! Unfortunately I need to support char arrays containing NULL characters which is why I was originally trying to use YYFILL to do bounds checks.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 23, 2016

Sentnel doesn't have to be NULL. It can be any character that does not appear in valid input (but might appear in invalid input). If all characters are allowed by your regexp, only then you should use YYFILL.

@skvadrik skvadrik closed this Mar 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.