Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A code generated for period (.) requires 4 bytes. #243

Closed
unixod opened this issue Feb 11, 2019 · 3 comments

Comments

@unixod
Copy link

commented Feb 11, 2019

Hi,

I noticed interesting behavior when tried to generated a code for the following re2c-specification:

bool consumeOneCodePoint(InputCursor pos, InputCursor end)
{
    /*!re2c
        re2c:flags:utf-8 = 1;

        . { return pos; }
        * { return end; }
    */
}

as I figured out, the generated code, on input requires at least 4 bytes not 1:

/* Generated by re2c 1.1.1 on Mon Feb 11 12:20:12 2019 */
#line 1 "example.re"
bool consumeOneCodePoint(InputCursor pos, InputCursor end)
{

#line 7 "<stdout>"
{
	YYCTYPE yych;
	if ((YYLIMIT - YYCURSOR) < 4) YYFILL(4);
	yych = *YYCURSOR;
	switch (yych) {
	case 0x00:
        ...
}

this requirement seems a little bit strange to me, because UTF-8 code points aren't necessary to have 4 bytes length.

Could you help me please to clarify, whether this behavior is bug or an expected feature? :)


PS.
My initial code, where I encountered this behavior, was a little bit more complex and looked as follows:

struct ConsumptionResult {
    bool success;
    InputCursor pos;
};

inline ConsumptionResult execConsumeCodePoint(InputCursor begin, InputCursor end)
{
    auto pos = begin.get();
    [[maybe_unused]] decltype(pos) YYMARKER;

    /*!re2c
        re2c:define:YYCTYPE = 'std::decay_t<decltype(*pos)>';
        re2c:define:YYCURSOR = pos;
        re2c:define:YYLIMIT = end;
        re2c:define:YYFILL:naked = 1;
        re2c:define:YYFILL = 'return {false, begin};';
        re2c:flags:utf-8 = 1;

        . { return {true, pos}; }
        * { return {false, end}; }
    */
}

in other words, in the code, at triggering of YYFILL I intended to return a fail from function, due to exhaustion of input.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Feb 11, 2019

Hi Eldar,

Dot . means "any symbol except newline". In UTF-8 this expands to something like this (the graph is actually for [^]). The length of code points in this range varies from 1 to 4.

In your case re2c demands 4 bytes, because it uses default input model (explained below). This is not necessarily the best option --- which model to choose depends on your particular setting:

  1. Sentinel model is the simplest; it assumes that you have a "sentinel" character that does not occur in valid input (for example, 0xc0 can never occur in valid UTF-8). Then all you have to do is append this character at the end of input, set re2c:yyfill:enable = 0; and catch the terminating 0xc0 using default rule *. See lex_sentinel in the example below (sentinel is 0xc0).

  2. Default input model is based on YYLIMIT and YYFILL: at certain points lexer checks if YYLIMIT - YYCURSOR < n. Check points do not occur on each step; for efficiency purposes re2c generates only one check point per strongly connected component in DFA. The number of charaters to be filled (n) equals to the maximal number of characters that lexer may read before the next check point (it is calculated by re2c as the maximal path length of the strongly connected component). This input model is efficient, but the user must generate additional padding at the end of input, as described here: http://re2c.org/examples/example_02.html. See lex_default in the example below.

  3. Recently added "simple" input model is somewhere in-between the previous two: it looks simple like sentinel model, but can be used in cases when there is no sentinel (when every character may appear in valid input, for example if you intend to process broken UTF-8 and skip over invalid byte sequences). See lex_simple in the example below (sentinel is 0).

  4. Other models, like this: http://re2c.org/examples/example_16.html and this: http://re2c.org/examples/example_15.html.

Below is a program that shows how 1, 2 and 3 can be used (3 was added recently and you'll need re2c from git for it to work). The program repeatedly calls lexer in a loop, each time consuming one UTF-8 character or skipping one code unit in case of error. Build it as re2c -i a.re -oa.cc -W && g++ a.cc -oa and run as printf 'zы\xc0й' | xargs ./a, sample output:

z: 7a
ы: d1 8b
lex error: c0 d0 b9
skipping one byte ...
й: d0 b9
done!

z: 7a
ы: d1 8b
lex error: c0 d0 b9 ff ff ff ff
skipping one byte ...
й: d0 b9
done!

z: 7a
ы: d1 8b
lex error: c0 d0 b9
skipping one byte ...
й: d0 b9
done!

Code:

#include <assert.h>
#include <string.h>
#include <stdio.h>
#include <stdint.h>

/*!re2c
    re2c:define:YYCTYPE = uint8_t;
    re2c:define:YYCURSOR = cur;
    re2c:define:YYLIMIT = end;
    re2c:flags:utf-8 = 1;
*/

enum Ecode {EOK, EEOF, ENOLEX};

struct Result {
    Ecode err;
    uint8_t *pos;
};

static void print_bytes(const uint8_t *s, const uint8_t *e)
{
    for (; s < e; ++s) printf(" %02x", *s);
    printf("\n");
}

typedef Result (lex1_t)(uint8_t*, uint8_t*, uint8_t*);

static Result lex1_sentinel(uint8_t *cur, uint8_t *end, uint8_t* /* unused */)
{
    uint8_t *YYMARKER, *tok = cur;
    /*!re2c
        re2c:yyfill:enable = 0;

        . { return {EOK, cur}; }
        * { return {tok == end && *tok == 0xc0 ? EEOF : ENOLEX, tok}; }
    */
}

/*!max:re2c*/

static Result lex1_default(uint8_t *cur, uint8_t *end, uint8_t *last)
{
    uint8_t *YYMARKER, *tok = cur;
    /*!re2c
        re2c:yyfill:enable = 1;
        re2c:define:YYFILL:naked = 1;
        re2c:define:YYFILL = 'assert(false);';

        . { return {EOK, cur}; }
        * { return {tok == last ? EEOF : ENOLEX, tok}; }
    */
}

static int fill() { return 1; } // always error

static Result lex1_simple(uint8_t *cur, uint8_t* end, uint8_t* /* unused */)
{
    uint8_t *YYMARKER, *tok = cur;
    /*!re2c
        re2c:yyfill:enable = 1;
        re2c:define:YYFILL = fill;
        re2c:eof = 0;

        . { return {EOK, cur}; }
        $ { return {EEOF, cur}; }
        * { return {ENOLEX, tok}; }
    */
}

static void lex(uint8_t *cur, uint8_t *end, uint8_t *lim, lex1_t *f)
{
    Result res;
    for (;;) {
        res = f(cur, end, lim);
        if (res.err == EEOF) {
            printf("done!\n\n");
            break;
        }
        else if (res.err == ENOLEX) {
            printf("lex error:");
            print_bytes(res.pos, end);
            printf("skipping one byte ...\n");
            cur = res.pos + 1;
        }
        else {
            assert (res.err == EOK);
            printf("%.*s:", (int)(res.pos - cur), cur);
            print_bytes(cur, res.pos);
            cur = res.pos;
        }
    }
}

static void lex_sentinel(char *str)
{
    const size_t len = strlen(str);
    str[len] = 0xc0;
    uint8_t *cur = (uint8_t*)str, *end = cur + len;
    lex(cur, end, NULL, lex1_sentinel);
    str[len] = 0;
}

static void lex_default(char *str)
{
    const size_t len = strlen(str);
    uint8_t *buf = new uint8_t[len + YYMAXFILL];
    uint8_t *cur = buf;
    uint8_t *end = buf + len;
    uint8_t *lim = end + YYMAXFILL;
    memcpy(buf, str, len);
    memset(end, 0xff, YYMAXFILL);
    lex(cur, lim, end, lex1_default);
    delete[] buf;
}

static void lex_simple(char *str)
{
    const size_t len = strlen(str);
    uint8_t *cur = (uint8_t*)str, *end = cur + len;
    lex(cur, end, NULL, lex1_simple);
}

int main(int argc, char **argv)
{
    for (int i = 1; i < argc; ++i) {
        lex_sentinel(argv[i]);
        lex_default(argv[i]);
        lex_simple(argv[i]);
    }
    return 0;
}
@skvadrik

This comment has been minimized.

Copy link
Owner

commented Feb 11, 2019

Of course, in real world code strlen can be avoided: either string is null-terminated and cannot contain 0 in the middle (in which case 0 can be sentinel and model 1 can be used), or input length is known in advance.

@unixod

This comment has been minimized.

Copy link
Author

commented Feb 12, 2019

Ulya, thanks a lot for your explanation and for the description of possible ways to solve my problem!

@unixod unixod closed this Feb 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.