Cases of retokenization invalidly assume that the tokenizer is stateless leading to crashes in various edge cases #12674

moosichu · 2022-08-29T18:00:34Z

Zig Version

0.10.0-dev.5788+f6c9b78c8

Steps to Reproduce

This is mainly a blocker for #12661, and causes a crash there. However this issue is a design bug that is currently present in the code, and I'm sure there could be other ways trigger it.

Expected Behavior

So the following function in Ast.zig shows this issue quite well:

pub fn tokenSlice(tree: Ast, token_index: TokenIndex) []const u8 {
    const token_starts = tree.tokens.items(.start);
    const token_tags = tree.tokens.items(.tag);
    const token_tag = token_tags[token_index];

    // Many tokens can be determined entirely by their tag.
    if (token_tag.lexeme()) |lexeme| {
        return lexeme;
    }

    // For some tokens, re-tokenization is needed to find the end.
    var tokenizer: std.zig.Tokenizer = .{
        .buffer = tree.source,
        .index = token_starts[token_index],
        .pending_invalid_token = null,
    };
    const token = tokenizer.next();
    assert(token.tag == token_tag);
    return tree.source[token.loc.start..token.loc.end];
}

It expects that a new tokenizer can be created and effectively treats it as stateless provided the correct token_index. This is not the case.

Actual Behavior

This issue is why #11414 crashes instead of just causing an error.

The tokenizer makes use of a pending_invalid_token field to keep track of a bad token if nested inside of another one: e.g. if a bad symbol is found inside of a string.

However, if you restart the tokenizer from an arbitrary character index that is within one of those tokens that can create a pending_invalid_token, then it will be in a different state at that index from when the file was originally parsed.

I think the solution here is to roll the tokenizer back to the start of current line and retokenize from there until the correct & matching token is found at the desired character index. I'm working a PR that does that, but it will take me a couple of days to do so, so thought I would make this issue whilst I had a smidgen of time.

Apologies for the weird bug report, as this was one identified form looking at the code and trying to solve another issue which makes this bug symptomatic only with those additional changes from my PR. Also had to dash off so had to write it quickly, so will refine it later! Sorry if it doesn't make a lot of sense.

The text was updated successfully, but these errors were encountered:

matu3ba · 2022-08-30T07:48:30Z

I think the solution here is to roll the tokenizer back to the start of current line and retokenize from there until the correct & matching token is found at the desired character index

I think it must do something like read from start of line until end of line to take into account multi-line strings, but I did not check the code for how it handles them.

Vexu · 2022-08-30T12:12:55Z

if you restart the tokenizer from an arbitrary character index that is within one of those tokens that can create a pending_invalid_token

Where is that happening? It seems like the only reason pending_invalid_token wouldn't be null is if we skipped over an invalid token.

moosichu · 2022-08-30T18:19:44Z

Where is that happening? It seems like the only reason pending_invalid_token wouldn't be null is if we skipped over an invalid token.

I think I poorly explained things there.

So let's say you have:

const a = "some stuff ";

The string "some stuff" could contain some invalid bytes, and when one of these are encountered, the rest of the string is still processed as normal, but the invalid byte is stored as self.pending_invalid_token in the tokenizer and isn't returned until tokenizer.next() is next called:

pub fn next(self: *Tokenizer) Token {
    if (self.pending_invalid_token) |token| {
        self.pending_invalid_token = null;
        return token;
    }
    ...

However - when Ast.tokenSlice() is trying to retrieve the state of that invalid token from the tokenizer, it naively recreates the tokenizer:

    // For some tokens, re-tokenization is needed to find the end.
    var tokenizer: std.zig.Tokenizer = .{
        .buffer = tree.source,
        .index = token_starts[token_index],
        .pending_invalid_token = null,
    };
    const token = tokenizer.next();

however, the tokenizer will not be in the state it was when it returned that initial pending_invalid_token because that was created when tokenizing the larger string token that contained it.

I hope that makes more sense now?

Vexu · 2022-08-30T20:16:52Z

Hmm, I think I understand now. It seems like this would be easiest solved by implementing #12449

moosichu · 2022-08-30T23:09:18Z

#12692 is a fairly simple change anyway and fixes this known cause of a crash in the short-term, so I think it should still be merged, although I agree #12449 is a better long term solution to the problem.

zigazeljko · 2022-08-31T08:24:22Z

@moosichu I found a better solution that should be relatively simple to implement.

Instead of setting pending_invalid_token, checkLiteralCharacter can just flag the current token as containing invalid bytes (this requires a new flag in std.zig.Token). The parser can then use this flag (inside the tokenizer loop) to generate error/warning messages for invalid bytes.

moosichu · 2022-09-05T07:57:45Z

@zigazeljko I don't think making tokens larger is necessarily a better solution to this problem, #12449 is definitely the best longer term solution. However, I think fixing the crash is important, and merging the current workaround is better than nothing, and once that is done hopefully I can find some time to work on #12449 at some point.

moosichu added the bug Observed behavior contradicts documented or intended behavior label Aug 29, 2022

moosichu mentioned this issue Aug 30, 2022

Scan from line start when finding tag in tokenizer #12692

Closed

andrewrk added the frontend Tokenization, parsing, AstGen, Sema, and Liveness. label Oct 15, 2022

andrewrk added this to the 0.10.0 milestone Oct 15, 2022

andrewrk modified the milestones: 0.10.0, 0.10.1 Oct 31, 2022

andrewrk modified the milestones: 0.10.1, 0.11.0 Jan 10, 2023

moosichu mentioned this issue Feb 17, 2023

Correctly handle carriage return characters according to the spec #12661

Merged

Vexu closed this as completed in #12661 Feb 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cases of retokenization invalidly assume that the tokenizer is stateless leading to crashes in various edge cases #12674

Cases of retokenization invalidly assume that the tokenizer is stateless leading to crashes in various edge cases #12674

moosichu commented Aug 29, 2022 •

edited

Loading

matu3ba commented Aug 30, 2022

Vexu commented Aug 30, 2022

moosichu commented Aug 30, 2022

Vexu commented Aug 30, 2022

moosichu commented Aug 30, 2022 •

edited

Loading

zigazeljko commented Aug 31, 2022

moosichu commented Sep 5, 2022 •

edited

Loading

Cases of retokenization invalidly assume that the tokenizer is stateless leading to crashes in various edge cases #12674

Cases of retokenization invalidly assume that the tokenizer is stateless leading to crashes in various edge cases #12674

Comments

moosichu commented Aug 29, 2022 • edited Loading

Zig Version

Steps to Reproduce

Expected Behavior

Actual Behavior

matu3ba commented Aug 30, 2022

Vexu commented Aug 30, 2022

moosichu commented Aug 30, 2022

Vexu commented Aug 30, 2022

moosichu commented Aug 30, 2022 • edited Loading

zigazeljko commented Aug 31, 2022

moosichu commented Sep 5, 2022 • edited Loading

moosichu commented Aug 29, 2022 •

edited

Loading

moosichu commented Aug 30, 2022 •

edited

Loading

moosichu commented Sep 5, 2022 •

edited

Loading