Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cases of retokenization invalidly assume that the tokenizer is stateless leading to crashes in various edge cases #12674

Closed
moosichu opened this issue Aug 29, 2022 · 7 comments · Fixed by #12661
Labels
bug Observed behavior contradicts documented or intended behavior frontend Tokenization, parsing, AstGen, Sema, and Liveness.
Milestone

Comments

@moosichu
Copy link
Sponsor Contributor

moosichu commented Aug 29, 2022

Zig Version

0.10.0-dev.5788+f6c9b78c8

Steps to Reproduce

This is mainly a blocker for #12661, and causes a crash there. However this issue is a design bug that is currently present in the code, and I'm sure there could be other ways trigger it.

Expected Behavior

So the following function in Ast.zig shows this issue quite well:

pub fn tokenSlice(tree: Ast, token_index: TokenIndex) []const u8 {
    const token_starts = tree.tokens.items(.start);
    const token_tags = tree.tokens.items(.tag);
    const token_tag = token_tags[token_index];

    // Many tokens can be determined entirely by their tag.
    if (token_tag.lexeme()) |lexeme| {
        return lexeme;
    }

    // For some tokens, re-tokenization is needed to find the end.
    var tokenizer: std.zig.Tokenizer = .{
        .buffer = tree.source,
        .index = token_starts[token_index],
        .pending_invalid_token = null,
    };
    const token = tokenizer.next();
    assert(token.tag == token_tag);
    return tree.source[token.loc.start..token.loc.end];
}

It expects that a new tokenizer can be created and effectively treats it as stateless provided the correct token_index. This is not the case.

Actual Behavior

This issue is why #11414 crashes instead of just causing an error.

The tokenizer makes use of a pending_invalid_token field to keep track of a bad token if nested inside of another one: e.g. if a bad symbol is found inside of a string.

However, if you restart the tokenizer from an arbitrary character index that is within one of those tokens that can create a pending_invalid_token, then it will be in a different state at that index from when the file was originally parsed.

I think the solution here is to roll the tokenizer back to the start of current line and retokenize from there until the correct & matching token is found at the desired character index. I'm working a PR that does that, but it will take me a couple of days to do so, so thought I would make this issue whilst I had a smidgen of time.

Apologies for the weird bug report, as this was one identified form looking at the code and trying to solve another issue which makes this bug symptomatic only with those additional changes from my PR. Also had to dash off so had to write it quickly, so will refine it later! Sorry if it doesn't make a lot of sense.

@moosichu moosichu added the bug Observed behavior contradicts documented or intended behavior label Aug 29, 2022
@matu3ba
Copy link
Contributor

matu3ba commented Aug 30, 2022

I think the solution here is to roll the tokenizer back to the start of current line and retokenize from there until the correct & matching token is found at the desired character index

I think it must do something like read from start of line until end of line to take into account multi-line strings, but I did not check the code for how it handles them.

@Vexu
Copy link
Member

Vexu commented Aug 30, 2022

if you restart the tokenizer from an arbitrary character index that is within one of those tokens that can create a pending_invalid_token

Where is that happening? It seems like the only reason pending_invalid_token wouldn't be null is if we skipped over an invalid token.

@moosichu
Copy link
Sponsor Contributor Author

Where is that happening? It seems like the only reason pending_invalid_token wouldn't be null is if we skipped over an invalid token.

I think I poorly explained things there.

So let's say you have:

const a = "some stuff ";

The string "some stuff" could contain some invalid bytes, and when one of these are encountered, the rest of the string is still processed as normal, but the invalid byte is stored as self.pending_invalid_token in the tokenizer and isn't returned until tokenizer.next() is next called:

pub fn next(self: *Tokenizer) Token {
    if (self.pending_invalid_token) |token| {
        self.pending_invalid_token = null;
        return token;
    }
    ...

However - when Ast.tokenSlice() is trying to retrieve the state of that invalid token from the tokenizer, it naively recreates the tokenizer:

    // For some tokens, re-tokenization is needed to find the end.
    var tokenizer: std.zig.Tokenizer = .{
        .buffer = tree.source,
        .index = token_starts[token_index],
        .pending_invalid_token = null,
    };
    const token = tokenizer.next();

however, the tokenizer will not be in the state it was when it returned that initial pending_invalid_token because that was created when tokenizing the larger string token that contained it.

I hope that makes more sense now?

@Vexu
Copy link
Member

Vexu commented Aug 30, 2022

Hmm, I think I understand now. It seems like this would be easiest solved by implementing #12449

@moosichu
Copy link
Sponsor Contributor Author

moosichu commented Aug 30, 2022

#12692 is a fairly simple change anyway and fixes this known cause of a crash in the short-term, so I think it should still be merged, although I agree #12449 is a better long term solution to the problem.

@zigazeljko
Copy link
Contributor

@moosichu I found a better solution that should be relatively simple to implement.

Instead of setting pending_invalid_token, checkLiteralCharacter can just flag the current token as containing invalid bytes (this requires a new flag in std.zig.Token). The parser can then use this flag (inside the tokenizer loop) to generate error/warning messages for invalid bytes.

@moosichu
Copy link
Sponsor Contributor Author

moosichu commented Sep 5, 2022

@zigazeljko I don't think making tokens larger is necessarily a better solution to this problem, #12449 is definitely the best longer term solution. However, I think fixing the crash is important, and merging the current workaround is better than nothing, and once that is done hopefully I can find some time to work on #12449 at some point.

@andrewrk andrewrk added the frontend Tokenization, parsing, AstGen, Sema, and Liveness. label Oct 15, 2022
@andrewrk andrewrk added this to the 0.10.0 milestone Oct 15, 2022
@andrewrk andrewrk modified the milestones: 0.10.0, 0.10.1 Oct 31, 2022
@andrewrk andrewrk modified the milestones: 0.10.1, 0.11.0 Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior frontend Tokenization, parsing, AstGen, Sema, and Liveness.
Projects
None yet
5 participants