Skip to content

ANTLR4 Memory Usage for Tokenizing a 200MB File #4770

Open
@openai0229

Description

@openai0229

I am experiencing an issue where tokenizing a large 200MB file using ANTLR4 results in over 1GB of memory usage. Here’s the code I am using to process the file:

try (InputStream inputStream = new FileInputStream(file)) {
    CharStream charStream = CharStreams.fromStream(inputStream);
    Lexer lexer = new MySqlLexer(charStream);
    UnbufferedTokenStream<Token> tokenStream = new UnbufferedTokenStream<>(lexer);
    while(true) {
        Token token = tokenStream.LT(1);
        int tokenType = token.getType();
        if (tokenType == Token.EOF) {
            break;
        }
        // ....other code
    }
} catch (Exception e) {
    // Handle exception
}

I noticed that as soon as I get the tokenStream, the memory usage spikes to over 1GB. I am using ANTLR4 version 4.13.1 on macOS 15.2. The grammar file I am using is MySqlLexer.g4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions