Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANTLR4 Memory Usage for Tokenizing a 200MB File #4770

Open
openai0229 opened this issue Feb 7, 2025 · 12 comments
Open

ANTLR4 Memory Usage for Tokenizing a 200MB File #4770

openai0229 opened this issue Feb 7, 2025 · 12 comments

Comments

@openai0229
Copy link

I am experiencing an issue where tokenizing a large 200MB file using ANTLR4 results in over 1GB of memory usage. Here’s the code I am using to process the file:

try (InputStream inputStream = new FileInputStream(file)) {
    CharStream charStream = CharStreams.fromStream(inputStream);
    Lexer lexer = new MySqlLexer(charStream);
    UnbufferedTokenStream<Token> tokenStream = new UnbufferedTokenStream<>(lexer);
    while(true) {
        Token token = tokenStream.LT(1);
        int tokenType = token.getType();
        if (tokenType == Token.EOF) {
            break;
        }
        // ....other code
    }
} catch (Exception e) {
    // Handle exception
}

I noticed that as soon as I get the tokenStream, the memory usage spikes to over 1GB. I am using ANTLR4 version 4.13.1 on macOS 15.2. The grammar file I am using is MySqlLexer.g4.

@PanOscar
Copy link

@openai0229 How did you managed to fix it?

@openai0229
Copy link
Author

I don't seem to have found a solution, so I can only divide my SQL file into chunks and give each chunk of the file separately to Lexer

@openai0229 openai0229 reopened this Feb 14, 2025
@jimidle
Copy link
Collaborator

jimidle commented Feb 14, 2025 via email

@openai0229
Copy link
Author

The reason I’m doing this is because I have a 200MB SQL file, and I need to split the SQL queries within it into individual executable SQL statements. I’m planning to use ANTLR for lexical analysis (tokenization), and then manually process the SQL statements based on the tokens. While the file is large, I need to extract and handle each SQL query individually to ensure that each one is executed correctly.

@ericvergnaud
Copy link
Contributor

What you could try is to parse a statement, then remove it, and repeat until the file is empty.

@openai0229
Copy link
Author

openai0229 commented Feb 14, 2025

What you could try is to parse a statement, then remove it, and repeat until the file is empty.

Thank you for your suggestion! However, I’m not quite sure I fully understand your idea. Could you please provide me with a code example to illustrate how I can implement this approach? It would really help me understand better. Thanks again!

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Feb 14, 2025

The idea is to use the MySql parser and to invoke rule sqlStatement.
Using an unbuffered stream (like you already do) this might prevent the memory consumption you're experiencing.

@openai0229
Copy link
Author

The idea is to use the MySql parser and to invoke rule sqlStatement. Using an unbuffered stream (like you already do) this might prevent the memory consumption you're experiencing.

Thanks for the clarification! I’ll give that a try. Thanks again for your help!

@jimidle
Copy link
Collaborator

jimidle commented Feb 14, 2025

The reason I’m doing this is because I have a 200MB SQL file, and I need to split the SQL queries within it into individual executable SQL statements. I’m planning to use ANTLR for lexical analysis (tokenization), and then manually process the SQL statements based on the tokens. While the file is large, I need to extract and handle each SQL query individually to ensure that each one is executed correctly.

The reason I suggest that this is an XY question is that I am suggesting you do not start with a 200MB file of all the SQL statements in one blob. Where are you getting this file from?

@openai0229
Copy link
Author

The reason I’m doing this is because I have a 200MB SQL file, and I need to split the SQL queries within it into individual executable SQL statements. I’m planning to use ANTLR for lexical analysis (tokenization), and then manually process the SQL statements based on the tokens. While the file is large, I need to extract and handle each SQL query individually to ensure that each one is executed correctly.

The reason I suggest that this is an XY question is that I am suggesting you do not start with a 200MB file of all the SQL statements in one blob. Where are you getting this file from?

Thank you for your suggestion! Before I received your message, I had already started splitting the 200MB file into smaller chunks for processing. This approach has helped me significantly reduce memory consumption, and it’s been working much better for handling large files. You’re right, maybe trying to process such a large file directly with ANTLR was not the best choice.

@KvanTTT
Copy link
Member

KvanTTT commented Feb 15, 2025

The reason I suggest that this is an XY question is that I am suggesting you do not start with a 200MB file of all the SQL statements in one blob. Where are you getting this file from?

I've encoutered such big SQL files in my previous work (not 200 MB, but 30 MB that's anyway very big). Also, generated code also could be very big (for instance, by ANTLR itself). That's why I think the question is very reasonable.

Each token is going to create an object in the token stream, and lexing is
done upfront. Plus 200MB to hold your source.

But token probably doesn't hold the entire string, only start and end offsets. That's why the entire memory usage by all tokens could be even less than 200, especially if big strings are used. Also, they can be handled on-the-fly.

The idea is to use the MySql parser and to invoke rule sqlStatement.
Using an unbuffered stream (like you already do) this might prevent the memory consumption you're experiencing.

As far as I understand, @openai0229 even doesn't use parser. The topic is about tokenization and parsing only increases memory usage.

I don't seem to have found a solution, so I can only divide my SQL file into chunks and give each chunk of the file separately to Lexer

I don't think it's an ideal solution, I'd try to set up tokens processing on-the-fly. But I actually don't remember if it's possible to configure UnbufferedTokenStream or other stream to release already processed tokens. Also, is 1GB spike is really big? I suppose GC can handle most amount of memory after all tokens are processed.

@openai0229
Copy link
Author

I don't think it's an ideal solution, I'd try to set up tokens processing on-the-fly. But I actually don't remember if it's possible to configure UnbufferedTokenStream or other stream to release already processed tokens. Also, is 1GB spike is really big? I suppose GC can handle most amount of memory after all tokens are processed.

Thank you for your reply! I also agree that chunking the file is not an ideal solution. Unfortunately, it’s the approach I’ve had to resort to for now. I’m still looking for better ways to optimize memory usage, especially to minimize ANTLR’s memory consumption.

During my testing, the memory usage spikes as soon as I obtain the Lexer object. The highest peak I’ve seen is around 2GB.

Image Image

As a temporary solution, I’ve currently divided the file into chunks, with each chunk being around 1MB. Below is the memory usage after applying this method. While it’s not the best solution, it seems to address my needs for the time being.

Image


If you have any experience or suggestions, I would greatly appreciate it! Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants