-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ANTLR4 Memory Usage for Tokenizing a 200MB File #4770
Comments
@openai0229 How did you managed to fix it? |
I don't seem to have found a solution, so I can only divide my SQL file into chunks and give each chunk of the file separately to Lexer |
Why would you parse 200MB of SQL as a single file? This seems like anXY
question to me.
Each token is going to create an object in the token stream, and lexing is
done upfront. Plus 200MB to hold your source.
Are you trying to parse a complete dump of every query you have in one file?
…On Thu, Feb 13, 2025 at 20:35 acsc ***@***.***> wrote:
I don't seem to have found a solution, so I can only divide my SQL file
into chunks and give each chunk of the file separately to Lexer
—
Reply to this email directly, view it on GitHub
<#4770 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMFLRMDHAVLQZELPEWL2PVI6BAVCNFSM6AAAAABWWFK2JKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJYGEZDSNRZHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
[image: openai0229]*openai0229* left a comment (antlr/antlr4#4770)
<#4770 (comment)>
I don't seem to have found a solution, so I can only divide my SQL file
into chunks and give each chunk of the file separately to Lexer
—
Reply to this email directly, view it on GitHub
<#4770 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ7TMFLRMDHAVLQZELPEWL2PVI6BAVCNFSM6AAAAABWWFK2JKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNJYGEZDSNRZHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The reason I’m doing this is because I have a 200MB SQL file, and I need to split the SQL queries within it into individual executable SQL statements. I’m planning to use ANTLR for lexical analysis (tokenization), and then manually process the SQL statements based on the tokens. While the file is large, I need to extract and handle each SQL query individually to ensure that each one is executed correctly. |
What you could try is to parse a statement, then remove it, and repeat until the file is empty. |
Thank you for your suggestion! However, I’m not quite sure I fully understand your idea. Could you please provide me with a code example to illustrate how I can implement this approach? It would really help me understand better. Thanks again! |
The idea is to use the MySql parser and to invoke rule |
Thanks for the clarification! I’ll give that a try. Thanks again for your help! |
The reason I suggest that this is an XY question is that I am suggesting you do not start with a 200MB file of all the SQL statements in one blob. Where are you getting this file from? |
Thank you for your suggestion! Before I received your message, I had already started splitting the 200MB file into smaller chunks for processing. This approach has helped me significantly reduce memory consumption, and it’s been working much better for handling large files. You’re right, maybe trying to process such a large file directly with ANTLR was not the best choice. |
I've encoutered such big SQL files in my previous work (not 200 MB, but 30 MB that's anyway very big). Also, generated code also could be very big (for instance, by ANTLR itself). That's why I think the question is very reasonable.
But token probably doesn't hold the entire string, only start and end offsets. That's why the entire memory usage by all tokens could be even less than 200, especially if big strings are used. Also, they can be handled on-the-fly.
As far as I understand, @openai0229 even doesn't use parser. The topic is about tokenization and parsing only increases memory usage.
I don't think it's an ideal solution, I'd try to set up tokens processing on-the-fly. But I actually don't remember if it's possible to configure |
I am experiencing an issue where tokenizing a large 200MB file using ANTLR4 results in over 1GB of memory usage. Here’s the code I am using to process the file:
I noticed that as soon as I get the tokenStream, the memory usage spikes to over 1GB. I am using ANTLR4 version 4.13.1 on macOS 15.2. The grammar file I am using is MySqlLexer.g4.
The text was updated successfully, but these errors were encountered: