Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Avoid object instantiation in lexers #1147
One way to reduce the memory usage of Rouge is to avoid instantiation of objects in individual lexers. Based on the report generated by
Initial testing suggested this can lead to a substantial reduction in memory used by the lexers. The AppleScript lexer (the largest lexer by memory usage) can be reduced from 65KB to 2.22KB by simply putting the regular expressions defined as local variables behind class methods.
Do you have any interest in doing this thankless task, @ashmaroli? (I promise I will thank you!)
I've been playing around with seeing the impact of deleting certain lexers. I reduced memory allocation almost 1.2MB (!!!) just by deleting all the lexers staring with
I haven't looked at which lexer in particular was causing problems but if it was because of unnecessary object instantiation, this could be a huge help :)
As a first pass, my preferred option would be to put all local variables (be they arrays of special words or single regular expressions) behind class methods. This has the benefit of not requiring any further editing of lexers that were using local variables (although, as you noted, they weren't all doing that so it's not going to be true universally).
Once that's done, further optimisation could involve replacing some of the individual regular expressions with references to commonly used ones that are centrally defined (as you proposed in #1139). I know @jneen expressed concern about unnecessary obfuscation with that approach, though, and I think it would be worth testing how much it improves things.
If the early indications are correct, 'wrapping' variables should drastically reduce memory allocation for most uses cases. Users are rarely going to be invoking Rouge to syntax highlight more than a handful of languages and so the actual number of regular expressions instantiated in practice even without centrally defined expressions should be low (I think).
So I kept playing around and now have this code as a kind of hacky proof of concept.
The important stuff is all in
All tests pass.
Here's the stats for memory on master:
and on this branch:
The memory for just loading the library has been reduced to:
The information from the lexer files is extracted using a simple regular expression. If a lexer file does not express the information required (class name, tag, aliases) in the format expected, this won't work.
A better approach may be to formally split lexers into two files: one that contains the details necessary for selection of the appropriate lexer and the other that contains the lexing logic. This might be something appropriate for version 4.0.