Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid object instantiation in lexers #1147

Open
pyrmont opened this issue Jun 2, 2019 · 20 comments

Comments

Projects
None yet
3 participants
@pyrmont
Copy link
Member

commented Jun 2, 2019

One way to reduce the memory usage of Rouge is to avoid instantiation of objects in individual lexers. Based on the report generated by rake profile_memory, it would seem the biggest causes of unnecessary object instantiation in a lexer are:

  1. a failure to wrap arrays of keywords in class methods so that you do this:

    keywords = ['if', 'elsif', 'else']

    rather than this:

    def self.keywords
      @keywords ||= ['if', 'elsif', 'else']
    end
  2. a failure to wrap regexs (especially involving interpolation) in class methods so that you do this:

    reserved = /(\b#{keywords.join('|')}\b)/

    rather than this:

    def self.reserved
      @reserved ||= /(\b#{keywords.join('|')}\b)/
    end

Initial testing suggested this can lead to a substantial reduction in memory used by the lexers. The AppleScript lexer (the largest lexer by memory usage) can be reduced from 65KB to 2.22KB by simply putting the regular expressions defined as local variables behind class methods.

Do you have any interest in doing this thankless task, @ashmaroli? (I promise I will thank you!)

@jneen

This comment has been minimized.

Copy link
Member

commented Jun 2, 2019

Yes! And also, as far as possible, reducing the amount of giant regexes we generate by interpolating large arrays would go a long way.

@ashmaroli

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

@pyrmont I have no issues undertaking this task. Optimizing memory will help everyone.
I'll submit PR(s) towards resolving this.

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2019

Thanks @ashmaroli!

I've been playing around with seeing the impact of deleting certain lexers. I reduced memory allocation almost 1.2MB (!!!) just by deleting all the lexers staring with L.

I haven't looked at which lexer in particular was causing problems but if it was because of unnecessary object instantiation, this could be a huge help :)

@ashmaroli

This comment was marked as spam.

Copy link
Contributor

commented Jun 2, 2019

Haha! Trials by Elimination.. 😄

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2019

Deleting P and S also reduces memory allocation by around 1MB per letter as well. In general, I'd estimate that it was about 300KB or so per letter.

@ashmaroli

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

I see a lot of inconsistencies in the lexers. Some have keywords =, while some have KEYWORDS =
while some have def self.keywords;
I think the first step here would be to decide on a consistent design..

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2019

Using a class method was @jneen's comment in reviewing the lexer development guide and was the basis for the suggestion above. That feels idiomatic to me.

@ashmaroli

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

Great. Here's my plan: Convert all keywords = and similar that stores arrays to be class methods and all
id = which is storing simple regexps into CONSTANTS.. whatsay?

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2019

As a first pass, my preferred option would be to put all local variables (be they arrays of special words or single regular expressions) behind class methods. This has the benefit of not requiring any further editing of lexers that were using local variables (although, as you noted, they weren't all doing that so it's not going to be true universally).

Once that's done, further optimisation could involve replacing some of the individual regular expressions with references to commonly used ones that are centrally defined (as you proposed in #1139). I know @jneen expressed concern about unnecessary obfuscation with that approach, though, and I think it would be worth testing how much it improves things.

If the early indications are correct, 'wrapping' variables should drastically reduce memory allocation for most uses cases. Users are rarely going to be invoking Rouge to syntax highlight more than a handful of languages and so the actual number of regular expressions instantiated in practice even without centrally defined expressions should be low (I think).

@ashmaroli

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

@pyrmont I won't be able to finish this task. The various lexers are structured differently. Some specs fail on changing keywords = to def self.keywords or def keywords. It is time-consuming to sync spec files and changes in the lexer source..

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2019

No problem at all! Thanks for taking a look at it!

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 7, 2019

So I kept playing around and now have this code as a kind of hacky proof of concept.

Approach

The important stuff is all in lib/rouge/lexer.rb. In short, rather than load in the individual lexer files, it:

  1. extracts the class names, tags and aliases from these files;
  2. saves these values as keys in a hash with the associated value being the relative path of the file;
  3. loads the lexer file for <Lexer Class> when the consumer either (a) attempts to access Lexers::<Lexer Class> or (b) attempts to retrieve an instance of <Lexer Class> from Lexer.registry.

All tests pass.

Results

Here's the stats for memory on master:

Total allocated: 11.5 MB (109823 objects)
Total retained:  4.37 MB (30845 objects)

and on this branch:

Total allocated: 2.34 MB (14721 objects)
Total retained:  480.52 kB (3446 objects)

The memory for just loading the library has been reduced to:

Total allocated: 2.14 MB (12720 objects)
Total retained:  380.64 kB (2804 objects)

Limitations

The information from the lexer files is extracted using a simple regular expression. If a lexer file does not express the information required (class name, tag, aliases) in the format expected, this won't work.

Future Work

A better approach may be to formally split lexers into two files: one that contains the details necessary for selection of the appropriate lexer and the other that contains the lexing logic. This might be something appropriate for version 4.0.

@ashmaroli

This comment was marked as resolved.

Copy link
Contributor

commented Jun 7, 2019

From the compare-diff, it looks like your tree hasn't been rebased to the latest rouge:master (or it so appears because the recent escaping logic has been removed..?)

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 7, 2019

Rebased for your ease of comparison pleasure.

Memory usage is now:

Total allocated: 2.36 MB (14870 objects)
Total retained:  482.79 kB (3477 objects)

and for vanilla loading:

Total allocated: 2.15 MB (12867 objects)
Total retained:  382.91 kB (2835 objects)
@ashmaroli

This comment was marked as spam.

Copy link
Contributor

commented Jun 7, 2019

@ashmaroli

This comment was marked as resolved.

Copy link
Contributor

commented Jun 7, 2019

"Vanilla loading" being the master..?

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 7, 2019

@ashmaroli Vanilla loading means without actually running any highlighting. I introduced a rake task that just loads the library and that's what I mean by vanilla loading.

@jneen

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

Another option that does away with that scary regex: keep the lexer keyword, but do not save the block. Instead, re-load the file later with a special thread-variable flag set, such that the lexer actually gets populated.

@pyrmont

This comment has been minimized.

Copy link
Member Author

commented Jun 7, 2019

@jneen I feel like I've just been galaxy brained.

@jneen

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

lol maybe. it might be a bit much.

also make sure to update RougeVersion in rouge.jneen.net so it can preload everything before removing the global constant (and the files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.