Skip to content

new lexer: thrift#787

Closed
jneen wants to merge 2 commits into
mainfrom
feature.thrift
Closed

new lexer: thrift#787
jneen wants to merge 2 commits into
mainfrom
feature.thrift

Conversation

@jneen
Copy link
Copy Markdown
Member

@jneen jneen commented Sep 24, 2017

Closes #784 - Implements a lexer for Apache Thrift.

state :root do
mixin :comments_and_whitespace

rule name do |m|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you role regexp, type DSL?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, the biggest performance loss is in regexp scanning. Using this strategy, we only need to scan once - if the name regexp fails to scan, the lexer can simply skip this entire block. On the other hand, if it matches, we only need to perform a small number of Set lookups to determine the correct token.

@bigpick
Copy link
Copy Markdown

bigpick commented Feb 4, 2021

Last active in 2017 but still open; any outlook in Thrift lexer support coming to Rouge highlighting?

@kpumuk
Copy link
Copy Markdown
Contributor

kpumuk commented Apr 18, 2026

Is there any change to merge the Thrift lexer? :-) We are using Jekyll for the Thrift website, it is kind of shame that there is no syntax highlighting on the homepage https://thrift.apache.org/

BeforeAfter
CleanShot 2026-04-18 at 10 31 13@2x CleanShot 2026-04-18 at 10 31 32@2x

@kpumuk
Copy link
Copy Markdown
Contributor

kpumuk commented Apr 18, 2026

After a review against IDL spec, there are a few gaps:

  1. Core IDL terminals are missing from the token sets
    The lexer never classifies include, cpp_include, oneway, throws, void, or uuid, even though they are first-class terminals in idl.md rules [3], [4], [20]-[22], and [25]. This is already user-visible on the rendered homepage snippet from tutorial/tutorial.thrift: void, throws, and oneway are emitted as plain Name tokens instead of keyword/builtin tokens.

  2. Double constants tokenize incorrectly
    idl.md rule [33] allows DoubleConstant, but the lexer only has an integer rule. In a real Rouge run:

    • 3.14 tokenized as Integer("3"), Error("."), Integer("14")
    • 1e9 tokenized as Integer("1"), Name("e9")

    That means valid Thrift constants render with lexer errors.

  3. Shell comments are treated as syntax errors
    Thrift files commonly use shell-style comments, and the official tutorial says .thrift files support standard shell comments. The current lexer only handles /* */ and //, so lines starting with # emit an Error("#") token and then lex the remainder as identifiers.

@jneen Would you mind if I pick up from where you left and submit a new PR?

@jneen
Copy link
Copy Markdown
Member Author

jneen commented Apr 18, 2026

Please do - this is separate from IDLang other than in name, correct?

@jneen
Copy link
Copy Markdown
Member Author

jneen commented May 2, 2026

Closing in favour of #2284

@jneen jneen closed this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lexer: Apache Thrift

4 participants