Skip to content
tajmone edited this page Jan 11, 2019 · 8 revisions
Highlight v3.41

Debugging Languages Definitions

Useful references on how to debug syntax definition files to pinpoint issue.


Table of Contents


Introduction

During the creation of a new language definition unexpected parser behaviors might show up in some edge cases. Tracking down the problem is not always easy since Highlight parser is an opaque blackbox to the user — except for its internal state variables which expose the current state of the parser inside hooked functions.

Here you'll find some guidelines and tools on how to leverage those internal states to isolate the problem.

State IDs Plugin

Highlight ships with token_add_state_ids.lua, a plugin which exposes in the output document the parser's state changes, and their IDs:

Description="Add internal state IDs behind each token (for debugging)."

function syntaxUpdate(desc)
  function Decorate(token, state)
    return token .. ' ('.. string.format("%d",state) .. ')'
  end
end

Plugins={

  { Type="lang", Chunk=syntaxUpdate },

}

The plugin can be turned ON and OFF in Highlight GUI, from the "Plug-in" tab, allowing to visually track the parser states for the current input file.

A python example, without the State IDs plugin:

state-IDs plugin OFF

… and with the State IDs plugin enabled:

state-IDs plugin ON

The plugin adds, at each parser state change, the integer of the new state enclosed in parenthesis. This reveals us interesting details about the parser's inner workings; for example, from the above screenshot we can notice that during the string parsing the parser updates the syntax multiple times, even though the same state is confirmed. This shows us that the parser is consuming the string in chunks, trying to isolate any tokens that could match legitimate sub-string elements (escape sequences, interpolations).

As for the actual numbers, these represent the various possible parser states, which are assigned at initialization time, and might vary with each syntax (depending on what elements are actually defined). The correspondence between parser states and integer values can be retrived via the --verbose option.

Let's try it with the example file used with HighlightGUI in our screenshots. From the command line we'll invoke highlight --verbose StatesIDs-plugin-Example.py and try to pinpoint in the output which states correspond to "11", "9" and "1" (actual output cut-down here, for space reasons):

> highlight --verbose StatesIDs-plugin-Example.py

Loading language definition:
C:\Program Files\Highlight\langDefs\python.lang

Description: Python

LUA GLOBALS:
...
HL_INTERPOLATION: number [ 10 ]
HL_INTERPOLATION_END: number [ 19 ]
HL_KEYWORD: number [ 11 ]                   <-- (11) = Keywords
HL_KEYWORD_END: number [ 20 ]
...
HL_NUMBER: number [ 2 ]
HL_OPERATOR: number [ 9 ]                   <--  (9) = Operators
HL_OPERATOR_END: number [ 18 ]
...
HL_STANDARD: number [ 0 ]
HL_STRING: number [ 1 ]                     <--  (1) = Strings
HL_STRING_END: number [ 12 ]
HL_UNKNOWN: number [ 100 ]
...

I've added arrows on the right side, pointing to the values we were seeking for. Now we know what these numbers mean in term of the parser states:

  • 1 represents Strings
  • 9 represents Operators
  • 11 represents Keywords

Let's analyse the plugin output:

state-IDs plugin ON

We can now get a clear picture of how HL parser is tokenizing the "print("Hello!")" line, step by step:

token state ID parser state
"print" (11) Keyword token
"(" (9) Operator token
""" (1) String token
"Hello" (1) (1) String token
"!" (1) (1) String token
""" (1) String token
")" (9) Operator token

You'll' also notice that syntaxUpdate() is being called twice for tokens inside the string (ie, for "Hello" and "!"). This means that for the current syntax definition the parser needs to undergo two state updates for evaluating those tokens — basically, one update to establish they are not sub-elements (eg: an escape sequence), and another to establish that the string state needs to carry on.

In complex language definitions, the parser might go through multiple updates to evaluate each token, depending on the token's context and the definitions provided by the syntax, but especially if there are custom rules hooked into OnStateChange() that force it to return with custom values (eg: HL_REJECT, HL_STANDARD, etc.).

Playing around with the state-IDs plugin and following the parser's syntax updates and state changes with various input examples and languages — while studying their syntax definition code — is a great way to gain insights on Highlight's internals and how custom code in the hook functions can alter the parser's behaviour.

Related Wiki Pages

Clone this wiki locally