[for reference] all work done which is not in original repo #338

GerHobbelt · 2016-12-15T23:39:13Z

Since a lot has been done and several of these features are tough to 'extract cleanly' to produce 'simple' patches (they won't be simple anyway), the list of differences (features and fixes in the derived repo):

[to be completed]

Main features

full Unicode support (okay, astral codepoints are hairy and only partly supported) in lexer and parser
- lexer can handle XRegExp \pXXX unicode regex atoms, e.g. \p{Alphabetic}
  - jison auto-expands and re-combines these when used inside regex set expressions in macros, e.g.
```
ALPHA                                   [{UNICODE_LETTER}a-zA-Z_]
```
    will be reduced to the equivalent of
```
ALPHA                                   [{UNICODE_LETTER}_]
```
    hence you don't need to worry your regexes will include duplicate characters in regex [...] set expressions.
- parser rule names can be Unicode identifiers (you're not limited to US ASCII there).

lexer macros can be used inside regex set expressions (in other macros and/or lexer rules); the lexer will barf a hairball (i.e. throw an informative error) when the macro cannot be expanded to represent a character set without causing counter-intuitive results), e.g. this is a legal series of lexer macros now:

ASCII_LETTER                            [a-zA-z]
UNICODE_LETTER                          [\p{Alphabetic}{ASCII_LETTER}]
ALPHA                                   [{UNICODE_LETTER}_]
DIGIT                                   [\p{Number}]
WHITESPACE                              [\s\r\n\p{Separator}]
ALNUM                                   [{ALPHA}{DIGIT}]

NAME                                    [{ALPHA}](?:[{ALNUM}-]*{ALNUM})?
ID                                      [{ALPHA}]{ALNUM}*

the parser generator produces optimized parse kernels: any feature you do not use in your grammar (e.g. error rule driven error recovery or @elem location info tracking) is rigorously stripped from the generated parser kernel, producing the fastest possible parser engine.
you can define a custom written lexer in the grammar definition file's %lex ... /lex section in case you find the standard lexer is too slow to your liking on otherwise insufficient. (This is done by specifying a no-rules lexer with the custom lexer placed in the lexer trailing action code block.)
you can %include action code chunks from external files, in case you find that the action code blurbs obscure the grammar's / lexer's definition. Use this when you have complicated/extensive action code for rules or a large amount of 'trailing code' ~ code following the %% end-of-ruleset marker.
CLI: -c 2 -- you now have the choice between two different table compression algorithms:
- mode 2 creates the smallest tables,
- mode 1 is the one available in 'vanilla jison' and
- mode 0 is 'no compression what-so-ever'

Minor 'Selling Points'

you can produce parsers which do not include a try ... catch wrapper for that last bit of speed and/or when you want to handle errors in surrounding userland code.
all errors are thrown using a parser and lexer-specific Error-derived class which allows userland code to discern which type of error (and thus: available extra error information!) is being processed via a simple/fast instanceof check for either of them.
the jison CLI tool will print additional error information when a grammar parse error occurred (derived off / closely related to Add detail grammar error output #321 and Stop silencing useful information. #258)
the jison CLI tool will print parse table statistics when requested (-I commandline switch) so you can quickly see how much table space your grammar is consuming. Handy when you are optimizing your grammar to reduce the number of states per parse for performance reasons.
includes [a derivative or close relative of] Added support for ES2015 module generation #326, Pass moduleMain option to generator #316, Support %options ranges in the grammar. #302, Fix moduleName (and other options?) being lost #290, #282,#283 - support specifying moduleName on CLI, support generate names... #284
fixes
- TypeError: undefined is not an object (evaluating 'this.yy.parser') #358 (crashes on this.yy.parser missing errors)
- Fix issue where the location of the past lexer input is reported on error #356 (wrong input attached to error)
- this.lexer is undefined #353 (crashes on this.yy.lexer missing errors)
- Always remove the token stack label #352 (token_stack label issue: jison-gho's way of code stripping does depend on labels at all, so the issue is moot now)
- [feature req] add bison YYRECOVERING support #349 (YYRECOVERING macro support -- should work, fingers crossed 😉)
- Getting "this.performAction.call is not a function" when trying to parse text #348 (performAction invocation trouble)
- definitions inconsistencies between lex and jison #333 (lexer recognizes literal regex parts without quotes whenever possible),
- jison@0.4.17 throws non-Error #328 (all errors are Error-derived instances with a text message and extra info attached),
- lexer.pushState does not work as expected (or intended?) #317 (?not sure?),
- "Jison is not defined" when running generated parser in debug mode #313,
- Support ranges in locations #301,
- is there error recovery? #299 (with minor additional abilities compared to vanilla jison, e.g. configurable error recovery search depth),
- Feature Request: Report unused rules to improve refactoring of large grammars #296 (unused grammar rules are reported and nuked, i.e. not included in the generated output),
- Add parameter in the command-line tools to support specifying moduleName #282,
- Not work for JSON format #276 (and we support JSON5 format besides!),
- EBNF unable to get at sub-expression #254,
- Is it possible to inspect the state stack in helper functions? #239 (all parser stacks are available in all grammar rule action code via yyvstack, yysstack, etc. -- documented in the documented grammar file's top API documenting comment chunk),
- How to extract values from nested EBNF clauses? #233 (EBNF rewriting to BNF now works; see also the wiki),
- Typos in documentation #231,
- Implementing own parser.yy.parseError function throws error #218 (and parseError can now produce a return value for the parser to return to the calling userland code),
- Any plan to support epsilon? #210,
- Is there a way to include another file just in the middle of the current one? #175 (kind of..., we now support %include filepath statements in stead of any code chunk),
- Changing lexer mode from parser fails #165 (kind of... now jison does not fetch look-ahead when the rule reduce action doesn't need it; it requires intimate understanding of your grammar and the way this LALR grammar engine handles it, but you can once again code 'lexer hacks' from inside parser rules' action code. Shudder or rejoice, depending on your mental make-up ;-) ),
- need way to distinguish errors #138 (instanceof of parser and lexer error class),
- Support %initial-action #121 (indirectly, you can now do this by writing an action code chunk for an initial 'epsilon' rule and get this behaviour that way)

Where is this thing heading?

using recast et al to help analyze rule action code to help code-strip both parser and lexer to produce fast parse/lex runs. Currently only the parser gets analyzed (a tad roughly) to strip costly operations from the parser run-time to make it fast / efficient.
also note migrate towards using a monorepo: ES6/rollup/etc. is a horror otherwise GerHobbelt/jison#16: moving towards a babel-style monorepo. This work has now completed (oct-nov 2017: jison-gho releases 0.6.1-200+)

… in the console.log() statements in there: console.log() adds a newline automatically, while the original C code `printf()` does not.

…ter code stripping. Adjusted stripper regexes to fix this.

… the preceeding commits: `action === 0` is the error parse state and that one, when it is discovered during error **recovery** in the inner slow parse loop, is handed back to the outer loop to prevent undue code duplication. Handing back means the outer loop will have to process that state, not exit on it immediately!

…reset/cleanup the `recoveringErrorInfo` object as one may invoke `yyerrok` while still inside the error recovery phase of the parser, thus *potentially* causing trouble down the lane for subsequent parse states. (This is another edge case that's hard to produce: better-safe-than-sorry coding style applies.)

…amples/Makefile. Tweak `make superclean` to ensure that we can bootstrap once you've run `make prep` by reverting the jison/dist/ directory after 'supercleaning'.

…about a piece of action code which "does not compile": lexer and parser line tracking yylloc info starts counting at line ONE(1) instead of ZERO(0) hence we do NOT need to compensate when bumping down the action code before parsing/validating it in here.

…mpare the full set of examples` output vs. a given reference. This is basically a 'system test' / 'acceptance test' **test level** that co-exists with the unit tests and integration tests in the tests/ directory: those tests are already partly leaning towards a 'system test' level and that is "polluting" the implied simplicity of unit tests...

…ch is included with every generated parser: this makes those reports easier to understand at a glance.

…ippets and other code blocks. We don't want to do them all, so there's #26

…liver a cleaner info set when custom lexers are involved AND not exhibit side effects such as modifying the provided lexer spec when it comes in native format, i.e. doesn't have to be parsed or JSON.parse()d anymore: we should strive for an overall cleaner interface behaviour, even if that makes some internals a tad more hairy.

… it should always have produced an 'expected set of tokens' in the info hash, whether you're running in an error recovery enabled grammar or a simple (non-error-recovering) grammar.

- DO NOT cleanup the old one before we start the new error info track: the old one will *linger* on the error stack and stay alive until we invoke the parser's cleanup API! - `recoveringErrorInfo` is also part of the `__error_recovery_infos` array, hence has been destroyed already: no need to do that *twice*.

…llback set a la jison parser run-time: - `fastLex()`: return next match that has a token. Identical to the `lex()` API but does not invoke any of the `pre_lex()` nor any of the `post_lex()` callbacks. - `canIUse()`: return info about the lexer state that can help a parser or other lexer API user to use the most efficient means available. This API is provided to aid run-time performance for larger systems which employ this lexer. - now executes all `pre_lex()` and `post_lex()` callbacks provided as + member function, i.e. `lexer.pre_lex()` and `lexer.post_lex()` + member of the 'shared state' `yy` as passed to the lexer via the `setInput()` API, i.e. `lexer.yy.pre_lex()` and `lexer.yy.post_lex()` + member of the lexer options, i.e. `lexer.options.pre_lex()` and `lexer.options.post_lex()`

…lon rule (which has no location info); add / introduce the `lexer::deriveLocationInfo()` API to help you & us to construct a more-or-less useful/sane location info object from the context surrounding it when the requested location info itself is not available.

…comparison` as it will compare more than just the generated codegen parsers' sources...

…e used to reconstruct missing/epsilon location infos. This helps fix crashes observed when reporting some errors that are triggered while parsing epsilon rules, but will also serve other purposes. The important bit here is that it helps prevent crashes inside the lexer's `prettyPrintRange()` API when no or faulty location info object(s) have been passed as parameters: robuster lexer APIs.

…ed according to the internal action+ parse kernel analysis. NOTE: the fact that the error reporting/recovery logic checks the **lexer.yylineno** lexer attribute does not count as that code won't need / touch the internal `yylineno` variable in any way.

# Conflicts: # lib/jison-parser-kernel.js

… is obsoleted anyway.

…dn't work as the `parseError` would not propagate into the parser kernel due to the way `shallow_copy_noclobber` worked. This is quite hairy as we depend on its behaviour of NOT overwriting members so that we can use it for yylloc propagation code inside the kernel. With this fix, that functionality should remain unchanged while now anything set in `parser.yy` should make it into the parser kernel *properly* once again.

…ernel into the main source file (see previous commit)

…de a more robust lexer interface: // 1) make sure any outside interference is detected ASAP: // these attributes are to be treated as 'const' values // once the lexer has produced them with the token (return value \`r\`). // 2) make sure any subsequent \`lex()\` API invocation CANNOT // edit the \`yytext\`, etc. token attributes for the *current* // token, i.e. provide a degree of 'closure safety' so that // code like this: // // t1 = lexer.lex(); // v = lexer.yytext; // l = lexer.yylloc; // t2 = lexer.lex(); // assert(lexer.yytext !== v); // assert(lexer.yylloc !== l); // // succeeds. Older (pre-v0.6.5) jison versions did not *guarantee* // these conditions. this.yytext = Object.freeze(this.yytext); this.matches = Object.freeze(this.matches); this.yylloc.range = Object.freeze(this.yylloc.range); this.yylloc = Object.freeze(this.yylloc);

… lexer

# Conflicts: # lib/jison.js # package-lock.json # package.json # packages/jison-lex/regexp-lexer.js # packages/jison2json/tests/tests.js

# Conflicts: # README.md # lib/cli.js # package.json

…'re going to take a different route towards parsing jison action code as the current approach is a maintenance nightmare. recast is again playing up and I'm getting sick of it all and that never was the goal of this.

…is pair to cooperate.

added js-sequence-diagrams to demo projects list

@n

…code to (temporarily) turn the jison generated source code into 'regular javascript' so we can pull it through standard babel or similar tools. (The previous attempt was to enhance the babel tokenizer and have the jison identifiers processed that way, but given the structure of babel, it meant tracking a slew of large packages, which turned out way too costly. So we revert to this 'Unicode hack' which employs the JavaScript specification about which Unicode characters are *legal in a JavaScript identifier*. TODO: Should write a blog/article about this. Here's the comments from the horse's mouth: --- Determine which Unicode NonAsciiIdentifierStart characters are unused in the given sourcecode and provide a mapping array from given (JISON) start/end identifier character-sequences to these. The purpose of this routine is to deliver a reversible transform from JISON to plain JavaScript for any action code chunks. This is the basic building block which helps us convert jison variables such as `$id`, `$3`, `$-1` ('negative index' reference), `@id`, `#id`, `#TOK#` to variable names which can be parsed by a regular JavaScript parser such as esprima or babylon. ``` function generateMapper4JisonGrammarIdentifiers(input) { ... } ``` IMPORTANT: we only want the single char Unicodes in here so we can do this transformation at 'Char'-word rather than 'Code'-codepoint level. ``` const IdentifierStart = unicode4IdStart.filter((e) => e.codePointAt(0) < 0xFFFF); ``` As we will be 'encoding' the Jison Special characters @ and # into the IDStart Unicode range to make JavaScript parsers *not* barf a hairball on Jison action code chunks, we must consider a few things while doing that: We CAN use an escape system where we replace a single character with multiple characters, as JavaScript DOES NOT discern between single characters and multi-character strings: anything between quotes is a string and there's no such thing as C/C++/C#'s `'c'` vs `"c"` which is *character* 'c' vs *string* 'c'. As we can safely escape characters, all we need to do is find a character (or set of characters) which are in the ID_Start range and are expected to be used rarely while clearly identifyable by humans for ease of debugging of the escaped intermediate values. The escape scheme is simple and borrowed from ancient serial communication protocols and the JavaScript string spec alike: - assume the escape character is A - then if the original input stream includes an A, we output AA - if the original input includes a character #, which must be escaped, it is encoded/output as A This is the same as the way the backslash escape in JavaScript strings works and has a minor issue: sequences of AAA with an odd number of A's CAN occur in the output, which might be a little hard to read. Those are, however, easily machine-decodable and that's what's most important here. To help with that AAA... issue AND because we need to escape multiple Jison markers, we choose to a slightly tweaked approach: we are going to use a set of 2-char wide escape codes, where the first character is fixed and the second character is chosen such that the escape code DOES NOT occur in the original input -- unless someone would have intentionally fed nasty input to the encoder as we will pick the 2 characters in the escape from 2 utterly different *human languages*: - the first character is ဩ which is highly visible and allows us to quickly search through a source to see if and where there are *any* Jison escapes. - the second character is taken from the Unicode CANADIAN SYLLABICS range (0x1400-0x1670) as far as those are part of ID_Start (0x1401-0x166C or there-abouts) and, unless an attack is attempted at jison, we can be pretty sure that this 2-character sequence won't ever occur in real life: even when one writes such a escape in the comments to document this system, e.g. 'ဩᐅ', then there's still plenty alternatives for the second character left. - the second character represents the escape type: $-n, $#, #n, @n, #ID#, etc. and each type will pick a different base shape from that CANADIAN SYLLABICS charset. - note that the trailing '#' in Jison's '#TOKEN#' escape will be escaped as a different code to signal '#' as a token terminator there. - meanwhile, only the initial character in the escape needs to be escaped if encountered in the original text: ဩ -> ဩဩ as the 2nd and 3rd character are only there to *augment* the escape. Any CANADIAN SYLLABICS in the original input don't need escaping, as these only have special meaning when prefixed with ဩ - if the ဩ character is used often in the text, the alternative ℹ இ ண ஐ Ϟ ല ઊ characters MAY be considered for the initial escape code, hence we start with analyzing the entire source input to see which escapes we'll come up with this time. The basic shapes are: - 1401-141B: ᐁ 1 - 142F-1448: ᐯ 2 - 144C-1465: ᑌ 3 - 146B-1482: ᑫ 4 - 1489-14A0: ᒉ 5 - 14A3-14BA: ᒣ 6 - 14C0-14CF: ᓀ - 14D3-14E9: ᓓ 7 - 14ED-1504: ᓭ 8 - 1510-1524: ᔐ 9 - 1526-153D: ᔦ - 1542-154F: ᕂ - 1553-155C: ᕓ - 155E-1569: ᕞ - 15B8-15C3: ᖸ - 15DC-15ED: ᗜ 10 - 15F5-1600: ᗵ - 1614-1621: ᘔ - 1622-162D: ᘢ ## JISON identifier formats ## - direct symbol references, e.g. `#NUMBER#` when there's a `%token NUMBER` for your grammar. These represent the token ID number. -> (1+2) start-# + end-# - alias/token value references, e.g. `$token`, `$2` -> $ is an accepted starter, so no encoding required - alias/token location reference, e.g. `@token`, `@2` -> (6) single-@ - alias/token id numbers, e.g. `#token`, `#2` -> (3) single-# - alias/token stack indexes, e.g. `##token`, `##2` -> (4) double-# - result value reference `$$` -> $ is an accepted starter, so no encoding required - result location reference `@$` -> (6) single-@ - rule id number `#$` -> (3) single-# - result stack index `##$` -> (4) double-# - 'negative index' value references, e.g. `$-2` -> (8) single-negative-$ - 'negative index' location reference, e.g. `@-2` -> (7) single-negative-@ - 'negative index' stack indexes, e.g. `##-2` -> (5) double-negative-#

# Conflicts: # ports/csharp/Jison/Jison/csharp.js # ports/php/php.js # ports/php/template.php

…h-pages

…cters.

…now have an augmented API.

…a second argument (`options`): cleaning up calling code which assumed as much.

gitbook-legacy bot pushed a commit to GerHobbelt/jison that referenced this pull request Feb 26, 2017

gitbook: dumped zaach#338 into there for quick ref while I work on it

5c0054a

GerHobbelt force-pushed the master branch from 8d2f0df to 72a1830 Compare July 31, 2017 03:15

GerHobbelt mentioned this pull request Sep 13, 2017

Lexical error when parsing Integer values #360

Open

GerHobbelt added 21 commits October 30, 2017 00:17

fixup the error-handling-and-yyerrok* examples: remove extra newlines…

637d3a3

… in the console.log() statements in there: console.log() adds a newline automatically, while the original C code `printf()` does not.

observed in one of the examples: lingering yyleng = 0; statement af…

c3255fd

…ter code stripping. Adjusted stripper regexes to fix this.

add a make profile task to simply only run the profiler tasks in ex…

c4e5e7f

…amples/Makefile. Tweak `make superclean` to ensure that we can bootstrap once you've run `make prep` by reverting the jison/dist/ directory after 'supercleaning'.

regenerated library files and updated NPM packages.

008454a

use *positive* statements in jison code generator analysis report whi…

d0d982a

…ch is included with every generated parser: this makes those reports easier to understand at a glance.

tweak examples to pass the JS validation tests for all action code sn…

8c12447

…ippets and other code blocks. We don't want to do them all, so there's #26

parser kernel fix: when yyError() internally invokes parseError()…

ef1732a

… it should always have produced an 'expected set of tokens' in the info hash, whether you're running in an error recovery enabled grammar or a simple (non-error-recovering) grammar.

examples: replaced the codegen-feature-tester make task with `make …

91fe96c

…comparison` as it will compare more than just the generated codegen parsers' sources...

updated NPM packages and rebuilt library files.

7c54a72

regenerated library files

4694c5c

Merge branch 'master' into debuggable

2da1887

# Conflicts: # lib/jison-parser-kernel.js

GerHobbelt and others added 30 commits November 17, 2019 20:51

only supporting node versions 8 and above from now on. Anything older…

580427a

… is obsoleted anyway.

remove test comment from example

6c54adb

fix copy/pasta API reference which broke any build

fa23ebc

couple of "use strict" additions plus incorporated the jison parser k…

b046d9c

…ernel into the main source file (see previous commit)

unifying the rollup configurations, using a single template reference.

ac75af4

add couple more test files; one for jison (parser), one for the jison…

0608e9e

… lexer

fix test reference/sollwert: this is how the *new* BNF parse tree looks!

0e0a2a4

Merge branch 'point-6-1-maintenance'

6ff5ffe

# Conflicts: # lib/jison.js # package-lock.json # package.json # packages/jison-lex/regexp-lexer.js # packages/jison2json/tests/tests.js

Merge branch 'debuggable'

f96cf98

Merge remote-tracking branch 'remotes/runtrizapps/master'

a8d67dc

# Conflicts: # README.md # lib/cli.js # package.json

added test for XRegExp usage

8005635

hard revert to base recast and babel. Let's see how far we can get th…

07c9f90

…is pair to cooperate.

Merge pull request zaach#354 from anirvan/patch-1

b947363

added js-sequence-diagrams to demo projects list

bump version

b347662

updated npm packages

1dc2d1b

Merge remote-tracking branch 'remotes/zaach-original/master'

9d7459b

# Conflicts: # ports/csharp/Jison/Jison/csharp.js # ports/php/php.js # ports/php/template.php

Merge remote-tracking branch 'remotes/zaach-original/gh-pages' into g…

a3cc10b

…h-pages

sync build revision

d3fd3a8

further work on the Unicode mapping of Jison special Identifier chara…

625fcce

…cters.

fixed bug in regex for Unicode mapper decodeJisonTokens() method.

e12e57d

fix bugs in the Unicode mapper and augment tests for it.

be19e1a

fix the basic tests which employ the Unicode mapper: those functions …

1b2c5a1

…now have an augmented API.

regenerated helpers library files

e8a07c9

hacky: make jison work with the new unicode id mapper.

bd10539

parser/github-issues-nr-XX-gho.ejs(input) API method does NOT have …

c913ae8

…a second argument (`options`): cleaning up calling code which assumed as much.

test examples added for issues 51 and 58

396e925

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[for reference] all work done which is not in original repo #338

[for reference] all work done which is not in original repo #338

GerHobbelt commented Dec 15, 2016 •

edited

Loading

[for reference] all work done which is not in original repo #338

Are you sure you want to change the base?

[for reference] all work done which is not in original repo #338

Conversation

GerHobbelt commented Dec 15, 2016 • edited Loading

Main features

Minor 'Selling Points'

Where is this thing heading?

GerHobbelt commented Dec 15, 2016 •

edited

Loading