lexr is meant to be a lightweight tokenizer built in Javascript to be more modern and clean than the C ancestor.
lexr is compartmentalized to be able to work on its own however is aimed to be used hand in hand with grammr once the project is finished.
lexr and grammr are an effort to re-think how the traditional process of flex and bison work together and move development to a more modern process.
Both of these projects are developed in order to be implemented in the re-work of ivi a project which aims to visualize code for the purpose of teaching intro programming.
The current Lexical Analyzer has built-in support for Javascript with a plan on extending to other languages.
If you do not see your language supported or would like to simply use custom tokens it is possible to do so as well.
What is currently supported is
- Using built-in or custom tokens
- Adding Tokens either one by one or in a set to the tokenizer
- Error Detection
- Functions on Token Recognition
- Overwriting error token name
- Removing Tokens from the token set
- Ignoring tokens for output either one by one or in a set
- UnIgnoring tokens from the token set
- Custom Output for tokens
- Removing custom output for tokens
- Javascript
- Json
The entire library wraps around a Tokenizer
class.
First import the library
let lexr = require('lexr');
In order to use built-in languages, initialize the tokenizer as so:
let tokenizer = new lexr.Tokenizer("Javascript");
If you would like to use fully custom tokens then simply initialize as so:
let tokenizer = new lexr.Tokenizer("");
If you have selected a built-in language you will not be able to add or remove tokens until you disable strict
mode for tokens.
To do so call the disableStrict()
function on the tokenizer instance.
Once you have done so or if you are working on a fully custom tokenizer you can add tokens 2 ways:
// Add a single token
// Arguments: tokenName, RegExp pattern
tokenizer.addToken("L_PAREN", /\(/);
// Add multiple tokens
// Must be in the form of a Javascript Object
const tokens = {
L_BRACE : /{/,
R_BRACE : new RegExp('}'),
};
tokenizer.addTokenSet(tokens);
You can also remove pre-existing tokens if you are using a custom language or have disabled strict
mode.
tokenizer.removeToken("L_PAREN");
Tokenizer has ability to ignore whitespace and newlines by calling the methods on the instance.
Example:
let x = lexr.Tokenizer("Json");
x.ignoreWhiteSpace();
x.ignoreNewLine();
If you would like to add functions when tokens are recognized you can add them through a set or through individual addition by calling the proper addFunction
function.
// Add functions through set
let funs = {
WHITESPACE : function() { whitespaceCount += 1 },
IDENTIFIER : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);
// Add single function
tokenizer.addFunction("NEW_LINE", function() { lineCount += 1 });
// Remove function
tokenizer.removeFunction("IDENTIFIER");
lexr slightly separates itself from flex in how functions are handled.
Your functions should not use any of the information taken from the current token since you have access to that information post tokenization.
This keeps the functions that are being executed smaller and cleaner.
An example of code that would go with the functions is as follows
let funs = {
WHITESPACE : function() { whitespaceCount += 1 },
NEW_LINE : function() { idNum += 1 }
}
tokenizer.addFunctionSet(funs);
let input = `var a = 4;
var b = 3;`;
let whitespaceCount = 0;
let idNum = 0;
tokenizer.tokenize(input);
Since functions are contained within an object in the tokenizer, scoping can get a bit iffy.
You can use the example above however, a suggested usage is to make a functions.js
in order to:
- declare all of your tokens to functions
- declare any variables you need outside of the function object
- export the function object as well as any variables you want access to to the proper files
Instead of using yytext within your functions the suggested usage is to analyze post tokenization.
An example of grabbing all identifier names and inserting them into let's say a symbol table would look like:
let input = `var a = 4;
var b = 3;`;
let output = tokenizer.tokenize(input);
let symbolTable = {};
for (let i = 0; i < output.length; i++) {
if (output[i].token === "IDENTIFIER") {
symbolTable[output[i].value] = undefined;
}
}
By default the error token name when detecting an uncaught token will be ERROR
however, if you would like to change the name you can do so by calling setErrTok
as so:
tokenizer.setErrTok("DIFF_ERROR");
You can also ignore certain tokens from appearing in the output by either calling addIgnore
tokenizer.addIgnore("WHITESPACE");
Or by adding an entire set through an array or an object
let ignore = ["WHITESPACE", "VAR"];
tokenizer.addIgnoreSet(ignore);
// Or through an object which allows true or false
let ignore2 = {
"WHITESPACE" : true,
"VAR" : false,
};
tokenizer.addIgnoreSet(ignore2);
If you would like to unIgnore tokens programatically just call the unIgnore
method
tokenizer.unIgnore("WHITESPACE");
There are options to make your output more verbose by adding a customOut
field to the output object.
Similarly to how other operations work you can either add a set of tokens or a single token as well as remove them.
// Add a set of custom outputs
let customOut = {
"WHITESPACE" : 2,
"VAR" : 'declaration',
}
tokenizer.addCustomOutSet(customOut);
// Add a single custom output
tokenizer.addCustomOut("SEMI_COLON", 111);
// Remove a custom out
tokenizer.removeCustomOut("VAR");
A sample output object would then look like:
{ token: 'WHITESPACE', value: ' ', customOut: 2 }
Lastly in order to tokenize your input code simply call the tokenizer's tokenize method.
let output = tokenizer.tokenize(aString);
In its current form the output from tokenize(aString)
will be in the form of a list of Objects each having 2 properties.
token
being the token captured, and value
being what determined the token.
let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("Javascript");
let input = "var a = null;";
let output = tokenizer.tokenize(input);
console.log(output);
Output would then be
[ { token: 'VAR', value: 'var' },
{ token: 'WHITESPACE', value: ' ' },
{ token: 'IDENTIFIER', value: 'a' },
{ token: 'WHITESPACE', value: ' ' },
{ token: 'ASSIGN', value: '=' },
{ token: 'WHITESPACE', value: ' ' },
{ token: 'NULL_LIT', value: 'null' },
{ token: 'SEMI_COLON', value: ';' } ]
let lexr = require('lexr');
let tokenizer = new lexr.Tokenizer("");
tokenizer.addToken("PLUS", /\+/);
tokenizer.setErrTok("DIFF_ERROR");
let input = "5+5;";
let output = tokenizer.tokenize(input);
console.log(output);
Output would then be
[ { token: 'DIFF_ERROR', value: '5' },
{ token: 'PLUS', value: '+' },
{ token: 'DIFF_ERROR', value: '5;' } ]
How I suggest development if you are not using built-in languages is to separate each part of the tokenization into its own file.
If you are using a complex language where the regexes can become very large, separate the building up of those regexes to another file and only export the final regex to your token object.
Since the Tokenizer
can take in sets of information it is easiest to separate everything and use exports between files.
+-- src/
| +-- index.js
| +-- functions/
| +-- functions.js
| +-- tokens/
| +-- tokens.js
| +-- regexPatterns.js
| +-- ignore/
| +-- ignoreTokens.js
| +-- customOut/
| +-- customOutput.js
If there are any good freatures missing feel free to open an issue for a feature request.