A character-based tokenizer for programming languages. Will tokenize:
- Identifiers;
- Number & string literals (double quotes only);
- Punctuation symbols are tokenized separately;
>>> from toksic import tokenize
>>> tokenize('a + "b = c"')
['a', '+', '"b = c"']
>>> tokenize('ant+b?:c')
['ant', '+', 'b', '?', ':', 'c']
>>> tokenize('a == b')
['a', '=', '=', 'b']
To capture specific keywords or operators such as ==
, you can use a Trie:
>>> from toksic import tokenize, Trie
>>> trie = Trie(); trie.insert('==')
>>> tokenize('a == b', trie)
['a', '==', 'b']
>>> trie = Trie(); trie.insert('not in')
>>> tokenize('a not in b', trie)
['a', 'not in', 'b']
Only double quotes string literals are supported by default, but you can introduce other patterns as well:
# Custom anything
>>> tokenize("'a$' = /b+/", literals=[("'", "'"), ("/", "/")])
["'a$'", "=", "/b+/"]
# Handle single & double quotes
>>> tokenize("'a$' = \"b+\"", literals=[("'", "'"), ('"', '"')])
["'a$'", "=", '"b+"']
# Different start & end symbols
>>> tokenize("^a+$ = \"b+\"", literals=[("^", "$"), ('"', '"')])
["^a+$", "=", '"b+"']
The only limitation is that your string literals must be enclosed by single character symbols, though they can be different.
You can also use this package to tokenize a string literal on the fly:
$ python -m toksic "a++ + -b" "++"
# <literal> [, <specials>]*
a
++
+
-
b