Skip to content

vjern/toksic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

toksic

A character-based tokenizer for programming languages. Will tokenize:

  • Identifiers;
  • Number & string literals (double quotes only);
  • Punctuation symbols are tokenized separately;
>>> from toksic import tokenize

>>> tokenize('a + "b = c"')
['a', '+', '"b = c"']

>>> tokenize('ant+b?:c')
['ant', '+', 'b', '?', ':', 'c']

>>> tokenize('a == b')
['a', '=', '=', 'b']

Keywords

To capture specific keywords or operators such as ==, you can use a Trie:

>>> from toksic import tokenize, Trie
>>> trie = Trie(); trie.insert('==')
>>> tokenize('a == b', trie)
['a', '==', 'b']

>>> trie = Trie(); trie.insert('not in')
>>> tokenize('a not in b', trie)
['a', 'not in', 'b']

String literals

Only double quotes string literals are supported by default, but you can introduce other patterns as well:

# Custom anything
>>> tokenize("'a$' = /b+/", literals=[("'", "'"), ("/", "/")])
["'a$'", "=", "/b+/"]

# Handle single & double quotes
>>> tokenize("'a$' = \"b+\"", literals=[("'", "'"), ('"', '"')])
["'a$'", "=", '"b+"']

# Different start & end symbols
>>> tokenize("^a+$ = \"b+\"", literals=[("^", "$"), ('"', '"')])
["^a+$", "=", '"b+"']

The only limitation is that your string literals must be enclosed by single character symbols, though they can be different.

Command Line

You can also use this package to tokenize a string literal on the fly:

$ python -m toksic "a++ + -b" "++"
#                   <literal> [, <specials>]*
a
++
+
-
b

About

Character-based tokenization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published