Tokens

Tokens.jl supplies tools to work with small text fragments like those produced by a lexer stage during source code processing of computer languages. A lexer segments source text into "tokens", usually short sequences which are classified e.g. into identifiers, numbers, whitespace and so forth. A token in the context of Tokens.jl consists of a category ID (enumeration with 16 predefined values) and a string.

Design guidelines are memory and runtime efficiency, in particular for very short strings. Strings with up to 7 code units can be stored directly in a minimal token structure of 8 bytes, tokens with longer strings use a separate buffer. A token vector uses one common buffer for all elements, avoiding heap allocations per token creation.

Inspirations came from Nicholas Ormrod's talk on strange details of std::string at Facebook, ShortStrings for interned strings, and WeakRefStrings for string arrays using a shared content buffer.

Current State

Tokens.jl compiles and runs defined tests. However test coverage is very incomplete, features are not mature, expect breaking changes in upcoming releases below release 1.0.

Basic structures

AbstractToken <: AbstractString

A string with an attached token category. There are 16 categories with predefined meaning, closely related to the lexer methods in the Tokens package.

DToken, BToken, HToken <: AbstractToken

The type central to this package comes in three flavours, identified with a prefix character.

B in BToken stands for "buffered". Token contents is stored in a separate string buffer, a BToken instance holds a reference to it. memory layout and behavior is very similar to a SubString. An instance consists of 16 bytes (w/o buffer).

D in DToken stands for "direct": token contents is directly encoded within the instance, no separate buffer for contents is used. Instance conststs of 8 bytes. Contents size is limited to 7 code units. Compared to BToken, it saves memory, has increases data locality, and many operations are mich faster, e.g. comparisons.

H in HToken stands for "hybrid". HToken combines DToken and BToken in one type, similar to a julia union type. An instance consists of 16 bytes. On runtime, it is cast to a DToken or BToken instance. In scenarios where the percentage of short tokens, wich fit into a DToken, is high, it saves memory and has better overall performance.

SharedIO <: IO

An IOBuffer derivate with elaborated token support. It uses a "copy on write" approach, keeping track of its parts which are referenced by tokens and token vectors.

TokenVector <: AbstractVector

Memory-efficient Vector implementation which uses one SharedIO for all of its token elements. Within the structure, each token consumes 8 bytes plus contents bytes if contents size exceeds 7 code units.

TokenTree

A compact tree structure on top of a TokenVector. , intended as syntax treewenn suited to for tokens stored in

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.vscode		.vscode
docs		docs
src		src
test		test
.appveyor.yml		.appveyor.yml
.cirrus.yml		.cirrus.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md
VSCode show issue.txt		VSCode show issue.txt
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokens

Current State

Basic structures

AbstractToken <: AbstractString

DToken, BToken, HToken <: AbstractToken

SharedIO <: IO

TokenVector <: AbstractVector

TokenTree

AbstractToken interface

About

Releases

Packages

Languages

License

rryi/Tokens.jl

Folders and files

Latest commit

History

Repository files navigation

Tokens

Current State

Basic structures

AbstractToken <: AbstractString

DToken, BToken, HToken <: AbstractToken

SharedIO <: IO

TokenVector <: AbstractVector

TokenTree

AbstractToken interface

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages