Skip to content
A tokenizer written in (SWI-)Prolog. It has some useful features and some flexibility and it might improve.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
prolog
LICENSE
README.md
pack.pl

README.md

Synopsis

?- tokenize(`\tExample  Text.`, Tokens).
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')] 

?- tokenize(`\tExample  Text.`, Tokens, [cntrl(false), pack(true), cased(true)]).
Tokens = [word('Example', 1), spc(' ', 2), word('Text', 1), punct('.', 1)] 

?- tokenize(`\tExample  Text.`, Tokens), untokenize(Tokens, Text), format('~s~n', [Text]).
	example  text.
Tokens = [cntrl('\t'), word(example), spc(' '), spc(' '), word(text), punct('.')],
Text = [9, 101, 120, 97, 109, 112, 108, 101, 32|...] 

Description

Module tokenize aims to provide a straightforward tool for tokenizing text into a simple format. It is the result of a learning exercise, and it is far from perfect. If there is sufficient interest from myself or anyone else, I'll try to improve it.

It is packaged as an SWI-Prolog pack, available here. Install it into your SWI-Prolog system with the query

?- pack_install(tokenize).

Please visit the wiki for more detailed instructions and examples, including a full list of options supported.

You can’t perform that action at this time.