Tokenizer

Currently, we support the following tokenizers:

Bert: default uncased BERT tokenizer.
Tocken: a Unicode tokenizer pre-trained on wiki-103-raw with min_freq=10.
Unicode: a Unicode tokenizer that will be trained on your data.

Usage

Pre-trained Tokenizer

Bert and Tocken are pre-trained tokenizers. You can use them directly by calling the tokenize function.

SELECT tokenize('A quick brown fox jumps over the lazy dog.', 'Bert');  -- or 'Tocken'
-- {2058:1, 2474:1, 2829:1, 3899:1, 4248:1, 4419:1, 5376:1, 5831:1}

Train on Your Data

Unicode will be trained on your data during the document tokenization. You can use this function with/without the trigger:

with trigger (convenient but slower)

CREATE TABLE corpus (id SERIAL, text TEXT, embedding bm25vector);
SELECT create_unicode_tokenizer_and_trigger('test_token', 'corpus', 'text', 'embedding');
INSERT INTO corpus (text) VALUES ('PostgreSQL is a powerful, open-source object-relational database system.'); -- insert text to the table
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> to_bm25query('corpus_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
    FROM corpus
    ORDER BY rank
    LIMIT 10;

without trigger (faster but need to call the tokenize function manually)

CREATE TABLE corpus (id SERIAL, text TEXT, embedding bm25vector);
INSERT INTO corpus (text) VALUES ('PostgreSQL is a powerful, open-source object-relational database system.'); -- insert text to the table
SELECT create_tokenizer('test_token', $$
tokenizer = 'unicode'
stopwords = 'nltk'
table = 'corpus'
column = 'text'
$$);
UPDATE corpus SET embedding = tokenize(text, 'test_token');
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> to_bm25query('corpus_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
    FROM corpus
    ORDER BY rank
    LIMIT 10;

Configuration

We utilize TOML to configure the tokenizer. You can specify the tokenizer type and the table/column to train on.

Here is what each field means:

Field	Type	Description
tokenizer	String	The tokenizer type (`bert`, `tocken`, or `unicode`).
stopwords	String	The stopwords used for Unicode tokenizer (`lucene`, `nltk` or `iso`), default is `nltk`.
table	String	The table name to train on for Unicode tokenizer.
column	String	The column name to train on for Unicode tokenizer.

Note

tokenizer_name is case-sensitive. Make sure to use the exact name when calling the tokenize function.
tokenizer_name can only contain alphanumeric characters and underscores, and it must start with an alphabet.
tokenizer_name is unique. You cannot create two tokenizers with the same name.
The max length of a token in unicode tokenizer is 2600. If a token is longer than this, it will be cut off to multiple tokens. If you need to support longer tokens, welcome to submit an issue here.

Contribution

To create another tokenizer that is pre-trained on your data, you can follow the steps below:

update TOKENIZER_RESERVED_NAMES, create_tokenizer, drop_tokenizer, and tokenize functions in the token.rs.
(optional) pre-trained data can be stored under the tokenizer directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

tokenizer.md

tokenizer.md

Tokenizer

Usage

Pre-trained Tokenizer

Train on Your Data

Configuration

Note

Contribution

Files

tokenizer.md

Latest commit

History

tokenizer.md

File metadata and controls

Tokenizer

Usage

Pre-trained Tokenizer

Train on Your Data

Configuration

Note

Contribution