Currently, we support the following tokenizers:
Bert
: default uncased BERT tokenizer.Tocken
: a Unicode tokenizer pre-trained on wiki-103-raw withmin_freq=10
.Unicode
: a Unicode tokenizer that will be trained on your data.
Bert
and Tocken
are pre-trained tokenizers. You can use them directly by calling the tokenize
function.
SELECT tokenize('A quick brown fox jumps over the lazy dog.', 'Bert'); -- or 'Tocken'
-- {2058:1, 2474:1, 2829:1, 3899:1, 4248:1, 4419:1, 5376:1, 5831:1}
Unicode
will be trained on your data during the document tokenization. You can use this function with/without the trigger:
- with trigger (convenient but slower)
CREATE TABLE corpus (id SERIAL, text TEXT, embedding bm25vector);
SELECT create_unicode_tokenizer_and_trigger('test_token', 'corpus', 'text', 'embedding');
INSERT INTO corpus (text) VALUES ('PostgreSQL is a powerful, open-source object-relational database system.'); -- insert text to the table
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> to_bm25query('corpus_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
FROM corpus
ORDER BY rank
LIMIT 10;
- without trigger (faster but need to call the
tokenize
function manually)
CREATE TABLE corpus (id SERIAL, text TEXT, embedding bm25vector);
INSERT INTO corpus (text) VALUES ('PostgreSQL is a powerful, open-source object-relational database system.'); -- insert text to the table
SELECT create_tokenizer('test_token', $$
tokenizer = 'unicode'
stopwords = 'nltk'
table = 'corpus'
column = 'text'
$$);
UPDATE corpus SET embedding = tokenize(text, 'test_token');
CREATE INDEX corpus_embedding_bm25 ON corpus USING bm25 (embedding bm25_ops);
SELECT id, text, embedding <&> to_bm25query('corpus_embedding_bm25', 'PostgreSQL', 'test_token') AS rank
FROM corpus
ORDER BY rank
LIMIT 10;
We utilize TOML
to configure the tokenizer. You can specify the tokenizer type and the table/column to train on.
Here is what each field means:
Field | Type | Description |
---|---|---|
tokenizer | String | The tokenizer type (bert , tocken , or unicode ). |
stopwords | String | The stopwords used for Unicode tokenizer (lucene , nltk or iso ), default is nltk . |
table | String | The table name to train on for Unicode tokenizer. |
column | String | The column name to train on for Unicode tokenizer. |
tokenizer_name
is case-sensitive. Make sure to use the exact name when calling thetokenize
function.tokenizer_name
can only contain alphanumeric characters and underscores, and it must start with an alphabet.tokenizer_name
is unique. You cannot create two tokenizers with the same name.- The max length of a token in unicode tokenizer is
2600
. If a token is longer than this, it will be cut off to multiple tokens. If you need to support longer tokens, welcome to submit an issue here.
To create another tokenizer that is pre-trained on your data, you can follow the steps below: