Full-text-search: add an option to use ICU tokenizer #1095

di72nn · 2020-10-25T14:52:16Z

Probably fixes #1090.

The currently used FTS tokenizer (unicode61) doesn't know anything about CJK, so it doesn't split words in these languages.
I'm not sure about the quality, but the icu tokenizer seems to do a better job at this (to my understanding unicode61 is still better for latin-based languages, hence it is the default).

Here are some tests I ran on an emulator (Android 8.1):

adb shell
sqlite3

CREATE VIRTUAL TABLE ft3_tokenize_test_unicode USING fts3tokenize(unicode61);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu USING fts3tokenize(icu);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu_cn_simplified USING fts3tokenize(icu, zh_CN);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu_cn_traditional USING fts3tokenize(icu, zh_TW);

SELECT token, start, end, position FROM ft3_tokenize_test_unicode WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_simplified WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_traditional WHERE input='为什么不支持中文 fts test';

SELECT token, start, end, position FROM ft3_tokenize_test_unicode WHERE input='据台湾中时新闻网报道，一份最新民调今天（24日）出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu WHERE input='据台湾中时新闻网报道，一份最新民调今天（24日）出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_simplified WHERE input='据台湾中时新闻网报道，一份最新民调今天（24日）出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_traditional WHERE input='据台湾中时新闻网报道，一份最新民调今天（24日）出炉 fts test 2';

icu, icu zh_CN and icu zh_TW produced the same result in this case.

I also tried to find this article using this query: 据台湾.

FTS: add option to use ICU tokenizer

e522359

di72nn added the Ready for review label Oct 25, 2020

Strubbl approved these changes Nov 11, 2020

View reviewed changes

di72nn added this to the 2.4.2 milestone Nov 25, 2020

tcitworld approved these changes Dec 1, 2020

View reviewed changes

tcitworld merged commit 3664c2c into master Dec 1, 2020

tcitworld deleted the fts_icu branch December 1, 2020 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-text-search: add an option to use ICU tokenizer #1095

Full-text-search: add an option to use ICU tokenizer #1095

di72nn commented Oct 25, 2020

Full-text-search: add an option to use ICU tokenizer #1095

Full-text-search: add an option to use ICU tokenizer #1095

Conversation

di72nn commented Oct 25, 2020