Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text-search: add an option to use ICU tokenizer #1095

Merged
merged 1 commit into from
Dec 1, 2020
Merged

Conversation

di72nn
Copy link
Member

@di72nn di72nn commented Oct 25, 2020

Probably fixes #1090.

The currently used FTS tokenizer (unicode61) doesn't know anything about CJK, so it doesn't split words in these languages.
I'm not sure about the quality, but the icu tokenizer seems to do a better job at this (to my understanding unicode61 is still better for latin-based languages, hence it is the default).

Here are some tests I ran on an emulator (Android 8.1):

adb shell
sqlite3

CREATE VIRTUAL TABLE ft3_tokenize_test_unicode USING fts3tokenize(unicode61);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu USING fts3tokenize(icu);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu_cn_simplified USING fts3tokenize(icu, zh_CN);
CREATE VIRTUAL TABLE ft3_tokenize_test_icu_cn_traditional USING fts3tokenize(icu, zh_TW);

SELECT token, start, end, position FROM ft3_tokenize_test_unicode WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_simplified WHERE input='为什么不支持中文 fts test';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_traditional WHERE input='为什么不支持中文 fts test';

SELECT token, start, end, position FROM ft3_tokenize_test_unicode WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_simplified WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';
SELECT token, start, end, position FROM ft3_tokenize_test_icu_cn_traditional WHERE input='据台湾中时新闻网报道,一份最新民调今天(24日)出炉 fts test 2';

icu, icu zh_CN and icu zh_TW produced the same result in this case.

I also tried to find this article using this query: 据台湾.

@di72nn di72nn added this to the 2.4.2 milestone Nov 25, 2020
@tcitworld tcitworld merged commit 3664c2c into master Dec 1, 2020
@tcitworld tcitworld deleted the fts_icu branch December 1, 2020 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"The search cannot be completed when inputting Chinese"
3 participants