add a chinese tokenizer #2008

trinity-1686a · 2022-09-23T10:09:31Z

Description

This adds a simple tokenizer for CJK. Before, something like "你好世界" (hello world) would be a single token because it contains no whitespace. This means searching for "你好" would yield no result.

A more intelligent tokenizer would probably split in two tokens (hello, world). This tokenizer simply split at each char, creating 4 tokens. This is much faster at indexing, but requires using a phrase query to match a word written as two or more chars.

fix #1979

How was this PR tested?

Some tests added for the tokenizer, and a manual test by indexing the wiki-articles-10000 dataset, using the new tokenizer for the body field and searching for "毛藝" (name of a Chinese gymnast), "毛" (first half), "藝" (2nd half) and "藝毛" (wrong order):

"毛藝": yield a doc before and after
"毛": yield a doc only after
"藝": yield a doc only after
"藝毛": yield nothing

yangshike · 2022-09-26T02:12:21Z

Can you provide a simple source code installation document? Let me test this function

trinity-1686a · 2022-09-26T07:44:59Z

You can clone the repository and checkout the branch for this PR, cd quickwit/quickwit, run cargo build (if you don't have cargo installed, lookup how to install rust on your operating system). You'll find quickwit binary in target/debug/quickwit. With just that, you won't have the web interface, but you'll have the cli and the API, which should be enough to test.

Remember to use chinese as your tokenizer instead of default or en_stem for the fields where you want chinese to be indexed

yangshike · 2022-09-26T08:57:48Z

Is this speed adjustable on branch chinese-tokenize？
Run the file based index creation under (60 million lines in json format)
Less than 1MB/s
Num docs 2621360 Parse errs 0 PublSplits 2 Input size 143MB Thrghput 0.88MB/s Time 00:02:47
Num docs 2639792 Parse errs 0 PublSplits 2 Input size 144MB Thrghput 0.88MB/s Time 00:02:48
Num docs 2658204 Parse errs 0 PublSplits 2 Input size 145MB Thrghput 0.78MB/s Time 00:02:49
Num docs 2676568 Parse errs 0 PublSplits 2 Input size 146MB Thrghput 1.00MB/s Time 00:02:50
Num docs 2694614 Parse errs 0 PublSplits 2 Input size 147MB Thrghput 1.00MB/s Time 00:02:51
Num docs 2712715 Parse errs 0 PublSplits 2 Input size 148MB Thrghput 0.96MB/s Time 00:02:52
Num docs 2712715 Parse errs 0 PublSplits 2 Input size 148MB Thrghput 0.75MB/s Time 00:02:53
Num docs 2731041 Parse errs 0 PublSplits 2 Input size 149MB Thrghput 0.75MB/s Time 00:02:54
Num docs 2749538 Parse errs 0 PublSplits 2 Input size 150MB Thrghput 0.75MB/s Time 00:02:55
Num docs 2767715 Parse errs 0 PublSplits 2 Input size 151MB Thrghput 0.78MB/s Time 00:02:56

PSeitz · 2022-09-26T09:01:58Z

@yangshike yes when compiled as release cargo build --release and target/release/quickwit

yangshike · 2022-09-26T09:25:47Z

when i run :
./quickwit run --service searcher

console:
2022-09-26T09:21:37.129Z INFO quickwit: version="0.3.1-nightly" commit="unknown"
2022-09-26T09:21:37.130Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:21:37.130Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:21:37.131Z INFO quickwit_cli: Loaded Quickwit config. config_uri=file:///root/quickwit-v0.3.1/config/quickwit.yaml config=QuickwitConfig { version: 0, cluster_id: "quickwit-default-cluster", node_id: "node-old-DLbk", rest_listen_addr: 127.0.0.1:7280, gossip_listen_addr: 127.0.0.1:7280, grpc_listen_addr: 127.0.0.1:7281, gossip_advertise_addr: 127.0.0.1:7280, grpc_advertise_addr: 127.0.0.1:7281, enabled_services: {Indexer, Metastore, Searcher, Janitor}, peer_seeds: [], metastore_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes#polling_interval=30s" }, default_index_root_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes" }, data_dir_path: "./qwdata", indexer_config: IndexerConfig { split_store_max_num_bytes: Byte(100000000000), split_store_max_num_splits: 1000 }, searcher_config: SearcherConfig { fast_field_cache_capacity: Byte(1000000000), split_footer_cache_capacity: Byte(500000000), max_num_concurrent_split_searches: 100, max_num_concurrent_split_streams: 100 } }
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:21:37.131Z INFO quickwit_cluster::cluster: Joining cluster. cluster_id=quickwit-default-cluster node_id=node-old-DLbk grpc_public_addr=127.0.0.1:7281 available_services={Searcher} gossip_listen_addr=127.0.0.1:7280 gossip_public_addr=127.0.0.1:7280 peer_seed_addrs=
2022-09-26T09:21:47.133Z ERROR quickwit_serve: No metastore service found among cluster members, stopping server.
Command failed: Failed to start server: no metastore service was found among cluster members. Try running Quickwit with additional metastore service quickwit run --service metastore.
[root@iZ8vbb3fnrdi22x6avhwmhZ quickwit-v0.3.1]# ./quickwit_v2 run --service searcher --config=./config/quickwit.yaml
2022-09-26T09:22:59.681Z INFO quickwit: version="0.3.1-nightly" commit="unknown"
2022-09-26T09:22:59.682Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:22:59.682Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:22:59.683Z INFO quickwit_cli: Loaded Quickwit config. config_uri=file:///root/quickwit-v0.3.1/config/quickwit.yaml config=QuickwitConfig { version: 0, cluster_id: "quickwit-default-cluster", node_id: "node-young-3tP6", rest_listen_addr: 127.0.0.1:7280, gossip_listen_addr: 127.0.0.1:7280, grpc_listen_addr: 127.0.0.1:7281, gossip_advertise_addr: 127.0.0.1:7280, grpc_advertise_addr: 127.0.0.1:7281, enabled_services: {Metastore, Indexer, Janitor, Searcher}, peer_seeds: [], metastore_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes#polling_interval=30s" }, default_index_root_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes" }, data_dir_path: "./qwdata", indexer_config: IndexerConfig { split_store_max_num_bytes: Byte(100000000000), split_store_max_num_splits: 1000 }, searcher_config: SearcherConfig { fast_field_cache_capacity: Byte(1000000000), split_footer_cache_capacity: Byte(500000000), max_num_concurrent_split_searches: 100, max_num_concurrent_split_streams: 100 } }
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:22:59.683Z INFO quickwit_cluster::cluster: Joining cluster. cluster_id=quickwit-default-cluster node_id=node-young-3tP6 grpc_public_addr=127.0.0.1:7281 available_services={Searcher} gossip_listen_addr=127.0.0.1:7280 gossip_public_addr=127.0.0.1:7280 peer_seed_addrs=
2022-09-26T09:23:09.685Z ERROR quickwit_serve: No metastore service found among cluster members, stopping server.
Command failed: Failed to start server: no metastore service was found among cluster members. Try running Quickwit with additional metastore service quickwit run --service metastore.

Why does the cluster mode start? How to start stand-alone mode only??

docs/configuration/index-config.md

fulmicoton · 2022-09-26T09:49:13Z

quickwit/quickwit-doc-mapper/src/tokenizers.rs

+        self.token.text.clear();
+        self.token.position = self.token.position.wrapping_add(1);
+
+        let mut iter = self.last_char.take().into_iter().chain(&mut self.chars);


I'm not sure what you mean by this

A sorry, I meant iter is not a great name.
char_iter would be nicer.

quickwit/quickwit-doc-mapper/src/tokenizers.rs

fulmicoton · 2022-09-26T09:57:02Z

You can just run ./quickwit run . It will start the metastore service, the indexer and the searcher in standalone.

As mentionned by @PSeitz, make sure you run in --release mode.

fulmicoton · 2022-09-26T09:59:34Z

@trinity-1686a Possibly a silly idea. Do you think we should do a "is_ascii" on the whole string. If all ascii just run the regular tokenizer logic, and if not run this... Then maybe we could enable this tokenizer as the default if the perf diff is insignificant?

fulmicoton

It would deserve a couple more unit test and nicer naming here and there maybe but great job overall.

Do whatever fix you feel is good and you can merge.

yangshike · 2022-09-26T10:03:26Z

curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=234"
output:
{
"document": {
"id": 452873747743539869,
"wechat_name": "积善得福234"
}
},
{
"document": {
"id": 573296179717722190,
"wechat_name": "234"
}
},
{
"document": {
"id": 486971530163212350,
"wechat_name": "234"
}
}
and logs in console:
2022-09-26T10:18:00.558Z INFO quickwit_serve::search_api::rest_handler: search index_id=customer2 request=SearchRequestQueryString { query: "234", aggs: None, search_fields: None, snippet_fields: None, start_timestamp: None, end_timestamp: None, max_hits: 20, start_offset: 0, format: PrettyJson, sort_by_field: None }
2022-09-26T10:18:00.558Z INFO leaf_search: quickwit_search::service: leaf_search index="customer2" splits=[SplitIdAndFooterOffsets { split_id: "01GDWMJPAMX98X2HMCDY53GT6V", split_footer_start: 370245642, split_footer_end: 370433502 }, SplitIdAndFooterOffsets { split_id: "01GDWMKFHZ81ZQHB02K3A0BM9D", split_footer_start: 371469783, split_footer_end: 371657749 }, SplitIdAndFooterOffsets { split_id: "01GDWMM8GD09EHAFFWZAY7JZPM", split_footer_start: 369785641, split_footer_end: 369972703 }, SplitIdAndFooterOffsets { split_id: "01GDWMN402HK4C39XD80H9VVCF", split_footer_start: 368958150, split_footer_end: 369144758 }, SplitIdAndFooterOffsets { split_id: "01GDWMP073VXVCZWJR7VD8BYVN", split_footer_start: 368857240, split_footer_end: 369044696 }, SplitIdAndFooterOffsets { split_id: "01GDWMPXZP2JD99PS8MNRK38E9", split_footer_start: 361718851, split_footer_end: 361895041 }, SplitIdAndFooterOffsets { split_id: "01GDWMQWFMQPF0FBRQQD0AHGAN", split_footer_start: 9672130, split_footer_end: 9681047 }]

when i run:
curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=积善得福234"
or
curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=积善"
or
curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=积善得福"

Has No output and The console does not see the request log,

I tested again. It is OK to use the command line, but not http
./quickwit index search --index customer2 --query "wechat_name:积善得福234"
{
"num_hits": 1,
"hits": [
{
"document": {
"id": 452873747743539869,
"wechat_name": "积善得福234"
}
}
],
"elapsed_time_micros": 37319,
"errors": []
}

This is my index file:
version: 0
index_id: customer2

doc_mapping:
field_mappings:
- name: id
type: i64
fast: false
- name: wechat_name
type: text
tokenizer: chinese
record: position
stored: true
search_settings:
default_search_fields: [id, wechat_name]

yangshike · 2022-09-26T11:13:49Z

,

It has been solved. It's OK to do url conversion for Chinese

like： curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=wechat_name:%E5%91%B5%E5%91%B5"

trinity-1686a · 2022-09-26T12:52:21Z

Possibly a silly idea. Do you think we should do a "is_ascii" on the whole string.

@fulmicoton I don't know, it depends a lot on the performance of this tokenizer vs the default. I'll try to do some benchmarks.

add proptest rename tokenizer to chinese_compatible

fulmicoton · 2022-09-27T01:22:34Z

@fulmicoton I don't know, it depends a lot on the performance of this tokenizer vs the default. I'll try to do some benchmarks.

@trinity-1686a By "this tokenizer" you mean the composite one I was describing or the one in this PR?

In the composite version, the is_ascii stuff should give us a fast happy path.
Acutally the current whitespace tokenizer could also benefit from this happy path.

Once we have identified that the string is ascii, we could have whitespace tokenizer implementaiton that operates on a &[u8].
Right now we decode utf-8. The result should actually be faster.

(Note that the whitespace did not need to decode utf-8 to begin with but that's another story)

fulmicoton · 2022-09-27T01:24:06Z

@yangshike thank you for the tests and the follow up. I aam surprised your browser did not do the URL encoding directly!

quickwit/quickwit-doc-mapper/src/tokenizers.rs

yangshike · 2022-09-27T01:59:16Z

@yangshike thank you for the tests and the follow up. I aam surprised your browser did not do the URL encoding directly!

The curl used is not a browser

fulmicoton · 2022-09-27T02:01:25Z

Ah that makes sense @yangshike ! :)

trinity-1686a · 2022-09-28T07:53:24Z

@trinity-1686a By "this tokenizer" you mean the composite one I was describing or the one in this PR?
In the composite version, the is_ascii stuff should give us a fast happy path.

@fulmicoton I mean both. There are many cases where we can get non-ascii text that would work fine with the current default tokenizer (a single emoji in a document, é (or any other accentuated letter in languages which have them...). I don't expect the tokenizer from this PR to be substantially slower than the default (but until tested, that's speculation). If it happens to be, such datasets would be impacted.

trinity-1686a · 2022-09-30T09:14:08Z

@fulmicoton

bench	min	avg	max
default tokenizer + ascii	260.51 MiB/s	261.66 MiB/s	262.84 MiB/s
chinese tokenizer + ascii	209.05 MiB/s	209.26 MiB/s	209.51 MiB/s
default tokenizer + chinese	91.950 MiB/s	92.087 MiB/s	92.249 MiB/s
chinese tokenizer + chinese	66.451 MiB/s	66.952 MiB/s	67.276 MiB/s
default tokenizer + french	162.68 MiB/s	162.77 MiB/s	162.84 MiB/s
chinese tokenizer + french	146.14 MiB/s	146.41 MiB/s	146.66 MiB/s

pure ascii: 20% slower
pure chinese: 28% slower (but default tokenizer returns incorrect result)
french (mostly ascii, with a few utf8 chars lying around): 9.8% slower
(test strings were the first few lines of "full text search" on Wikipedia)

Maybe having a composite tokenizer tokenizer can make sense to not get a 20% hit on ascii, but a close to 10% on non-ascii non-chinese text is a bit hefty in my opinion, and I'm not sure it can be made the default then. It's easy to get any of ßé¿😀 in most dataset.
But maybe there is a bigger bottleneck after tokenization, and being a bit slower have no real impact on the whole pipeline? In which case, having that composite tokenizer as the default would be nice.

trinity-1686a added 2 commits September 23, 2022 11:39

add a basic chinese tokenizer

2f4d0bb

fix clippy error

f61cda6

document new tokenizer

2614e0c

fulmicoton reviewed Sep 26, 2022

View reviewed changes

docs/configuration/index-config.md Outdated Show resolved Hide resolved

fulmicoton reviewed Sep 26, 2022

View reviewed changes

quickwit/quickwit-doc-mapper/src/tokenizers.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Sep 26, 2022

View reviewed changes

quickwit/quickwit-doc-mapper/src/tokenizers.rs Show resolved Hide resolved

fulmicoton approved these changes Sep 26, 2022

View reviewed changes

address review comments

561d8a8

add proptest rename tokenizer to chinese_compatible

Merge branch 'main' into chinese-tokenizer

d6c4fc4

fulmicoton reviewed Sep 27, 2022

View reviewed changes

quickwit/quickwit-doc-mapper/src/tokenizers.rs Show resolved Hide resolved

fulmicoton approved these changes Sep 27, 2022

View reviewed changes

quickwit-oss deleted a comment from yangshike Sep 28, 2022

Merge branch 'main' into chinese-tokenizer

6fc42f7

Merge branch 'main' into chinese-tokenizer

6dbbcf6

fulmicoton enabled auto-merge (squash) September 30, 2022 03:40

Merge branch 'main' into chinese-tokenizer

bf32ddc

fulmicoton merged commit 1d72a77 into main Sep 30, 2022

fulmicoton deleted the chinese-tokenizer branch September 30, 2022 07:21

fmassot mentioned this pull request Sep 30, 2022

Does it support chinese tokenizer or custom tokenizer? #1737

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a chinese tokenizer #2008

add a chinese tokenizer #2008

trinity-1686a commented Sep 23, 2022

yangshike commented Sep 26, 2022

trinity-1686a commented Sep 26, 2022

yangshike commented Sep 26, 2022 •

edited

PSeitz commented Sep 26, 2022

yangshike commented Sep 26, 2022

fulmicoton Sep 26, 2022

trinity-1686a Sep 26, 2022

fulmicoton Sep 27, 2022 •

edited

fulmicoton commented Sep 26, 2022 •

edited

fulmicoton commented Sep 26, 2022

fulmicoton left a comment

yangshike commented Sep 26, 2022 •

edited

yangshike commented Sep 26, 2022 •

edited

trinity-1686a commented Sep 26, 2022 •

edited

fulmicoton commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

yangshike commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

trinity-1686a commented Sep 28, 2022 •

edited

trinity-1686a commented Sep 30, 2022 •

edited

add a chinese tokenizer #2008

add a chinese tokenizer #2008

Conversation

trinity-1686a commented Sep 23, 2022

Description

How was this PR tested?

yangshike commented Sep 26, 2022

trinity-1686a commented Sep 26, 2022

yangshike commented Sep 26, 2022 • edited

PSeitz commented Sep 26, 2022

yangshike commented Sep 26, 2022

fulmicoton Sep 26, 2022

Choose a reason for hiding this comment

trinity-1686a Sep 26, 2022

Choose a reason for hiding this comment

fulmicoton Sep 27, 2022 • edited

Choose a reason for hiding this comment

fulmicoton commented Sep 26, 2022 • edited

fulmicoton commented Sep 26, 2022

fulmicoton left a comment

Choose a reason for hiding this comment

yangshike commented Sep 26, 2022 • edited

yangshike commented Sep 26, 2022 • edited

trinity-1686a commented Sep 26, 2022 • edited

fulmicoton commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

yangshike commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

trinity-1686a commented Sep 28, 2022 • edited

trinity-1686a commented Sep 30, 2022 • edited

yangshike commented Sep 26, 2022 •

edited

fulmicoton Sep 27, 2022 •

edited

fulmicoton commented Sep 26, 2022 •

edited

yangshike commented Sep 26, 2022 •

edited

yangshike commented Sep 26, 2022 •

edited

trinity-1686a commented Sep 26, 2022 •

edited

trinity-1686a commented Sep 28, 2022 •

edited

trinity-1686a commented Sep 30, 2022 •

edited