Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a chinese tokenizer #2008

Merged
merged 8 commits into from Sep 30, 2022
Merged

add a chinese tokenizer #2008

merged 8 commits into from Sep 30, 2022

Conversation

trinity-1686a
Copy link
Contributor

Description

This adds a simple tokenizer for CJK. Before, something like "你好世界" (hello world) would be a single token because it contains no whitespace. This means searching for "你好" would yield no result.

A more intelligent tokenizer would probably split in two tokens (hello, world). This tokenizer simply split at each char, creating 4 tokens. This is much faster at indexing, but requires using a phrase query to match a word written as two or more chars.

fix #1979

How was this PR tested?

Some tests added for the tokenizer, and a manual test by indexing the wiki-articles-10000 dataset, using the new tokenizer for the body field and searching for "毛藝" (name of a Chinese gymnast), "毛" (first half), "藝" (2nd half) and "藝毛" (wrong order):

  • "毛藝": yield a doc before and after
  • "毛": yield a doc only after
  • "藝": yield a doc only after
  • "藝毛": yield nothing

@yangshike
Copy link
Contributor

Can you provide a simple source code installation document? Let me test this function

@trinity-1686a
Copy link
Contributor Author

You can clone the repository and checkout the branch for this PR, cd quickwit/quickwit, run cargo build (if you don't have cargo installed, lookup how to install rust on your operating system). You'll find quickwit binary in target/debug/quickwit. With just that, you won't have the web interface, but you'll have the cli and the API, which should be enough to test.

Remember to use chinese as your tokenizer instead of default or en_stem for the fields where you want chinese to be indexed

@yangshike
Copy link
Contributor

yangshike commented Sep 26, 2022

Is this speed adjustable on branch chinese-tokenize?
Run the file based index creation under (60 million lines in json format)
Less than 1MB/s
Num docs 2621360 Parse errs 0 PublSplits 2 Input size 143MB Thrghput 0.88MB/s Time 00:02:47
Num docs 2639792 Parse errs 0 PublSplits 2 Input size 144MB Thrghput 0.88MB/s Time 00:02:48
Num docs 2658204 Parse errs 0 PublSplits 2 Input size 145MB Thrghput 0.78MB/s Time 00:02:49
Num docs 2676568 Parse errs 0 PublSplits 2 Input size 146MB Thrghput 1.00MB/s Time 00:02:50
Num docs 2694614 Parse errs 0 PublSplits 2 Input size 147MB Thrghput 1.00MB/s Time 00:02:51
Num docs 2712715 Parse errs 0 PublSplits 2 Input size 148MB Thrghput 0.96MB/s Time 00:02:52
Num docs 2712715 Parse errs 0 PublSplits 2 Input size 148MB Thrghput 0.75MB/s Time 00:02:53
Num docs 2731041 Parse errs 0 PublSplits 2 Input size 149MB Thrghput 0.75MB/s Time 00:02:54
Num docs 2749538 Parse errs 0 PublSplits 2 Input size 150MB Thrghput 0.75MB/s Time 00:02:55
Num docs 2767715 Parse errs 0 PublSplits 2 Input size 151MB Thrghput 0.78MB/s Time 00:02:56

@PSeitz
Copy link
Contributor

PSeitz commented Sep 26, 2022

@yangshike yes when compiled as release cargo build --release and target/release/quickwit

@yangshike
Copy link
Contributor

when i run :
./quickwit run --service searcher

console:
2022-09-26T09:21:37.129Z INFO quickwit: version="0.3.1-nightly" commit="unknown"
2022-09-26T09:21:37.130Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:21:37.130Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:21:37.131Z INFO quickwit_cli: Loaded Quickwit config. config_uri=file:///root/quickwit-v0.3.1/config/quickwit.yaml config=QuickwitConfig { version: 0, cluster_id: "quickwit-default-cluster", node_id: "node-old-DLbk", rest_listen_addr: 127.0.0.1:7280, gossip_listen_addr: 127.0.0.1:7280, grpc_listen_addr: 127.0.0.1:7281, gossip_advertise_addr: 127.0.0.1:7280, grpc_advertise_addr: 127.0.0.1:7281, enabled_services: {Indexer, Metastore, Searcher, Janitor}, peer_seeds: [], metastore_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes#polling_interval=30s" }, default_index_root_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes" }, data_dir_path: "./qwdata", indexer_config: IndexerConfig { split_store_max_num_bytes: Byte(100000000000), split_store_max_num_splits: 1000 }, searcher_config: SearcherConfig { fast_field_cache_capacity: Byte(1000000000), split_footer_cache_capacity: Byte(500000000), max_num_concurrent_split_searches: 100, max_num_concurrent_split_streams: 100 } }
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:21:37.131Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:21:37.131Z INFO quickwit_cluster::cluster: Joining cluster. cluster_id=quickwit-default-cluster node_id=node-old-DLbk grpc_public_addr=127.0.0.1:7281 available_services={Searcher} gossip_listen_addr=127.0.0.1:7280 gossip_public_addr=127.0.0.1:7280 peer_seed_addrs=
2022-09-26T09:21:47.133Z ERROR quickwit_serve: No metastore service found among cluster members, stopping server.
Command failed: Failed to start server: no metastore service was found among cluster members. Try running Quickwit with additional metastore service quickwit run --service metastore.
[root@iZ8vbb3fnrdi22x6avhwmhZ quickwit-v0.3.1]# ./quickwit_v2 run --service searcher --config=./config/quickwit.yaml
2022-09-26T09:22:59.681Z INFO quickwit: version="0.3.1-nightly" commit="unknown"
2022-09-26T09:22:59.682Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:22:59.682Z INFO quickwit_config::config: Using listen address as advertise address. advertise_address=127.0.0.1
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:22:59.683Z INFO quickwit_cli: Loaded Quickwit config. config_uri=file:///root/quickwit-v0.3.1/config/quickwit.yaml config=QuickwitConfig { version: 0, cluster_id: "quickwit-default-cluster", node_id: "node-young-3tP6", rest_listen_addr: 127.0.0.1:7280, gossip_listen_addr: 127.0.0.1:7280, grpc_listen_addr: 127.0.0.1:7281, gossip_advertise_addr: 127.0.0.1:7280, grpc_advertise_addr: 127.0.0.1:7281, enabled_services: {Metastore, Indexer, Janitor, Searcher}, peer_seeds: [], metastore_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes#polling_interval=30s" }, default_index_root_uri: Uri { uri: "file:///root/quickwit-v0.3.1/qwdata/indexes" }, data_dir_path: "./qwdata", indexer_config: IndexerConfig { split_store_max_num_bytes: Byte(100000000000), split_store_max_num_splits: 1000 }, searcher_config: SearcherConfig { fast_field_cache_capacity: Byte(1000000000), split_footer_cache_capacity: Byte(500000000), max_num_concurrent_split_searches: 100, max_num_concurrent_split_streams: 100 } }
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Cluster ID is not set, falling back to default value. cluster_id=quickwit-default-cluster
2022-09-26T09:22:59.683Z WARN quickwit_config::config: Peer seed list is empty.
2022-09-26T09:22:59.683Z INFO quickwit_cluster::cluster: Joining cluster. cluster_id=quickwit-default-cluster node_id=node-young-3tP6 grpc_public_addr=127.0.0.1:7281 available_services={Searcher} gossip_listen_addr=127.0.0.1:7280 gossip_public_addr=127.0.0.1:7280 peer_seed_addrs=
2022-09-26T09:23:09.685Z ERROR quickwit_serve: No metastore service found among cluster members, stopping server.
Command failed: Failed to start server: no metastore service was found among cluster members. Try running Quickwit with additional metastore service quickwit run --service metastore.

Why does the cluster mode start? How to start stand-alone mode only??

self.token.text.clear();
self.token.position = self.token.position.wrapping_add(1);

let mut iter = self.last_char.take().into_iter().chain(&mut self.chars);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_iter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sorry, I meant iter is not a great name.
char_iter would be nicer.

@fulmicoton
Copy link
Contributor

fulmicoton commented Sep 26, 2022

You can just run ./quickwit run . It will start the metastore service, the indexer and the searcher in standalone.

As mentionned by @PSeitz, make sure you run in --release mode.

@fulmicoton
Copy link
Contributor

@trinity-1686a Possibly a silly idea. Do you think we should do a "is_ascii" on the whole string. If all ascii just run the regular tokenizer logic, and if not run this... Then maybe we could enable this tokenizer as the default if the perf diff is insignificant?

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would deserve a couple more unit test and nicer naming here and there maybe but great job overall.

Do whatever fix you feel is good and you can merge.

@yangshike
Copy link
Contributor

yangshike commented Sep 26, 2022

curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=234"
output:
{
"document": {
"id": 452873747743539869,
"wechat_name": "积善得福234"
}
},
{
"document": {
"id": 573296179717722190,
"wechat_name": "234"
}
},
{
"document": {
"id": 486971530163212350,
"wechat_name": "234"
}
}
and logs in console:
2022-09-26T10:18:00.558Z INFO quickwit_serve::search_api::rest_handler: search index_id=customer2 request=SearchRequestQueryString { query: "234", aggs: None, search_fields: None, snippet_fields: None, start_timestamp: None, end_timestamp: None, max_hits: 20, start_offset: 0, format: PrettyJson, sort_by_field: None }
2022-09-26T10:18:00.558Z INFO leaf_search: quickwit_search::service: leaf_search index="customer2" splits=[SplitIdAndFooterOffsets { split_id: "01GDWMJPAMX98X2HMCDY53GT6V", split_footer_start: 370245642, split_footer_end: 370433502 }, SplitIdAndFooterOffsets { split_id: "01GDWMKFHZ81ZQHB02K3A0BM9D", split_footer_start: 371469783, split_footer_end: 371657749 }, SplitIdAndFooterOffsets { split_id: "01GDWMM8GD09EHAFFWZAY7JZPM", split_footer_start: 369785641, split_footer_end: 369972703 }, SplitIdAndFooterOffsets { split_id: "01GDWMN402HK4C39XD80H9VVCF", split_footer_start: 368958150, split_footer_end: 369144758 }, SplitIdAndFooterOffsets { split_id: "01GDWMP073VXVCZWJR7VD8BYVN", split_footer_start: 368857240, split_footer_end: 369044696 }, SplitIdAndFooterOffsets { split_id: "01GDWMPXZP2JD99PS8MNRK38E9", split_footer_start: 361718851, split_footer_end: 361895041 }, SplitIdAndFooterOffsets { split_id: "01GDWMQWFMQPF0FBRQQD0AHGAN", split_footer_start: 9672130, split_footer_end: 9681047 }]

when i run:
curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=积善得福234"
or
curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=积善"
or
curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=积善得福"

Has No output and The console does not see the request log,

I tested again. It is OK to use the command line, but not http
./quickwit index search --index customer2 --query "wechat_name:积善得福234"
{
"num_hits": 1,
"hits": [
{
"document": {
"id": 452873747743539869,
"wechat_name": "积善得福234"
}
}
],
"elapsed_time_micros": 37319,
"errors": []
}

This is my index file:
version: 0
index_id: customer2

doc_mapping:
field_mappings:
- name: id
type: i64
fast: false
- name: wechat_name
type: text
tokenizer: chinese
record: position
stored: true
search_settings:
default_search_fields: [id, wechat_name]

@yangshike
Copy link
Contributor

yangshike commented Sep 26, 2022

,

It has been solved. It's OK to do url conversion for Chinese

like: curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=wechat_name:%E5%91%B5%E5%91%B5"

@trinity-1686a
Copy link
Contributor Author

trinity-1686a commented Sep 26, 2022

Possibly a silly idea. Do you think we should do a "is_ascii" on the whole string.

@fulmicoton I don't know, it depends a lot on the performance of this tokenizer vs the default. I'll try to do some benchmarks.

add proptest
rename tokenizer to chinese_compatible
@fulmicoton
Copy link
Contributor

@fulmicoton I don't know, it depends a lot on the performance of this tokenizer vs the default. I'll try to do some benchmarks.

@trinity-1686a By "this tokenizer" you mean the composite one I was describing or the one in this PR?

In the composite version, the is_ascii stuff should give us a fast happy path.
Acutally the current whitespace tokenizer could also benefit from this happy path.

Once we have identified that the string is ascii, we could have whitespace tokenizer implementaiton that operates on a &[u8].
Right now we decode utf-8. The result should actually be faster.

(Note that the whitespace did not need to decode utf-8 to begin with but that's another story)

@fulmicoton
Copy link
Contributor

@yangshike thank you for the tests and the follow up. I aam surprised your browser did not do the URL encoding directly!

@yangshike
Copy link
Contributor

@yangshike thank you for the tests and the follow up. I aam surprised your browser did not do the URL encoding directly!

The curl used is not a browser

@fulmicoton
Copy link
Contributor

Ah that makes sense @yangshike ! :)

@quickwit-oss quickwit-oss deleted a comment from yangshike Sep 28, 2022
@trinity-1686a
Copy link
Contributor Author

trinity-1686a commented Sep 28, 2022

@trinity-1686a By "this tokenizer" you mean the composite one I was describing or the one in this PR?
In the composite version, the is_ascii stuff should give us a fast happy path.

@fulmicoton I mean both. There are many cases where we can get non-ascii text that would work fine with the current default tokenizer (a single emoji in a document, é (or any other accentuated letter in languages which have them...). I don't expect the tokenizer from this PR to be substantially slower than the default (but until tested, that's speculation). If it happens to be, such datasets would be impacted.

@fulmicoton fulmicoton enabled auto-merge (squash) September 30, 2022 03:40
@fulmicoton fulmicoton merged commit 1d72a77 into main Sep 30, 2022
@fulmicoton fulmicoton deleted the chinese-tokenizer branch September 30, 2022 07:21
@trinity-1686a
Copy link
Contributor Author

trinity-1686a commented Sep 30, 2022

@fulmicoton

bench min avg max
default tokenizer + ascii 260.51 MiB/s 261.66 MiB/s 262.84 MiB/s
chinese tokenizer + ascii 209.05 MiB/s 209.26 MiB/s 209.51 MiB/s
default tokenizer + chinese 91.950 MiB/s 92.087 MiB/s 92.249 MiB/s
chinese tokenizer + chinese 66.451 MiB/s 66.952 MiB/s 67.276 MiB/s
default tokenizer + french 162.68 MiB/s 162.77 MiB/s 162.84 MiB/s
chinese tokenizer + french 146.14 MiB/s 146.41 MiB/s 146.66 MiB/s

pure ascii: 20% slower
pure chinese: 28% slower (but default tokenizer returns incorrect result)
french (mostly ascii, with a few utf8 chars lying around): 9.8% slower
(test strings were the first few lines of "full text search" on Wikipedia)

Maybe having a composite tokenizer tokenizer can make sense to not get a 20% hit on ascii, but a close to 10% on non-ascii non-chinese text is a bit hefty in my opinion, and I'm not sure it can be made the default then. It's easy to get any of ßé¿😀 in most dataset.
But maybe there is a bigger bottleneck after tokenization, and being a bit slower have no real impact on the whole pipeline? In which case, having that composite tokenizer as the default would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chinese search is not supported?
5 participants