[Feature]: Support Chinese Tokenization for BM25 using Jieba

### What feature would you like to request?

Currently, fastembed supports BM25 sparse embeddings, but it lacks optimized support for Chinese text. Standard whitespace tokenization is ineffective for Chinese, as the language does not use spaces between words. This leads to poor retrieval performance when using fastembed's BM25 implementation for Chinese datasets.


### Is there any additional information you would like to provide?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support Chinese Tokenization for BM25 using Jieba #610

What feature would you like to request?

Is there any additional information you would like to provide?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Support Chinese Tokenization for BM25 using Jieba #610

Description

What feature would you like to request?

Is there any additional information you would like to provide?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions