Search does not support non-English languages #1081

taills · 2019-10-29T03:50:46Z

Unable to search for Chinese Keywords

weihanglo · 2019-10-29T16:23:35Z

mdBook uses elasticlunr.js for offline searching under the hood. And according to this issue weixsong/elasticlunr.js#53, it seems that there is no plan to support searching in other languages.

hysencn · 2020-03-27T10:45:34Z

it is a good tools. I love it.

We, more than 1 million peoples, have the same issue. could you help on it?
need chinese search support.

or, could we search chinese with google?
How to do it?

futurist · 2020-05-10T02:26:31Z

For Chinese like languages, maybe it's not suitable for local search, instead could using some service like algolia, which also being used in vuepress

cxumol · 2020-08-16T06:08:21Z

Yes, it is highly possible to add searching in Chinese characters.

elasticlunr is the search engine used by mdBook
Go to elasticlunr official documentation, read the section Other Languages. With just 3 more lines of code, elasticlunr can be used with other languages.
The Chinese language support of lunr-languages is PRed but not yet merged.
Alternatively, suggested by comments on Need Chinese support MihaiValentin/lunr-languages#32, Japanese language can be used as a workaround. That's because this line has covered 一-龠, which is a usual range including most Chinese characters on the Unicode table.

bigdogs · 2022-04-09T03:29:02Z

Any new progress for this issue?

Sciroccogti · 2022-04-15T10:13:12Z

Any new progress for this issue?

There is a PR #1496 working on it but needs help.

wc7086 · 2023-05-14T11:05:52Z

Replacing elasticlunr.js with https://github.com/ajitid/fzf-for-js may allow this issue to be resolved.

silence-coding · 2023-05-22T01:52:10Z

Looking forward to new progress on this MR

switefaster · 2023-07-05T07:16:54Z

@ehuss Would you please tell me if this feature would be accepted? I don't think I'm able to find out any more issues by prototyping.

In case you're too busy to read the whole comment, I will highlight some key issues for you. All modifications are feature-gated.

I intend to import lunr-languages's Chinese extension and a WebAssembly as a segmenter, which leads to:
- extra static file dependencies, not only to several .js but also to a .wasm
- usage of ES6 Module and async
I have a custom Language trait implementation created, which is sort of irrelevant to mdBook itself
I want to grant users the ability to include a custom dictionary. My plan is to add a subsection to book.toml such as [output.html.zh] and add a field additional-dict just like additional-js. I don't know if you will be comfortable with this.

All mentioned modifications except the custom dictionary are available to check in my fork. I would appreciate it a lot if you could tell me about your attitudes toward this feature and/or the issues I listed.

Importing lunr.zh.js(with slight modification to make it compatible with elasticlunr) and relevant extensions never works as mdBook is using a pre-generated search index from elasticlunr-rs when building the book. The correct way to solve this is PR #1496. However, the original PR is not using a preferable solution besides some flaws pointed out by the core maintainers, for example, it is not using an appropriate segmenter. And also, the PR is seemingly not updating anymore. I've figured out a possibly more elegant solution, namely to use either Intl.Segmenter or jieba-wasm as a Chinese segmenter. I'd love to work on this issue, but as per the Contribution Guide, I'm not sure if this issue is grabbing any attention from the maintainers, and I won't bother making a pull request if they do not.

Apart from all these, there are still some details we need to discuss:

Whether to use Intl.Segmenter or jieba-wasm. I prefer the latter since Intl.Segmenter is not supported by FireFox while WebAssembly is fully supported by almost all browsers, and using jieba can assure the consistency between the generated index and segment generated in the browser. However, using jieba-wasm requires extra file dependencies to at least two files jieba_rs_wasm.js and jieba_rs_wasm_bg.wasm, I'm not sure if the maintainers will be happy with that even if we can control them with feature flags.
elasticlunr-rs's Chinese support is incomplete, in the way that the stop word filter is inconsistent with the one from lunr-languages, and there is no sign of it being solved. We can implement Language trait ourselves in mdBook, but again I'm not sure if this is appropriate as it is kind of irrelevant to mdBook itself. If the maintainers prefer not, we then have to make a PR to elasticlunr-rs first and then wait for it to be merged.
The search result is kind of odd(showing apparently not very matching results) due to the segmenter segmenting some particular idiom-ish phrases as individual words, e.g. "换而言之" -> ["换", "而言", "之"], this also happens to uncommon terms. This could be solved by either allowing users to add a custom dictionary or just don't use any segmenter at all, as I realize most users would just be searching for keywords, in which case matching for the whole term is more reasonable. If we accept the latter solution then we don't need to consider everything listed above at all :P We shall not consider using no segmenter because elasticlunr depends highly on a tokenizer to work, otherwise we'll have to build our own searcher or switch to another one, which I don't consider a good trade-off.
Search results in the result list are not highlighted. To be exact, it seems that only those having 'space' characters(i.e. space, tab or \n, etc.) ahead of them will be highlighted. Guess we have somewhere in searcher.js to modify. Solved this by making changes to searcher.js, particularly on makeTeaser() function.
...and more if any

I'll be keeping an eye on this comment to see if anyone is interested.

Update

I created a fork as a proof-of-concept, and I find some problems that I didn't notice at the time I posted this comment. jieba-wasm somehow requires async mechanism to work. As a consequence, I was forced to change the loading method of searcher.js from regular CJS to an import() call in index.hbs and lunr.zh.js. Eventually, it worked well, and the ECMA module is well-supported so I don't think that's a big issue, but I note it here as mdBook did it nowhere before.

Anyone willing to test it may build the forked repository with zh feature and use the product as usual. An expected result is produced on my machine.

Update 2

Listed some more problems.

chens08 · 2024-01-13T23:22:14Z

How to use it，can you give a example about the book.toml? I fork your
project and build it，but，it can‘t work.

miaomiao1992 · 2024-03-14T11:37:24Z

Any new progress on this issue?

aneteanetes · 2024-04-01T17:54:21Z

just 3 more lines of code

It was a lie.

Аnd kind of long journey...

But anyway, here am i, and my instructions for adding non-english search by little blood:

First of all, you need add lunr.stemmer.support.js and lunr.YOURLANG.js. You can do this by multiple ways:
1.1 Create head.hbs in theme folder and add html-tag
1.2 Add scripts by additional-js key in book.toml
1.3 Or just append this script's to overrided file in step 4
Additionally, i am advice add lunr.multi.js too.
Then, you need override searcher.js by putting copy of original file into src folder.
The best part. Find searchindex = elasticlunr.Index.load(config.index); line, and replace with:

searchindex = elasticlunr(function() {
            // adding (multi)language
            this.use(elasticlunr.multiLanguage('en', 'ru')); 
            
            // fields to index.
            this.addField('title');
            this.addField('body');
            this.addField('breadcrumbs');

            // Identify documents field
            this.setRef('id');
            
            // Get all documents stored in prebuilded index
            for (let key in config.index.documentStore.docs) {
              this.addDoc(config.index.documentStore.docs[key]);
            }
        });

And search will be fine.

But one more word: when mdbook build searchindex, it's completely ignore language settings in book.toml. It's uses only english chars for creating index, while elasticlunr-rs support other languages. By this behaviour, all attempts for adding additinal language will fail. I do not write Rust code, and can't create PR, but i hope this information will help someone.

peaceshi mentioned this issue Feb 22, 2020

can't search chinese #1120

Closed

ehuss added A-Localization Area: Localization, language support, etc. A-Search Area: Search labels Apr 21, 2020

kLiHz mentioned this issue Jul 11, 2021

When will text search support Chinese characters #1444

Closed

ehuss changed the title ~~Unable to search for Chinese Keywords~~ Search does not support non-English languages Jul 27, 2021

ehuss mentioned this issue Jul 27, 2021

Search in RU or other language #1479

Closed

zoziha mentioned this issue Oct 10, 2021

现有网页不支持中文搜索 fortran-fans/Fortran-in-Action#4

Open

sunface mentioned this issue Nov 29, 2021

添加中文搜索功能 sunface/rust-course#86

Open

ehuss mentioned this issue Nov 20, 2023

Search doesn't work for non-latin characters #2244

Closed

chenrui333 mentioned this issue Dec 28, 2023

fix mdbook config zigcc/learning-zig#21

Merged

ehuss mentioned this issue Feb 13, 2024

No RTL Support & Search not working for languages like arabic or persian! #2316

Open

fw-immunant mentioned this issue May 9, 2024

Cannot search for "From and Into" google/comprehensive-rust#2068

Open

Sunshine40 mentioned this issue May 20, 2024

Non-English search support #2393

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search does not support non-English languages #1081

Search does not support non-English languages #1081

taills commented Oct 29, 2019

weihanglo commented Oct 29, 2019

hysencn commented Mar 27, 2020

futurist commented May 10, 2020

cxumol commented Aug 16, 2020

bigdogs commented Apr 9, 2022

Sciroccogti commented Apr 15, 2022

wc7086 commented May 14, 2023

silence-coding commented May 22, 2023

switefaster commented Jul 5, 2023 •

edited

Loading

chens08 commented Jan 13, 2024 •

edited

Loading

miaomiao1992 commented Mar 14, 2024

aneteanetes commented Apr 1, 2024

Search does not support non-English languages #1081

Search does not support non-English languages #1081

Comments

taills commented Oct 29, 2019

weihanglo commented Oct 29, 2019

hysencn commented Mar 27, 2020

futurist commented May 10, 2020

cxumol commented Aug 16, 2020

bigdogs commented Apr 9, 2022

Sciroccogti commented Apr 15, 2022

wc7086 commented May 14, 2023

silence-coding commented May 22, 2023

switefaster commented Jul 5, 2023 • edited Loading

Update

Update 2

chens08 commented Jan 13, 2024 • edited Loading

miaomiao1992 commented Mar 14, 2024

aneteanetes commented Apr 1, 2024

switefaster commented Jul 5, 2023 •

edited

Loading

chens08 commented Jan 13, 2024 •

edited

Loading