[Feature request] Index text without diacritics #53

Sanqui · 2022-05-16T16:41:38Z

When I search for "zlutoucky", I would like to be able to find "žluťoučký", and vice versa. These results should have a lower relevancy/priority. Would this be possible?

scambier · 2022-05-17T07:48:11Z

Omnisearch would have to index an additional computed field to hold the converted text without diacritics. Definitely doable, but I'd put that behind a setting, since that would roughly double the indexing duration and increase memory consumption.

I'm not sure how it will work with the in-file search, but I can take a look at it. Anyway, this part needs a rework.

These results should have a lower relevancy/priority

Because the same word with and without diacritics could have 2 different meanings?

Sanqui · 2022-05-17T10:40:15Z

I don't mind it being behind a setting. It's too bad you may have to convert the text, would be nice if the search library you're using offered diacritics removal alongside stemming possibly lemmatization.

These results should have a lower relevancy/priority

Because the same word with and without diacritics could have 2 different meanings?

Indeed, many such cases. Though I would have to see it in practice to know if it would be usable if the results were ranked the same, I suspect it would be only a minor annoyance.

scambier · 2022-05-17T11:28:28Z

It's too bad you may have to convert the text, would be nice if the search library you're using offered diacritics removal

Well, I suspect it's specifically because diacritics can change the meaning of a word. I have the same issue in French, but to a lesser degree: the fuzzy matching usually works around it ("creme brulee" will match "crème brulée", but not ""crème brûlée").

I think that a better solution would be a toggle to simply ignore diacritics: normalize all search queries, and notes before indexing.

scambier added the good first issue Good for newcomers label May 17, 2022

scambier added this to the 1.4 milestone May 19, 2022

scambier self-assigned this Jun 3, 2022

scambier added a commit that referenced this issue Jun 8, 2022

#53 - Ignoring diacritics

ca80850

scambier added a commit that referenced this issue Jun 8, 2022

#53 - also remove diacritics from the cache

7b741fa

scambier added a commit that referenced this issue Jun 8, 2022

#53 - Fixed some highlighting issues

8f954a3

scambier closed this as completed Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Index text without diacritics #53

[Feature request] Index text without diacritics #53

Sanqui commented May 16, 2022

scambier commented May 17, 2022

Sanqui commented May 17, 2022 •

edited

scambier commented May 17, 2022 •

edited

[Feature request] Index text without diacritics #53

[Feature request] Index text without diacritics #53

Comments

Sanqui commented May 16, 2022

scambier commented May 17, 2022

Sanqui commented May 17, 2022 • edited

scambier commented May 17, 2022 • edited

Sanqui commented May 17, 2022 •

edited

scambier commented May 17, 2022 •

edited