Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Index text without diacritics #53

Closed
Sanqui opened this issue May 16, 2022 · 3 comments
Closed

[Feature request] Index text without diacritics #53

Sanqui opened this issue May 16, 2022 · 3 comments
Assignees
Labels
good first issue Good for newcomers
Milestone

Comments

@Sanqui
Copy link

Sanqui commented May 16, 2022

When I search for "zlutoucky", I would like to be able to find "žluťoučký", and vice versa. These results should have a lower relevancy/priority. Would this be possible?

@scambier
Copy link
Owner

Omnisearch would have to index an additional computed field to hold the converted text without diacritics. Definitely doable, but I'd put that behind a setting, since that would roughly double the indexing duration and increase memory consumption.

I'm not sure how it will work with the in-file search, but I can take a look at it. Anyway, this part needs a rework.

These results should have a lower relevancy/priority

Because the same word with and without diacritics could have 2 different meanings?

@Sanqui
Copy link
Author

Sanqui commented May 17, 2022

I don't mind it being behind a setting. It's too bad you may have to convert the text, would be nice if the search library you're using offered diacritics removal alongside stemming possibly lemmatization.

These results should have a lower relevancy/priority

Because the same word with and without diacritics could have 2 different meanings?

Indeed, many such cases. Though I would have to see it in practice to know if it would be usable if the results were ranked the same, I suspect it would be only a minor annoyance.

@scambier
Copy link
Owner

scambier commented May 17, 2022

It's too bad you may have to convert the text, would be nice if the search library you're using offered diacritics removal

Well, I suspect it's specifically because diacritics can change the meaning of a word. I have the same issue in French, but to a lesser degree: the fuzzy matching usually works around it ("creme brulee" will match "crème brulée", but not ""crème brûlée").

I think that a better solution would be a toggle to simply ignore diacritics: normalize all search queries, and notes before indexing.

@scambier scambier added the good first issue Good for newcomers label May 17, 2022
@scambier scambier added this to the 1.4 milestone May 19, 2022
@scambier scambier self-assigned this Jun 3, 2022
scambier added a commit that referenced this issue Jun 8, 2022
scambier added a commit that referenced this issue Jun 8, 2022
@scambier scambier closed this as completed Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants