Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DokuWiki should do search ignoring accents (diacritics) #3243

Open
moisesbr-dw opened this issue Aug 26, 2020 · 5 comments
Open

DokuWiki should do search ignoring accents (diacritics) #3243

moisesbr-dw opened this issue Aug 26, 2020 · 5 comments

Comments

@moisesbr-dw
Copy link
Contributor

moisesbr-dw commented Aug 26, 2020

(Issue #2037 had two parts: search and sort/collation. PR #3115 resolved the second part but closed the issue. This new issue reopens the first part.)


When running a wiki in a language that uses extended Latin (eg French), or non-Latin scripts (eg Greek), the indexing and search function has the following two limitations

Accented and not-accented versions of the same character are not treated as equivalent. For example "é" and "e" (French), or "έ" and "ε" (Greek) are not treated as the same character for indexing and search purposes, even though that is the expected behaviour for speakers of those languages.

In extended Latin, and non-ASCII alphabets, capital and lowercase letters are not treated as equivalent, while in basic ASCII they are. For example in Greek, the search function cannot find "Εντός" if search for "εντός" (Ε-ε are the upper/lowercase pair).

Originally posted by @disk0x in #2037 (comment)


There are 3 crucial points in the search problem:

  1. inc/indexer.php, line 60: strlen() - this function counts bytes, not characters, so every latin letter with an accent is miscounted and put in the wrong index file, because the UTF-8 representation uses 2 bytes, not 1.
  2. inc/Search/Indexer.php, line 798: array_search() - this function looks for an exact match in the index.
  3. inc/Search/Indexer.php, line 811: preg_grep() - this function looks for part of a word in the index, but the part given in the regular expression will be an exact match.

If the DokuWiki search engine worked only with exact matching, I would already have a working solution that only fixes point 2 above, providing a carefully configured collator for matching.

But things are not like this, and there is no way AFAIK to use regular expressions and collation at the same time, solving point 3.

A working solution, then, should do the following:

  • When indexing, the words must be normalized (deaccented etc.) before put in the index files.
  • When searching, the query must be normalized before being compared to the index, so exact or partial search would work with no modification in points 2 and 3 above.

This implementation:

  • would not be backward-compatible (we would increment the INDEXER_VERSION);
  • could use the Normalizer class, or use romanization as a fallback.

Originally posted by @moisesbr-dw in #2037 (comment)

@moisesbr-dw
Copy link
Contributor Author

Just FYI I think there is ongoing work on refactoring FulltextIndex in #2943.

Originally posted by @phy25 in #2037 (comment)

@moisesbr-dw
Copy link
Contributor Author

Just FYI I think there is ongoing work on refactoring FulltextIndex in #2943.

It seems that #2943 doesn't address the issue posted here.
All code pointed at the first message remains the same (in another place, of course).

@ssahara
Copy link
Collaborator

ssahara commented Aug 27, 2020

Hi, The scope of #2943 is to make DokuWiki PSR-12 compliant, and does not address latin script search issue.
I think this issue should be addressed separately.

Anyway, I think DokuWiki could use ICU transliterator class as a possible third option of $conf['deaccent'] , and transliterator will be necessary to address latin script search issue. I am working to write dokuwiki\Utf8\Transliterator in different branch https://github.com/ssahara/dokuwiki/tree/translit

@moisesbr-dw
Copy link
Contributor Author

Hi Satoshi. Thanks for the reply.

Hi, The scope of #2943 is to make DokuWiki PSR-12 compliant, and does not address latin script search issue.
I think this issue should be addressed separately.

Good, so that's what this issue is for.

Anyway, I think DokuWiki could use ICU transliterator class as a possible third option of $conf['deaccent'] ,

In #3151 I argue differently: deaccentation should not be used in modern systems.

and transliterator will be necessary to address latin script search issue.

Firstly I thought of Normalizer, but Transliterator is much more powerful.
Thank you for pointing that.

I am working to write dokuwiki\Utf8\Transliterator in different branch https://github.com/ssahara/dokuwiki/tree/translit

Is there a related issue or PR?

@moisesbr-dw
Copy link
Contributor Author

I'm doing extensive tests and still don't have a "beautiful" code to present, but I'd like to share some thoughts.

First and most important thing: the character normalization applied on indexing and search string must be language-dependent.

For Latin and Greek characters, as stated in the opening comment, the normalization consists mostly in getting rid of all accents, called marks in Unicode.

But this same process applied on Indian texts would absolutely cripple the indexed text. Indian languages use alphasillabaries instead of alphabets: the consonants have an inherent vowel, and marks are used to show other vowels. So it is absolutely necessary to not remove the marks, and index the text as it is.

Second thing: how to implement this?

One approach is to create a new configuration item in lang.php telling which normalization should be used. But this would not be enough for multilingual pages and wikis. Maybe translation plugin could help select the right normalization for each page.

Another approach is to apply a dynamic process that selects the appropriate normalization on the fly, based on the very characters being indexed. Some sort of table would give the normalization to use based on character ranges.

Third thing: we definitely need help from all translators. Only they could tell how they search on the net, e.g. with Google, and confirm which things are important. In my tests I'm doing a lot of guesswork based on Unicode and Wikipedia information, and of course inaccuracies are to be found.

One idea here is to create a specific topic on the forum and ask people to contribute.

Fourth thing: it seems wise to let #2943 be merged before any big change in the search mechanism. I will focus on finishing normalization tests, not on opening a branch. A dedicated repository for this stuff is the goal now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants