New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DokuWiki should do search ignoring accents (diacritics) #3243
Comments
Just FYI I think there is ongoing work on refactoring FulltextIndex in #2943. Originally posted by @phy25 in #2037 (comment) |
Hi, The scope of #2943 is to make DokuWiki PSR-12 compliant, and does not address latin script search issue. Anyway, I think DokuWiki could use ICU transliterator class as a possible third option of |
Hi Satoshi. Thanks for the reply.
Good, so that's what this issue is for.
In #3151 I argue differently: deaccentation should not be used in modern systems.
Firstly I thought of
Is there a related issue or PR? |
I'm doing extensive tests and still don't have a "beautiful" code to present, but I'd like to share some thoughts. First and most important thing: the character normalization applied on indexing and search string must be language-dependent. For Latin and Greek characters, as stated in the opening comment, the normalization consists mostly in getting rid of all accents, called marks in Unicode. But this same process applied on Indian texts would absolutely cripple the indexed text. Indian languages use alphasillabaries instead of alphabets: the consonants have an inherent vowel, and marks are used to show other vowels. So it is absolutely necessary to not remove the marks, and index the text as it is. Second thing: how to implement this? One approach is to create a new configuration item in Another approach is to apply a dynamic process that selects the appropriate normalization on the fly, based on the very characters being indexed. Some sort of table would give the normalization to use based on character ranges. Third thing: we definitely need help from all translators. Only they could tell how they search on the net, e.g. with Google, and confirm which things are important. In my tests I'm doing a lot of guesswork based on Unicode and Wikipedia information, and of course inaccuracies are to be found. One idea here is to create a specific topic on the forum and ask people to contribute. Fourth thing: it seems wise to let #2943 be merged before any big change in the search mechanism. I will focus on finishing normalization tests, not on opening a branch. A dedicated repository for this stuff is the goal now. |
(Issue #2037 had two parts: search and sort/collation. PR #3115 resolved the second part but closed the issue. This new issue reopens the first part.)
When running a wiki in a language that uses extended Latin (eg French), or non-Latin scripts (eg Greek), the indexing and search function has the following two limitations
Accented and not-accented versions of the same character are not treated as equivalent. For example "é" and "e" (French), or "έ" and "ε" (Greek) are not treated as the same character for indexing and search purposes, even though that is the expected behaviour for speakers of those languages.
In extended Latin, and non-ASCII alphabets, capital and lowercase letters are not treated as equivalent, while in basic ASCII they are. For example in Greek, the search function cannot find "Εντός" if search for "εντός" (Ε-ε are the upper/lowercase pair).
Originally posted by @disk0x in #2037 (comment)
There are 3 crucial points in the search problem:
inc/indexer.php
, line 60:strlen()
- this function counts bytes, not characters, so every latin letter with an accent is miscounted and put in the wrong index file, because the UTF-8 representation uses 2 bytes, not 1.inc/Search/Indexer.php
, line 798:array_search()
- this function looks for an exact match in the index.inc/Search/Indexer.php
, line 811:preg_grep()
- this function looks for part of a word in the index, but the part given in the regular expression will be an exact match.If the DokuWiki search engine worked only with exact matching, I would already have a working solution that only fixes point 2 above, providing a carefully configured collator for matching.
But things are not like this, and there is no way AFAIK to use regular expressions and collation at the same time, solving point 3.
A working solution, then, should do the following:
This implementation:
INDEXER_VERSION
);Normalizer
class, or use romanization as a fallback.Originally posted by @moisesbr-dw in #2037 (comment)
The text was updated successfully, but these errors were encountered: