You can clone with
HTTPS or Subversion.
We have a problem with searching text across languages.
The tsvector is calculated using the actual language of a text.
This give a problem when we are searching that text from another language.
This might show the problem directly:
(zotonic009prod@miffy)17> iolist_to_binary(mod_search:to_tsquery("monique den boer", z_context:set_language(nl, z:c(maxclass)))).
<<"monique & den & boer:*">>
(zotonic009prod@miffy)18> iolist_to_binary(mod_search:to_tsquery("monique den boer", z_context:set_language(en, z:c(maxclass)))).
<<"moniqu & den & boer:*">>
Depending on the mix of the indexing language and the search language we will find Monique or not.
I propose to index and search all texts using the default language of the system.
Can't we index all languages?
And search the index for the language the user current has selected...
All translations are collected together into one tsvector column.
It is just that the language-dependent processing of the text should be the same as how it is searched.
And as we are searching for any language, we need to process those languages in the same way.
I propose to either use the site's default language or en for processing and searching, and remove the language specific processing.
I think that is the easiest solution indeed, using the site's default language to store all texts and to search.
Ok, then I will change the search and pivot code in that way. When nobody objects against it...
We need to check what the influence is when searching for "münchen", "munchen" or "muenchen".
Yeah, I'm not familiar enough with the search routines to have any objections :p
I just checked the to_tsvector.
For pg_catalog.english it maps München to "munchen"
For pg_catalog.dutch it maps München to "munch"
I propose that we perform a transliteration ourselves before we call the mapping functions.
For that we can use the routines present in z_string:to_name/1 (and friends).
It does do stemming and it does remove common words.
So that is indeed a problem, as both are language dependent.
(zotonic001@Lamma)5> iolist_to_binary(mod_search:to_tsquery("This is a test", z_context:set_language(en, z:c(maxclass)))).
(zotonic001@Lamma)6> iolist_to_binary(mod_search:to_tsquery("This is a test", z_context:set_language(nl, z:c(maxclass)))).
<<"this & a & test:*">>
can't we turn off stemming and stopping?
alternatively we'd need a column per site language to search on, which is stemmed/stopped per language.
A column/index for every supported language is probably the only correct approach.
But text indexing is expensive (cpu wise), and do we want to update so many text indices when a resource is changed?
Another approach might be: http://en.wikipedia.org/wiki/Stemming#Multilingual_stemming
Together with transliteration to map common non-ascii characters to ascii (and lowercase).
For studying/reading: http://www.postgresql.org/docs/9.2/static/textsearch-dictionaries.html
can't we keep a hash on the version of the text that has been indexed, so we don't reindex all languages, but only on those that was actually changed?
moving to 0.10..
Indexing using the site language does not help with a site to create/manage translations. Then it is a user setting which language is primary. Of course this is not the regular setup.
I would like to add that word stemming should aid the user to get more results (word stem "help" also finds "helping", "helped"), but it should not be the only way to the text. The original, literal text should also be available. People's names, brands, products, biological classification, etc. should be preserved and findable with their original name.
Moving to 0.11; this is an open ended issue
Why not considering integrate sphinx full-text search engine into zotonic? By using the different language package, such as Coreseek to Chinese.
Since as a solid CMS Framework, there must be a perfect full-text search engine inside.
There is mod_search_solr which integrates solr into the full text search
This might also be helpful with indexing different languages.