Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Language and search (tsvector) #553

Open
mworrell opened this Issue · 19 comments

6 participants

@mworrell
Owner

We have a problem with searching text across languages.

The tsvector is calculated using the actual language of a text.
This give a problem when we are searching that text from another language.

This might show the problem directly:

(zotonic009prod@miffy)17> iolist_to_binary(mod_search:to_tsquery("monique den boer", z_context:set_language(nl, z:c(maxclass)))).
<<"monique & den & boer:*">>
(zotonic009prod@miffy)18> iolist_to_binary(mod_search:to_tsquery("monique den boer", z_context:set_language(en, z:c(maxclass)))).
<<"moniqu & den & boer:*">>

Depending on the mix of the indexing language and the search language we will find Monique or not.

I propose to index and search all texts using the default language of the system.

@kaos
Owner

Can't we index all languages?
And search the index for the language the user current has selected...

@mworrell
Owner

All translations are collected together into one tsvector column.
It is just that the language-dependent processing of the text should be the same as how it is searched.

And as we are searching for any language, we need to process those languages in the same way.

I propose to either use the site's default language or en for processing and searching, and remove the language specific processing.

@arjan
Owner

I think that is the easiest solution indeed, using the site's default language to store all texts and to search.

@mworrell
Owner

Ok, then I will change the search and pivot code in that way. When nobody objects against it...

We need to check what the influence is when searching for "münchen", "munchen" or "muenchen".

@kaos
Owner

Yeah, I'm not familiar enough with the search routines to have any objections :p

@mworrell
Owner

I just checked the to_tsvector.

For pg_catalog.english it maps München to "munchen"
For pg_catalog.dutch it maps München to "munch"

I propose that we perform a transliteration ourselves before we call the mapping functions.
For that we can use the routines present in z_string:to_name/1 (and friends).

Ok?

@mmzeeman
Owner
@mworrell
Owner

It does do stemming and it does remove common words.
So that is indeed a problem, as both are language dependent.

@mworrell
Owner

Example:

(zotonic001@Lamma)5> iolist_to_binary(mod_search:to_tsquery("This is a test", z_context:set_language(en, z:c(maxclass)))).
<<"test:*">>
(zotonic001@Lamma)6> iolist_to_binary(mod_search:to_tsquery("This is a test", z_context:set_language(nl, z:c(maxclass)))).
<<"this & a & test:*">>
@arjan
Owner

can't we turn off stemming and stopping?

@arjan
Owner

alternatively we'd need a column per site language to search on, which is stemmed/stopped per language.

@mworrell
Owner

A column/index for every supported language is probably the only correct approach.

But text indexing is expensive (cpu wise), and do we want to update so many text indices when a resource is changed?

Another approach might be: http://en.wikipedia.org/wiki/Stemming#Multilingual_stemming

Together with transliteration to map common non-ascii characters to ascii (and lowercase).

@kaos
Owner

can't we keep a hash on the version of the text that has been indexed, so we don't reindex all languages, but only on those that was actually changed?

@kaos
Owner

moving to 0.10..

@ArthurClemens
Collaborator

Indexing using the site language does not help with a site to create/manage translations. Then it is a user setting which language is primary. Of course this is not the regular setup.

I would like to add that word stemming should aid the user to get more results (word stem "help" also finds "helping", "helped"), but it should not be the only way to the text. The original, literal text should also be available. People's names, brands, products, biological classification, etc. should be preserved and findable with their original name.

@arjan arjan modified the milestone: Release 0.11, Release 0.10
@arjan
Owner

Moving to 0.11; this is an open ended issue

@arjan arjan self-assigned this
@arjan arjan modified the milestone: Wish list, Release 0.11
@witeman

Why not considering integrate sphinx full-text search engine into zotonic? By using the different language package, such as Coreseek to Chinese.

Since as a solid CMS Framework, there must be a perfect full-text search engine inside.

@mworrell
Owner

There is mod_search_solr which integrates solr into the full text search

http://modules.zotonic.com/page/323/mod-search-solr

This might also be helpful with indexing different languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.