Language and search (tsvector) #553

Closed
mworrell opened this Issue Apr 23, 2013 · 25 comments

7 participants

@mworrell
Zotonic member

We have a problem with searching text across languages.

The tsvector is calculated using the actual language of a text.
This give a problem when we are searching that text from another language.

This might show the problem directly:

(zotonic009prod@miffy)17> iolist_to_binary(mod_search:to_tsquery("monique den boer", z_context:set_language(nl, z:c(maxclass)))).
<<"monique & den & boer:*">>
(zotonic009prod@miffy)18> iolist_to_binary(mod_search:to_tsquery("monique den boer", z_context:set_language(en, z:c(maxclass)))).
<<"moniqu & den & boer:*">>

Depending on the mix of the indexing language and the search language we will find Monique or not.

I propose to index and search all texts using the default language of the system.

@kaos
Zotonic member

Can't we index all languages?
And search the index for the language the user current has selected...

@mworrell
Zotonic member

All translations are collected together into one tsvector column.
It is just that the language-dependent processing of the text should be the same as how it is searched.

And as we are searching for any language, we need to process those languages in the same way.

I propose to either use the site's default language or en for processing and searching, and remove the language specific processing.

@arjan
Zotonic member

I think that is the easiest solution indeed, using the site's default language to store all texts and to search.

@mworrell
Zotonic member

Ok, then I will change the search and pivot code in that way. When nobody objects against it...

We need to check what the influence is when searching for "münchen", "munchen" or "muenchen".

@kaos
Zotonic member

Yeah, I'm not familiar enough with the search routines to have any objections :p

@mworrell
Zotonic member

I just checked the to_tsvector.

For pg_catalog.english it maps München to "munchen"
For pg_catalog.dutch it maps München to "munch"

I propose that we perform a transliteration ourselves before we call the mapping functions.
For that we can use the routines present in z_string:to_name/1 (and friends).

Ok?

@mmzeeman
Zotonic member
@mworrell
Zotonic member

It does do stemming and it does remove common words.
So that is indeed a problem, as both are language dependent.

@mworrell
Zotonic member

Example:

(zotonic001@Lamma)5> iolist_to_binary(mod_search:to_tsquery("This is a test", z_context:set_language(en, z:c(maxclass)))).
<<"test:*">>
(zotonic001@Lamma)6> iolist_to_binary(mod_search:to_tsquery("This is a test", z_context:set_language(nl, z:c(maxclass)))).
<<"this & a & test:*">>
@arjan
Zotonic member

can't we turn off stemming and stopping?

@arjan
Zotonic member

alternatively we'd need a column per site language to search on, which is stemmed/stopped per language.

@mworrell
Zotonic member

A column/index for every supported language is probably the only correct approach.

But text indexing is expensive (cpu wise), and do we want to update so many text indices when a resource is changed?

Another approach might be: http://en.wikipedia.org/wiki/Stemming#Multilingual_stemming

Together with transliteration to map common non-ascii characters to ascii (and lowercase).

@kaos
Zotonic member

can't we keep a hash on the version of the text that has been indexed, so we don't reindex all languages, but only on those that was actually changed?

@kaos
Zotonic member

moving to 0.10..

@ArthurClemens

Indexing using the site language does not help with a site to create/manage translations. Then it is a user setting which language is primary. Of course this is not the regular setup.

I would like to add that word stemming should aid the user to get more results (word stem "help" also finds "helping", "helped"), but it should not be the only way to the text. The original, literal text should also be available. People's names, brands, products, biological classification, etc. should be preserved and findable with their original name.

@arjan arjan modified the milestone: Release 0.11, Release 0.10 Apr 17, 2014
@arjan
Zotonic member

Moving to 0.11; this is an open ended issue

@arjan arjan self-assigned this Sep 30, 2014
@arjan arjan modified the milestone: Wish list, Release 0.11 Sep 30, 2014
@witeman

Why not considering integrate sphinx full-text search engine into zotonic? By using the different language package, such as Coreseek to Chinese.

Since as a solid CMS Framework, there must be a perfect full-text search engine inside.

@mworrell
Zotonic member

There is mod_search_solr which integrates solr into the full text search

http://modules.zotonic.com/page/323/mod-search-solr

This might also be helpful with indexing different languages.

@ddeboer
Zotonic member

What is the path forward here? Concentrate on using dedicated search solutions, such as Solr and Elasticsearch for full-text search or do we want to fix this in Zotonic itself, too?

@mworrell
Zotonic member

We are uncertain about the "fix".

Definitely not require any external search engines.

I propose to index & search all texts in the default language of the site. If the default language changes then the texts need to be pivoted again.

@mworrell mworrell modified the milestone: Release 0.13.5, Wish list Oct 6, 2015
@ddeboer
Zotonic member

Why not index in all content languages that are being used?

@mworrell
Zotonic member

@ddeboer Because we have sites with 10+ languages... And then we have to index all N languages in N variations.

@ddeboer
Zotonic member

But that’s only a problem of data storage size, isn’t it? When searching, we could limit the index that is consulted to the searcher’s current language so performance suffers as little as possible.

@mworrell
Zotonic member

Not just data storage. You need more tables and also joins to the full text index tables, so performance would suffer. And of course pivoting will be slower, with the increased number of pivot tables.

@mworrell mworrell added a commit that referenced this issue Oct 8, 2015
@mworrell mworrell core: change stemming of the full text indexes.
This fixes a problem where the stemming of the full text index didn't match the stemming of the search query.
The stemmer is selected using the language code from either the i18n.language_stemmer or i18n.language configuration.
Call z_pivot_rsc:stemmer_language(Context) to see which PostgreSQL stemmer will be used.
You have to ensure that the selected stemmer is available in PostgresSQL.
With psql you can use the command '\dFd' to see the list of available stemmers.

Fixes #553
c8aad27
@mworrell mworrell assigned mworrell and unassigned arjan Oct 8, 2015
@mworrell mworrell added a commit that closed this issue Oct 8, 2015
@mworrell mworrell core: change stemming of the full text indexes.
This fixes a problem where the stemming of the full text index didn't match the stemming of the search query.
The stemmer is selected using the language code from either the i18n.language_stemmer or i18n.language configuration.
Call z_pivot_rsc:stemmer_language(Context) to see which PostgreSQL stemmer will be used.
You have to ensure that the selected stemmer is available in PostgresSQL.
With psql you can use the command '\dFd' to see the list of available stemmers.

Fixes #553

(cherry picked from commit c8aad27)
114c8c4
@mworrell mworrell closed this in 114c8c4 Oct 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment