Elasticsearch issue when search Thai language #1887

herdianabdillah · 2021-02-03T02:56:46Z

Describe the bug
unable to search with some Thai word. this is example sentence i have in my FAQ body ปรับการปฏิบัติงานโดยรวม
it can find the FAQ with first 5 character, but when i try with last 5 character the FAQ not found.

To Reproduce
Steps to reproduce the behavior:

Go to Home page -> change to Thai Language
paste the last 5 character in search column and the FAQ not found

Expected behavior
FAQ should be found even with first 5 character or last five character as long as it contain the word

Screenshots
this is the FAQ

this when i search with first 5 character

this when i search with last 5 character

phpMyFAQ (please complete the following information):

phpMyFAQ version : 3.0.4
PHP version : 7
Database : MySQL
Elasticsearch yes/no - Yes
Elasticsearch version : 6.8.12

thorsten · 2021-02-03T05:41:07Z

Does the Thai language have spaces between the words?

thorsten · 2021-02-03T05:42:11Z

Looks like you need a different Tokenizer for the Thai language: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html

herdianabdillah · 2021-02-03T06:12:31Z

Does the Thai language have spaces between the words?

Seems like Thai language not have spaces between the words

herdianabdillah · 2021-02-03T06:15:19Z

Looks like you need a different Tokenizer for the Thai language: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html

so Should i drop the current index?
and then in this function, should i change tokenizer to array? and add thai inside that? because i still want able to search other language like china, japan and etc

private function getParams()
{
global $PMF_ELASTICSEARCH_STEMMING_LANGUAGE;

    return [
        'index' => $this->esConfig['index'],
        'body' => [
            'settings' => [
                'number_of_shards' => PMF_ELASTICSEARCH_NUMBER_SHARDS,
                'number_of_replicas' => PMF_ELASTICSEARCH_NUMBER_REPLICAS,
                'analysis' => [
                    'filter' => [
                        'autocomplete_filter' => [
                            'type' => 'edge_ngram',
                            'min_gram' => 1,
                            'max_gram' => 20
                        ],
                        'Language_stemmer' => [
                            'type' => 'stemmer',
                            'name' => $PMF_ELASTICSEARCH_STEMMING_LANGUAGE[$this->config->getDefaultLanguage()]
                        ]
                    ],
                    'analyzer' => [
                        'autocomplete' => [
                            'type' => 'custom',
                            'tokenizer' => 'standard',
                            'filter' => [
                                'lowercase',
                                'autocomplete_filter',
                                'Language_stemmer'
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ];
}

thorsten · 2021-02-03T06:17:45Z

I think this Tokenizer would be the best: https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/analysis-icu-tokenizer.html

thorsten · 2021-02-03T06:19:11Z

So, it would be cool if you could test it. Use the ICU Tokenizer, drop the index and re-create it and re-index the content.

herdianabdillah · 2021-02-03T06:36:42Z

I think this Tokenizer would be the best: https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/analysis-icu-tokenizer.html

Okay so i just need change that "standard" to "icu_tokenizer" right? let me try it

thorsten · 2021-02-03T06:37:20Z

Yes, seems so :-)

herdianabdillah · 2021-02-03T07:05:37Z

Hi Thorsten,

i got this error when i click create index

Fatal error: Uncaught Elasticsearch\Common\Exceptions\BadRequest400Exception: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Custom Analyzer [autocomplete] failed to find tokenizer under name [icu_tokenizer]"}],"type":"illegal_argument_exception","reason":"Custom Analyzer [autocomplete] failed to find tokenizer under name [icu_tokenizer]"},"status":400} in /var/www/html/phpMyFAQ/src/libs/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php:632
Stack trace:
#0 /var/www/html/phpMyFAQ/src/libs/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(317): Elasticsearch\Connections\Connection->process4xxError(Array, Array, Array)
#1 /var/www/html/phpMyFAQ/src/libs/react/promise/src/FulfilledPromise.php(28): Elasticsearch\Connections\Connection->Elasticsearch\Connections{closure}(Array)
#2 /var/www/html/phpMyFAQ/src/libs/ezimuel/ringphp/src/Future/CompletedFutureValue.php(55): React\Promise\FulfilledPromise->then(Object(Closure), NULL, NULL)
#3 /var/www/html/phpMyFAQ in /var/www/html/phpMyFAQ/src/libs/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php on line 632

thorsten · 2021-02-03T07:27:52Z

Ah, sorry, looks like, you have to install the ICU support via plugins:

sudo bin/elasticsearch-plugin install analysis-icu

herdianabdillah · 2021-02-03T07:42:45Z

Done install the ICU support, success create index -> full import
trying search again. but it still not working

thorsten · 2021-02-03T16:37:24Z

Could you please try this example?

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html

herdianabdillah · 2021-02-04T13:03:57Z

change the tokenizer to "thai" the condition still same. can search the 5 first character but cant find with the last 5 character

thorsten · 2021-02-04T17:05:31Z

Did you try the example mentioned in the Elasticsearch documentation?

herdianabdillah · 2021-02-08T04:51:18Z

Hi Thorsten, in the elasticsearch documentaion for thai tokenizer, there is no example configuration but when check another tokenizer configuration the point is change the tokenizer value to the what we want to use right? correct me if imwrong because im not really good at elasticsearch

thorsten · 2021-02-08T05:41:46Z

I'll try to reproduce it on my test installation.

thorsten · 2021-02-08T19:31:23Z

I tried it with the ICU Tokenizer on my local v3.1 installation using Elasticsearch 7.10. The search worked for

the full string "ปรับการปฏิบัติงานโดยรวม"
the string "รับการปฏิบัติงานโดยรวม"
the string "บการปฏิบัติงานโดยรวม"

Removing one more character results in an empty search result.

herdianabdillah · 2021-02-09T00:05:37Z

can you try with 5 - 8 character only? from the last character. so in latest development version and latest elasticsearch its working fine with full string, and remove 1 or 2 character from the full string. and it will also working like that in 3.0.8 release later?

thorsten · 2021-02-09T07:32:25Z

I moved it to the next version as I need the possibility for the users to configure which tokenizer will be used. So we need a new configuration to handle this.

thorsten · 2021-02-09T07:33:05Z

The v3.1 release will be a drop in replacement for 3.0, so no need to change the templates.

herdianabdillah · 2021-02-10T04:10:41Z

Thank you, cant wait for the new version :)

also i found another issue with DB MSSQL, but i already fix it in the code and from data type in the table, also i dont know if thats because the collation or not. because in SQL Server i use we dont have any suffix _utf8. i just use the default collation.

and this is example issue, we cant insert chinese, japan, thai character in MSSQL, because data type is "varchar". i change it into "nvarchar" also in the query i add N before string. should i open new issue for this?

thorsten · 2021-02-10T06:04:05Z

@herdianabdillah a new issue would be awesome. I would change the CREATE TABLE statement for MS SQL.

thorsten self-assigned this Feb 3, 2021

thorsten added the Bug label Feb 3, 2021

thorsten added this to To do in 3.0.8 via automation Feb 3, 2021

thorsten added this to the 3.0.8 milestone Feb 3, 2021

thorsten added this to To do in 3.1.0-beta via automation Feb 8, 2021

thorsten removed this from To do in 3.0.8 Feb 8, 2021

thorsten modified the milestones: 3.0.8, 3.1 Feb 8, 2021

thorsten moved this from To do to In progress in 3.1.0-beta Feb 13, 2021

thorsten closed this as completed in c8ad78a Mar 2, 2021

3.1.0-beta automation moved this from In progress to Done Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch issue when search Thai language #1887

Elasticsearch issue when search Thai language #1887

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 4, 2021

thorsten commented Feb 4, 2021

herdianabdillah commented Feb 8, 2021

thorsten commented Feb 8, 2021

thorsten commented Feb 8, 2021

herdianabdillah commented Feb 9, 2021 •

edited

thorsten commented Feb 9, 2021

thorsten commented Feb 9, 2021

herdianabdillah commented Feb 10, 2021

thorsten commented Feb 10, 2021

Elasticsearch issue when search Thai language #1887

Elasticsearch issue when search Thai language #1887

Comments

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 3, 2021

thorsten commented Feb 3, 2021

herdianabdillah commented Feb 4, 2021

thorsten commented Feb 4, 2021

herdianabdillah commented Feb 8, 2021

thorsten commented Feb 8, 2021

thorsten commented Feb 8, 2021

herdianabdillah commented Feb 9, 2021 • edited

thorsten commented Feb 9, 2021

thorsten commented Feb 9, 2021

herdianabdillah commented Feb 10, 2021

thorsten commented Feb 10, 2021

herdianabdillah commented Feb 9, 2021 •

edited