Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch issue when search Thai language #1887

Closed
herdianabdillah opened this issue Feb 3, 2021 · 22 comments
Closed

Elasticsearch issue when search Thai language #1887

herdianabdillah opened this issue Feb 3, 2021 · 22 comments
Assignees
Labels
Projects
Milestone

Comments

@herdianabdillah
Copy link

Describe the bug
unable to search with some Thai word. this is example sentence i have in my FAQ body ปรับการปฏิบัติงานโดยรวม
it can find the FAQ with first 5 character, but when i try with last 5 character the FAQ not found.

To Reproduce
Steps to reproduce the behavior:

  1. Go to Home page -> change to Thai Language
  2. paste the last 5 character in search column and the FAQ not found

Expected behavior
FAQ should be found even with first 5 character or last five character as long as it contain the word

Screenshots
this is the FAQ
image

this when i search with first 5 character
image

this when i search with last 5 character
image

phpMyFAQ (please complete the following information):

  • phpMyFAQ version : 3.0.4
  • PHP version : 7
  • Database : MySQL
  • Elasticsearch yes/no - Yes
  • Elasticsearch version : 6.8.12
@thorsten thorsten self-assigned this Feb 3, 2021
@thorsten thorsten added the Bug label Feb 3, 2021
@thorsten thorsten added this to To do in 3.0.8 via automation Feb 3, 2021
@thorsten thorsten added this to the 3.0.8 milestone Feb 3, 2021
@thorsten
Copy link
Owner

thorsten commented Feb 3, 2021

Does the Thai language have spaces between the words?

@thorsten
Copy link
Owner

thorsten commented Feb 3, 2021

Looks like you need a different Tokenizer for the Thai language: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html

@herdianabdillah
Copy link
Author

Does the Thai language have spaces between the words?

Seems like Thai language not have spaces between the words

@herdianabdillah
Copy link
Author

Looks like you need a different Tokenizer for the Thai language: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html

so Should i drop the current index?
and then in this function, should i change tokenizer to array? and add thai inside that? because i still want able to search other language like china, japan and etc

private function getParams()
{
global $PMF_ELASTICSEARCH_STEMMING_LANGUAGE;

    return [
        'index' => $this->esConfig['index'],
        'body' => [
            'settings' => [
                'number_of_shards' => PMF_ELASTICSEARCH_NUMBER_SHARDS,
                'number_of_replicas' => PMF_ELASTICSEARCH_NUMBER_REPLICAS,
                'analysis' => [
                    'filter' => [
                        'autocomplete_filter' => [
                            'type' => 'edge_ngram',
                            'min_gram' => 1,
                            'max_gram' => 20
                        ],
                        'Language_stemmer' => [
                            'type' => 'stemmer',
                            'name' => $PMF_ELASTICSEARCH_STEMMING_LANGUAGE[$this->config->getDefaultLanguage()]
                        ]
                    ],
                    'analyzer' => [
                        'autocomplete' => [
                            'type' => 'custom',
                            'tokenizer' => 'standard',
                            'filter' => [
                                'lowercase',
                                'autocomplete_filter',
                                'Language_stemmer'
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ];
}

@thorsten
Copy link
Owner

thorsten commented Feb 3, 2021

I think this Tokenizer would be the best: https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/analysis-icu-tokenizer.html

@thorsten
Copy link
Owner

thorsten commented Feb 3, 2021

So, it would be cool if you could test it. Use the ICU Tokenizer, drop the index and re-create it and re-index the content.

@herdianabdillah
Copy link
Author

I think this Tokenizer would be the best: https://www.elastic.co/guide/en/elasticsearch/plugins/7.10/analysis-icu-tokenizer.html

Okay so i just need change that "standard" to "icu_tokenizer" right? let me try it

@thorsten
Copy link
Owner

thorsten commented Feb 3, 2021

Yes, seems so :-)

@herdianabdillah
Copy link
Author

Hi Thorsten,

i got this error when i click create index


Fatal error: Uncaught Elasticsearch\Common\Exceptions\BadRequest400Exception: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Custom Analyzer [autocomplete] failed to find tokenizer under name [icu_tokenizer]"}],"type":"illegal_argument_exception","reason":"Custom Analyzer [autocomplete] failed to find tokenizer under name [icu_tokenizer]"},"status":400} in /var/www/html/phpMyFAQ/src/libs/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php:632
Stack trace:
#0 /var/www/html/phpMyFAQ/src/libs/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php(317): Elasticsearch\Connections\Connection->process4xxError(Array, Array, Array)
#1 /var/www/html/phpMyFAQ/src/libs/react/promise/src/FulfilledPromise.php(28): Elasticsearch\Connections\Connection->Elasticsearch\Connections{closure}(Array)
#2 /var/www/html/phpMyFAQ/src/libs/ezimuel/ringphp/src/Future/CompletedFutureValue.php(55): React\Promise\FulfilledPromise->then(Object(Closure), NULL, NULL)
#3 /var/www/html/phpMyFAQ in /var/www/html/phpMyFAQ/src/libs/elasticsearch/elasticsearch/src/Elasticsearch/Connections/Connection.php on line 632

@thorsten
Copy link
Owner

thorsten commented Feb 3, 2021

Ah, sorry, looks like, you have to install the ICU support via plugins:

sudo bin/elasticsearch-plugin install analysis-icu

@herdianabdillah
Copy link
Author

Done install the ICU support, success create index -> full import
trying search again. but it still not working

@thorsten
Copy link
Owner

thorsten commented Feb 3, 2021

@herdianabdillah
Copy link
Author

change the tokenizer to "thai" the condition still same. can search the 5 first character but cant find with the last 5 character

@thorsten
Copy link
Owner

thorsten commented Feb 4, 2021

Did you try the example mentioned in the Elasticsearch documentation?

@herdianabdillah
Copy link
Author

Hi Thorsten, in the elasticsearch documentaion for thai tokenizer, there is no example configuration but when check another tokenizer configuration the point is change the tokenizer value to the what we want to use right? correct me if imwrong because im not really good at elasticsearch

@thorsten
Copy link
Owner

thorsten commented Feb 8, 2021

I'll try to reproduce it on my test installation.

@thorsten thorsten added this to To do in 3.1.0-beta via automation Feb 8, 2021
@thorsten thorsten removed this from To do in 3.0.8 Feb 8, 2021
@thorsten thorsten modified the milestones: 3.0.8, 3.1 Feb 8, 2021
@thorsten
Copy link
Owner

thorsten commented Feb 8, 2021

I tried it with the ICU Tokenizer on my local v3.1 installation using Elasticsearch 7.10. The search worked for

  • the full string "ปรับการปฏิบัติงานโดยรวม"
  • the string "รับการปฏิบัติงานโดยรวม"
  • the string "บการปฏิบัติงานโดยรวม"

Removing one more character results in an empty search result.

@herdianabdillah
Copy link
Author

herdianabdillah commented Feb 9, 2021

can you try with 5 - 8 character only? from the last character. so in latest development version and latest elasticsearch its working fine with full string, and remove 1 or 2 character from the full string. and it will also working like that in 3.0.8 release later?

@thorsten
Copy link
Owner

thorsten commented Feb 9, 2021

I moved it to the next version as I need the possibility for the users to configure which tokenizer will be used. So we need a new configuration to handle this.

@thorsten
Copy link
Owner

thorsten commented Feb 9, 2021

The v3.1 release will be a drop in replacement for 3.0, so no need to change the templates.

@herdianabdillah
Copy link
Author

Thank you, cant wait for the new version :)

also i found another issue with DB MSSQL, but i already fix it in the code and from data type in the table, also i dont know if thats because the collation or not. because in SQL Server i use we dont have any suffix _utf8. i just use the default collation.

and this is example issue, we cant insert chinese, japan, thai character in MSSQL, because data type is "varchar". i change it into "nvarchar" also in the query i add N before string. should i open new issue for this?

@thorsten
Copy link
Owner

@herdianabdillah a new issue would be awesome. I would change the CREATE TABLE statement for MS SQL.

@thorsten thorsten moved this from To do to In progress in 3.1.0-beta Feb 13, 2021
3.1.0-beta automation moved this from In progress to Done Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

2 participants