Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irregular results #114

Open
stell opened this issue Mar 9, 2021 · 21 comments
Open

Irregular results #114

stell opened this issue Mar 9, 2021 · 21 comments

Comments

@stell
Copy link

stell commented Mar 9, 2021

Everything is up to date and freshly indexed. I have two pages containing the word "geokoordinaten" which is German and stands for Geocoordinates.

When searching for "geokoordinaten" i get 2 results, which is correct.

"geokoordina" outputs 0 results.
"geokoo" outputs 2 results again.

On the other hand if searching for "geoko" i get 2 results again, but different pages. Should this not output 4 pages?

I don't know if this is an issue or something is badly setup, but i don't see whats the problem here.

You can try it out here:
https://www.jfewo.de/docs/de/suche?q=geokoordinaten

This is the config:

enabled: true
search_route: /suche
query_route: /s
built_in_css: true
built_in_js: true
built_in_search_page: true
enable_admin_page_events: true
search_type: auto
fuzzy: false
phrases: true
stemmer: german
display_route: true
display_hits: true
display_time: true
live_uri_update: true
limit: "20"
min: "4"
snippet: "300"
index_page_by_default: true
scheduled_index:
  enabled: false
  at: "* 2 * * 1-7"
  logs: logs/tntsearch-index.out
filter:
  items:
    - root@.descendants
powered_by: true
search_object_type: Grav

@nqb
Copy link

nqb commented Mar 12, 2021

Hello,

I noticed same kind of issue on YunoHost's documentation.
If I search for "revers", I got 5 pages with "reverse" or "Reverse" results but if I add a "e" (for "reverse"), I got only one result.

@ViliusS
Copy link
Contributor

ViliusS commented Mar 31, 2021

Everything is up to date and freshly indexed. I have two pages containing the word "geokoordinaten" which is German and stands for Geocoordinates.

When searching for "geokoordinaten" i get 2 results, which is correct.

"geokoordina" outputs 0 results.
"geokoo" outputs 2 results again.

On the other hand if searching for "geoko" i get 2 results again, but different pages. Should this not output 4 pages?

I don't know if this is an issue or something is badly setup, but i don't see whats the problem here.

You can try it out here:
https://www.jfewo.de/docs/de/suche?q=geokoordinaten

This is the config:

enabled: true
search_route: /suche
query_route: /s
built_in_css: true
built_in_js: true
built_in_search_page: true
enable_admin_page_events: true
search_type: auto
fuzzy: false
phrases: true
stemmer: german
display_route: true
display_hits: true
display_time: true
live_uri_update: true
limit: "20"
min: "4"
snippet: "300"
index_page_by_default: true
scheduled_index:
  enabled: false
  at: "* 2 * * 1-7"
  logs: logs/tntsearch-index.out
filter:
  items:
    - root@.descendants
powered_by: true
search_object_type: Grav

Try disabling the stemmer and rebuilding the index after that. See if that helps.

@stell
Copy link
Author

stell commented Mar 31, 2021

Already tried that. Same irregular results.

@ViliusS
Copy link
Contributor

ViliusS commented Mar 31, 2021

Just tried your link ant it shows 2 same results for "geokoordina" and "geokoo".

@stell
Copy link
Author

stell commented Mar 31, 2021

Cannot confirm.
"geokoordina" outputs 0 results
"geokoo" outputs 2 results
"geoko" outputs 2 (different) results

In fact all of this should output more that 2 results.

@ViliusS
Copy link
Contributor

ViliusS commented Mar 31, 2021

I don't know enough about German language but the difference between "geokoordina" and "geokoordinat" looks like really a stemmer issue. Setting stemmer to 'no' should return the same results for both queries.

@stell
Copy link
Author

stell commented Mar 31, 2021

Yeah, I thought the same.

@ViliusS
Copy link
Contributor

ViliusS commented Mar 31, 2021

Without stemmer:

[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('geokoordina').PHP_EOL;"
geokoordina

[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('geokoordinat').PHP_EOL;"
geokoordinat

[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('geokoordinaten').PHP_EOL;"
geokoordinaten

With stemmer:
[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('geokoordina').PHP_EOL;"
geokoordina

[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('geokoordinat').PHP_EOL;"
geokoordinat

[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('geokoordinaten').PHP_EOL;"
geokoordinat

Check if you really disabled the stemmer, i.e. set it to 'no', because at least the last difference is a stemmer issue. Stemming is a complex beast, so to properly debug issue I would disable it for now.

@stell
Copy link
Author

stell commented Mar 31, 2021

Like I said. I tested it without stemmer at the beginning and two more times after. Results are the same.
I tested setting stemmer: no and stemmer: "no" which is saved from admin backend.

@mbirth
Copy link

mbirth commented Jul 19, 2021

Same here:

Screenshot 2021-07-19 at 12 31 54

But if I add the final "e", I get no results:

Screenshot 2021-07-19 at 12 32 01

My config:

enabled: true
search_route: /search
query_route: /s
built_in_css: true
built_in_js: true
built_in_search_page: true
enable_admin_page_events: true
search_type: auto
fuzzy: true
phrases: true
stemmer: "no"
display_route: true
display_hits: true
display_time: true
live_uri_update: true
limit: '20'
min: '3'
snippet: '300'
index_page_by_default: true
scheduled_index:
  enabled: false
  at: '30 3 * * *'
  logs: logs/tntsearch-index.out
filter:
  items:
    - root@.descendants
  published: true
powered_by: false
search_object_type: Grav

PHP:

PHP 7.4.21 (cli) (built: Jun 29 2021 15:17:15) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
    with Zend OPcache v7.4.21, Copyright (c), by Zend Technologies

Grav and TNTSearch are up to date.

@ViliusS
Copy link
Contributor

ViliusS commented Jul 19, 2021

Did you rebuild index after disabling the stemmer? Delete old index file fully.

Again this looks like a German stemmer issue:
[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('spectr').PHP_EOL;"
spectr

[]# php -r " include('Stemmer.php'); require_once ('GermanStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\GermanStemmer::stem('spectre').PHP_EOL;"
spectr

[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('spectr').PHP_EOL;"
spectr

[]# php -r " include('Stemmer.php'); require_once ('NoStemmer.php'); echo TeamTNT\TNTSearch\Stemmer\NoStemmer::stem('spectre').PHP_EOL;"
spectre

@mbirth
Copy link

mbirth commented Jul 19, 2021

Of course, I've rebuilt the index several times. And the German stemmer was never activated at any time. I've even deleted the index files before reindexing to make sure they are built from scratch.

...
Added   335 /de/software/autohotkey/u3helper/tutorial
Added   336 /de/software/autohotkey/u3helper/u3helper-vs-packagefactory
Total rows 336


Indexed in 6.8s
mbirth@server /...> bin/plugin tntsearch query Spectre
{
    "number_of_hits": 0,
    "execution_time": "1.0729 ms"
}
mbirth@server /...>

EDIT: It looks like the indexing process does something weird, as I can't find the word "Spectre" in the index:

sqlite> select * from wordlist where term="spectre";
sqlite> select * from wordlist where term="spectr";
id|term|num_hits|num_docs
5201|spectr|9|4

@stell
Copy link
Author

stell commented Jul 19, 2021

Same problem is still present here.

@mbirth
Copy link

mbirth commented Jul 19, 2021

Okay, I don't know why, but it seems the Indexer still used the PorterStemmer even though I had "no" in my config. Now after changing the value via the Grav Admin interface and then setting it back to "no" via text editor (the same thing I did yesterday), it seems to work correctly and the word "Spectre" is indexed fine.

On a sidenote: Selecting "Disable" from the Grav Admin thingy turns the Yaml into stemmer: no which translates into stemmer: false or stemmer: 0 and makes the indexer trying to load a class 0Stemmer which fails. The correct entry has to be stemmer: 'no' for it to work.

@ViliusS
Copy link
Contributor

ViliusS commented Jul 19, 2021

Interesting indeed. Could be related to #116 which is still waiting for merge, unfortunately. Also check https://github.com/teamtnt/tntsearch/pull/243/files . Not sure which Grav version you are using and how up-to-date TNTSearch library it includes.

@bgdnlp
Copy link

bgdnlp commented Aug 1, 2021

I can also confirm, on v3.3.1, Grav v1.7.18. Stemmer does make a difference, but disabling it doesn't fix the problem.

fuzzy is false

With stemmer set to English (porter):

  • enc finds encode, doesn't find encryption and encrypted
  • encrfinds encryption and encrypted
  • encry doesn't find anything
  • encryp and encrypt finds both ecryption and encrypted
  • encrypte and encrypted find encrypted
  • encrypti and encryptio doesn't find anything
  • encryption finds encryption
    when it finds something, it find both posts.

With stemmer set to 'no' or default:

  • encrypt find encryption and encrypted, but only one article
  • encrypti to encryption finds encryption in two articles

I don't know if I'm setting something wrong, but this is too unreliable.

@ViliusS
Copy link
Contributor

ViliusS commented Aug 1, 2021

@bgdnlp try with patches in https://github.com/teamtnt/tntsearch/pull/243/files and #116

@ufukayyildiz
Copy link

@bgdnlp try with patches in https://github.com/teamtnt/tntsearch/pull/243/files and #116

this fixed my problem. thanks

@thekenshow
Copy link
Contributor

I've updated to the latest Grav 1.7.23 and applied the changes noted in #114 (comment) but I'm still not getting the desired results.

Test case is a search for "spk", which should return "spk1000" and "spk7457", but only the first appears:

Screen Shot 2021-10-25 at 4 31 03 PM

A search for "spk7", returns "spk7457", which should also appear in the previous search:

Screen Shot 2021-10-25 at 4 31 13 PM

I don't believe I've missed anything, but here is a diff showing the changes I've applied:

    diff --git a/user/config/plugins/tntsearch.yaml b/user/config/plugins/tntsearch.yaml
    index a1ea9789..05a15902 100644
    --- a/user/config/plugins/tntsearch.yaml
    +++ b/user/config/plugins/tntsearch.yaml
    @@ -8,7 +8,7 @@ enable_admin_page_events: true
     search_type: auto
     fuzzy: false
     phrases: true
    -stemmer: 'default'
    +stemmer: 'no'
     display_route: true
     display_hits: true
     display_time: true
    diff --git a/user/plugins/tntsearch/classes/GravTNTSearch.php b/user/plugins/tntsearch/classes/GravTNTSearch.php
    index f5a1082d..9e9a75ac 100644
    --- a/user/plugins/tntsearch/classes/GravTNTSearch.php
    +++ b/user/plugins/tntsearch/classes/GravTNTSearch.php
    @@ -42,7 +42,7 @@ class GravTNTSearch
             $locator = Grav::instance()['locator'];
     
             $search_type = $config->get('plugins.tntsearch.search_type', 'auto');
    -        $stemmer = $config->get('plugins.tntsearch.stemmer', 'default');
    +        $stemmer = $config->get('plugins.tntsearch.stemmer', 'no');
             $limit = $config->get('plugins.tntsearch.limit', 20);
             $snippet = $config->get('plugins.tntsearch.snippet', 300);
             $data_path = $locator->findResource('user://data', true) . '/tntsearch';
    @@ -225,8 +225,10 @@ class GravTNTSearch
             $this->tnt->setDatabaseHandle(new GravConnector);
             $indexer = $this->tnt->createIndex($this->index);
     
    -        // Set the stemmer language if set
    -        if ($this->options['stemmer'] !== 'default') {
    +        // Disable stemmer for users with older configuration.
    +        if ($this->options['stemmer'] == 'default') {
    +            $indexer->setLanguage('no');
    +        } else {
                 $indexer->setLanguage($this->options['stemmer']);
             }
     
    @@ -340,4 +342,4 @@ class GravTNTSearch
     
             return $fields;
         }
    -}
    +}
    \ No newline at end of file
    diff --git a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php
    index b96e6dd1..50096aad 100644
    --- a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php
    +++ b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Classifier/TNTClassifier.php
    @@ -2,7 +2,7 @@
     
     namespace TeamTNT\TNTSearch\Classifier;
     
    -use TeamTNT\TNTSearch\Stemmer\PorterStemmer;
    +use TeamTNT\TNTSearch\Stemmer\NoStemmer;
     use TeamTNT\TNTSearch\Support\Tokenizer;
     
     class TNTClassifier
    @@ -18,7 +18,7 @@ class TNTClassifier
         public function __construct()
         {
             $this->tokenizer = new Tokenizer;
    -        $this->stemmer   = new PorterStemmer;
    +        $this->stemmer   = new NoStemmer;
         }
     
         public function predict($statement)
    @@ -128,4 +128,4 @@ class TNTClassifier
             $this->tokenizer = $classifier->tokenizer;
             $this->stemmer   = $classifier->stemmer;
         }
    -}
    +}
    \ No newline at end of file
    diff --git a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php
    index 1742d3ae..8182d4aa 100644
    --- a/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php
    +++ b/user/plugins/tntsearch/vendor/teamtnt/tntsearch/src/Indexer/TNTIndexer.php
    @@ -13,7 +13,7 @@ use TeamTNT\TNTSearch\Connectors\SQLiteConnector;
     use TeamTNT\TNTSearch\Connectors\SqlServerConnector;
     use TeamTNT\TNTSearch\FileReaders\TextFileReader;
     use TeamTNT\TNTSearch\Stemmer\CroatianStemmer;
    -use TeamTNT\TNTSearch\Stemmer\PorterStemmer;
    +use TeamTNT\TNTSearch\Stemmer\NoStemmer;
     use TeamTNT\TNTSearch\Support\Collection;
     use TeamTNT\TNTSearch\Support\Tokenizer;
     use TeamTNT\TNTSearch\Support\TokenizerInterface;
    @@ -41,7 +41,7 @@ class TNTIndexer
     
         public function __construct()
         {
    -        $this->stemmer    = new PorterStemmer;
    +        $this->stemmer    = new NoStemmer;
             $this->tokenizer  = new Tokenizer;
             $this->filereader = new TextFileReader;
         }
    @@ -71,7 +71,7 @@ class TNTIndexer
             if (!isset($this->config['driver'])) {
                 $this->config['driver'] = "";
             }
    -    
    +
             if (!isset($this->config['wal'])) {
                 $this->config['wal'] = true;
             }
    @@ -131,9 +131,9 @@ class TNTIndexer
         }
     
         /**
    -     * @param string $language  - one of: arabic, croatian, german, italian, porter, russian, ukrainian
    +     * @param string $language  - one of: no, arabic, croatian, german, italian, porter, portuguese, russian, ukrainian
          */
    -    public function setLanguage($language = 'porter')
    +    public function setLanguage($language = 'no')
         {
             $class = 'TeamTNT\\TNTSearch\\Stemmer\\'.ucfirst(strtolower($language)).'Stemmer';
             $this->setStemmer(new $class);
    @@ -178,7 +178,7 @@ class TNTIndexer
             $this->index = new PDO('sqlite:'.$this->config['storage'].$indexName);
             $this->index->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
     
    -        if($this->config['wal']) {
    +        if ($this->config['wal']) {
                 $this->index->exec("PRAGMA journal_mode=wal;");
             }
     
    @@ -306,7 +306,7 @@ class TNTIndexer
                 if ($counter % 10000 == 0) {
                     $this->index->commit();
                     $this->index->beginTransaction();
    -                $this->info("Commited");
    +                $this->info("Committed");
                 }
             }
             $this->index->commit();
    @@ -692,4 +692,4 @@ class TNTIndexer
                 echo $text.PHP_EOL;
             }
         }
    -}
    +}
    \ No newline at end of file

@ViliusS
Copy link
Contributor

ViliusS commented Oct 26, 2021

@thekenshow your case is different than this issue. This issue deals with stemmer which operates only on normal words. If the numbers are involved you should create a separate issue ticket.

@thekenshow
Copy link
Contributor

Ah, good to know, thanks. Filed a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants