Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search issues #216

Closed
peteruithoven opened this issue Feb 3, 2019 · 10 comments
Closed

Search issues #216

peteruithoven opened this issue Feb 3, 2019 · 10 comments
Labels

Comments

@peteruithoven
Copy link

I'm on elementary OS (Built on Ubuntu 18.04 LTS)
AppStream version: 0.12.5

I noticed an search issue in the elementary OS AppCenter: elementary/appcenter#942 which is apparently reproducible when using the appstreamcli search directly.

Summary of searching results:
cal / calc / calcu / calcul / calculator: Expected results like all sorts of calculator apps.
calcula / calculat: only 2 irrelevant seeming results (these where included with the results above)
calculato: No results

Searching calcul (using grep for brevity):

$ appstreamcli search calcul | grep calculator
Icon: mate-calc_accessories-calculator.png
Package: gnome-calculator
Icon: gnome-calculator_gnome-calculator.png
Identifier: deepin-calculator.desktop [desktop-application]
Package: deepin-calculator
Icon: deepin-calculator_deepin-calculator.png
Identifier: pcbcalculator.desktop [desktop-application]
Icon: kicad_pcbcalculator.png
Identifier: org.kde.plasma.calculator [generic]
Icon: plasma-widgets-addons_accessories-calculator.png
Identifier: io.elementary.calculator.desktop [desktop-application]
Package: pantheon-calculator
Icon: accessories-calculator
Summary: A simple calculator for chemistry
Summary: Homebrewer's recipe calculator
Summary: A high-precision scientific calculator
Summary: Graphing calculator emulator
Summary: A cute little Body Mass Index calculator
Summary: Powerful and easy to use calculator
Summary: a GTK 2 / GTK 3 algebraic and RPN calculator
Icon: kcalc_accessories-calculator.png
Summary: Resistor color code calculator

Searching calculat

$ appstreamcli search calculat
Identifier: org.kde.step.desktop [desktop-application]
Name: Step
Summary: Interactive Physical Simulator
Package: step
Homepage: http://edu.kde.org/step/
Icon: step_step.png
---
Identifier: soundkonverter.desktop [desktop-application]
Name: soundKonverter
Summary: Audio file conversion tool
Package: soundkonverter
Homepage: https://github.com/dfaust/soundkonverter
Icon: soundkonverter_soundkonverter.png

Searching calculato

$ appstreamcli search calculato
Unable to find component matching calculato!

I looked through the existing issues and all search issues seemed fixed quite a while ago.
Please let me know if I can provide more information.

@JMoerman
Copy link

Another example, my app Go For It! can barely be found: appstreamcli search "go for it" > ./search yields a list where more than 1100 apps preceed it.

Only when running appstreamcli search "go for it!" > ./search or appstreamcli search "it!" > ./search does appstreamcli return any somewhat reasonable results (second place, amarok ranks higher).

Attracting new users isn't a huge issue as searching for todo/timer/... works fine, but finding the app by name is basically impossible.

@ximion
Copy link
Owner

ximion commented Mar 2, 2019

Hmm, I can't reproduce any of this.
Is your AppStream compiled with stemming support? What is your system locale? "Go for it" is maybe filtered out because it contains tokens that are generic and are therefore removed ("for" and "it", "Go" probably not).

If you turn on verbose mode with --verbose you get some information on what is actually searched for.

ximion added a commit that referenced this issue Mar 2, 2019
Tokenize the search query with tokenize-and-fold, do not compare search
terms with an English-only word blacklist and speed up the search token
validity check.
Also, unittest the stemming feature.
See #216
@JMoerman
Copy link

JMoerman commented Mar 2, 2019

Is your AppStream compiled with stemming support?

No idea, probably? (** (appstreamcli:4767): DEBUG: 20:39:33.108: Stemming language is: en)

What is your system locale?

Locale has no effect for me, en_US and nl_NL yield ~ the same results. (System locale is nl_NL)

Using --verbose results in the following information (using en_US locale to make this easier to reproduce):

$ LANG=en_US.UTF-8 LANGUAGE=en_US appstreamcli search "go for it" --verbose
...
** (appstreamcli:4767): DEBUG: 20:39:33.108: Percentage of valid components: 100.000
** (appstreamcli:4767): DEBUG: 20:39:33.108: Stemming language is: en
** (appstreamcli:4767): DEBUG: 20:39:33.108: Search term invalid. Matching everything.
...
$ LANG=en_US.UTF-8 LANGUAGE=en_US appstreamcli search "go for it!" --verbose
...
** (appstreamcli:5481): DEBUG: 20:42:57.388: Searching for: it!
Identifier: org.kde.amarok [desktop-application]
Name: Amarok
Summary: Amarok - Rediscover Your Music!
Package: amarok
Icon: amarok_amarok.png
---
Identifier: com.github.jmoerman.go-for-it [desktop-application]
Name: Go For It!
Summary: A stylish to-do list with built-in productivity timer
Package: com.github.jmoerman.go-for-it
Homepage: http://manuel-kehl.de/projects/go-for-it/
Icon: com.github.jmoerman.go-for-it_com.github.jmoerman.go-for-it.png

So it seems you are right that the individual tokens are dropped.

@JMoerman
Copy link

JMoerman commented Mar 2, 2019

Reproducing the behavior observed by @peteruithoven:

$ LANG=en_US.UTF-8 LANGUAGE=en_US appstreamcli search "calculato" --verbose
...
** (appstreamcli:8083): DEBUG: 20:53:27.019: Searching for: calculato
Unable to find component matching calculato!
$ LANG=en_US.UTF-8 LANGUAGE=en_US appstreamcli search "calculator" --verbose
...
** (appstreamcli:8820): DEBUG: 20:56:46.010: Searching for: calcul
Identifier: mate-calc.desktop [desktop-application]
Name: MATE Calculator
Summary: Perform arithmetic, scientific or financial calculations
Package: mate-calc
Icon: mate-calc_accessories-calculator.png
---
Identifier: org.gnome.Calculator.desktop [desktop-application]
Name: GNOME Calculator
Summary: Perform arithmetic, scientific or financial calculations
Package: gnome-calculator
Homepage: https://wiki.gnome.org/Apps/Calculator
Icon: gnome-calculator_gnome-calculator.png
---

@ximion
Copy link
Owner

ximion commented Mar 2, 2019

Hehe ^^
The "problem" here is that "calculato" does not get stemmed by the stemming algorithm, while "calculator" does get stemmed to "calcul".
The token cache for components contains pre-stemmed tokens, so "calculato" will not match the "calcul" token.
That's quite an interesting issue, but something that can be fixed with either some performance or memory impact (we either stem tokens while searching or we create a stemmed-tokens cache).
I am contemplating to move the AppStream metadata pool out of memory entirely and permanently onto a cache, maybe backed by LMDB. That would allow AsPool to hold hundreds of thousands of components without enormous memory usage.
There's no final plan for that yet, though.
(I know from GNOME Software's experiences that the memory consumption can grow quite a lot with this, and currently libappstream's memory usage is delightfully low in most scenarios).

@JMoerman
Copy link

JMoerman commented Mar 2, 2019

The "problem" here is that "calculato" does not get stemmed by the stemming algorithm, while "calculator" does get stemmed to "calcul".

I noticed that, yes. That also means that the issue I'm having has a different cause, as this isn't a result from stemming. In both cases a naive substring search would "solve" the issue, however. (assuming excessively large processing power and memory space)

I do think that individual tokens should not be discarded when searching for application names as this would make the situation for apps with names like the one I maintain rather hopeless.

ximion added a commit that referenced this issue Jun 10, 2019
This will make it possible for users to find apps like "Go for it!"
which otherwise would be impossible to search for.
(Note: We do not keep such small search tokens in the cache, except for
high-value texts, like name and summary)
CC: #216
@ximion
Copy link
Owner

ximion commented Jun 10, 2019

As part of the caching rework, I also landed a few search optimizations which should improve the results.
Example:

$ appstreamcli s cal | grep Identifier | wc -l
134
$ appstreamcli s calc | grep Identifier | wc -l
2
$ appstreamcli s calcula | grep Identifier | wc -l
41
$ appstreamcli s calculato | grep Identifier | wc -l
18
$ appstreamcli s calculator | grep Identifier | wc -l
18

The numbers look odd at first, but are easy to explain: Nothing matches "cal" or "calcula" (they also don't get stemmed), so a prefix match is performed and we get broad results of stuff with tokens that do have these prefixes. "calc" however is a direct-match token, so we will find "libreoffice calc" (as that's it's name). The other queries will match calculator apps with varying precision.

How many calculators do we find? (grep for "calculator")

cal: 11
calc: 0
calcula: 11
calculato: 11
calculator: 11

As for the "Go for It!" oddity, with the new algorithm changes AppStream will keep small tokens in the index if they are of high value (= they stem from the component's name, summary or ID). With a very recent change to the user search query preprocessing, you should also now be able to search using small search tokens.

I hope this helps! I am not done with improving search, there is still a lot of stuff to be looked into and potentially to be improved. I am also not sure whether the "prefer exact match only" approach ("calc" only finding Calc) is actually a good thing or confusing to users.

If you want to test things, the changes are in master (because the changes are massive and invasive, I expect a bit of time for the dust to settle and all issues to be found - API is also not 100% behaving the same as before yet, that's something that needs to be addressed prior to the release).

@ximion
Copy link
Owner

ximion commented Jun 16, 2019

I think this is fixed now, search works well now and the original issue of this bug report is addressed.
I will keep an eye on the "less but precise results vs more but less precise results" issue though, and change that behavior later, if needed/requested.

@ximion ximion closed this as completed Jun 16, 2019
@peteruithoven
Copy link
Author

Thanks for looking into this!
I hope this will make it to the elementary OS's AppCenter soon.

@ximion
Copy link
Owner

ximion commented Jun 16, 2019

I am planning to make a new release soon :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants