-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order first by exact match in results #287
Conversation
478a463
to
37854bc
Compare
Hm I'm worried that there's a deeper problem here we may want to fix. If the normal postgres ranking isn't putting the exact crate match at the very top, then something about how we're ranking results seems quite wrong, so maybe postgres was accidentally misconfigured? |
I think isn't a problem of postgresql configuration, the ts_rank function calculate the rank based on some weights, cargo crate have no keywords, and only include once the cargo word in the description, other packages, that include the cargo word in the name, keyword and descriptions, appear before it, for example, cargo-edit include the cargo word in name, two times in keywords and three times in description. The tokenization is a problem too, to postgresql, cargo-edit is 2 words, cargo and edit, this means that all crates with cargo-anything will have cargo in the name, and probably in the keywords and problably in the description, I'm not sure that we can do it without a small "hack" to force the exact match priority. |
This seems like our weights might be incorrect though? Sounds like we should much more heavily weight the title than we currently are (so it trumps everything) and basically go along the lines of that. |
I'm not sure changing the weight can be enought because the cargo name is included in a lot of names, and the cargo exact match doesn't get a better rank. May work a combination of very heavy weight of title and a normalization to ensure that shorter names comes before (http://www.postgresql.org/docs/9.4/static/textsearch-controls.html#TEXTSEARCH-RANKING). Anyway I have to test it correctly, I am trying to get the crates-io backend working. |
Hm ok, let's try and use the normal ranking functions in postgres then? If you need help setting up the backend just let me know |
37854bc
to
adb40ca
Compare
I have pushed another idea. If everything is in the same search block I can't give more importance to "shorter" crate names. Then, calc the rank for the name of the crate with a normalization based on the number of "words" give it a better result. The problem here is that any package with the word in the name, will be appear always before another package without the word in the name, independently of the number of tmes the word appears in the rest of the fields. I have to keep thinking about it, but I think the correct approach may be a variant of this approach. |
Hm I'm not quite following what the literal postgres query is doing here, but it seems legit? We can always play around with various tweaks to see what gives us the best results. So long as it's better than the previous it seems nice :) Do you think that any tests could be added for this as well? |
For an example, when searching "num" the exact crate is the fourth result. I'm not even sure how the first result (Seems like the high download count should factor in too, but that's a separate concern...) |
e90d046
to
56ca925
Compare
I reviewed it and back to the first approach (the last version simply put the exact match in the top) and then the rest of results order by the ts_rank result. I have visited the idea of separated rankings for name field and the rest of the fields, but doesn't work correctly because to get first the exact match i have to normalize the matches with the field size, and then the rank of name will be based on the length of the field, leaving the rank of the rest of the fields to a second ordering giving worse results. Another idea based on the previous paragraph idea, is to calculate the rank as a combination of the rank of the name and the rank of the rest of the fields but it is complex and less understandable. I think the solution of exact match first, and postgres ranked results after it is good enough for all the cases, have the expected behavior and is really simple to understand. I will add a test too, to ensure the exact mach always goes first. |
56ca925
to
298b496
Compare
Ok, seems like it's at least an improvement over the status quo, so let's do that. Thanks @jespino! |
No description provided.