Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve search relevance #389

Open
heathermiller opened this issue Feb 15, 2017 · 22 comments
Projects

Comments

@heathermiller
Copy link
Contributor

@heathermiller heathermiller commented Feb 15, 2017

I can't find spark anymore... I have to search for Spark and then fiddle with the filters. Spark should really turn up on the first page when searching for "spark"

@julienrf

This comment has been minimized.

Copy link
Member

@julienrf julienrf commented Feb 15, 2017

The score is highly dependent on keywords, I think that's the best way to avoid false positives, even though it requires manual editing of projects.

The other results for the "spark" query are not irrelevant. It is super hard to design a magic scoring function that always put your expected result first...

I added the "spark" keyword to the apache/spark project, in order to boost its score.

@kzys

This comment has been minimized.

Copy link

@kzys kzys commented Feb 16, 2017

Can we just use the number of stars as the default sort order?
https://index.scala-lang.org/search?q=spark&page=1&sort=stars

@julienrf

This comment has been minimized.

Copy link
Member

@julienrf julienrf commented Feb 16, 2017

@kzys We tried that but then some projects that are not completely related to a given search query but have a lot of stars (e.g. playframework) appear at the top of the results. That does not work well either…

@MasseGuillaume

This comment has been minimized.

Copy link
Member

@MasseGuillaume MasseGuillaume commented Feb 16, 2017

@heathermiller let's use this issue: #303 and close this one as duplicate ?

@cvogt

This comment has been minimized.

Copy link

@cvogt cvogt commented Mar 4, 2017

this looks off. scalafmt does not show at all in the live preview

image

@cvogt

This comment has been minimized.

Copy link

@cvogt cvogt commented Mar 4, 2017

scalafmt doesn't show anywhere in this list eventhough I typed the exact name

@cvogt

This comment has been minimized.

Copy link

@cvogt cvogt commented Mar 4, 2017

when I type enter it does show on position 8 of that list.

maybe project that have the exact search term contained in there name should show first?

@olafurpg

This comment has been minimized.

Copy link
Member

@olafurpg olafurpg commented Mar 29, 2017

Scalafmt does show up when searching for scalameta

screen shot 2017-03-29 at 14 18 07

I think the scalameta repo should be the top result, however.

@ShaneDelmore

This comment has been minimized.

Copy link

@ShaneDelmore ShaneDelmore commented Apr 1, 2017

Can we give a heavy weight to matches in the name? If I search for shapeless I should get shapeless, even if a library named ShapetyShapeShape mentions Shapeless 42 times in the read even and has more stars than Travis Brown has StackOverflow points. Is this controversial?

@MasseGuillaume

This comment has been minimized.

Copy link
Member

@MasseGuillaume MasseGuillaume commented Apr 1, 2017

@ShaneDelmore The counter example is a library named json. Is it more relevant than circe ?

@ShaneDelmore

This comment has been minimized.

Copy link

@ShaneDelmore ShaneDelmore commented Apr 1, 2017

Yes. If I search for a library named json I would like it to show up before Circe. If I list the category json by popularity then I would expect Circe to be near the top. Has the counter example proved a larger problem than the current results ranking?

@MasseGuillaume MasseGuillaume mentioned this issue Apr 3, 2017
@olafurpg

This comment has been minimized.

Copy link
Member

@olafurpg olafurpg commented Jun 14, 2017

The search results for "scalafix" lists scalacenter/scalafix at 11th place https://index.scala-lang.org/search?q=scalafix

screen shot 2017-06-14 at 09 11 10

@heathermiller

This comment has been minimized.

Copy link
Contributor Author

@heathermiller heathermiller commented Jun 14, 2017

This issue of search relevance really needs to be resolved. This is crippling Scaladex.

@MasseGuillaume MasseGuillaume added this to Backlog in Tasks Jun 20, 2017
@MasseGuillaume MasseGuillaume moved this from Backlog to In Progress in Tasks Jun 27, 2017
@olafurpg

This comment has been minimized.

Copy link
Member

@olafurpg olafurpg commented Jun 27, 2017

Searching for "shapeless" does not list https://github.com/milessabin/shapeless on the front page https://index.scala-lang.org/search?q=shapeless at the moment.

@MasseGuillaume

This comment has been minimized.

Copy link
Member

@MasseGuillaume MasseGuillaume commented Jun 27, 2017

https://index.scala-lang.org/milessabin

And someone lost his cat: #430.

I'm taking a look.

@MasseGuillaume

This comment has been minimized.

Copy link
Member

@MasseGuillaume MasseGuillaume commented Jun 28, 2017

@olafurpg found the problem: #430 (comment)

@MasseGuillaume MasseGuillaume moved this from In Progress to Done in Tasks Jun 30, 2017
@MasseGuillaume MasseGuillaume moved this from Done to Deployed in Tasks Jun 30, 2017
@MasseGuillaume MasseGuillaume moved this from Deployed to Done in Tasks Jun 30, 2017
@MasseGuillaume MasseGuillaume moved this from Done to Deployed in Tasks Jun 30, 2017
@MasseGuillaume MasseGuillaume moved this from Deployed to Done in Tasks Jun 30, 2017
@MasseGuillaume MasseGuillaume moved this from Done to Deployed in Tasks Jul 2, 2017
@MasseGuillaume MasseGuillaume removed this from Deployed in Tasks Jul 3, 2017
@MasseGuillaume MasseGuillaume added this to In Progress in Tasks Jul 11, 2017
@MasseGuillaume

This comment has been minimized.

Copy link
Member

@MasseGuillaume MasseGuillaume commented Jul 11, 2017

I added test for relevance

first(query)(org/repo): org/repo should be the first result when searching for query
top(query)(repos): repos should be on the first page the ordering is not relevant
exactly(query)(repos) repos should show in this order

@MasseGuillaume MasseGuillaume changed the title Search relevance is meh again Improve search relevance Jul 11, 2017
@MasseGuillaume MasseGuillaume moved this from Today to V3 in Tasks Jul 12, 2017
@ShaneDelmore

This comment has been minimized.

Copy link

@ShaneDelmore ShaneDelmore commented Aug 14, 2018

I attempted to work on this a couple of weekends ago and then forgot about it, and now I am out of motivation for the time being but I’ll share my thoughts anyway for what I found before I stopped due to not knowing how to inspect the contents of the index to see what to search on.

Multiple people have said something to the effect of “just rank by github stars” and the answer has often been “you don’t want to do that, or spark will always be at the top of the list”. I believe the problem is that there is not enough separation between filtering results returned and ranking them.

The ranking isn’t bad from what I can see although I did improve results in my tests by playing with the weights a bit, but the biggest difference was to just stop trying so hard to make sure we never miss a result. I noticed that we don’t just search title, tags, and the readme, but we search dependency names, and other things that I think should not cause a library to be returned as a result.

For example, if you search for config and a library that does Units of Measure calculations ends up depending on typesafe-config, or maybe mentions that you can set defaults in the lib.config file it will be included as a result. I don’t think anyone wants this (please post here if I am wrong though, maybe I don’t know of your use case).

If we stop including so many libraries in results, then we won’t have to worry about them getting promoted to the top of the list. The problem is not that when querying json that spark has more github stars than circe, the problem is that Spark is not a json library. Just having a dependency on a module named spark-json doesn’t make it a json library. Heck, if we want to match spark-json, great, return that as a hit result, but not spark itself.

Out of ~6000 libraries I get the following number of hits for various searches:
Scalatest: 2026
Json: 1106
Play: 989
Config: 852
Shapeless: 323

I think these numbers are hugely exaggerated, almost all due to dependencies. Due we have 323 libraries that depend on shapeless, sure, but we only have a few libs that are the one and only Shapeless or part of it’s ecosystem and while a lot of libraries use scalatest, when someone searches for scalatest they want ScalaTest and libraries that add functionality to ScalaTest, not every library that happens to use it in it’s own tests.

TLDR: If we return fewer, more targeted results I think the ordering will end up being pretty easy, I think we are just returning far too many results. Don’t use dependency names for inclusion criteria. The problem is not missing a few libs, if libs are missed the author can just add a topic with the search term they want to hit, but if we include too many libs then it becomes difficult to rank them and the lib author has far less control over that.

I’m sorry I’m just leaving a comment and not a PR but I’m realizing that I may not have time to get to this for a while and my findings may be useful for someone else.

@eed3si9n

This comment has been minimized.

Copy link

@eed3si9n eed3si9n commented May 8, 2019

It's now 2019. And clicking on the first topic from the front page akka (110), we still get odd results.

topics_akka_2019

json(85) not much better since it missing obvious ones like circe and json4s on the first page.

topics_json_2019

In both cases GitHub's relevance search filtered by Scala language returns much usable results:

2019_Search_akka

2019_Search_json

From the looks of it it's a mixture of repo name, tags, description, and stars?

@eed3si9n

This comment has been minimized.

Copy link

@eed3si9n eed3si9n commented May 8, 2019

Here's a back of the napkin weights I reverse engineered from looking at the ranking.
https://docs.google.com/spreadsheets/d/1L6IjOKJ67GjwiBdLGa8M7GSa_AU645WmtYpkXN4ujeY/edit#gid=0

GitHub_relevance

On the "name" column you get 10 points for exact match on the project name like "akka", otherwise you get 1 point. paypal/squbs showing up at the second place is a hint of how the repo names are weighed against the stars.

akka-sample showing up above akka-in-action shows that tags do matter, but proportionally, it should matter much less compared to the description.

eed3si9n added a commit to eed3si9n/scaladex that referenced this issue May 11, 2019
Please see scalacenter#389 (comment)

From looking at GitHub's search result, which is very good, the signal we should be boosting more is the description. The tag seems to mix in more random things, so it's ok as a tie breaker, but it's not very reliable.
@eed3si9n

This comment has been minimized.

Copy link

@eed3si9n eed3si9n commented May 11, 2019

I attempted to send a PR to implement the scoring formula - #571
If someone could review and correct it that'd be great.

@eed3si9n

This comment has been minimized.

Copy link

@eed3si9n eed3si9n commented May 11, 2019

I also agree with @ShaneDelmore's comment about

I think these numbers are hugely exaggerated, almost all due to dependencies.

Could someone show me how to cut those out?

eed3si9n added a commit to eed3si9n/scaladex that referenced this issue May 11, 2019
Please see scalacenter#389 (comment)

From looking at GitHub's search result, which is very good, the signal we should be boosting more is the description. The tag seems to mix in more random things, so it's ok as a tie breaker, but it's not very reliable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
8 participants
You can’t perform that action at this time.