Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy searching #1270

Closed
reillysiemens opened this issue Mar 1, 2018 · 3 comments
Closed

Fuzzy searching #1270

reillysiemens opened this issue Mar 1, 2018 · 3 comments
Labels
A-backend ⚙️ A-keywords A-search C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Comments

@reillysiemens
Copy link

I recently searched crates.io for "git" using the default sorting of relevance. I expected to find the git2 crate, but instead found the git crate.

Below is a screenshot of the exact match currently found when searching with https://crates.io/search?q=git.

crates-io-git-search-relevance

Searching for "git2" directly with https://crates.io/search?q=git2 produces the desired result with an exact match.
crates-io-git2-search-relevance

I took a look at the source for crates.io briefly last night and it looks like the search controller uses the PostgreSQL ts_rank_cd text search function for the default search. I'm not familiar enough with the Cover Density Ranking algorithm to explain why or whether this produces the results above, but that might be a starting point in digging deeper into this.

Relevance seems like a tricky term here. The default search probably does produce the most relevant package from a text similarity standpoint, but not necessarily to me as a programmer looking for a git library to use. Maybe a hybrid approach that considers text relevance, all-time downloads, and recent downloads would produce something closer to what I expected.

@moore3071
Copy link
Contributor

I'll add the little bit that I know to try to explain why git2 isn't higher on the search results. git2 is the 49th result when searching for Git currently. The document searched is a concatenation of the fields crate name, keywords, description, and readme with weights A, B, C, and D respectively. For more detailed info on these weight classes see the Postgresql documentation, but simply put a crate is less relevant if the search term only appears once in the readme(D) than if the search term appeared once in the crate name(A).

The Git2 crate's text search vector (minus the Readme) looks like:

'allow':20C 'bind':3C 'git':2B,9C,25C 'git2':1A 'interoper':7C 'libgit2':5C 'librari':12C 'memori':17C 'read':22C 'repositori':10C,26C 'safe':18C 'threadsaf':15C 'write':24C

You'll notice that the values for git are: 2B, 9C, and 25C. This means the second word with a weight of B and the 9th and 25th word with a weight of C. The only reason git2 shows up in this search at all is because of this.

I like the idea of a hybrid approach, but I'd be curious about how that would affect the query speed. This would be done by ordering on a function of the ts_rank_cd result and the other pieces(all-time downloads, etc). I'm curious if Diesel can do this.

Alternatively, in this case, #1266 would have included the title into the ranking as all trigrams of the search were in the package title. This could still miss relevant searches though, so it's not a catch-all.

As another alternative, you could search by keyword git using the API and git2 is the top result https://crates.io/api/v1/crates?keyword=git. I don't think keyword searching is available on crates.io's frontend, but I could be wrong.

@sgrif sgrif mentioned this issue May 3, 2018
@sgrif
Copy link
Contributor

sgrif commented May 3, 2018

This would be done by ordering on a function of the ts_rank_cd result and the other pieces(all-time downloads, etc). I'm curious if Diesel can do this.

Yes, it can.

Alternatively, in this case, #1266 would have included the title into the ranking as all trigrams of the search were in the package title. This could still miss relevant searches though, so it's not a catch-all

I'm happy to experiment. Can you give me some specific queries you'd like tried?

As another alternative, you could search by keyword git using the API and git2 is the top result https://crates.io/api/v1/crates?keyword=git. I don't think keyword searching is available on crates.io's frontend, but I could be wrong.

It is. https://crates.io/keywords/git. I'm definitely open to suggestions for better exposing that.

@carols10cents carols10cents changed the title Relevance sorting in searches does not always produce desired results Fuzzy searching Jun 27, 2018
@carols10cents carols10cents added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-search A-postgres A-keywords labels Jun 27, 2018
@phil-opp
Copy link

Probably related: A search for "ssh" only returns the probably most mature ssh2 crate on page 6 with the default relevance sorting. This makes the discovery of new crates very difficult.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
A-backend ⚙️ A-keywords A-search C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

No branches or pull requests

6 participants