-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search: improve keyword search prototype #52233
Conversation
Update: I noticed that we try to automatically pull out language aliases (like "typescript", "html", etc.) into language filters. This adds a lot of noise, since it converts common terms like "batch", "json", "text" to file name filters. After removing it, we see good results compared to ripgrep:
Positive signal: this prototype does better than ripgrep on my set of example searches for the
|
@@ -114,14 +114,17 @@ func (s *searchClient) Plan( | |||
} | |||
tr.LazyPrintf("parsing done") | |||
|
|||
features := ToFeatures(featureflag.FromContext(ctx), s.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hacky! Fixing this requires a refactor, which I'll do in a follow-up (but I didn't want to clutter this PR with the changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I opened #52649 to fix this.
Codenotify: Notifying subscribers in CODENOTIFY files for diff 39999bb...f3d2c38.
|
return commonCodeSearchTerms.Has(input) || stopWords.Has(input) | ||
} | ||
|
||
var commonCodeSearchTerms = stringSet{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlighting the main issue with this approach: if a common word sneaks through our aggressive stopwords lists, then it can make the results very noisy. Truly fixing this would require improvements in Zoekt:
- Make its new BM25 scoring logic respect IDF, so that common terms are down-weighted
- Make it better able to handle large "OR" queries efficiently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
} | ||
|
||
func stemTerm(input string) string { | ||
// Attempt to stem words, but only use the stem if it's a prefix of the original term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unintuitive to me. Do we just exclude non-prefix stems because otherwise it wouldn't match the exact input term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right -- stems aren't always prefixes of the original term, which means the original term would no longer match. One example I ran into is "Cody" -> "Codi", which is inaccurate and creates a lot of noise. I added a comment explaining this.
"github.com/sourcegraph/sourcegraph/internal/search/query" | ||
) | ||
|
||
const maxTransformedPatterns = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any details on why 10? I could imagine that's still pretty expensive in Zoekt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very scientific, I just tested out a search that tokenizes to 10 final terms. On my local instance, this always takes < 400ms, which seemed reasonable. We can definitely bump this down after testing on larger datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, sounds good to me 👍
We recently enabled Zoekt's new BM25 scoring for the `keyword` search type. We enabled the option using feature flags, which is hacky because users will never be touching this setting. This PR refactors all Zoekt-related jobs to use the `search.ZoektParameters` struct. This lets us pass the flag to Zoekt directly when constructing jobs. Follow-up to #52233
We have an experimental search type called `patterntype:keyword`. In testing it on Cody-style queries, it had worse relevance than our ripgrep implementation, and was sometimes quite slow. This PR makes improvements to query analysis: * Reduce the number of tokens we search by using a more aggressive stopword list * Make stemming cheaper and less noisy by using the stem if it's a prefix of the original * Limit the max number of tokens we'll search over * Remove language detection because it was too noisy and makes it hard to compare to other search strategies It also improves ranking: * Enable Zoekt's keyword scoring to rank documents by (approximate) BM25 * Removes unused ranking logic related to "match groups" Addresses #50786
We recently enabled Zoekt's new BM25 scoring for the `keyword` search type. We enabled the option using feature flags, which is hacky because users will never be touching this setting. This PR refactors all Zoekt-related jobs to use the `search.ZoektParameters` struct. This lets us pass the flag to Zoekt directly when constructing jobs. Follow-up to #52233
We have an experimental search type called
patterntype:keyword
. In testing it on Cody-style queries, it had worse relevance than our ripgrep implementation, and was sometimes quite slow.This PR makes improvements to query analysis:
It also improves ranking:
Addresses #50786
Test plan
New unit tests, plus quality tests (results in a comment below).