Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: improve keyword search prototype #52233

Merged
merged 10 commits into from
May 25, 2023
Merged

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented May 19, 2023

We have an experimental search type called patterntype:keyword. In testing it on Cody-style queries, it had worse relevance than our ripgrep implementation, and was sometimes quite slow.

This PR makes improvements to query analysis:

  • Reduce the number of tokens we search by using a more aggressive stopword list
  • Make stemming cheaper and less noisy by using the stem if it's a prefix of the original
  • Limit the max number of tokens we'll search over
  • Remove language detection because it was too noisy and makes it hard to compare to other search strategies

It also improves ranking:

  • Enable Zoekt's keyword scoring to rank documents by (approximate) BM25
  • Removes unused ranking logic related to "match groups"

Addresses #50786

Test plan

New unit tests, plus quality tests (results in a comment below).

@cla-bot cla-bot bot added the cla-signed label May 19, 2023
@jtibshirani
Copy link
Member Author

jtibshirani commented May 20, 2023

Neutral/ negative signal: this implementation does worse than ripgrep on CodeSearchNet. It's not horrible though, and NDCG is a tough metric (compared to what we really care about, which is recall.

Update: I noticed that we try to automatically pull out language aliases (like "typescript", "html", etc.) into language filters. This adds a lot of noise, since it converts common terms like "batch", "json", "text" to file name filters. After removing it, we see good results compared to ripgrep:

NDCG@k ripgrep prototype
@1 0.1830 0.2155
@5 0.2921 0.3608
@10 0.3409 0.4178
@20 0.4176 0.5017

Positive signal: this prototype does better than ripgrep on my set of example searches for the sourcegraph/sourcegraph repo. The following table shows whether each method finds the correct file in the top 10.

Search Analyzed search (for ripgrep/ keyword) Correct file ripgrep prototype embeddings
What does InternalDoer do? InternalDoer internal/httpcli/client.go
Is crewjam/saml used anywhere in the code? crewjam/saml go.mod
Where are the Cody “no context messages” defined? Cody context messages enterprise/cmd/worker/internal/ embeddings/contextdetection/dataset.go
Where are the embeddings no context regexes? embeddings context regexes enterprise/cmd/embeddings/ shared/context_detection.go
Are sub-repo permissions respected in the embeddings service? sub-repo permissions embeddings service enterprise/cmd/embeddings/ shared/main.go
Where do we convert lang filters to file extensions? convert lang filters file extensions internal/search/query/helpers.go
Where are the grafana dashboards defined for frontend search ranking? grafana dashboards frontend search ranking monitoring/definitions/frontend.go

@@ -114,14 +114,17 @@ func (s *searchClient) Plan(
}
tr.LazyPrintf("parsing done")

features := ToFeatures(featureflag.FromContext(ctx), s.logger)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hacky! Fixing this requires a refactor, which I'll do in a follow-up (but I didn't want to clutter this PR with the changes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I opened #52649 to fix this.

@jtibshirani jtibshirani marked this pull request as ready for review May 24, 2023 19:48
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented May 24, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 39999bb...f3d2c38.

Notify File(s)
@camdencheek internal/search/client/client.go
internal/search/keyword/BUILD.bazel
internal/search/keyword/match_groups.go
internal/search/keyword/match_groups_test.go
internal/search/keyword/query_transformer.go
internal/search/keyword/query_transformer_test.go
internal/search/keyword/stop_words.go
internal/search/keyword/term_utils.go
internal/search/types.go
internal/search/zoekt/zoekt.go
@keegancsmith internal/search/client/client.go
internal/search/keyword/BUILD.bazel
internal/search/keyword/match_groups.go
internal/search/keyword/match_groups_test.go
internal/search/keyword/query_transformer.go
internal/search/keyword/query_transformer_test.go
internal/search/keyword/stop_words.go
internal/search/keyword/term_utils.go
internal/search/types.go
internal/search/zoekt/zoekt.go

return commonCodeSearchTerms.Has(input) || stopWords.Has(input)
}

var commonCodeSearchTerms = stringSet{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting the main issue with this approach: if a common word sneaks through our aggressive stopwords lists, then it can make the results very noisy. Truly fixing this would require improvements in Zoekt:

  • Make its new BM25 scoring logic respect IDF, so that common terms are down-weighted
  • Make it better able to handle large "OR" queries efficiently

Copy link
Member

@camdencheek camdencheek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

internal/search/keyword/term_utils.go Outdated Show resolved Hide resolved
}

func stemTerm(input string) string {
// Attempt to stem words, but only use the stem if it's a prefix of the original term.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unintuitive to me. Do we just exclude non-prefix stems because otherwise it wouldn't match the exact input term?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right -- stems aren't always prefixes of the original term, which means the original term would no longer match. One example I ran into is "Cody" -> "Codi", which is inaccurate and creates a lot of noise. I added a comment explaining this.

"github.com/sourcegraph/sourcegraph/internal/search/query"
)

const maxTransformedPatterns = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any details on why 10? I could imagine that's still pretty expensive in Zoekt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very scientific, I just tested out a search that tokenizes to 10 final terms. On my local instance, this always takes < 400ms, which seemed reasonable. We can definitely bump this down after testing on larger datasets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, sounds good to me 👍

@jtibshirani jtibshirani merged commit 80a8177 into main May 25, 2023
5 checks passed
@jtibshirani jtibshirani deleted the jtibs/keyword-search branch May 25, 2023 18:13
jtibshirani added a commit that referenced this pull request May 31, 2023
We recently enabled Zoekt's new BM25 scoring for the `keyword` search
type. We enabled the option using feature flags, which is hacky because
users will never be touching this setting.

This PR refactors all Zoekt-related jobs to use the
`search.ZoektParameters` struct. This lets us pass the flag to Zoekt
directly when constructing jobs.

Follow-up to #52233
ErikaRS pushed a commit that referenced this pull request Jun 22, 2023
We have an experimental search type called `patterntype:keyword`. In
testing it on Cody-style queries, it had worse relevance than our
ripgrep implementation, and was sometimes quite slow.

This PR makes improvements to query analysis:
* Reduce the number of tokens we search by using a more aggressive
stopword list
* Make stemming cheaper and less noisy by using the stem if it's a
prefix of the original
* Limit the max number of tokens we'll search over
* Remove language detection because it was too noisy and makes it hard
to compare to other search strategies

It also improves ranking:
* Enable Zoekt's keyword scoring to rank documents by (approximate) BM25
* Removes unused ranking logic related to "match groups"

Addresses #50786
ErikaRS pushed a commit that referenced this pull request Jun 22, 2023
We recently enabled Zoekt's new BM25 scoring for the `keyword` search
type. We enabled the option using feature flags, which is hacky because
users will never be touching this setting.

This PR refactors all Zoekt-related jobs to use the
`search.ZoektParameters` struct. This lets us pass the flag to Zoekt
directly when constructing jobs.

Follow-up to #52233
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants