Search: improve keyword search prototype #52233

jtibshirani · 2023-05-19T23:57:46Z

We have an experimental search type called patterntype:keyword. In testing it on Cody-style queries, it had worse relevance than our ripgrep implementation, and was sometimes quite slow.

This PR makes improvements to query analysis:

Reduce the number of tokens we search by using a more aggressive stopword list
Make stemming cheaper and less noisy by using the stem if it's a prefix of the original
Limit the max number of tokens we'll search over
Remove language detection because it was too noisy and makes it hard to compare to other search strategies

It also improves ranking:

Enable Zoekt's keyword scoring to rank documents by (approximate) BM25
Removes unused ranking logic related to "match groups"

Addresses #50786

Test plan

New unit tests, plus quality tests (results in a comment below).

jtibshirani · 2023-05-20T00:10:20Z

~~Neutral/ negative signal: this implementation does worse than ripgrep on CodeSearchNet. It's not horrible though, and NDCG is a tough metric (compared to what we really care about, which is recall.~~

Update: I noticed that we try to automatically pull out language aliases (like "typescript", "html", etc.) into language filters. This adds a lot of noise, since it converts common terms like "batch", "json", "text" to file name filters. After removing it, we see good results compared to ripgrep:

NDCG@k	ripgrep	prototype
@1	0.1830	0.2155
@5	0.2921	0.3608
@10	0.3409	0.4178
@20	0.4176	0.5017

Positive signal: this prototype does better than ripgrep on my set of example searches for the sourcegraph/sourcegraph repo. The following table shows whether each method finds the correct file in the top 10.

Search	Analyzed search (for ripgrep/ keyword)	Correct file	ripgrep	prototype	embeddings
What does InternalDoer do?	InternalDoer	internal/httpcli/client.go	❌	❌	✅
Is crewjam/saml used anywhere in the code?	crewjam/saml	go.mod	❌	✅	✅
Where are the Cody “no context messages” defined?	Cody context messages	enterprise/cmd/worker/internal/ embeddings/contextdetection/dataset.go	❌	❌	✅
Where are the embeddings no context regexes?	embeddings context regexes	enterprise/cmd/embeddings/ shared/context_detection.go	✅	✅	✅
Are sub-repo permissions respected in the embeddings service?	sub-repo permissions embeddings service	enterprise/cmd/embeddings/ shared/main.go	✅	✅	✅
Where do we convert lang filters to file extensions?	convert lang filters file extensions	internal/search/query/helpers.go	✅	✅	❌
Where are the grafana dashboards defined for frontend search ranking?	grafana dashboards frontend search ranking	monitoring/definitions/frontend.go	❌	✅	❌

jtibshirani · 2023-05-22T22:11:24Z

internal/search/client/client.go

@@ -114,14 +114,17 @@ func (s *searchClient) Plan(
 	}
 	tr.LazyPrintf("parsing done")

+	features := ToFeatures(featureflag.FromContext(ctx), s.logger)


This is hacky! Fixing this requires a refactor, which I'll do in a follow-up (but I didn't want to clutter this PR with the changes).

Update: I opened #52649 to fix this.

sourcegraph-bot · 2023-05-24T19:50:51Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 39999bb...f3d2c38.

Notify	File(s)
@camdencheek	internal/search/client/client.go internal/search/keyword/BUILD.bazel internal/search/keyword/match_groups.go internal/search/keyword/match_groups_test.go internal/search/keyword/query_transformer.go internal/search/keyword/query_transformer_test.go internal/search/keyword/stop_words.go internal/search/keyword/term_utils.go internal/search/types.go internal/search/zoekt/zoekt.go
@keegancsmith	internal/search/client/client.go internal/search/keyword/BUILD.bazel internal/search/keyword/match_groups.go internal/search/keyword/match_groups_test.go internal/search/keyword/query_transformer.go internal/search/keyword/query_transformer_test.go internal/search/keyword/stop_words.go internal/search/keyword/term_utils.go internal/search/types.go internal/search/zoekt/zoekt.go

jtibshirani · 2023-05-24T20:01:43Z

internal/search/keyword/term_utils.go

+	return commonCodeSearchTerms.Has(input) || stopWords.Has(input)
+}
+
+var commonCodeSearchTerms = stringSet{


Highlighting the main issue with this approach: if a common word sneaks through our aggressive stopwords lists, then it can make the results very noisy. Truly fixing this would require improvements in Zoekt:

Make its new BM25 scoring logic respect IDF, so that common terms are down-weighted

Make it better able to handle large "OR" queries efficiently

camdencheek

Looks great, thanks!

internal/search/keyword/term_utils.go

camdencheek · 2023-05-25T15:36:49Z

internal/search/keyword/term_utils.go

+}
+
+func stemTerm(input string) string {
+	// Attempt to stem words, but only use the stem if it's a prefix of the original term.


This is unintuitive to me. Do we just exclude non-prefix stems because otherwise it wouldn't match the exact input term?

That's right -- stems aren't always prefixes of the original term, which means the original term would no longer match. One example I ran into is "Cody" -> "Codi", which is inaccurate and creates a lot of noise. I added a comment explaining this.

camdencheek · 2023-05-25T15:40:03Z

internal/search/keyword/query_transformer.go

 	"github.com/sourcegraph/sourcegraph/internal/search/query"
 )

+const maxTransformedPatterns = 10


Any details on why 10? I could imagine that's still pretty expensive in Zoekt

This is not very scientific, I just tested out a search that tokenizes to 10 final terms. On my local instance, this always takes < 400ms, which seemed reasonable. We can definitely bump this down after testing on larger datasets.

Cool, sounds good to me 👍

We recently enabled Zoekt's new BM25 scoring for the `keyword` search type. We enabled the option using feature flags, which is hacky because users will never be touching this setting. This PR refactors all Zoekt-related jobs to use the `search.ZoektParameters` struct. This lets us pass the flag to Zoekt directly when constructing jobs. Follow-up to #52233

We have an experimental search type called `patterntype:keyword`. In testing it on Cody-style queries, it had worse relevance than our ripgrep implementation, and was sometimes quite slow. This PR makes improvements to query analysis: * Reduce the number of tokens we search by using a more aggressive stopword list * Make stemming cheaper and less noisy by using the stem if it's a prefix of the original * Limit the max number of tokens we'll search over * Remove language detection because it was too noisy and makes it hard to compare to other search strategies It also improves ranking: * Enable Zoekt's keyword scoring to rank documents by (approximate) BM25 * Removes unused ranking logic related to "match groups" Addresses #50786

We recently enabled Zoekt's new BM25 scoring for the `keyword` search type. We enabled the option using feature flags, which is hacky because users will never be touching this setting. This PR refactors all Zoekt-related jobs to use the `search.ZoektParameters` struct. This lets us pass the flag to Zoekt directly when constructing jobs. Follow-up to #52233

Search: improve keyword search prototype

b72ba9f

cla-bot bot added the cla-signed label May 19, 2023

jtibshirani added 2 commits May 22, 2023 09:36

Fix test

abff689

Remove language identification

b2921ff

jtibshirani requested a review from a team May 22, 2023 18:16

jtibshirani mentioned this pull request May 22, 2023

Implement keyword search context as fallback for cody web #50786

Closed

jtibshirani commented May 22, 2023

View reviewed changes

jtibshirani added 3 commits May 22, 2023 15:12

Apply formatting

840910b

Merge remote-tracking branch 'upstream/main' into jtibs/keyword-search

7189e8b

Bazel configure

0815b81

jtibshirani marked this pull request as ready for review May 24, 2023 19:48

Merge remote-tracking branch 'upstream/main' into jtibs/keyword-search

f28ab6f

jtibshirani commented May 24, 2023

View reviewed changes

keegancsmith approved these changes May 25, 2023

View reviewed changes

camdencheek approved these changes May 25, 2023

View reviewed changes

jtibshirani added 3 commits May 25, 2023 09:57

Simplify removePunctuation

d38e218

Clarify stemming strategy

7285c57

Bazel configure

f3d2c38

jtibshirani merged commit 80a8177 into main May 25, 2023
5 checks passed

jtibshirani deleted the jtibs/keyword-search branch May 25, 2023 18:13

jtibshirani mentioned this pull request May 30, 2023

Search: fix UseKeywordScoring hack #52649

Merged

This was referenced Jun 9, 2023

Cody context: keyword search bug fixes #53274

Closed

LLM-enhanced keyword context #52815

Merged

Search: solidify Zoekt options for keyword search #53430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: improve keyword search prototype #52233

Search: improve keyword search prototype #52233

jtibshirani commented May 19, 2023 •

edited

Loading

jtibshirani commented May 20, 2023 •

edited

Loading

jtibshirani May 22, 2023

jtibshirani May 30, 2023

sourcegraph-bot commented May 24, 2023 •

edited

Loading

jtibshirani May 24, 2023

camdencheek left a comment

camdencheek May 25, 2023

jtibshirani May 25, 2023

camdencheek May 25, 2023

jtibshirani May 25, 2023

camdencheek May 25, 2023

Search: improve keyword search prototype #52233

Search: improve keyword search prototype #52233

Conversation

jtibshirani commented May 19, 2023 • edited Loading

Test plan

jtibshirani commented May 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sourcegraph-bot commented May 24, 2023 • edited Loading

Choose a reason for hiding this comment

camdencheek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented May 19, 2023 •

edited

Loading

jtibshirani commented May 20, 2023 •

edited

Loading

sourcegraph-bot commented May 24, 2023 •

edited

Loading