Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global search on Sourcegraph.com should indicate that repo: filter is needed #3966

Closed
sqs opened this issue May 13, 2019 · 11 comments
Closed
Assignees
Labels
bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior.
Milestone

Comments

@sqs
Copy link
Member

sqs commented May 13, 2019

Global search isn't supported on Sourcegraph.com. If the user specifies a query that exceeds the maximum number of allowed repositories to search over (implemented in the code link below), they should see a helpful message instead of just a "context deadline exceeded" error.

Repro:

  1. Go to https://sourcegraph.com/search?q=foo+bar

Actual:
image

Expected: The "Too many matching repositories" alert at https://sourcegraph.com/github.com/sourcegraph/sourcegraph@6f954bbc1cfc4a54d715f42325cb2ea4037e93ec/-/blob/cmd/frontend/graphqlbackend/search_alert.go#L222:11

(From https://twitter.com/sqs/status/1127746584811192320)

@sqs sqs added bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. search labels May 13, 2019
@sqs sqs added this to the 3.5 milestone May 13, 2019
@sqs sqs assigned ijt May 13, 2019
@ijt
Copy link
Contributor

ijt commented May 14, 2019

@nicksnyder changed the default searchable number of repos to be unlimited, and sourcegraph.com appears to be using the default.

Instead of an error, I think we could show some repo results that could then be used to refine the query. Watch count could be a useful indicator of relevance. Here's a bigquery showing what that looks like for "vim":

Screen Shot 2019-05-13 at 16 59 08

@ijt
Copy link
Contributor

ijt commented May 14, 2019

To be clear, I'm not necessarily saying we should use bigquery for this. It just happens to have a dataset that helps to see what would happen with a simple approach like this.

One possibility that could work not just on github would be to periodically query the recent_searches table to find out how frequently each repo shows up in sourcegraph.com queries.

@ijt
Copy link
Contributor

ijt commented May 14, 2019

To give a rough idea of what that could look like, here's some data from the recent_searches table:

sg=# create temp table rs as select substring(query from 'repo:([^ @]+)') as repo, count(*) from recent_searches where query like '%repo:%' group by repo order by count desc;    
SELECT 5211

sg=# select * from rs limit 20;
                                     repo                                      | count 
-------------------------------------------------------------------------------+-------
 ^github.com/tensorflow/tensorflow$                                            |  1381
 ^github.com/fastai/fastai$                                                    |  1334
 ^github.com/openai/baselines$                                                 |   618
 ^github.com/kaldi-asr/kaldi$                                                  |   528
 ^github.com/apache/spark$                                                     |   492
 ^github.com/sourcegraph/sourcegraph$                                          |   480
 ^github.com/numpy/numpy$                                                      |   465
 ^github.com/docker/compose$                                                   |   436
 ^github.com/pytorch/pytorch$                                                  |   435
 ^github.com/rlworkgroup/garage$                                               |   432
 ^github.com/huggingface/pytorch-pretrained-BERT$                              |   408
 ^github.com/akanimax/Variational_Discriminator_Bottleneck$                    |   395
 ^github.com/envoyproxy/envoy$                                                 |   358
 ^github.com/nearprotocol/nearcore$                                            |   355
 ^github.com/pytorch/vision$                                                   |   349
 ^github.com/ray-project/ray$                                                  |   342
 ^github.com/paritytech/parity-wasm$                                           |   334
 ^github.com/LZQthePlane/Online-Realtime-Action-Recognition-based-on-OpenPose$ |   296
 ^github.com/dmlc/tvm$                                                         |   283
 ^github.com/travislee8964/ocserv-auto$                                        |   273
(20 rows)

sg=# select * from rs where repo ~ 'vim[^/]*$' limit 20;
                    repo                     | count 
---------------------------------------------+-------
 ^github.com/autozimu/LanguageClient-neovim$ |    76
 ^github.com/amix/vimrc$                     |    19
 ^github\.com/neovim/neovim$                 |    18
 ^github.com/neovim/neovim$                  |    11
 ^github.com/daa84/neovim-lib$               |     6
 ^github.com/vim/vim$                        |     5
 ^github.com/crazyclerk/vimrc$               |     4
 ^github.com/sakhnik/nvim-gdb$               |     2
 ^github\.com/airblade/vim-gitgutter$        |     2
 ^github.com/JetBrains/ideavim$              |     2
 ^github.com/hmybmny/vim.cpp$                |     1
(11 rows)

where the 'vim[^/]*$' regex matches the last component of the repo path. The answer looks fairly reasonable as a result set for the query "vim".

@ijt
Copy link
Contributor

ijt commented May 14, 2019

Of course we'd have to join rs against the repo table since the repo column in rs is actually a regex, not an actual repo name.

@ijt
Copy link
Contributor

ijt commented May 21, 2019

Streaming (#3991) should make it unnecessary to require a repo: filter on sourcegraph.com since some results would show up pretty quickly in many cases and all of them would eventually show up at least for the repos that are currently cloned.

@slimsag
Copy link
Member

slimsag commented May 21, 2019

I have set "maxReposToSearch": 50 back in the Sourcegraph.com site configuration. I have NO idea how this got removed or if it was just never added, but this was a huge gaping performance issue on our site and caused everything to slow down significantly just now.

This is fixed now, so closing.

@slimsag slimsag closed this as completed May 21, 2019
@felixfbecker
Copy link
Contributor

@slimsag Imo this issue is not resolved. A private instance can still configure a higher number and run into the "Context deadline exceeded" error instead of the expected alert.

@slimsag
Copy link
Member

slimsag commented May 22, 2019

@felixfbecker point taken, Context deadline exceeded is not a helpful message and we should fix that.

However, the correct behavior in this case would be telling the user that they need to increase the timeout parameter. If I say I want my instance to be able to search over 1000 repositories and that's too slow, I don't expect it to tell me to specify a repo: filter.

@ijt
Copy link
Contributor

ijt commented May 22, 2019 via email

@ijt
Copy link
Contributor

ijt commented Jun 14, 2019

@slimsag took care of this specific issue. Closing it.

Screen Shot 2019-06-14 at 16 07 37

@ijt ijt closed this as completed Jun 14, 2019
@ijt
Copy link
Contributor

ijt commented Jun 14, 2019

@felixfbecker, there should be a separate issue about the point you raised if we're able to reproduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior.
Projects
None yet
Development

No branches or pull requests

4 participants