Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

symbol search slow at scale #3534

Open
slimsag opened this issue Apr 22, 2019 · 15 comments · Fixed by #5192

Comments

@slimsag
Copy link
Member

commented Apr 22, 2019

Because symbol search is implemented via repository -> SQLite regex matching, it can be very slow when hitting large repositories (100+ seconds for repositories on the order of ~20GB). This is in contrast to our zoekt search which is rather fast on a similar size repository I believe.

Most of these perf issues occur in filterSymbols here: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/symbols/internal/symbols/search.go#L155

Possible solutions:

  1. Optimize simple regex matches to LIKE matches, e.g. f.*b.*r. becomes LIKE f%b%r_ to match foobar1. This is a good idea but will only help in simple cases.
  2. Prioritize proper indexing of symbols via zoekt. I believe this makes sense for multiple reasons:
    1. It would improve symbol search performance substantially in all cases
    2. It would improve perf of basic code intel and allow us to more easily support it on sourcegraph.com as both would be heavily reliant on symbol search perf.

Reported by https://app.hubspot.com/contacts/2762526/company/407948923

@slimsag slimsag added this to the Backlog milestone Apr 22, 2019

@slimsag slimsag self-assigned this Apr 22, 2019

@slimsag slimsag modified the milestones: Backlog, 3.4 Apr 22, 2019

@slimsag

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

I'll tackle the first optimization in 3.4.

More discussion is needed before committing to the 2nd solution here (which would be the proper long term solution I believe).

@chrismwendt

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

Do you happen to know what the common usage patterns are? Are users searching in the sidebar? Maybe a prefix index would be a sufficient optimization for that.

@slimsag

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

Via a script I believe.

@chrismwendt

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

For literal matches, ^foo$ (hits the index) is much faster than foo (not indexed, falls back to a scan).

@ijt

This comment has been minimized.

Copy link
Contributor

commented May 7, 2019

@slimsag, are you actively working on this? Are you still planning to finish it for 3.4?

@slimsag

This comment has been minimized.

Copy link
Member Author

commented May 7, 2019

I'll tackle the first optimization in 3.4.

But we do need someone here to be aware of the broader issue & take ownership on the 2nd issue point here.

@ijt

This comment has been minimized.

Copy link
Contributor

commented May 8, 2019

I reproduced this locally with fakehub on the Linux kernel.

Screen Shot 2019-05-07 at 5 48 38 PM

@slimsag

This comment has been minimized.

Copy link
Member Author

commented May 15, 2019

I missed this in 3.4 but may come in a subsequent patch release. Moving to 3.5

@slimsag slimsag modified the milestones: 3.4, 3.5 May 15, 2019

@slimsag slimsag referenced this issue May 29, 2019
4 of 31 tasks complete
@slimsag

This comment has been minimized.

Copy link
Member Author

commented Jun 7, 2019

@keegancsmith said:

FYI zoekt has support to run and index ctags output.

@slimsag slimsag changed the title symbol search slow on very large repositories symbol search slow at scale Jun 7, 2019

@slimsag slimsag removed this from the 3.5 milestone Jun 7, 2019

@ijt ijt referenced this issue Jun 19, 2019
1 of 5 tasks complete

@kzh kzh self-assigned this Jun 19, 2019

@ijt ijt added this to the 3.6 milestone Jun 20, 2019

@beyang beyang referenced this issue Jun 27, 2019
4 of 15 tasks complete
@slimsag

This comment has been minimized.

Copy link
Member Author

commented Jul 1, 2019

Reasons for pursuing symbol search via Zoekt:

  • Queries like type:symbol ^Router$ count:9999999 taking much longer than indexed zoekt text searches (and sometimes timing out). e.g. compare plaintext vs. symbol (this is a real query indicative of what a major customer is trying to do).
  • Timeouts on a large customer instance when running non-large-count queries when many of these queries (incl. large-count) are ongoing. Zoekt-based textsearch doesn't behave poorly in this way.
  • High memory usage introduced in 3.3, not directly attributed to sqlite-based symbol search but I suspect could be helped by a zoekt-based implementation, see #4643 for further explanation here.

Note that even though we persue symbol search via Zoekt, we will still maintain the old implementation specifically for cases of non-indexed branches and revisions.

@chrismwendt

This comment has been minimized.

Copy link
Contributor

commented Jul 1, 2019

Those problems could also be solved by eagerly indexing symbols, right? I'm wondering if it would be faster to implement that.

That said, the eager-indexing infrastructure code might already be implemented for Zoekt (or even inside it?), and reimplementing it for symbols might take longer than switching to Zoekt. Also, there might be customer need to run a lot of non-^...$ symbols searches (which Zoekt is specifically designed for).

@keegancsmith

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

Those problems could also be solved by eagerly indexing symbols, right? I'm wondering if it would be faster to implement that.

I would be surprised if our current architecture could handle 30k repo symbol requests. For example, just resolving HEAD for 30k repos is not possible, so we would need some caching for that as well. We have the correct architecture to keep up to date a large set of repos.

Additionally zoekt has two things in its favour to our current impl:

  • It is purpose built to answer any regex quickly.
  • It has some support for ctags already.

That said, the eager-indexing infrastructure code might already be implemented for Zoekt (or even inside it?), and reimplementing it for symbols might take longer than switching to Zoekt. Also, there might be customer need to run a lot of non-^...$ symbols searches (which Zoekt is specifically designed for).

Yes, zoekt only does eager indexing so we can't get rid of symbol searcher. However, once you have Zoekt handling HEAD its then becomes quite hard to do searches for large reposets which aren't HEAD (with our current query syntax). So the mix of both is great.

Some extra context from a discussion I had on slack with @kzh:

zoekt only retains the indices (start, end) of the symbols from the ctags information generated for its rankings. It discards the rest (excluding language) including a notable field, kind which we use to indicate the type of the symbol (ie func, interface, etc).

@kzh is now modifying zoekt to index more information. We will see where this takes us.

@slimsag: What is a case where we do symbol searches across all repos. Is it only in the case of type:symbol queries, or is it triggered more than that now? Maybe language extensions using symbol search instead of text search?

@slimsag

This comment has been minimized.

Copy link
Member Author

commented Jul 12, 2019

I missed this, apologies.

@slimsag: What is a case where we do symbol searches across all repos. Is it only in the case of type:symbol queries, or is it triggered more than that now? Maybe language extensions using symbol search instead of text search?

Basic code intel is implemented 100% on top of type:symbol queries, or at the very least uses it heavily for cross-repo ops, so this is a major source of these requests and how it ended up central to our product AFAIK.

@keegancsmith keegancsmith modified the milestones: 3.6, 3.7 Jul 15, 2019

@keegancsmith

This comment has been minimized.

Copy link
Member

commented Jul 15, 2019

I've bumped to the next milestone since we will be continuing work on this.

@slimsag

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.