Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: I plan on editing this write-up and publishing it someplace once this PR is merged :)
Introduction
We're seeking to add workspace symbols, that is the ability to search for declarations by name across all the files in your editor's file tree, to ZLS. This incurs significant construction and access overhead which must be minimized to maintain interactivity, which is crucial in the editor environments where ZLS is run.
N-grams and trigrams
An n-gram is a chunk of text of size
n
. It is produced by sliding a window of sizen
and stride 1 across a larger chunk of text, ending when a new window of sizen
cannot be created. A trigram is an n-gram wheren = 3
.To obtain the trigrams of
agent
, for example, we obtain the first 3 characters,age
, then shift our window by 1 to obtaingen
, and again to obtainent
. Shifting our window by 1 again would not yield 3 characters, so we stop here. Thus, the trigrams of agent areage
,gen
, andent
.N-grams are a nice way to execute approximate searches over a large corpus of text, allowing the consideration not only the entirety, prefix, or suffix of a search target, but also all of its constituent parts. Trigrams also enable efficient large-scale regular expression searches (see Zoekt from my ex-employer Sourcegraph), but that's out of scope for this article.
Indexing
We need to index the name of every single global constant, variable, and function declaration. This is easily doable with the now-refactored DocumentScope, which lists all declarations in a single contiguous list.
Note that we could perform trigram indexing immediately during the construction of the
DocumentScope
, but that would incur overhead on every edit that we'd rather split into a separate task in our multithreaded setup to keep ZLS fast and responsive.We can begin by attaching a flag,
should_be_indexed_for_trigrams
, during the construction of theDocumentScope
to each declaration identifying whether it's one of our search targets, thus preventing locals and symbols with names shorter than three characters long from being indexed.During indexing, we iterate over the declarations for each document and find the trigrams for their names. We then create an inverse mapping from each trigram in the declaration's name to the declaration.
So for example, if declaration
Declaration.Index(1)
has nameagent
, our inverse mapping would look like this:This inverse mapping is constructed per-document. After it is constructed, we also construct a Binary Fuse filter to quickly disqualify documents that do not contain certain trigrams at query time.
Querying
We begin by obtaining the trigrams for our query. We then iterate through all our documents and check if each document contains each trigram in our query via the Binary Fuse filter, which cannot return false negatives but can return false positives, albeit with a very low false positive rate. This allows us to reduce our computation to only documents that likely have all our query trigrams, and is especially effective for longer queries.
Once we've gathered our candidate documents, querying is essentially just performing an intersection.
We use a "merge intersection," which is ripped out of merge sort, to intersect lists. King tried beating this approach in a couple of purely hashmap-based ways, but with our setup which mostly involves many small inverse mappings (~10,000 trigrams with ~30 declarations each, for example), the merge intersection always won out.
I took a look at and partially implemented Fast Set Intersection in Memory, but it seems to be significant overkill for this sort of small intersection application. In Section 4, "Experimental Evaluation," they show that merge intersection performs rather well and sometimes comparably to the implementations shown in the paper for small intersection sizes, so that's what we're sticking with unless someone can find a better solution.