LSIF: Delete indexed files not known by git #8523

efritz · 2020-02-20T18:08:57Z

Indexers such as lsif-tsc will add index entries for files outside of the dump root (e.g. will include ../node_modules/ when indexing web/), add index entries for files outside of the repository root (e.g. will include ../../lsif-tsc/node_modules/typescript/lib/lib.es2015.symbol.d.ts), and add index entries for files in the repository/dump root but possibly untracked by git (e.g. node_modules, other generated files).

For the most part, this is just a waste of space in the dump. However, it also manifests itself in j2d and find references results in a bad way. This PR removes references to any location that we can't jump to in the UI. This does not change any hover data, though. So if we pull a hover definition out of lib.es2015.symbol.d.ts, it will still be in the result dump (we just won't give you the option to jump to a 404).

@sourcegraph/core-services, please review the general algorithm implemented in lsif/src/worker/conversion/visibility.ts described below, as it adds additional queries to gitserver at conversion time. If there is any concern with scale or we need to address possible issues before this is merged please start this discussion earlier than later.

For each document in the LSIF dump (this includes each source file in the root, and possibly additional files that it imports from vendored dependencies), we need to determine if the document path is a file known to git. We do this with the following steps:

determine the dirname of the file relative to the repository root
return early if it's outside of the repo (starts with ..) as git ls-tree will fail here
run a (cached) git ls-tree on all ancestors of the dirname; if any ancestor returns an empty child set, we can early out
determine if the path exists in the child set of the parent directory

Each git ls-tree command is non-recursive and we memoize the results of each call. This means that we add at most one git ls-tree call per unique parent directory of the LSIF index. Because we also query from the root of the repo down to the leaf, once we hit an untracked vendor root we don't need to go any further (think lsif-go indexing vendor/github.com/..., or lsif-tsc indexing node_modules in the root or a subdir). This will cut out any git ls-tree call for a non-existent directory within these vendor directories.

If there are ways to more intelligently batch these requests I'd be happy to hear them. See this slack thread for some additional context.

Before

After

…tree output.

lsif/src/worker/conversion/conversion.ts

keegancsmith

LGTM. Some questions inline, main interest is around how many inflight requests can occur for a single path checker.

From a design perspective is it possible to structure this from a PathVisibilityChecker to a PatchVisibilityFilter. IE the caller gives it a list of all the paths and it returns the visible ones. This would allow for more flexibility to implement batching/etc.

keegancsmith · 2020-02-21T09:36:20Z

lsif/src/worker/conversion/visibility.ts

+    private frontendUrl?: string
+    private ctx?: TracingContext
+    private mockGetDirectoryChildren?: typeof getDirectoryChildren
+    private directoryContents = new Map<string, Set<string>>()


out of interest how likely is it that you will have something in an lsif dump where the parent directory exists in git but the file does not? Just storing directories visibility is more efficient to query and cache.

Somewhat likely but not the common case, I would suspect. I'm more interesting in ignoring vendored deps that are large trees of untracked files, but there is also the case where we have single generated files or a handful of generated files (grpc, protobuf, thrift, our schema.ts file generated from our graphql definition fits into this category).

Just storing directories visibility is more efficient to query and cache.

The idea of having the directory contents here is that we get the visibility of all sibling files at once. Are you suggesting that we instead use something like private visibleDirectories = new Map<string, bool>()? If that's the case then we'll still jump to files that don't exist in the tree unless we also re-check file visibility.

Maybe I misunderstood here - could you elaborate a bit? Where/why is more efficient to query a directory's visibility rather than list its tree non-recursively?

keegancsmith · 2020-02-21T09:37:02Z

lsif/src/worker/conversion/visibility.ts

+     * @param documentPath The path of the file relative to the dump root.
+     * @param requireDocumentDump Whether or not we require the path to be within the dump root.
+     */
+    public async shouldIncludePath(documentPath: string, requireDocumentDump: boolean = true): Promise<boolean> {


can this be called concurrently? ie is there any chance we will have multiple promises on the go leading to this function, and therefore duplicate work against this.isInGitTree

Not currently. If we end up doing batching/concurrency I'll add a hard limit for the number of in-flight requests. They're currently performed sequentially on-demand.

keegancsmith · 2020-02-21T09:39:59Z

Oh and one other important bit. How much insight will we have to the time spent on PathVisibilityChecker and gitserver rps. This has the potential to become a non-insignificant cost of an LSIF upload.

mrnugget

I don't have anything to add besides what @keegancsmith already mentioned, but I just wanted to formally approve and say: man, I love how you write code 👍

efritz · 2020-02-21T16:15:04Z

From a design perspective is it possible to structure this from a PathVisibilityChecker to a PatchVisibilityFilter. IE the caller gives it a list of all the paths and it returns the visible ones. This would allow for more flexibility to implement batching/etc.

Since we cache all of the results we can just supply the list up-front and not change any of the downstream code. How would you suggest these queries be batched? Just send multiple distinct git ls-tree requests in the same request?

efritz · 2020-02-21T16:15:50Z

How much insight will we have to the time spent on PathVisibilityChecker and gitserver rps.

We're currently logging the time spent for each request - I'll do a pre-run so we can get the total time spent over all requests as well so we have that without needing to add the durations of distinct spans by hand.

lsif/src/worker/conversion/visibility.ts

Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>

…h/sourcegraph into lsif-remove-unreachable-paths

unknwon · 2020-02-21T21:20:17Z

efritz added 2 commits February 20, 2020 12:01

Set indexer name for scala files.

b19d2f8

Merge branch 'master' of github.com:sourcegraph/sourcegraph

718634c

efritz changed the base branch from lsif-merge-documents to master February 20, 2020 21:12

Squash commits.

d54ad0d

efritz force-pushed the lsif-remove-unreachable-paths branch from fec4959 to d54ad0d Compare February 20, 2020 21:13

efritz changed the title ~~WIP.~~ LSIF: Delete indexed files not known by git missing from g Feb 20, 2020

efritz changed the title ~~LSIF: Delete indexed files not known by git missing from g~~ LSIF: Delete indexed files not known by git Feb 20, 2020

efritz marked this pull request as ready for review February 20, 2020 21:32

efritz requested a review from a team as a code owner February 20, 2020 21:32

efritz requested a review from a team February 20, 2020 21:32

Fix basename error. Update tests to be more descriptive about git ls-…

b0f8e1d

…tree output.

unknwon approved these changes Feb 21, 2020

View reviewed changes

lsif/src/worker/conversion/conversion.ts Outdated Show resolved Hide resolved

keegancsmith approved these changes Feb 21, 2020

View reviewed changes

mrnugget approved these changes Feb 21, 2020

View reviewed changes

efritz added 2 commits February 21, 2020 10:15

Merge branch 'master' into lsif-remove-unreachable-paths

79c90ed

Merge branch 'master' into lsif-remove-unreachable-paths

5a7aeae

creachadair approved these changes Feb 21, 2020

View reviewed changes

efritz and others added 8 commits February 21, 2020 10:56

Add warmCache method as a place to batch requests in the future.

db8098f

Update lsif/src/worker/conversion/visibility.ts

74c7f61

Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>

Update lsif/src/worker/conversion/visibility.ts

9db2126

Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>

Update lsif/src/worker/conversion/visibility.ts

ca1dfb7

Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>

Update lsif/src/worker/conversion/visibility.ts

6ad10fd

Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>

Respond to (comment phrasing) feedback

df7256e

Merge branch 'lsif-remove-unreachable-paths' of github.com:sourcegrap…

f325666

…h/sourcegraph into lsif-remove-unreachable-paths

Rename class.

364e916

efritz mentioned this pull request Feb 21, 2020

Batch git ls-tree requests from lsif-server #8555

Closed

Add note with issue link.

6978bd4

efritz merged commit e29a361 into master Feb 21, 2020

efritz deleted the lsif-remove-unreachable-paths branch February 21, 2020 21:35

efritz mentioned this pull request Feb 24, 2020

Code intelligence: 3.14 tracking issue #8457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSIF: Delete indexed files not known by git #8523

LSIF: Delete indexed files not known by git #8523

efritz commented Feb 20, 2020 •

edited

keegancsmith left a comment

keegancsmith Feb 21, 2020

efritz Feb 21, 2020

keegancsmith Feb 21, 2020

efritz Feb 21, 2020

keegancsmith commented Feb 21, 2020

mrnugget left a comment

efritz commented Feb 21, 2020

efritz commented Feb 21, 2020

unknwon commented Feb 21, 2020

LSIF: Delete indexed files not known by git #8523

LSIF: Delete indexed files not known by git #8523

Conversation

efritz commented Feb 20, 2020 • edited

Before

After

keegancsmith left a comment

Choose a reason for hiding this comment

keegancsmith Feb 21, 2020

Choose a reason for hiding this comment

efritz Feb 21, 2020

Choose a reason for hiding this comment

keegancsmith Feb 21, 2020

Choose a reason for hiding this comment

efritz Feb 21, 2020

Choose a reason for hiding this comment

keegancsmith commented Feb 21, 2020

mrnugget left a comment

Choose a reason for hiding this comment

efritz commented Feb 21, 2020

efritz commented Feb 21, 2020

unknwon commented Feb 21, 2020

efritz commented Feb 20, 2020 •

edited