Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSIF: Delete indexed files not known by git #8523

Merged
merged 15 commits into from Feb 21, 2020

Conversation

efritz
Copy link
Contributor

@efritz efritz commented Feb 20, 2020

Indexers such as lsif-tsc will add index entries for files outside of the dump root (e.g. will include ../node_modules/ when indexing web/), add index entries for files outside of the repository root (e.g. will include ../../lsif-tsc/node_modules/typescript/lib/lib.es2015.symbol.d.ts), and add index entries for files in the repository/dump root but possibly untracked by git (e.g. node_modules, other generated files).

For the most part, this is just a waste of space in the dump. However, it also manifests itself in j2d and find references results in a bad way. This PR removes references to any location that we can't jump to in the UI. This does not change any hover data, though. So if we pull a hover definition out of lib.es2015.symbol.d.ts, it will still be in the result dump (we just won't give you the option to jump to a 404).

@sourcegraph/core-services, please review the general algorithm implemented in lsif/src/worker/conversion/visibility.ts described below, as it adds additional queries to gitserver at conversion time. If there is any concern with scale or we need to address possible issues before this is merged please start this discussion earlier than later.

For each document in the LSIF dump (this includes each source file in the root, and possibly additional files that it imports from vendored dependencies), we need to determine if the document path is a file known to git. We do this with the following steps:

  • determine the dirname of the file relative to the repository root
  • return early if it's outside of the repo (starts with ..) as git ls-tree will fail here
  • run a (cached) git ls-tree on all ancestors of the dirname; if any ancestor returns an empty child set, we can early out
  • determine if the path exists in the child set of the parent directory

Each git ls-tree command is non-recursive and we memoize the results of each call. This means that we add at most one git ls-tree call per unique parent directory of the LSIF index. Because we also query from the root of the repo down to the leaf, once we hit an untracked vendor root we don't need to go any further (think lsif-go indexing vendor/github.com/..., or lsif-tsc indexing node_modules in the root or a subdir). This will cut out any git ls-tree call for a non-existent directory within these vendor directories.

If there are ways to more intelligently batch these requests I'd be happy to hear them. See this slack thread for some additional context.

Before

Kapture 2020-02-20 at 15 57 09

After

Kapture 2020-02-20 at 15 59 30

@efritz efritz changed the base branch from lsif-merge-documents to master February 20, 2020 21:12
@efritz efritz force-pushed the lsif-remove-unreachable-paths branch from fec4959 to d54ad0d Compare February 20, 2020 21:13
@efritz efritz changed the title WIP. LSIF: Delete indexed files not known by git missing from g Feb 20, 2020
@efritz efritz changed the title LSIF: Delete indexed files not known by git missing from g LSIF: Delete indexed files not known by git Feb 20, 2020
@efritz efritz marked this pull request as ready for review February 20, 2020 21:32
@efritz efritz requested a review from a team as a code owner February 20, 2020 21:32
@efritz efritz requested a review from a team February 20, 2020 21:32
lsif/src/worker/conversion/conversion.ts Outdated Show resolved Hide resolved
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some questions inline, main interest is around how many inflight requests can occur for a single path checker.

From a design perspective is it possible to structure this from a PathVisibilityChecker to a PatchVisibilityFilter. IE the caller gives it a list of all the paths and it returns the visible ones. This would allow for more flexibility to implement batching/etc.

private frontendUrl?: string
private ctx?: TracingContext
private mockGetDirectoryChildren?: typeof getDirectoryChildren
private directoryContents = new Map<string, Set<string>>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of interest how likely is it that you will have something in an lsif dump where the parent directory exists in git but the file does not? Just storing directories visibility is more efficient to query and cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat likely but not the common case, I would suspect. I'm more interesting in ignoring vendored deps that are large trees of untracked files, but there is also the case where we have single generated files or a handful of generated files (grpc, protobuf, thrift, our schema.ts file generated from our graphql definition fits into this category).

Just storing directories visibility is more efficient to query and cache.

The idea of having the directory contents here is that we get the visibility of all sibling files at once. Are you suggesting that we instead use something like private visibleDirectories = new Map<string, bool>()? If that's the case then we'll still jump to files that don't exist in the tree unless we also re-check file visibility.

Maybe I misunderstood here - could you elaborate a bit? Where/why is more efficient to query a directory's visibility rather than list its tree non-recursively?

* @param documentPath The path of the file relative to the dump root.
* @param requireDocumentDump Whether or not we require the path to be within the dump root.
*/
public async shouldIncludePath(documentPath: string, requireDocumentDump: boolean = true): Promise<boolean> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be called concurrently? ie is there any chance we will have multiple promises on the go leading to this function, and therefore duplicate work against this.isInGitTree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently. If we end up doing batching/concurrency I'll add a hard limit for the number of in-flight requests. They're currently performed sequentially on-demand.

@keegancsmith
Copy link
Member

Oh and one other important bit. How much insight will we have to the time spent on PathVisibilityChecker and gitserver rps. This has the potential to become a non-insignificant cost of an LSIF upload.

Copy link
Contributor

@mrnugget mrnugget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have anything to add besides what @keegancsmith already mentioned, but I just wanted to formally approve and say: man, I love how you write code 👍

@efritz
Copy link
Contributor Author

efritz commented Feb 21, 2020

From a design perspective is it possible to structure this from a PathVisibilityChecker to a PatchVisibilityFilter. IE the caller gives it a list of all the paths and it returns the visible ones. This would allow for more flexibility to implement batching/etc.

Since we cache all of the results we can just supply the list up-front and not change any of the downstream code. How would you suggest these queries be batched? Just send multiple distinct git ls-tree requests in the same request?

@efritz
Copy link
Contributor Author

efritz commented Feb 21, 2020

How much insight will we have to the time spent on PathVisibilityChecker and gitserver rps.

We're currently logging the time spent for each request - I'll do a pre-run so we can get the total time spent over all requests as well so we have that without needing to add the durations of distinct spans by hand.

lsif/src/worker/conversion/visibility.ts Outdated Show resolved Hide resolved
lsif/src/worker/conversion/visibility.ts Outdated Show resolved Hide resolved
lsif/src/worker/conversion/visibility.ts Outdated Show resolved Hide resolved
lsif/src/worker/conversion/visibility.ts Outdated Show resolved Hide resolved
lsif/src/worker/conversion/visibility.ts Outdated Show resolved Hide resolved
lsif/src/worker/conversion/visibility.ts Outdated Show resolved Hide resolved
lsif/src/worker/conversion/visibility.ts Outdated Show resolved Hide resolved
efritz and others added 8 commits February 21, 2020 10:56
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
…h/sourcegraph into lsif-remove-unreachable-paths
@unknwon
Copy link
Member

unknwon commented Feb 21, 2020

@efritz efritz merged commit e29a361 into master Feb 21, 2020
@efritz efritz deleted the lsif-remove-unreachable-paths branch February 21, 2020 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants