New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSIF: Delete indexed files not known by git #8523
Conversation
fec4959
to
d54ad0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some questions inline, main interest is around how many inflight requests can occur for a single path checker.
From a design perspective is it possible to structure this from a PathVisibilityChecker
to a PatchVisibilityFilter
. IE the caller gives it a list of all the paths and it returns the visible ones. This would allow for more flexibility to implement batching/etc.
private frontendUrl?: string | ||
private ctx?: TracingContext | ||
private mockGetDirectoryChildren?: typeof getDirectoryChildren | ||
private directoryContents = new Map<string, Set<string>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of interest how likely is it that you will have something in an lsif dump where the parent directory exists in git but the file does not? Just storing directories visibility is more efficient to query and cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat likely but not the common case, I would suspect. I'm more interesting in ignoring vendored deps that are large trees of untracked files, but there is also the case where we have single generated files or a handful of generated files (grpc, protobuf, thrift, our schema.ts file generated from our graphql definition fits into this category).
Just storing directories visibility is more efficient to query and cache.
The idea of having the directory contents here is that we get the visibility of all sibling files at once. Are you suggesting that we instead use something like private visibleDirectories = new Map<string, bool>()
? If that's the case then we'll still jump to files that don't exist in the tree unless we also re-check file visibility.
Maybe I misunderstood here - could you elaborate a bit? Where/why is more efficient to query a directory's visibility rather than list its tree non-recursively?
* @param documentPath The path of the file relative to the dump root. | ||
* @param requireDocumentDump Whether or not we require the path to be within the dump root. | ||
*/ | ||
public async shouldIncludePath(documentPath: string, requireDocumentDump: boolean = true): Promise<boolean> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be called concurrently? ie is there any chance we will have multiple promises on the go leading to this function, and therefore duplicate work against this.isInGitTree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not currently. If we end up doing batching/concurrency I'll add a hard limit for the number of in-flight requests. They're currently performed sequentially on-demand.
Oh and one other important bit. How much insight will we have to the time spent on PathVisibilityChecker and gitserver rps. This has the potential to become a non-insignificant cost of an LSIF upload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have anything to add besides what @keegancsmith already mentioned, but I just wanted to formally approve and say: man, I love how you write code 👍
Since we cache all of the results we can just supply the list up-front and not change any of the downstream code. How would you suggest these queries be batched? Just send multiple distinct git ls-tree requests in the same request? |
We're currently logging the time spent for each request - I'll do a pre-run so we can get the total time spent over all requests as well so we have that without needing to add the durations of distinct spans by hand. |
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
Co-Authored-By: M. J. Fromberger <michael@sourcegraph.com>
…h/sourcegraph into lsif-remove-unreachable-paths
Indexers such as lsif-tsc will add index entries for files outside of the dump root (e.g. will include ../node_modules/ when indexing web/), add index entries for files outside of the repository root (e.g. will include ../../lsif-tsc/node_modules/typescript/lib/lib.es2015.symbol.d.ts), and add index entries for files in the repository/dump root but possibly untracked by git (e.g. node_modules, other generated files).
For the most part, this is just a waste of space in the dump. However, it also manifests itself in j2d and find references results in a bad way. This PR removes references to any location that we can't jump to in the UI. This does not change any hover data, though. So if we pull a hover definition out of lib.es2015.symbol.d.ts, it will still be in the result dump (we just won't give you the option to jump to a 404).
@sourcegraph/core-services, please review the general algorithm implemented in
lsif/src/worker/conversion/visibility.ts
described below, as it adds additional queries to gitserver at conversion time. If there is any concern with scale or we need to address possible issues before this is merged please start this discussion earlier than later.For each document in the LSIF dump (this includes each source file in the root, and possibly additional files that it imports from vendored dependencies), we need to determine if the document path is a file known to git. We do this with the following steps:
Each git ls-tree command is non-recursive and we memoize the results of each call. This means that we add at most one git ls-tree call per unique parent directory of the LSIF index. Because we also query from the root of the repo down to the leaf, once we hit an untracked vendor root we don't need to go any further (think lsif-go indexing vendor/github.com/..., or lsif-tsc indexing node_modules in the root or a subdir). This will cut out any git ls-tree call for a non-existent directory within these vendor directories.
If there are ways to more intelligently batch these requests I'd be happy to hear them. See this slack thread for some additional context.
Before
After