Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

Scanning all files in the rootUri makes the server unusable for directories with many files #471

Open
keyboardDrummer opened this issue Jun 12, 2018 · 3 comments

Comments

@keyboardDrummer
Copy link
Contributor

keyboardDrummer commented Jun 12, 2018

I have a package directory that contains about 200k files, many of which are generated and are not required for providing Typescript language tooling. The glob npm package used by this LSP server takes ~45seconds to scan all files in the package directory. During this time the LSP server will not respond to requests, making it unusable.

I've compared using the 'glob' package to using the 'find' package. The latter completes in about 1 second, which would be sufficient to make the LSP server usable. However, the find package finds slightly different files than glob. It seems to include also files in 'hidden' directories such as .git, but then skips files in symlinked directories.

Lastly, running find -type f | wc -l in my package directory completes instantly.

I think the root cause of this issue is that the LSP server aggressively scans all files in the rootUri. This is also a problem when requiring files from outside the rootUri directory. I see that the LanguageServiceHost interface that you need to implement for the Typescript compiler contains a synchronous readFile method. Is that why you preload all the existing files? Can you not just synchronously read the file at that time?

Have you looked at what tsc does? It finishes running on this package in a few seconds. Also it allows requiring packages outside of the tsconfig.json folder.

@felixfbecker
Copy link
Contributor

I see that the LanguageServiceHost interface that you need to implement for the Typescript compiler contains a synchronous readFile method. Is that why you preload all the existing files?

Exactly.

Can you not just synchronously read the file at that time?

No, because the file content don't have to come from disk, they may be requested through LSP too, which is always async.

@keyboardDrummer
Copy link
Contributor Author

keyboardDrummer commented Jun 13, 2018

No, because the file content don't have to come from disk, they may be requested through LSP too, which is always async.

It sounds like you're saying LSP supports having the server request file contents from the client. However, I don't see that in the protocol. Could you elaborate on what you're referring to? The protocol allows the client to push file content to the server, but obviously after those are stored in memory on the server, they can be served synchronously to Typescript.

Looking at how tsc and tsserver implement ModuleResolutionHost: they use fs.readFileSync (code). The result is great performance. Why not use the same approach for this LSP server?

Update: I see now that you've extended LSP with file requests from server to client, such as textDocument/xcontent and workspace/xfiles, which explains your earlier comment. Can you give me some insight into the use-case for these methods? If you're storing the source files in some location, why would you not run the LSP server there? I can see a use-case where the LSP client is on a different machine than the source code, but then it would be easiest to have the communication between the LSP client and server bridge the distance, and keep the LSP server with the source.

Have you made attempts to get your LSP extensions merged into LSP?

Anyways, I feel that in the scenario that you're not using remote file fetching, this LSP server should stop prefetching and simply use fs.readFileSync, since that provides reliable performance. I've made a quick prototype here that solves the problem for our many-files package. It also solves this issue I found.

Update:
I read up on SourceGraph's blog post detailing their LSP extensions. I would say given that you want file fetching as part of the LSP protocol, then the ModuleResolutionHost methods such as readFile from Typescript should be async. I've looked at their issues but I don't see this issue there. Have you tried adding it and getting the Typescript API changed?

@felixfbecker
Copy link
Contributor

@keyboardDrummer seems like you found the answers to most of your questions. To summarize, to support thousands of clients browsing repos at different revisions at the same time, we run language servers in containers, acting as TCP servers (speaking LSP). For performance reasons, the containers don't have access to a fully checked-out workspace on disk, but fetch the files they need on-demand, over the LSP files extension (spec here).

If you're storing the source files in some location, why would you not run the LSP server there?

The files are actually not checked out - that would be too expensive for big monorepos, because latest master changes all the time and every connection would require a fresh git checkout (because multiple revisions can be looked at by different users at the same time). The implementation of workspace/xfiles basically runs git ls-files, and the implementation of textDocument/xcontent basically runs git show.

Have you made attempts to get your LSP extensions merged into LSP?

We have not for this extension, because every prior attempt to merge things into LSP that didn't solve any problem for VS Code wasn't successful. E.g. see microsoft/language-server-protocol#245 or microsoft/language-server-protocol#182 (comment) (not blaming anyone here, but figuring these things out properly for the official spec is hard, especially when there is little interest from other parties). It may be worth to revisit now, since VS Code is gaining support for file providers to enable e.g. working over FTP.

But that wouldn't solve the unvelrying problem in TypeScript: As you mentioned, TypeScript's ModuleResolutionHost is synchronous, which is tracked in an issue here: microsoft/TypeScript#1857
We are not the only one suffering from this exact problem, see e.g. microsoft/TypeScript#1857 (comment)

This is already an issue for the VS case today. Right now we end up having to load up all files on the VS side before we can even call into the script side to answer a question. We do this so we have the data ready when the LS asks the host "what has changed between these files?" And we have to do this, even though the LS might not ever ask that question (because the version hasn't changed). If we were async through-and-through, we wouldn't have this problem.

We'd simply call into the script LS. It would synchronize info with the host, and then the LS would just ask for the changes to files with different versions in an async manner. This would be much better, and would provide a lot less stress on the managed side (for example, Roslyn could flush these files to their cache, instead of having to keep them all in memory.).

I'm suprised that fs.readFileSync works somewhat well (both in your PR and for tsserver itself), but my guess is that this is only true as long as the server is only serving a single project. NodeJS is single-threaded, so in the case of running as a server serving multiple at once, one connection blocking the whole event loop for all would be a nightmare (borrowing from Ryan Dahl's original NodeJS presentation, one blocking disk I/O operation means 41.000.000 CPU cycles wasted). This doesn't have to be a TCP server serving multiple connections, it may be a single connection if we support workspaceFolders.

Now of course I would love to get rid of the prefetching. It's a thorn in my eye. To somewhat improve performance, we use heuristics to not fetch all files, but only ones that are TypeScript files belonging to the project and some that "look" like global declaration files (e.g. types/node.d.ts, types/mocha.d.ts, ...). This can create problems where we if we guessed wrong, TypeScript may ask for a file and we don't have it. I would love nothing more than to avoid all that, but the only solution that would univerially work that I see is making the ModuleResolutionHost actually async.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants