Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

Minimal code changes run rls at 500% CPU for minutes #688

Closed
sfackler opened this issue Jan 29, 2018 · 23 comments
Closed

Minimal code changes run rls at 500% CPU for minutes #688

sfackler opened this issue Jan 29, 2018 · 23 comments
Labels
bug racer an issue caused by or could be solved by Racer

Comments

@sfackler
Copy link
Member

  1. Clone https://github.com/sfackler/typed-headers
  2. Write a line of code somewhere.
  3. Watch the rls binary run at 500-600% CPU for 1-2 minutes.

I'm using rls-preview 0.124.0-nightly (511321a 2018-01-23) from the latest nightly. Did request cancellation break maybe?

@sfackler sfackler changed the title Adding a new function to a crate pegs rls at 500% CPU for minutes Minimal code changes run rls at 500% CPU for minutes Jan 29, 2018
@sfackler
Copy link
Member Author

I tried running with RUST_LOG=debug, but it produced so much output that vscode basically soft locked. It seemed to be a bunch of stuff from racer though.

@sfackler
Copy link
Member Author

This also appears to be the case for the rls that's shipped with the current stable release.

@nrc nrc added the bug label Feb 1, 2018
@nrc
Copy link
Member

nrc commented Feb 1, 2018

So, I tried to repro but Cargo couldn't make the patch for http work, so I removed that from the Cargo.toml. On the altered project I couldn't repro - I got 120% CPU for a short time - 30 sec or less. (I'm using master, but I don't think much has changed compared with 0.124).

One thing that changed somewhat recently is that if there is not a build running, then we will start one immediately on typing instead of waiting for some time (we only wait on subsequent builds). We still don't run more that one build at once (though we may be running a build at the same time as indexing the previous build). We've never been able to cancel builds once they start.

@nrc
Copy link
Member

nrc commented Feb 5, 2018

I have a hypothesis for something which might cause this. Builds in the RLS are a cargo check followed by post-processing the generated save-analysis data. The former is usually much longer than the latter. The way we queue builds is that we never start one until the current one has finished and we wait for a timeout before starting one. However, we don't queue the save-analysis indexing - we start that as soon as the build is finished and can do it in parallel with another build.

It is possible that if you have a crate where the indexing time is greater than the build time, then you could end up with lots of indexing happening at the same time, even though multiple builds are not. I wonder if something like that is happening here?

@sfackler could you run with RUST_LOG=rls=debug please? That would avoid getting spammed by Racer or the compiler.

@sfackler
Copy link
Member Author

sfackler commented Feb 5, 2018 via email

@alexheretic
Copy link
Member

I can reproduce this behaviour. The culprit in my case is racer. For this to happen racer completion must take far longer than our request-timeout, so medium/large crates. After timeout these tasks remain running in their thread. Though we do limit the max tasks to num_cpus, this does technically allow textDocument/completion spam in such crates to cause 100% cpu for as long as these racer tasks take.

This is how I reproduce it by rapidly typing qwer qwer qwer qwer qwer qwer qwer qwer etc causing my client to span completion requests to the server.

First balm for the wound will be failing requests as soon as our request work pool is full (currently they'll just stack up). This will at least bound the 100% cpu time to the completion of N requests. I'll raise a pr to do this, and log a warning of what tasks have consumed the work pool. This will also help confirm my suspicion that racer is causing this in general.

Really we'd want to be able to cancel long running threads, like racer requests and builds.

@nrc
Copy link
Member

nrc commented Feb 7, 2018

Interesting, thanks for investigating!

Instead of/as well as limiting the number of active requests we should probably limit the number of active Racer tasks, probably to 1 (like we do for builds). That would allow other requests without us getting bogged down in Racer jobs. I hope we can do that without a whole queuing infrastructure like we do for builds...

@alexheretic
Copy link
Member

alexheretic commented Feb 8, 2018

Unlike the builds where we have to deal with slowness, request actions are expected to be fast because they're generally useless unless that's true. Because of this expectation we time out such requests. Without cancellation we are faced with either blocking future tasks until the timed out task finishes, or starting the new ones additionally trusting the timed out tasks to finish before long.

We go with the latter, #702 then is about mitigating worst edge cases that can cause this issue. I was thinking it was better to address this generally rather than racer specific rules. Racer shouldn't be so slow & intensive right?

The #702 rules in general shouldn't be hit, for example for textDocument/completion to be failed on a 4 thread cpu we'd need 2 tasks already running. Since each task is waited on for 1.5 seconds that means one of those tasks has been running at least 3 seconds and the other 1.5 seconds. New thinking is: Once a task is taking so long it becomes acceptable shift our trust and cancel future tasks of the same type.

For racer to cause this issue (6 thread utilisation) on current logic then a racer task had to running for more than 7.5-9 seconds. #702 will help, but after merging can we find out why racer is so slow & intensive?

@alexheretic
Copy link
Member

After I added the long-task logging I see:

WARN 2018-02-08T13:27:37Z: rls::actions::requests: `textDocument/completion` took 62.5s
WARN 2018-02-08T13:27:36Z: rls::actions::requests: `textDocument/completion` took 62.7s

So beyond #702 we need to get to the bottom of why racer takes so long. Maybe in the meantime add logic to disable racer completion functionality after we detect it's slow?

@nrc
Copy link
Member

nrc commented Feb 11, 2018

cc @jwilm since this is a Racer issue

@nrc
Copy link
Member

nrc commented Feb 11, 2018

Without cancellation we are faced with either blocking future tasks until the timed out task finishes

This seems pretty fine as a throttling mechanism to me. It effectively detects slow runs of Racer and prevents it running - it is not optimal since we are still running Racer and thus using some CPU, but we shouldn't really crush multiple cores because we're only launching one task at a time.

I was thinking it was better to address this generally rather than racer specific rules

I'm in favour of both general and specific mechanisms here - I don't think it is over-specific - we only outsource work to rustc, Racer, and Rustfmt, so I think specific mechanisms for all three are still OK, complexity-wise.

@alexheretic
Copy link
Member

alexheretic commented Feb 12, 2018

This seems pretty fine as a throttling mechanism to me. It effectively detects slow runs of Racer and prevents it running - it is not optimal since we are still running Racer and thus using some CPU, but we shouldn't really crush multiple cores because we're only launching one task at a time.

Yep we can absolutely do better with what we have, that's what #702 is about. But cancellation just seems such a good fit for Rls, that's why I keep harping on about it. At least it's something to revisit when a decent solution exists in std or a multiplatform crate.

I'm in favour of both general and specific mechanisms here

Fair enough. It is worth noting that on some projects racer isn't slow and our assumptions work. I don't think it was particularly wrong that we have a optimistic "tasks should finish fast" approach where we leave timed out tasks on the pool and expect them not to live long.

That's why I think it's ok to have 2 concurrent racer-completion tasks running. When the 2nd one runs and finds another already running it means one racer task has timed out. Optimistically we should assume that the first will finish soon and go ahead with the new request (not generally wanting to cancel requests). However, when a 3rd request comes in and we have 2 ongoing tasks we can be much more sure there's an issue and fast-fail the request.

In this way we handle when completion is fast, when completion is roughly equal to the timeout and when completion is slow. Worst case: 2 cores are used for the duration of a single racer execution, while typing.

I think this reasoning will apply to every kind of request.

But of course if we can improve the worst case without harming the others all the better!

  • Maybe we should dynamically disable completion when we see it taking >N seconds? Because when racer is taking, say 5 seconds, you can't really use it and it's likely to be taking up a couple of cores.
  • Maybe we should allow disabling completion in config, so people having issues can manually remove functionality to address cpu usage?

@mooman219
Copy link

I hate to +1 with a comment, but this bug is extremely frustrating to the point where I cannot edit projects using the latest RLS build. I've reverted back to the 01-01-2018 build where I only occasionally get the issue. Restarting the RLS temporarily resolved the issue.

@jedisct1
Copy link

jedisct1 commented Mar 2, 2018

I completely stopped using RLS. It happens within the first minute, so RLS has become unusable for me.

@sfackler
Copy link
Member Author

sfackler commented Mar 2, 2018

I still see this when working on typed-headers - it's just pinned at 200% CPU now due to #702.

@nrc
Copy link
Member

nrc commented Mar 2, 2018

Maybe in the meantime add logic to disable racer completion functionality after we detect it's slow?

We should think about how/if we can do this. It will be pretty annoying to disable Racer since it means no more code completion, but that seems better than no code completion and all of the CPU. What I'm trying to say is we need to be careful not to get too many false positives.

@alexheretic
Copy link
Member

alexheretic commented Mar 2, 2018

We should think about how/if we can do this. It will be pretty annoying to disable Racer since it means no more code completion, but that seems better than no code completion and all of the CPU. What I'm trying to say is we need to be careful not to get too many false positives.

Yes we need to be careful with disabling autocomplete automatically, but it would be useful if we can do it well.

However, having a Rls config option for autocomplete would be fine and allow manual disabling for large projects where autocomplete isn't behaving anyway.

# atom rls.toml
autocomplete = false

On another front, has anyone raised an issue for racer to try to improve this upstream? The issue should be reproducible with a more basic/direct racer plugin.

What's the status of autocomplete without racer, I'm guessing that's still far off?

@cynecx
Copy link

cynecx commented Mar 3, 2018

Sorry if this isn't the right place to ask this and I am not exactly familiar how rls works internally but why is rls actually using racer? Why can't rls "just use" the data from rustc? I am not quite sure what the status of racer is, but it seems quite buggy (it can't auto-complete features which it cannot parse, I assume?, see #588, it also seems like it uses an really old version of syntex_syntax, which may be the cause of some parsing issues?).

@nrc
Copy link
Member

nrc commented Mar 3, 2018

@cynecx basically because we can't compile the program fast enough - once the user is typing we need to compile the program and calculate the code completion results fast enough to be useful (apx <= 0.25s). We can't get anywhere near that right now. Hopefully as incremental compilation gets better plus some other changes, we will be able to.

@mooman219
Copy link

I'm very much for a manual option to disable code completion. There's still a lot of value in error/warning highlighting that I can't get right now because code completion is preventing that.

@alexheretic
Copy link
Member

alexheretic commented Mar 4, 2018

I actually cloned typed-headers and wow this project really suffers from this, you weren't joking @sfackler.

I can confirm that the culprit is indeed racer completion. It's the first two requests which time out and keep running for ... I waited a long time to see how long they take. I think this project may somehow cause an infinite loop in racer. This should be reported on https://github.com/racer-rust/racer.

When I restarted Rls with racer-completion disabled, #747 style, the problem vanished. That projects like this can be such hard edges for racer is good enough reason to provide the config option.

alexheretic added a commit to alexheretic/rls that referenced this issue Mar 4, 2018
Provides manual workaround for rust-lang#688
@alexheretic
Copy link
Member

With #747 now merged the racer_completion config option should be available with the next rls release.

So, for example, in atom a line in your rls.toml with the following would disable racer completion.

# rls.toml for atom
racer_completion = false

@nrc
Copy link
Member

nrc commented Jun 21, 2018

I assume this is fixed by the last racer update, if not let me know.

@nrc nrc closed this as completed Jun 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug racer an issue caused by or could be solved by Racer
Projects
None yet
Development

No branches or pull requests

6 participants