Rust Language Server (IDE support) #1317

Merged
merged 6 commits into from Feb 11, 2016

Projects

None yet
@nrc
Contributor
nrc commented Oct 13, 2015

This RFC describes how we intend to modify the compiler to support IDEs. The
intention is that support will be as generic as possible. A follow-up internals
post will describe how we intend to focus our energies and deploy Rust support
in actual IDEs.

There are two sets of technical changes proposed in this RFC: changes to how we
compile, and the creation of an 'oracle' tool (name of tool TBC).

Thanks to Phil Dawes, Bruno Medeiros, Vosen, eddyb, Evgeny Kurbatsky, and Dmitry Jemerov for early feedback.

@nrc nrc self-assigned this Oct 13, 2015
@killercup
killercup commented Oct 13, 2015 edited
@killercup killercup commented on an outdated diff Oct 13, 2015
text/0000-ide.md
+ let foo = foo();
+ let x = fo;
+}
+```
+
+will parse, but fail name resolution, but again we would like to suggest code
+completion options.
+
+There are two issues: dealing with incomplete or incorrect names (e.g., `fo` in
+the second example), and dealing with unfinished AST nodes (e.g., in the first
+example we need an identifier to finish the `.` expression, a `;` to terminate
+the let statement, and a `}` to terminate the `main` function).
+
+A solution to the first problem is replacing invalid names with some magic
+identifier, and ignoring errors involving that identifier. @sanxiyn implemented
+something like the second feature in a [PR](https://github.com/rust-
@killercup
killercup Oct 13, 2015

Line break → link break

@liigo
Contributor
liigo commented Oct 14, 2015

'oracle' is not a perfect name here, since when you google 'rustlang oracle', Google don't know what you intend to search, a compiler tool or a database api?

@daniel-vainsencher daniel-vainsencher and 2 others commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+Incremental compilation is where, rather than re-compiling an entire crate, only
+code which is changed and its dependencies are re-compiled. See
+[RFC #1298](https://github.com/rust-lang/rfcs/pull/1298).
+
+Lazy compilation is where, rather than compiling an entire crate, we start by
+compiling a single function (or possibly some other unit of code), and re-
+compiling code which is depended on until we are done. Not all of a crate will
+be compiled in this fashion.
+
+These two compilation strategies are faster than the current compilation model
+(compile everything, every time). They are somewhat orthogonal - compilation can
+be either lazy or incremental without implying the other. The [current
+proposal](https://github.com/rust-lang/rfcs/pull/1298) for supporting
+incremental compilation involves some lazy compilation as an implementation
+detail.
+
@daniel-vainsencher
daniel-vainsencher Oct 14, 2015

Seems like "incremental" here is use to describe push-driven in contrast to the lazy pull-driven compilation. Which is more efficient depends strongly on the use case, hence supporting both and combinations is probably useful. If the dependencies, changes (the pushes) and interests (the pulls) are managed explicitly, combining the strategies should be feasible. I would use the word "incremental" instead to describe all of these partial compilations.

@nrc
nrc Oct 15, 2015 Contributor

The two are orthogonal - you can have lazy incremental, eager incremental, lazy non-incremental, and eager non-incremental.

@daniel-vainsencher
daniel-vainsencher Oct 15, 2015

I'm saying that the reference to incremental in line 101 seems to define it as eager, conflicting the orthogonality set up in lines 110-115. Or maybe that's just the way it reads to me.

@Ericson2314
Ericson2314 Oct 15, 2015 Contributor

@daniel-vainsencher I think the intended distinction is that "incremental" describes what is done (only what hasn't been done before), while "lazy" describes the algorithm of figuring what to do (start at end goal and work through dependency dag). Laziness without caching is non-incremental, and the "push-driven" approach you mention is a different algorithm for incremental compilation than the lazy one. Does that clarify the orthogonality?

@eddyb eddyb and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+expanded source code. Furthermore, serialising any of the compiler's IRs is not
+good enough: the AST and HIR do not contain any type or name information, the
+HIR and MIR are post-expansion.
+
+The best option seems to be the save-analysis information. This is in a poor
+format, but is the 'right' data (it can be based on an early AST and includes
+type and name information). It can be evolved to be more efficient form over the
+long run (it has been a low priority task for a while to support different
+formats for the information).
+
+Full builds will generate a dump of save-analysis data for a whole crate. Quick
+checks will generate data for the changed code. In both cases the oracle must
+incrementally update its knowledge of the source code. How exactly to do this
+when neither names nor ids are stable is an interesting question, but too much
+detail for this RFC (especially as the implementation of ids in the compiler is
+evolving).
@eddyb
eddyb Oct 14, 2015 Member

Why exactly can't the oracle drive rustc directly?
Keeping compiler sessions in memory seems like the most efficient and accurate method at our disposal.

@nrc
nrc Oct 15, 2015 Contributor

It could, I list this as an alternative in the alternatives section (I assume you mean the quick-check version of rustc). I think it may be a better approach. The only real downside is that the oracle then has to know about dependencies between crates.

In terms of when to do a full build of a crate, I think you want to avoid this as much as possible, so the IDE is the best driver to choose when it is necessary.

@eddyb
Member
eddyb commented Oct 14, 2015

I forgot to comment this on the pre-RFC thread (sorry @nrc), but I prefer rider as the name of the oracle tool, and keeping racer (with @phildawes' permission, of course) as the one-shot completion (and perhaps quickcheck) tool.

The way I see it, racer "races" to give an useful result each time, whereas rider "rides" along an IDE, for as long as there are Rust projects to be handled by it, and can have a slow start without impacting UX.

@kud1ing
Contributor
kud1ing commented Oct 14, 2015

I don't have a name suggestion (yet), but i expect calling it "Oracle" will lead to confusion and legal trouble.

@killercup

(I like how this RFC concentrates on the name first 😉)

How about rustcage? Either pronounced rust cage or rust sage.

@phildawes

Personally I'm not entirely sold on the oracle/rider concept yet. I think the fundamental requirement is that we need a stable interface to rustc to support IDEs and plugins. I'm not yet sure whether a long running database process is required/desirable.

Some things that might cause problems:

  • The oracle concept implies a single view of the sourcecode, and I think this jibes a bit with the tools that will perform refactoring, reformatting and suggestions. I suspect they will want interactive access to the compiler (compile this pre-processed snippet, what errors occur if I do this?)
  • If the oracle were to support completion plugins directly from its database, it would need to have very low latency update turnaround (i.e. on every keypress, in <100ms). I'm not sure how well this would work with it being separate from rustc and not driven by the plugins.

It may be that we have an oracle in addition to a tools-oriented interface to rustc.

(aside: I noticed that the go oracle isn't used by the gocode completion tool, but I don't know the reason for this, it could just be historical)

@daniel-vainsencher daniel-vainsencher and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+We might want to allow some shortcuts to invalidate an entire file or
+recursively invalidate a directory.
+
+
+**Description**
+
+*get definition*
+
+Takes a span, returns all 'definitions and declarations' for the identifier
+covered by the span. Can return an error if the span does not cover exactly one
+identifier or the oracle has no data for an identifier.
+
+The returned data is a list of 'defintion' data. That data includes the span for
+the item, any documentation for the item, a code snippet for the item,
+optionally a type for the item, and one or more kinds of definition (e.g.,
+'variable definition', 'field definition', 'function declaration').
@daniel-vainsencher
daniel-vainsencher Oct 14, 2015

This style of API, where the IDE queries for what it needs right now, is hard for doing push-driven updates (based on a file changing and being saved), because who knows what exactly the IDE is interested in?

An alternative is to allow IDEs to register interest in some queries (for a typical heavy IDE "all definitions in this project", but in Racer, only a fast changing "whatever is under this cursor location"), and then:

  • Use the registrations to notify IDE only of interesting changes.
  • Use positive registrations to know someone cares about this at all (even before they ask for it).
@nrc
nrc Oct 15, 2015 Contributor

Could you explain the push-driven update a bit more please? I am assuming the IDE plugin is pretty much state-less with respect to the oracle's data and whenever the user performs an action (right click, hover, hotkey, etc.) then it will query the oracle for the data it needs (thus why the oracle must be quick).

@daniel-vainsencher
daniel-vainsencher Oct 15, 2015

Two things can drive the oracle to do work:

  • Code changed (two major usecases: a drip as in editing or massive as in git pull/automatic refactoring/cargo update). So IIUC, what you've called eager evaluation would be to react to the change in code immediately, I called this push driven. This can easily be a waste of time when you are recomputing something that nobody cares about. However, if in the API the IDE said "I care until X,Y,Z and any updates on them until further notice" then you can be correctly selective.
  • IDE asked for something (say, a definition) and not everything has been precomputed in advance. Triggers some lazy (I called this pull driven) computation. This is never a waste of time, but may have unreasonable latency (whoops, I should have d/led those new versions of two crates before, huh?) and might also lose opportunities for doing things in parallel.
@nrc
nrc Oct 16, 2015 Contributor

I think that for the push-based changes, the IDE calls the update functions (for the big changes, that is why we need to invalidate whole files or directories). But I don't think the result of any of that has to be communicated back to the IDE (other than error messages, maybe) - the IDE won't keep any state about the program - that is all kept in the oracle, which will be updated by the compiler (or the two are integrated together).

You should of the oracle as part of the IDE really, the IDE shouldn't manage state of its own about the program.

Does that sound right? Or am I missing something about the work flow?

@daniel-vainsencher daniel-vainsencher commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+no invalidated spans, the update call adds data (which will cause an error if
+there are conflicts). Where there is no input data, update just invalidates.
+
+We might want to allow some shortcuts to invalidate an entire file or
+recursively invalidate a directory.
+
+
+**Description**
+
+*get definition*
+
+Takes a span, returns all 'definitions and declarations' for the identifier
+covered by the span. Can return an error if the span does not cover exactly one
+identifier or the oracle has no data for an identifier.
+
+The returned data is a list of 'defintion' data. That data includes the span for
@daniel-vainsencher
daniel-vainsencher Oct 14, 2015

defintion -> definition

@bungcip
bungcip commented Oct 14, 2015

Found some paper related with IDE integration:
http://wasdett.org/2013/submissions/wasdett2013_submission_10.pdf

@daniel-vainsencher

Apologize for linking my own paper. [1] is a (proto-) pattern language describing methods to support multiple analyses over a changing source base (some from Smalltalk designs, some also used then in Eclipse). Basically, something like the oracle discussed here. One major decision point is: what objects from the analysis persist when the program text is not completely valid? for example, adding a "{" early in a file can be seen to invalidate every scope after it, do we then not use those function definitions in auto complete? Smalltalk solves this by giving most definitions (classes, methods) a persistent identity over time. The text edited only corresponds to a single definition, and updates the canonical version only when saved (and syntactically valid). Rust currently seems to encourage large files, which makes this more difficult, though not necessarily impossible. For example, if we have both a recent valid version of the source and the changed span, we can use the valid old definitions except where valid current versions are available.

[1] http://hillside.net/plop/2006/Papers/ACMConferenceProceedings/Intimacy_Gradient/a15-vainsencher.pdf

@olivren olivren and 3 others commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+incremental compilation), but there is essentially no change from a compile
+today. It is not very interesting and won't be discussed further.
+
+The incremental check follows a new model of compilation. This check must be as
+fast as possible but does not need to generate machine code. We'll describe it
+in more detail below. We call this kind of compilation a 'quick-check'.
+
+This RFC also covers making compilation more robust.
+
+
+## The oracle
+
+The oracle is a long running daemon process. It will keep a database
+representation of an entire project's source code and semantic information (as
+opposed to the compiler which operates on a crate at a time). It is
+incrementally updated by the compiler and provides an IPC API for providing
@olivren
olivren Oct 14, 2015

Preliminary notes: I'm commenting on this RFC because I started writing a Rust plugin for the QtCreator IDE. This plugin is in a very preliminary state and it does nothing useful for the moment. I don't know much about compilers in general, and I'm discovering the QtCreator API as I write my plugin. So, be suspicious of anything I could say!

What is the rationale for proposing a daemon process + an IPC API? I fail to see an advantage compared to the oracle just being a library with both a Rust and a C API, that I could call directly from my IDE plugin code. It would remove the pain of writing communication code. it would also allow me to instanciate two services at the same time with isolated content.

The only pros I can see for the daemon approach are:

  1. Share information between multiple IDE instances.
    It seems like a rare use case, and I doubt there are much data to share.
  2. Be easier to integrate in languages with awful FFI capabilities.
    The IPC API could be designed on top of an existing library, if the need is real.

When I investigated to integrate Racer to my IDE plugin, I came to the conclusion that it would be easier to use Racer as a library rather than invoking it as a process and parsing the output.

@eddyb
eddyb Oct 15, 2015 Member

Having an API means all sorts of stabilization concerns, and also passing structured data through a C API.

@phildawes
phildawes Oct 15, 2015

Would we not have to apply the same stablization concerns to the IPC interface?
(genuine question, what's the difference between a rust api and an ipc api wrt stabilization?)

@nrc
nrc Oct 15, 2015 Contributor

The major reason is parallelism - we want the IDE to carry on doing its thing whilst the compiler (coordinated by the oracle) compiles the non-urgent parts of changed source. Then the oracle needs to update its database with the results. Although the IDE could handle the threading to make all this work, it is simpler (and more reusable) just to do it all in a separate process.

With a decent IPC library, having an IPC API is not a lot more complex than having an FFI API. And in turns of implementation having a totally separate process is marginally easier.

In terms of stability, I don't think there is much difference between the two. An IPC API might be a little more robust because using an intermediate data structure and having parsing/serialisation adds a little abstraction.

@olivren
olivren Oct 15, 2015

At least for QtCreator, the extension API is already designed with asynchrony in mind. Having to deal with asynchronous IO would be a lot more painful for me than having a simple synchronous API. In fact, if you give me an IPC API, the first thing I'll do will be to wrap IO and parsing into a simple synchronous API. In the end there will just be an unnecessary
indirection layer.

The oracle library will of course deal with its own update work in a dedicated thread, but you will have to have such a thread anyway, with a process-based design.

@nrc
nrc Oct 16, 2015 Contributor

Hmm, interesting, that suggests implementing the oracle as a library might be a better solution than as a separate process. I'm not 100% convinced, but it seems like a reasonable alternative to consider. Let me have a think about it...

@nrc
nrc Oct 16, 2015 Contributor

Does anyone know how this would work out in the Java world? How does a JNI/FFI interface compare to an IPC interface?

@eddyb
eddyb Oct 16, 2015 Member

I guess this got lost in the bikeshed: the reason I independently came up with the IPC idea is that compartimentalization and serialization is unavoidable.

I have implemented a system where rustc threads are spawned in the background, and there is a duplex request/response channel to control the thread and get data from it.

The messages are owned ADTs, which means rustc-specific data (e.g. an interned string) has to be converted to a general representation (String).

To avoid tying the in-memory format to Rust or C-based representations, an actual serialization format can be used.
Cap'n Proto, for example, had built-in versioning and extensibility.

And, finally, for added reliability and to get rid of the FFI surface, the threads can be moved to a separate process, which is how we end up with the oracle/rider model.

@olivren olivren and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+Takes nothing, returns a list of project ids.
+
+Each of the remaining calls takes a project identifier.
+
+
+**Update**
+
+See section on input data format below.
+
+*update*
+
+Takes input data (actual source code rather than spans since we cannot assume
+the user has saved the file) and a list of spans to invalidate. Where there are
+no invalidated spans, the update call adds data (which will cause an error if
+there are conflicts). Where there is no input data, update just invalidates.
+
@olivren
olivren Oct 14, 2015

I understand this is not a detailed API, but I dont see how this works when there are multiple spans to invalidate.Do all spans must be computed based on the initial content of the file? It sounds like it could be a bit painful to compute it from the plugin side. Maybe a simpler API could live alongside this one, that takes the full content of the file and that lets the oracle compute the differences based on its current state.

@nrc
nrc Oct 15, 2015 Contributor

The spans are relative to the last update passed to the oracle. I don't think that should be hard to compute - the plugin just has to keep track of any text deleted or edited since the last update call.

The trouble with diff'ing before and after snapshots is that it is hard to do well, and so we'd end up making mistakes or overestimating the invalidated region. I imagine it is not super-cheap either.

@olivren
olivren Oct 15, 2015

This makes sense. I think you are right and the proposed design is best.

@olivren olivren and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+the user has saved the file) and a list of spans to invalidate. Where there are
+no invalidated spans, the update call adds data (which will cause an error if
+there are conflicts). Where there is no input data, update just invalidates.
+
+We might want to allow some shortcuts to invalidate an entire file or
+recursively invalidate a directory.
+
+
+**Description**
+
+*get definition*
+
+Takes a span, returns all 'definitions and declarations' for the identifier
+covered by the span. Can return an error if the span does not cover exactly one
+identifier or the oracle has no data for an identifier.
+
@olivren
olivren Oct 14, 2015

Why is the input a span here? I would expect an offset. Would I have to find the span of the entire word I want the definition of, or is a partial span (or even an empty span) a valid input? (Note that the same remark applies to all the subsequent API functions)

@nrc
nrc Oct 15, 2015 Contributor

I'm assuming that the IDE has a tokeniser and therefore already knows the span of the identifier (even simple editors usually have a tokeniser in order to implement syntax highlighting). On the other hand, I suppose that from the oracle's point of view it is as easy to identify the identifier by a single position as it is a span, so it might be as well to take a single position.

@olivren
olivren Oct 15, 2015

The hand written tokenizer on the plugin side will certainly not be as accurate as the oracle's one. I'd rather delegate to the oracle as much as possible.

@olivren olivren and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+*get definition*
+
+Takes a span, returns all 'definitions and declarations' for the identifier
+covered by the span. Can return an error if the span does not cover exactly one
+identifier or the oracle has no data for an identifier.
+
+The returned data is a list of 'defintion' data. That data includes the span for
+the item, any documentation for the item, a code snippet for the item,
+optionally a type for the item, and one or more kinds of definition (e.g.,
+'variable definition', 'field definition', 'function declaration').
+
+*get references*
+
+Takes a span, returns a list of reference data (or an error). Each datum
+consists of the span of the reference and a code snippet.
+
@olivren
olivren Oct 14, 2015

The output should also tell the "kind" of each reference found. For example, if we want to find the references of a function declared in a trait, the output could be either a "method call" kind, or a "definition in a trait impl" kind, or a "function as a value" kind.

@nrc
nrc Oct 15, 2015 Contributor

The "reference data" includes the kind of definition, see lines 299, 300

@olivren
olivren Oct 15, 2015

The term used in the the previous section was "definition data". It is not obvious that "reference data" designates the "kind" of references.

@nrc
nrc Oct 16, 2015 Contributor

I'll try and polish the text here.

@olivren olivren and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+'variable definition', 'field definition', 'function declaration').
+
+*get references*
+
+Takes a span, returns a list of reference data (or an error). Each datum
+consists of the span of the reference and a code snippet.
+
+*get docs*
+
+Takes a span, returns the same data as *get definition* but limited to doc strings.
+
+*get type*
+
+Takes a span, returns the same data as *get definition* but limited to type information.
+
+Question: are these useful/necessary? Or should users just call *get definition*?
@olivren
olivren Oct 14, 2015

I think it only depends whether the oracle is able to give one type of answer more quickly than another one. If the oracle can only be "ready to answer any request" or "not ready", then a unique get definition function is enough.

@daniel-vainsencher
daniel-vainsencher Oct 15, 2015

I don't think that get type (presumably over any expression, not just identifiers) is the same kind of query as get definition: the user wants the type specialized to the current call's context, including values of any type parameters.

@olivren olivren and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+*get docs*
+
+Takes a span, returns the same data as *get definition* but limited to doc strings.
+
+*get type*
+
+Takes a span, returns the same data as *get definition* but limited to type information.
+
+Question: are these useful/necessary? Or should users just call *get definition*?
+
+*search for identifier*
+
+Takes a search string or an id, and a struct of search parameters including case
+sensitivity, and the kind of items to search (e.g., functions, traits, all
+items). Returns a list of spans and code snippets.
+
@olivren
olivren Oct 14, 2015

The search for identifier should also take as an input the scope of the search. QtCreator's Locator lets the user search for a symbol in the current opened file, and It's one of the Locator function I use the most.

@nrc
nrc Oct 15, 2015 Contributor

This is a good idea, I'll add it.

Note that the IDE is expected to present an interface over this functionality, and plain text search should be done entirely in the IDE, the only search the oracle helps with is finding uses of a particular identifier (i.e., semantic search). Defining a scope to search over would still be useful though.

@olivren
olivren Oct 15, 2015

Yes, I was indeed talking about symbols, not full text searches. The Locator will give top level items only.

@olivren olivren and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+Takes a search string or an id, and a struct of search parameters including case
+sensitivity, and the kind of items to search (e.g., functions, traits, all
+items). Returns a list of spans and code snippets.
+
+
+**Code completion**
+
+*get suggestions*
+
+Takes a span (note that this span could be empty, e.g, for `foo.` we would use
+the empty span which starts after the `.`; for `foo.b` we would use the span for
+`b`), and returns a list of suggestions (is this useful? Is there any difference
+from just using the caret position?). Each suggestion consists of the text for
+completion plus the same information as returned for the *get definition* call.
+
+
@olivren
olivren Oct 14, 2015

I would add another function to the API: a way to search among the documentation. QtCreator's Locator can search in the available documentation, and display the html doc of the matches.

@nrc
nrc Oct 15, 2015 Contributor

Is this a text search of the docs, or does it look up documentation for a particular function (or whatever)?

@olivren
olivren Oct 15, 2015

Not sure about the specifics. With a Qt project, this Locator feature matches either a symbol name, or a word that appears inside a title of the generated html doc (they are structured docs). This is clearly not a critical feature, it may not belong to an initial RFC.

@olivren olivren commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+
+*search for identifier*
+
+Takes a search string or an id, and a struct of search parameters including case
+sensitivity, and the kind of items to search (e.g., functions, traits, all
+items). Returns a list of spans and code snippets.
+
+
+**Code completion**
+
+*get suggestions*
+
+Takes a span (note that this span could be empty, e.g, for `foo.` we would use
+the empty span which starts after the `.`; for `foo.b` we would use the span for
+`b`), and returns a list of suggestions (is this useful? Is there any difference
+from just using the caret position?). Each suggestion consists of the text for
@olivren
olivren Oct 14, 2015

In this case I think a span is ok. If a user highlights a part of the text and invokes the autocompletion, I expect the completion to give results that could replace the highlighted selection.

@olivren olivren and 1 other commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+under normal compilation, only when performing a quick-check from the IDE.
+
+
+
+## Error format
+
+Currently the compiler generates text error messages. I propose that we add a
+mechanism to the compiler to support different formats for error messages. We
+already structure our error messages to some extent (separating the span
+information, the message, and the error code). Rather than turning these
+components into text in a fairly ad hoc manner, we should preserve that
+structure, and some central error handler should convert into a chosen format.
+We should support the current text format, JSON (or some other structured
+format) for tools to use, and HTML for rich error messages (this is somewhat
+orthogonal to this RFC, but has been discussed in the past as a desirable
+feature).
@olivren
olivren Oct 14, 2015

Could you give an example of how an HTML error message could be useful? Or give a link to a discussion on the subject?

@nrc
nrc Oct 15, 2015 Contributor

The idea is that we could use colouring or shading or whatever to indicate things like the scope of borrows or lifetimes and have links to relevant documentation, etc. This is not useful for IDEs so much as a general way to improve our error messages.

@olivren olivren commented on an outdated diff Oct 14, 2015
text/0000-ide.md
+having to reload metadata, but having to keep metadata for an entire project in
+memory would be expensive. We could perhaps compromise by unloading when the
+user needs to recompile a different crate. I believe it is probably better in
+the long run, but a batch process is OK to start with.
+
+How and when should we generate crate metadata. It seems sensible to generate
+this when we switch to editing/re-compiling a different crate. However, it's not
+clear if this must be done from scratch or if it can be produced from the
+incremental compilation metadata (see that RFC, I guess).
+
+What should we call the oracle tool? I don't particularly like "oracle",
+although it is descriptive (it comes from the Go tool of the same name).
+Alternatives are 'Rider', 'Racer Server', or anything you can think of.
+
+How do we handle different versions of Rust and interact with multi-rust?
+Upgrades to the next stable version of Rust?
@olivren
olivren Oct 14, 2015

I too was wondering how to deal with multiple versions of Rust. In my IDE plugin, I would like to let the user choose its compiler, and be able to invoke the nightly compiler or the stable one explicitly using different targets. I don't know how I could give this choice to the user, for the code model used for navigation/completion/refactoring.

@eddyb
Member
eddyb commented Oct 15, 2015

@phildawes To remove as much latency as possible and to prevent rustc stabilization concerns, I believe that the oracle should use rustc's internal APIs and be behind the stability wall.
That is, the oracle ships with rustc and uses its libraries directly.

I really see no point in having some output format from rustc itself, that's added design complexity and inefficiency for no gain (in case of the oracle, at least).

@phildawes

Hi @eddyb!

I agree with that idea, if oracle is to be a thing then it makes sense to put oracle behind the stabilization wall and link it directly to rustc.

However I'm worried about the whole oracle thing. Definitely we need a stable interface to rustc. My concerns with oracle being the stable interface to rustc are:

  • 'database of code' is seductive, but imposing this top-down abstraction could easily result in incidental complexity (overengineering)
  • it might turn out that building the various refactorings and functionality require an interface that doesn't fit well into the database-of-code paradigm.

It already appears to me that completions maybe don't fit naturally into this design, and we've barely got started.

I'd prefer to see us drive the interface out bottom-up from the ide plugins. Try to build stuff using rustc directly, understand what stable interface is required.

@nrc
Contributor
nrc commented Oct 15, 2015

'oracle' is not a perfect name here

@liigo oracle is a terrible name! But it does have some precedent from Go, and I can't think of a better one. Suggestions welcome!

@nrc
Contributor
nrc commented Oct 15, 2015

@phildawes

I think the fundamental requirement is that we need a stable interface to rustc to support IDEs and plugins.

The save-analysis API should evolve into this. But it seems there is a level of computation that must be done on top of that data (i.e., the data the compiler has) and that this computation must be done by any IDE, so it makes sense to me to share that code in the oracle.

I'm not yet sure whether a long running database process is required/desirable.

The database aspect is just an implementation detail. I think it is necessary in order to do the cross-referencing that the compiler does not do (e.g., enumerating all the implementations of a trait). I also think some kind of cache is necessary for perf reasons. That doesn't have to be a database, but there is a lot of info for Rust programs (indexing Rust or Servo for DXR produces several gigs of data) and we'll hit lots of problems with data of that size. Databases have these problems solved already.

The oracle concept implies a single view of the sourcecode, and I think this jibes a bit with the tools that will perform refactoring, reformatting and suggestions. I suspect they will want interactive access to the compiler (compile this pre-processed snippet, what errors occur if I do this?)

They will, and they will be able to interact with the compiler directly still. I imagine that something like a reformatting tool will start by querying the oracle, and then use the compiler directly.

If the oracle were to support completion plugins directly from its database, it would need to have very low latency update turnaround (i.e. on every keypress, in <100ms). I'm not sure how well this would work with it being separate from rustc and not driven by the plugins.

This is a concern. I think the only way we can be fast enough is to be totally separate from rustc - a db query should orders of magnitude than compiling even a small section of a program. It is possible that the IPC overhead will be too high and we'll have to provide the oracle as a library for the plugins to use in-process, but that probably means writing in Java to avoid expensive JNI calls. We should get this perf info from an early prototype, and can revisit then if necessary.

It may be that we have an oracle in addition to a tools-oriented interface to rustc.

Yeah, I envisage this.

@nrc
Contributor
nrc commented Oct 15, 2015

@eddyb

I believe that the oracle should use rustc's internal APIs and be behind the stability wall.

Yeah, we could definitely put the oracle and quick-check compiler together. The downside of this is purely about development practice - it will be pain to prototype the oracle if we have to land stuff into the compiler all the time and deal with the compiler internals changing.

I think I would like to start separately for easier development (they are two radically different kinds of software), work out how things will look and then merge them if necessary. I imagine that using the save-analysis API directly rather than data provided by it should not be too difficult.

@nrc
Contributor
nrc commented Oct 15, 2015

@phildawes

It already appears to me that completions maybe don't fit naturally into this design, and we've barely got started.

Could you expand on this please? I'm keen to understand the limitations of the database approach.

I'd prefer to see us drive the interface out bottom-up from the ide plugins. Try to build stuff using rustc directly, understand what stable interface is required.

To deal with rustc directly, it would need to be modified to be a long running process - the latency of reading and processing crate metadata would kill the latency alone. I suspect we could do simple lookups and probably code completion based on querying the compiler. Some more complicated searches are impossible due to the compiler never having the info. For large projects, there would be big overheads in keeping the compiler in memory - it uses a lot, and we'd need to figure out a way to have an instance running for each crate we currently care about. Then there are questions about how quickly we can reload the data for a crate - hopefully the incremental compilation work helps there, but it is a little bit of an unknown.

In comparison, the database approach feels simpler - it is really just a cache of the compiler's data, pre-cross-referenced and stored in an efficient manner. It's also somewhat tried and tested as the implementation for DXR and is I believe how most IDEs for other languages work (with the difference that the database is managed by the IDE rather than a helper oracle).

@eddyb
Member
eddyb commented Oct 15, 2015

For large projects, there would be big overheads in keeping the compiler in memory - it uses a lot, and we'd need to figure out a way to have an instance running for each crate we currently care about.

There is... always a compromise: if we use only flat arrays without pointers, and no pointer-sized types, we can start memory-mapping them out of crate metadata.
I suggested something like this for HIR in one of the discussions, and I'm pretty sure it can work out, albeit at the cost of less ergonomic internal APIs.

Also, most of the memory overhead comes from LLVM, IME.
I was able to keep ~120 rustc instances in RAM, for all the .rs files in rust-by-example, in about 400MB (IIRC, it could've been more).
I had to remove some eager impl loading logic to get there, and there were some other things (eager macro loading) that could also be removed, but it was more work for a smaller win.

@phildawes

Alternative straw-man design: (this is a starting point + probably full of holes, don't be too hard on me!)

  • There is no project api.
  • get analysis for Item
    pass the filename, file-contents (if not the same as on-disk), and a point in the file
    returns the save-analysis style info for every element in the Item.
    If there are parse or semantic errors/warnings, also returns these, but does its best to complete
    • turnaround time: aim for ~50ms
  • get analysis for File
    pass the filename, file-contents (if not the same as on-disk)
    returns the analysis info for every element in the file
    If there are parse or semantic errors/warnings, also returns these
    • turnaround time: aim for ~250ms
  • get analysis for project
    pass a filename in the project (doesn't matter which file)
    Returns analysis info for every type and function skeleton, but not bodies
    If there are parse or semantic errors/warnings, also returns these
    • turnaround time: aim for ~250ms

The advantage of this approach is that it assumes less about the intentions of the caller and is more flexible.

All the cross referencing is done by the IDE plugins according to their needs. Additional middleware can be written to assist plugins later if required.

@phildawes

Hi @nrc

To deal with rustc directly, it would need to be modified to be a long running process - the latency of reading and processing crate metadata would kill the latency alone. I suspect we could do simple lookups and probably code completion based on querying the compiler. Some more complicated searches are impossible due to the compiler never having the info. For large projects, there would be big overheads in keeping the compiler in memory - it uses a lot, and we'd need to figure out a way to have an instance running for each crate we currently care about. Then there are questions about how quickly we can reload the data for a crate - hopefully the incremental compilation work helps there, but it is a little bit of an unknown.

I think I disagree with this.

  • When profiling rustc, reading the crate metadata takes very small fraction of the time compared to resolution and analysis.
  • We wouldn't keep a rustc running, we'd link to the libraries and perform the analysis lazily.
    (I think we have to do this if we want to support quick-check-style analysis for completions, even if we do the database thing)

In comparison, the database approach feels simpler - it is really just a cache of the compiler's data, pre-cross-referenced and stored in an efficient manner. It's also somewhat tried and tested as the implementation for DXR and is I believe how most IDEs for other languages work (with the difference that the database is managed by the IDE rather than a helper oracle).

  • I don't think DXR is a good precedent in this case, since DXR works on a static fully-compiled view of the crate. If a static view was all that was required we'd already have good tool support built on top of save-analysis. DXR doesn't have any of the incremental update or unfinished code complexity (cache invalidation being one of the 2 hard things in software engineering! http://martinfowler.com/bliki/TwoHardThings.html)
  • I'm not sure that IDE's work like this. E.g. @dgrunwald hints that SharpDevelop and Microsoft's new 'Roslyn' C# use a lazy approach to semantic analysis rather caching the entire semantic project. https://www.reddit.com/r/rust/comments/3a4qbj/racer_rustc_update/cs9w3ny?context=3
@Ericson2314
Contributor

@eddyb I agree that we should minimize the number of processes. Every process boundary means screwing around with serialization formats and flattening things, which is a waste of code and effort. Furthermore the nature of what is being done lends itself to rich data structures, so this flattening can be extra harmful.


I brought this up in the earlier incremental compilation RFC, but at some point I rather have all dependency graphs be explicit, and traverse/cache in some generic matter. My hunch is that building such graphs is far less expensive than storing them. If so might as well have all builds work this way, and instead make the decision how much of the graph should be persisted---in memory in the case of a daemon or on disk in the case of repeated runs. I'd assume once we have GC that building and immediately chucking the graphs would be even more of a non-issue.

@phildawes

@nrc

.. appears to me that completions maybe don't fit naturally into this design, ..

Could you expand on this please? I'm keen to understand the limitations of the database approach.

My thinking is:

  • the requirement for completions is that we provide suggestions for the code as it is being written with low latency (ideally < 100ms on each keypress).
  • With the database fed by the compiler approach the compiler has to be passed the new state by the ide after each keypress. The compiler generates the analysis data (quick-check), which is then fed to update the database. After the database is updated the ide can then perform the get-suggestions query.
  • A more straightforward approach would be to obtain the analysis data direct from the compiler (in order to get the type of the expression being completed), and then do a separate field+methods search on the (more static) skeleton data.
@phildawes

Hello again!

Reading it all back, I suspect you (@nrc) and I are mostly in agreement about the details, but just have a different focus.

Have I got this correct?:

  • we both agree that there needs to be a stable low level tools interface to rustc for 'quick check'. Only rustc can do quick-check, since it requires the complicated type and lifetime analysis performed by librustc_typeck et al. (by 'rustc' I mean the code that makes up the compiler rather than the binary)
    • the current view is that this interface will deliver content similar to save-analysis, but will support generation of per-Item (or smaller?) chunks.
  • Your focus is on the oracle database-style component to assist ide plugins, with the intention of saving duplicated effort. From this perspective the interaction with the compiler/quick-check is an implementation detail.
  • My concern is the low level quickcheck functionality and api. For me this is the piece blocking everything else (including the oracle!).

I'm not yet sure if the oracle is a necessary piece to get good ide support. Once you take away typeck+resolve I think the rest is mostly graph traversal. But with the low level interface in place it doesn't matter if it isn't a good fit; ide developers can write their own traversal/search code instead (and in java if they like!).

@nrc
Contributor
nrc commented Oct 16, 2015

@phildawes yeah, I think we are more in agreement than not. I think in your mental model, the oracle is part of the IDE, and it is perfectly reasonable to view it as a reuse mechanism shared between IDEs layered on top of the compiler, rather than the compiler itself.

I guess where we differ is that for code completion (as opposed to code search), you don't need the oracle layer, you could get the answer straight from the compiler. I guess you want that for efficiency, whereas I was imagining the oracle giving a more uniform interface by providing everything.

It would certainly be possible for Racer to skip the oracle and talk directly to the compiler for code completion. The same goes for IDEs, although not if we take the approach where the oracle drives the compiler, rather than the IDE directly. However, I think the mode of operation in this case could be pretty quick - the oracle queries the compiler, passes the result to the IDE, and then updates its database. As long as the oracle and quick check compiler are merged into one program, this should be as quick as using the compiler directly. If they are separate processes, then we'd need to measure the difference.

(More in the next reply).

@nrc
Contributor
nrc commented Oct 16, 2015

@phildawes I suggested it might be possible to ask the compiler directly for code completion and skip the oracle. I want to go into more detail about how this might work.

Example:

    let foo = ...;
    foo.<caret>

So, let's assume there is enough code for the compiler to precisely infer the type of foo, and we want the code completion options at <caret> - i.e., we want a list of all fields and methods that would make the expression type check.

The current state of the compiler is that it checks an expression like foo.bar() given the type of foo. Fields are much easier than methods, but still not super simple - we might be applying autoref/deref and due to the Deref trait that could mean different concrete types. We also have to take into account coercions. For methods, we again have to check various combinations of auto-ref/deref and coercions, but we also have to check any traits in scope. The compiler checks if a trait has a method named bar and then tries to match it to the adjusted type of foo, taking into account the self type. We are pretty eager about pruning based on the name bar.

Now for code completion, we not asking if a given name checks out some how, but instead are asking for the set of all names which could check out. The compiler cannot answer that question at the moment. It could be modified to do so, I think. It would require a lot of modification to the method lookup code (which would only be used for code completion, not other things). This would require ploughing through a lot of data. We're essentially asking the compiler to do a bunch of computation and data management.

Implementing this in the oracle would be non-trivial too. But we would ask the compiler for the type of foo and then could look that up in the db to find a list of methods/fields with a single (albeit complex) query. Note that this doesn't require updating the db. I imagine a database would be quicker at answering the query than the compiler would be at computing all possible names. In the db case, we are only doing the data management, and databases are good at that!

So, I think it is possible to do code completion without the compiler. But I don't think it is easy to implement, and I'm not sure if it will be quicker than using the oracle.

There are some things we need the oracle for (the compiler will never be able to answer), e.g., 'find all uses of this type/variable', so we do need the oracle around, I imagine using it for code completion would not be too much harder.

@arielb1
Contributor
arielb1 commented Oct 16, 2015

@nrc

I am not sure how much we want a database separate from rustc - rustc certainly has a way (TraitDef::for_each_impl) to look up all the implementations of a trait, and while there is no index (e.g. for uses of a definition), full AST scans can be done in <1s, and adding a mode to get all methods in method::probe isn't particularly difficult. On the other hand, operations on types (e.g. trait matching, method lookup) are complicated enough I would prefer them not to be duplicated.

@arielb1
Contributor
arielb1 commented Oct 16, 2015

In different words, we already have a program for quickly answering questions about Rust code - it is just librustc.

@nrc
Contributor
nrc commented Oct 19, 2015

There are a few problems with using rustc as the frontend for answering questions about Rust programs (of course it is the ultimate source of knowledge in this proposal):

  • it doesn't have complete information about a program, in particular about downstream crates - TraitDef::for_each_impl for example only knows about the current crate and crates we depend on. Crates could come later in compilation which it does not know about. You could make sure to only ask this question when compilng the 'last' crate, but we're then adding significant complexity to the use.
  • There are some questions it can't answer easily - find all uses of a field in a struct or a local variable. Sure we could extend the compiler to keep track of these things - but it is a whole lot of data that most uses of the compiler don't need, and you end up running the compiler in different modes and everything just gets more complicated - why not keep this stuff external and reduce complexity?
  • Dealing with compiler internals is scarily complicated and unstable, so we need some API surface (save-analysis is meant to be this), the oracle or whatever is a think layer over this API that does some post-analysis computation and caching for you.

I'm not suggesting the oracle does trait matching, method lookup, etc. The compiler still does all this, the oracle just keeps track of the data the compiler generates, and keeps it in an easily accessible form.

@arielb1
Contributor
arielb1 commented Oct 19, 2015

@nrc

Some things, like a simple reverse definition lookup, are basically database features (but I am not sure that a simple linear scan won't be good enough given the performance requirements - this is a not-particularly-obvious space+precomputation vs. time tradeoff). However, things that involve type inference or trait matching, while not really fitting anything resembling a relational (or relationless) database, can be very easily done by something like rustc. Pre-computing these once and checking them is non-trivial, as methods can come from where-clauses and there are infinitely many (generic) types.

@juarezr
juarezr commented Oct 20, 2015

The best player I know in this area is Microsoft C# compiler Roslyn.
Worth take a look:
https://github.com/dotnet/roslyn

@eddyb
Member
eddyb commented Oct 20, 2015

@juarezr Java IDEs are very powerful without as much work, because the language is simpler (compared to, say, C++, although libclang is solving some of that).

While C# has more features, the "compilation model" is similar to Java's, with limited nesting, limited inference and nothing like type-classes (traits in Rust).

@phildawes

I had a skype conversation with @nrc on friday which was very thought-provoking for me and I learnt a lot. (Nick is a v smart guy!) I've only just got round to writing it up, sorry

Notes are in no particular order (I didn't write things down at the time)

  • I didn't realise, but Nick was proposing using an actual database (e.g. postgres) to store the type info. To me that sounded like an additional unnecessary dependency, but he swayed me that this could be a good tradeoff if we need to do a bunch of fine-grained invalidation / re-indexing. DXR used to use postgres, currently it uses elastic search.
    (Nick would recommend against using Elasticsearch for IDEs because of less of an emphasis on textual search)
  • Also I hadn't fully considered the scope of what was being considering for the index,
    • At the limit, changing a field or method in libcollections could result in the index telling us all the places it is impacted in other projects
  • Nick didn't realise, but my type-inference prototype/experiment (which strips function bodies except for the interesting one to speed things up) actually strips the text as it is loaded, rather than stripping the ast. This means the rustc parse + expand + check stages for big crates are much quicker (e.g. sub second for rustc_typeck). In the demo I gave it runs from scratch on each query.
  • Intellij seems to take multiple seconds doing find-references style big searches - they are not always subsecond matches, implying to me that maybe they are not fully indexed. Maybe a combination of partial index and search strikes a balance.
  • Nick says Visual studio takes a long time indexing the firefox source when you start up. Anecdotally this suggests to me a more thorough index than intellij takes, or maybe it is just that c++ is just more difficult to analyse than java. (As Nick understands it, in c++ type checking gets done after monomorphisation so you have to have the types at the call site to check the body of generic functions)
  • In a former life I wrote a small refactoring tool for python called bicycle-repair-man. It performed find-references by doing a sort-of text search and some other heuristics to narrow down the scope of possible targets, and then performed a more expensive find-definition on each one to see if it was actually a hit. Some similar narrow-then-check might be workable for this sort of query in rust.

I came away thinking that the big outstanding question impacting the architecture of the tools solution will be how much cached/persisted indexing is required between calls for each usecase/search.

  • At one end of the scale: If we can perform an entire query from scratch each time then we don't need a persisted index. 'oracle' style functionality could be stateless
    • architecturally simplest
    • could have multiple oracles with little memory cost, which would suggest a library design
    • new searches could be added incrementally as we need them (since different plugins can use different versions of the query library)
  • At the other end of the scale: We require a detailed cached index of everything to meet our performance targets
    • must deal with cache invalidation (possibly fine-grained cache invalidation)
    • detailed index has a large data overhead, so suggests a server design like a database server
    • centralised, functionality tightly coupled to the database
      • either controlled through an api like the proposed oracle one
        • (in which case we need to anticipate usecases and provide appropriate api for them)
      • or index format must be tightly specified and evolve carefully

I don't think we can meet performance targets without some indexing persisted between queries (though I would love this not to have to be the case!). There are however tangible benefits to minimising the level of indexing especially with regard to cache-invalidation and re-indexing complexity.

I think the most useful next stage for me personally is to do some more prototyping to see what is possible and get some performance numbers on various types of query / analysis. As a start I'll post some numbers from my type-inference prototype soon.

@phildawes

I posted some timings from my type inference experiment and a short write-up here

@phildawes

@arielb1
FWIW I agree with you that we need to explore this space+precomputation vs time tradeoff, and also get a better feel for the requirements before we dive into indexing.
It might be that we're happy to take 3 seconds for a global find-references, and it might be that a linear scan + targeted rustc resolves can meet that. Unfortunately we don't have much data at the moment.

It's probably worth enumerating some usecases so we can think about them and try and get some data around them. For me, the most useful usecases are:

  1. completions/suggestions as you type
  2. go-to-definition (I think is just a subset of completions. Also get-docs is a similar query)
  3. find-references (rename refactoring is a similar type of query)
  4. extract fn/method

The last one is interesting because it requires lifetime/region information to do accurately. It might be that a half-assed version of this is ok to start. Personally I find manually extracting functions in rust a real chore, so any automated help would be welcome!

@arielb1
Contributor
arielb1 commented Oct 21, 2015

@phildawes

extract-method-style refactoring and code completion require detailed inference of the current version of the relevant method, which is hopeless when using a database and very fast when using rustc. find-references could use a database (at least for the find part), but rename-refactor might want to use a compiler for the "actually refactor" part to ensure it gets things correct.

@arielb1
Contributor
arielb1 commented Oct 21, 2015

@phildawes

rustc basically does "generate suggestions" every time it type-checks a method, and type-checking a method can be done in <100ms as you have measured. We probably want a rustc with metadata ready and caches hot anyway.

@arielb1
Contributor
arielb1 commented Oct 21, 2015

@phildawes

rustc also basically supports loading code at runtime - see what inline does.

@juarezr
juarezr commented Oct 21, 2015

@eddyb

I wasn't comparing the features or simplicity of the Rust, C# or Java languages.
I just posted the link for serving of reference for research for this RFC.
Microsoft has recently rebuild from zero the .NET compiler used for code analysis.
As this is a very similar compiler, I think this is a excelent opportunity for people extracting good knowledge from the project documentation that could help planning the design of this feature.
From what I have read in this discussion, the exact architecture is starting to take a shape. And also this appears to be a enormous effort.
I hope this help to improve ideas and avoid future rework.

@daniel-vainsencher

Database indeed evokes an image of something slow, out of date and partial.
Seems to me the oracle should be an in memory cache of all relevant rustc
computed datastructures, sharing rustc code to preserve the exact semantics.

And then the real question is whether it is fast enough to call rustc on
all relevant code to extract information jit instead.

Seems like a hard case of interest is wanting to modify a module in rustc,
and asking for all of its callers. How long does it take rustc to process
rustc (+ deps) with no code generation?
On Oct 19, 2015 2:05 PM, "arielb1" notifications@github.com wrote:

@nrc https://github.com/nrc

Some things, like a simple reverse definition lookup, are basically
database features (but I am not sure that a simple linear scan won't be
good enough given the performance requirements - this is a
not-particularly-obvious space+precomputation vs. time tradeoff). However,
things that involve type inference or trait matching, while not really
fitting anything resembling a relational (or relationless) database, can be
very easily done by something like rustc. Pre-computing these once and
checking them is non-trivial, as methods can come from where-clauses and
there are infinitely many (generic) types.


Reply to this email directly or view it on GitHub
#1317 (comment).

@eddyb
Member
eddyb commented Oct 22, 2015

About Visual Studio taking a long time: something similar happens in KDevelop (tried using it for LLVM development).
The issue there is C++'s ridiculous compilation model: in C++, you can't even parse an AST without doing full type-checking.
What takes 5 seconds for rustc working on the entirety of librustc (from parsing to name resolution, before type-checking) would take minutes for an equivalent C++ codebase.
C++1z modules could help with the include explosion problem, but not the rest.

About @nrc wanting to use PostgreSQL: I really can't see how a general database can outperform rustc's dedicated data structures, in both space and time complexity (although space could be reduced by compression).

Traversing the entire AST has a baseline cost of ~40ms on a large crate (librustc), but that can be lowered with a flat array approach, with no indirections (our AST isn't really cache-friendly).
Using a specialized cross-reference index on top of rustc would be a couple orders of magnitude faster than just the database transport.

And at the end of the day, isn't rustc with incremental recompilation a huge specialized database? Why add pointless layers on top?

@ArtemGr
ArtemGr commented Oct 22, 2015

Database indeed evokes an image of something slow, out of date and partial. Seems to me the oracle should be an in memory cache of all relevant rustc computed datastructures, sharing rustc code to preserve the exact semantics.

In-memory structures are faster until they're not. There's a memory budget and it isn't as large as one might think. Windows tends to swap out the largest application when it needs memory. IDEs already tend to be large. Suppose you have an IDE open and want to switch to a different activity for a while - Windows will provision for more memory and will likely swap out the IDE. That's a problem Firefox has, making it less resilient against swap-outs comparing to Chrome.

Also, startup speed is an important user experience factor. The user will notice if IDE rescans all those dependencies after restart. It might seem not a big deal but it surely gave Java it's slowish reputation.

Still, the oracle might very well be designed and implemented first as an in-memory cache, because it might be simpler to do that way. And the on-disc persistence might be implemented later and as a fall-back option when the project size is rather large.

I really can't see how a general database can outperform rustc's dedicated data structures

Writing a new dedicated data structure isn't necessarily the best example of code reuse. It makes rustc more complex, increasing the maintance burden. A carefully picked database solution might offer better abstractions and isolate the IDE from rustc implementation details.

PostgreSQL isn't the best tool for the job, though. It will be hard to embed in an IDE.

@phildawes

Hi @arielb1

rustc also basically supports loading code at runtime - see what inline does.

I guess the difference is that inlined HIR has already been analysed when the inline happens (in the trans stage). N.B. I'm not very familiar with the trans stage so please correct me if I'm wrong!

New loaded code would need to be go through the other stages using the existing side tables. I guess this is the logical next thing to attempt

@phildawes

(BTW, I'm experimenting with code patching because it is needed to support typeinference for autocomplete. I'm not convinced that we should be doing the subsequent method/field lookups etc.. using the compiler tables + ast; Surfacing an api to allow this sounds like a more robust solution to me at this stage.)

@arielb1
Contributor
arielb1 commented Oct 23, 2015

@phildawes

First, inlining is currently only used for post-typeck code, but there is nothing really stopping it from being used for pre-typeck code (except we will probably need a GC, but maybe we could just discard the rustc process when it gets bad enough - most of the bloat will involve caches anyway).

The correct way to access the compiler tables is via the rustc API. We certainly should have some stable wrapper around it, but it will probably need to be developed in sync with the wrappers.

@Valloric

Hi, I'm the author of YouCompleteMe (AKA YCM) and ycmd; I have several years of experience dealing with semantic engines that support code-completion and code-comprehension (GoTo, type/doc info etc) features.

I've read through the RFC, and here are some observations based on that plus the experience I've gathered integrating libclang, OmniSharp, Jedi, gocode etc. I've interacted with people who work on all of those projects.

I've also written some thoughts of how semantic engines should be structured before.

  1. The suggested approach with an oracle process that keeps AST information in memory is a good one. This is not only "good" for performance, it is critical for it.
  2. The oracle should be an HTTP+JSON server. Even with slow Python implementations, this communication mechanis is more than fast enough (that link has benchmarks) for a localhost server. Seriously, stop looking at Thrift, Cap'nProto etc. HTTP+JSON is incredibly well understood, trivially easy to integrate support for and works flawlessly on every platform. Those of you concerned about performance: ycmd has been using this approach for two years. The serialization & communication overhead on a localhost server is utterly negligible (see benchmarks above). We know for a fact that this is beyond fast. Hundreds of thousands of YCM users use this setup every day. Fight that urge to over-engineer and prematurely optimize. :)
  3. Please have the oracle interface use UTF-8. Just because Visual Studio has to live with the legacy decisions of the Windows API doesn't mean the rest of us should.
  4. Please have the protocol support filesystem file state override. In other words, don't just ask for a filename + line & column number in your code-completion (and GoTo etc) interface; you must also support receiving a list of (filename -> file contents) pairs in the same call that represent the in-memory state of the user's code in case they haven't saved their changes to disk. For each such pair, the state of the file on disk is "overriden." Clients having to create temp files just for the semantic engine breaks all sorts of things and is slow.
  5. Please don't have notions of "projects," just files. The client apps that will be built on top of your oracle will handle that. The only thing your semantic engine needs to support is "in this file, line & column number location, what are the possible code-completion candidates" or info about the identifier at that location etc. That's it, you don't really need more than that. Certainly not for a v1 (and I'd say never). All the semantic engines ycmd talks to work like this and it all works great (except OmniSharp which must support the "solution" project files from msbuild, and that's a pain). You can provide "cursor" objects, AST walking and other abstractions later. They're not needed for 99% of the functionality clients want. Start small.
  6. Please don't ask clients to provide "invalidation spans" or anything of the sort. It's wildly unnecessary. Libclang does just fine without anything like it and it has to build (and update) in-memory ASTs for freaking C++ which is harder to parse than Rust. AFAIK it does so by computing an AST cache of all the files that are included by a particular file A and then when another parse request for that file A comes in, it checks did any of the included files change. If they didn't, that info can just be re-used. File A itself is always fully reparsed. This is still plenty fast; perf problems with libclang come from having to compile transitive dependencies because of the C++ include model (which Rust thankfully doesn't have).
  7. Please have your API use 1-based line & column numbers for location points. A point in a file is (absolute filename, line, column). A range is two of those tuples. That's it. You don't need anything more than that, I guarantee it. A column is a byte index, not a char one. You want 1-based line & column numbers because that's what user's see in editors. You're going to provide diagnostic info so don't force every single client to map your 0-based numbers to 1-based numbers for user display and editor APIs for jump points (almost all of which use 1-based numbers). Take my word for it; I've made this mistake before and fixing it took a while.
  8. Please talk to potential clients before stabilizing an API. Don't start from "does this API work for you?", start with "what are your use-cases?" and clients will suggest the simplest API that works for them. Write an implementation, have us integrate it and provide feedback. Then iterate.
  9. Please make your API thread-safe. If you take any piece of advice from what I've written here, please let it be this one. Don't make requests serialize on every call to your API.

I'm happy to answer any questions you might have.

@arielb1
Contributor
arielb1 commented Oct 25, 2015

Please make your API thread-safe. If you take any piece of advice from what I've written here, please let it be this one. Don't make requests serialize on every call to your API.

We don't like threads in Rust. Are you talking about parallel recompile+autocomplete, or something else?

@dgrunwald
Contributor

IDEs are multi-threaded. There could be one thread running lints in the background; another thread running semantic syntax highlighting, and a third thread performing code completion. And the file contents can be mutated at any time by the IDE main thread.

If you force IDEs to serialize access to the oracle, you force low-latency operations like code completion to wait for more expensive operations like lints. That makes the IDE feel unresponsive.

Moreover, once the IDE gets into non-trivial operations like refactorings, it's damn near impossible to keep the thing free of deadlocks. We had that issue with SharpDevelop, and ended up rewriting the whole C# semantic analysis engine (NRefactory) to use immutable data structures to get rid of the locking complexity.

@Valloric

@dgrunwald provided a perfect explanation on why thread-safety is a must.

I didn't know NRefactory used immutable data-structures internally. That's awesome and probably the reason why we don't have problems talking to OmniSharp (which uses NRefactory); if only libclang did that... in YCM we get exactly the scenario that was mentioned, where code-completion calls that are supposed to be super-fast get serialized behind an async full-reparse that we asked the engine to perform N seconds previously (it takes a long time for C++ files that have large transitive includes) because libclang is completely thread-hostile.

@arielb1
Contributor
arielb1 commented Oct 25, 2015

@dgrunwald

This is annoying because rustc fundamentally relies on caching to work with acceptable performance. I guess we could have multiple "worker" type contexts. I am not sure how much of a win sharing data between workers will be (a tcx has 200MB of shareable-looking HIR+resolution and 50MB of non-shareable stuff - we should try to use some kind of GC for HIR and just have multiple tcx-s). That will also improve our garbage collection story, because we can kill workers when they get too fat.

@dgrunwald
Contributor

Memory could well be an issue, since the oracle might have hundreds of crates loaded simultaneously.
My recommendation is to store only the item signatures; and use a full AST/HIR only for the current module of interest. This also allows skipping function bodies in the parser (in NRefactory 4, the parser would tell the lexer to skip the current block without fully tokenizing it).
And a fast parser could remove the need to serialize the oracle state anything to disk.

On "Find References": SharpDevelop uses a full linear scan to do this. First we use a "grep -r"-like operation to find all possible files that might reference the target item. Then those files are parsed (remember, we didn't store full syntax trees, only the signatures), and the functions containing a possible match are analyzed.
This makes "Find References" extremely fast in the common case where "grep" only returns a few dozen files.
For example, on the SharpDevelop code base (1.1 million lines of C# code), finding references to the SyntaxTree class takes <3 seconds, returning 227 occurrences in 84 files.
Of course, common identifiers like "Value" or "Text" take a while longer, and especially operator overloads are slow as almost all code has to be analyzed. But these operations are rare enough that the performance is acceptable.

My tips:

  • Keep the amount of persistent state as small as possible to avoid problems with invalidation. (persistent state = everything that gets reused after code changes).
  • Don't store anything to disk. It's likely the oracle can be fast enough without doing this; and unnecessary complexity creates bugs. "Have you tried deleting the .ncb file?" (I remember having to do this a couple times per day when using VS, ca. 2005)
  • Use lazy evaluation. The IDE is only interested in very specific bits of information, almost always restricted to a couple of lines around the cursor. Avoid calculating stuff that might never get used before it gets invalidated by the next code change.
  • I think it's important to be lazy across crate boundaries. Let's take servo as example for a big dependency graph. After I've changed the signature of a public function in crate geom (e.g. loosening generic bounds), I want to be able to use it immediately in crate servo (=see the new signature in code completion), without having to wait for every crate in between to fully typecheck.
  • At least for C#, laziness saves so much time that incremental compilation is unnecessary for IDE purposes (of course, it's still interesting when actually compiling).
@arielb1
Contributor
arielb1 commented Oct 25, 2015

@dgrunwald

rustc already stores most of the item signatures in easy-to-access (~70ms-ish on my slow laptop) and well-indexed metadata (on disc) - we may just store them all and kick a tcx for a function body only on demand (unfortunately, rust's lack of an ABI creates basically the same problem as C++ includes, but we are planning on getting around it with incremental compilation).

@arielb1
Contributor
arielb1 commented Oct 25, 2015

Actually, not really - because of how impls work, displaying method autocomplete requires loading all traits+impls in dependent crates (or at least everything that has wide-ranging blanket impls), which takes about ~200ms, along with associated invalidation concerns. Maybe do that when you click on a method, and cache the param-env + loaded metadata?

@phildawes

@arielb1

We don't like threads in Rust...

Do you mean in rustc? There's lots of multithreaded software written in rust.

@phildawes

@valloric, @dgrunwald, thanks for the info + advice, this is really good stuff. The point about not serializing requests is especially important as we weigh up using rustc for searches.

Out of interest, what do you consider to be acceptable turnaround times for the various types of queries?
e.g.

  • code completion
  • find-references / rename
  • lint a file worth of code
@dgrunwald
Contributor

Code completion: 250ms is acceptable; but if you want the IDE to feel fast, you should aim to beat 100ms.

Go to definition / get item information: 500ms should be acceptable.

Find references: Depends on the scope in which references are searched.
An editor might want to highlight all references to the item under the caret in the current file. This should be quick to react when the caret is moved (<250ms?), but may take longer (1-2s?) if the file was just edited (this feature is interesting when reading code, not so much while writing).

Across a whole project, find references/rename isn't really performance-critical as it's triggered manually by the user. <2s is optimal, but users will be OK with it taking longer, especially in large projects.

Linting the current file: 1s.
To avoid burning the CPU, you'll probably want to start linting only when the user stops typing for a moment (500ms idle time?), or immediately after typing a semicolon.

@arielb1
Contributor
arielb1 commented Oct 26, 2015

@phildawes

I meant we don't really like the "multiple threads all mutating the same data structures" programming model, but it looks like using multiple type contexts will work fine (maybe even use that on regular compilations for parallelism?)

The annoying part is that unless we get incremental type collection we would need to re-collect and re-wf-check dependent crates, and that could take a few seconds on a slow computer and large codebase. Because of how impls work, you will probably need to do that every time you poke the signature of a struct.

We also badly need some sort of incremental expansion+resolution, but that sounds easier to me (also, not my area).

@dgrunwald

We can totally type-check a non-huge function in a few tens of ms, as long as we do it on a warm cache. We just need to manage these.

@Ericson2314
Contributor

@arielb1 I thought that all of this is basically contingent on the incremental compilation work. Also I don't think many threads sharing a cache is a problem. We ought to only need to cache deterministic functions, which opens up a bunch of opportunities for increasing performance (e.g. avoiding ensuring that queries reflect the latest state of the cache).

@arielb1
Contributor
arielb1 commented Oct 27, 2015

@Ericson2314

Even if we get incremental compilation, I am not sure we will get incremental type collection. The problem with multiple threads sharing a cache is that they will need to lock that cache, and that tends to cause problems.

@bruno-medeiros

Whoa, I've arrived late to this thread and missed a lot of discussion. I got to say, it's a bit hard to follow since sometimes it seems we use terminology which is unclear. At least it has been to me in a few cases. The devil is in the details, and even a slight difference in the concepts that different people have in mind can make a big difference in conducting an effective discussion.

That said, let me pick up from what @Valloric said, since I pretty much agree with nearly everything he said in a previous post. I'll add my thoughts to his comments:

I've also written some thoughts of how semantic engines should be structured before.

The suggested approach with an oracle process that keeps AST information in memory is a good one. This is not only "good" for performance, it is critical for it.

Indeed. As far as I see things, when considering only the basic Oracle functionality like quick-check, find-definition (and maybe even code-completion), then the Oracle is little more than a long-running rustc, with AST and other semantic data structures persisted in memory across multiple requests. There really isn't much more to the Oracle, other than performance aspects (aspects which will probably required changes in rustc itself), and the IPC API .

Only in a more complex functionality like find-references does the Oracle start to require more specialized data structures of its own (like a database or an index)

The oracle should be an HTTP+JSON server. Even with slow Python implementations, this communication mechanis is more than fast enough (that link has benchmarks) for a localhost server. Seriously, stop looking at Thrift, Cap'nProto etc. HTTP+JSON is incredibly well understood, trivially easy to integrate support for and works flawlessly on every platform. Those of you concerned about performance: ycmd has been using this approach for two years. The serialization & communication overhead on a localhost server is utterly negligible (see benchmarks above). We know for a fact that this is beyond fast. Hundreds of thousands of YCM users use this setup every day. Fight that urge to over-engineer and prematurely optimize. :)

Indeed. If we look at it, in nearly every Oracle operation, the amount of data that is processed by the Oracle process (including rustc) is orders of magnitude larger than the data sent by IPC to the IDE. Therefore worrying about optimization in the IPC layer is premature.

Please have the oracle interface use UTF-8. Just because Visual Studio has to live with the legacy decisions of the Windows API doesn't mean the rest of us should.

Please have the protocol support filesystem file state override. In other words, don't just ask for a filename + line & column number in your code-completion (and GoTo etc) interface; you must also support receiving a list of (filename -> file contents) pairs in the same call that represent the in-memory state of the user's code in case they haven't saved their changes to disk. For each such pair, the state of the file on disk is "overriden." Clients having to create temp files just for the semantic engine breaks all sorts of things and is slow.

I had the same design approach, when I developed a semantic engine for the D language (for the DDT Eclipse IDE). And I implemented it as such, but Iater found out things would be much simpler if the IDE simply saved the file before a semantic operation was requested. I then tried that approach, to see if Eclipse would handle it properly - unfortunately it doesn't handle this paradigm entirely well. And this, unlike IntelliJ which seems to handle this perfectly - there is no longer a UX concept of saving files, it is all done transparently in IntelliJ (local history and version control subsume the need for explicitly saving).

I believe the IntelliJ way is the future, but not all IDEs and editors support that out-of-the-box, nor can easily be modified to support that. Therefore I think the Oracle should support file state override, but this should be a late stages feature, not a high priority feature.

Please don't have notions of "projects," just files. The client apps that will be built on top of your oracle will handle that. The only thing your semantic engine needs to support is "in this file, line & column number location, what are the possible code-completion candidates" or info about the identifier at that location etc. That's it, you don't really need more than that. Certainly not for a v1 (and I'd say never). All the semantic engines ycmd talks to work like this and it all works great (except OmniSharp which must support the "solution" project files from msbuild, and that's a pain). You can provide "cursor" objects, AST walking and other abstractions later. They're not needed for 99% of the functionality clients want. Start small.

I also didn't quite understand what an Oracle "project" would be for, as mentioned in the RFC. The fact that we have Cargo crates seems to make "projects" redundant. It's all the info Oracle needs. Only for a reverse-search kind of functionality, like find-references, there would be the need to specify a scope of projects/crates in which to search for references. But this scope would be a simple data structure/format, there would be no need for lifecycle init/delete management as mentioned in the RFC.

Please don't ask clients to provide "invalidation spans" or anything of the sort. It's wildly unnecessary. Libclang does just fine without anything like it and it has to build (and update) in-memory ASTs for freaking C++ which is harder to parse than Rust. AFAIK it does so by computing an AST cache of all the files that are included by a particular file A and then when another parse request for that file A comes in, it checks did any of the included files change. If they didn't, that info can just be re-used. File A itself is always fully reparsed. This is still plenty fast; perf problems with libclang come from having to compile transitive dependencies because of the C++ include model (which Rust thankfully doesn't have).

Yeah, invalidation span doesn't seem like a good idea to me either. It's only purpose would be to implement incremental parsing in the Oracle, which would be a very complex Oracle/rustc feature (with little benefits). And it could also be a bit complex for IDEs to provide this information (depending on the IDE). Best to have the Oracle itself detect if a file source has changed, and just reparse the whole file, even if only a few bits of source have changed.

Please have your API use 1-based line & column numbers for location points. A point in a file is (absolute filename, line, column). A range is two of those tuples. That's it. You don't need anything more than that, I guarantee it. A column is a byte index, not a char one. You want 1-based line & column numbers because that's what user's see in editors. You're going to provide diagnostic info so don't force every single client to map your 0-based numbers to 1-based numbers for user display and editor APIs for jump points (almost all of which use 1-based numbers). Take my word for it; I've made this mistake before and fixing it took a while.

On this point I disagree. Not all clients use 1-based line+column format to represent a span. In Eclipse in particularly, it's a 0-based character offset (from file start, no lines or columns). The Oracle should just provide and accept both formats, since it's trivial for the Oracle to convert between them anyways.

(As a general principle, the Oracle should make the IDEs work as simple as possible - because there is only one Rust Oracle, but many Rust IDEs/editors)

Please talk to potential clients before stabilizing an API. Don't start from "does this API work for you?", start with "what are your use-cases?" and clients will suggest the simplest API that works for them. Write an implementation, have us integrate it and provide feedback. Then iterate.

Agreed. Especially on the iterate bit. And short iterations please. There are lot of simple and basic features that are not implemented yet, but would already be of use to IDEs if they were implemented (such as persistent/consistent warnings for cargo build).

Please make your API thread-safe. If you take any piece of advice from what I've written here, please let it be this one. Don't make requests serialize on every call to your API.

Agreed.

@alexcrichton alexcrichton referenced this pull request in rust-lang/cargo Oct 29, 2015
Open

Persistent warning messages when building #2089

@bruno-medeiros

@phildawes I was taking a better look at Racer, I was wondering what it uses for parsing Rust code. I see it comes from the syntex_syntax crate. Is this just a port of libsyntax module from rustc, or does it do more stuff? And if so, does Racer use this extra functionality from syntex_syntax?

@phildawes

Hi @bruno-medeiros. syntex_syntax is just a port of libsyntax so that it can be used with rust stable.

@bruno-medeiros

Hi @bruno-medeiros. syntex_syntax is just a port of libsyntax so that it can be used with rust stable.

And libsyntax is just the Rust parser (and AST) used by rustc? Or more than that?

Who currently maintains syntax, anyone from Rust team, or external developers?

@arielb1
Contributor
arielb1 commented Nov 3, 2015

@bruno-medeiros

Yes, libsyntax is rustc's parser.

@DemiMarie

Please make your API thread-safe. If you take any piece of advice from what I've written here, please > let it be this one. Don't make requests serialize on every call to your API.

In particular, this means that different requests must be able to run concurrently. Ideally, they would run in parallel on multicore systems.

@eddyb
Member
eddyb commented Nov 8, 2015

@drbo If the requests are for different crates, with rustc integration that would mean they would go to different compiler session threads, and the compiler session threads would run in parallel (except when dependencies between them need propagating).

@phildawes

I suspect @drbo means for the same crate too - e.g. the lint delaying completion example that @dgrunwald gave.

@phildawes

Hi @valloric

I have a question about 5. from your post above:

  1. Please don't have notions of "projects," just files. The client apps that will be built on top of your oracle will handle that. The only thing your semantic engine needs to support is "in this file, line & column number location, what are the possible code-completion candidates" or info about the identifier at that location etc. That's it, you don't really need more than that. Certainly not for a v1 (and I'd say never).

How does YCM and libclang deal with multiple build targets using the same file? Racer does the same thing, just (file,line,column), but we've run into a few situations where people are building for multiple cargo binaries and racer is unable to find the completion they're expecting because it is assuming the wrong root.
(in racer currently you have to fudge it by reordinging the binaries in your cargo.toml).

Thanks!

@bruno-medeiros

How does YCM and libclang deal with multiple build targets using the same file? Racer does the same thing, just (file,line,column), but we've run into a few situations where people are building for multiple cargo binaries and racer is unable to find the completion they're expecting because it is assuming the wrong root.
(in racer currently you have to fudge it by reordinging the binaries in your cargo.toml).

Again another newbie question from me, but I don't understand why this would be a problem for Racer... Why does the root matter? (I'm assuming root means the Rust source file that contains the main function of the binary. ) The file where completion is invoked should be all that matters to Racer, no?

@phildawes

Hi @bruno-medeiros. Mainly because the root source file contains the 'extern crate' and mod directives.
The root also contains the feature flags, config attributes etc.., although currently racer is not sophisticated enough for these to be a problem yet. (they would be a problem for rustc though)

@Valloric

@phildawes Hm, seems like an issue specific to Rust. None of the other languages ycmd supports have a notion of a root file. In any case, racer shouldn't have to make a decision on which root file (out of several) to use; the client will provide it. Provide an API that will list possible root files for a given file; also, completion calls should take an additional root file argument (you could make it optional if only one root is possible and mandatory if there's ambiguity).

If you can't determine which root files are possible, just always require one to be provided in calls. The clients will ask the user to list the root file in a configuration file/runtime option/whatever.

Keep in mind that users will not be querying racer directly, they'll be using a client. From the perspective of the user, they're configuring the client options and know nothing about the existence of the backend (nor should they, it's an implementation detail).

@erkinalp

but i expect calling it "Oracle" will lead to confusion and legal trouble

It is a trademark, and trademarks have sector used defined in registration.

@bruno-medeiros

Hi @bruno-medeiros. Mainly because the root source file contains the 'extern crate' and mod directives.
The root also contains the feature flags, config attributes etc.., although currently racer is not sophisticated enough for these to be a problem yet. (they would be a problem for rustc though)

Whoa, I just read up some on that, and gave it a try, I see what you mean now about 'extern crate' and mod directives... I wasn't expecting this, this seems to be a very bad design choice for Rust. A very alien one at best. It basically means you can't perform semantic analysis on a source file "A" by looking only at source file A, and whatever source file A "includes". Rather, the semantic meaning of source file A can change depending on who imports A.

In other words, an arbitrary Rust source file cannot be considered a compilation unit (unlike Java, C#, D, etc.) , rather, only the root files are... Has this been discussed in the Rust community at large? Seems like a misguided choice. 😧

@bruno-medeiros

but i expect calling it "Oracle" will lead to confusion and legal trouble

It is a trademark, and trademarks have sector used defined in registration.

Yeah, according to one of the Go oracle guys, even the Go oracle will change it's name. (to "guru" or something) We probably can leave to whoever actually builds the Rust oracle first to choose its name...

@Ericson2314
Contributor

@bruno-medeiros There are better ways to do incremental compilation than per-file, see https://github.com/rust-lang/rfcs/blob/master/text/1298-incremental-compilation.md. Rust modules within a crate contain many things that would have to go in headers anyways, so the comparison with C and C++ isn't quite fair. Javac doesn't do very interesting things at compile time so that also isn't fair.

@cmr
Contributor
cmr commented Nov 25, 2015

@bruno-medeiros you're right, the compilation unit isn't the source file. It's the crate. This is intentional, and is an old decision. It's largely worked out very nicely for us.

@erkinalp

@Ericson2314: ISO C++ WG is now preparing a module system for C++, so they will have complexer module optimisation than us in C++17.

@Ericson2314
Contributor

Hmm? My points was just that file-based compilation units are lousy. If C++ gets a module system, that's great, but will probably violate file-based compilation units.

@bruno-medeiros

@bruno-medeiros There are better ways to do incremental compilation than per-file, see https://github.com/rust-lang/rfcs/blob/master/text/1298-incremental-compilation.md. Rust modules within a crate contain many things that would have to go in headers anyways, so the comparison with C and C++ isn't quite fair. Javac doesn't do very interesting things at compile time so that also isn't fair.

Huh? I didn't compare with C/C++, I only mentioned Java, C#, and D. D can also do a lot of compile-time stuff, so I doubt that's any excuse to say you can't have a file-based compilation unit.

And yeah, "There are better ways to do incremental compilation than per-file", but having a file-based compilation unit doesn't preclude nor makes it any harder for tools to do finer-grained incremental compilation.

However, the other way around - not having file-based compilation units - makes it impossible (or very hard) to perform file-based incremental compilation/analysis, something which is otherwise fairly easy to implement, one just needs a file-based dependency graph (which can be implemented independent of language actually).

This way will force a tool such as Racer to use rustc (or reimplement a very complex dependency system itself) if it wants to have incremental analysis capabilities. Maybe that won't be a problem since the current plan is for Racer to use rustc anyways, but we're not sure yet how well that is going to work out.

@bruno-medeiros

@bruno-medeiros you're right, the compilation unit isn't the source file. It's the crate. This is intentional, and is an old decision. It's largely worked out very nicely for us.

Why has it worked nicely? Is that rationale or discussion available somewhere? I'm still not convinced. I tried to look it up, so far only found this: https://gist.github.com/DanielKeep/470f4e114d28cd0c8d43 (but that document only explains why Rust modules work the way they do, but not why)

@arielb1
Contributor
arielb1 commented Nov 26, 2015

@bruno-medeiros

Rust files can (and do) have circular dependencies, which makes file-based incremental analysis problematic. Java/C# also have this problem, and their analysis is also not really file-based.

@arielb1
Contributor
arielb1 commented Nov 26, 2015

@bruno-medeiros

The crate-is-compilation-unit decision was made long ago (I think by @graydon himself). It seems to work very nicely in practice despite being somewhat confusing to newcomers - it basically removes all kinds of annoying build-system issues.

Although there may be alternatives we have not considered, if you don't have crate-is-compilation-unit you either need some way to handle inter-module circular dependencies or to have manually-written header files.

@DanielKeep's description is not entirely accurate. Modules are not meant to be hermetically-sealed little boxes (as crates are) - they just control namespacing.

At least how I conceptualize it, each Rust crate is made of items that are exposed through a module tree, in the same way a Unix-style filesystem is made of inodes that are exposed through directory entries (this analogy does not match private use-s precisely, which basically only guide name resolution).

As Rust is a statically-compiled language, if dlopen is not used then every crate in the final program is accessible from the root crate, but not necessarily in a public and/or canonical way (for example, crate a can depend on b and c, both of which depend on d).

@Nemikolh

A very alien one at best. It basically means you can't perform semantic analysis on a source file "A" by looking only at source file A, and whatever source file A "includes". Rather, the semantic meaning of source file A can change depending on who imports A.

@bruno-medeiros Java has the same problem. If you don't have the class path, you can't find the correct libraries versions and provide correct completions. Looking only at a java file for completion is surely not sufficient at all.

I think the mention of compilation unit here is off-topic. The way Rust performs compilation and the fact that a file is a unit or not is irrelevant to completion. Same apply for incremental compilation.

Going back to autocompletion in the case of multiple roots, it seems that the alternative to asking the user for a root file is suggesting only completion for shared API between each roots. It doesn't seems trivial though.

@graydon
graydon commented Dec 4, 2015

Well, let's be clear and not mix issues. On an intra crate basis, this is exactly what's planned, when people talk about doing incremental compilation as in RFC 1298; if not in-memory than at least on-disk.

The conversation (that is very far afield) in the past couple comments here has been about whether crates ought to exist as a cyclicality-containment-boundary at all, or whether rustc should analyze potentially the entire world of referenced source code (i.e. all the crates you depend on from crates.io, and all of their dependencies, and so forth) transitively, and recompile subgraphs of that super-graph when things change.

I think I can confidently state that that's out of scope for the language as it currently exists; it was designed to support a cyclicality boundary on purpose, and it'd be a tremendous amount of work to change, not to mention completely defeat any concept of independent versioning, binary distribution, etc.

@Ericson2314
Contributor

. I still think that the idea of storing the entire representation in-memory, and updating it incrementally, might be a good one, if rustc can do this without a complete rewrite.

Well, let's be clear and not mix issues. On an intra crate basis, this is exactly what's planned, when people talk about doing incremental compilation as in RFC 1298; if not in-memory than at least on-disk.

The incremental compilation RFC also mentions this for multiple crates long term---useful when developing multiple crates simultaneously. As far as I can tell, nothing proposed so far here or elsewhere would hinder such a thing.

@arielb1
Contributor
arielb1 commented Dec 4, 2015

@Ericson2314

I feel that having incremental compilation work for multiple crates might even be easier than it working for a single crate - it is basically a fixed ABI as C has.

@bruno-medeiros

@graydon Hold on, what's this about suggesting that it would be better if crates could have cyclic dependencies? That certainly was not what I was on about. I think we are having some confusion here because we are misunderstanding the term "compilation-unit". "compilation-unit" has a specific meaning in C/C++, but perhaps it doesn't apply as clearly to other languages.

A compilation unit in Java is any .java source file that is part of the classpath. Each .java file translates exactly to one .class file (the compiled code). And you can ask the compiler to compile any .java file in the classpath, and that java file will always translate to the same compiled code, regardless of whether the java file was requested to be compiled directly in the compiler command line, or if it is being compiled because it it is included in some other compilation-unit being compiled.

An example, say you have: Foo.java => Other.java where "=>" means depends on. If you request a compile of Foo.java, it will compile both java files to their respective .class compiled code (Other.java will be compiled first, then Foo.java).
But instead, you could also request Other.java to be compiled by javac - producing a Other.class file - and then on a different javac invocation request Foo.java to be compiled. Then, Other.java would not be recompiled, because it has been compiled already. This second javac invocation would just use the existing Other.class file (assuming of course, Other.java hasn't been modified since). And note, this also works fine even if there are cyclic dependencies between .java files.

In Rust, the equivalent is not possible. In Rust, only the crate root files are a compilation unit. If there is a lib.rs file whose contents are mod other; //..., and then there is a other.rs file next to lib.rs, you can't just ask rustc to compile the other.rs file because it has no semantic meaning, it has no semantic existence on its own. (I mean you can ask rustc to compile other.rs, but it won't produce anything meaningful)

This was my original complaint, that it seems in Rust you can't have compilation-units that are finer-grained than a crate/pseudo-crate. In every other modern statically compiled language I know, you can have a source file as a compilation unit.

@ArtemGr
ArtemGr commented Dec 4, 2015

In every other modern statically compiled language I know, you can have a source file as a compilation unit.

So what if I #include-ed a file from C or C++? How is that different from Rust modules?

The C file I'm including might depend on something I defined before the #include, it is not at all a compilation unit. Having C files as separate compilation units is just a convention. You can do the same with Rust by not using mod.

P.S. An IDE that would treat a C/C++ file as a separate compilation unit is just broken, it will report errors in such a file becouse it tries to process it separately instead of going through the compilation paths defined in the build system.

P.S. By convention we call such files "header" files and no C/C++ IDE in it's right mind will try to compile them separately. In Rust the mod relationships between source files are specified to a larger degree than the #include relationships in C/C++.

P.S. But than again, maybe C/C++ is not the modern statically compiled language you was referring to? Check out Fantom, they have a Pod and not a source file as a compilation unit.

@graydon
graydon commented Dec 4, 2015

Yeah, it's funny you mention this being something you "can't have" in Rust. That's not true: rustc very much can (and actually always does) run on a "single file". It just happens to interpret the file you pass as the root of a module-tree (reading the transitive closure of its mod statements from there). So what you want works when your project only has single-file crates, but starts to fail when you have crates that span nested modules.

I suspect (though am not certain) that you're upset that the module path of a file within a crate is determined by the sequence of references that lead to it, from the root of a crate. That is definitely intentional, and I'll admit it's a little unusual compared to many languages. But Rust's design favours hierarchical (and local) naming much more heavily than other languages, in several aspects: for example, each crate is considered its own complete module namespace as well.

In any case, the module-path-determined-by-references is not a construct that Rust alone is capable of expressing. If you think of Rust mod statements as a convention for combining a C++ nested namespace { ... } block with an #include directive, you won't be far off as far as what a compiler can and can't get away with (there's an additional wrinkle that all the modules in a given crate are further mangled to avoid inter-crate conflicts over reuse of the same module name in separate crates, but for the purposes of this discussion that's irrelevant).

@bruno-medeiros

@ArtemGr

PS: But than again, maybe C/C++ is not the modern statically compiled language you was referring to?

Exactly, it's most definitely not! 😱 Especially with regards to this discussion: C/C++ has a rubbish compilation model. I'm sure we can all agree on that. That's why comparing the Rust compilation model to C/C++ isn't really saying much, even if the Rust compilation model turns out to be better.

No, rather, I was thinking of languages like Java,C#,D,Go,Nim, etc (I meant to say statically-typed languages). In all those, the compilation/translation unit (as I described above) is a file, with the exception of Go, where instead the compilation unit is a Go package. However, that still compares favorably to Rust since Go packages are typically more fine grained than Rust crates. They are somewhere in between a crate and a Rust module. (Go is a bit weird in this regard since it doesn't have a proper package management system.)

@bruno-medeiros

So what you want works when your project only has single-file crates, but starts to fail when you have crates that span nested modules.

Exactly. Which is the vast majority of crates (multi-file crates).

I suspect (though am not certain) that you're upset that the module path of a file within a crate is determined by the sequence of references that lead to it, from the root of a crate.

Yes, that's half of what upsets. It makes it less clear, less obvious, how to understand the module structure of a Rust crate by looking only at its source files and folders. The other half of what upsets me is the performance implications of this model.

That is definitely intentional, and I'll admit it's a little unusual compared to many languages. But Rust's design favours hierarchical (and local) naming much more heavily than other languages, in several aspects: for example, each crate is considered its own complete module namespace as well.

It's indeed a good thing that Rust favours a more hierarchical naming structure. But I don't think having a compilation unit model like Java would negatively affect such aspect at all.

In any case, the module-path-determined-by-references is not a construct that Rust alone is capable of expressing. If you think of Rust mod statements as a convention for combining a C++ nested namespace { ... } block with an #include directive, you won't be far off as far as what a compiler can and can't get away with (there's an additional wrinkle that all the modules in a given crate are further mangled to avoid inter-crate conflicts over reuse of the same module name in separate crates, but for the purposes of this discussion that's irrelevant).

That you can perform a similar construct in C++ is not really much of a favorable argument to me (see previous post).

@graydon
graydon commented Dec 4, 2015

But I don't think having a compilation unit model like Java would negatively affect such aspect at all.

I'll admit it could probably be made to work, but:

  • It'd require each file to declare its own package / module, which is redundant and churny
  • It'd defeat the idiom of conditional compilation through attributes on a mod item
  • It defeats the idiom of using the same module in different settings
  • It'd probably mean going to per-file .o outputs with all the actual codegen deferred to LTO
  • It'd probably mean reparsing all dependencies per input file (or reading ASTs and metadata out of .o files, as we do for .rlibs currently) so I suspect it'd wind up doing more work to compile a crate
  • It would break all existing rust code

Overall I'd vote against such a change (were I still on the core team, which if course I'm not). The current system is powerful, flexible and parsimonious. I see no strong reason to re-engineer it when other options are available that don't break the world.

The other half of what upsets me is the performance implications of this model

Do you mean the performance implication of completing a symbol in an IDE? I can see it might take a little longer in a couple contexts, but in most I think it won't be visible: in anything type-dependent you need the whole crate in memory to complete the symbol anyway.

If you mean compilation-speed, I believe the current model is actually faster than what you propose. On a whole-crate basis, already; and on a one-file-changed incremental basis, it will be when RFC 1298 is implemented.

Keep in mind that a .o file from such a scheme absolutely wouldn't contain "object code" in the conventional sense: it's going to contain ASTs for reloading in dependencies and LLVM bitcode for downstream codegen / inlining / LTO. Exactly the same as we produce in any intermediate artifacts in the compiler-maintained cache of RFC 1298, except with redundant copies due to the source file boundaries not matching up with the connected components of the dependency graph.

Under RFC 1298's scheme, touching a file will re-read, typecheck and codegen only the items that changed in the file, but also all its dependent definitions in a crate, exactly once, regardless of which file they originate in. Under the scheme you're proposing, touching a file would produce a .o file that is more expensive in at least 3 dimensions:

  • All definitions in the touched file would get rebuilt, instead of just those that changed (unless you add an RFC 1298-like definition-cache at this level as well)
  • All those definitions would then get re-read when compiling each of its dependent files
  • Any generic combination / inlining tree that involves those definitions would get re-codegen'ed into each dependent file's .o output, then eliminated at LTO

I really don't see any potential compilation perf win here. I think part of the problem is that you're assuming there's a lot more opportunity for truly "independent" compilation than there actually is. Java compiles to .class files which have extremely independent, high level contents; they're really just compiler IR files. It defers all "real" codegen including inlining and specializing to runtime. So you're setting up a kind of false comparison by talking about "compilation to .class files"; they're not the finished goods. You'd have to describe your scheme from going to .class to machine code as well.

A language like Rust does all its optimization and codegen ahead of time. Every language has to do it somewhere! I believe we're currently doing it at nearly the cheapest place we can, given the perf niche we're aiming for, just need to do it less often, and in finer grains (which RFC 1298 is about). There's little to be gained by trying to map those grains to input-file-sized chunks: the dependencies of a definition are rarely-if-ever single files.

@arielb1
Contributor
arielb1 commented Dec 4, 2015

@bruno-medeiros

In a Java-style model, when one file in a strongly-connected-component changes you have to recompile them all. If you don't split your code into headers and source files, you basically have to recompile everything when you touch a file. Also, javac must read every source file multiple times when doing a full compilation - I imagine that autocomplete takes some shortcuts here.

@phildawes

Related to this rfc: I'm trying to figure out how best to give Racer access to stable rustc for type inference:
https://internals.rust-lang.org/t/racer-type-inference-with-stable-rustc/2971

@bruno-medeiros

@graydon

I'll admit it could probably be made to work, but:

I believe I could counter those points to a satisfactory level, but it would take a lot explaining/writing effort, a some design effort too (to address some of the problems mentioned) . Not a trivial task at the moment, and also it's a massive off-topic, I don't think it would be right to continue such discussion here. Maybe I'll pick it up later on the forums... although if RFC 1298 comes quick enough, it won't matter much anyways.

As for the performance implications of these two different module systems, I was just referring to the incremental compilation case. Yeah, for one-shot compilation, it's not gonna make a difference. I guess I should have mentioned that explicitly in my Foo.java => Other.java example. That's why what intermediate format a source file would compile to is not relevant. Yeah, Java compiles to Java bytecode (the .class) which defers most optimizations to run-time. D on the other hand compiles to native code (.o). But it doesn't matter: even if the intermediate format was just whatever results immediately after semantic analysis (but before any code generation), it would still make incremental compilation faster with a Java-like module system - but again, only if RFC 1298 is not implemented.

Also, what I just said about incremental compilation, also applies to incremental analysis as performed by an IDE tool like Racer (since they both need to perform semantic analysis). So looking back at this comment:

Do you mean the performance implication of completing a symbol in an IDE? I can see it might take a little longer in a couple contexts, but in most I think it won't be visible: in anything type-dependent you need the whole crate in memory to complete the symbol anyway.

You usually need the whole crate in memory to complete the symbol yeah. But again the key here is incremental analysis, that is, re-analyzing source code after source changes occur. With the current system, a change in a source file means the whole in-memory crate structure needs to be discarded, and new one created from semantic analysis (unless the IDE tool is smart enough to implement something like RFC 1298). It wouldn't be the case with a Java like module system - only the dependent source files would need to be reanalyzed.

@bruno-medeiros

Hi guys, an update: I've started work this week on an external tool that would perform a "parse-analysis" of Rust source code: https://github.com/RustDT/rust_parse_describe . Parse-analysis is basically what I call an operation that supplies an IDE/editor with any useful information that can be derived from just parsing alone (no semantic analysis). This would be:

  • Parse errors (if any). This can be used to provide on-the-fly parse errors reporting in the editor.
  • The structural elements of the source file (that is, the top-level definitions). This can be used to provide an editor outline, or provider the block source ranges for editor block folding.

I imagine this tool could be integrated into Racer, for convenience. Also, if integrated with Racer (or the Rust-oracle), and used in daemon mode, there is theoretically the possibility to optimize by reusing calculated data structures, like the parse tree.

PS: On the Eclipse/RustDT side I already have framework support for this functionality, as I did the above for Goclipse, using go oracle as a backend tool. So once https://github.com/RustDT/rust_parse_describe is up and running, shouldn't take long to plug that into RustDT.

nrc added some commits Dec 15, 2015
@nrc nrc Remove material on error format and robust compilation
This is a somewhat separate topic, most of which doesn't need an RFC. The material here didn't really add much.
1615236
@nrc nrc Rewrite most of the RFC
More focussed on the RLS (formerly oracle). A smaller and mosty simpler design.
3dd97de
@nrc nrc removed the T-compiler label Jan 13, 2016
@nrc nrc changed the title from Changes to the compiler to support IDEs to Rust Language Server (IDE support) Jan 13, 2016
@nrc
Contributor
nrc commented Jan 13, 2016

I've updated the RFC, it is smaller and more focussed. Hopefully addressed many of the comments here. This update views the RLS (formerly oracle) and the compiler has an integrated piece of software. Using the compiler's data is considered in the alternatives section. I still favour using a database, hopefully the issues are explained a bit better. In summary, it seems that using the compiler's data directly for finding all references to a definition is not possible in the short run. In the long run we might want to do searches (e.g., find method calls) which are more difficult to answer directly from the compiler's data.

@phildawes

I haven't had much of a chance to digest this fully, but I think with these changes the remit for the RLS is basically the same as for racer. (i.e. run as a daemon, provide results to 'find-definition', 'find-references', 'complete' etc..).

@Valloric

The new RFC is very nice, great work!

There's one fairly important point that should IMO be changed:

Communication between the RLS and an IDE will be via IPC calls

Please make the daemon an HTTP+JSON server. People will tell you that's "slow," but I've thoroughly debunked that myth years ago. That post is 2.5 years old, and since then we've had ycmd for years in YCM with absolutely zero perf problems with the transport mechanism. For code-completion to feel "instant," the response from the server needs to come within 200ms. The post I linked to above has benchmarks and details, but the TL;DR is that the overhead of HTTP+JSON communication layer is 2ms in the absolute worst case with a 100kb request & response and both the server and client completely written in freaking Python.

And like I've said, we've had years of experience with this in YCM. Other than ycmd, there are several other completion servers that also talk HTTP+JSON: gocode, tern.js, JediHTTP, OmniSharpServer and Rust's own racerd (plus a few I'm forgetting). Please go with the industry standard here and don't pick custom IPC systems that won't work well across platforms, aren't nearly as well understood as HTTP+JSON and aren't trivial to write clients for in every language.

@eddyb
Member
eddyb commented Jan 14, 2016

@Valloric what about JSON over stdio with an external wrapper for HTTP, if desired?
I would expect separate instances to coexist gracefully and not be reachable via TCP.

@Valloric

@eddyb

what about JSON over stdio with an external wrapper for HTTP, if desired?

Why do stdio? What would be the benefit? I only see drawbacks, like problems with multiplexing concurrent requests.

We have this exact problem with TypeScript's tsserver which is a semantic completion deamon over stdio. We (the ycmd devs) hate the transport mechanism they picked since now we have all these problems (like multiplexing requests) which we don't have for all the other completion servers we talk to that reasonably use HTTP+JSON.

So I have experience using such a stdio completion daemon and it's a much more painful approach for, critically, absolutely no benefit. Really, none.

I would expect separate instances to coexist gracefully and not be reachable via TCP.

If your desire is to support several instances at once, that's not an issue with HTTP completion daemons. We start a new ycmd instance for every Vim instance the user opens, and I routinely have 5+ of them open at a time. Ycmd just asks the OS for a free port on localhost (by binding to port 0) and gets it.

That's not even the full story, we create a new instance of ycmd and of every completion server it talks to for every new Vim instance. Zero problems with this across Linux/Windows/Mac/FreeBSD/etc with great perf.

@eddyb
Member
eddyb commented Jan 14, 2016

@Valloric Then we're stuck with a dependency on Cargo-based compilation of the Rust distribution to pull in hyper or some other HTTP implementation.
Although, I guess you could still have TCP without HTTP on top, if you really wanted to avoid that extra dependency.

@jwilm
jwilm commented Jan 14, 2016

Alternatively, wrapping these compiler utilities in a server could be left to community projects. This RFC would effectively become "Build equivalent of libclang for Rust", and that's not a bad thing as long as we are mindful of the intended use case.

@arielb1
Contributor
arielb1 commented Jan 14, 2016

@Valloric

Opening random ports on localhost is bad from a security perspective - any website being visited on the local machine can talk to it and try to hack into it. I don't want to deal with these kind of things.

This can be solved by Unix sockets or their Windows equivalent (COM?)

Also, we don't really want to have rustc depend on a HTTP library. I think that providing a Rust API and having IPC bindings on crates.io would be a good solution.

@ArtemGr
ArtemGr commented Jan 14, 2016

Regarding the IPC bindings, I'd really like to see the completion work over the net. A lot of server-side projects are actually compiled on the server and not on the developer's workstation. Server might have access to C libraries that are not readily portable to another OS. It will be a pain porting such projects just to make the IDE completion work.

@eddyb
Member
eddyb commented Jan 14, 2016

@arielb1 As long as the API is opaque like IPC would be (i.e. owned ADTs over Rust channels), I am fine with that.

@jwilm
jwilm commented Jan 14, 2016

@arielb1 Preventing random services from querying a localhost service is not difficult. Both YCM and racerd support HMAC auth to prevent exactly that.

That said, I agree having an HTTP lib in rustc is undesirable. The IPC version doesn't make sense to me either. If you look at the expected consumers of the RLS, it's not clear to me who would actually want to integrate via IPC primitives.

The most pragmatic approach here is going to be unopinionated about such server-client interactions. This is conveniently also less work for rustc. The focus becomes building a high level Rust API for managing an RLS instance and accessing completions, references, definitions, and compiler messages. Matters such as transport and protocol are then left to the actual server implementations or direct integrations.

The rest of this is regarding the RFC text.

Unresolved Questions

A problem is that Visual Studio uses UTF16 while Rust uses UTF8, there is (I understand) no efficient way to convert between byte counts in these systems. I'm not sure how to address this. It might require the RLS to be able to operate in UTF16 mode.

If the APIs just speak in terms of 1-based column/line locations (as-is-natural), shouldn't such a conversion be trivial?

Alternatives

We could only provide the RLS as a library

YES. For reasons above. 😄

Detailed design > Architecture

It seems to me that most queries would require some level of compilation to be done. For example, every time completions are requested, it's probably because the user just typed something new like my_vec. Reordering queries ahead of compilation requests would potentially result in stale data being returned. For example, If the user just wrote my_vec = vec![1, 2, 3] and compilation hadn't been performed, the my_vec binding wouldn't be in the compilation database, and nothing would be returned. I suppose in such cases, a compilation could be performed using the newly supplied buffers after such a failed query.

Lazy compilation

Editors typically provide a set of dirty buffers for all queries. These buffers should be applied first to always return correct results.

@Valloric

The most pragmatic approach here is going to be unopinionated about such server-client interactions. This is conveniently also less work for rustc. The focus becomes building a high level Rust API for managing an RLS instance and accessing completions, references, definitions, and compiler messages. Matters such as transport and protocol are then left to the actual server implementations or direct integrations.

Exactly. I don't actually want rustc to provide a completion daemon, I want a libclang version of rustc. We'll build the daemon.

But if rustc really wants to build a daemon, it should be HTTP+JSON. But a libclang-like library would be best.

Both YCM and racerd support HMAC auth to prevent exactly that.

To clarify what Joe said a bit to prevent confusion, both ycmd and racerd require HMAC auth for security. It's not optional.

@dgrunwald
Contributor

A problem is that Visual Studio uses UTF16 while Rust uses UTF8, there is (I understand) no efficient way to convert between byte counts in these systems. I'm not sure how to address this. It might require the RLS to be able to operate in UTF16 mode.

If the APIs just speak in terms of 1-based column/line locations (as-is-natural), shouldn't such a conversion be trivial?

What does "as-is-natural" mean? What's 1 column? One code unit, one code point, or one grapheme cluster? How are tabs counted?

Byte offsets are much simpler than line/column pairs.

As for Visual Studio / UTF-16: at some point edits will have to be converted from UTF-16 to UTF-8 so that the rust compiler can parse the code. Converting byte counts isn't less efficient than that, so I don't really see a problem here. But if this does turn out to be a performance problem for UTF-16 editors: it's quite possible to efficiently maintain a data structure that maps UTF-16 <-> UTF-8 offsets. Editors already use similar data structures for mapping between line numbers and offsets.

@nrc
Contributor
nrc commented Jan 14, 2016

I've added text to expand on some of the alternatives/questions discussed in the last few comments. I've also added a paragraph about the RLS taking changed text not saved to a file.

To explain my position on these questions:

I think we should provide a daemon rather than a library because to work, such a piece of software must be kept in memory so every client will want this facility. Not providing it seems like asking clients to duplicate work. I don't imagine there will be much difference in how they would implement it either (except for the client/server protocol perhaps, see below). There will be a library interface available for lower-level tools like Racer, if it wants to use the RLS. The IDE authors I've talked to have said they prefer an IPC interface.

I believe HTTP+JSON is the best solution for IPC. I envisage the RLS to live in its own repo using rustc as a submodule. Therefore pulling in Hyper or some other library is not an issue. The security issues seem solvable. It allows easily using the RLS over a network as well as locally. It is a proven solution - it is used by Roslyn and YCMD.

@eddyb
Member
eddyb commented Jan 14, 2016

@nrc RLS could be split, with the HTTP+JSON component using the API bundled with the compiler.

@nrc
Contributor
nrc commented Jan 14, 2016

@jwilm to answer your other points:

UTF16/UTF8

RLS will provide both row/column and byte offset information. Apparently the latter is useful for IDE authors. I added some text here to the RFC, I think the question boils down to whether the RLS or the client does the conversion. I believe it can be done by either and isn't very nice for either. In the interests of making progress the RLS will not initially do the conversion. In the long run (if there are multiple VS plugins, for example) it might make sense to add it to the RLS

Reordering queries

I imagine that we would track dependencies and never move a query ahead of a compilation that it depended on. But we could move queries ahead of compilations that are orthogonal (say of a different file or crate). In any case this is an extension that won't be around for a while.

dirty buffers

Yes, this is true. I'd thought of this but not put it in the RFC, I've added a para for it.

@nrc
Contributor
nrc commented Jan 14, 2016

@eddyb I'm envisaging that there will be some changes to improve the save-analysis API. The other changes in this RFC (tracking multiple crates, using a database to answer queries over the whole program rather than one crate) are better to live outside the compiler. I expect that this second level (db + work queue) can live in a crate and the daemon/IPC stuff can live in another.

@eddyb
Member
eddyb commented Jan 14, 2016

@nrc part of that has to live with the compiler and provide a stable API which is the IPC API, in Rust (communicating over channels with everything else, for example).
Unless you don't want to distribute RLS with the Rust compiler and use various hacks to get it to compile for stable versions.

@arielb1
Contributor
arielb1 commented Jan 14, 2016

@nrc

I don't really want rustc to be in the API business - we already have enough troubles with language stability, we don't need more of these.

Having a official crates.io server along with an unstable librustc_interactive (we really need a better story than #[unstable] for the compiler libs) in the main repo would be the best.

@bruno-medeiros

On the issue of HTPP vs. other transport mechanisms.

First of all, the core of RLS functionality should be well abstracted from what actual transport mechanism is used. It should be trivial to add interface code that uses HTTP, TCP sockets, cmdline stdio, or something else. In practice the choice will be left to the RLS client (the editor/IDE). The above implies that RLS should not be a single crate - but rather one crate for RLS as a library, another crate for a RLS deamon using HTTP as an interface, etc.

@Valloric @jwilm That said, I don't see the point of using HTTP as an interface, seems like bloat. I'm not saying there is a necessarily a significant performance hit, I'm saying I don't see the benefit at all, versus, say, TCP sockets (+JSON). I've written one such deamon, and the socket handling code is super simple: https://github.com/DDT-IDE/DDT/blob/master/plugin_tooling/src/dtool/genie/AbstractSocketServer.java . That's 140 lines, and most of it is just nice error handling, and dealing with some Java particularities (like making sure unchecked exceptions don't kill a thread without being logged somewhere). However, the core code of a TCP socket server is super simple:

  • Open a server socket, listen for connections
  • When a new connection is made, create a new thread to read/write on that new connection.

What's the benefit of bringing in a whole HTTP server just for that?
So you can use URLs to encode command options or other information to the daemon? To be able to view daemon information on a browser?

@bgourlie

@bruno-medeiros HTTP+JSON is trivially consumable by anything and anyone. Having to operate at the socket level increases the barrier to entry for anyone looking to integrate. I think it's easy to understate, for example, how useful being able to use something like curl to query really is, or how much simpler, programmatically, using HTTP is vs sockets.

If performance isn't a concern, what are the implications of the additional bloat introduced in using HTTP?

@bruno-medeiros

On the RFC itself:

Compilation

It would nice if we could call this "compile-analysis" or some other less ambiguous term. That is, to distinguish the "compilation" operation where we actually want binaries/libraries to be generated, and the "compilation" operation where we just want semantic analysis performed and error/warnings provided to the user.

@nrc
"The analysis information will be provided by the save-analysis API." What exactly doesn't mean, I didn't quite get it. What again is the save-analysis? (this term is no longer defined in the current RFC text)

Alternatives

The RFC is still a bit light on the architecture/design of the semantic database. As an example, let me provide a gist of how this is done in the Eclipse JDT (or CDT).

First, a term definition: compile-analysis data - this is all the semantic data that must be calculated in order to produce errors/warnings for the whole project. Obviously this includes reference resolution ("with reference X in an AST, what definition does it actually refer to?"), etc.

  • Compile-analysis information is stored in memory, in a format closely involved/interacted with the AST. This means that with say a Reference AST node, it's trivial - API wise - to request the resolution of that reference (this is called a binding). The underlying code also takes care to see if a new semantic resolution is necessary, or if the previous one is still valid.

Then there is the information used to process queries such as "find all references to type Foo", or "find types named *Foo", where * is a wildcard. This information, lets call it query-analysis data. Note that most of this information is not required to perform an actual compilation, this is something that is only of use for IDE users.

In the JDT/CDT world this query-analysis data is stored in a structure called the "index", which is divided by each source file. (a source file is "indexed" then, and the result is called an index document). An index document is a collection of index entries, each of which are quite simple, they are just a data tuple of the format: <TYPE, SPAN, NAME>
TYPE is the type of entry, such as Reference or Definition (there are other, more complex ones)
*SPAN is the span (aka source range).
*NAME is the *unresolved
name of the index entry.
So Rust code like this let x = foo; would provide entries:

<DEF, 4:5, "x">
<REF, 8:12, "foo">

Notice that in the ref entry above, the name is just "foo" -> it's not a fully qualified name, and as such is ambiguous... to what definition does "foo" actually point to? You can't tell by the index entry alone.

So how does it work when someone asks to find all references to type xpto::foo of a given crate? (I'm using equivalent Rust terminology here to be more clear)
This search is executed in two phases. First, it looks up in the index for potential matches, that is, references that could potentially refer to xpto::foo (it checks that by comparing the name). So the <REF, 8:12, "foo"> entry would be accepted as a potential match.

On the second phase, the engine just requests a find-definition operation of all those potential match references to figure out what actual element it refers to note. It does this by finding the Reference AST node at the span given in the entry. After that it is trivial to request the ref resolution.
Note that this last part is efficient because it uses the compile-analysis data, which was already calculated beforehand, and is stored/cached in-memory. So resolving each potential match is nearly instantaneous. And once it knows the actual element referred to, it checks if it is the same as xpto::foo itself, discards it otherwise.

The index documents are usually kept in-memory, but they are also persisted and cached into the disk. This is because the index documents can take up a lot of memory and so might not be good to keep them all in memory (Java projects can huge, mind you, especially if using 3rd party libraries). Also having index documents persisted in a file, means they can be loaded quicker when the IDE restarts. (that is quicker than re-indexing the source file)

Also, the reason the index is divided into documents that correspond to the source file they originate from, is because it makes it easier to update them (the index documents). When a user types some text, it invalidate the index for the source file being edited. This is also why an entry such as <REF, 8:12, "foo"> for say source file "file_A.rs" does not contain the fully qualified name/id of the element "foo" the reference points at. Because such resolution could become invalid if a source file other than "file_A.rs" is edited. In this way, the index documents are designed so that their entries only depend on the source file they originate from, not other source files.


Now, I'm not saying this is the architecture that RLS should have.

What I'm saying is that this details with some precision how the data structures for an IDE/oracle engine are structured, how they are updated in face of user editing, how queries use them, etc.. In the RFC it is not clear at all what the suggested design is (at least I cant figure it out).

Some points I think this should bring up:

  • We should make a distinction, at least conceptually, between compilation-analysis data and query-analysis data.
  • How is compilation-analysis data structured, in such a way that allows for incremental compilation. I think this a very important consideration that should be kept in mind at all times, because incremental compilation is going to be very useful, if not crucial, for RLS.
  • How much overlap should exist between compilation-analysis data and query-analysis data? Should a user query use both, or just the query-analysis data? For example if a user asks "what definition is pointed to at location X?", does that use the same data structures that an incremental compiler would use, or does it use a separate query-analysis data stored in a different format? (I think it should be the former case)
  • Should the compilation-analysis data be structured in a SQL database, or a simpler data structure, closer to the AST/MIR?
@eddyb
Member
eddyb commented Jan 15, 2016

@bruno-medeiros FWIW the compiler no longer works on the AST but HIR, which started out as being identical to the AST but is getting differential improvements, e.g. Items are no longer nested in terms of ownership, i.e. an ItemMod effectively containing a vector of Item IDs, instead of the Items themselves.
There have also been suggestions along the lines of having a HIR format which can be memory-mapped from metadata found in crates.

What again is the save-analysis?

It's @nrc's pass to convert from Rust's internal representations into some textual DXR-friendly format (based on CSV atm AFAIK).
IMO save-analysis is a massive bodge and all of the work it does could be abstracted into a nicer internal compiler API, on top of which RLS could be built.

There are 0 uses for having a textual format for storing any kind of metadata at any point during RLS' operation.

Should the compilation-analysis data be structured in a SQL database, or a simpler data structure, closer to the AST/MIR?

I believe it should be using the HIR: not something like the HIR, and not the HIR alone, but a layer on top of the HIR.

The HIR and the existing side-tables form the data store, while the RLS database is a layer of indexes which the compiler itself doesn't already have, or does have but doesn't lend itself to RLS queries (such as fuzzy searches).

In this way, the index documents are designed so that their entries only depend on the source file they originate from, not other source files.

This is solved by building RLS on top of incremental compilation instead of overengineering it.

Given the flexible and automatic approach @nikomatsakis took for tracking dependencies between compiler passes, for incremental recompilation, it should be possible to also track RLS indexes using the same mechanism and recompute them as necessary.

@nrc
Contributor
nrc commented Jan 15, 2016

@bruno-medeiros @eddyb to clarify, when I say save-analysis API I do not mean the csv dump, I mean the API (which is WIP) that provides some of the services as functions, see trans::save::get_item_data, etc. This is just an API for the compiler to provide information about it's data without exposing its internals. There is no textual representation anywhere.

Note that interaction with the save-analysis API uses the AST, not the HIR because the HIR is meant to be private to the compiler, whereas the AST is public API.

I intentionally left the specification of the database vague. I don't believe it will benefit from too much up-front planning. A relational SQL database is what I have in mind, but I know there are many options (DXR for example use ElasticSearch, but I don't think that would be useful for the RLS since it will be doing mostly relational, not textual, queries).

@bruno-medeiros, that's a lot of info, thanks! It'll take me a little while to chew that over. In the mean-time, I'll answer the questions that I can:

For incremental compilation, the compiler will return a set of code which is invalidated, either by span in the source or by node id. We can then query the db using this set and delete all records from the DB.

In the alternatives section, I discuss a little bit the option of using incremental compilation data rather than our own DB layer. All other things being equal, I would prefer to just use the data we have, rather than stick a DB on top. However, I see two big points in favour of the DB: first is timing - incremental compilation data will not support this out of the box - the starting point is not to keep data for function bodies, only items. So if incremental compilation is fast enough, it is not clear when (if ever) we'll get the required data for the RLS. It is certainly not as simple as serialising the HIR or MIR.

Secondly indexing: I hope that we could actually be quick enough to use the 'raw' data without indexing. The only query which requires non-local data is find-all-references, which is not expected to be instantaneous. I believe one could read and search the data for references, without indexing, in a few seconds for the largest programs. However, for more sophisticated searches we would need indexing, and maybe even for find-all-references if that was not fast enough. As soon as we get into managing indexes ourselves, I would rather outsource this to a general purpose database. This stuff is hard to do and there is no point in us rolling our own. I believe that is the more complex solution.

@eddyb I see using a database to abstract away questions about indexing, when to read data from disk, etc., as a simplification. The "HIR alone" + layers on top + indexes seems more complex to me, especially as we would need to heavily modify the HIR first. The only advantage of not using the DB is that invalidation is easier.

@phildawes

(disclaimer: racer maintainer!)

Personally I'd like to see this RFC focus on compiler changes to enable IDE functionality, rather than building an RLS server product. I think originally the proposal was called 'Changes to the compiler to support IDEs' and that is more where I think this should go.

The big problems blocking the analysis required for good IDE functionality are:

  • how to generate analysis information for pieces of code, fast. (interactive compile)
  • how to analyze unfinished or incomplete code

These require an official rust-team backed proposal because they involve changes to the rust compiler and so need coordination and to be blessed by rust teams. There is only one rust compiler.

The other stuff is much more speculative - we are talking about how an ide should be wired together and what functionality it should have.

There can be many solutions here, and specifying one - the RLS (and implicitly giving it rust team blessing) will discourage the community from taking up future projects in this area.

@phildawes

To clarify: I'm not saying we shouldn't discuss this stuff and build software as part of this effort. In fact this is required to drive out the necessary compiler changes. However only the compiler changes require official rust-team blessing.

Moreover:

  • Having an official Rust Language Server program as the only stable interface to the compiler means all future ide functionality involving analysis would also have to be specified and built into it.
  • It has a narrow vertical interface. The RLS is saying 'here's how completion and navigation functionality will work in an ide'.

I would like to see: 'here's how to get analysis information out of the compiler, fast and for incomplete code'.

@Valloric

I completely agree with @phildawes here. I don't think rustc should be in the business of dictating the details of how editors should access completion/semantic info. We need a stable library to access this data; no more, no less. Again, look at libclang.

@nrc
Contributor
nrc commented Jan 17, 2016

@phildawes, I think (hope) we are more in agreement than it would appear. Maybe the choice of names and the use of requirements are misleading. I'll try and clarify.

how to generate analysis information for pieces of code, fast. (interactive compile)

this is discussed in the RFC. To summarise, start with incremental compilation, then extend with lazy compilation keeping the compiler in memory.

how to analyze unfinished or incomplete code

I decided this stuff was unrelated to the rest of this RFC and removed it. It is still a priority. Some is implemented already. What remains is error recovery in the parser. This may or may not be in an RFC, but it will be a separate one if it gets one. It might be simple enough to just implement.

On the question of doing too much of the IDE's work, let me revisit two of the requirements. I do want to only supply the data from the compiler and let the IDE do the rest, I certainly don't want to constrain innovation in the IDE or other tools. By choosing these requirements I'm just picking some things an IDE might want to do in order to ensure the RLS meets those needs. I'm not saying an IDE must do them or must do them in the way suggested.

  • highlight all references to an item on the page

The RLS here is just returning a list of references to the IDE. I.e., a lookup and a reverse lookup in the name resolution tables. The data is limited to a single file. The important thing is that data is limited to one crate (which makes the search easier for the RLS).

The IDE is free to use that information however it likes. It could use it to highlight the other uses on the screen/in the file. The RLS is not doing this though.

  • code completion

Here, the RLS is taking a cursor position and a compiled file and returning a list of fields or methods that could be called. It is explicitly not doing full code completion. In particular the IDE (or Racer) would need to filter this list according to what is typed, possibly applying spelling correction, etc. The list would also be structured data giving information about the fields and methods such as types and where they are defined. It is up to the client to decide how to present this data (just names, or more info). This seems the minimal information that the compiler could provide.

I feel like maybe the name and the mode of operation (daemon process, etc.) are distracting from the information presented. Again, this is a low level library, simply presenting the compiler's data in an easy to consume way. What data is exposed is a separate issue to how it is exposed, and I believe here that presenting an IPC API is a convenience for clients (note in addition to a standard, in-process library API for those who choose to access it that way).

@Valloric to address the libclang alternative approach. Having a stable interface to compiler internals is somewhat orthogonal to providing a nice interface for IDEs. In some ways it is a much harder question, because we would want to expose much more of the compiler and that risks de-facto stabilisation of some parts. There are plans in this area - the save-analysis API is a low-level library for accessing compiler internals without exposing implementation details. It is WIP. In the longer run, we will probably design some kind of compiler plugin API.

However, even if we concentrated on this approach it is not suitable for an IDE. One point is speed. The RLS as proposed takes care of managing compilation, this is necessary especially for fast compilation models in the future. This work would be identical between IDEs, so there seems no point in making them duplicate it. Second point is that for searching you need access to the whole program, not just a single crate. The compiler only has access to the current crate and item signatures in upstream crates. For searches, the IDE needs function bodies for all crates. Managing that is tricky and again is something that all clients must do, so it makes sense to provide it in the RLS.

@phildawes

Hi @nrc! Sorry I don't think I made myself clear. I'm arguing for just the analog of the 'save-analysis' api to be specified here. The way to extract analysis information from the compiler, fast and for incomplete code. This requires agreement and rust-team blessing to progress because it requires changes to rustc.
Not 'code-completion', 'find-references' etc..

@DemiMarie

I think one thing that is clear is that the means by which the IDE and RLS
communicate is beyond the scope of this RFC. The optimum mechanism will
likely be IDE- and possibly platform- specific and I doubt that plaintext
HTTP will be the answer (noteably, it is a poor choice on multi-user
systems because if the server dies a local attacker can replace it and
cause an information leak at a minimum). Something based on Windows named
pipes and Unix domain sockets is far more satisfactory from a security
perspective. HTTP may require TLS with cert pinning.
On Jan 18, 2016 2:14 AM, "Phil Dawes" notifications@github.com wrote:

Hi @nrc https://github.com/nrc! Sorry I don't think I made myself
clear. I'm arguing for just the analog of the 'save-analysis' api to be
specified here. The way to extract analysis information from the compiler,
fast and for incomplete code. This requires agreement and rust-team
blessing to progress because it requires changes to rustc.
Not 'code-completion', 'find-references', 'find-definition'.


Reply to this email directly or view it on GitHub
#1317 (comment).

@bruno-medeiros

I still think a full relational database seems overkill. Mostly because virtually all of the IDE queries will be relatively simple in structure. But anyways, this ultimately is an RLS implementation aspect, and I think it's of limited value to be discussing upfront design for implementation aspects of RLS (even though I did just that before 😅 ). Also, not to mention this a feature that depends on other more essential features to be implemented first.

Of more importance are the functionality goals and general design principles of RLS, and in this I agree with @phildawes and @Valloric too: it is best to focus on incremental improvements to rustc that would make it more amenable to use by IDEs and RLS. I'd rather see API being gradually introduced (even if very unstable API), than no API at all. But rustc would have to be fully committed to this design objective, not just do things half-way. Otherwise it might make things hard for RLS, hard to a point that RLS might have been better off just using the rustc parser, and then working with the AST on it's own (which is essentially, what Racer does currently, correct?).

@nrc
Contributor
nrc commented Jan 18, 2016

@phildawes ah, so maybe not so much agreement :-) Let's look at a higher level: to get IDE (and IDE-like support) we need three pieces - the compiler, middleware, and the editor. I don't intend doing any planning in the editor space, it seems implementation is well under way. The compiler needs some work - it must be faster and more robust. Work is already underway on incremental compilation, lazy compilation may be built on top of that, but I don't think we can spec it in much detail until incremental compilation is further along. Furthermore, incremental compilation alone will get us a long way - from impossible to use to either annoyingly slow or good enough, so I don't consider faster compilation too important right now. On the robust-ness front, I think that the only big issue is error recovery in the parser - I do want to get into that, but it seems pretty orthogonal to this RFC.

So that leaves the middleware and that is what I am trying to plan in this RFC. I think this is necessary because there is a lot of work in getting data from the compiler to an IDE, and it is work that must be done by all IDEs and (although a bit complex) doesn't offer any room for innovation. The work required is managing compilation, preservation of data between compiler runs, and some cross-referencing of that data.

I'm not really sure what you are objecting to - you seem to be saying that we should not be planning the middleware at all, rather than that the plan should change. To me, the middleware seems to be the most important thing to plan right now.

There will be RFCs for some compiler issues at some point (lazy compilation in the future, perhaps error recovery soon), but for now I think the middleware is more important.

I also want to push back on the the idea that we are providing a high level service here. The data we're providing is low level data - "find all references" is really just find all identifiers with the same definition id, this is just simple cross-referencing of the compiler data (complexity comes from the fact that the data is from multiple crates). "Code completion" is find all fields and methods that are allowed after a dot, this requires post-processing before presenting to the use (by an IDE, not the RLS) and I can't think of a lower-level request to the compiler which would allow you to do type-based code completion that this.

@nrc
Contributor
nrc commented Jan 18, 2016

@drbo this is not clear to me. We are designing the RLS, and how the RLS communicates with its clients seems to be a core part of that design. The fact that it is being discussed means it is an important part of the design, not that we should ignore it.

AIUI, securing an HTTP server is pretty much a solved problem. The solution may be complex, but it is known. I would worry far more about securing a new or custom protocol.

@nrc
Contributor
nrc commented Jan 18, 2016

@bruno-medeiros I think I answered your comment in the reply to Phil above. But in case not, could you be specific about what you think needs to change in rustc? When you say API, what kind of API are you talking about? My understanding is that the middleware is more of an unknown at the moment than the compiler is.

@nrc
Contributor
nrc commented Jan 18, 2016

@bruno-medeiros (in reply to your comment from 3 days ago). Running the compiler in 'compile-anlaysis' is possible today, either from the command line or via the driver API. It is used for example in cargo check. I don't believe there is more to do in this area.

Thanks for the explanation of JDT. What is the advantage of splitting the index data from the compilation data? I see a few - no duplication of the compilation data, smaller index data, easier updating, perhaps the index data can be used without the compile data for some things? The first few don't seem very important today (maybe on low powered hardware), I don't know about the last.

The Rust equivalent of compilation data will exist in the incremental compilation data. However, that will not contain enough information (initially) to re-compile inside a function body. It will probably not be optimal for searching (this might be solvable with indexes over the data, etc., but compelxity). I'm also keen not to constrain evolution of the compiler at this stage. So, I think it is best for the RLS to keep its own copy of compilation data. Given that, it seems simplest to combine the index and compile data that the RLS holds.

For me, the simplest solution to managing this data is just to dump it into a relational database. Let the db manage when to read from disk, manage indexes, etc. These are hard problems and I don't see how we can do better by doing it ourselves (without an awful lot of effort). I also don't see this as complexity - clearly there is complexity but it is isolated and abstracted in the database.

Querying data (for search-like queries) is trivial. Looking up potential methods and fields is probably best answered from the compiler directly, rather than via the db since it needs some non-trivial computation.

Invalidating data is slightly more complex, but still fairly straightforward - the compiler will inform the RLS of spans in the (old) source code which are invalid, we then delete any data from the db within these spans.

@phildawes

Hi @nrc, Thanks for replying, sorry if my comments are coming over a bit snarky.

I do think we should be talking about middleware, how it can work, what will be required by clients and from the compiler. However I think saying there will be one official rust program to provide all the analysis operations for IDEs, and that it will be the only way for an clients to perform these operations is a bit far.

It worries me because it affects racer, racerd and other future community projects. My initial reaction to RLS is 'if at some point there is going to be an official rust language server then I should probably stop working on racer'.

I do view the RLS operations as high level, as with no other way to extract analysis information from rustc they dictate the only operations an IDE can provide. There is a strong burden to support every operation an IDE could conceivably want to do.

For example,

"Code Completion" is find all fields and methods that are allowed after a dot:

  • is this all the currently legal fields/methods?
  • What about currently private fields?
  • What about non-imported modules?

I don't want that every time an IDE wants to do something new or slightly different it requires adding a new operation or additional functionality to the RLS. E.g. If an IDE or client tool wants to add refactoring functionality - e.g. Extract Method, does it first have to convince the RLS team to add the right operations to RLS? (edit: rewrote this paragraph)

Instead I'm advocating getting the compiler analysis extraction interface right. I think the save-analysis information in a library form is a good starting point.

There are tradeoffs involved in the RLS design, both implementation tradeoffs (relational db?, separate process? http?, up-front caching vs lazy? per-ide? per-workstation?) and functional tradeoffs (which operations?). Saying there will be one official program providing all the operations assumes we can nail a set of tradeoffs that will be right for all IDEs and tooling. I think this is unnecessarily restrictive and I don't think we need to do this to succeed.

@nrc
Contributor
nrc commented Jan 19, 2016

Hi @phildawes, so let me take a step back. First off, I don't think there should be one piece of official middleware for ever. I hope that both the interface presented by the middleware to the IDE and the compiler's API will both be stable and specified so we can have alternative implementations. There is a lot of tricky, inter-related levers here, it is an area I don't know much about, and at the compiler end there is a moving target, so I doubt we'll get it right the first time. Furthermore, I don't even think that there is one ideal middleware solution - different tools will want very different levels of abstraction.

I envisage the RLS fitting in between heavyweight IDEs and the compiler. For more lightweight editors and other tools, I don't think that will be ideal. There perhaps the stack is compiler, Racer, editor, or compiler, RLS (without the daemon stuff), Racer, editor.

I think the code completion example is useful for explaining where the RLS fits in - the RLS will return all fields and methods, given what the type will allow (this is still a bit handwavery, but basically what the compiler sees at type checking time). The data returned for each field (and likewise for methods) will be something like an id, a name, a type (possibly both generic and specific), privacy information, attributes, a span for the definition, whether the field is accessible from the current cursor position, and so forth.

The client is then responsible for what to show to the user, whether the client is an IDE or Racer (or other middleware that lives on top of RLS). So the bullet points you suggest are things the client filters on, not the RLS - if the client wants to show inaccessible fields, they can, if they want to add more fields (such as from non-imported modules) they can.

It seems to me that there is a lot of room for innovation in the client here, but not so much in the RLS - it just spits out what the compiler knows

On refactoring, I don't think this should be done by the RLS, the IDE or a refactoring tool should do it. That tool will probably need information that the compiler doesn't provide, so we'll need to add APIs to both the compiler and RLS to expose the low-level data, but I'm assuming that would be an API to check "is this name currently used in this scope", or similar things, not high level things like extract a method. So, "kind of" is the answer, but they'd have to persuade the compiler team first in any case.

Finally, I'd like to push back on the idea that he RLS precludes APIs in the compiler. It does not. We can still (and will) have public APIs in the compiler that tools can use. The RLS is just another point on the spectrum - if you want really low level data - ask the compiler, really high level data - ask Racer, somewhere in between - ask the RLS. In particular, I see the RLS as a convenience to allow asking questions about a whole program, whereas the compiler will only ever allow asking questions about a single crate.

I'm keen to start with the RLS rather than the compiler APIs because I'm more sure of what they should look like and they will help inform what the compiler APIs look like.

To emphasise, I don't see the RLS as a replacement for Racer, but as a tool to help Racer, IDEs, and other tools to have easier access to the compiler's data. Put another way, the vision for the RLS is not to be a tool that uses the compiler, but to lower the bar to entry for tools that want to use the compiler.

@bruno-medeiros

@drbo this is not clear to me. We are designing the RLS, and how the RLS communicates with its clients seems to be a core part of that design. The fact that it is being discussed means it is an important part of the design, not that we should ignore it.

Hum, the format of the messages is an important part of the design, but whether the messages are delivered via HTTP, plain TCP sockets, stdin, or something else, not so much. Not until you want to take security into consideration, but it shouldn't affect the operation of the RLS itself (or of the IDE).

Is security the reason why people (@Valloric @jwilm, etc.) where suggesting using HTTP as a base for the RLS communication?

@bruno-medeiros

@bruno-medeiros I think I answered your comment in the reply to Phil above. But in case not, could you be specific about what you think needs to change in rustc? When you say API, what kind of API are you talking about? My understanding is that the middleware is more of an unknown at the moment than the compiler is.

I don't have specific sugestions, as I'm not familiar with rustc internals. What I was saying in general about the API, is that the API should allow RLS (the middleware) to build on the compiler and answer all sorts of interesting questions for an IDE, like "find all references to element foo", or "is this name currently used in this scope?". Like you said later on, "we'll need to add APIs to both the compiler and RLS to expose the low-level data". It's a shift to designing a compiler (rustc) as a library for all sorts of purposes, not just compilation into binaries. I'm not familiar with libclang, but I think this is what @Valloric said when he suggested rustc should evolve to be more libclang. (and then RLS would build on top of that)

It is used for example in cargo check

cargo check does nothing on my machine, is this something currently available only on nightly ?

Thanks for the explanation of JDT. What is the advantage of splitting the index data from the compilation data?

Like you said, easier updating/invalidation, but also, I think because it is very easy to persist the index data to disk, because it consists of only plain textual (or number) data. There are no references to other objects (unlike the compilation-data which has lots of references between itself, complex data structures, etc.).

For me, the simplest solution to managing this data is just to dump it into a relational database. Let the db manage when to read from disk, manage indexes, etc. These are hard problems and I don't see how we can do better by doing it ourselves (without an awful lot of effort).

I'm not saying we should write our own code to handle disk/memory caching, etc. (although I don't think it would be that hard of a problem, for the purposes of RLS usage alone). There might be some other libraries that help with that already. What about using a no-sql database? That might a better choice. To be honest it's not an area I'm familiar with, so I don't know what could be good choices here. On top of that we are also constrained a bit by how easy it would be to use such libraries/databases in Rust (ie, if there is already Rust code to interact with the library/database or not).

@bruno-medeiros

@phildawes

I don't want that every time an IDE wants to do something new or slightly different it requires adding a new operation or additional functionality to the RLS. E.g. If an IDE or client tool wants to add refactoring functionality - e.g. Extract Method, does it first have to convince the RLS team to add the right operations to RLS?

Hum, yes. Or the IDE developers have to modify RLS themselves to do that. I mean, RLS will be an open-source project, so it will be easy to add patches (or even fork/mod).
You seem to not like this scenario, but what alternative could there be? The other alternative I can think of is not practical: to have RLS submit all sorts of low-level compiler data to the IDE, and then IDE try to figure things for itself. Mind you, the IDE would be a different process, and likely written in a different language (C, Java, etc.), so you are talking about marshaling very complex Rust data structures to a different language.

@eddyb
Member
eddyb commented Jan 19, 2016

@nrc I still don't understand why you would ever use an external database, and here's why:

We already have a highly specialized & efficient pure Rust relational database; we call it "librustc"

Our on-disk metadata format already has a hash index, but even ignoring that:
Incremental recompilation will also require us to have on-disk formats which support (amortized) O(1) insertion, removal and lookup.

So why use a relational database, when you have all of that infrastructure?
Why spend time binding to some C library and writing all of the transformations from our structures to something dumb enough for it to understand?

Relational databases certainly aren't magic and a lot of the complexity they have is around handling arbitrary data in arbitrary ways.
We have specialized data and special needs, therefore a relational database is not only overkill, but potentially harmful.

And I haven't really touched the save-analysis stuff: are you seriously suggesting that we dump all the compiler data in bulk into some stringly-typed mush?

How will that scale to very large crates like the compiler itself?
That's where I most need an IDE, and your plan seems to cripple that usecase.

In any case, my prediction, were this plan of yours to be put in practice, is that at some point, someone will rewrite all of that unnecessary abstraction, reducing code size and eliminating one (or more) non-Rust dependencies.

You will have an increase in performance and anyone looking at the diff will wonder why it was so complicated in the first place.

@arielb1
Contributor
arielb1 commented Jan 19, 2016

@eddyb

sqlite is quite nice as a file storage format, though I am not sure how capable it will be of handling rustc's very chatty reads.

@ArtemGr
ArtemGr commented Jan 19, 2016

In any case, my prediction, were this plan of yours to be put in practice, is that at some point, someone will rewrite all of that unnecessary abstraction, reducing code size and eliminating one (or more) non-Rust dependencies.

That might very well be true and I'd like to point out that it is a good thing!

"When designing a new kind of system, a team will design a throw-away system (whether it intends to or not). This system acts as a "pilot plant" that reveals techniques that will subsequently cause a complete redesign of the system." - https://en.wikipedia.org/wiki/The_Mythical_Man-Month#The_pilot_system

Relational databases certainly aren't magic and a lot of the complexity they have is around handling arbitrary data in arbitrary ways.
We have specialized data and special needs, therefore a relational database is not only overkill, but potentially harmful.

Relational databases are self-docummenting. You have a table, you can always look it up, see what columns, expression indexes, foreign keys it has, browse the data, make some slight modifications necessary for debugging, etc. It's like always having an UML diagram on hand.

In my experience, and I worked with both the relational and the key-value (LevelDB, LMDB, Oracle Berkeley DB Java Edition, Riak, Cassandra, Voldemort) databases, the relational side is much easier to prototype with. And as you've noted youself, they're interchangeable, what you keep in a relational database you can keep in key-value store and vice versa, so if the compiler can use a specialized index then it can use a relational database instead. The nice property of relational databases is that you don't have to maintain the index integrity or even the referential integrity by yourself. You don't have to reinvent the necessary abstractions. Want to upgrade your data to a different format? It's a matter of several SQL queries and not of writing a whole program just to make that tiny upgrade possible, or maintaining the code of the old model in order to be able to access that old format during the upgrade.

Eventually both approaches have their gotchas and shortcuts to efficient use. And both can fail miserably. It all depends. That's why I think that picking one over another shouldn't be a part of this RFC but rather an implementation detail left out to implementors and what they're most familiar and comfortable with.

@nrc
Contributor
nrc commented Jan 19, 2016

@bruno-medeiros I agree we don't need to go in to too much detail - but whether we use a library or IPC seems important.

Is security the reason why people (@Valloric @jwilm, etc.) where suggesting using HTTP

I believe ease of use is a more important reason. Security has been mentioned as a reason not to use HTTP.

cargo check does nothing on my machine, is this something currently available only on nightly ?

Not exactly sure which version of Cargo is required, but it is fairly recent.

It may be better to use NoSql rather than an SQL db, I'm not an expert in the space. I know DXR uses ElasticSearch, but that is more to support fancy text searching. I've found old school SQL dbs to be quite a good match for PL stuff in the past - the data is very homogeneous, cross-referencing by id fits the relational paradigm nicely, etc.

@nrc
Contributor
nrc commented Jan 19, 2016

@eddyb all things being equal I would like to re-use the compiler's data, there are a few reasons not:

  • it's not usable now. Incremental compilation data isn't done and it's likely to change a fair bit even once it is done.
  • metadata and planned incremental compilation data does not store any info about function bodies - we would have to serialise all this extra info. It will probably not be used for anything else, so it would be just for the RLS. We would want to not emit it in other cases, because it would be a large amount of data.
  • none of this data is designed to be used by anyone except the compiler - sure it can be used, but it will not be trivial. In the compiler the functions for reading the data are tightly coupled to how the data is used, if we want to use it in other ways we need to implement that.
  • stability - this is all compiler-internal stuff which we might change at any moment.
  • complex searches, in the future we might want to add sophisticated searches like 'find all impls of this trait' - each search like this would mean adding another index over the data and thus more and more complexity.
  • concurrent access - if we're rebuilding (assume non-incremental for whatever reason) we can't query the data.

It still seems reasonable to argue that using the compiler's data is a better approach, but is definitely not as clear cut as you make out. It may well be that we choose the db path and in some time the incremental compilation stuff has settled and we have function body data and we can get rid of the db, that seems fine to me. But I don't think we can implement this today without the db.

And I haven't really touched the save-analysis stuff: are you seriously suggesting that we dump all the compiler data in bulk into some stringly-typed mush?

No I am not. I think you missed my reply to you above where I make a distinction between the save-analysis data dumps (data in bulk) and the save-analysis API which is more typed, allows access to data on a fine-grained basis, and is basically just a future-proof wrapper around the compiler's data.

@arielb1
Contributor
arielb1 commented Jan 19, 2016

@ArtemGr

Want to upgrade your data to a different format? It's a matter of several SQL queries and not of writing a whole program just to make that tiny upgrade possible,

Rust has a homegrown basically-relational-database metadata format (we can't use an SQL database there because rustc is way too chatty). Many parts of rust source fit there nicely. However, one of the most important things - types - is a very poor fit for a relational database because of the tree-structure.

I suspect that all queries involving type information require a live compiler anyway. It would though be nice if someone(=me?) would write a metadata-to-sqlite translator.

@sfackler
Member

SQLite virtual tables could potentially be used to provide a SQL interface to compiler metadata without needing to go through the overhead of updating physical tables: https://www.sqlite.org/vtab.html.

The question of if the metadata is reasonably representable in SQL would still stand, of course.

@phildawes

@nrc Hi Nick, Thanks for your reply, I'm glad the compiler api is not out of scope - I was under the impression that with this new cut-down proposal the compiler api wasn't deemed necessary because the RLS would provide the public interface.

I would however still like to argue that the compiler-api be the focus and priority

  • I'm skeptical about the big-database-of-code approach, and want the door left open for alternatives
  • I'd also like to see (and have a go at) projects to bring refactoring to rust

I think the api could be as straightforward as a focused save-analysis-style api (e.g. analysis-for-function / analysis-for-item, crate top-level-analysis (no function bodies)).

@phildawes

To elaborate on the compiler api:

  • racer currently uses syntex-syntax for parsing rust code snippets, and this works surprisingly well. The api changes frequently (because it tracks the compiler libsyntax api), but keeping up with it is pretty ok and allows racer to be a crates.io project that can be built with rust stable. I don't think we need an alternative stable api for parsing at this point.
  • The new analysis api could start by mapping expression spans to type and definition paths. This is along the same lines as the save-analysis output, but I'm advocating that there be an rustc api for generating small pieces of this data (e.g. at the Item (function) level as a start). In order to keep things simple the input to this analysis api could be the same arguments you would pass to the compiler.

I think (guess!) the combination of these two could be enough to feed the RLS and enable it to be a crates.io project rather than something that must be shipped with the compiler. It also would leave the door open for alternative approaches and other tools that are currently out of scope for RLS.

@phildawes

Hi @bruno-medeiros.

(Re: refactoring)

Hum, yes. Or the IDE developers have to modify RLS themselves to do that. I mean, RLS will be an open-source project, so it will be easy to add patches (or even fork/mod).

The problem is when you want to add something contentious to support some niche tooling, or try something experimental that might not pay off. Forking a project is a socially aggressive action, and it is a shame to have political fallout and hurt over some small technical point or direction. (we all saw it a couple of times on contentious rust PRs)

Also without a compiler-api RLS will probably need to be shipped with rustc making it hard to build an alternative tool.

You seem to not like this scenario, but what alternative could there be?

The alternative I'm advocating is that there be a compiler-analysis api to extract the analysis information. (the analysis information is not dissimilar to the JDT/CDT index-document you described. Rustc has something similar as a way to get information into DXR, although I think it might be partially broken at the moment).

@nikomatsakis
Contributor

So, I've been sort of half-following this thread, but I wanted to throw in some thoughts. Sorry for not having read every comment in detail, I'm sure I'm repeating things that have been said. What's worse, I'm going to ask some questions at the end that have probably been answered. Humor me. :)

In general, I think that the idea of having an oracle process that serves as a kind of database for Rust metadata makes a lot of sense. The current compiler data strutures are not designed to support IDEs or random queries, and I don't feel like I would want that as a constraint on how the compiler is structured internally. In particular, the compiler as it is today was really written to be a simple batch compiler a la gcc. It expects to compile a single crate and generate some output. We're working on changing that model, via incremental compilation and other things, but it's going to be a long road I think.

In the meantime, I think it makes sense to have a second layer that aggregates the output of multiple compilations (e.g., for multiple crates) and keeps a coherent view of things. This would keep that data indexed however it makes sense to have the data be indexed and be able to answer queries like "what was the type of the expression here" or "what traits are implemented for this type" and so forth. Obviously I'm glossing over a ton of details here. This layer is more-or-less what I thought the Oracle is.

To me, it doesn't matter so much if this layer is actually a process that communicates via IPC, or some shared libraries that everybody uses. I sort of prefer processes for various reasons, but whatever. If it's some shared libraries, we'll presumably need some file locking or other mechanisms to cope with the scenario where the compiler is trying to update some shared repository of information and the IDE is trying to read it.

Over time, I could see that this oracle layer and the compiler move closer together, and maybe the compiler eventually becomes a nifty library that can be integrated more closely into IDEs. In that case the "oracle" would just be various APIs, and not really much in the way of code.

Does this overarching thought make sense? @nrc, is this roughly what you were thinking with RFC, or am I totally off base? If it doesn't make sense, what are the major alternatives?

@nrc
Contributor
nrc commented Jan 21, 2016

@nikomatsakis spot on.

@nrc
Contributor
nrc commented Jan 21, 2016

@phildawes my intention is for the RLS to live outside the Rust repo and use the compiler as a library, like any other tool. Clearly it is going to be unstable for a long time (which is another good reason to put it in its own process instead of linking it).

My current model for a compiler API is that the tool uses libsyntax to parse source code, and then has the AST. You can then use AST nodes (currently by node id; spans have issues with macros) to query the compiler about types and so forth. This works pretty well, however, we might do better, for example by giving a synthetic 'AST' which also includes type information, etc. (the HAIR approach). We should have a discussion about how best to do this, but not here.

I have no idea what functions should be presented by the API, I've been adding stuff in a pretty ad hoc basic to the save-analysis API (note, the API, not the dumps, which are both working fine, btw). My plan for the API is more to just implement anything anyone wants (and is possible) until we have a good idea of the use cases, then make an RFC to stabilise some of it. If there are specific things you want, please let me know and/or file issues (or PRs) on the Rust repo.

@phildawes

@nrc, @nikomatsakis

Hi Niko, Hi Nick,

In general, I think that the idea of having an oracle process that serves as a kind of database for Rust metadata makes a lot of sense.

Personally I'm wary of the database-of-code metaphor. I think it hides the two hard problems of rust code completion, and instead encourages thinking about indexing and cache invalidation. As I see it, the two hard technical problems that need solving are:

  1. complete-as-you-type requires rustc to get involved in the hot path.
    • type inference of the expression-being-completed can't be pre-cached. (database is no help here)
    • rustc must be capable of delivering the type (of the bit before the dot) in <100ms
  2. complete-as-you-type requires processing incomplete/broken code
    • the crate isn't semantically complete or compileable (again, database is no help here)

I don't think a code index is a requirement for fast completion. This is counter-intuitive, but racer specifically avoids doing it on grounds of simplicity and is fast, even for big multi-crate projects. (some discussion here).

I also don't think a database is required for find-references, but I have no evidence for this, only intuition

@nrc
Contributor
nrc commented Jan 21, 2016

Hi @phildawes,

1 I agree, I probably wouldn't use the database at all for code completion info - that is a purely local query so can work straight off the compiler (on a slightly separate note, I see the compiler/RLS only returning info after the ., updating the actual code completion suggestions based on what is being typed is the code completion tool's job).

  1. I don't think this is an issue. The compiler will process broken code (I've been working on more support for that this week). Code that has changed is invalidated from the DB, parts of code that we can type check are added, the rest is ignored. E.g., if the user types x.f; but there is only a foo field, the DB will have info about x but will ignore the f (we might even record that there is a field access, but the field name is unknown, if that is useful).

Find references is definitely the motivation for using a DB.

@nikomatsakis
Contributor

Personally I'm wary of the database-of-code metaphor. I think it
hides the two hard problems of rust code completion, and instead
encourages thinking about indexing and cache invalidation.

This makes a lot of sense. I certainly agree that adding an oracle
doesn't get us fast code completion for free -- that seems to require
not just incr. comp., but also a bit more work towards lazy
compilation and better error recovery (some of which is ongoing, as
@nrc notes). It does seem like an oracle would help for Find All
References, and I'm not sure what else.

Still, it seems to me that even in code completion, there is perhaps a
role for an oracle. In particular, you talk about wanting to determine
the type of the receiver of a method, but once you know that receiver,
you still want to figure out things like "what are the set of traits
that are implemented for that type?" (and note that those traits may
not be imported, so the compiler might not consider them
normally). This might be an ideal place for an oracle to come in.

Moreover, some completion scenarios don't necessarily require knowing
the full type, e.g., if I write some_fn(, then you can likely
resolve some_fn to a specific item (but in fact you really want to
be resolving some_fn all along to potential matches, even before
I've finished typing it).

Maybe there's another way to say what I'm saying. It seems clear that
"ideal IDE integration" is going to be a long time coming, and we're
going to be evolving how rustc works for some time. But in the short
to medium time frame, I expect you are going to want to do fast,
approximate queries across several crates, and it seems like we can
get that more easily through some spearate oracle (you will also
want to be able to ask rustc for stuff and get back a fast answer, of
course).

And if in the future it happens that rustc gets fast enough that we
don't need a database to answer those queries, we can just recompute
the data, that's fine, right? All the old code that expected
potentially stale answers should be fine when their answers are 100%
up to date?

@bruno-medeiros

https://crates.io/crates/syntex_syntax/

Speaking of, who maintains this project? Are the Rust developers involved?

I think changes and improvements to the Rust parser would a nice incremental step and starting point to improving the toolchain, with regards to how it's used by IDEs or IDE related tools (like Racer).

I've been working on one such tool, https://github.com/RustDT/Rainicorn (formerly called rust-parse-describe, I mentioned in a previous comment some weeks ago), also using syntex_syntax, and I can see some avenues for improvement already. For starters a panic-less parser would be nice, the way the parser currently works makes my code a bit more complicated than would otherwise be necessary.

Forking a project is a socially aggressive action, and it is a shame to have political fallout and hurt over some small technical point or direction.

I didn't mean forking as in a divergent, or organizational fork (duplication of efforts, vastly incompatible code baselines, no communication, development heading in different directions, etc.).

I meant fork as in the Git way: a clone/branch that you create, you make some modifications or additions, but it doesn't diverge that much from the upstream source, and you try to merge updates from the upstream source regularly. And occasionally you might submit patches to upstream as well.

@nrc
Contributor
nrc commented Jan 22, 2016

@bruno-medeiros syntex is maintained by @erickt, who is a core member of the community (community and moderation teams, as well as just being generally involved), but not part of the compiler team. It is pretty much just a straight clone of libsyntax from the compiler, so improvements to the compiler's parser show up in syntex quite quickly.

Part of the plan with procedural macros/syntax extensions is to present a stable interface for them to work on, at which point syntex gets a lot less necessary (only useful for tools). In the long term I'd like to stabilise enough of libsyntax that tools don't need it either.

There is work going on to make the parser panic-less, it no longer panics on error. I've also been doing some work on error recovery, for example rust-lang/rust#31065 adds some error correction for missing identifiers. Ideally it should be panic-free under normal use and recover from most errors within the next few months.

@nrc
Contributor
nrc commented Jan 22, 2016

@nikomatsakis I did not intend the DB to be used for the code completion suggestions. It is useful for find all references, also queries like find all impls of a given trait, find all sub-traits, etc.

@bruno-medeiros

Still, it seems to me that even in code completion, there is perhaps a role for an oracle

Just to be clear, the Oracle, as in, "the resident process that is responsible for serving requests of various sorts to the IDE", it should handle code completion as well. Even if the data structures that are used to determine the results of code completion are entirely different from the data structures used for say, find-references, it makes no sense for this to be two different processes, or two different tools. This is because at the very minimum, these two operations can share cached AST data, not to mention that eventually (and sooner that later) we will want the Oracle to support functionality to manage/supply dirty editor buffers (ie, use a document that is being edited in memory in an IDE, but has not yet been persisted to disk). Even the functionality I'm coding in the Rainicorn tool should ideally also eventually be integrated into an oracle.

Of course, as an early prototype, it's okay for different operations to be handled by different tools, etc. but the end goal should be to integrate everything in the oracle, all-knowing that it is. 😉 (gotta hand it to the Go guys, they choose the perfect name - apart from the trademark thing.. 😝 )

@nrc BTW, I was just looking at the Nim language, and to my surprise found out they have an "oracle" tool already: http://nim-lang.org/docs/idetools.html
And by the looks of it quite more advanced than the Go-oracle, it actually seems to achieve all those key aspects that were mentioned before:

  • Handles incorrect/incomplete code well
  • Is a resident process
  • Supports managing dirty buffers
  • Supports all sorts of IDE query operations in a single tool: resolve-definition ("Definitions"), code-completion ("Suggestions"), find references ("Symbol usages"), etc.. (Something like parse-analysis seems to be missing though)
@bruno-medeiros

Ideally it should be panic-free under normal use and recover from most errors within the next few months.

Sweet 👍

@alexcrichton
Member

🔔 This RFC is now entering its two-week long final comment period 🔔

@lilianmoraru

Don't know if it is of any help but I will throw it out there.
QtCreator(mainly C++ IDE) has a plugin that uses Clang to offer C++ code completion.
Here is the code:
http://code.qt.io/cgit/qt-creator/qt-creator.git/tree/src/libs/clangbackendipc
http://code.qt.io/cgit/qt-creator/qt-creator.git/tree/src/plugins/clangcodemodel

QtCreator also has an older custom C++ code model(that is being replaced by the clang one) that was a lot faster and seems like it is accepted in the community that the code model that uses the compiler will be considerably slower, but it offers more information and is more accurate.
I mention the speed because I saw @phildawes mentioning the hard requirements on the compiler to deliver the information in <100ms and I am not sure if we should expect it to be fast.

@DemiMarie

@lilianmoraru The reason the compiler needs to deliver the information so quickly is that the GUI is waiting on it. The user is expecting the completion to appear as soon as the user presses . and that requires the compiler to respond fast.

@michaelwoerister

I'm in favor of accepting this RFC. The approach seems worth exploring and it does not preclude improving the compilers amenability for being used as a library. On the contrary, I think the compiler's APIs will benefit from trying to build the RLS on top of it and it can only help if people on the compiler team are actual clients of their own APIs. The RFC leaves many open questions when it comes to specifics and we just need a prototype implementation and the experience that comes from building that in order to decide how to proceed further. Worst case, we'll learn a bunch of stuff on what doesn't work :)

@erkinalp
erkinalp commented Feb 8, 2016

1-based line numbers and 0-based column numbers please.

@nrc
Contributor
nrc commented Feb 9, 2016

@erkinalp why?

@bruno-medeiros

@erkinalp why?

I was wondering the same, why is a mixed format being used (1-based in one, and 0-based for the other)

@erkinalp
erkinalp commented Feb 9, 2016

Emacs uses 0-based column numbers.

@ticki
Contributor
ticki commented Feb 9, 2016

@erkinalp That's a perfect reason for not doing that 😜 .

@nrc
Contributor
nrc commented Feb 9, 2016

afaik, there is no standard for editors to use 1-based line numbers. Having both 0-based seems like the least confusing thing to do, it's easy enough for editors to add one to each line number.

@dgrunwald
Contributor

Compilers (including rustc) tend to use 1-based line and column numbers in their error messages. Following that standard seems like the least confusing thing to do. I certainly wouldn't expect 0-based line numbers.

But the concepts of "line" and "column" are ill-defined anyways.
Does a vertical tab (\v) count as a new line? What about form feed? What about all the other exotic Unicode newlines?
Does a tab count as N columns (where N is configurable; usually 4 or 8?), does it advance to the next multiple of N columns, or does it count as only 1 character?
Are full-width characters like 'x' one or two columns wide?
Or does the editor count each Unicode codepoint as 1 column? Maybe a 'column' really is a UTF-8 byte offset within the line? Maybe it's measured in UTF-16 code units? Grapheme clusters?

It's difficult to find two editors that agree in their line/column counting for all possible input files.

@Valloric
Valloric commented Feb 9, 2016

I've integrated 5+ semantic engines in ycmd, and the only thing that makes sense is 1-based line and column numbers. Columns are byte offsets in UTF-8. Done.

it's easy enough for editors to add one to each line number.

But why should they? Line & column numbers coming from your oracle will be shown to the user and they expect 1-based numbering.

there is no standard for editors to use 1-based line numbers.

And yet they ~all do use 1-based numbers in the user interface. When you put your caret on the first line in the file, the editor doesn't say the line number is 0, it says it's 1. Same for columns.

@erkinalp

Vertical tab means skip one line below and continue from same column offset.
CR, CR+LF, LF, LS and NEL are regular line feeds.
FF and PS count as two lines instead of one.

@bruno-medeiros

I've integrated 5+ semantic engines in ycmd, and the only thing that makes sense is 1-based line and column numbers. Columns are byte offsets in UTF-8. Done.

Why byte offsets and not Unicode character offsets? It's not like an error or position for a Rust symbol will ever start in the middle of a Unicode character.

there is no standard for editors to use 1-based line numbers.

And yet they ~all do use 1-based numbers in the user interface. When you put your caret on the first line in the file, the editor doesn't say the line number is 0, it says it's 1. Same for columns.

Because the internal API for lines and columns can be 0-based, despite the UI being 1-based. This is certainly the case for Eclipse, for IntelliJ, and probably for most IDEs/editors out there. It would not surprise me if Vim is the odd one out... 😆

@adelarsq

Keep it simple. Just use 0-based for lines and columns.

@alexcrichton
Member

This RFC was discussed during the tools team triage today and the decision was to merge. This RFC is still at a somewhat high level and some minor details can continue to be ironed out in the implementation over time, but there seems to be widespread agreement about the body of the RFC here.

Thanks again for the discussion everybody!

@alexcrichton alexcrichton merged commit 37d72be into rust-lang:master Feb 11, 2016
@alexcrichton alexcrichton referenced this pull request in rust-lang/rust Feb 11, 2016
Open

Tracking issue for the Rust Language Server #31548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment