Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow memory leak in Dotty's parser #1584

Open
Blaisorblade opened this issue Oct 12, 2016 · 24 comments
Open

Slow memory leak in Dotty's parser #1584

Blaisorblade opened this issue Oct 12, 2016 · 24 comments

Comments

@Blaisorblade
Copy link
Contributor

Splitting out of #1527 to give a clearer summary.

Dotty interns identifiers and never releases them, even if you drop the compiler instance, because the hashtable used for interning is stored in a static field. While this only leaks memory slowly, I'm not sure that's OK (though @odersky is not convinced this is a real problem). So if one wants to avoid this leak in Ensime/ScalaIDE, it seems one would have to load Dotty in a separate classloader.

I'm trying to anticipate whether this is indeed a problem, so pinging @fommil and @sschaef:

  • Is that indeed a false problem for Ensime/ScalaIDE?
  • If the problem is there, is loading Dotty in a separate classloader (and discarding it on different projects) acceptable for Ensime/ScalaIDE? Is that something they do today with Scalac?
@Blaisorblade
Copy link
Contributor Author

For completeness, let me quote @odersky's response here too:

As long as you compile the same files, the nametable will not grow, since
everything is hashed. In my experience, even if you compile different
files, the name table grows very slowly. There are simply not so many
different names to deal with. So, it could become a problem at some point,
but right now I am not seeing it.

@fommil
Copy link

fommil commented Oct 12, 2016

@Blaisorblade the reason I bring up https://github.com/fommil/class-monkey is because it is not possible to "discard" a classloader that uses scalac at present because of several locations where files are not closed correctly (fundamentally JarFile's finalizer is never called). Hopefully this is remedied in dotty. I have several workarounds for scalac, including an additional monkey patch in ensime itself https://github.com/ensime/ensime-server/blob/2.0/monkeys/src/main/scala-2.11/scala/reflect/io/ZipArchive.scala

@smarter
Copy link
Member

smarter commented Oct 12, 2016

@fommil Right now dotty relies on scala.reflect.io though we'd love to get rid of that dependency.

@smarter smarter closed this as completed Oct 12, 2016
@smarter smarter reopened this Oct 12, 2016
@Blaisorblade
Copy link
Contributor Author

@fommil OK, I see that's relevant. Is that the only obstacle or is there more?

@fommil
Copy link

fommil commented Oct 12, 2016

that's probably the biggest memory leak that I'm aware of with scalac. Another memory leak we've observed is when the PC just doesn't return and we have to start a new one, but the old one never dies (eating CPU and memory). Having the ability to really demand "look, here, just stop what you're doing old chap" would be fantastic.

@Blaisorblade
Copy link
Contributor Author

Blaisorblade commented Oct 12, 2016

Sounds like you'd like to "isolate" Scalac more. I feared it would be hard.* If we do that, there are fewer reasons to fix the slow leak discussed in this issue.

*EDIT: I mean: I feared that encapsulating the compiler from outside would be an annoyance for clients, in which case on Dotty would have to fix all memory leaks instead of relying on "compiler encapsulation". Relying on encapsulation would reduce non-functional requirements on Dotty (a bit à la Erlang).

@kiritsuku
Copy link

I would like to avoid leaking memory, even if it is only a little bit. It simply makes up-scaling of the compilation process more difficult. We already have problems with scalac because some resources are not correctly handled/freed. Especially code generation in our test suite is a pain because we have to ensure uniqueness of types/symbols manually. Resetting an existing scalac instance is simply not possible. Btw, we generate thousands of compilation units in tests, therefore even a small leak may have a larger impact.

Because of complexity overhead, we often don't reuse existing instances of scalac but recreate them. This adds some startup overhead though. However, if dotc is able to minimize the startup overhead, it can leak as much memory as it wants because then we can simply trash it whenever we want. But I'm not sure to which extend it is possible to minimize the startup overhead. The largest problem is that for large projects a lot of dependencies may need to be recompiled.

@Blaisorblade
Copy link
Contributor Author

@sschaef Thanks—I don't claim to have a full overview of Dotty, but I'd guess this is better fixed earlier than later. But @odersky argued fixing this is not easy, so a few questions:

Btw, we generate thousands of compilation units in tests, therefore even a small leak may have a larger impact.

This leak is from interning names—Dotty stores forever (in an object field) all the unique names it ever saw, so reusing names won't cause leaks.

it can leak as much memory as it wants because then we can simply trash it whenever we want

That won't help with the leak under discussion, since the leaked memory is stored in an object field. That's why I talk about discarding classloaders—unloading a class is the only way to discard static members. (Storing this table in Context is also apparently a problem).

@retronym
Copy link
Member

Discarding state by throwing away classloaders is problematic for performance, as you also throw away all the JITted code.

@retronym
Copy link
Member

A pragmatic solution here would be to add a clear method to the name table. Obviously that would be thread-unsafe, and could only be used when no compilers are active.

@Blaisorblade What did you mean by:

Storing this table in Context is also apparently a problem

For large multi-project builds, one problem is that you end up with lots of duplication between the name tables of the compilers for each sub-project. Name table sharing would be worthwhile in those cases. This could be still be achieved without a global name table by passing in an existing name table when creating the root context for a new compiler.

@Blaisorblade
Copy link
Contributor Author

Blaisorblade commented Oct 12, 2016

A pragmatic solution here would be to add a clear method to the name table. Obviously that would be thread-unsafe, and could only be used when no compilers are active.

👍 💯
Important nitpick: "No active compiler" doesn't make it safe, but that'd be fixed by memory barriers to prevent reorderings by CPUs/caches/compilers (say, creating a happens-before edge via writes/reads to volatiles—clear would need to write to one, while creating a new compiler would have to read the same volatile). Is this worth mentioning or it is obvious here?

What did you mean by:

Storing this table in Context is also apparently a problem

I just referred (badly?) to Odersky's rebuttal of the proposal in the parent thread:
#1527 (comment)

Can't we just move the relevant hashtable to whatever the equivalent of
Global is?

No, names have conceptually larger scope than a single global or, in the
case of dotty, root context). I [tried] that early on in the dotty design and
it became a mess quickly.

@Blaisorblade
Copy link
Contributor Author

A pragmatic solution here would be to add a clear method to the name table.

Forgot: Do we probably want a more generic API for clients, in case it later needs to clear other object fields?

@fommil
Copy link

fommil commented Oct 13, 2016

Interning of symbol tables sounds very reasonable to me. We would never parse an entire project in ensime, just the sources the user is interested in.

Well pointed out on JIT. If we can avoid having to use a classloader to use dottypc that would be good. If it used classloaders internally to just load resources, that's fine but please remember to close them after.

Somewhat related is the functionality provided by scalap. With scalap we really do scan the entire classpath, every classfile, and if that were to introduce a memory leak we'd be in huge trouble. Will scalap continue our will there be a replacement for it? We only use a small part of scalap and it might be easier if you just wrote something from scratch.

@smarter
Copy link
Member

smarter commented Oct 13, 2016

scalap can't be reused since pickling is completely different in dotty, you can get similar information (and much more) from unpickling tasty, though there's no public API to do this currently.

@fommil
Copy link

fommil commented Oct 13, 2016

Ok, good we caught this early, me definitely need that functionality to build or index. Should I create a ticket or will you? Happy to do a walkthrough of our usecase.

@odersky
Copy link
Contributor

odersky commented Oct 13, 2016

I think it's possible to move the name table to the root context of a
project (i.e. move it to class ContextState). We'd lose the ability to talk
about a name independently of a compiler instance, but we'd prevent
possible memory leaks for good. Note that once we start clearing a static
NameTable we lose the ability to use names outside of compiler instances
anyway.

On Thu, Oct 13, 2016 at 7:12 AM, Sam Halliday notifications@github.com
wrote:

Ok, good we caught this early, me definitely need that functionality to
build or index. Should I create a ticket or will you? Happy to do a
walkthrough of our usecase.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1584 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAwlVrjPyjjTBJEflvofBNGjhJInJ1urks5qzb2pgaJpZM4KUkE7
.
{"api_version":"1.0","publisher":{"api_key":"
05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":
{"external_key":"github/lampepfl/dotty","title":"
lampepfl/dotty","subtitle":"GitHub repository","main_image_url":"
https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-
11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://
cloud.githubusercontent.com/assets/143418/15842166/
7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in
GitHub","url":"https://github.com/lampepfl/dotty"}},"
updates":{"snippets":[{"icon":"PERSON","message":"@fommil in #1584: Ok,
good we caught this early, me definitely need that functionality to build
or index. Should I create a ticket or will you? Happy to do a walkthrough
of our usecase."}],"action":{"name":"View Issue","url":"https://github.
com//issues/1584#issuecomment-253416070"}}}

Prof. Martin Odersky
LAMP/IC, EPFL

@smarter
Copy link
Member

smarter commented Oct 13, 2016

Ok, good we caught this early, me definitely need that functionality to build or index. Should I create a ticket or will you? Happy to do a walkthrough of our usecase.

A walkthrough would be interesting. Note that when I say "public API" I really mean "external API that we plan to support", but there's no external API for anything in dotty yet (except the very limited https://github.com/lampepfl/dotty/tree/master/interfaces), for now using the compiler internal APIs is fine. There's no good usage examples now but I'm currently experimenting with using tasty to provide some IDE features such as jump to definition and (maybe) renaming, I'll keep you up to date.

@fommil
Copy link

fommil commented Oct 13, 2016

@odersky for ensime it might be useful to have a shared name table. Consider the case where we are spawning new presentationcompiler instances fairly regularly (e.g. one per sub-project, and for the project's tests, and maybe fresh PCs for refactoring operations).

@fommil
Copy link

fommil commented Oct 13, 2016

@smarter what do you mean by "jump to definition"? you'd need to understand where source jars are stored to do this, and also java. It seems like this is an area where what core dotty can do should be limited to avoid duplicated efforts in ensime.

@smarter
Copy link
Member

smarter commented Oct 13, 2016

@fommil I'm ignoring java and outside projects, we store the source file path in Tasty so we should have the information we need, I'm not trying to duplicate ensime but rather to evaluate how various parts of dotty (including tasty, which was not explicitly designed for interactive use) can be used in IDEs. I'll have code and a write-up in a few days.

@fommil
Copy link

fommil commented Oct 13, 2016

@smarter I'm not sure you have everything you need in there, you won't know the location of the jar or the source directory root. If what you are providing is (relative path from source root, source file name, position in file) then that's perfect. The current classfiles only give (source file name, line number) for java FQNs, so this would be a good upgrade. It's important that you store only the relative path from the root, not the full path. (this will clean up some of the nasty workarounds we need in ENSIME! https://github.com/ensime/ensime-server/blob/2.0/core/src/main/scala/org/ensime/indexer/SourceResolver.scala)

@fommil
Copy link

fommil commented Oct 13, 2016

also, this could be awesome for debugging if this information is available at runtime by converting source/line to specific blocks of code! e.g. imagine being able to "step into lambda"

@smarter
Copy link
Member

smarter commented Oct 13, 2016

I agree that it should be relative, not sure if it is right now, which is exactly why I'm testing these features :). In any case this is very much off-topic, we can chat more on the dotty gitter if you want.

@Blaisorblade
Copy link
Contributor Author

Blaisorblade commented Mar 20, 2018

For anybody caring: the static table is still there under dotty.tools.dotc.core.Names.table.
https://github.com/lampepfl/dotty/blob/e5ff11c111e0051061e3a4edcd9334a9ab76b4ff/compiler/src/dotty/tools/dotc/core/Names.scala#L542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants