Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse emitting and printing of trees in the backend #4917

Merged
merged 3 commits into from Jan 29, 2024
Merged

Conversation

gzm0
Copy link
Contributor

@gzm0 gzm0 commented Dec 9, 2023

This allows us to use the Emitter's powerful caching mechanism to
directly cache printed trees (as byte buffers) and not cache
JavaScript trees anymore at all.

This reduces in-between run memory usage on the test suite from
1.13 GB (not GiB) to 1.01 GB on my machine (roughly 10%).

Runtime performance (both batch and incremental) is unaffected.

It is worth pointing out, that in order to avoid duplicate caching, we do not cache full class trees anymore.

@gzm0 gzm0 marked this pull request as draft December 9, 2023 21:20
@gzm0 gzm0 changed the title Fuse emitting and printing of trees in the backend RFC: Fuse emitting and printing of trees in the backend Dec 9, 2023
@gzm0
Copy link
Contributor Author

gzm0 commented Dec 9, 2023

Performance measurements

Sorry for the swapped colors between inc and batch.

Incremental (including warmup run)

Full

plot-inc-full

Cropped (same data)

plot-inc-limited

Batch

plot-batch

Supplemental info

Script to calculate "Full Backend" timing.

library(readr)
library(ggplot2)
library(dplyr)

d_full <- read_csv("logger-timings-batch.csv", col_names = c("variant", "op", "t_ns"), col_types = "ffn")

d_backend <- bind_cols(
  d_full %>% filter(op == "Emitter") %>% select(variant, t_ns),
  d_full %>% filter(op == "BasicBackend: Write result") %>% select(t_ns) %>% rename(t_ns2 = t_ns),
) %>% mutate(op = "Full Backend", t_ns = t_ns + t_ns2, t_ns2 = NULL)

d <- bind_rows(
  d_full %>% filter(grepl('Emitter|BasicBackend', op)),
  d_backend,
)

ggplot(d, aes(x = op, color = variant, y = t_ns)) + geom_boxplot()

logger-timings-batch.csv
logger-timings-inc.csv

@gzm0 gzm0 requested a review from sjrd December 9, 2023 21:29
@gzm0
Copy link
Contributor Author

gzm0 commented Dec 17, 2023

I've had a look at this with YourKit.

Findings:

What Size [MB]
main: JS trees 232
main: printed trees 120
main: total 352
pr: printed trees 237
absolute savings pr 115
wasted nested trees pr 117

Thoughts:

  • The savings are consistent with the overall heap measurements (equal up to rounding error).
  • It seems we're wasting a lot of space for nested trees on a codebase that is low on nested invocations (710 cases out of 6538). This is a major concern since JS class heavier codebases will suffer from this more.

@gzm0
Copy link
Contributor Author

gzm0 commented Dec 17, 2023

I don't think I'm comfortable proceeding with this given the potential negative impacts depending on the codebase.

Options I see:

  • Abandon this idea.
  • Stop caching the entire class but only it's parts. @sjrd do you think this might be a viable approach?

@gzm0
Copy link
Contributor Author

gzm0 commented Dec 17, 2023

FWIW: I'm going to try and benchmark the second option :)

@sjrd
Copy link
Member

sjrd commented Dec 17, 2023

  • Stop caching the entire class but only it's parts. @sjrd do you think this might be a viable approach?

That is probably viable, yes.

@gzm0
Copy link
Contributor Author

gzm0 commented Dec 17, 2023

Numbers look comparable (incremental):

plot
logger-timings.csv

@gzm0
Copy link
Contributor Author

gzm0 commented Dec 25, 2023

I have made some progress here.

@sjrd there are two points I'd like your input on:

  • Getting proper indentation indeed is not super tricky, what I do not know is how to test it: So far, tree printing was just unit tested (which is/was fine). However, at this point, there are a lot of non-local decisions that go into indenting the blocks that are cached, so it doesn't seem feasible to unit test this :-/ @sjrd any ideas?
  • In order to properly deal with semicolons in the printer, I needed to move semicolon printing into the statement (rather than the block context). As a result, we also print a semicolon after the last statement in a block. This leads to quite a bit of (uncompressed) fastlink size increase. Is this acceptable? (relevant commit: 567cb00)

TODOs for me:

  • Clean up the last 4 commits.
  • Benchmark (both memory and speed)
  • Diff output of linked test suite to check it's unchanged.

@gzm0
Copy link
Contributor Author

gzm0 commented Dec 28, 2023

Updated benchmarking results (incremental only):

plot

Cropped:

plot-cropped
logger-timings.csv

@gzm0
Copy link
Contributor Author

gzm0 commented Dec 28, 2023

Note to self: The adjustments to the JS printer are incorrect: There are semicolons after flow-control statements (e.g. if-else chains).

@gzm0 gzm0 force-pushed the more-cache branch 4 times, most recently from 268775c to bea04d2 Compare December 31, 2023 07:14
@gzm0 gzm0 changed the title RFC: Fuse emitting and printing of trees in the backend Fuse emitting and printing of trees in the backend Dec 31, 2023
@gzm0
Copy link
Contributor Author

gzm0 commented Dec 31, 2023

@sjrd, this and the dependencies (#4920, #4921, #4922) are ready for review.

I suggest you review 22db2f6 first, to get an idea of what the overall thing is doing.

@gzm0
Copy link
Contributor Author

gzm0 commented Jan 1, 2024

The CI failure reveals a problem with the Transformed tree: We cannot implement show anymore for a tree (notably not for trees that contain non-PrintedTree chunks).

@gzm0
Copy link
Contributor Author

gzm0 commented Jan 1, 2024

Fixing the show (and running scripted locally) reveals another problem: We cannot cache GCC trees that easily.

It seems they keep a reference to their parents (and siblings). So caching them potentially leaks big amounts of memory, but also doesn't actually let us re-use the nodes.

Maybe the best strategy for now is to not cache AST transforms for GCC for now?

@sjrd
Copy link
Member

sjrd commented Jan 1, 2024

Maybe the best strategy for now is to not cache AST transforms for GCC for now?

That's probably fine. When we run GCC we're not incremental anyway because it isn't. One level of transformation is not going to change much, especially if we have to recreate new Nodes anyway.

@gzm0 gzm0 force-pushed the more-cache branch 2 times, most recently from 0885927 to b255236 Compare January 2, 2024 11:02
@gzm0
Copy link
Contributor Author

gzm0 commented Jan 2, 2024

I have added a commit (to be squashed) to not cache GCC trees. TODO: Write a unit test for the GCC backend to link twice consecutively (we should not only have caught this with a scripted test).

gzm0 added a commit to gzm0/scala-js that referenced this pull request Jan 2, 2024
@gzm0
Copy link
Contributor Author

gzm0 commented Jan 2, 2024

PR for test is here: #4924.

gzm0 added a commit to gzm0/scala-js that referenced this pull request Jan 2, 2024
(_tree, false)
// Input has not changed and we were not invalidated.
// --> nothing has changed (we recompute to save memory).
(compute, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird. Now, both sides of the if call compute, and they do not use its result other than to return it. Shouldn't we return only the Boolean then, and let the caller do the compute themselves? That would avoid the tuple and more importantly the extractChangedAndWithGlobals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, yes. You have discovered what I have attempted to brush under the rug 😅 If I extract compute after calling trackChanged, org.scalajs.linker.BasicLinkerBackendTest.noInvalidatedModuleInSecondRun fails.

I'm still at the stage of debugging where I think "this cannot happen". So all-in-all, I guess this clearly warrants further investigation :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the issue: The alternative code was using a short-circuiting or assignment (||=) so the RHS was not always executed. Putting it into a val first fixed the problem. (updated).

This allows us to use the Emitter's powerful caching mechanism to
directly cache printed trees (as byte buffers) and not cache
JavaScript trees anymore at all.

This reduces in-between run memory usage on the test suite from
1.12 GB (not GiB) to 1.00 GB on my machine (roughly 10%).

Runtime performance (both batch and incremental) is unaffected.

It is worth pointing out, that due to how the Emitter caches trees,
classes that end up being ES6 classes is performed will be held twice
in memory (once the individual methods, once the entire class).

On the test suite, this is the case for 710 cases out of 6538.
In the next commit, we want to avoid caching entire classes because of
the memory cost. However, the BasicLinkerBackend relies on the
identity of the generated trees to detect changes: Since that identity
will change if we stop caching them, we need to provide an explicit
"changed" signal.
This reduces some memory overhead for negligible performance cost.

Residual (post link memory) benchmarks for the test suite:
Baseline: 1.13 GB, new 1.01 GB
@gzm0 gzm0 marked this pull request as ready for review January 29, 2024 13:15
@gzm0 gzm0 requested a review from sjrd January 29, 2024 13:15
Copy link
Member

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Looks good! Thanks!

@sjrd sjrd merged commit 64e7725 into scala-js:main Jan 29, 2024
3 checks passed
@gzm0 gzm0 deleted the more-cache branch February 3, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants