Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage during composition + inability to increase v8 composer's heap size results in OOMs #863

Open
jsalem-brex opened this issue Jun 10, 2024 · 6 comments
Labels
bug Something isn't working internally-reviewed The issue has been reviewed internally.

Comments

@jsalem-brex
Copy link

Component(s)

composition

Component version

v0.0.0-20240521213547-4077ab4a01f0

wgc version

N/A

controlplane version

0.88.2

router version

0.88.9

What happened?

If possible, please create a PR with a failing test to illustrate the issue clearly.
Otherwise, please attach a minimum reproduction through a GitHub repository that includes
essential information such as the relevant subgraph SDLs.
Please also make sure that the instructions for the reproduction are clear, tested, and fully accurate.

Description

We are testing Cosmo self-hosted in our backend as we transition from Apollo. We recently enabled a validation mode where we compose using both Apollo and Cosmo's v8 composer. Cosmo's v8 isolate will routinely OOM during composition. At first these OOMs were happening at the 3Gi limit we placed on the container. We then increased the memory limit of the composition container to 6Gi, however the OOM will still occur at ~4Gi. v8 has a default heap size of 4GB and requires a --max-old-space-size flag (or passed in NODE_OPTIONS) to increase its available heap size.

Normally, I would increase the heap size for v8, but I don't see a mechanism to configure the v8 isolate in

func newVM() (*v8Vm, error) {
isolate := v8.NewIsolate()
global := v8.NewObjectTemplate(isolate)
stringHash := v8.NewFunctionTemplate(isolate, stringHashV8)
if err := global.Set("stringHash", stringHash, v8.ReadOnly); err != nil {
return nil, err
}
urlParse := v8.NewFunctionTemplate(isolate, urlParseV8)
if err := global.Set("urlParse", urlParse, v8.ReadOnly); err != nil {
return nil, err
}
ctx := v8.NewContext(isolate, global)
if _, err := ctx.RunScript(jsPrelude, "prelude.js"); err != nil {
return nil, fmt.Errorf("error running prelude: %w", debugErr(err))
}
if _, err := ctx.RunScript(indexJs, "shim.js"); err != nil {
return nil, fmt.Errorf("error loading shim: %w", debugErr(err))
}
shim, err := ctx.Global().Get("shim")
if err != nil {
return nil, fmt.Errorf("error retrieving shim: %w", debugErr(err))
}
shimFunc := func(name string) (*v8.Function, error) {
fpv, err := shim.Object().Get(name)
if err != nil {
return nil, fmt.Errorf("error retrieving shim function %s: %w", name, debugErr(err))
}
fp, err := fpv.AsFunction()
if err != nil {
return nil, fmt.Errorf("error converting shim function to JS function %s: %w", name, debugErr(err))
}
return fp, nil
}
federateSubgraphs, err := shimFunc("federateSubgraphs")
if err != nil {
return nil, err
}
buildRouterConfiguration, err := shimFunc("buildRouterConfiguration")
if err != nil {
return nil, err
}
return &v8Vm{
isolate: isolate,
ctx: ctx,
null: v8.Null(ctx.Isolate()),
federateSubgraphs: federateSubgraphs,
buildRouterConfiguration: buildRouterConfiguration,
}, nil
}
type vm = v8Vm
.

I'm currently hacking in a patch that will use v8.SetFlags("--max-old-space-size=<new limit>"), however there should probably be a generic mechanism for passing in a config for the v8 runtime.

Excerpt from OOM logs:

[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed

I'm a bit surprised at the memory requirements compared to Apollo's composition (never pushed the 1Gi limit we set on it). Is this expected?

release commit: 4077ab4

Steps to Reproduce

Compose a sufficiently complex graph.

Expected Result

Ability to increase v8 heap size to accommodate complex graph.

Actual Result

Allocation failures during promotion in a GC causing an OOM in composition container.

Example logs from OOM

#
# Fatal javascript OOM in Scavenger: semi-space copy
#
<--- JS stacktrace --->
[18:0x7f35ec399940] 696 ms: Scavenge 21.6 (30.3) -> 21.6 (30.3) MB, 324.9 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[18:0x7f35ec399940] 369 ms: Scavenge 20.7 (30.3) -> 20.7 (30.3) MB, 230.5 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[18:0x7f35ec399940] 135 ms: Scavenge 19.2 (29.1) -> 19.2 (30.3) MB, 3.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
Starting cosmo composition in port <redacted>, admin port <redacted>
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
<--- JS stacktrace --->
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
<--- JS stacktrace --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
<--- JS stacktrace --->
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
#
#
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- JS stacktrace --->
<--- Last few GCs --->
<--- Last few GCs --->
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->
#
#
#
<--- JS stacktrace --->
<--- JS stacktrace --->
#
<--- JS stacktrace --->
<--- Last few GCs --->
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
#
<--- Last few GCs --->
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 990 ms: Scavenge 44.0 (70.6) -> 44.0 (70.6) MB, 220.3 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
#
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
#
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- JS stacktrace --->
[16:0x7f1990751570] 769 ms: Scavenge 50.4 (70.6) -> 44.0 (70.6) MB, 242.0 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
# Fatal javascript OOM in MarkCompactCollector: young object promotion failed
[16:0x7f1990751570] 499 ms: Scavenge 50.2 (69.1) -> 43.6 (70.6) MB, 115.1 / 0.0 ms (average mu = 1.000, current mu = 1.000) allocation failure
<--- Last few GCs --->

Environment information

Environment

OS: minimal linux used in distroless base images
Compiler(if manually compiled): go 1.22

Router configuration

No response

Router execution config

No response

Log output

No response

Additional context

No response

@jsalem-brex jsalem-brex added the bug Something isn't working label Jun 10, 2024
Copy link

WunderGraph commits fully to Open Source and we want to make sure that we can help you as fast as possible.
The roadmap is driven by our customers and we have to prioritize issues that are important to them.
You can influence the priority by becoming a customer. Please contact us here.

@StarpTech
Copy link
Contributor

Hi @jsalem-brex, thank you for the report. To provide better support, we need a fully reproducible example. Other than that, what’s the reason for choosing the Go approach rather than using our Typescript composition library directly?

@StarpTech StarpTech added the internally-reviewed The issue has been reviewed internally. label Jun 11, 2024
@flymedllva
Copy link
Contributor

I'd like to use the Golang variant too, since it's our primary language. But pipilne was built using wgc as it was scary to use those js <-> go transformations, judging by this issue not for nothing)

We do not use anything but router.

The only problem we still have is a memory leak when updating the schema (((( #756

@jsalem-brex
Copy link
Author

we need a fully reproducible example

I can't provide you with our schemas directly. I am also not sure what about schemas is triggering the high memory usage, so creating an arbitrary example will be difficult. I'll see what I can put together, though.

We likely have more subgraphs in our schema than average. We also have a schema that is duplicated in all of our subgraphs for default shared types. Would either of those contribute to greater than typical memory usage?

Are there any specific configuration levers I should pull to get you more information in the meantime?

what’s the reason for choosing the Go approach rather than using our Typescript composition library directly?

A big reason we started looking at Cosmo in the first place was the ability to have Go as the primary language. It reduces the complexity of maintaining the system and we have a higher density of Go expertise/infra maturity compared to JS.

Is there a reason we shouldn't be using the Go libraries for composition?

@StarpTech
Copy link
Contributor

Hi @jsalem-brex, thanks for the information. A reproducible example would definitely help. You can also share with me a private example (private repo) that we will handle confidentially.

A big reason we started looking at Cosmo in the first place was the ability to have Go as the primary language. It reduces the complexity of maintaining the system and we have a higher density of Go expertise/infra maturity compared to JS.

That absolutely makes sense.

Is there a reason we shouldn't be using the Go libraries for composition?

No, just for the sake of a potential workaround.

@Aenimus
Copy link
Member

Aenimus commented Sep 20, 2024

Hi @jsalem-brex,

Could you update on the state with the latest go composition?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working internally-reviewed The issue has been reviewed internally.
Projects
None yet
Development

No branches or pull requests

4 participants