Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudserver memory leak #1069

Closed
descrepes opened this issue Nov 6, 2020 · 8 comments
Closed

Cloudserver memory leak #1069

descrepes opened this issue Nov 6, 2020 · 8 comments

Comments

@descrepes
Copy link

Bug Report Information

Memory leak in Cloudserver since 8.1

Description

We upgraded two Zenko instances to 1.2 one month ago and we noticed a lot of cloudserver pod restarts.
It happen on both instances. One of the instance have 3 locations and 3 cloudserver pods. The other have more than 30 cloudserver pods and more than 100 locations.

Steps to Reproduce the Issue

Deploy the latest zenko chart.
Look at cloudserver restarts and grafana cloudserver dashboard.
We tested the 8.1.20 and 8.2.6.

Actual Results

8.2.6 metrics:
8-2-6.png

8.1.20 metrics
8-1-20.png

You can see on both 8.1.20 and 8.2.6 that the heap is still growing. And it end with a pod restart with a NodeJS stacktrace:

      Last State:  Terminated
      Reason:    Error
      Message:   ======================================

    0: ExitFrame [pc: 0xa18e7edbe1d]
Security context: 0x04e105e1e6e9 <JSObject>
    1: connectToNext(aka connectToNext) [0x23d0dc3908f1] [/usr/src/app/node_modules/utapi/node_modules/ioredis/built/connectors/SentinelConnector/index.js:~41] [pc=0xa18e9273f62](this=0x014cadf026f1 <undefined>)
    2: /* anonymous */(aka /* anonymous */) [0x3a96bdf132a1] [/usr/src/app/node_modules/utapi/node_modules/ioredis/built/connectors/SentinelConnector...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0x8fa050 node::Abort() [node]
 2: 0x8fa09c  [node]
 3: 0xb0020e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb00444 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xef4952  [node]
 6: 0xef4a58 v8::internal::Heap::CheckIneffectiveMarkCompact(unsigned long, double) [node]
 7: 0xf00b32 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node]
 8: 0xf01464 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xf040d1 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [node]
10: 0xecd554 v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationSpace) [node]
11: 0x116d6de v8::internal::Runtime_AllocateInNewSpace(int, v8::internal::Object**, v8::internal::Isolate*) [node]
12: 0xa18e7edbe1d
Aborted (core dumped)
npm ERR! code ELIFECYCLE
npm ERR! errno 134
npm ERR! @zenko/cloudserver@8.1.2 start_s3server: `node index.js`
npm ERR! Exit status 134
npm ERR!
npm ERR! Failed at the @zenko/cloudserver@8.1.2 start_s3server script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-11-05T14_17_01_647Z-debug.log

      Exit Code:    134
      Started:      Wed, 04 Nov 2020 04:36:10 +0100
      Finished:     Thu, 05 Nov 2020 15:17:01 +0100
    Ready:          True
    Restart Count:  13

Expected Results

Like in 8.0, pods should not restart and the heap size should not grow to reach OOM.

8.0.22 metrics:
8-0-22.png

Additional Information

Let us know how your deployed example is configured. Tell us your:

  • Azure AKS 1.16.13

we also tested the latest-8.2 docker image and observed the same symptoms as the 8.2.6 and 8.1.20.

Regards

@jonathan-gramain
Copy link
Contributor

Hi @descrepes,

Thank you for your bug report!

If at least one of the Zenko locations is an AWS-S3-compatible location, it might be due to a socket leak issue that we are currently fixing for the next patch release of Zenko, that can happen when connection errors/timeouts occur on the connections to the AWS-compatible backend. The leaked sockets usually retain some data in their TCP buffers, causing a memory leak as well. Not guaranteed it is the same issue that you are witnessing, but once we have the fix ready you may give it a try and see if the memory leak is resolved for you.

Another idea could be to instrument the running cloudserver process with node --inspect, which can give an idea of where the memory is being consumed.

@jonathan-gramain
Copy link
Contributor

jonathan-gramain commented Nov 6, 2020

@descrepes in the meantime, if you would like to try out a provisional fix before we release a patch, you can apply the following patch to cloudserver 8.2 and re-build the image (the actual fix is in a branch in Arsenal repository, hence this is just a dependency update, and it actually also contains another fix for a cloudserver worker crash):

diff --git a/package.json b/package.json
index a0417b82..a3527108 100644
--- a/package.json
+++ b/package.json
@@ -20,7 +20,7 @@
   "homepage": "https://github.com/scality/S3#readme",
   "dependencies": {
     "@hapi/joi": "^17.1.0",
-    "arsenal": "github:scality/Arsenal#2461b5c",
+    "arsenal": "github:scality/Arsenal#1106b7f",
     "async": "~2.5.0",
     "aws-sdk": "2.363.0",
     "azure-storage": "^2.1.0",

@descrepes
Copy link
Author

Hi,
Thx for the information 😄

I rebuild the image with the patch applied to 8.2 but the memory leak is still here:

8-2-patched.png

@descrepes
Copy link
Author

Hi,

We upgraded to 8.2.7 and we still have the memory leak.
This leak is not present in the 8.0.22.

Regards.

@descrepes
Copy link
Author

@jonathan-gramain one important thing to note is that we are mostly using Azure Blob as backend.

Regards.

@descrepes
Copy link
Author

I can send you some nodejs profiling memories if it helps :)

@vrancurel
Copy link
Contributor

It is possible that there is also a memory leak in the Azure Blob backend. Please send the nodejs profiling !
Thanks a lot
Vianney

@rahulreddy
Copy link
Collaborator

Closing this as it was confirmed offline that this issue has been fixed in 1.2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants