Cloudserver memory leak #1069

descrepes · 2020-11-06T08:56:49Z

Bug Report Information

Memory leak in Cloudserver since 8.1

Description

We upgraded two Zenko instances to 1.2 one month ago and we noticed a lot of cloudserver pod restarts.
It happen on both instances. One of the instance have 3 locations and 3 cloudserver pods. The other have more than 30 cloudserver pods and more than 100 locations.

Steps to Reproduce the Issue

Deploy the latest zenko chart.
Look at cloudserver restarts and grafana cloudserver dashboard.
We tested the 8.1.20 and 8.2.6.

Actual Results

8.2.6 metrics:

8.1.20 metrics

You can see on both 8.1.20 and 8.2.6 that the heap is still growing. And it end with a pod restart with a NodeJS stacktrace:

      Last State:  Terminated
      Reason:    Error
      Message:   ======================================

    0: ExitFrame [pc: 0xa18e7edbe1d]
Security context: 0x04e105e1e6e9 <JSObject>
    1: connectToNext(aka connectToNext) [0x23d0dc3908f1] [/usr/src/app/node_modules/utapi/node_modules/ioredis/built/connectors/SentinelConnector/index.js:~41] [pc=0xa18e9273f62](this=0x014cadf026f1 <undefined>)
    2: /* anonymous */(aka /* anonymous */) [0x3a96bdf132a1] [/usr/src/app/node_modules/utapi/node_modules/ioredis/built/connectors/SentinelConnector...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0x8fa050 node::Abort() [node]
 2: 0x8fa09c  [node]
 3: 0xb0020e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb00444 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xef4952  [node]
 6: 0xef4a58 v8::internal::Heap::CheckIneffectiveMarkCompact(unsigned long, double) [node]
 7: 0xf00b32 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node]
 8: 0xf01464 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xf040d1 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [node]
10: 0xecd554 v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationSpace) [node]
11: 0x116d6de v8::internal::Runtime_AllocateInNewSpace(int, v8::internal::Object**, v8::internal::Isolate*) [node]
12: 0xa18e7edbe1d
Aborted (core dumped)
npm ERR! code ELIFECYCLE
npm ERR! errno 134
npm ERR! @zenko/cloudserver@8.1.2 start_s3server: `node index.js`
npm ERR! Exit status 134
npm ERR!
npm ERR! Failed at the @zenko/cloudserver@8.1.2 start_s3server script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-11-05T14_17_01_647Z-debug.log

      Exit Code:    134
      Started:      Wed, 04 Nov 2020 04:36:10 +0100
      Finished:     Thu, 05 Nov 2020 15:17:01 +0100
    Ready:          True
    Restart Count:  13

Expected Results

Like in 8.0, pods should not restart and the heap size should not grow to reach OOM.

8.0.22 metrics:

Additional Information

Let us know how your deployed example is configured. Tell us your:

Azure AKS 1.16.13

we also tested the latest-8.2 docker image and observed the same symptoms as the 8.2.6 and 8.1.20.

Regards

The text was updated successfully, but these errors were encountered:

jonathan-gramain · 2020-11-06T19:23:11Z

Hi @descrepes,

Thank you for your bug report!

If at least one of the Zenko locations is an AWS-S3-compatible location, it might be due to a socket leak issue that we are currently fixing for the next patch release of Zenko, that can happen when connection errors/timeouts occur on the connections to the AWS-compatible backend. The leaked sockets usually retain some data in their TCP buffers, causing a memory leak as well. Not guaranteed it is the same issue that you are witnessing, but once we have the fix ready you may give it a try and see if the memory leak is resolved for you.

Another idea could be to instrument the running cloudserver process with node --inspect, which can give an idea of where the memory is being consumed.

jonathan-gramain · 2020-11-06T19:42:33Z

@descrepes in the meantime, if you would like to try out a provisional fix before we release a patch, you can apply the following patch to cloudserver 8.2 and re-build the image (the actual fix is in a branch in Arsenal repository, hence this is just a dependency update, and it actually also contains another fix for a cloudserver worker crash):

diff --git a/package.json b/package.json
index a0417b82..a3527108 100644
--- a/package.json
+++ b/package.json
@@ -20,7 +20,7 @@
   "homepage": "https://github.com/scality/S3#readme",
   "dependencies": {
     "@hapi/joi": "^17.1.0",
-    "arsenal": "github:scality/Arsenal#2461b5c",
+    "arsenal": "github:scality/Arsenal#1106b7f",
     "async": "~2.5.0",
     "aws-sdk": "2.363.0",
     "azure-storage": "^2.1.0",

descrepes · 2020-11-07T14:29:59Z

Hi,
Thx for the information 😄

I rebuild the image with the patch applied to 8.2 but the memory leak is still here:

descrepes · 2021-01-04T16:33:34Z

Hi,

We upgraded to 8.2.7 and we still have the memory leak.
This leak is not present in the 8.0.22.

Regards.

descrepes · 2021-01-29T12:44:58Z

@jonathan-gramain one important thing to note is that we are mostly using Azure Blob as backend.

Regards.

descrepes · 2021-01-29T13:38:15Z

I can send you some nodejs profiling memories if it helps :)

vrancurel · 2021-02-02T00:51:25Z

It is possible that there is also a memory leak in the Azure Blob backend. Please send the nodejs profiling !
Thanks a lot
Vianney

rahulreddy · 2021-04-16T23:00:38Z

Closing this as it was confirmed offline that this issue has been fixed in 1.2.2

rahulreddy closed this as completed Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudserver memory leak #1069

Cloudserver memory leak #1069

descrepes commented Nov 6, 2020

jonathan-gramain commented Nov 6, 2020

jonathan-gramain commented Nov 6, 2020 •

edited

descrepes commented Nov 7, 2020

descrepes commented Jan 4, 2021

descrepes commented Jan 29, 2021

descrepes commented Jan 29, 2021

vrancurel commented Feb 2, 2021

rahulreddy commented Apr 16, 2021

Cloudserver memory leak #1069

Cloudserver memory leak #1069

Comments

descrepes commented Nov 6, 2020

Bug Report Information

Description

Steps to Reproduce the Issue

Actual Results

Expected Results

Additional Information

jonathan-gramain commented Nov 6, 2020

jonathan-gramain commented Nov 6, 2020 • edited

descrepes commented Nov 7, 2020

descrepes commented Jan 4, 2021

descrepes commented Jan 29, 2021

descrepes commented Jan 29, 2021

vrancurel commented Feb 2, 2021

rahulreddy commented Apr 16, 2021

jonathan-gramain commented Nov 6, 2020 •

edited