Pods crash when scheduled on nodes with >24 CPU's #48

phenixblue · 2020-10-02T05:07:29Z

What happened:

Installing and running MagTape on worker nodes with 24 or more CPU's generates a high number of threads with Gunicorn and there appears to be a memory leak of some sort.

What you expected to happen:

Pods to startup normally

How to reproduce it (as minimally and precisely as possible):

Run the simple install in a cluster with worker nodes that have 24 or more CPU's

Anything else we need to know?:

Experienced on worker nodes with 24 cores x 128GB RAM

Example output from MagTape container logs:

[2020-10-02 04:52:27 +0000] [107] [INFO] Booting worker with pid: 107
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:62)
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:63)
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:64)
[2020-10-02 04:52:27 +0000] [62] [INFO] Worker exiting (pid: 62)
[2020-10-02 04:52:27 +0000] [64] [INFO] Worker exiting (pid: 64)
[2020-10-02 04:52:27 +0000] [63] [INFO] Worker exiting (pid: 63)
[2020-10-02 04:52:29 +0000] [108] [INFO] Booting worker with pid: 108
[2020-10-02 04:52:29 +0000] [109] [INFO] Booting worker with pid: 109
[2020-10-02 04:52:30 +0000] [1] [INFO] Unhandled exception in main loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 211, in run
    self.manage_workers()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
    self.spawn_workers()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
    self.spawn_worker()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 567, in spawn_worker
    pid = os.fork()
OSError: [Errno 12] Out of memory

Environment:

Kubernetes version (use kubectl version): v1.15.5
Worker Node OS: Ubuntu 16.04
Cloud provider or hardware configuration:
Others:

The text was updated successfully, but these errors were encountered:

phenixblue · 2020-10-02T16:01:19Z

This is related to the dynamic sizing for workers/threads in the Gunicorn config. While the docs recommend (2 x $num_cores) + 1, they also recommend not going above 12 workers total. I'm testing out hard coding the values for workers/threads to a reasonable default and using an HPA for scaling out vs. up.

Thanks to Alex, @ilrudie , and Shahar for the consultations!

phenixblue added bug Something isn't working python needs-triage labels Oct 2, 2020

This was referenced Oct 2, 2020

Hard code Gunicorn workers/threads to gain default #49

Closed

Add HPA for MagTape #50

Closed

Hardcode Gunicorn Workers/Threads #51

Merged

phenixblue closed this as completed in #51 Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods crash when scheduled on nodes with >24 CPU's #48

Pods crash when scheduled on nodes with >24 CPU's #48

phenixblue commented Oct 2, 2020

phenixblue commented Oct 2, 2020 •

edited

Pods crash when scheduled on nodes with >24 CPU's #48

Pods crash when scheduled on nodes with >24 CPU's #48

Comments

phenixblue commented Oct 2, 2020

phenixblue commented Oct 2, 2020 • edited

phenixblue commented Oct 2, 2020 •

edited