Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods crash when scheduled on nodes with >24 CPU's #48

Closed
phenixblue opened this issue Oct 2, 2020 · 1 comment · Fixed by #51
Closed

Pods crash when scheduled on nodes with >24 CPU's #48

phenixblue opened this issue Oct 2, 2020 · 1 comment · Fixed by #51
Labels
bug Something isn't working needs-triage python

Comments

@phenixblue
Copy link
Contributor

What happened:

Installing and running MagTape on worker nodes with 24 or more CPU's generates a high number of threads with Gunicorn and there appears to be a memory leak of some sort.

What you expected to happen:

Pods to startup normally

How to reproduce it (as minimally and precisely as possible):

Run the simple install in a cluster with worker nodes that have 24 or more CPU's

Anything else we need to know?:

Experienced on worker nodes with 24 cores x 128GB RAM

Example output from MagTape container logs:

[2020-10-02 04:52:27 +0000] [107] [INFO] Booting worker with pid: 107
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:62)
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:63)
[2020-10-02 04:52:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:64)
[2020-10-02 04:52:27 +0000] [62] [INFO] Worker exiting (pid: 62)
[2020-10-02 04:52:27 +0000] [64] [INFO] Worker exiting (pid: 64)
[2020-10-02 04:52:27 +0000] [63] [INFO] Worker exiting (pid: 63)
[2020-10-02 04:52:29 +0000] [108] [INFO] Booting worker with pid: 108
[2020-10-02 04:52:29 +0000] [109] [INFO] Booting worker with pid: 109
[2020-10-02 04:52:30 +0000] [1] [INFO] Unhandled exception in main loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 211, in run
    self.manage_workers()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
    self.spawn_workers()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
    self.spawn_worker()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 567, in spawn_worker
    pid = os.fork()
OSError: [Errno 12] Out of memory

Environment:

  • Kubernetes version (use kubectl version): v1.15.5
  • Worker Node OS: Ubuntu 16.04
  • Cloud provider or hardware configuration:
  • Others:
@phenixblue phenixblue added bug Something isn't working python needs-triage labels Oct 2, 2020
@phenixblue
Copy link
Contributor Author

phenixblue commented Oct 2, 2020

This is related to the dynamic sizing for workers/threads in the Gunicorn config. While the docs recommend (2 x $num_cores) + 1, they also recommend not going above 12 workers total. I'm testing out hard coding the values for workers/threads to a reasonable default and using an HPA for scaling out vs. up.

Thanks to Alex, @ilrudie , and Shahar for the consultations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant