Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[resotocore] Update process interrupted with large graph updates #636

Closed
MrMarvin opened this issue Feb 8, 2022 · 5 comments
Closed

[resotocore] Update process interrupted with large graph updates #636

MrMarvin opened this issue Feb 8, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@MrMarvin
Copy link
Contributor

MrMarvin commented Feb 8, 2022

Description

When collecting a large number of accounts *1 the graph merge fails with any version newer than v2.0.0a13.

Using #608 (on a v2.0.0a15 image) to dump the collected graph and model:

-rw-r--r--   1 marv  staff   306M Feb  7 23:31 resoto-graph-zasag9am.ndjson
-rw-r--r--   1 marv  staff   103K Feb  7 23:27 resoto-model-bag5nlua.json

Trying the steps @lloesche documented, I was unable to reproduce the behaviour locally (macOS, current main, graphdb in Docker). The graphs imports fine, albeit very slowly (which might just be due to the graphdb being undersized etc).

Trying to understand whats going on from the provided stack trace:

  1. Import process starts at 19:49:07
  2. Import process dies at 19:49:40 - 33 seconds later.

This kind of leads me to believe that Queue#get(True, 30) hits its 30 second timeout.
Without diving too deep yet, the underlying locking mechanism seems like it might be platform/os dependent and therefore explains why I'm not seeing the same behaviour locally.

Unfortunately I wont be able to supply the collected graph for debugging.

Version

v2.0.0a14+

Environment

Linux, Resoto one-for-all container image running on GKE

Steps to Reproduce

  1. Collect a large graph, let resotoworker send it to resotocore for merge, grep logs of the spawned import process

Logs

22-02-07 19:49:01|resotoworker| INFO|85|worker-collect_cleanup  resotocore Event Handler called
22-02-07 19:49:01|resotocore| INFO|84|MainThread  127.0.0.1 [07/Feb/2022:19:49:01 +0000] "POST /graph/resoto HTTP/1.1" 200 414 "-" "python-requests/2.27.1"
22-02-07 19:49:04|resotocore| INFO|84|MainThread  127.0.0.1 [07/Feb/2022:19:49:02 +0000] "PATCH /model HTTP/1.1" 200 138466 "-" "python-requests/2.27.1"
22-02-07 19:49:05|resotocore| INFO|84|MainThread  Received merge_graph request
======== Running on http://0.0.0.0:8900 ========
(Press CTRL+C to quit)
22-02-07 19:49:07|resotocore| INFO|497|MainThread  Import process started: 497
22-02-07 19:49:07|resotocore| INFO|497|MainThread  Create ArangoHTTPClient with timeout=900 and verify=True
22-02-07 19:49:33|resotocore| INFO|84|MainThread  127.0.0.1 [07/Feb/2022:19:49:33 +0000] "GET /metrics HTTP/1.1" 200 30633 "-" "Prometheus/2.32.1"
22-02-07 19:49:40|resotocore|ERROR|497|MainThread  Update process interrupted. Preemptive Exit. 
Traceback (most recent call last):
  File "/usr/local/resoto-venv-pypy3/lib/pypy3.8/site-packages/core/model/db_updater.py", line 155, in run
    result = asyncio.run(self.setup_and_merge())
  File "/usr/local/pypy/lib/pypy3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/pypy/lib/pypy3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/pypy/lib/pypy3.8/asyncio/futures.py", line 178, in result
    raise self._exception
  File "/usr/local/pypy/lib/pypy3.8/asyncio/tasks.py", line 280, in __step
    result = coro.send(None)
  File "/usr/local/resoto-venv-pypy3/lib/pypy3.8/site-packages/core/model/db_updater.py", line 145, in setup_and_merge
    result = await self.merge_graph(db)
  File "/usr/local/resoto-venv-pypy3/lib/pypy3.8/site-packages/core/model/db_updater.py", line 125, in merge_graph
    nxt = self.next_action()
  File "/usr/local/resoto-venv-pypy3/lib/pypy3.8/site-packages/core/model/db_updater.py", line 120, in next_action
    return self.read_queue.get(True, 30)  # type: ignore
  File "/usr/local/pypy/lib/pypy3.8/multiprocessing/queues.py", line 108, in get
    raise Empty
queue.Empty
22-02-07 19:49:40|resotocore| INFO|497|MainThread  Shutdown initiated for current process.

Additional Context

*1 - about 2000-ish GCP projects, only few select resource types resulting in a 306M resoto-graph .ndjson file

@MrMarvin MrMarvin added the bug Something isn't working label Feb 8, 2022
@aquamatthias
Copy link
Member

Thanks for reporting this @MrMarvin. We realized the same issue here and created a fix for this problem: #633
We will create a new release including the fix asap.

@MrMarvin
Copy link
Contributor Author

MrMarvin commented Feb 8, 2022

Cool 👍
I'll give the current :edge image a spin and see if that fixed it.

@MrMarvin
Copy link
Contributor Author

MrMarvin commented Feb 8, 2022

Thanks a lot! That seemed to work... somewhat. We're no longer running into the 30 second timeout. Closing this issue.

However, we are now be back at getting the import process OOM killed again. Found the OOM message in dmesg:

[Tue Feb  8 09:53:35 2022] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=c82355e14b2fa6fcc5757285b1d5f1f4cc5c197a1ec8da98ac04408854eddeed,mems_allowed=0,oom_memcg=/kubepods/burstable/pod42aa47fa-d4ec-4613-ab50-49fcdbf9a8d1,task_memcg=/kubepods/burstable/pod42aa47fa-d4ec-4613-ab50-49fcdbf9a8d1/c82355e14b2fa6fcc5757285b1d5f1f4cc5c197a1ec8da98ac04408854eddeed,task=pypy3,pid=201234,uid=0
[Tue Feb  8 09:53:36 2022] Memory cgroup out of memory: Killed process 201234 (pypy3) total-vm:10659868kB, anon-rss:10413844kB, file-rss:51120kB, shmem-rss:8kB, UID:0 pgtables:20748kB oom_score_adj:865
[Tue Feb  8 09:53:36 2022] oom_reaper: reaped process 201234 (pypy3), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB

For completeness, here is the log from how an OOM kill looks from the perspective of user land logs. Not immediately helpful as-is.

22-02-08 09:51:33\|resotocore\| INFO\|490\|MainThread  Import process started: 490
--
22-02-08 09:51:33\|resotocore\| INFO\|490\|MainThread  Create ArangoHTTPClient with timeout=900 and verify=True
22-02-08 09:53:20\|resotocore\| INFO\|490\|MainThread  Resolve attributes in graph
22-02-08 09:53:43\|resotocore\| INFO\|490\|MainThread  Resolve attributes finished.
22-02-08 09:55:38\|resotocore\| INFO\|83\|MainThread  Abort update manually: b14e5aba-88c4-11ec-a7dc-82ec9b375184
22-02-08 09:55:38\|resotocore\| INFO\|83\|MainThread  Request <Request POST /graph/resoto/merge > has failed with exception: Error: ImportAborted
Message: Import process died. (ExitCode: -9)


22-02-08 09:55:38\|resotoworker\|ERROR\|88\|worker-collect_cleanup  b'Error: ImportAborted\nMessage: Import process died. (ExitCode: -9)'
--
22-02-08 09:55:38\|resotoworker\|ERROR\|88\|worker-collect_cleanup  Failed to collect: Failed to send graph: b'Error: ImportAborted\nMessage: Import process died. (ExitCode: -9)'
Traceback (most recent call last):
File "/usr/local/resoto-venv-python3/lib/python3.10/site-packages/resotoworker/__main__.py", line 161, in core_actions_processor
collect(collectors)
File "/usr/local/resoto-venv-python3/lib/python3.10/site-packages/resotoworker/collect.py", line 51, in collect
send_to_resotocore(graph)
File "/usr/local/resoto-venv-python3/lib/python3.10/site-packages/resotoworker/resotocore.py", line 22, in send_to_resotocore
send_graph(graph, base_uri, resotocore_graph, dump_json=dump_json)
File "/usr/local/resoto-venv-python3/lib/python3.10/site-packages/resotoworker/resotocore.py", line 93, in send_graph
raise RuntimeError(f"Failed to send graph: {r.content}")
RuntimeError: Failed to send graph: b'Error: ImportAborted\nMessage: Import process died. (ExitCode: -9)'

@MrMarvin MrMarvin closed this as completed Feb 8, 2022
@aquamatthias
Copy link
Member

Thanks for reporting back, @MrMarvin.

To clarify the OOM issue: the update worked in 2.0.0a13, and is OOM killed in the current version. (no other changes on your side) - is this correct?

@lloesche ^^

@MrMarvin
Copy link
Contributor Author

MrMarvin commented Feb 8, 2022

To clarify the OOM issue: the update worked in 2.0.0a13, and is OOM killed in the current version. (no other changes on your side) - is this correct?

We do run some feature flags that were only introduced with a later version. Change we introduced implicitly with moving to v2.0.0a16 (via :edge):

        - name: RESOTOWORKER_GRAPH_MERGE_KIND
          value: account
        - name: RESOTOWORKER_DEBUG_DUMP_JSON
          value: "true"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants