-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running out of memory with 12GB available #116
Comments
Thanks for reporting! Let us see if the size of your infra is causing the OOM. First we are going to sync just the infra:
Then we can deactivate the sync of the infra and sync again:
Can you run that and output the results ? Cheers |
Hi Simon - I just ran the steps and the 2nd chunk ran into OOM again. However, I think that's because you had one too many Here it is with just
I think we can close this now, as there is a work-around. |
Thanks. Indeed my steps were not clear enough also I did want to run the 2 distinct scenarios:
Also I remember now that the first install always trigger a new sync without any log (it is just written So are you saying that your last 2 command Also what do you consider a work-around ? It seems to me that Would you mind sharing in
|
Thanks for taking the time to look in more detail into this. I believe I understood what the intention of your previous 2 commands was: sync just infra, then sync just non-infra. That's why I removed the I considered it a work-around, b/c I was able to do at least one error-free sync using the 2 steps. Now that I try to get a clean run with ~/.awless removed, it always seems to fail:
If I am repeating the same sync-command without removing
I'm positive that during the previous attempts I ran |
Yeap. In my opinion this is definitely due to the size/amount of what is fetched:
So adding up the infra (which as your output shows does not have a negligible size) to the previous sync services makes it blow up. There is actually for some services a lot of parallel calls done. For instance, to retrieve Anyway, we will have a look at how to mitigate and improve that in the coming weeks. If you have any ideas to share on this issue or more comments on the CLI do not hesitate. I will keep this open until we have a kind of resolution. Thanks. |
@thoellrich I have been running some memory bench locally. So far the most greedy service is actually when fetching & resolving the access info (basically IAM users, groups, roles, ....). In your case, I notice that syncing the access is the longest as it never displays the time it took (when in verbose mode). I am curious if you were to only sync the access with the following:
(it does not have to be a first install) Would you mind outputting the result of the command? Cheers. |
Here you go:
|
Thanks a lot @thoellrich . So here a full sync is done (due to first install) and then the command to only sync access. The funny thing is that you did not have the OOM issue! I have just fixed also the fact that for first install we did not inject a proper logger in the Sync, hence it does not properly logs all info. For instance, in your case we should have seen Anyway, roughly we can see that each of your services is a minimum of 1G in memory, which is a lot and we will have to figure out how to improve that. Also a sync in your case would take more than 10 seconds. I am wondering if that does not make in your case the one-liners creation and As we try to focus Cheers. |
I'm on a mac with 8GB RAM and I can see that's clearly not enough. |
@gauravarora Thanks for reporting the issue. The sync process is obviously proportional to the size of each services' resources to fetch and resolve. Given the size you have:
... you run indeed into an limitation of local resources. And let me be clear, this is not acceptable from our point of view. Your infra should be a considered a normal infra (tending towards large maybe) and Also the sync (per service) is run at different time when using awless (see doc). So 3mins or 10mins to resolve a service is definitely a no-no. With PS: Thanks for opening a ticket for the separate issue of Note: As said in the Getting Started doc, you can disable autosync with |
snapshot instead of graph to avoid numerous in loop call to triplestore.Source.Snapshot() (call that pre allocates memory) Tackle part of issue #116 Overall rough improvements: - CPU: 50% better on fetching mainly Access resources - Mem alloc_object: from ~70% to ~40% of cumulated alloc_objects for Snapshot() method calls
I have reduced the usage of memory during the sync (see commit 417a0a9). This is only available for now on @thoellrich @gauravarora More importantly the flag
... it will dump the profiling Go files Then to inspect memory enter the interactive pprof and enter the web command like so:
The Cheers! |
Overall, independently of local or small improvements, the main issue is that we hold in memory all the cloud resources fetched. They are represented as RDF triples and indexed in a map before being flushed and written to their corresponding service local file (under My take is that in order to NOT use that amount of memory, we will have to stream (i.e. channeled) the triples from creation down to being written to their respecting files. That would avoid holding them all in memory while still closing the file when all triples for this service are done. Obviously, the interesting challenge is that cloud resources are held in memory to reconcile amongst them and built up their relations. |
(issue #116) Running with /usr/bin/time --verbose we get a decrease of 2.5 in Maximum Resident Size memory. Profiling give us a decrease in alloc_space from ~90% to below 50%. See profiling results: NEW Showing top 10 nodes out of 292 (cum >= 2.51MB) flat flat% sum% cum cum% 20.87MB 20.87% 20.87% 20.87MB 20.87% runtime.makemap 12.50MB 12.51% 33.38% 48.87MB 48.88% github.com/wallix/awless/vendor/github.com/wallix/triplestore.(*source).Snapshot 11.50MB 11.50% 44.88% 11.50MB 11.50% runtime.rawstringtmp OLD Showing top 10 nodes out of 80 (cum >= 2MB) flat flat% sum% cum cum% 165.34MB 46.51% 46.51% 165.34MB 46.51% runtime.makemap 80.51MB 22.65% 69.16% 80.51MB 22.65% runtime.rawstringtmp 55.54MB 15.62% 84.78% 323.87MB 91.10% github.com/wallix/awless/vendor/github.com/wallix/triplestore.(*source).Snapshot
To see improvements from commit above (8bc14ce) and see what still takes memory with big infra one can run:
|
I was getting the oom problem on my MBP with 16GB of memory.
|
@deinspanjer Thanks for running the latest head and reporting your findings! That is very useful. So you had the OOM issue on your 16GB MBP. All 16GB used up! Yikes! Indeed if you do not mind the profiling Many thanks! |
We can try the Send service, you just better hope you are the first one to try downloading the file. :) |
Avoid snapshotting datastore for querying when not necessary when building relations (issue #116) * Go profiling: - making Snapshotting not on top 5 CPU anymore (now below top25) - making Snaphotting not on top 5 Mem anymore (now below top30)
@deinspanjer I got them ok through the Send Service. But the prof files do not contain any metrics. I got after unzipping: $ ll -h *{prof,txt}
-rw-r--r-- 1 simon simon 6,6K août 4 13:00 cpu-sync.prof
-rw-r--r-- 1 simon simon 6,7K août 4 13:00 mem-sync.prof
-rw-r--r-- 1 simon simon 1,1K août 4 13:01 sync-profile-stdout.txt Doing a ... anyway lets leave the prof files aside. To validate that we have a improvement and a fix for you, simply run:
... and outptut the result here. We will basically check that instead of using 8GB of RAM we have a lower Max Resident Size memory. (you can even pull the latest master before doing that to get the latest improvement, see commit above) Thanks. |
Sorry, it seems the BSD (OSX) version of time doesn't support the verbose flag. Here is the results of the -e run with the latest head:
|
@deinspanjer Ok. Too bad. I am on macOS 10.12.1 and I got the flag Anyway to close the issue and validate a reduction of memory consumption (even if I know there is a important one), I wanted demonstrative figures. In your case, it seems that I will re-ping the other users to have them re-sync with the new version and make sure it is working for them as well. Thanks. |
Ugh, don't know what is up with my time. Maybe I overrode it with a brew coreutil or something? Sorry my run isn't the demonstrative figure you were looking for, but I am certainly happen that it went from not working at all and failing after almost a minute to being nice and snappy with just a few seconds. :) |
Somehow my ~/.awless data also seems to have gotten corrupt at some point. 'Before purging ~/.awless dir, couldn't ever get past infra sync stage (CPU steadily increasing way out of control on the way to crashing system); after deleting the dir and its contents, awless is happy and functional again.
pre purge:
post puge:
Thanks! |
@cmcconnell1 I notice on pre-purge that your profile was Anyway happy it is working now. Side note on
|
And one more from OP. Great progress guys! Thanks! Should we close it?
|
Thanks @thoellrich . If after those fixes you find now |
No need to keep it open, because I no longer see the OOM. If I find other stuff I'll open another issue. |
Running
awless -e sync
with 12GB of memory available ends up in the process being killed because of an OOM situation. Would have expected that 12GB is plenty of memory for the task.The text was updated successfully, but these errors were encountered: