-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0 #3906
Comments
My environment is same, surrealdb in docker. Firstly, I tried exporting (in 1.4.0 itself) and importing all my databases manually (to a new instance), this did lower the CPU usage back to near 0%, but still had the logs as before. This new instance got to same 25% CPU usage after a day or so. I just created another new instance with only one database imported. Seems fine for more than a day. The logs were present nevertheless. This log I could see some other failed attempts to connect to port 4317, not sure if this relates to CPU usage.
|
also seeing this |
Same here Screen.Recording.2024-04-21.at.10.45.21.mov |
I just tried removing my old database file (the one I created with v1.0.0) and started 1.4.0. The cpu usage is back to the 0-1% range. It seems like the problem is with something in the database file itself. |
It's not just related to docker, just running sample.mp4Linux |
+1 |
I noticed after about a week of normal operation the process started idling at 6-8%. I updated to 1.4.2 a few days ago, but not sure if it has anything to do with it. |
I've been observing the CPU usage and it seems that it keeps going upwards every day. From 6-8% 4 days ago it went up to around 10-12%. I'm really afraid to deploy it in production, I see it as a huge risk. I have to revert back to 1.3.1. If anyone figures out a solution to this, please, share it with us |
Don't dare deploy it in production, it's an existential problem that has appeared since certain versions. I've never had this problem before. We tried to downgrade to earlier versions, but there were compatibility errors with our backups. New versions work with backups created with older versions, but not older versions, which display compatibility errors. This problem also exists on Windows, apparently describing the same problem on the same releases. Since opening the ticket, the team has not yet reacted, so you'll have to wait if you don't have a database alternative. |
Hi @iteroji, @ragavendaran, @MerryOscar, @Micnubinub, @micisse, @KristjanVall We have some ideas, but don't have access to your environments - could you please share some information on the above to help us debug?
We have some ideas what it could be, but we would like to get some confirmation so any help will be useful on the above questions. Thanks for the help in finding this and helping resolve, Hugh |
1. In my case and since I've been using SurrealDB, no. The novelty of this bug is that when I CTRL+C to stop the server, I get a |
That's really useful, thank you! Particularly about the channel - I saw something like that in development and believed it was resolved; Will have a look as you shouldn't see the channel closed |
Hi @phughk I'm not using LiveQueries, ChangeFeeds (3) I've exported and imported my data in a new Instance of SurrealDB 1.4.0, the CPU usage was high after 12 hours (overnight) when I checked. (5) Sure, I did a backup after the clean import, and another backup after 12 hours (docker volumes) I've attached both logs, also included the tree structure of the database folder at both snapshots in the top. High-CPU-Usage-(12H-aged).txt I did not see I suspected if Events were causing this, testing by inserting some data in tables with events set up, but to no avail. |
Hi @ragavendaran thank you for providing those logs. I am a bit surprised that the timestamps don't indicate that the db was running for a long time. For example the first log message is
and the last is
As far as I can see, that indicates about a single minute of logging. We are going to try and reproduce this internally, but were wondering if you may have better log samples. Also if you are able to get a DTrace from the instance, that would be fantastic. A tool I have been using for that is flamegraph https://github.com/flamegraph-rs/flamegraph?tab=readme-ov-file |
Hi @phughk I ran the instance only for a minute, the CPU usage is high in a few seconds (the problem won't go away with restarts once started) The first time to spot the issue was after 12 hours in a clean imported instance. I've got the flamegraphs (for linux-amd64 bare-metal) highlighting the CPU usage from both versions 1.4.0 and 1.4.2 I did see the I'll be happy to provide more details if needed. Edit: Also confirmed this issue with v1.4.0.beta.1 release. |
Hi everyone thank you for sharing all the information, it is very useful. The flamegraphs indicate that there are 3 main parts to this, and the largest part seems to be accessing rocksdb (libc.so.6). Because you are all running release builds, we don't have debug information in the flamegraphs and cannot decipher easily exactly what the above functions are. However we can speculate that since we have polling functions present, this is to do with that. And the Can you please confirm @iteroji, @ragavendaran, @MerryOscar, @Micnubinub, @micisse, @KristjanVall
We are continuing trying to reproduce this locally but can't immediately see how this is happening for you. We can confirm #3952 and have a fix in the works for that. |
I took the opportunity to run a test with and without a backup created with older versions of surrealdb.
without.backup.mp4
with.backup.mp4 |
Hi @phughk
Was able to get a flamegraph from a debug build of 1.4.0-beta-1, but limited samples for 10 seconds (CPU usage was high the entire time) I can try getting a full minute flamegraph from debug build with even fewer samples, if the above one is insufficient. I've created another clean instance (1.4.0-beta-1) with a single database including events, if the issue crops up there, I'll have full logs to share. |
As expected, I was able to recreate this time, after letting the instance (1.4.0-beta.1 arm64-docker) run for about 12 hours, I saw 2% constant CPU usage, without any connections / queries. I backed up this instance (entire database file) at 4 points in time.
I used these backups to run with a debug build (1.4.0-beta.1 amd64-local) for each snapshots (each 1 minute runtime). I'm attaching only the perfs this time, ( Thank you for introducing me to
perfs before and after 12h, despite very little activity in between the snapshots, CPU usage became high. |
Hey @ragavendaran 2% passive usage is tolerable, as the database is doing liveness checks. I can see you mentioned that after 12h the CPU was high, but how high? 2% or higher? and was this constant or simply spikes? Thank you for providing this information and captures, looking at them now. The perfdata files are missing the corresponding stack files - do you have these by the chance? |
Hi @phughk, there were no changes to the database before and after 12hours of idle, there was no CPU usage in the beginning (snapshot 3), and 2 to 4 % after the 12hour idle (snapshot 4). It spikes every 1 second and seems constant in I believe, the perf command records stack frames I'm not sure about the stack files. If there is a specific perf command to use, and some directions to get the stack files, I'll be able to get them. |
…ix-incorrect-shutdown
Yeah, this is the polling rate for change feeds and live queries. We have changed it back to the 10s polling rate that it was in 1.3, as 1.4 changed it to 1s; 1.5 will have the fix for this. Thanks all for your cooperation in resolving this Will close this ticket after 1.5 is tested by you all |
1.5.0-beta1 is out, please try it if you have time. It should resolve the issue. |
Hi @phughk, i've been using v1.5.0-beta.1 since this Saturday afternoon and I haven't noticed any load increase. I've only been using it for a day and a half and so far there's nothing to report about the problem that caused the ticket. However, I'm still seeing these messages and/or errors, as in the video below (view in full):
perhaps planned in another fix Surrealdb.v1.5.0-beta.1.test.mp4Have a nice day |
Hey @micisse , thanks. Those are just error messages, it isn't actually a failure. When a channel is closed on one side, the other side failed to send and stopped as well. We will remove the error message, but they are not a major concern. The delr is not a concern and will also be removed from trace logs. |
Closing as this is resolved in 1.5.0-beta1, but can be mitigated on 1.4.x with |
Describe the bug
Starting with version 1.4.0 my database instance in docker idles at ~25% and the log prints the following pattern of messages every second or even more often.
The same occurs on the nightly build, however downgrading to 1.3.1 or below returns idling to ~0.1%.
Have there been changes with the file format or anything that would require migration?
Steps to reproduce
Upgrade to 1.4.0 or above from previous releases (docker) .
Expected behaviour
Idling around 0-1% instead of 25%
SurrealDB version
1.4.0 for linux on x86_64
Is there an existing issue for this?
Code of Conduct
The text was updated successfully, but these errors were encountered: