Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0 #3906

iteroji · 2024-04-18T08:22:47Z

Describe the bug

Starting with version 1.4.0 my database instance in docker idles at ~25% and the log prints the following pattern of messages every second or even more often.

container-surrealdb-1  | 2024-04-18T07:55:45.661776Z TRACE surrealdb::api::engine::tasks: Node agent tick: Instant { tv_sec: 606859, tv_nsec: 540700528 }
container-surrealdb-1  | 2024-04-18T07:55:45.661857Z TRACE surrealdb_core::kvs::ds: Ticking at timestamp 1713426945
container-surrealdb-1  | 2024-04-18T07:55:45.985632Z TRACE surrealdb_core::kvs::tx: Delr page was empty
container-surrealdb-1  | 2024-04-18T07:55:46.228224Z TRACE surrealdb_core::kvs::tx: Delr page was empty
container-surrealdb-1  | 2024-04-18T07:55:46.638700Z TRACE surrealdb_core::kvs::tx: Delr page was empty

The same occurs on the nightly build, however downgrading to 1.3.1 or below returns idling to ~0.1%.

Have there been changes with the file format or anything that would require migration?

Steps to reproduce

Upgrade to 1.4.0 or above from previous releases (docker) .

Expected behaviour

Idling around 0-1% instead of 25%

SurrealDB version

1.4.0 for linux on x86_64

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ragavpr · 2024-04-18T12:37:12Z

My environment is same, surrealdb in docker.

Firstly, I tried exporting (in 1.4.0 itself) and importing all my databases manually (to a new instance), this did lower the CPU usage back to near 0%, but still had the logs as before.

This new instance got to same 25% CPU usage after a day or so.

I just created another new instance with only one database imported. Seems fine for more than a day. The logs were present nevertheless.

This log surrealdb_core::kvs::tx: Delr page was empty, seems to be printed for the number of databases in the instance.
In addition to the that.

I could see some other failed attempts to connect to port 4317, not sure if this relates to CPU usage.

surrealdb  | 2024-04-18T12:32:41.355797Z TRACE tonic::transport::service::reconnect: poll_ready; connecting
surrealdb  | 2024-04-18T12:32:41.355811Z TRACE hyper::client::connect::http: connect error for 127.0.0.1:4317: ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" })
surrealdb  | 2024-04-18T12:32:41.355821Z TRACE tonic::transport::service::reconnect: poll_ready; error
surrealdb  | 2024-04-18T12:32:41.355824Z DEBUG tonic::transport::service::reconnect: reconnect::poll_ready: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }))
surrealdb  | 2024-04-18T12:32:41.355830Z DEBUG tower::buffer::worker: service.ready=true processing request
surrealdb  | 2024-04-18T12:32:41.355836Z TRACE tonic::transport::service::reconnect: Reconnect::call
surrealdb  | 2024-04-18T12:32:41.355839Z DEBUG tonic::transport::service::reconnect: error: error trying to connect: tcp connect error: Connection refused (os error 111)

MerryOscar · 2024-04-20T12:05:58Z

also seeing this

Micnubinub · 2024-04-21T00:46:40Z

Same here 1.4.0 and 1.4.2, sitting between 75 and 100% CPU usage. RAM keeps cycling between 220 and 320MB... No queries running

Screen.Recording.2024-04-21.at.10.45.21.mov

iteroji · 2024-04-21T11:36:58Z

I just tried removing my old database file (the one I created with v1.0.0) and started 1.4.0. The cpu usage is back to the 0-1% range. It seems like the problem is with something in the database file itself.

micisse · 2024-04-21T18:28:12Z

It's not just related to docker, just running surrealdb start ... causes this problem and hogs all the memory, when I stop it the machine goes back to normal. It seems to launch several instances in a single run.

sample.mp4

Linux
SurrealDB 1.4.0 for linux on x86_64 & SurrealDB 1.4.2 for linux on x86_64

KristjanVall · 2024-04-25T07:20:37Z

+1

iteroji · 2024-04-27T11:39:07Z

I noticed after about a week of normal operation the process started idling at 6-8%. I updated to 1.4.2 a few days ago, but not sure if it has anything to do with it.

iteroji · 2024-05-01T08:32:00Z

I've been observing the CPU usage and it seems that it keeps going upwards every day. From 6-8% 4 days ago it went up to around 10-12%. I'm really afraid to deploy it in production, I see it as a huge risk. I have to revert back to 1.3.1.

If anyone figures out a solution to this, please, share it with us

micisse · 2024-05-01T11:55:27Z

Don't dare deploy it in production, it's an existential problem that has appeared since certain versions. I've never had this problem before. We tried to downgrade to earlier versions, but there were compatibility errors with our backups. New versions work with backups created with older versions, but not older versions, which display compatibility errors. This problem also exists on Windows, apparently describing the same problem on the same releases. Since opening the ticket, the team has not yet reacted, so you'll have to wait if you don't have a database alternative.

Bug: Massive performance degradation on v1.4.2 #3935

phughk · 2024-05-01T14:34:12Z

Hi @iteroji, @ragavendaran, @MerryOscar, @Micnubinub, @micisse, @KristjanVall

We have some ideas, but don't have access to your environments - could you please share some information on the above to help us debug?

Are you using live queries
Are you using change feeds
How long have you been running the DBMS (SurrealDB) (creation data, approximate size)
How long have you been running change feeds and the retention size
Could you please attach the output logs at trace logging level
Could you try running without trace logs to see if that reduces usage please
Are you using Live queries v2?

We have some ideas what it could be, but we would like to get some confirmation so any help will be useful on the above questions.

Thanks for the help in finding this and helping resolve,

Hugh

micisse · 2024-05-01T15:40:40Z

@phughk

1. In my case and since I've been using SurrealDB, no.
2. In my current case and for this app, no.
3. Difficult to assess, but in my case, I can say well before version 1.0.0, i.e. v0.x.x, so since October/November 2023.
4. 0, I've had no use for it.
5. I uploaded this video as soon as SurrealDB was launched with the --trace flag #3906 (comment). The logs are still the same, just like the ticket author's (for docker). If you need anything else, give us instructions on how to get exactly the logs you're interested in.
6. In my case, I've been running SurrealDB since I started using it without logs (--trace) because everything was working normally, there was nothing special to observe. I only used the logs to make the video above. In dev mode, here's the command i use surrealdb start --allow-all -u root -p root file:///...
7. In my case, no no, not necessary for the application in its current state.

The novelty of this bug is that when I CTRL+C to stop the server, I get a Failed to send shutdown signal to task: sending on a closed channel error as in the video, which I didn't get before. Another observation is that this bug is variable: one day everything's fine, and another day we see excessive memory consumption.

phughk · 2024-05-01T15:46:44Z

That's really useful, thank you! Particularly about the channel - I saw something like that in development and believed it was resolved; Will have a look as you shouldn't see the channel closed

ragavpr · 2024-05-02T12:18:07Z

Hi @phughk

I'm not using LiveQueries, ChangeFeeds

(3) I've exported and imported my data in a new Instance of SurrealDB 1.4.0, the CPU usage was high after 12 hours (overnight) when I checked.
(6) Running without logs has no effect in CPU usage, it remains high either way.

(5) Sure, I did a backup after the clean import, and another backup after 12 hours (docker volumes)
No matter which machine I use (arm/x86), the new one has 0% CPU usage, and the aged one has 16% constant CPU usage, the logs seem the same.

I've attached both logs, also included the tree structure of the database folder at both snapshots in the top.

High-CPU-Usage-(12H-aged).txt
No-CPU-Usage-(New-Instance).txt
Hope this helps.

I did not see Failed to send shutdown signal to task: sending on a closed channel in my instance when stopping.

I suspected if Events were causing this, testing by inserting some data in tables with events set up, but to no avail.

phughk · 2024-05-02T14:01:47Z

Hi @ragavendaran thank you for providing those logs. I am a bit surprised that the timestamps don't indicate that the db was running for a long time. For example the first log message is

 47 surrealdb  | 2024-05-02T11:45:17.700722Z TRACE tower::buffer::worker: worker polling for next message

and the last is

1072 surrealdb  | 2024-05-02T11:45:59.903767Z TRACE surrealdb_core::kvs::tx: Delr page was empty

As far as I can see, that indicates about a single minute of logging.

We are going to try and reproduce this internally, but were wondering if you may have better log samples. Also if you are able to get a DTrace from the instance, that would be fantastic. A tool I have been using for that is flamegraph https://github.com/flamegraph-rs/flamegraph?tab=readme-ov-file

Micnubinub · 2024-05-02T16:01:09Z

@phughk

Yes, I am using Live Queries
A few months, exported only about 2MB... (High usage is usually after a restart)
Not sure, I'm using JS beta5 and the lib always logs this (haven't used them since the beta got released):

ragavpr · 2024-05-02T18:49:55Z

Hi @phughk

I ran the instance only for a minute, the CPU usage is high in a few seconds (the problem won't go away with restarts once started)

The first time to spot the issue was after 12 hours in a clean imported instance.

I've got the flamegraphs (for linux-amd64 bare-metal) highlighting the CPU usage from both versions 1.4.0 and 1.4.2
flamegraphs.tar.gz (Download XML)

I did see the Failed to send shutdown signal to task: sending on a closed channel message this time.

I'll be happy to provide more details if needed.

Edit: Also confirmed this issue with v1.4.0.beta.1 release.

phughk · 2024-05-03T13:33:07Z

Hi everyone thank you for sharing all the information, it is very useful.

The flamegraphs indicate that there are 3 main parts to this, and the largest part seems to be accessing rocksdb (libc.so.6). Because you are all running release builds, we don't have debug information in the flamegraphs and cannot decipher easily exactly what the above functions are. However we can speculate that since we have polling functions present, this is to do with that. And the Delr message may be part of the issue.

Can you please confirm @iteroji, @ragavendaran, @MerryOscar, @Micnubinub, @micisse, @KristjanVall

number of namespaces
number of databases
number of tables
number of records
are you using events
approximately how many connections you have open to the database
nature of requests (number of select vs update vs create vs other)

We are continuing trying to reproduce this locally but can't immediately see how this is happening for you. We can confirm #3952 and have a fix in the works for that.

micisse · 2024-05-03T14:38:45Z

@phughk

number of namespaces: 6 created by me
number of databases: each namespace contains one database
number of tables: each database contains 4 tables (SCHEMALESS)
number of records :
- 2 tables with 300 records (which I've limited to 300, otherwise there are many more - Select (frequently), Delete and Create (Not very frequent), when a new entry is added, the oldest is automatically deleted (1 to 2 changes per day depending on interface). no Update)
- 2 tables with currently 7 records (creation, modification, deletion - Not very frequent - 1 to 2 changes or not per day)
do you use events: no
approximate number of open database connections: Only one in dev mode
nature of queries (number of selections, updates, creations and others) : C.R.U.D and especially Select. Creations, deletions and modifications are rare. Data is known and pre-loaded, transformed once from CSV into RUST..

I took the opportunity to run a test with and without a backup created with older versions of surrealdb.

Running surrealdb v1.4.x without a backup, starting from 0 in a new folder with no load ramp-up, high memory consumption (just launch, no select, delete or update...).

without.backup.mp4

Running surrealdb v1.4.x with a backup created from previous versions and then modified (export/import) as surrealdb is updated, with an immediate increase in load and memory consumption (just launch, no selection, deletion or update...).

with.backup.mp4

ragavpr · 2024-05-03T14:50:14Z

Hi @phughk

1 Namespace
6 Databases (2 databases have Events).
28 to 30 Tables.
500K+ records.
I do use events.
max 2 to 3 connections for a short time (then no connections most of the time, this issue started probably when there was no connections)
60% select, 30% create, 6% update, 4% delete (CPU usage will be high even without any connections / running queries)

Was able to get a flamegraph from a debug build of 1.4.0-beta-1, but limited samples for 10 seconds (CPU usage was high the entire time)
flamegraph-debug-10sec.tar.gz (Download XML)

I can try getting a full minute flamegraph from debug build with even fewer samples, if the above one is insufficient.

I've created another clean instance (1.4.0-beta-1) with a single database including events, if the issue crops up there, I'll have full logs to share.

ragavpr · 2024-05-04T05:44:10Z

As expected, I was able to recreate this time, after letting the instance (1.4.0-beta.1 arm64-docker) run for about 12 hours, I saw 2% constant CPU usage, without any connections / queries.

I backed up this instance (entire database file) at 4 points in time.

Immediately after creation (0 min)
After importing 1 database with events (15 min)
After performing all operations in that database (30 min and 200+ queries)
After 12 hours of idling (12 hours and 20 queries)

I used these backups to run with a debug build (1.4.0-beta.1 amd64-local) for each snapshots (each 1 minute runtime).

I'm attaching only the perfs this time, (flamegraph takes really long time to generate .svg from debug perfs)
debug-perfs.tar.gz

Thank you for introducing me to flamegraphs, I've found yet another amazing tool Hotspot

perfs after 12 hours (snapshot 4)

perfs before 12 hours (snapshot 3)

perfs before and after 12h, despite very little activity in between the snapshots, CPU usage became high.

phughk · 2024-05-07T13:15:47Z

Hey @ragavendaran 2% passive usage is tolerable, as the database is doing liveness checks. I can see you mentioned that after 12h the CPU was high, but how high? 2% or higher? and was this constant or simply spikes? Thank you for providing this information and captures, looking at them now. The perfdata files are missing the corresponding stack files - do you have these by the chance?

ragavpr · 2024-05-09T14:33:56Z

Hi @phughk, there were no changes to the database before and after 12hours of idle, there was no CPU usage in the beginning (snapshot 3), and 2 to 4 % after the 12hour idle (snapshot 4). It spikes every 1 second and seems constant in htop.
Yes, 2 to 4 % CPU usage in this test instance, lower than the ~25% CPU usage in the original instance with 6 databases.

I believe, the perf command records stack frames
perf record --call-graph dwarf ...

I'm not sure about the stack files. If there is a specific perf command to use, and some directions to get the stack files, I'll be able to get them.
After your suggestion, I'll try getting the perfs again for this test instance and the original instance with 25% CPU usage.

…ix-incorrect-shutdown

phughk · 2024-05-10T09:09:41Z

Yeah, this is the polling rate for change feeds and live queries. We have changed it back to the 10s polling rate that it was in 1.3, as 1.4 changed it to 1s; 1.5 will have the fix for this. Thanks all for your cooperation in resolving this
@iteroji @ragavendaran @MerryOscar @Micnubinub @micisse @KristjanVall

Will close this ticket after 1.5 is tested by you all

phughk · 2024-05-10T21:52:50Z

1.5.0-beta1 is out, please try it if you have time. It should resolve the issue.
https://deploy-preview-564--surrealdb-web.netlify.app/releases/#v1-5-0-beta-1

micisse · 2024-05-13T07:40:36Z

Hi @phughk, i've been using v1.5.0-beta.1 since this Saturday afternoon and I haven't noticed any load increase. I've only been using it for a day and a half and so far there's nothing to report about the problem that caused the ticket.

However, I'm still seeing these messages and/or errors, as in the video below (view in full):

...TRACE surrealdb_core::kvs::tx: Delr page was empty, but this time with a certain delay and still in bursts.
Server shutdown error (CTRL+C) ...ERROR surreal::cli::start: Failed to send shutdown signal to task: sending on a closed channel

perhaps planned in another fix

Surrealdb.v1.5.0-beta.1.test.mp4

Have a nice day

phughk · 2024-05-13T08:29:17Z

Hey @micisse , thanks. Those are just error messages, it isn't actually a failure. When a channel is closed on one side, the other side failed to send and stopped as well. We will remove the error message, but they are not a major concern.

The delr is not a concern and will also be removed from trace logs.

phughk · 2024-05-13T09:43:53Z

Closing as this is resolved in 1.5.0-beta1, but can be mitigated on 1.4.x with --tick-interval 10s

iteroji added bug Something isn't working triage This issue is new labels Apr 18, 2024

phughk self-assigned this May 1, 2024

phughk added topic:performance Improvements to database performance topic:live This is related to live queries and push notifications topic:changefeeds This is related to changefeeds and removed triage This issue is new labels May 1, 2024

phughk mentioned this issue May 2, 2024

Improve shutdown of tasks #3977

Closed

3 tasks

This was referenced May 2, 2024

Bug: live queries kill DB connection #3978

Closed

Bug: Error running node agent tick #3952

Open

phughk mentioned this issue May 6, 2024

Bug: Deadlock encountered #3987

Open

2 tasks

This was referenced May 8, 2024

Delete old change feed timestamps #3996

Closed

Gh 3906 reduce polling rate #3999

Merged

phughk added a commit to phughk/surrealdb that referenced this issue May 9, 2024

Merge remote-tracking branch 'surrealdb/main' into surrealdbgh-3906-f…

20f3e62

…ix-incorrect-shutdown

phughk mentioned this issue May 13, 2024

Bug: Delr is displayed in trace logs of releases #4029

Closed

2 tasks

phughk closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0 #3906

Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0 #3906

iteroji commented Apr 18, 2024 •

edited

ragavpr commented Apr 18, 2024 •

edited

MerryOscar commented Apr 20, 2024

Micnubinub commented Apr 21, 2024 •

edited

iteroji commented Apr 21, 2024

micisse commented Apr 21, 2024 •

edited

KristjanVall commented Apr 25, 2024

iteroji commented Apr 27, 2024

iteroji commented May 1, 2024

micisse commented May 1, 2024

phughk commented May 1, 2024 •

edited

micisse commented May 1, 2024

phughk commented May 1, 2024

ragavpr commented May 2, 2024

phughk commented May 2, 2024

Micnubinub commented May 2, 2024 •

edited

ragavpr commented May 2, 2024 •

edited

phughk commented May 3, 2024 •

edited

micisse commented May 3, 2024 •

edited

ragavpr commented May 3, 2024 •

edited

ragavpr commented May 4, 2024

phughk commented May 7, 2024 •

edited

ragavpr commented May 9, 2024

phughk commented May 10, 2024 •

edited

phughk commented May 10, 2024

micisse commented May 13, 2024

phughk commented May 13, 2024 •

edited

phughk commented May 13, 2024

Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0 #3906

Bug: High idle cpu usage and log flooded with "Delr page was empty" messages since 1.4.0 #3906

Comments

iteroji commented Apr 18, 2024 • edited

Describe the bug

Steps to reproduce

Expected behaviour

SurrealDB version

Is there an existing issue for this?

Code of Conduct

ragavpr commented Apr 18, 2024 • edited

MerryOscar commented Apr 20, 2024

Micnubinub commented Apr 21, 2024 • edited

iteroji commented Apr 21, 2024

micisse commented Apr 21, 2024 • edited

KristjanVall commented Apr 25, 2024

iteroji commented Apr 27, 2024

iteroji commented May 1, 2024

micisse commented May 1, 2024

phughk commented May 1, 2024 • edited

micisse commented May 1, 2024

phughk commented May 1, 2024

ragavpr commented May 2, 2024

phughk commented May 2, 2024

Micnubinub commented May 2, 2024 • edited

ragavpr commented May 2, 2024 • edited

phughk commented May 3, 2024 • edited

micisse commented May 3, 2024 • edited

ragavpr commented May 3, 2024 • edited

ragavpr commented May 4, 2024

phughk commented May 7, 2024 • edited

ragavpr commented May 9, 2024

phughk commented May 10, 2024 • edited

phughk commented May 10, 2024

micisse commented May 13, 2024

phughk commented May 13, 2024 • edited

phughk commented May 13, 2024

iteroji commented Apr 18, 2024 •

edited

ragavpr commented Apr 18, 2024 •

edited

Micnubinub commented Apr 21, 2024 •

edited

micisse commented Apr 21, 2024 •

edited

phughk commented May 1, 2024 •

edited

Micnubinub commented May 2, 2024 •

edited

ragavpr commented May 2, 2024 •

edited

phughk commented May 3, 2024 •

edited

micisse commented May 3, 2024 •

edited

ragavpr commented May 3, 2024 •

edited

phughk commented May 7, 2024 •

edited

phughk commented May 10, 2024 •

edited

phughk commented May 13, 2024 •

edited