Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in 6.2 #1865

Closed
fmoessbauer opened this issue Feb 8, 2022 · 19 comments
Closed

Potential memory leak in 6.2 #1865

fmoessbauer opened this issue Feb 8, 2022 · 19 comments
Labels
ILP Issues or changes relating to Influx Line Protocol

Comments

@fmoessbauer
Copy link

Describe the bug

We are running QuestDB 6.2 (container) and ingesting data at 14kHz (via ILP, single writer) for a long period. During this period (couple of days), we do not run a single query.

There, we observed that the memory usage of QuestDB rises over time, finally allocating all available memory. When limiting the memory using cgroups (technically via docker-compose mem_limit setting), we observe that the process is periodically OOMd and restarted. Further we observed the following:

  • we do not overrun the DB (all data ends up on disk until last commit before OOM, checked after restart)
  • when stopping the ingestion, the memory usage does not shrink
  • the JVM heap is limited to 1GB (that could be put into the docs)
  • Restarts happen very periodically

time between OOMs (restarts)

  • 03:48:02
  • 04:03:43
  • 03:53:27
  • 03:58:54
  • 05:00:44

docker-compose.yml

services:
  questdb:
    image: docker.io/questdb/questdb:6.2
    volumes:
      - /mnt/db-storage:/root/.questdb
      # Note: setting log-level via env var did not work
      - ./questdb-conf/log.conf:/root/.questdb/conf/log.conf:ro
    ports:
      - '8812:8812'
      - '9000:9000'
      - '9009:9009'
    environment:
      - QDB_LINE_TCP_MAINTENANCE_JOB_INTERVAL=10000
      - QDB_LINE_TCP_DEFAULT_PARTITION_BY=HOUR
      - QDB_CAIRO_COMMIT_LAG=10000
      - QDB_HTTP_QUERY_CACHE_ENABLED=false
      - QDB_PG_SELECT_CACHE_ENABLED=false
      - QDB_PG_INSERT_CACHE_ENABLED=false
    cpus: 4
    mem_limit: 2G
    restart: unless-stopped

QuestDB output after startup

2022-02-08T06:36:59.104203Z A server-main Server config : /root/.questdb/conf/server.conf
2022-02-08T06:36:59.111826Z A server-main Config changes applied:
2022-02-08T06:36:59.111843Z A server-main   http.enabled : true
2022-02-08T06:36:59.111870Z A server-main   tcp.enabled  : true
2022-02-08T06:36:59.111891Z A server-main   pg.enabled   : true
2022-02-08T06:36:59.111912Z A server-main open database [id=8916382354024914915.-5271762009964388491]
2022-02-08T06:36:59.111948Z A server-main platform [bit=64]
2022-02-08T06:36:59.111969Z A server-main OS/Arch: linux/amd64 [AVX2,8]
2022-02-08T06:36:59.112352Z A server-main available CPUs: 4
2022-02-08T06:36:59.112388Z A server-main db root: /root/.questdb/db
2022-02-08T06:36:59.112410Z A server-main backup root: null
2022-02-08T06:36:59.112482Z A server-main db file system magic: 0x6edc97c2 [BTRFS] SUPPORTED
2022-02-08T06:36:59.112712Z A server-main SQL JIT compiler mode: off
2022-02-08T06:36:59.298217Z A i.q.TelemetryJob instance [id=0x05d6f771771573000001f41cdc0163, enabled=true]
2022-02-08T06:36:59.307648Z A http-server listening on 0.0.0.0:9000 [fd=58 backlog=256]
2022-02-08T06:36:59.347898Z A pg-server listening on 0.0.0.0:8812 [fd=62 backlog=10]
2022-02-08T06:36:59.380170Z A tcp-line-server listening on 0.0.0.0:9009 [fd=64 backlog=256]
2022-02-08T06:36:59.408147Z A server-main enjoy
2022-02-08T06:36:59.345514Z A http-min-server listening on 0.0.0.0:9003 [fd=60 backlog=4]

To reproduce

No response

Expected Behavior

Fixed upper bound of allocated memory. At least for the containerized version, this limit should be read from the cgroup.

Environment

- **QuestDB version**: 6.2 (container)
- **OS**: Debian Bullseye
- **container runtime**: podman + crun (rootless)
- **storage**: BTRFS (for DB), fuse-overlayfs for container rootfs

Additional context

Without memory limitation:

questdb-leak

@bziobrowski
Copy link
Contributor

bziobrowski commented Feb 8, 2022

Hello @fmoessbauer
Could you run
select dump_memory_usage();
while memory usage is high and paste logged memory stats here ?

@puzpuzpuz puzpuzpuz added the ILP Issues or changes relating to Influx Line Protocol label Feb 8, 2022
@fmoessbauer
Copy link
Author

Done. The current usage when executing the statement (according to cgroup) is nearly 2G (with a 2G limit).

2022-02-08T09:47:51.934903Z A dump-memory-usage
        TOTAL: 928297437
        MMAP_DEFAULT: 229520
        NATIVE_DEFAULT: 31735241
        MMAP_O3: 0
        NATIVE_O3: 167772192
        NATIVE_RECORD_CHAIN: 0
        MMAP_TABLE_WRITER: 436248576
        NATIVE_TREE_CHAIN: 0
        MMAP_TABLE_READER: 55094960
        NATIVE_COMPACT_MAP: 0
        NATIVE_FAST_MAP: 0
        NATIVE_FAST_MAP_LONG_LIST: 0
        NATIVE_HTTP_CONN: 102781568
        NATIVE_PGW_CONN: 134217728
        MMAP_INDEX_READER: 8516
        MMAP_INDEX_WRITER: 53488
        MMAP_INDEX_SLIDER: 0
        MMAP_BLOCK_WRITER: 0
        NATIVE_REPL: 131072
        NATIVE_SAMPLE_BY_LONG_LIST: 0
        NATIVE_LATEST_BY_LONG_LIST: 0
        NATIVE_JIT_LONG_LIST: 0
        NATIVE_LONG_LIST: 0
        NATIVE_JIT: 24576

@puzpuzpuz
Copy link
Contributor

@fmoessbauer would you be able to repeat the query in 30-60 minutes on the same instance, so that we can have a diff?

@fmoessbauer
Copy link
Author

Right after executing the command, the DB crashed and restarted (no memory stats output). After that (running at ~250MB memory):

        TOTAL: 621618615
        MMAP_DEFAULT: 274600
        NATIVE_DEFAULT: 31734985
        MMAP_O3: 0
        NATIVE_O3: 32
        NATIVE_RECORD_CHAIN: 0
        MMAP_TABLE_WRITER: 352354304
        NATIVE_TREE_CHAIN: 0
        MMAP_TABLE_READER: 90
        NATIVE_COMPACT_MAP: 0
        NATIVE_FAST_MAP: 0
        NATIVE_FAST_MAP_LONG_LIST: 0
        NATIVE_HTTP_CONN: 102781568
        NATIVE_PGW_CONN: 134217728
        MMAP_INDEX_READER: 46172
        MMAP_INDEX_WRITER: 53488
        MMAP_INDEX_SLIDER: 0
        MMAP_BLOCK_WRITER: 0
        NATIVE_REPL: 131072
        NATIVE_SAMPLE_BY_LONG_LIST: 0
        NATIVE_LATEST_BY_LONG_LIST: 0
        NATIVE_JIT_LONG_LIST: 0
        NATIVE_LONG_LIST: 0
        NATIVE_JIT: 24576

@ideoma
Copy link
Collaborator

ideoma commented Feb 8, 2022

Thanks for all the info. Can you please include table definition? You can copy it to clipboard in the web console.

@bziobrowski
Copy link
Contributor

Could you also share logs from time of crash ?
If QDB was killed due to OOM - could you share related system log lines ?

@fmoessbauer
Copy link
Author

Table definition (renamed tables and cols due to IP):

table_a (HOUR):
 a (double)
 b (double)
 c (double)
 d (double)
 e (double)
 f (double)
 timestamp (timestamp)
 
table_b (HOUR):
 a (symbol)
 b (long)
 c (long)
 d (long)
 timestamp (timestamp)
 
table_c (HOUR):
 a (double)
 b (double)
 c (double)
 d (double)
 timestamp (timestamp)

syslog at time of OOM:

[588914.530368] VM Periodic Tas invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[588914.541333] CPU: 6 PID: 76329 Comm: VM Periodic Tas Not tainted 5.10.0-10-amd64 #1 Debian 5.10.84-1
[588914.551012] Hardware name: [...]
[588914.559217] Call Trace:
[588914.562156]  dump_stack+0x6b/0x83
[588914.565917]  dump_header+0x4a/0x1f0
[588914.569841]  oom_kill_process.cold+0xb/0x10
[588914.574463]  out_of_memory+0x1bd/0x500
[588914.578649]  mem_cgroup_out_of_memory+0x134/0x150
[588914.583792]  try_charge+0x75e/0x7d0
[588914.587712]  mem_cgroup_charge+0x7f/0x240
[588914.592159]  ? __add_to_page_cache_locked+0x3b7/0x430
[588914.597651]  __add_to_page_cache_locked+0x3b7/0x430
[588914.602967]  ? scan_shadow_nodes+0x30/0x30
[588914.607507]  add_to_page_cache_lru+0x48/0xc0
[588914.612527]  pagecache_get_page+0x145/0x380
[588914.617336]  filemap_fault+0x69e/0x900
[588914.621710]  ? xas_find+0x172/0x1c0
[588914.625831]  ? filemap_map_pages+0x223/0x410
[588914.630710]  __do_fault+0x36/0x120
[588914.634733]  handle_mm_fault+0x11e7/0x1bf0
[588914.639418]  do_user_addr_fault+0x1b8/0x3f0
[588914.644205]  exc_page_fault+0x78/0x160
[588914.648554]  ? asm_exc_page_fault+0x8/0x30
[588914.653257]  asm_exc_page_fault+0x1e/0x30
[588914.657874] RIP: 0033:0x7f5f65b73cf0
[588914.661886] Code: Unable to access opcode bytes at RIP 0x7f5f65b73cc6.
[588914.668848] RSP: 002b:00007f5f3e9f3db8 EFLAGS: 00010206
[588914.674524] RAX: 00007f5f660cc5f8 RBX: 00007f5f66cb8fc8 RCX: 0000000000000002
[588914.682106] RDX: 0000000000000001 RSI: 0000000000000032 RDI: 00007f5f60197650
[588914.689692] RBP: 00007f5f3e9f3dd0 R08: 00007f5f60197650 R09: 0000000000056c11
[588914.697270] R10: 00007ffc23bfa080 R11: 0000000000000206 R12: 00007f5f60197740
[588914.704852] R13: 0000000000000001 R14: 0000000000000002 R15: 00007f5f661db800
[588914.712636] memory: usage 2097792kB, limit 2097152kB, failcnt 23910853
[588914.719814] swap: usage 0kB, limit 2097152kB, failcnt 0
[588914.725721] Memory cgroup stats for /user.slice/user-1000.slice/user@1000.service/user.slice/user-libpod_pod_[...]/container:
[588914.725761] anon 2134052864
                file 0
                kernel_stack 294912
                percpu 0
                sock 2641920
                shmem 0
                file_mapped 0
                file_dirty 0
                file_writeback 0
                anon_thp 1730150400
                inactive_anon 2134024192
                active_anon 0
                inactive_file 0
                active_file 0
                unevictable 0
                slab_reclaimable 3935776
                slab_unreclaimable 262136
                slab 4197912
                workingset_refault_anon 0
                workingset_refault_file 15293883
                workingset_activate_anon 0
                workingset_activate_file 515724
                workingset_restore_anon 0
                workingset_restore_file 5412
                workingset_nodereclaim 14751
                pgfault 43938444
                pgmajfault 6930
                pgrefill 431699
                pgscan 48606847
                pgsteal 31996098
                pgactivate 2046858
                pgdeactivate 408481
                pglazyfree 0
                pglazyfreed 0
                thp_fault_alloc 660
                thp_collapse_alloc 495
[588914.853266] Tasks state (memory values in pages):
[588914.858438] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[588914.867528] [  76205]  1000 76205  1608174   520773  5038080        0             0 java
[588914.876084] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=container,mems_allowed=0,oom_memcg=/user.slice/user-1000.slice/user@1000.service/user.slice/user-libpod_pod_[...].scope/container,task_memcg=/user.slice/user-1000.slice/user@1000.service/user.slice/user-libpod_pod_[...].scope/container,task=java,pid=76205,uid=1000
[588914.930880] Memory cgroup out of memory: Killed process 76205 (java) total-vm:6432696kB, anon-rss:2083092kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4920kB oom_score_adj:0
[588914.971095] oom_reaper: reaped process 76205 (java), now anon-rss:0kB, file-rss:104kB, shmem-rss:0kB

@ideoma
Copy link
Collaborator

ideoma commented Feb 8, 2022

One more question, do the distinct values of symbols column increase over time, if yes then how fast?

@fmoessbauer
Copy link
Author

One more question, do the distinct values of symbols column increase over time, if yes then how fast?

No, we use just 2 fixed values there (3 and 14 characters each)

@bziobrowski
Copy link
Contributor

bziobrowski commented Feb 8, 2022

Did you prepare table definition by copying from web console ?
Could you please check if tables are in fact partitioned (e.g. by looking into data directories) ?

By the way - with memory limited to 2GB you might be getting much more frequent crashes if JVM tries to use more memory . I think the default 1G limit is due to ergonomics and doesn't include non-heap areas like metaspace or thread stacks . 2G doesn't leave a lot of room for mmap-ed files .

@fmoessbauer
Copy link
Author

Did you prepare table definition by copying from web console ?

Yes, and it is actually partitioned by hour on disk (dir names similar to 2022-02-08T13 for all tables)

By the way - with memory limited to 2GB you might be getting much more frequent crashes if JVM tries to use more memory

Yes, that was the intention. I wanted to make the OOM appear faster. But in fact you can choose an arbitrary size here (tested with 4GB, 8GB, 16GB). Only the time till OOM is longer.

@bziobrowski
Copy link
Contributor

Could you go to our public slack channel -> https://slack.questdb.io/ . It will be faster to troubleshoot this way .

@bziobrowski
Copy link
Contributor

After checking smaps, OS stats & jvm heap stats it looks like :
RSS reported by smaps is much lower than that reported by OS tools .
JVM heap footprint is tiny .

Potential root causes :

  1. jvm leak (seems unlikely because reported heap size was tiny)
  2. offheap leak (unlikely because smap RSS wasn't high and qdb stats didn't show anything unusual)
  3. OS/container leak

@fmoessbauer I'm on the public slack channel if you'd like to continue troubleshooting .

@ideoma
Copy link
Collaborator

ideoma commented Feb 10, 2022

I tired to reproduce the issue sending randomly data for 3 tables with your schema with 14khz this over single write connection

CREATE TABLE 'table_a' (a DOUBLE, b DOUBLE, c DOUBLE, d DOUBLE, e DOUBLE, f DOUBLE, timestamp TIMESTAMP) timestamp (timestamp) PARTITION BY HOUR;

CREATE TABLE 'table_b' (a SYMBOL capacity 256 CACHE, b LONG, c LONG, d LONG, timestamp TIMESTAMP) timestamp (timestamp) PARTITION BY HOUR;

CREATE TABLE 'table_c' (a DOUBLE, b DOUBLE, c DOUBLE, d DOUBLE, timestamp TIMESTAMP) timestamp (timestamp) PARTITION BY HOUR;
public class LineTCPSenderMain {
    public static void main(String[] args) {
        long count = 0;
        String hostIPv4 = "127.0.0.1";
        int port = 9009; 
        int bufferCapacity = 4 * 1024;

        final Rnd rnd = new Rnd();
        long start = System.nanoTime();
        try (LineTcpSender sender = new LineTcpSender(Net.parseIPv4(hostIPv4), port, bufferCapacity)) {
            while (true) {
                if (count * 1_000_000_000L / (System.nanoTime() - start) > 14_000) {
                    sender.flush();
                    Os.sleep(10);
                } else {
                    int metric = (int) (rnd.nextLong() % 3);
                    switch (metric) {
                        case 0:
                            sender.metric("table_a")
                                    .field("a", rnd.nextDouble())
                                    .field("b", rnd.nextDouble())
                                    .field("c", rnd.nextDouble())
                                    .field("d", rnd.nextDouble())
                                    .field("e", rnd.nextDouble())
                                    .field("f", rnd.nextDouble());
                            break;
                        case 1:
                            sender.metric("table_b")
                                    .tag("a", rnd.nextBoolean() ? "a" : "asldfkjasldalkdjf")
                                    .field("b", rnd.nextLong())
                                    .field("c", rnd.nextLong())
                                    .field("d", rnd.nextLong());
                            break;
                        case 2:
                            sender.metric("table_c")
                                    .field("a", rnd.nextDouble())
                                    .field("b", rnd.nextDouble())
                                    .field("c", rnd.nextDouble())
                                    .field("d", rnd.nextDouble());
                            break;
                        default:
                            continue;
                    }
                    sender.$(Os.currentTimeNanos());
                    count++;
                }
            }
        }
    }
}

After 20 mins memory stable was, dump_memory_usage did not change for a byte. Values are same in total and in every category which is very different from what you see.

	TOTAL: 883919410

Here are few questions to help to simulate your usage please:

  1. Do you use same TCP connection to send data for all 3 tables?
  2. Do you write to 3 tables only to experience OOM or the total number of table written before the OOM is greater?
  3. Is your data out of order (e.g. you specify timestamp in the ILP message and it may be out of order) ?
  4. Do you connect / disconnect the ILP TCP connection or keep it always open?

@bziobrowski
Copy link
Contributor

I did another check with Felix and things seemed stable with memory limit set to 4GB, ongoing data ingestion and while querying big tables .
As this is a test system Felix will temporarily remove memory limit and let it run until 5GB RSS is reached and then we'll inspect memory allocations again . I'll conduct some experiments to see if mmap read memory pressure can trigger OOM when (almost) all RAM is utilized .

@fmoessbauer
Copy link
Author

Hi, after ~5hours of insertion, we are now at 5GB RSS.

dump_memory_usage()

2022-02-14T15:34:40.519115Z A dump-memory-usage
        TOTAL: 1530713181
        MMAP_DEFAULT: 200824
        NATIVE_DEFAULT: 31734713
        MMAP_O3: 707879832
        NATIVE_O3: 167776272
        NATIVE_RECORD_CHAIN: 0
        MMAP_TABLE_WRITER: 385908736
        NATIVE_TREE_CHAIN: 0
        MMAP_TABLE_READER: 0
        NATIVE_COMPACT_MAP: 0
        NATIVE_FAST_MAP: 0
        NATIVE_FAST_MAP_LONG_LIST: 0
        NATIVE_HTTP_CONN: 102781568
        NATIVE_PGW_CONN: 134217728
        MMAP_INDEX_READER: 8516
        MMAP_INDEX_WRITER: 49344
        MMAP_INDEX_SLIDER: 0
        MMAP_BLOCK_WRITER: 0
        NATIVE_REPL: 131072
        NATIVE_SAMPLE_BY_LONG_LIST: 0
        NATIVE_LATEST_BY_LONG_LIST: 0
        NATIVE_JIT_LONG_LIST: 0
        NATIVE_LONG_LIST: 0
        NATIVE_JIT: 24576

Quest Config

      - QDB_LINE_TCP_MAINTENANCE_JOB_INTERVAL=10000
      - QDB_LINE_TCP_DEFAULT_PARTITION_BY=HOUR
      - QDB_CAIRO_COMMIT_LAG=10000
      - QDB_CAIRO_MAX_UNCOMMITTED_ROWS=10000
      - QDB_HTTP_QUERY_CACHE_ENABLED=false
      - QDB_PG_SELECT_CACHE_ENABLED=false
      - QDB_PG_INSERT_CACHE_ENABLED=false

I just sent the smaps, GC and heap stats via slack.

Reg. questions from above:

Do you use same TCP connection to send data for all 3 tables?
No, one connection per table

Do you write to 3 tables only to experience OOM or the total number of table written before the OOM is greater?
We just have these three tables

Is your data out of order (e.g. you specify timestamp in the ILP message and it may be out of order) ?
No, strictly sequential per table. There might be a lag of ~10 seconds between tables

Do you connect / disconnect the ILP TCP connection or keep it always open?
We keep it open.

@bluestreak01
Copy link
Member

should be fixed in #1901

@pswu11
Copy link
Contributor

pswu11 commented Mar 1, 2022

Hi @fmoessbauer in our latest release 6.2.1 this issue is fixed. Could you give it a try and let us know if it's working on your side? Thanks!

@fmoessbauer
Copy link
Author

@pswu11 Looks like that fixed the issue. We did no longer observe a monotonic grow of the RSS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ILP Issues or changes relating to Influx Line Protocol
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants