Memory leak in zfs when importing a pool with a txg specified to use for rollback #5389

fling- · 2016-11-12T23:34:18Z

I have a healthy and importable pool.
But zpool import hangs when I'm trying to import with a txg:
zpool import -o readonly=on -R /mnt/gentoo -T (some-recent-txg-number) tmp

With atleast one of txgs zfs starts allocating ram and stops at ~16G.
With all other txgs tested it never stops allocating and consuming atleast 60G and memory usage keeps growing.
The box hangs in both cases, import never returns.
The issue is reproducible with both freebsd and illumos and the leaking behavior is fully identical.
The last tested version is 0.6.5.8.

The text was updated successfully, but these errors were encountered:

behlendorf · 2016-11-14T18:50:52Z

@fling- could you check if the issue is reproducible with the 0.7.0-rc2 tag or newer.

fling- · 2016-11-15T04:09:27Z

@behlendorf should I image the drives for backup purposes prior trying to import with 0.7.X or am I good if using -o readonly=on ?

behlendorf · 2016-11-15T04:57:54Z

@fling- there's no need to image the drives before trying 0.7.0. Just make sure you don't run zpool upgrade which will enable several new features. Enabling these features will prevent you from going back to a 0.6.5.x release. Importing the pool read-only isn't a bad idea if you want to be extra careful. That will ensure no changes to the pool are made.

behlendorf · 2017-01-25T01:51:50Z

@fling- any update?

fling- · 2017-09-23T13:16:58Z

@behlendorf still reproducible.

[recovery] localhost ~ # uname -r
4.12.4-recovery-zfs-af0f842
[recovery] localhost ~ # cat /sys/module/{zfs,spl}/version
0.7.0-78_ga35b4cc8
0.7.0-12_g9df9692
[recovery] localhost ~ # zpool import -o readonly=on -R /mnt/gentoo -T 13068024 tmp
2-1.fc27 04/01/2014
[  448.003073] Call Trace:
[  448.003073]  dump_stack+0x4d/0x6a
[  448.003073]  panic+0xca/0x203
[  448.003073]  out_of_memory+0x334/0x470
[  448.003073]  __alloc_pages_slowpath+0xc2f/0xd10
[  448.003073]  __alloc_pages_nodemask+0x1f7/0x210
[  448.003073]  alloc_pages_current+0x8e/0x140
[  448.003073]  __vmalloc_node_range+0x1c0/0x2f0
[  448.003073]  copy_process.part.47+0x5a3/0x1890
[  448.003073]  ? _do_fork+0xbd/0x370
[  448.003073]  ? set_next_entity+0xf6/0x6b0
[  448.003073]  ? put_prev_entity+0x2a/0x540
[  448.003073]  ? kthread_create_on_node+0x40/0x40
[  448.003073]  ? pick_next_task_fair+0x3db/0x4a0
[  448.003073]  _do_fork+0xbd/0x370
[  448.003073]  kernel_thread+0x24/0x30
[  448.003073]  kthreadd+0x12d/0x170
[  448.003073]  ? kthread_create_on_cpu+0x90/0x90
[  448.003073]  ret_from_fork+0x22/0x30
[  448.003073] Kernel Offset: disabled
[  448.003073] ---[ end Kernel panic - not syncing: Out of memory and no killabl
e processes...
[  448.003073]

fling- · 2017-09-23T13:25:43Z

@behlendorf also the pool is importable with all the txgs after 13068024 wiped using zfs_revert-0.1.py to destroy uberblocks containing these txgs. Thanks to @jshoward.
No readonly import needed and nothing crashes, no leaks. This could mean the txg itself is a pretty safe one and the issue is in openzfs code, not in the broken ondisk data.

fling- · 2017-09-23T14:02:57Z

Import works even when I revert to even older txgs. I can snapshot and send old deleted datasets without any issues. I get corrupted data and the pool refusing to import for some txgs but it works in general.

behlendorf · 2017-09-25T17:49:30Z

That's good. So then the data on disk is almost certainly good, we're just requiring too much memory as part of the import. Were the most recently results you reported using the 0.7.1 tag?

fling- · 2017-09-28T03:45:37Z

@behlendorf this one:

[recovery] localhost ~ # cat /sys/module/{zfs,spl}/version
0.7.0-78_ga35b4cc8
0.7.0-12_g9df9692

I used zfs_revert script on a qcow2 snapshot of the pool in qemu. And this is the only way I found to get to the older txgs.
The import with -T is still not working becaus of the memory usage even with the recent versions.
The regular import works just fine, no issues.

dweeezil · 2018-06-16T02:30:08Z

@fling- This issue caught my eye in light of the recent changes to the pool import code (6cb8e53 etc.). As pointed out in this commit's log, the import process now allows much more flexibility when rewinding pools and also, along with related commits, can provide for better error messages when an import fails. Do you still have this pool? If so, could you try a recent master to see whether the problem still occurs.

behlendorf · 2018-10-12T20:27:40Z

Closing. The improved import code 6cb8e53 should handle this better, if there are still problems for specific pools let's open a new issue.

behlendorf added the Component: Memory Management kernel memory management label Nov 14, 2016

behlendorf added this to the 0.7.0 milestone Nov 14, 2016

behlendorf modified the milestones: 0.8.0, 0.7.0 Mar 20, 2017

fling- mentioned this issue Jun 7, 2018

Panic on import: VERIFY3(c < (1ULL << 24) >> 9) failed (36028797018963967 < 32768) #7603

Open

dioni21 mentioned this issue Jul 23, 2018

zfs incremental recv fails with 'destination already exists' and corrupts pool #7735

Closed

behlendorf closed this as completed Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in zfs when importing a pool with a txg specified to use for rollback #5389

Memory leak in zfs when importing a pool with a txg specified to use for rollback #5389

fling- commented Nov 12, 2016

behlendorf commented Nov 14, 2016

fling- commented Nov 15, 2016

behlendorf commented Nov 15, 2016

behlendorf commented Jan 25, 2017

fling- commented Sep 23, 2017 •

edited

Loading

fling- commented Sep 23, 2017 •

edited

Loading

fling- commented Sep 23, 2017

behlendorf commented Sep 25, 2017

fling- commented Sep 28, 2017

dweeezil commented Jun 16, 2018

behlendorf commented Oct 12, 2018

Memory leak in zfs when importing a pool with a txg specified to use for rollback #5389

Memory leak in zfs when importing a pool with a txg specified to use for rollback #5389

Comments

fling- commented Nov 12, 2016

behlendorf commented Nov 14, 2016

fling- commented Nov 15, 2016

behlendorf commented Nov 15, 2016

behlendorf commented Jan 25, 2017

fling- commented Sep 23, 2017 • edited Loading

fling- commented Sep 23, 2017 • edited Loading

fling- commented Sep 23, 2017

behlendorf commented Sep 25, 2017

fling- commented Sep 28, 2017

dweeezil commented Jun 16, 2018

behlendorf commented Oct 12, 2018

fling- commented Sep 23, 2017 •

edited

Loading

fling- commented Sep 23, 2017 •

edited

Loading