Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in zfs when importing a pool with a txg specified to use for rollback #5389

Closed
fling- opened this issue Nov 12, 2016 · 11 comments
Closed
Labels
Component: Memory Management kernel memory management
Milestone

Comments

@fling-
Copy link
Contributor

fling- commented Nov 12, 2016

I have a healthy and importable pool.
But zpool import hangs when I'm trying to import with a txg:
zpool import -o readonly=on -R /mnt/gentoo -T (some-recent-txg-number) tmp

With atleast one of txgs zfs starts allocating ram and stops at ~16G.
With all other txgs tested it never stops allocating and consuming atleast 60G and memory usage keeps growing.
The box hangs in both cases, import never returns.
The issue is reproducible with both freebsd and illumos and the leaking behavior is fully identical.
The last tested version is 0.6.5.8.

@behlendorf behlendorf added the Component: Memory Management kernel memory management label Nov 14, 2016
@behlendorf behlendorf added this to the 0.7.0 milestone Nov 14, 2016
@behlendorf
Copy link
Contributor

@fling- could you check if the issue is reproducible with the 0.7.0-rc2 tag or newer.

@fling-
Copy link
Contributor Author

fling- commented Nov 15, 2016

@behlendorf should I image the drives for backup purposes prior trying to import with 0.7.X or am I good if using -o readonly=on ?

@behlendorf
Copy link
Contributor

@fling- there's no need to image the drives before trying 0.7.0. Just make sure you don't run zpool upgrade which will enable several new features. Enabling these features will prevent you from going back to a 0.6.5.x release. Importing the pool read-only isn't a bad idea if you want to be extra careful. That will ensure no changes to the pool are made.

@behlendorf
Copy link
Contributor

@fling- any update?

@behlendorf behlendorf modified the milestones: 0.8.0, 0.7.0 Mar 20, 2017
@fling-
Copy link
Contributor Author

fling- commented Sep 23, 2017

@behlendorf still reproducible.

[recovery] localhost ~ # uname -r
4.12.4-recovery-zfs-af0f842
[recovery] localhost ~ # cat /sys/module/{zfs,spl}/version
0.7.0-78_ga35b4cc8
0.7.0-12_g9df9692
[recovery] localhost ~ # zpool import -o readonly=on -R /mnt/gentoo -T 13068024 tmp
2-1.fc27 04/01/2014
[  448.003073] Call Trace:
[  448.003073]  dump_stack+0x4d/0x6a
[  448.003073]  panic+0xca/0x203
[  448.003073]  out_of_memory+0x334/0x470
[  448.003073]  __alloc_pages_slowpath+0xc2f/0xd10
[  448.003073]  __alloc_pages_nodemask+0x1f7/0x210
[  448.003073]  alloc_pages_current+0x8e/0x140
[  448.003073]  __vmalloc_node_range+0x1c0/0x2f0
[  448.003073]  copy_process.part.47+0x5a3/0x1890
[  448.003073]  ? _do_fork+0xbd/0x370
[  448.003073]  ? set_next_entity+0xf6/0x6b0
[  448.003073]  ? put_prev_entity+0x2a/0x540
[  448.003073]  ? kthread_create_on_node+0x40/0x40
[  448.003073]  ? pick_next_task_fair+0x3db/0x4a0
[  448.003073]  _do_fork+0xbd/0x370
[  448.003073]  kernel_thread+0x24/0x30
[  448.003073]  kthreadd+0x12d/0x170
[  448.003073]  ? kthread_create_on_cpu+0x90/0x90
[  448.003073]  ret_from_fork+0x22/0x30
[  448.003073] Kernel Offset: disabled
[  448.003073] ---[ end Kernel panic - not syncing: Out of memory and no killabl
e processes...
[  448.003073]

@fling-
Copy link
Contributor Author

fling- commented Sep 23, 2017

@behlendorf also the pool is importable with all the txgs after 13068024 wiped using zfs_revert-0.1.py to destroy uberblocks containing these txgs. Thanks to @jshoward.
No readonly import needed and nothing crashes, no leaks. This could mean the txg itself is a pretty safe one and the issue is in openzfs code, not in the broken ondisk data.

@fling-
Copy link
Contributor Author

fling- commented Sep 23, 2017

Import works even when I revert to even older txgs. I can snapshot and send old deleted datasets without any issues. I get corrupted data and the pool refusing to import for some txgs but it works in general.

@behlendorf
Copy link
Contributor

That's good. So then the data on disk is almost certainly good, we're just requiring too much memory as part of the import. Were the most recently results you reported using the 0.7.1 tag?

@fling-
Copy link
Contributor Author

fling- commented Sep 28, 2017

@behlendorf this one:

[recovery] localhost ~ # cat /sys/module/{zfs,spl}/version
0.7.0-78_ga35b4cc8
0.7.0-12_g9df9692

I used zfs_revert script on a qcow2 snapshot of the pool in qemu. And this is the only way I found to get to the older txgs.
The import with -T is still not working becaus of the memory usage even with the recent versions.
The regular import works just fine, no issues.

@dweeezil
Copy link
Contributor

@fling- This issue caught my eye in light of the recent changes to the pool import code (6cb8e53 etc.). As pointed out in this commit's log, the import process now allows much more flexibility when rewinding pools and also, along with related commits, can provide for better error messages when an import fails. Do you still have this pool? If so, could you try a recent master to see whether the problem still occurs.

@behlendorf
Copy link
Contributor

Closing. The improved import code 6cb8e53 should handle this better, if there are still problems for specific pools let's open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests

3 participants