-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed constraint check (ipos <= new_vec.rbegin()->m_end) #26
Comments
ipos 15650891 is not a multiple of 4096, so something is weird there. Is this near the end of a file, especially a file that is currently or recently being modified? There are sometimes glitches where something changes on the filesystem and one source of btrfs metadata disagrees with another because they're capturing data from the filesystem at different times (sometimes several minutes apart). That tends to get discovered by an exception handler when the discrepancy breaks something that doesn't like the filesystem changing underneath it (like the binary search algorithm in ExtentWalker). It's usually harmless, and bees will visit the extents again on the next scan generation. If the file keeps getting modified all the time, bees will keep triggering the exception (patches to implement a glob- or regexp- based path exclusion framework welcome ;); however, if the file is really being modified all the time then dedup on that file will be ineffective anyway. |
Nothing accesses the file. Actually it's game data from the Windows Steam version running within Wine. But Wine (and thus Steam) is not running currently, and wasn't running in days. Maybe it results from bad interaction with autodefrag? Tho I don't see why this should happen because actually nobody is writing directly to the file. Even bees seems to refuse to work on the file due to above constraints with exceptions. There seem to be at least two files of the same game data that keep coming again and again in a loop and eventually bees never finishes with it. More details:
It's not even big... |
That is sounding more like a bug in the binary search algorithm previously mentioned. :-P Could you run fiemap on the offending file? That will show the extent metadata fields as bees sees them. Maybe there's something anomalous in there that is confusing BtrfsExtentWalker. For that matter, fiewalk should fail in the same way as Bees does, but it's a bit more controlled so it's easier to understand what's going on. (be sure to change the #if 0 to #if 1 so we get both walking directions tested). |
Meanwhile, my btrfs crashed with some refcount issues, some double linked extents and some orphaned extents. "btrfsck" wasn't able to fix this due to compression used and the extent length being different to what is expected. I was able to fix this with btrfs-zero-log, then delete the affected inodes, then letting btrfsck fix the rest (it needed 3 runs to fix everything). Finally, I restored good copies of the deleted files from the backup. The problem seems to no longer occur, so this error may be an indicator of an already broken file system. But I think, this error was introduced by using bees - which of course is probably not bees fault but still a bug in btrfs which is already documented and tracked in the mailing list (object already exists, errno=-17). Running bees can inject this problem into the filesystem, system then freezes and needs a hard reboot. I had a few of these. It eventually results in the problem I had. |
s/introduced/triggered/ by bees. It wouldn't be the first bug that bees found in the kernel. Which kernel version are you running? |
Ah yes, "triggered"... I was struggling to find the right word. ;-) It's been 4.12.13, now 4.12.14, ck patchset, using bfq scheduler. As a vague guess, bfq might be involved in triggering it. I had similar errors some months back when I tried the out-of-tree bfq patches. Since bees initial scan is done, the system ran stable (without freezes), except that last time when it presented that errno=-17 to me upon cold reboot. BTW: After investigating logs since last boot, I still see such messages:
So either something in my btrfs is still toast, or bees cannot handle this. It again repeats a lot of those message for the same file (with long runs of identical ipos, but changing over time). Over time, it switches to a different file to throw this exception on. According to the package manager, the above file is still pristine. So maybe nothing to worry too much about. |
I've heard bad experiences involving bfq and btrfs from multiple people. Not so much from bfq itself but the multiqueue IO it depends on. I'd test that a lot more thoroughly on VMs before turning it on in production. I wouldn't worry too much about the bees exception. I hit that a few times myself. My guess is that there's a bug in the ExtentWalker binary-search code that is triggered by the specific details of the layout of these files on your filesystem. Maybe there's a hole that is exactly the wrong size and gets missed during search window expansion. The other possibility is that there's a duplicate or overlapping extent ref in the filesystem, but in that case it should have shown up in the fiemap output too, but it didn't. I haven't had a chance to verify this yet (and my TODO list includes rewriting ExtentWalker anyway...). |
Well, I don't consider my home system as a real production system... While, yes, I'm using it as my main system for doing production stuff like development, documentation, and games (if you want to call games "production") with doing daily backups, I'm also using it as a testbed for pre-production environments, that's preparation and testing containers before pushing things to production environment. Bfq is one of those things that I would currently not deploy to that said production environments. It works well enough here (and does improve perceived performance a lot) but I fear it would create fatal problems in a 24/7 production environment where it is even not that suitable as an IO scheduler because that environment is about server containers. I'm running deadline/noop there, because underlying storage is a SAS HBA with BBU and SSD cache. If you're going to rewrite the ExtentWalker, I'm fine with not working on non-fatal bugs in the current implementation, especially if it's hard to find. |
I'm seeing many of the following messages mostly in the context of the same file. After a while, I see these messages repeating again for another file.
ipos doesn't change while the file context doesn't change:
The text was updated successfully, but these errors were encountered: