failed constraint check (ipos <= new_vec.rbegin()->m_end) #26

kakra · 2017-09-19T19:41:23Z

I'm seeing many of the following messages mostly in the context of the same file. After a while, I see these messages repeating again for another file.

Sep 19 21:35:10 jupiter beesd[7483]: crawl: *** EXCEPTION ***
Sep 19 21:35:10 jupiter beesd[7483]: crawl:         exception type std::runtime_error: ipos = 15650891, new_vec.rbegin()->m_end = 15650820 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 19 21:35:10 jupiter beesd[7483]: crawl: ***

ipos doesn't change while the file context doesn't change:

Sep 19 21:35:10 jupiter beesd[7483]: crawl:         exception type std::runtime_error: ipos = 15650891, new_vec.rbegin()->m_end = 15650820 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 19 21:35:10 jupiter beesd[7483]: crawl:         exception type std::runtime_error: ipos = 15650891, new_vec.rbegin()->m_end = 15650820 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 19 21:35:10 jupiter beesd[7483]: crawl:         exception type std::runtime_error: ipos = 15650891, new_vec.rbegin()->m_end = 15650820 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 19 21:35:10 jupiter beesd[7483]: crawl:         exception type std::runtime_error: ipos = 15650891, new_vec.rbegin()->m_end = 15650820 failed constraint check (ipos <= new_vec.rbegin()->m_end)

The text was updated successfully, but these errors were encountered:

Zygo · 2017-09-20T01:26:02Z

ipos 15650891 is not a multiple of 4096, so something is weird there. Is this near the end of a file, especially a file that is currently or recently being modified?

There are sometimes glitches where something changes on the filesystem and one source of btrfs metadata disagrees with another because they're capturing data from the filesystem at different times (sometimes several minutes apart). That tends to get discovered by an exception handler when the discrepancy breaks something that doesn't like the filesystem changing underneath it (like the binary search algorithm in ExtentWalker).

It's usually harmless, and bees will visit the extents again on the next scan generation. If the file keeps getting modified all the time, bees will keep triggering the exception (patches to implement a glob- or regexp- based path exclusion framework welcome ;); however, if the file is really being modified all the time then dedup on that file will be ineffective anyway.

kakra · 2017-09-20T19:13:42Z

Nothing accesses the file. Actually it's game data from the Windows Steam version running within Wine. But Wine (and thus Steam) is not running currently, and wasn't running in days.

Maybe it results from bad interaction with autodefrag? Tho I don't see why this should happen because actually nobody is writing directly to the file. Even bees seems to refuse to work on the file due to above constraints with exceptions.

There seem to be at least two files of the same game data that keep coming again and again in a loop and eventually bees never finishes with it.

More details:

Sep 19 21:35:10 jupiter beesd[7483]: crawl: BeesAddress(fd 384 /#187708 (deleted) offset 0x1000)
Sep 19 21:35:10 jupiter beesd[7483]: crawl: Found matching range: BeesRangePair: 4K src[0x1000..0x2000] dst[0x2995000..0x2996000]
Sep 19 21:35:10 jupiter beesd[7483]: crawl: src = 384 /#187708 (deleted)
Sep 19 21:35:10 jupiter beesd[7483]: crawl: dst = 383 /home/kakra/apps/winsteam/steamapps/common/BioShock 2 Remastered/ContentBaked/pc/BulkContent/BulkChunk1_77.blk
Sep 19 21:35:10 jupiter beesd[7483]: crawl: creating brp (4K [0x1000..0x2000] fd = 384 '/#187708 (deleted)', 4K [0x2995000..0x2996000] fid = 262:5149095 fd = 383 '/home/kakra/apps/winsteam/steamapps/common/BioShock 2 Remastered/ContentBaked/pc/BulkContent/BulkChunk1_
Sep 19 21:35:10 jupiter beesd[7483]: crawl: Opening dst bfr 4K [0x2995000..0x2996000] fid = 262:5149095 fd = 383 '/home/kakra/apps/winsteam/steamapps/common/BioShock 2 Remastered/ContentBaked/pc/BulkContent/BulkChunk1_77.blk'
Sep 19 21:35:10 jupiter beesd[7483]: crawl: chase_extent_ref ino BtrfsInodeOffsetRoot { .m_inum = 5149095, .m_offset = 0x2995000, .m_root = 262 } bbd BeesBlockData { 4K 0x1000 fd = 384 '/#187708 (deleted)', data[4096] = 'A\x08\x00\x00tttt\x00\x00\x00\x00...' }
Sep 19 21:35:10 jupiter beesd[7483]: crawl:
Sep 19 21:35:10 jupiter beesd[7483]: crawl: *** EXCEPTION ***
Sep 19 21:35:10 jupiter beesd[7483]: crawl:         exception type std::runtime_error: ipos = 15650891, new_vec.rbegin()->m_end = 15650820 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 19 21:35:10 jupiter beesd[7483]: crawl: ***

$ ls -alh BulkChunk1_77.blk
-rw-r--r-- 1 kakra kakra 57M 10. Jan 2017  BulkChunk1_77.blk

It's not even big...

Zygo · 2017-09-20T19:23:50Z

That is sounding more like a bug in the binary search algorithm previously mentioned. :-P

Could you run fiemap on the offending file? That will show the extent metadata fields as bees sees them. Maybe there's something anomalous in there that is confusing BtrfsExtentWalker.

For that matter, fiewalk should fail in the same way as Bees does, but it's a bit more controlled so it's easier to understand what's going on. (be sure to change the #if 0 to #if 1 so we get both walking directions tested).

kakra · 2017-09-20T19:32:44Z

There you go:
https://gist.github.com/kakra/eb0f2bddde592e47d478f936683f3fc6

kakra · 2017-10-02T13:43:42Z

Meanwhile, my btrfs crashed with some refcount issues, some double linked extents and some orphaned extents. "btrfsck" wasn't able to fix this due to compression used and the extent length being different to what is expected.

I was able to fix this with btrfs-zero-log, then delete the affected inodes, then letting btrfsck fix the rest (it needed 3 runs to fix everything). Finally, I restored good copies of the deleted files from the backup.

The problem seems to no longer occur, so this error may be an indicator of an already broken file system.

But I think, this error was introduced by using bees - which of course is probably not bees fault but still a bug in btrfs which is already documented and tracked in the mailing list (object already exists, errno=-17). Running bees can inject this problem into the filesystem, system then freezes and needs a hard reboot. I had a few of these. It eventually results in the problem I had.

Zygo · 2017-10-02T14:34:12Z

s/introduced/triggered/ by bees. It wouldn't be the first bug that bees found in the kernel.

Which kernel version are you running?

kakra · 2017-10-02T20:03:59Z

Ah yes, "triggered"... I was struggling to find the right word. ;-)

It's been 4.12.13, now 4.12.14, ck patchset, using bfq scheduler. As a vague guess, bfq might be involved in triggering it. I had similar errors some months back when I tried the out-of-tree bfq patches.

Since bees initial scan is done, the system ran stable (without freezes), except that last time when it presented that errno=-17 to me upon cold reboot.

BTW: After investigating logs since last boot, I still see such messages:

Sep 29 22:58:35 jupiter beesd[742]: crawl: *** EXCEPTION ***
Sep 29 22:58:35 jupiter beesd[742]: crawl:         exception type std::runtime_error: ipos = 10321965, new_vec.rbegin()->m_end = 10321928 failed constraint check (ipos <= new_vec.rbegin()->m_end)
Sep 29 22:58:35 jupiter beesd[742]: crawl: ***
Sep 29 22:58:35 jupiter beesd[742]: crawl: scan_forward 128K [0x660000..0x680000] fid = 259:9922554 fd = 270 '/gentoo/rootfs/usr/lib64/libnvidia-ptxjitcompiler.so.384.90'
Sep 29 22:58:35 jupiter beesd[742]: crawl: ---  END  TRACE --- exception ---

So either something in my btrfs is still toast, or bees cannot handle this.

It again repeats a lot of those message for the same file (with long runs of identical ipos, but changing over time). Over time, it switches to a different file to throw this exception on.

According to the package manager, the above file is still pristine. So maybe nothing to worry too much about.

Zygo · 2017-10-02T20:42:58Z

I've heard bad experiences involving bfq and btrfs from multiple people. Not so much from bfq itself but the multiqueue IO it depends on. I'd test that a lot more thoroughly on VMs before turning it on in production.

I wouldn't worry too much about the bees exception. I hit that a few times myself.

My guess is that there's a bug in the ExtentWalker binary-search code that is triggered by the specific details of the layout of these files on your filesystem. Maybe there's a hole that is exactly the wrong size and gets missed during search window expansion.

The other possibility is that there's a duplicate or overlapping extent ref in the filesystem, but in that case it should have shown up in the fiemap output too, but it didn't.

I haven't had a chance to verify this yet (and my TODO list includes rewriting ExtentWalker anyway...).

kakra · 2017-10-06T02:03:03Z

Well, I don't consider my home system as a real production system... While, yes, I'm using it as my main system for doing production stuff like development, documentation, and games (if you want to call games "production") with doing daily backups, I'm also using it as a testbed for pre-production environments, that's preparation and testing containers before pushing things to production environment. Bfq is one of those things that I would currently not deploy to that said production environments. It works well enough here (and does improve perceived performance a lot) but I fear it would create fatal problems in a 24/7 production environment where it is even not that suitable as an IO scheduler because that environment is about server containers. I'm running deadline/noop there, because underlying storage is a SAS HBA with BBU and SSD cache.

If you're going to rewrite the ExtentWalker, I'm fine with not working on non-fatal bugs in the current implementation, especially if it's hard to find.

Zygo added bug roadmap issue will be addressed in next major code iteration labels May 30, 2018

Zygo closed this as completed in 5227965 Jun 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed constraint check (ipos <= new_vec.rbegin()->m_end) #26

failed constraint check (ipos <= new_vec.rbegin()->m_end) #26

kakra commented Sep 19, 2017

Zygo commented Sep 20, 2017

kakra commented Sep 20, 2017

Zygo commented Sep 20, 2017

kakra commented Sep 20, 2017

kakra commented Oct 2, 2017

Zygo commented Oct 2, 2017

kakra commented Oct 2, 2017

Zygo commented Oct 2, 2017

kakra commented Oct 6, 2017

failed constraint check (ipos <= new_vec.rbegin()->m_end) #26

failed constraint check (ipos <= new_vec.rbegin()->m_end) #26

Comments

kakra commented Sep 19, 2017

Zygo commented Sep 20, 2017

kakra commented Sep 20, 2017

Zygo commented Sep 20, 2017

kakra commented Sep 20, 2017

kakra commented Oct 2, 2017

Zygo commented Oct 2, 2017

kakra commented Oct 2, 2017

Zygo commented Oct 2, 2017

kakra commented Oct 6, 2017