-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wasted memory in inode cache and dentry cache #14506
Comments
Possible solutions:
|
The 2nd proposed solution is easy to implement. Another improvement that I'd like to make is to reopen the file descriptor in the shard that really owns it, as slab allocation is today NUMA aware, and therefore we can benefit from fair distribution of resources across NUMA nodes. Actually, by just reopening the file descriptor, I guess we'll cause a natural defragmentation of xfs's internal and dentry caches?! Worth testing. |
More sstables will appear and kernel will re-use the holes in partially-filled slabs? |
With a very high traffic node it may defragment itself after a while. But the odds are low. These are 16k slabs so 4 sstables in the same slab need to be deleted. |
Closing a file doesn't make its dentry+inode to be evicted. Quite opposite -- kernel caches dentries for closed files and respective inodes are pinned in memory too. Eviction (called "reclaim" in kernel) happens either on page allocation failure (not small object allocation failure) or in the background thread (but I'm not very confident about the thread, background reclaim was very different in different kernels) |
And with a low-traffic node, we're stuck with 25% utilization. |
Yes, the idea is to allocate long-lived inodes together, and short-lived inodes together, so that if/when the short-lived inodes are reclaimed they free a slab. |
There's kmem-controller in the cgroups area. When allocated in different groups kmem-cache dedicates different slab pages for that. If threads can live in different kmem groups and this thing still works 🤞 it can be an option too |
I don't get why. If the traffic is low we have 25% utilization of few pages, so hardly a problem. If the traffic is high then those 5 files that are opened once (or twice) for a short time should be reclaimed and then re-allocated by newly appearing sstables. |
Traffic is low ≠ few files. We've seen a misconfigured TWCS table with ~200k sstables. That translated to ~800k files and ~1GB slab, instead of 200MB slab. |
Unicode fail, I don't understand why the =/= character (U+2260) can't be copied here. |
≢-< this does work |
That's why email is better. I was able to see your symbol from my mail reader :-) |
⍯ <- this too |
Maybe it's just misrendered locally. |
You mean the ≠ in |
In your quote, the first ≠ is misrendered and I can select individual glyphs (but it's also kerned!). The second ≠ is rendered correctly (so for sure font related, since the second instance uses a different font). |
Ah, now I see it. It's ~two-step. First, inode cache is populated with I remember that Cristopher tried to do something about it long ago, e.g. -- https://lwn.net/Articles/371892/ This can be a justification to rewriting Linux in C++ with the ability to std::move() dentries+inodes during slab |
Yes, about 4000 years ago @tgrabiec demonstrated the problem (IIRC on seastar memcached) and everyone is busy admiring LSA ever since. |
Our usage of inodes is dual: - the Index.db and Data.db components are pinned in memory as the files are open - all other components are read once and never looked at again As such, tune the kernel to prefer evicting dcache/inodes to memory pages. The default is 100, so the value of 2000 increases it by a factor of 20. Ref scylladb#14506
↑ patch implements first proposal, but it won't solve it without the second, due to fragmentation. |
← celebrating Unicode and the Compose key |
There's ~20 dentries per slab and ~15+ xfs inodes per slab (21 and 17 in the original issue's kernel). Opening all files in one go, then closing all but Index and Data will leave the |
I think the point is to open long lived first and short lived later. So it will be used used used ... unused unused ..., Then reclaiming will have an easy job freeing from the back |
But note patching sstable loader is only deferring the issue as newer sstables are introduced one at a time, so we cannot prevent fragmentation in the long run |
In the long run sstables replace each other. A deleted inode is freed from slab and the next sstable to be created will take its place. The problem only happens during initial load. |
Long-lived from several sstables first, then short lived from the same set
Exactly
... unless long-lived files are eventually closed, AFAIU, we expect that compaction should provide that in the long run |
I was thinking to open ALL short-lived components first. We have to open -TOC.txt first, so at that point we may as well open all the other short-lived components. Then a second pass to open Index/Data. |
Our usage of inodes is dual: - the Index.db and Data.db components are pinned in memory as the files are open - all other components are read once and never looked at again As such, tune the kernel to prefer evicting dcache/inodes to memory pages. The default is 100, so the value of 2000 increases it by a factor of 20. Ref #14506 Closes #14509
We might also tune vm.min_slab_ratio
If we set it to 1%, then as soon as we reach 1% we'll evict inodes and new sstables will start to fill the holes. 5% is too close to our 7% reserve. |
LKML might appreciate fcntl extension (or O_SOMETHING flag) that would prevent file's dentry and inode from being cached |
Implementing multiple passes should be done in the sstable_directory initially so we can have those passes:
For newly written sstables (by memtable flush or compaction) we can consider a background crawler in the sstables manager that will process a batch of sstables, closing and reopening the data and index in each. |
This is enough to place the in-memory and in-disk inodes together, ruining the ability to reclaim slabs.
Opening and closing does nothing for the inode cache. I'm less worried about runtime, there's constant deletion of sstables so inode slots will get reused. |
Well except in the streaming/repair case. |
I thought that the next step will resolve the fragmentation issue.
|
No. Closing files doesn't remove them from the inode cache. Even if it did, they would get placed in the same slots.
Or vice versa, first open only data/index and then open all the rest. |
I think that delayed loading of data/index aligns better on the trajectory towards tiered storage. |
I don't think it matters, it's localized to sstable_directory. |
@denesb, @raphaelsc how's that for an introductory project for @lkshminarayanan ? |
Looks good to me. Should be a good introduction on how we load sstable files. |
Update the SSTable loader mechanism to allow loading the SSTable files in two phases. First, load the files of all SSTables in a directory, which will be closed immediately after loading. Later, load the respective data and index files of all those SSTables. Loading them in two phases will help reduce memory fragmentation that can occur in the dentry/inode cache due to interleaving the opening of files that will be closed immediately and the files that will be held open for a longer duration. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Update the SSTable loader mechanism to allow loading the SSTable files in two phases. First, load the files of all SSTables in a directory, which will be closed immediately after loading. Later, load the respective data and index files of all those SSTables. Loading them in two phases will help reduce memory fragmentation that can occur in the dentry/inode cache due to interleaving the opening of files that will be closed immediately and the files that will be held open for a longer duration. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Mitigate this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Mitigate this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
…ectory contents During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Prevent this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
…ata directory contents During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Prevent this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
…fying data directory contents' from Lakshmi Narayanan Sreethar During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Mitigate this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes #14506 Closes #16952 * github.com:scylladb/scylladb: directories: prevent inode cache fragmentation by orderly verifying data directory contents directories: skip verifying data directory contents during startup directories: co-routinize create_and_verify
…ata directory contents During startup, the contents of the data directory are verified to ensure that they have the right owner and permissions. Verifying all the contents, which includes files that will be read and closed immediately, and files that will be held open for longer durations, together, can lead to memory fragementation in the dentry/inode cache. Prevent this by updating the verification in a such way that these two set of files will be verified separately ensuring their separation in the dentry/inode cache. Fixes scylladb#14506 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
An sstable is composed of 7 files. They will be opened in close proximity. Two will remain open, five will be closed.
After some time, the kernel will run out of memory. It will start evicting non-open files. But because 2 out of 7 files are still open, memory fragmentation will prevent the slabs containing those inodes/dentries from being reclaimed.
Let's take for example 200k sstables. That's 200k*1k*7 = 1.4GB memory used, of which 200k*1k*5 = 1GB is wasted.
The text was updated successfully, but these errors were encountered: